Recorded Talks: Machine Learning
How Transformers Learn Causal Structure with Gradient Descent
Jason Lee, Princeton University
The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
Jason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, representation learning, and reinforcement learning. He has received the Samsung AI Researcher of the Year Award, NSF Career Award, ONR Young Investigator Award in Mathematical Data Science, Sloan Research Fellowship, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization, and Princeton Commendation for Outstanding Teaching.
Off-the-shelf Algorithmic Stability
Rebecca Willett, University of Chicago
Algorithmic stability holds when our conclusions, estimates, fitted models, predictions, or decisions are insensitive to small changes to the training data. Stability has emerged as a core principle for reliable data science, providing insights into generalization, cross-validation, uncertainty quantification, and more. Whereas prior literature has developed mathematical tools for analyzing the stability of specific machine learning (ML) algorithms, we study methods that can be applied to arbitrary learning algorithms to satisfy a desired level of stability. First, I will discuss how bagging is guaranteed to stabilize any prediction model, regardless of the input data. Thus, if we remove or replace a small fraction of the training data at random, the resulting prediction will typically change very little. Our analysis provides insight into how the size of the bags (bootstrap datasets) influences stability, giving practitioners a new tool for guaranteeing a desired level of stability. Second, I will describe how to extend these stability guarantees beyond prediction modeling to more general statistical estimation problems where bagging is not as well known but equally useful for stability. Specifically, I will describe a new framework for stable classification and model selection by combining bagging on class or model weights with a stable, “soft” version of the argmax operator.
Rebecca Willett is a Professor of Statistics and Computer Science and the Director of AI in the Data Science Institute at the University of Chicago, and she holds a courtesy appointment at the Toyota Technological Institute at Chicago. Her research is focused on machine learning foundations, scientific machine learning, and signal processing. Willett received the inaugural Data Science Career Prize from the Society of Industrial and Applied Mathematics in 2024, was named a Fellow of the Society of Industrial and Applied Mathematics in 2021, and was named a Fellow of the IEEE in 2022. She is the Deputy Director for Research at the NSF-Simons Foundation National Institute for Theory and Mathematics in Biology, Deputy Director for Research at the NSF-Simons Institute for AI in the Sky (SkAI), and a member of the NSF Institute for the Foundations of Data Science Executive Committee. She is the Faculty Director of the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship. She helps direct the Air Force Research Lab University Center of Excellence on Machine Learning. She received the National Science Foundation CAREER Award in 2007, was a DARPA Computer Science Study Group member, and received an Air Force Office of Scientific Research Young Investigator Program award in 2010. She completed her PhD in Electrical and Computer Engineering at Rice University in 2005. She was an Assistant and then tenured Associate Professor of Electrical and Computer Engineering at Duke University from 2005 to 2013. She was an Associate Professor of Electrical and Computer Engineering, Harvey D. Spangler Faculty Scholar, and Fellow of the Wisconsin Institutes for Discovery at the University of Wisconsin-Madison from 2013 to 2018.
What Kinds of Functions do Neural Networks Learn? Theory and Practical Applications
Robert Nowak, University of Wisconsin
This talk presents a theory characterizing the types of functions neural networks learn from data. Specifically, the function space generated by deep ReLU networks consists of compositions of functions from the Banach space of second-order bounded variation in the Radon transform domain. This Banach space includes functions with smooth projections in most directions. A representer theorem associated with this space demonstrates that finite-width neural networks suffice for fitting finite datasets. The theory has several practical applications. First, it provides a simple and theoretically grounded method for network compression. Second, it shows that multi-task training can yield significantly different solutions compared to single-task training, and that multi-task solutions can be related to kernel ridge regressions. Third, the theory has implications for improving implicit neural representations, where multi-layer neural networks are used to represent continuous signals, images, or 3D scenes. This exploration bridges theoretical insights with practical advancements, offering a new perspective on neural network capabilities and future research directions.
Robert Nowak is the Grace Wahba Professor of Data Science and Keith and Jane Nosbusch Professor in Electrical and Computer Engineering at the University of Wisconsin-Madison. His research focuses on machine learning, optimization, and signal processing. He serves on the editorial boards of the SIAM Journal on the Mathematics of Data Science and the IEEE Journal on Selected Areas in Information Theory.
Transformers Learn In-context by (Functional) Gradient Descent
Xiang Cheng, TILOS Postdoctoral Scholar at MIT
Motivated by the in-context learning phenomenon, we investigate how the Transformer neural network can implement learning algorithms in its forward pass. We show that a linear Transformer naturally learns to implement gradient descent, which enables it to learn linear functions in-context. More generally, we show that a non-linear Transformer can implement functional gradient descent with respect to some RKHS metric, which allows it to learn a broad class of functions in-context. Additionally, we show that the RKHS metric is determined by the choice of attention activation, and that the optimal choice of attention activation depends in a natural way on the class of functions that need to be learned. I will end by discussing some implications of our results for the choice and design of Transformer architectures.
How Large Models of Language and Vision Help Agents to Learn to Behave
Roy Fox, Assistant Professor and Director of the Intelligent Dynamics Lab, UC Irvine
If learning from data is valuable, can learning from big data be very valuable? So far, it has been so in vision and language, for which foundation models can be trained on web-scale data to support a plethora of downstream tasks; not so much in control, for which scalable learning remains elusive. Can information encoded in vision and language models guide reinforcement learning of control policies? In this talk, I will discuss several ways for foundation models to help agents to learn to behave. Language models can provide better context for decision-making: we will see how they can succinctly describe the world state to focus the agent on relevant features; and how they can form generalizable skills that identify key subgoals. Vision and vision–language models can help the agent to model the world: we will see how they can block visual distractions to keep state representations task-relevant; and how they can hypothesize about abstract world models that guide exploration and planning.
Roy Fox is an Assistant Professor of Computer Science at the University of California, Irvine. His research interests include theory and applications of control learning: reinforcement learning (RL), control theory, information theory, and robotics. His current research focuses on structured and model-based RL, language for RL and RL for language, and optimization in deep control learning of virtual and physical agents.
The Synergy between Machine Learning and the Natural Sciences
Max Welling, Research Chair in Machine Learning, University of Amsterdam
Traditionally machine learning has been heavily influenced by neuroscience (hence the name artificial neural networks) and physics (e.g. MCMC, Belief Propagation, and Diffusion based Generative AI). We have recently witnessed that the flow of information has also reversed, with new tools developed in the ML community impacting physics, chemistry and biology. Examples include faster DFT, Force-Field accelerated MD simulations, PDE Neural Surrogate models, generating druglike molecules, and many more. In this talk I will review the exciting opportunities for further cross fertilization between these fields, ranging from faster (classical) DFT calculations and enhanced transition path sampling to traveling waves in artificial neural networks.
Prof. Max Welling is a research chair in Machine Learning at the University of Amsterdam and a Distinguished Scientist at MSR. He is a fellow at the Canadian Institute for Advanced Research (CIFAR) and the European Lab for Learning and Intelligent Systems (ELLIS) where he also serves on the founding board. His previous appointments include VP at Qualcomm Technologies, professor at UC Irvine, postdoc at U. Toronto and UCL under supervision of prof. Geoffrey Hinton, and postdoc at Caltech under supervision of prof. Pietro Perona. He finished his PhD in theoretical high energy physics under supervision of Nobel laureate Prof. Gerard ‘t Hooft.
TILOS Seminar: Learning from Diverse and Small Data
Ramya Korlakai Vinayak, Assistant Professor, University of Wisconsin at Madison
In this talk, we will address this question in the following settings:
(i) In many applications, we observe count data which can be modeled as Binomial (e.g., polling, surveys, epidemiology) or Poisson (e.g., single cell RNA data) data. As a single or finite parameters do not capture the diversity of the population in such datasets, they are often modeled as nonparametric mixtures. In this setting, we will address the following question, “how well can we learn the distribution of parameters over the population without learning the individual parameters?” and show that nonparametric maximum likelihood estimators are in fact minimax optimal.
(ii) Learning preferences from human judgements using comparison queries plays a crucial role in cognitive and behavioral psychology, crowdsourcing democracy, surveys in social science applications, and recommendation systems. Models in the literature often focus on learning average preference over the population due to the limitations on the amount of data available per individual. We will discuss some recent results on how we can reliably capture diversity in preferences while pooling together data from individuals.
Ramya Korlakai Vinayak is an assistant professor in the Dept. of ECE and affiliated faculty in the Dept. of Computer Science and the Dept. of Statistics at the UW-Madison. Her research interests span the areas of machine learning, statistical inference, and crowdsourcing. Her work focuses on addressing theoretical and practical challenges that arise when learning from societal data. Prior to joining UW-Madison, Ramya was a postdoctoral researcher in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. She received her Ph.D. in Electrical Engineering from Caltech. She obtained her Masters from Caltech and Bachelors from IIT Madras. She is a recipient of the Schlumberger Foundation Faculty of the Future fellowship from 2013-15, and an invited participant at the Rising Stars in EECS workshop in 2019. She is the recipient of NSF CAREER Award 2023-2028.
TILOS Seminar: Machine Learning Training Strategies Inspired by Humans' Learning Skills
Pengtao Xie, Assistant Professor, UC San Diego
Humans, as the most powerful learners on the planet, have accumulated a lot of learning skills, such as learning through tests, interleaving learning, self-explanation, active recalling, to name a few. These learning skills and methodologies enable humans to learn new topics more effectively and efficiently. We are interested in investigating whether humans' learning skills can be borrowed to help machines to learn better. Specifically, we aim to formalize these skills and leverage them to train better machine learning (ML) models. To achieve this goal, we develop a general framework--Skillearn, which provides a principled way to represent humans' learning skills mathematically and use the formally-represented skills to improve the training of ML models. In two case studies, we apply Skillearn to formalize two learning skills of humans: learning by passing tests and interleaving learning, and use the formalized skills to improve neural architecture search.
Pengtao Xie is an assistant professor at UC San Diego. He received his PhD from the Machine Learning Department at Carnegie Mellon University in 2018. His research interests lie in machine learning inspired by human learning and its applications in healthcare. His research outcomes have been adopted by medical device companies, medical imaging centers, hospitals, etc. and have been published at top-tier artificial intelligence conferences and journals including ICML, NeurIPS, ACL, ICCV, TACL, etc. He is the recipient of the Tencent AI-Lab Faculty Award, Tencent WeChat Faculty Award, the Innovator Award presented by the Pittsburgh Business Times, the Siebel Scholars award, and the Goldman Sachs Global Leader Scholarship.
TILOS Seminar: Causal Discovery for Root Cause Analysis
Professor Murat Kocaoglu, Assistant Professor, Purdue University
Cause-effect relations are crucial for several fields, from medicine to policy design as they inform us of the outcomes of our actions a priori. However, causal knowledge is hard to curate for complex systems that might be changing frequently. Causal discovery algorithms allow us to extract causal knowledge from the available data. In this talk, first, we provide a short introduction to algorithmic causal discovery. Next, we propose a novel causal discovery algorithm from a collection of observational and interventional datasets in the presence of unobserved confounders, with unknown intervention targets. Finally, we demonstrate the effectiveness of our algorithm for root-cause analysis in microservice architectures.
Dr. Kocaoglu received his B.S. degree in Electrical-Electronics Engineering with a minor in Physics from the Middle East Technical University in 2010, his M.S. degree from the Koc University, Turkey in 2012, and his Ph.D. degree from The University of Texas at Austin in 2018 under the supervision of Prof. Alex Dimakis and Prof. Sriram Vishwanath. He was a Research Staff Member in the MIT-IBM Watson AI Lab in IBM Research, Cambridge, Massachusetts from 2018 to 2020. Since 2021, he is an assistant professor in the School of ECE at Purdue University. His current research interests include causal inference and discovery, causal machine learning, deep generative models, and information theory.
IEEE Seasonal School: Manufacturability, Testing, Reliability, and Security
00:00:00 - Introduction
01:51:00 - "Machine Learning for DFM", Bei Yu, Associate Professor, Chinese University of Hong Kong
59:54:00 - "ML for Testing and Yield", Li-C. Wang, Professor, UC Santa Barbara
01:59:00 - "ML for Cross-Layer Reliability and Security", Muhammad Shafique, Professor of Computer Engineering, NYU Abu Dhabi
IEEE Seasonal School: Standard Platforms for ML in EDA and IC Design
00:00:00 - Introduction
00:02:45 - "Exchanging EDA data for AI/ML using Standard API", Kerim Kalafala, Senior Technical Staff Member, IBM (co-chair, AI/ML for EDA Special Interest Group, Si2), Richard Taggart, Senior Software Engineering Manager, IBM and Akhilesh Kumar, Principal R&D Engineer, Ansys
01:29:30 - "IEEE CEDA DATC RDF and METRICS2.1: Toward a Standard Platform for ML-Enabled EDA and IC Design", Jinwook Jung, Research Staff Member, IBM Research
TILOS Seminar: Rare Gems: Finding Lottery Tickets at Initialization
Dimitris Papailiopoulos, Associate Professor, University of Wisconsin-Madison
Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming “train, prune, re-train” approach. Frankle & Carbin in 2019 conjectured that we can avoid this by training lottery tickets, i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by discovering Rare Gems: sparse, trainable networks at initialization, that achieve high accuracy even before training. When Rare Gems are trained with SGD, they achieve accuracy competitive or better than Iterative Magnitude Pruning (IMP) with warmup.
Dimitris Papailiopoulos is the Jay & Cynthia Ihlenfeld Associate Professor of Electrical and Computer Engineering at the University of Wisconsin-Madison, a faculty fellow of the Grainger Institute for Engineering, and a faculty affiliate at the Wisconsin Institute for Discovery. His research interests span machine learning, information theory, and distributed systems, with a current focus on efficient large-scale training algorithms. Before coming to Madison, Dimitris was a postdoctoral researcher at UC Berkeley and a member of the AMPLab. He earned his Ph.D. in ECE from UT Austin, under the supervision of Alex Dimakis. He received his ECE Diploma M.Sc. degree from the Technical University of Crete, in Greece. Dimitris is a recipient of the NSF CAREER Award (2019), three years of Sony Faculty Innovation Awards (2018, 2019 and 2020), a joint IEEE ComSoc/ITSoc Best Paper Award (2020), an IEEE Signal Processing Society, Young Author Best Paper Award (2015), the Vilas Associate Award (2021), the Emil Steiger Distinguished Teaching Award (2021), and the Benjamin Smith Reynolds Award for Excellence in Teaching (2019). In 2018, he co-founded MLSys, a new conference that targets research at the intersection of machine learning and systems.
IEEE Seasonal School: Applications / Future Frontiers
00:00:00 - Introduction
01:23:00 - "Automating Analog Layout: Why This Time is Different", Sachin Sapatnekar, Professor, University of Minnesota
57:20:00 - "Machine Learning-Powered Tools and Methodologies for 3D Integration", Sung Kyu Lim, Professor, Georgia Institute of Technology
01:58:18 - "ML for Verification", Shobha Vasudevan, Researcher at Google and Adjunct Professor at UIUC
IEEE Seasonal School: Deep / Reinforcement Learning
00:00:00 - Introduction
02:30:00 - "Machine Learning for EDA Optimization", Mark Ren, Senior Manager, NVIDIA Research
01:02:30 - "Learning to Optimize", Ismail Bustany, Fellow, AMD
02:01:50 - "Circuit Training: An open-source framework for generating chip floor plans with distributed deep reinforcement learning", Joe Jiang, Staff Software Engineer and Manager, Google Brain
TILOS Seminar: Robust and Equitable Uncertainty Estimation
Aaron Roth, Professor, University of Pennsylvania
Machine learning provides us with an amazing set of tools to make predictions, but how much should we trust particular predictions? To answer this, we need a way of estimating the confidence we should have in particular predictions of black-box models. Standard tools for doing this give guarantees that are averages over predictions. For instance, in a medical application, such tools might paper over poor performance on one medically relevant demographic group if it is made up for by higher performance on another group. Standard methods also depend on the data distribution being static—in other words, the future should be like the past.
In this lecture, I will describe new techniques to address both these problems: a way to produce prediction sets for arbitrary black-box prediction methods that have correct empirical coverage even when the data distribution might change in arbitrary, unanticipated ways and such that we have correct coverage even when we zoom in to focus on demographic groups that can be arbitrary and intersecting. When we just want correct group-wise coverage and are willing to assume that the future will look like the past, our algorithms are especially simple.
This talk is based on two papers, that are joint work with Osbert Bastani, Varun Gupta, Chris Jung, Georgy Noarov, and Ramya Ramalingam.
Aaron Roth is the Henry Salvatori Professor of Computer and Cognitive Science, in the Computer and Information Sciences department at the University of Pennsylvania, with a secondary appointment in the Wharton statistics department. He is affiliated with the Warren Center for Network and Data Science, and co-director of the Networked and Social Systems Engineering (NETS) program. He is also an Amazon Scholar at Amazon AWS. He is the recipient of a Presidential Early Career Award for Scientists and Engineers (PECASE) awarded by President Obama in 2016, an Alfred P. Sloan Research Fellowship, an NSF CAREER award, and research awards from Yahoo, Amazon, and Google. His research focuses on the algorithmic foundations of data privacy, algorithmic fairness, game theory, learning theory, and machine learning. Together with Cynthia Dwork, he is the author of the book “The Algorithmic Foundations of Differential Privacy.” Together with Michael Kearns, he is the author of “The Ethical Algorithm.”
TILOS Seminar: Non-convex Optimization for Linear Quadratic Gaussian (LQG) Control
Yang Zheng, Assistant Professor, UC San Diego
Recent studies have started to apply machine learning techniques to the control of unknown dynamical systems. They have achieved impressive empirical results. However, the convergence behavior, statistical properties, and robustness performance of these approaches are often poorly understood due to the non-convex nature of the underlying control problems. In this talk, we revisit the Linear Quadratic Gaussian (LQG) control and present recent progress towards its landscape analysis from a non-convex optimization perspective. We view the LQG cost as a function of the controller parameters and study its analytical and geometrical properties. Due to the inherent symmetry induced by similarity transformations, the LQG landscape is very rich yet complicated. We show that 1) the set of stabilizing controllers has at most two path-connected components, and 2) despite the nonconvexity, all minimal stationary points (controllable and observable controllers) are globally optimal. Based on the special non-convex optimization landscape, we further introduce a novel perturbed policy gradient (PGD) method to escape a large class of suboptimal stationary points (including high-order saddles). These results shed some light on the performance analysis of direct policy gradient methods for solving the LQG problem. The talk is based on our recent papers: https://arxiv.org/abs/2102.04393 and https://arxiv.org/abs/2204.00912.
Yang Zheng is an assistant professor in the ECE department at UC San Diego. Yang Zheng received the DPhil (Ph.D.) degree in Engineering Science from the University of Oxford in 2019. He received the B.E. and M.S. degrees from Tsinghua University in 2013 and 2015, respectively. From February 2019 to August 2020, he was a postdoctoral researcher at Harvard University. He was a research associate at Imperial College London in 2021.
Dr. Zheng’s research interests include learning, optimization, and control of network systems, and their applications to cyber-physical systems, autonomous vehicles, and traffic systems. His work has been acknowledged by several awards, including the 2019 European Ph.D. Award on Control for Complex and Heterogeneous Systems, the Best Student Paper Award Finalist at the 2019 European Control Conference, the Best Student Paper Award at the 17th IEEE International Conference on Intelligent Transportation Systems, and the Best Paper Award at the 14th Intelligent Transportation Systems Asia-Pacific Forum. He received the National Scholarship, Outstanding Graduate at Tsinghua University, the Clarendon Scholarship at the University of Oxford, and the Chinese Government Award for Outstanding Self-financed Students Abroad.