Recorded Talks: Foundations of AI and Optimization
Flat Minima and Generalization: from Matrix Sensing to Neural Networks
Maryam Fazel, University of Washington
When do overparameterized neural networks avoid overfitting and generalize to unseen data? Empirical evidence suggests that the shape of the training loss function near the solution matters—the minima where the loss is “flatter” tend to lead to better generalization. Yet quantifying flatness and its rigorous analysis, even in simple models, has been limited. In this talk, we examine overparameterized nonconvex models such as low-rank matrix sensing, matrix completion, robust PCA, as well as a 2-layer neural network as test cases. We show that under standard statistical assumptions, "flat" minima (minima with the smallest local average curvature, measured by the trace of the Hessian matrix) provably generalize in all these cases. These algorithm-agnostic results suggest a theoretical basis for favoring methods that bias iterates towards flat solutions, and help inform the design of better training algorithms.
Hunting the Hessian
Madeleine Udell, Stanford University
Ill conditioned loss landscapes are ubiquitous in machine learning, and they slow down optimization. Preconditioning the gradient to make the loss more isotropic is a natural solution, but is challenging for extremely large problems, as direct access to the problem Hessian is prohibitively expensive. We present two fresh approaches to preconditioning using tools from randomized numerical linear algebra and online convex optimization for efficient access to Hessian information, motivated by the question: what is the most useful information we can query from the problem Hessian using linear memory and compute?
Accelerating Nonconvex Optimization via Online Learning
Aryan Mokhtari, UT Austin
A fundamental problem in optimization is finding an ε-first-order stationary point of a smooth function using only gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O( ε^(−7/4) ). In this talk, I present a method with a gradient complexity of O( d^(1/4) ε^(−13/8) ), where d is the problem dimension—yielding improved complexity when d = O( ε^(−1/2) ). The proposed method builds on quasi-Newton ideas and operates by solving two online learning problems under the hood.
The Binary Iterative Hard Thresholding Algorithm
Arya Mazumdar, TILOS & UC San Diego
We will discuss our work on the convergence of iterative hard threshold algorithms for sparse signal recovery problems. For classification problems with nonseparable data this algorithm can be thought of minimizing the so-called ReLU loss. It seems to be very effective (statistically optimal, simple iterative method) for a large class of models of nonseparable data—sparse generalized linear models. It is also robust to adversarial perturbation. Based on joint work with Namiko Matsumoto.
Reverse diffusion Monte Carlo
Yian Ma, TILOS & UC San Diego
I will introduce a novel Monte Carlo sampling approach that uses the reverse diffusion process. In particular, the intermediary updates—the score functions—can be explicitly estimated to arbitrary accuracy, leading to an unbiased Bayesian inference algorithm. I will then discuss how to use this idea to improve sampling in the diffusion models via reverse transition kernels.
Linear Bregman Divergence Control
Babak Hassibi, Caltech
In the past couple of decades, the use of "non-quadratic" convex cost functions has revolutionized signal processing, machine learning, and statistics, allowing one to customize solutions to have desired structures and properties. However, the situation is not the same in control where the use of quadratic costs still dominates, ostensibly because determining the "value function", i.e., the optimal expected cost-to-go, which is critical to the construction of the optimal controller, becomes computationally intractable as soon as one considers general convex costs. As a result, practitioners often resort to heuristics and approximations, such as model predictive control that only looks a few steps into the future. In the quadratic case, the value function is easily determined by appealing to certainty-equivalence and solving Riccati equations. In this talk, we consider a special class of convex cost functions constructed from Bregman divergence and show how, with appropriate choices, they can be used to fully extend the framework developed for the quadratic case. The resulting optimal controllers are infinite horizon, come with stability guarantees, and have state-feedback, or estimated state-feedback, laws. They exhibit a much wider range of behavior than their quadratic counterparts since the feedback laws are nonlinear. We demonstrate the applicability of the approach to several cases of interest, including safety control, sparse control, and bang-bang control.
Optimization and Reasoning
Sicun (Sean) Gao, TILOS & UC San Diego
For a while, the remarkable progress in ML/AI has led many to dismiss the old-fashioned concerns of NP-hardness, with the belief that sufficiently large nonlinear functions can be trained to encode solutions to everything of practical relevance. Yet, as these functions are increasingly deployed as black-box models with agency, their usability is once again constrained by our ability to answer fundamental questions that demand deeper understanding across the entire training and inference pipeline. These questions inevitably correspond to solving NP-hard problems that remain well beyond the reach of existing algorithms. The formal reasoning community has spent decades developing a rich arsenal of tools for tackling similar problems, but mostly for discrete symbolic computing systems. Extending the same rigor and algorithmic power to the continuous domain is a grand challenge that has to be confronted. We need to unify optimization and reasoning towards new generations of capable algorithms that bring together numerical/analytic, combinatorial/algebraic, and statistical/probabilistic approaches. Addressing these challenges can establish new computational foundations for all real-world engineering disciplines too.
Foundational Methods for Foundation Models for Scientific Machine Learning
Michael W. Mahoney, LBNL and UC Berkeley
The remarkable successes of ChatGPT in natural language processing (NLP) and related developments in computer vision (CV) motivate the question of what foundation models would look like and what new advances they would enable, when built on the rich, diverse, multimodal data that are available from large-scale experimental and simulational data in scientific computing (SC), broadly defined. Such models could provide a robust and principled foundation for scientific machine learning (SciML), going well beyond simply using ML tools developed for internet and social media applications to help solve future scientific problems. I will describe recent work demonstrating the potential of the “pre-train and fine-tune” paradigm, widely-used in CV and NLP, for SciML problems, demonstrating a clear path towards building SciML foundation models; as well as recent work highlighting multiple “failure modes” that arise when trying to interface data-driven ML methodologies with domain-driven SC methodologies, demonstrating clear obstacles to traversing that path successfully. I will also describe initial work on developing novel methods to address several of these challenges, as well as their implementations at scale, a general solution to which will be needed to build robust and reliable SciML models consisting of millions or billions or trillions of parameters.
Michael W. Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He is also an Amazon Scholar as well as head of the Machine Learning and Analytics Group at the Lawrence Berkeley National Laboratory. He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, scientific machine learning, scalable stochastic optimization, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, computational methods for neural network analysis, physics informed machine learning, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received his PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he was on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he co-organized the Simons Institute’s fall 2013 and 2018 programs on the foundations of data science, he ran the Park City Mathematics Institute’s 2016 PCMI Summer Session on The Mathematics of Data, he ran the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he was the Director of the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. More information is available at https://www.stat.berkeley.edu/~mmahoney/.
Single location regression and attention-based models
Claire Boyer, Université Paris-Saclay
Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.
Synthetic Tasks as Testbeds for Attributing Model Behavior
Surbhi Goel, University of Pennsylvania
Understanding how different components of the machine learning pipeline—spanning data composition, architectural choices, and optimization dynamics—shape model behavior remains a fundamental challenge. In this talk, I will argue that synthetic tasks, which enable precise control over data distribution and task complexity, serve as powerful testbeds for analyzing and attributing behaviors in deep learning. Focusing on the sparse parity learning problem, a canonical task in learning theory, I will present insights into: (1) the phenomenon of “hidden progress” in gradient-based optimization, where models exhibit consistent advancement despite stagnating loss curves; (2) nuanced trade-offs between data, compute, model width, and initialization that govern learning success; and (3) the role of progressive distillation in implicitly structuring curricula to accelerate feature learning. These findings highlight the utility of synthetic tasks in uncovering empirical insights into the mechanisms driving deep learning, without the cost of training expensive models.
This talk is based on joint work with a lot of amazing collaborators: Boaz Barak, Ben Edelman, Sham Kakade, Bingbin Liu, Eran Malach, Sadhika Malladi, Abhishek Panigrahi, Andrej Risteski, and Cyril Zhang.
Surbhi Goel is the Magerman Term Assistant Professor of Computer and Information Science at the University of Pennsylvania. She is associated with the theory group, the ASSET Center on safe, explainable, and trustworthy AI systems, and the Warren Center for Network and Data Sciences. Surbhi’s research focuses on theoretical foundations of modern machine learning paradigms, particularly deep learning, and is supported by Microsoft Research and OpenAI. Previously, she was a postdoctoral researcher at Microsoft Research NYC and completed her Ph.D. at the University of Texas at Austin under Adam Klivans, receiving the UTCS Bert Kay Dissertation Award. She has also been a visiting researcher at IAS, Princeton, and the Simons Institute at UC Berkeley. Surbhi co-founded the Learning Theory Alliance (LeT‐All) and holds several leadership roles, including Office Hours co-chair for ICLR 2024 and co-treasurer for the Association for Computational Learning Theory.
Unlearnable Facts Cause Hallucinations in Pretrained Language Models
Adam Tauman Kalai, OpenAI
Pretrained language models (LMs) tend to preserve many qualities present in their training data, such as grammaticality, formatting, and politeness. However, for specific types of factuality, even LMs pretrained on factually correct statements tend to produce falsehoods at high rates. We explain these “hallucinations” by drawing a connection to binary classification, enabling us to leverage insights from supervised learning. We prove that pretrained LMs (which are “calibrated”) fail to mimic criteria that cannot be learned. Our analysis explains why pretrained LMs hallucinate on facts such as people’s birthdays but not on systematic facts such as even vs. odd numbers.
Of course, LM pretraining is only one stage in the development of a chatbot, and thus hallucinations are *not* inevitable in chatbots.
This is joint work with Santosh Vempala.
Adam Tauman Kalai is a Research Scientist at OpenAI working on AI Safety and Ethics. He has worked in Algorithms, Fairness, Machine Learning Theory, Game Theory, and Crowdsourcing. He received his PhD from Carnegie Mellon University. He has served as an Assistant Professor at Georgia Tech and TTIC, and is on the science team of the whale-translation Project CETI. He has co-chaired AI and crowdsourcing conferences and has numerous honors, most notably the Majulook prize.
How Transformers Learn Causal Structure with Gradient Descent
Jason Lee, Princeton University
The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
Jason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, representation learning, and reinforcement learning. He has received the Samsung AI Researcher of the Year Award, NSF Career Award, ONR Young Investigator Award in Mathematical Data Science, Sloan Research Fellowship, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization, and Princeton Commendation for Outstanding Teaching.
Off-the-shelf Algorithmic Stability
Rebecca Willett, University of Chicago
Algorithmic stability holds when our conclusions, estimates, fitted models, predictions, or decisions are insensitive to small changes to the training data. Stability has emerged as a core principle for reliable data science, providing insights into generalization, cross-validation, uncertainty quantification, and more. Whereas prior literature has developed mathematical tools for analyzing the stability of specific machine learning (ML) algorithms, we study methods that can be applied to arbitrary learning algorithms to satisfy a desired level of stability. First, I will discuss how bagging is guaranteed to stabilize any prediction model, regardless of the input data. Thus, if we remove or replace a small fraction of the training data at random, the resulting prediction will typically change very little. Our analysis provides insight into how the size of the bags (bootstrap datasets) influences stability, giving practitioners a new tool for guaranteeing a desired level of stability. Second, I will describe how to extend these stability guarantees beyond prediction modeling to more general statistical estimation problems where bagging is not as well known but equally useful for stability. Specifically, I will describe a new framework for stable classification and model selection by combining bagging on class or model weights with a stable, “soft” version of the argmax operator.
Rebecca Willett is a Professor of Statistics and Computer Science and the Director of AI in the Data Science Institute at the University of Chicago, and she holds a courtesy appointment at the Toyota Technological Institute at Chicago. Her research is focused on machine learning foundations, scientific machine learning, and signal processing. Willett received the inaugural Data Science Career Prize from the Society of Industrial and Applied Mathematics in 2024, was named a Fellow of the Society of Industrial and Applied Mathematics in 2021, and was named a Fellow of the IEEE in 2022. She is the Deputy Director for Research at the NSF-Simons Foundation National Institute for Theory and Mathematics in Biology, Deputy Director for Research at the NSF-Simons Institute for AI in the Sky (SkAI), and a member of the NSF Institute for the Foundations of Data Science Executive Committee. She is the Faculty Director of the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship. She helps direct the Air Force Research Lab University Center of Excellence on Machine Learning. She received the National Science Foundation CAREER Award in 2007, was a DARPA Computer Science Study Group member, and received an Air Force Office of Scientific Research Young Investigator Program award in 2010. She completed her PhD in Electrical and Computer Engineering at Rice University in 2005. She was an Assistant and then tenured Associate Professor of Electrical and Computer Engineering at Duke University from 2005 to 2013. She was an Associate Professor of Electrical and Computer Engineering, Harvey D. Spangler Faculty Scholar, and Fellow of the Wisconsin Institutes for Discovery at the University of Wisconsin-Madison from 2013 to 2018.
What Kinds of Functions do Neural Networks Learn? Theory and Practical Applications
Robert Nowak, University of Wisconsin
This talk presents a theory characterizing the types of functions neural networks learn from data. Specifically, the function space generated by deep ReLU networks consists of compositions of functions from the Banach space of second-order bounded variation in the Radon transform domain. This Banach space includes functions with smooth projections in most directions. A representer theorem associated with this space demonstrates that finite-width neural networks suffice for fitting finite datasets. The theory has several practical applications. First, it provides a simple and theoretically grounded method for network compression. Second, it shows that multi-task training can yield significantly different solutions compared to single-task training, and that multi-task solutions can be related to kernel ridge regressions. Third, the theory has implications for improving implicit neural representations, where multi-layer neural networks are used to represent continuous signals, images, or 3D scenes. This exploration bridges theoretical insights with practical advancements, offering a new perspective on neural network capabilities and future research directions.
Robert Nowak is the Grace Wahba Professor of Data Science and Keith and Jane Nosbusch Professor in Electrical and Computer Engineering at the University of Wisconsin-Madison. His research focuses on machine learning, optimization, and signal processing. He serves on the editorial boards of the SIAM Journal on the Mathematics of Data Science and the IEEE Journal on Selected Areas in Information Theory.
Transformers Learn In-context by (Functional) Gradient Descent
Xiang Cheng, TILOS Postdoctoral Scholar at MIT
Motivated by the in-context learning phenomenon, we investigate how the Transformer neural network can implement learning algorithms in its forward pass. We show that a linear Transformer naturally learns to implement gradient descent, which enables it to learn linear functions in-context. More generally, we show that a non-linear Transformer can implement functional gradient descent with respect to some RKHS metric, which allows it to learn a broad class of functions in-context. Additionally, we show that the RKHS metric is determined by the choice of attention activation, and that the optimal choice of attention activation depends in a natural way on the class of functions that need to be learned. I will end by discussing some implications of our results for the choice and design of Transformer architectures.
How Large Models of Language and Vision Help Agents to Learn to Behave
Roy Fox, Assistant Professor and Director of the Intelligent Dynamics Lab, UC Irvine
If learning from data is valuable, can learning from big data be very valuable? So far, it has been so in vision and language, for which foundation models can be trained on web-scale data to support a plethora of downstream tasks; not so much in control, for which scalable learning remains elusive. Can information encoded in vision and language models guide reinforcement learning of control policies? In this talk, I will discuss several ways for foundation models to help agents to learn to behave. Language models can provide better context for decision-making: we will see how they can succinctly describe the world state to focus the agent on relevant features; and how they can form generalizable skills that identify key subgoals. Vision and vision–language models can help the agent to model the world: we will see how they can block visual distractions to keep state representations task-relevant; and how they can hypothesize about abstract world models that guide exploration and planning.
Roy Fox is an Assistant Professor of Computer Science at the University of California, Irvine. His research interests include theory and applications of control learning: reinforcement learning (RL), control theory, information theory, and robotics. His current research focuses on structured and model-based RL, language for RL and RL for language, and optimization in deep control learning of virtual and physical agents.