Recorded Talks: Machine Learning
Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent
Yuejie Chi, Yale University
Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve multiple subtasks in a single autoregressive path, and the bootstrapping of inherently sequential reasoning through recursive self-training curriculum. Our optimization-based guarantees demonstrate that even shallow multi-head transformers, with chain-of-thought, can be trained to effectively solve problems that would otherwise require deeper architectures.
Yuejie Chi is the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science at Yale University, with a secondary appointment in Computer Science, and a member of the Yale Institute for Foundations of Data Science. Before joining Yale, Dr. Chi was the Sense of Wonder Group Endowed Professor of Electrical and Computer Engineering in AI Systems at Carnegie Melon University, with affiliation in MLD and CyLab. She also spent some time as a visiting researcher at Meta's Fundamental AI Research (FAIR). Dr. Yue's research interests lie in the theoretical and algorithmic foundations of data science, generative AI, reinforcement learning, and signal processing, motivated by applications in scientific and engineering domains. Her current focus is on improving the performance, efficiency and reliability of generative AI and decision making, driven by data-intensive but resource-constrained scenarios.
(De)regularized Wasserstein Gradient Flows via Reproducing Kernels
Bharath Sriperumbudur, Pennsylvania State University
Wasserstein gradient flows have become a popular tool in machine learning with applications in sampling, variational inference, generative modeling, and reinforcement learning, among others. The Wasserstein gradient flow (WGF) involves minimizing a probability functional over the Wasserstein space (by taking into account the intrinsic geometry of the Wasserstein space). In this work, we introduce approximate/regularized Wasserstein gradient flows in two different settings: (a) approximate the probability functional and (b) approximate the Wasserstein geometry. In (a), we consider the probability functional to be chi^2-divergence, whose WGF is difficult to implement. To this end, we propose a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) as an approximation of chi^2-divergence and develop an approximate WGF, which is easy to implement and has applications in generative modeling. On the other hand, in the setting of (b), we use Kullback-Leibler divergence as the probability functional and develop an approximation to the Wassertein geometry, which allows for an efficient implementation than that of the exact WGF, with applications in sampling. In both settings, we present a variety of theoretical results that relate the approximate flow to the exact flow and demonstrate the superiority of the approximate flows via numerical simulations.
Bharath Sriperumbudur is a professor in the Department of Statistics (with a courtesy appointment in the Department of Mathematics) at the Pennsylvania State University. His research interests include non-parametric statistics, machine learning, statistical learning theory, optimal transport and gradient flows, regularization and inverse problems, reproducing kernel spaces in probability and statistics, functional and topological data analysis.
Neuromorphic LLMs
Jason Eshraghian, UC Santa Cruz
This talk will show you what neuromorphic computing can do when an academic lab accidentally pulls $2-million of GPU-hours. We will showcase a series of frontier reasoning LLMs developed out of an academic lab, from data curation and pre-training to post-training and alignment. These models surpass leading LLMs from Meta, Google, and other heavily-resourced labs in the ~10-billion parameter regime, despite being 5x smaller.
We have deployed several models on neuromorphic hardware at just 2 watts, bringing state-of-the-art reasoning from the datacenter to the edge. Along the way, we dispel a series of widely-held assumptions about large-scale neuromorphic computation, revealing how it fundamentally differs from conventional deep learning, and why that difference matters.
Jason Eshraghian is an Assistant Professor and Fulbright Scholar in the Department of Electrical and Computer Engineering at the University of California, Santa Cruz. He is the developer of snnTorch, a Python library with over 500,000 downloads for training spiking neural networks. He is a dual-appointed IEEE CAS and EMBS Distinguished Lecturer, an Associate Editor of APL Machine Learning, the Chair of the IEEE Neural Systems and Applications Technical Committee, has been the recipient of seven IEEE Best Paper Awards, a Scientific Advisory Board Member of BrainChip and leads the Neuromorphic Agents Team at Conscium.
Incentivizing Emergent Behaviors for LLMs via Reinforcement Learning
Yi Wu, Tsinghua University
Reinforcement Learning (RL) has become a powerful post-training method for eliciting advanced behaviors in large language models (LLMs). This talk presents recent results showing how RL can incentivize the emergence of LLM capabilities across three domains: (1) multi-player deduction game, Werewolf, where RL-trained LLM agents develop strategic behaviors and outperform strong human players; (2) agentic search, where large-scale RL enables a 32B model to run multi-step search to answer non-trivial questions beyond commercial baselines; and (3) efficient reasoning, where RL mitigates over-thinking and improves both reliability and compute efficiency.
The papers can be found at
- Werewolf: https://arxiv.org/abs/2310.
18940 (ICML24), https://arxiv.org/abs/2502. 04686 (ICML25) - ASearcher: https://arxiv.org/abs/2508.
07976 - Thinking Efficiency: https://www.arxiv.org/abs/
2506.07104 (NeurIPS25)
All the projects are trained using our large-scale agentic RL system, AReaL, which is open-source at https://github.com/
Yi Wu is an assistant professor at the Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University. He obtained his Ph.D. from UC Berkeley and was a researcher at OpenAI from 2019 to 2020. His research focuses on reinforcement learning, multi-agent learning, and LLM agents. His representative works include the value iteration network, the MADDPG/MAPPO algorithm, OpenAI’s hide-and-seek project, and the AReaL project. He received the best paper award at NIPS 2016, the best demo award finalist at ICRA 2024, and MIT TR35 Asia Pacific 2025 award.
Stochastic-Gradient and Diagonal-Scaling Algorithms for Constrained Optimization and Learning
Frank E. Curtis, Lehigh University
I will motivate and provide an overview of recent efforts in my research group on the design and analysis of stochastic-gradient-based algorithms for solving constrained optimization problems. I will focus in particular on our motivation for informed supervised learning, where constraints in the training problem can be used to impose prior knowledge on the properties that should be possessed by a trained prediction model. In addition, I will provide a detailed look at our newest extensions of heavy-ball and Adam schemes from the unconstrained to the equality-constrained setting, for which we have shown state-of-the-art convergence guarantees. I will demonstrate the impressive practical performance of our methods using a few informed supervised learning problems.
Frank E. Curtis is a Professor in the Department of Industrial and Systems Engineering at Lehigh University, where he has been employed since 2009. He received a bachelor’s degree from the College of William and Mary in 2003 with a double major in Computer Science and Mathematics, received a master’s degree in 2004 and Ph.D. degree in 2007 from the Department of Industrial Engineering and Management Science at Northwestern University, and spent two years as a Postdoctoral Researcher in the Courant Institute of Mathematical Sciences at New York University from 2007 until 2009. His research focuses on the design, analysis, and implementation of numerical methods for solving large-scale nonlinear optimization problems. He received an Early Career Award from the Advanced Scientific Computing Research (ASCR) program of the U.S. Department of Energy (DoE), and has received funding from various programs of the U.S. National Science Foundation (NSF), including through a TRIPODS Phase I grant awarded to him and his collaborators at Lehigh, Northwestern, and Boston University. He has also received funding from the U.S. Office of Naval Research (ONR) and DoE’s Advanced Research Projects Agency-Energy (ARPA-E). He received, along with Leon Bottou (Meta AI) and Jorge Nocedal (Northwestern), the 2021 SIAM/MOS Lagrange Prize in Continuous Optimization. He was awarded, with James V. Burke (U. of Washington), Adrian Lewis (Cornell), and Michael Overton (NYU), the 2018 INFORMS Computing Society Prize. He and team members Daniel Molzahn (Georgia Tech), Andreas Waechter (Northwestern), Ermin Wei (Northwestern), and Elizabeth Wong (UC San Diego) were awarded second place in the ARPA-E Grid Optimization Competition in 2020. He currently serves as Area Editor for Continuous Optimization for Mathematics of Operations Research and serves as an Associate Editor for Mathematical Programming, SIAM Journal on Optimization, Operations Research, IMA Journal of Numerical Analysis, and Mathematical Programming Computation. He previously served as the Vice Chair for Nonlinear Programming for the INFORMS Optimization Society, and is currently very active in professional societies and groups related to mathematical optimization, including INFORMS, the Mathematics Optimization Society, and the SIAM Activity Group on Optimization.
Training Neural Networks at Any Scale
Volkan Cevher, EPFL
At the heart of deep learning’s transformative impact lies the concept of scale–encompassing both data and computational resources, as well as their interaction with neural network architectures. Scale, however, presents critical challenges, such as increased instability during training and prohibitively expensive model-specific tuning. Given the substantial resources required to train such models, formulating high-confidence scaling hypotheses backed by rigorous theoretical research has become paramount.
To bridge theory and practice, the talk explores a key mathematical ingredient of scaling in tandem with scaling theory: the numerical solution algorithms commonly employed in deep learning, spanning domains from vision to language models. We unify these algorithms under a common master template, making their foundational principles transparent. In doing so, we reveal the interplay between adaptation to smoothness structures via online learning and the exploitation of optimization geometry through non-Euclidean norms. Our exposition moves beyond simply building larger models–it emphasizes strategic scaling, offering insights that promise to advance the field while economizing on resources.
Volkan Cevher received the B.Sc. (valedictorian) in electrical engineering from Bilkent University in Ankara, Turkey, in 1999 and the Ph.D. in electrical and computer engineering from the Georgia Institute of Technology in Atlanta, GA in 2005. He was a Research Scientist with the University of Maryland, College Park from 2006-2007 and also with Rice University in Houston, TX, from 2008-2009. Currently, he is an Associate Professor at the Swiss Federal Institute of Technology Lausanne and a Faculty Fellow in the Electrical and Computer Engineering Department at Rice University. His research interests include machine learning, signal processing theory, optimization theory and methods, and information theory. Dr. Cevher is an ELLIS fellow and was the recipient of the Google Faculty Research award in 2018, the IEEE Signal Processing Society Best Paper Award in 2016, a Best Paper Award at CAMSAP in 2015, a Best Paper Award at SPARS in 2009, and an ERC CG in 2016 as well as an ERC StG in 2011.
Certifiably Correct Machine Perception
David Rosen, Northeastern University
Many fundamental machine perception and state estimation tasks require the solution of a high-dimensional nonconvex estimation problem; this class includes (for example) the fundamental problems of simultaneous localization and mapping (in robotics), 3D reconstruction (in computer vision), and sensor network localization (in distributed sensing). Such problems are known to be computationally hard in general, with many local minima that can entrap the smooth local optimization methods commonly applied to solve them. The result is that standard machine perception algorithms (based upon local optimization) can be surprisingly brittle, often returning egregiously wrong answers even when the problem to which they are applied is well-posed.
In this talk, we present a novel class of certifiably correct estimation algorithms that are capable of efficiently recovering provably good (often globally optimal) solutions of generally-intractable machine perception problems in many practical settings. Our approach directly tackles the problem of nonconvexity by employing convex relaxations whose minimizers provide provably good approximate solutions to the original estimation problem under moderate measurement noise. We illustrate the design of this class of methods using the fundamental problem of pose-graph optimization (a mathematical abstraction of robotic mapping) as a running example. We conclude with a brief discussion of open questions and future research directions.
David M. Rosen is an Assistant Professor in the Departments of Electrical & Computer Engineering and Mathematics and the Khoury College of Computer Sciences (by courtesy) at Northeastern University, where he leads the Robust Autonomy Laboratory (NEURAL). Prior to joining Northeastern, he was a Research Scientist at Oculus Research (now Meta Reality Labs) from 2016 to 2018, and a Postdoctoral Associate at MIT’s Laboratory for Information and Decision Systems (LIDS) from 2018 to 2021. He holds the degrees of B.S. in Mathematics from the California Institute of Technology (2008), M.A. in Mathematics from the University of Texas at Austin (2010), and ScD in Computer Science from the Massachusetts Institute of Technology (2016).
He is broadly interested in the mathematical and algorithmic foundations of trustworthy machine perception, learning, and control. His work has been recognized with the IEEE Transactions on Robotics Best Paper Award (2024), an Honorable Mention for the IEEE Transactions on Robotics Best Paper Award (2021), a Best Student Paper Award at Robotics: Science and Systems (2020), a Best Paper Award at the International Workshop on the Algorithmic Foundations of Robotics (2016), and selection as an RSS Pioneer (2019).
High-dimensional Optimization with Applications to Compute-Optimal Neural Scaling Laws
Courtney Paquette (McGill University)
Given the massive scale of modern ML models, we now only get a single shot to train them effectively. This restricts our ability to test multiple architectures and hyper-parameter configurations. Instead, we need to understand how these models scale, allowing us to experiment with smaller problems and then apply those insights to larger-scale models. In this talk, I will present a framework for analyzing scaling laws in stochastic learning algorithms using a power-law random features model (PLRF), leveraging high-dimensional probability and random matrix theory. I will then use this scaling law to address the compute-optimal question: How should we choose model size and hyper-parameters to achieve the best possible performance in the most compute-efficient manner? Then using this PLRF model, I will devise a new momentum-based algorithm that (provably) improves the scaling law exponent. Finally, I will present some numerical experiments on LSTMs that show how this new stochastic algorithm can be applied to real data to improve the compute-optimal exponent.
Courtney Paquette is an assistant professor at McGill University in the Mathematics and Statistics department, a CIFAR AI Chair (MILA), and an active member of the Montreal Machine Learning Optimization Group (MTL MLOpt) at MILA. Her research broadly focuses on designing and analyzing algorithms for large-scale optimization problems, motivated by applications in data science, and using techniques that draw from a variety of fields, including probability, complexity theory, and convex and nonsmooth analysis. Dr. Paquette is a lead organizer of the OPT-ML Workshop at NeurIPS since 2020, and a lead organizer (and original creator) of the High-dimensional Learning Dynamics (HiLD) Workshop at ICML.
A New Paradigm for Learning with Distribution Shift
Adam Klivans, UT Austin
We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.
We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier. This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution. Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.
Adam Klivans is a Professor of Computer Science at the University of Texas at Austin and Director of the NSF AI Institute for Foundations of Machine Learning (IFML). His research interests lie in machine learning and theoretical computer science, in particular, Learning Theory, Computational Complexity, Pseudorandomness, Limit Theorems, and Gaussian Space. Dr. Klivans is a recipient of the NSF CAREER Award and serves on the editorial board for the Theory of Computing and Machine Learning Journal.
Flat Minima and Generalization: from Matrix Sensing to Neural Networks
Maryam Fazel, University of Washington
When do overparameterized neural networks avoid overfitting and generalize to unseen data? Empirical evidence suggests that the shape of the training loss function near the solution matters—the minima where the loss is “flatter” tend to lead to better generalization. Yet quantifying flatness and its rigorous analysis, even in simple models, has been limited. In this talk, we examine overparameterized nonconvex models such as low-rank matrix sensing, matrix completion, robust PCA, as well as a 2-layer neural network as test cases. We show that under standard statistical assumptions, "flat" minima (minima with the smallest local average curvature, measured by the trace of the Hessian matrix) provably generalize in all these cases. These algorithm-agnostic results suggest a theoretical basis for favoring methods that bias iterates towards flat solutions, and help inform the design of better training algorithms.
Hunting the Hessian
Madeleine Udell, Stanford University
Ill conditioned loss landscapes are ubiquitous in machine learning, and they slow down optimization. Preconditioning the gradient to make the loss more isotropic is a natural solution, but is challenging for extremely large problems, as direct access to the problem Hessian is prohibitively expensive. We present two fresh approaches to preconditioning using tools from randomized numerical linear algebra and online convex optimization for efficient access to Hessian information, motivated by the question: what is the most useful information we can query from the problem Hessian using linear memory and compute?
One Small Step, One Giant Leap: From Test-Time Tweaks to Global Guarantees
Mahdi Soltanolkotabi, USC
Simple first-order methods like Gradient Descent (GD) remain foundational to modern machine learning. Yet, despite their widespread use, our theoretical understanding of the GD trajectory—how and why it works—remains incomplete in both classical and contemporary settings. This talk explores new horizons in understanding the behavior and power of GD across two distinct but connected fronts.
In the first part, we examine the surprising power of a single gradient step in enhancing model reasoning. We focus on test-time training (TTT)—a gradient-based approach that adapts model parameters using individual test instances. We introduce a theoretical framework that reveals how TTT can effectively handle distribution shifts and significantly reduce the data required for in-context learning, shedding light on why such simple methods often outperform expectations.
The second part turns to a more classical optimization setting: learning shallow neural networks with GD. Despite extensive study, even fitting a one-hidden-layer model to basic target functions lacks rigorous performance guarantees. We present a comprehensive analysis of the GD trajectory in this regime, showing how it avoids suboptimal stationary points and converges efficiently to global optima. Our results offer new theoretical foundations for understanding how GD succeeds in the presence of sub-optimal stationary points.
Unleashing the Power of Variance Reduction for Training Large Models
Quanquan Gu, Caltech
Training deep neural networks—and more recently, large language models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this talk, I will introduce a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within this framework, I will introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. In addition, I will draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
Foundational Methods for Foundation Models for Scientific Machine Learning
Michael W. Mahoney, LBNL and UC Berkeley
The remarkable successes of ChatGPT in natural language processing (NLP) and related developments in computer vision (CV) motivate the question of what foundation models would look like and what new advances they would enable, when built on the rich, diverse, multimodal data that are available from large-scale experimental and simulational data in scientific computing (SC), broadly defined. Such models could provide a robust and principled foundation for scientific machine learning (SciML), going well beyond simply using ML tools developed for internet and social media applications to help solve future scientific problems. I will describe recent work demonstrating the potential of the “pre-train and fine-tune” paradigm, widely-used in CV and NLP, for SciML problems, demonstrating a clear path towards building SciML foundation models; as well as recent work highlighting multiple “failure modes” that arise when trying to interface data-driven ML methodologies with domain-driven SC methodologies, demonstrating clear obstacles to traversing that path successfully. I will also describe initial work on developing novel methods to address several of these challenges, as well as their implementations at scale, a general solution to which will be needed to build robust and reliable SciML models consisting of millions or billions or trillions of parameters.
Michael W. Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He is also an Amazon Scholar as well as head of the Machine Learning and Analytics Group at the Lawrence Berkeley National Laboratory. He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, scientific machine learning, scalable stochastic optimization, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, computational methods for neural network analysis, physics informed machine learning, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received his PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he was on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he co-organized the Simons Institute’s fall 2013 and 2018 programs on the foundations of data science, he ran the Park City Mathematics Institute’s 2016 PCMI Summer Session on The Mathematics of Data, he ran the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he was the Director of the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. More information is available at https://www.stat.berkeley.edu/~mmahoney/.
Unlearnable Facts Cause Hallucinations in Pretrained Language Models
Adam Tauman Kalai, OpenAI
Pretrained language models (LMs) tend to preserve many qualities present in their training data, such as grammaticality, formatting, and politeness. However, for specific types of factuality, even LMs pretrained on factually correct statements tend to produce falsehoods at high rates. We explain these “hallucinations” by drawing a connection to binary classification, enabling us to leverage insights from supervised learning. We prove that pretrained LMs (which are “calibrated”) fail to mimic criteria that cannot be learned. Our analysis explains why pretrained LMs hallucinate on facts such as people’s birthdays but not on systematic facts such as even vs. odd numbers.
Of course, LM pretraining is only one stage in the development of a chatbot, and thus hallucinations are *not* inevitable in chatbots.
This is joint work with Santosh Vempala.
Adam Tauman Kalai is a Research Scientist at OpenAI working on AI Safety and Ethics. He has worked in Algorithms, Fairness, Machine Learning Theory, Game Theory, and Crowdsourcing. He received his PhD from Carnegie Mellon University. He has served as an Assistant Professor at Georgia Tech and TTIC, and is on the science team of the whale-translation Project CETI. He has co-chaired AI and crowdsourcing conferences and has numerous honors, most notably the Majulook prize.
How Transformers Learn Causal Structure with Gradient Descent
Jason Lee, Princeton University
The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
Jason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning, optimization, and statistics. Lately, he has worked on the foundations of deep learning, representation learning, and reinforcement learning. He has received the Samsung AI Researcher of the Year Award, NSF Career Award, ONR Young Investigator Award in Mathematical Data Science, Sloan Research Fellowship, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization, and Princeton Commendation for Outstanding Teaching.















