Recorded Talks: Optimization for ML and AI Seminar Series
Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent
Yuejie Chi, Yale University
Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve multiple subtasks in a single autoregressive path, and the bootstrapping of inherently sequential reasoning through recursive self-training curriculum. Our optimization-based guarantees demonstrate that even shallow multi-head transformers, with chain-of-thought, can be trained to effectively solve problems that would otherwise require deeper architectures.
Yuejie Chi is the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science at Yale University, with a secondary appointment in Computer Science, and a member of the Yale Institute for Foundations of Data Science. Before joining Yale, Dr. Chi was the Sense of Wonder Group Endowed Professor of Electrical and Computer Engineering in AI Systems at Carnegie Melon University, with affiliation in MLD and CyLab. She also spent some time as a visiting researcher at Meta's Fundamental AI Research (FAIR). Dr. Yue's research interests lie in the theoretical and algorithmic foundations of data science, generative AI, reinforcement learning, and signal processing, motivated by applications in scientific and engineering domains. Her current focus is on improving the performance, efficiency and reliability of generative AI and decision making, driven by data-intensive but resource-constrained scenarios.
(De)regularized Wasserstein Gradient Flows via Reproducing Kernels
Bharath Sriperumbudur, Pennsylvania State University
Wasserstein gradient flows have become a popular tool in machine learning with applications in sampling, variational inference, generative modeling, and reinforcement learning, among others. The Wasserstein gradient flow (WGF) involves minimizing a probability functional over the Wasserstein space (by taking into account the intrinsic geometry of the Wasserstein space). In this work, we introduce approximate/regularized Wasserstein gradient flows in two different settings: (a) approximate the probability functional and (b) approximate the Wasserstein geometry. In (a), we consider the probability functional to be chi^2-divergence, whose WGF is difficult to implement. To this end, we propose a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) as an approximation of chi^2-divergence and develop an approximate WGF, which is easy to implement and has applications in generative modeling. On the other hand, in the setting of (b), we use Kullback-Leibler divergence as the probability functional and develop an approximation to the Wassertein geometry, which allows for an efficient implementation than that of the exact WGF, with applications in sampling. In both settings, we present a variety of theoretical results that relate the approximate flow to the exact flow and demonstrate the superiority of the approximate flows via numerical simulations.
Bharath Sriperumbudur is a professor in the Department of Statistics (with a courtesy appointment in the Department of Mathematics) at the Pennsylvania State University. His research interests include non-parametric statistics, machine learning, statistical learning theory, optimal transport and gradient flows, regularization and inverse problems, reproducing kernel spaces in probability and statistics, functional and topological data analysis.
Extended Convex Lifting for Policy Optimization in Control
Yang Zheng, UC San Diego
Direct policy search has achieved great empirical success in reinforcement learning. Many recent studies have revisited its theoretical foundation for continuous control, which reveals elegant nonconvex geometry in various benchmark problems. In this talk, we introduce an Extended Convex Lifting (ECL) framework, which reveals hidden convexity in classical optimal and robust control problems from a modern optimization perspective. Our ECL offers a bridge between nonconvex policy optimization and convex reformulations. Despite non-convexity and non-smoothness, the existence of an ECL not only reveals that minimizing the original function is equivalent to a convex problem, but also certifies a class of first-order non-degenerate stationary points to be globally optimal. This ECL framework encompasses many benchmark control problems, including LQR, LQG, state-feedback, and output-feedback H-infinity robust control. We believe that the ECL framework may be of independent interest for analyzing nonconvex problems beyond control.
Yang Zheng is an Assistant Professor in the ECE Department at UC San Diego. His research focuses on control theory, convex and nonconvex optimization, and their applications to autonomous vehicles and traffic systems. He received his DPhil (Ph.D.) in Engineering Science from the University of Oxford in 2019, and his B.E. and M.S. degrees from Tsinghua University in 2013 and 2015, respectively. His work has been recognized with several awards, including the 2019 European Ph.D. Award on Control for Complex and Heterogeneous Systems, the 2022 Best Paper Award from IEEE Transactions on Control of Network Systems, the 2023 Best Graduate Teacher Award from UC San Diego’s ECE Department, the 2024 NSF CAREER Award, and the 2025 Donald P. Eckman Award from the American Automatic Control Council.
Randomized Linear Algebra with Subspace Injections
Joel Tropp, Caltech
To achieve the greatest possible speed, practitioners regularly implement randomized algorithms for low-rank approximation and least-squares regression with structured dimension reduction maps. This talk outlines a new perspective on structured dimension reduction, based on the injectivity properties of the dimension reduction map. This approach provides sharper bounds for sparse dimension reduction maps, and it leads to exponential improvements for tensor-product dimension reduction. Empirical evidence confirms that these types of structured random matrices offer exemplary performance for a range of synthetic problems and contemporary scientific applications.
Joint work with Chris Camaño, Ethan Epperly, and Raphael Meyer; available at https://arxiv.org/abs/2508.21189.
Joel A. Tropp is Steele Family Professor of Applied & Computational Mathematics at the California Institute of Technology. His research centers on applied mathematics, machine learning, data science, numerical algorithms, and random matrix theory. Some of his best-known contributions include matching pursuit algorithms, randomized SVD algorithms, matrix concentration inequalities, and statistical phase transitions. Prof. Tropp attained the Ph.D. degree in Computational Applied Mathematics at the University of Texas at Austin in 2004, and he joined Caltech in 2007. He won the PECASE in 2008, and he was recognized as a Highly Cited Researcher in Computer Science each year from 2014–2018. He is co-founder of the SIAM Journal on Mathematics of Data Science (SIMODS), and he was co-chair of the inaugural 2020 SIAM Conference on the Mathematics of Data Science. Prof. Tropp was elected SIAM Fellow in 2019, IEEE Fellow in 2020, and IMS Fellow in 2024. He received the 2025 Richard P. Feynman Prize for Excellence in Teaching at Caltech. He is an invited speaker at the 2026 International Congress of Mathematicians (ICM).
Stochastic-Gradient and Diagonal-Scaling Algorithms for Constrained Optimization and Learning
Frank E. Curtis, Lehigh University
I will motivate and provide an overview of recent efforts in my research group on the design and analysis of stochastic-gradient-based algorithms for solving constrained optimization problems. I will focus in particular on our motivation for informed supervised learning, where constraints in the training problem can be used to impose prior knowledge on the properties that should be possessed by a trained prediction model. In addition, I will provide a detailed look at our newest extensions of heavy-ball and Adam schemes from the unconstrained to the equality-constrained setting, for which we have shown state-of-the-art convergence guarantees. I will demonstrate the impressive practical performance of our methods using a few informed supervised learning problems.
Frank E. Curtis is a Professor in the Department of Industrial and Systems Engineering at Lehigh University, where he has been employed since 2009. He received a bachelor’s degree from the College of William and Mary in 2003 with a double major in Computer Science and Mathematics, received a master’s degree in 2004 and Ph.D. degree in 2007 from the Department of Industrial Engineering and Management Science at Northwestern University, and spent two years as a Postdoctoral Researcher in the Courant Institute of Mathematical Sciences at New York University from 2007 until 2009. His research focuses on the design, analysis, and implementation of numerical methods for solving large-scale nonlinear optimization problems. He received an Early Career Award from the Advanced Scientific Computing Research (ASCR) program of the U.S. Department of Energy (DoE), and has received funding from various programs of the U.S. National Science Foundation (NSF), including through a TRIPODS Phase I grant awarded to him and his collaborators at Lehigh, Northwestern, and Boston University. He has also received funding from the U.S. Office of Naval Research (ONR) and DoE’s Advanced Research Projects Agency-Energy (ARPA-E). He received, along with Leon Bottou (Meta AI) and Jorge Nocedal (Northwestern), the 2021 SIAM/MOS Lagrange Prize in Continuous Optimization. He was awarded, with James V. Burke (U. of Washington), Adrian Lewis (Cornell), and Michael Overton (NYU), the 2018 INFORMS Computing Society Prize. He and team members Daniel Molzahn (Georgia Tech), Andreas Waechter (Northwestern), Ermin Wei (Northwestern), and Elizabeth Wong (UC San Diego) were awarded second place in the ARPA-E Grid Optimization Competition in 2020. He currently serves as Area Editor for Continuous Optimization for Mathematics of Operations Research and serves as an Associate Editor for Mathematical Programming, SIAM Journal on Optimization, Operations Research, IMA Journal of Numerical Analysis, and Mathematical Programming Computation. He previously served as the Vice Chair for Nonlinear Programming for the INFORMS Optimization Society, and is currently very active in professional societies and groups related to mathematical optimization, including INFORMS, the Mathematics Optimization Society, and the SIAM Activity Group on Optimization.
Training Neural Networks at Any Scale
Volkan Cevher, EPFL
At the heart of deep learning’s transformative impact lies the concept of scale–encompassing both data and computational resources, as well as their interaction with neural network architectures. Scale, however, presents critical challenges, such as increased instability during training and prohibitively expensive model-specific tuning. Given the substantial resources required to train such models, formulating high-confidence scaling hypotheses backed by rigorous theoretical research has become paramount.
To bridge theory and practice, the talk explores a key mathematical ingredient of scaling in tandem with scaling theory: the numerical solution algorithms commonly employed in deep learning, spanning domains from vision to language models. We unify these algorithms under a common master template, making their foundational principles transparent. In doing so, we reveal the interplay between adaptation to smoothness structures via online learning and the exploitation of optimization geometry through non-Euclidean norms. Our exposition moves beyond simply building larger models–it emphasizes strategic scaling, offering insights that promise to advance the field while economizing on resources.
Volkan Cevher received the B.Sc. (valedictorian) in electrical engineering from Bilkent University in Ankara, Turkey, in 1999 and the Ph.D. in electrical and computer engineering from the Georgia Institute of Technology in Atlanta, GA in 2005. He was a Research Scientist with the University of Maryland, College Park from 2006-2007 and also with Rice University in Houston, TX, from 2008-2009. Currently, he is an Associate Professor at the Swiss Federal Institute of Technology Lausanne and a Faculty Fellow in the Electrical and Computer Engineering Department at Rice University. His research interests include machine learning, signal processing theory, optimization theory and methods, and information theory. Dr. Cevher is an ELLIS fellow and was the recipient of the Google Faculty Research award in 2018, the IEEE Signal Processing Society Best Paper Award in 2016, a Best Paper Award at CAMSAP in 2015, a Best Paper Award at SPARS in 2009, and an ERC CG in 2016 as well as an ERC StG in 2011.
High-dimensional Optimization with Applications to Compute-Optimal Neural Scaling Laws
Courtney Paquette (McGill University)
Given the massive scale of modern ML models, we now only get a single shot to train them effectively. This restricts our ability to test multiple architectures and hyper-parameter configurations. Instead, we need to understand how these models scale, allowing us to experiment with smaller problems and then apply those insights to larger-scale models. In this talk, I will present a framework for analyzing scaling laws in stochastic learning algorithms using a power-law random features model (PLRF), leveraging high-dimensional probability and random matrix theory. I will then use this scaling law to address the compute-optimal question: How should we choose model size and hyper-parameters to achieve the best possible performance in the most compute-efficient manner? Then using this PLRF model, I will devise a new momentum-based algorithm that (provably) improves the scaling law exponent. Finally, I will present some numerical experiments on LSTMs that show how this new stochastic algorithm can be applied to real data to improve the compute-optimal exponent.
Courtney Paquette is an assistant professor at McGill University in the Mathematics and Statistics department, a CIFAR AI Chair (MILA), and an active member of the Montreal Machine Learning Optimization Group (MTL MLOpt) at MILA. Her research broadly focuses on designing and analyzing algorithms for large-scale optimization problems, motivated by applications in data science, and using techniques that draw from a variety of fields, including probability, complexity theory, and convex and nonsmooth analysis. Dr. Paquette is a lead organizer of the OPT-ML Workshop at NeurIPS since 2020, and a lead organizer (and original creator) of the High-dimensional Learning Dynamics (HiLD) Workshop at ICML.






