Recorded Talks: Foundations of AI and Optimization
Machine learning for discrete optimization: Theoretical foundations
Ellen Vitercik, Stanford University
Many of the most important optimization problems in practice are massive in scale, mathematically complex, and involve numerous unknown parameters. Machine learning offers a powerful way to address these challenges by uncovering hidden structure across problem instances, but integrating predictions into algorithms raises fundamental questions: which architectures align with combinatorial structure, and how can we ensure robustness to error? This talk presents two case studies. First, we show how graph neural networks can approximate the optimal dynamic program for online matching, yielding algorithms that generalize across graph sizes and achieve strong empirical performance. Second, we investigate calibration as a principled interface between machine learning and decision-making, demonstrating through rent-or-buy and job scheduling problems that calibrated predictions yield both theoretical guarantees and practical improvements. This is joint work with Alexandre Hayderi, Amin Saberi, Anders Wikum, and Judy Hanwen Shen.
Ellen Vitercik is an Assistant Professor at Stanford University with a joint appointment between the Management Science & Engineering department and the Computer Science department. Her research interests include machine learning, algorithm design, discrete and combinatorial optimization, and the interface between economics and computation. Before joining Stanford, she spent a year as a Miller Postdoctoral Fellow at UC Berkeley and received a PhD in Computer Science from Carnegie Mellon University. Her research has been recognized with a Schmidt Sciences AI2050 Early Career Fellowship, an NSF CAREER award, the SIGecom Doctoral Dissertation Award, and the CMU School of Computer Science Distinguished Dissertation Award, among other honors.
A survey of the mixing times of the Proximal Sampler algorithm
Andre Wibisono, Yale University
Sampling is a fundamental algorithmic task with many connections to optimization. In this talk, we survey a recent algorithm for sampling known as the Proximal Sampler, which can be seen as a proximal discretization of the continuous-time Langevin dynamics, and achieves the current state-of-the-art iteration complexity for sampling in discrete time. We survey the mixing time guarantees of the Proximal Sampler algorithm and show they match the guarantees for the Langevin dynamics. When the target distribution satisfies log-concavity or isoperimetry, the Proximal Sampler has rapid convergence guarantees. We illustrate the proof technique via the strong data processing inequality along the Gaussian channel and its time reversal under isoperimetry.
Andre Wibisono is an assistant professor in the Department of Computer Science at Yale University, with a secondary appointment in the Department of Statistics & Data Science. His research interests are in the design and analysis of algorithms for machine learning, in particular for problems in optimization, sampling, and game theory. He received his BS degrees in Mathematics and in Computer Science from MIT, his MEng in Computer Science from MIT, his MA in Statistics from UC Berkeley, and his PhD in Computer Science from UC Berkeley. He has done postdoctoral research at the University of Wisconsin-Madison and at the Georgia Institute of Technology.
Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent
Yuejie Chi, Yale University
Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve multiple subtasks in a single autoregressive path, and the bootstrapping of inherently sequential reasoning through recursive self-training curriculum. Our optimization-based guarantees demonstrate that even shallow multi-head transformers, with chain-of-thought, can be trained to effectively solve problems that would otherwise require deeper architectures.
Yuejie Chi is the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science at Yale University, with a secondary appointment in Computer Science, and a member of the Yale Institute for Foundations of Data Science. Before joining Yale, Dr. Chi was the Sense of Wonder Group Endowed Professor of Electrical and Computer Engineering in AI Systems at Carnegie Melon University, with affiliation in MLD and CyLab. She also spent some time as a visiting researcher at Meta's Fundamental AI Research (FAIR). Dr. Yue's research interests lie in the theoretical and algorithmic foundations of data science, generative AI, reinforcement learning, and signal processing, motivated by applications in scientific and engineering domains. Her current focus is on improving the performance, efficiency and reliability of generative AI and decision making, driven by data-intensive but resource-constrained scenarios.
(De)regularized Wasserstein Gradient Flows via Reproducing Kernels
Bharath Sriperumbudur, Pennsylvania State University
Wasserstein gradient flows have become a popular tool in machine learning with applications in sampling, variational inference, generative modeling, and reinforcement learning, among others. The Wasserstein gradient flow (WGF) involves minimizing a probability functional over the Wasserstein space (by taking into account the intrinsic geometry of the Wasserstein space). In this work, we introduce approximate/regularized Wasserstein gradient flows in two different settings: (a) approximate the probability functional and (b) approximate the Wasserstein geometry. In (a), we consider the probability functional to be chi^2-divergence, whose WGF is difficult to implement. To this end, we propose a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) as an approximation of chi^2-divergence and develop an approximate WGF, which is easy to implement and has applications in generative modeling. On the other hand, in the setting of (b), we use Kullback-Leibler divergence as the probability functional and develop an approximation to the Wassertein geometry, which allows for an efficient implementation than that of the exact WGF, with applications in sampling. In both settings, we present a variety of theoretical results that relate the approximate flow to the exact flow and demonstrate the superiority of the approximate flows via numerical simulations.
Bharath Sriperumbudur is a professor in the Department of Statistics (with a courtesy appointment in the Department of Mathematics) at the Pennsylvania State University. His research interests include non-parametric statistics, machine learning, statistical learning theory, optimal transport and gradient flows, regularization and inverse problems, reproducing kernel spaces in probability and statistics, functional and topological data analysis.
Neuromorphic LLMs
Jason Eshraghian, UC Santa Cruz
This talk will show you what neuromorphic computing can do when an academic lab accidentally pulls $2-million of GPU-hours. We will showcase a series of frontier reasoning LLMs developed out of an academic lab, from data curation and pre-training to post-training and alignment. These models surpass leading LLMs from Meta, Google, and other heavily-resourced labs in the ~10-billion parameter regime, despite being 5x smaller.
We have deployed several models on neuromorphic hardware at just 2 watts, bringing state-of-the-art reasoning from the datacenter to the edge. Along the way, we dispel a series of widely-held assumptions about large-scale neuromorphic computation, revealing how it fundamentally differs from conventional deep learning, and why that difference matters.
Jason Eshraghian is an Assistant Professor and Fulbright Scholar in the Department of Electrical and Computer Engineering at the University of California, Santa Cruz. He is the developer of snnTorch, a Python library with over 500,000 downloads for training spiking neural networks. He is a dual-appointed IEEE CAS and EMBS Distinguished Lecturer, an Associate Editor of APL Machine Learning, the Chair of the IEEE Neural Systems and Applications Technical Committee, has been the recipient of seven IEEE Best Paper Awards, a Scientific Advisory Board Member of BrainChip and leads the Neuromorphic Agents Team at Conscium.
Kinetic Theory Perspective of Foundation Models for Physics
Maarten de Hoop, Rice University
We present a kinetic theory perspective of foundation models for physics. We begin with providing a mathematical framework for analyzing transformers. To uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens. That is, transformers become mappings between probability measures. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between such contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. We then characterize the conditions on mappings between measures that enable these to be represented in terms of in-context mappings as transformers. The solution map of the Vlasov equation, which is of nonlocal transport type, for interacting particle systems in the mean-field regime for the Cauchy problem satisfies the conditions; conversely, we prove that the measure-theoretic self-attention has the properties that ensure that the infinite depth, mean-field transformer can be identified with a Vlasov flow. Extending this framework from interactions to collisions leads to a further development of structured architectures inspired by Lattice Boltzmann Models, while flow motivates a design based on self-warping.
Professor Maarten V. de Hoop, Simons Chair in Computational and Applied Mathematics and Earth Science at Rice University, is internationally recognized for his contributions to the mathematical foundations of seismology, wave propagation, and inverse problems. His research bridges microlocal and harmonic analysis, scattering theory, and structured numerical methods with applications to seismic imaging, geophysical inversion, and large-scale computational modeling of acoustic, elastic, and electromagnetic phenomena. De Hoop has been a pioneer in developing techniques to extract subtle information from massive, complex seismic datasets, advancing our ability to probe the Earth’s interior with unprecedented resolution, and more recently has integrated deep learning and data-driven discovery with rigorous mathematical frameworks to open new frontiers in the analysis of multiscale wave phenomena and inverse spectral problems. He is the recipient of the J. Clarence Karcher Award from the Society of Exploration Geophysicists and the Young Scientists Award from the International Society for Analysis, its Applications and Computation, has been elected a Fellow of the Institute of Physics and an External Member of the Finnish Academy of Science and Letters, and has served as associate editor for Inverse Problems, Inverse Problems and Imaging, and the International Journal on Geomathematics.
Randomized Linear Algebra with Subspace Injections
Joel Tropp, Caltech
To achieve the greatest possible speed, practitioners regularly implement randomized algorithms for low-rank approximation and least-squares regression with structured dimension reduction maps. This talk outlines a new perspective on structured dimension reduction, based on the injectivity properties of the dimension reduction map. This approach provides sharper bounds for sparse dimension reduction maps, and it leads to exponential improvements for tensor-product dimension reduction. Empirical evidence confirms that these types of structured random matrices offer exemplary performance for a range of synthetic problems and contemporary scientific applications.
Joint work with Chris Camaño, Ethan Epperly, and Raphael Meyer; available at https://arxiv.org/abs/2508.21189.
Joel A. Tropp is Steele Family Professor of Applied & Computational Mathematics at the California Institute of Technology. His research centers on applied mathematics, machine learning, data science, numerical algorithms, and random matrix theory. Some of his best-known contributions include matching pursuit algorithms, randomized SVD algorithms, matrix concentration inequalities, and statistical phase transitions. Prof. Tropp attained the Ph.D. degree in Computational Applied Mathematics at the University of Texas at Austin in 2004, and he joined Caltech in 2007. He won the PECASE in 2008, and he was recognized as a Highly Cited Researcher in Computer Science each year from 2014–2018. He is co-founder of the SIAM Journal on Mathematics of Data Science (SIMODS), and he was co-chair of the inaugural 2020 SIAM Conference on the Mathematics of Data Science. Prof. Tropp was elected SIAM Fellow in 2019, IEEE Fellow in 2020, and IMS Fellow in 2024. He received the 2025 Richard P. Feynman Prize for Excellence in Teaching at Caltech. He is an invited speaker at the 2026 International Congress of Mathematicians (ICM).
Stochastic-Gradient and Diagonal-Scaling Algorithms for Constrained Optimization and Learning
Frank E. Curtis, Lehigh University
I will motivate and provide an overview of recent efforts in my research group on the design and analysis of stochastic-gradient-based algorithms for solving constrained optimization problems. I will focus in particular on our motivation for informed supervised learning, where constraints in the training problem can be used to impose prior knowledge on the properties that should be possessed by a trained prediction model. In addition, I will provide a detailed look at our newest extensions of heavy-ball and Adam schemes from the unconstrained to the equality-constrained setting, for which we have shown state-of-the-art convergence guarantees. I will demonstrate the impressive practical performance of our methods using a few informed supervised learning problems.
Frank E. Curtis is a Professor in the Department of Industrial and Systems Engineering at Lehigh University, where he has been employed since 2009. He received a bachelor’s degree from the College of William and Mary in 2003 with a double major in Computer Science and Mathematics, received a master’s degree in 2004 and Ph.D. degree in 2007 from the Department of Industrial Engineering and Management Science at Northwestern University, and spent two years as a Postdoctoral Researcher in the Courant Institute of Mathematical Sciences at New York University from 2007 until 2009. His research focuses on the design, analysis, and implementation of numerical methods for solving large-scale nonlinear optimization problems. He received an Early Career Award from the Advanced Scientific Computing Research (ASCR) program of the U.S. Department of Energy (DoE), and has received funding from various programs of the U.S. National Science Foundation (NSF), including through a TRIPODS Phase I grant awarded to him and his collaborators at Lehigh, Northwestern, and Boston University. He has also received funding from the U.S. Office of Naval Research (ONR) and DoE’s Advanced Research Projects Agency-Energy (ARPA-E). He received, along with Leon Bottou (Meta AI) and Jorge Nocedal (Northwestern), the 2021 SIAM/MOS Lagrange Prize in Continuous Optimization. He was awarded, with James V. Burke (U. of Washington), Adrian Lewis (Cornell), and Michael Overton (NYU), the 2018 INFORMS Computing Society Prize. He and team members Daniel Molzahn (Georgia Tech), Andreas Waechter (Northwestern), Ermin Wei (Northwestern), and Elizabeth Wong (UC San Diego) were awarded second place in the ARPA-E Grid Optimization Competition in 2020. He currently serves as Area Editor for Continuous Optimization for Mathematics of Operations Research and serves as an Associate Editor for Mathematical Programming, SIAM Journal on Optimization, Operations Research, IMA Journal of Numerical Analysis, and Mathematical Programming Computation. He previously served as the Vice Chair for Nonlinear Programming for the INFORMS Optimization Society, and is currently very active in professional societies and groups related to mathematical optimization, including INFORMS, the Mathematics Optimization Society, and the SIAM Activity Group on Optimization.
High-dimensional Optimization with Applications to Compute-Optimal Neural Scaling Laws
Courtney Paquette (McGill University)
Given the massive scale of modern ML models, we now only get a single shot to train them effectively. This restricts our ability to test multiple architectures and hyper-parameter configurations. Instead, we need to understand how these models scale, allowing us to experiment with smaller problems and then apply those insights to larger-scale models. In this talk, I will present a framework for analyzing scaling laws in stochastic learning algorithms using a power-law random features model (PLRF), leveraging high-dimensional probability and random matrix theory. I will then use this scaling law to address the compute-optimal question: How should we choose model size and hyper-parameters to achieve the best possible performance in the most compute-efficient manner? Then using this PLRF model, I will devise a new momentum-based algorithm that (provably) improves the scaling law exponent. Finally, I will present some numerical experiments on LSTMs that show how this new stochastic algorithm can be applied to real data to improve the compute-optimal exponent.
Courtney Paquette is an assistant professor at McGill University in the Mathematics and Statistics department, a CIFAR AI Chair (MILA), and an active member of the Montreal Machine Learning Optimization Group (MTL MLOpt) at MILA. Her research broadly focuses on designing and analyzing algorithms for large-scale optimization problems, motivated by applications in data science, and using techniques that draw from a variety of fields, including probability, complexity theory, and convex and nonsmooth analysis. Dr. Paquette is a lead organizer of the OPT-ML Workshop at NeurIPS since 2020, and a lead organizer (and original creator) of the High-dimensional Learning Dynamics (HiLD) Workshop at ICML.
A New Paradigm for Learning with Distribution Shift
Adam Klivans, UT Austin
We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.
We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier. This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution. Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.
Adam Klivans is a Professor of Computer Science at the University of Texas at Austin and Director of the NSF AI Institute for Foundations of Machine Learning (IFML). His research interests lie in machine learning and theoretical computer science, in particular, Learning Theory, Computational Complexity, Pseudorandomness, Limit Theorems, and Gaussian Space. Dr. Klivans is a recipient of the NSF CAREER Award and serves on the editorial board for the Theory of Computing and Machine Learning Journal.
Flat Minima and Generalization: from Matrix Sensing to Neural Networks
Maryam Fazel, University of Washington
When do overparameterized neural networks avoid overfitting and generalize to unseen data? Empirical evidence suggests that the shape of the training loss function near the solution matters—the minima where the loss is “flatter” tend to lead to better generalization. Yet quantifying flatness and its rigorous analysis, even in simple models, has been limited. In this talk, we examine overparameterized nonconvex models such as low-rank matrix sensing, matrix completion, robust PCA, as well as a 2-layer neural network as test cases. We show that under standard statistical assumptions, "flat" minima (minima with the smallest local average curvature, measured by the trace of the Hessian matrix) provably generalize in all these cases. These algorithm-agnostic results suggest a theoretical basis for favoring methods that bias iterates towards flat solutions, and help inform the design of better training algorithms.
Hunting the Hessian
Madeleine Udell, Stanford University
Ill conditioned loss landscapes are ubiquitous in machine learning, and they slow down optimization. Preconditioning the gradient to make the loss more isotropic is a natural solution, but is challenging for extremely large problems, as direct access to the problem Hessian is prohibitively expensive. We present two fresh approaches to preconditioning using tools from randomized numerical linear algebra and online convex optimization for efficient access to Hessian information, motivated by the question: what is the most useful information we can query from the problem Hessian using linear memory and compute?
Accelerating Nonconvex Optimization via Online Learning
Aryan Mokhtari, UT Austin
A fundamental problem in optimization is finding an ε-first-order stationary point of a smooth function using only gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O( ε^(−7/4) ). In this talk, I present a method with a gradient complexity of O( d^(1/4) ε^(−13/8) ), where d is the problem dimension—yielding improved complexity when d = O( ε^(−1/2) ). The proposed method builds on quasi-Newton ideas and operates by solving two online learning problems under the hood.
The Binary Iterative Hard Thresholding Algorithm
Arya Mazumdar, TILOS & UC San Diego
We will discuss our work on the convergence of iterative hard threshold algorithms for sparse signal recovery problems. For classification problems with nonseparable data this algorithm can be thought of minimizing the so-called ReLU loss. It seems to be very effective (statistically optimal, simple iterative method) for a large class of models of nonseparable data—sparse generalized linear models. It is also robust to adversarial perturbation. Based on joint work with Namiko Matsumoto.
Reverse diffusion Monte Carlo
Yian Ma, TILOS & UC San Diego
I will introduce a novel Monte Carlo sampling approach that uses the reverse diffusion process. In particular, the intermediary updates—the score functions—can be explicitly estimated to arbitrary accuracy, leading to an unbiased Bayesian inference algorithm. I will then discuss how to use this idea to improve sampling in the diffusion models via reverse transition kernels.
Linear Bregman Divergence Control
Babak Hassibi, Caltech
In the past couple of decades, the use of "non-quadratic" convex cost functions has revolutionized signal processing, machine learning, and statistics, allowing one to customize solutions to have desired structures and properties. However, the situation is not the same in control where the use of quadratic costs still dominates, ostensibly because determining the "value function", i.e., the optimal expected cost-to-go, which is critical to the construction of the optimal controller, becomes computationally intractable as soon as one considers general convex costs. As a result, practitioners often resort to heuristics and approximations, such as model predictive control that only looks a few steps into the future. In the quadratic case, the value function is easily determined by appealing to certainty-equivalence and solving Riccati equations. In this talk, we consider a special class of convex cost functions constructed from Bregman divergence and show how, with appropriate choices, they can be used to fully extend the framework developed for the quadratic case. The resulting optimal controllers are infinite horizon, come with stability guarantees, and have state-feedback, or estimated state-feedback, laws. They exhibit a much wider range of behavior than their quadratic counterparts since the feedback laws are nonlinear. We demonstrate the applicability of the approach to several cases of interest, including safety control, sparse control, and bang-bang control.
















