Certifiably Correct Machine Perception
David Rosen, Northeastern University
Many fundamental machine perception and state estimation tasks require the solution of a high-dimensional nonconvex estimation problem; this class includes (for example) the fundamental problems of simultaneous localization and mapping (in robotics), 3D reconstruction (in computer vision), and sensor network localization (in distributed sensing). Such problems are known to be computationally hard in general, with many local minima that can entrap the smooth local optimization methods commonly applied to solve them. The result is that standard machine perception algorithms (based upon local optimization) can be surprisingly brittle, often returning egregiously wrong answers even when the problem to which they are applied is well-posed.
In this talk, we present a novel class of certifiably correct estimation algorithms that are capable of efficiently recovering provably good (often globally optimal) solutions of generally-intractable machine perception problems in many practical settings. Our approach directly tackles the problem of nonconvexity by employing convex relaxations whose minimizers provide provably good approximate solutions to the original estimation problem under moderate measurement noise. We illustrate the design of this class of methods using the fundamental problem of pose-graph optimization (a mathematical abstraction of robotic mapping) as a running example. We conclude with a brief discussion of open questions and future research directions.
David M. Rosen is an Assistant Professor in the Departments of Electrical & Computer Engineering and Mathematics and the Khoury College of Computer Sciences (by courtesy) at Northeastern University, where he leads the Robust Autonomy Laboratory (NEURAL). Prior to joining Northeastern, he was a Research Scientist at Oculus Research (now Meta Reality Labs) from 2016 to 2018, and a Postdoctoral Associate at MIT’s Laboratory for Information and Decision Systems (LIDS) from 2018 to 2021. He holds the degrees of B.S. in Mathematics from the California Institute of Technology (2008), M.A. in Mathematics from the University of Texas at Austin (2010), and ScD in Computer Science from the Massachusetts Institute of Technology (2016).
He is broadly interested in the mathematical and algorithmic foundations of trustworthy machine perception, learning, and control. His work has been recognized with the IEEE Transactions on Robotics Best Paper Award (2024), an Honorable Mention for the IEEE Transactions on Robotics Best Paper Award (2021), a Best Student Paper Award at Robotics: Science and Systems (2020), a Best Paper Award at the International Workshop on the Algorithmic Foundations of Robotics (2016), and selection as an RSS Pioneer (2019).
AI Safety Theory: The Missing Middle Ground
Adam Oberman, McGill University
Over the past few years, the capabilities of generative artificial intelligence (AI) systems have advanced rapidly. Along with the benefits of AI, there is also a risk of harm. In order to benefit from AI while mitigating the risks, we need a grounded theoretical framework.
The current AI safety theory, which predates generative AI, is insufficient. Most theoretical AI safety results tend to reason absolutely: a system is a system is “aligned” or “mis-aligned”, “honest” or “dishonest”. But in practice safety is probabilistic, not absolute. The missing middle ground is a quantitative or relative theory of safety — a way to reason formally about degrees of safety. Such a theory is required for defining safety and harms, and is essential for technical solutions as well as for making good policy decisions.
In this talk I will:
- Review current AI risks (from misuse, from lack of reliability, and systemic risks to the economy) as well as important future risks (lack of control).
- Review theoretical predictions of bad AI behavior and discuss experiments which demonstrate that they can occur in current LLMs.
- Explain why technical and theoretical safety solutions are valuable, even by contributors outside of the major labs.
- Discuss some gaps in the theory and present some open problems which could address the gaps.
Adam Oberman is a Full Professor of Mathematics and Statistics at McGill University, a Canada CIFAR AI Chair, and an Associate Member of Mila. He is a research collaborator at LawZero, Yoshua Bengio’s AI Safety Institute. He has been researching AI safety since 2024. His research spans generative models, reinforcement learning, optimization, calibration, and robustness. Earlier in his career, he made significant contributions to optimal transport and nonlinear partial differential equations. He earned degrees from the University of Toronto and the University of Chicago, and previously held faculty and postdoctoral positions at Simon Fraser University and the University of Texas at Austin.
High-dimensional Optimization with Applications to Compute-Optimal Neural Scaling Laws
Courtney Paquette (McGill University)
Given the massive scale of modern ML models, we now only get a single shot to train them effectively. This restricts our ability to test multiple architectures and hyper-parameter configurations. Instead, we need to understand how these models scale, allowing us to experiment with smaller problems and then apply those insights to larger-scale models. In this talk, I will present a framework for analyzing scaling laws in stochastic learning algorithms using a power-law random features model (PLRF), leveraging high-dimensional probability and random matrix theory. I will then use this scaling law to address the compute-optimal question: How should we choose model size and hyper-parameters to achieve the best possible performance in the most compute-efficient manner? Then using this PLRF model, I will devise a new momentum-based algorithm that (provably) improves the scaling law exponent. Finally, I will present some numerical experiments on LSTMs that show how this new stochastic algorithm can be applied to real data to improve the compute-optimal exponent.
Courtney Paquette is an assistant professor at McGill University in the Mathematics and Statistics department, a CIFAR AI Chair (MILA), and an active member of the Montreal Machine Learning Optimization Group (MTL MLOpt) at MILA. Her research broadly focuses on designing and analyzing algorithms for large-scale optimization problems, motivated by applications in data science, and using techniques that draw from a variety of fields, including probability, complexity theory, and convex and nonsmooth analysis. Dr. Paquette is a lead organizer of the OPT-ML Workshop at NeurIPS since 2020, and a lead organizer (and original creator) of the High-dimensional Learning Dynamics (HiLD) Workshop at ICML.
A New Paradigm for Learning with Distribution Shift
Adam Klivans, UT Austin
We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.
We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier. This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution. Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.
Adam Klivans is a Professor of Computer Science at the University of Texas at Austin and Director of the NSF AI Institute for Foundations of Machine Learning (IFML). His research interests lie in machine learning and theoretical computer science, in particular, Learning Theory, Computational Complexity, Pseudorandomness, Limit Theorems, and Gaussian Space. Dr. Klivans is a recipient of the NSF CAREER Award and serves on the editorial board for the Theory of Computing and Machine Learning Journal.
Flat Minima and Generalization: from Matrix Sensing to Neural Networks
Maryam Fazel, University of Washington
When do overparameterized neural networks avoid overfitting and generalize to unseen data? Empirical evidence suggests that the shape of the training loss function near the solution matters—the minima where the loss is “flatter” tend to lead to better generalization. Yet quantifying flatness and its rigorous analysis, even in simple models, has been limited. In this talk, we examine overparameterized nonconvex models such as low-rank matrix sensing, matrix completion, robust PCA, as well as a 2-layer neural network as test cases. We show that under standard statistical assumptions, "flat" minima (minima with the smallest local average curvature, measured by the trace of the Hessian matrix) provably generalize in all these cases. These algorithm-agnostic results suggest a theoretical basis for favoring methods that bias iterates towards flat solutions, and help inform the design of better training algorithms.
Hunting the Hessian
Madeleine Udell, Stanford University
Ill conditioned loss landscapes are ubiquitous in machine learning, and they slow down optimization. Preconditioning the gradient to make the loss more isotropic is a natural solution, but is challenging for extremely large problems, as direct access to the problem Hessian is prohibitively expensive. We present two fresh approaches to preconditioning using tools from randomized numerical linear algebra and online convex optimization for efficient access to Hessian information, motivated by the question: what is the most useful information we can query from the problem Hessian using linear memory and compute?
One Small Step, One Giant Leap: From Test-Time Tweaks to Global Guarantees
Mahdi Soltanolkotabi, USC
Simple first-order methods like Gradient Descent (GD) remain foundational to modern machine learning. Yet, despite their widespread use, our theoretical understanding of the GD trajectory—how and why it works—remains incomplete in both classical and contemporary settings. This talk explores new horizons in understanding the behavior and power of GD across two distinct but connected fronts.
In the first part, we examine the surprising power of a single gradient step in enhancing model reasoning. We focus on test-time training (TTT)—a gradient-based approach that adapts model parameters using individual test instances. We introduce a theoretical framework that reveals how TTT can effectively handle distribution shifts and significantly reduce the data required for in-context learning, shedding light on why such simple methods often outperform expectations.
The second part turns to a more classical optimization setting: learning shallow neural networks with GD. Despite extensive study, even fitting a one-hidden-layer model to basic target functions lacks rigorous performance guarantees. We present a comprehensive analysis of the GD trajectory in this regime, showing how it avoids suboptimal stationary points and converges efficiently to global optima. Our results offer new theoretical foundations for understanding how GD succeeds in the presence of sub-optimal stationary points.
The Wisdom of the Body Revisited
Benjamin Recht, UC Berkeley
In 1932, Walter Cannon published his seminal text, The Wisdom of the Body, introducing the notion of homeostasis. He conceived of the body as a complex system actively working to keep itself in a stable state despite adversarial engagement with an uncertain and dangerous environment. Cannon’s concept of homeostasis would not only revolutionize the way we think about medicine but also inspire cyberneticists and early artificial intelligence researchers to think about the body and brain as well-regulated machines.
In this talk, I refocus Canon's work under a contemporary lens, showing how non-neural biological networks do very smart things. I will describe concepts from feedback control that illuminate necessary architectures for homeostasis. I will show how such systems can be both resilient to most disturbances while fragile to specific adversarial vectors. Identifying these fragilities can guide positive interventions that can steer dysregulated systems back to stable behavior. Throughout, I aim to highlight the role of mathematical and qualitative theory in our understanding and optimization of systems that behave effectively in unknown futures.
Accelerating Nonconvex Optimization via Online Learning
Aryan Mokhtari, UT Austin
A fundamental problem in optimization is finding an ε-first-order stationary point of a smooth function using only gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O( ε^(−7/4) ). In this talk, I present a method with a gradient complexity of O( d^(1/4) ε^(−13/8) ), where d is the problem dimension—yielding improved complexity when d = O( ε^(−1/2) ). The proposed method builds on quasi-Newton ideas and operates by solving two online learning problems under the hood.
The Binary Iterative Hard Thresholding Algorithm
Arya Mazumdar, TILOS & UC San Diego
We will discuss our work on the convergence of iterative hard threshold algorithms for sparse signal recovery problems. For classification problems with nonseparable data this algorithm can be thought of minimizing the so-called ReLU loss. It seems to be very effective (statistically optimal, simple iterative method) for a large class of models of nonseparable data—sparse generalized linear models. It is also robust to adversarial perturbation. Based on joint work with Namiko Matsumoto.
Reverse diffusion Monte Carlo
Yian Ma, TILOS & UC San Diego
I will introduce a novel Monte Carlo sampling approach that uses the reverse diffusion process. In particular, the intermediary updates—the score functions—can be explicitly estimated to arbitrary accuracy, leading to an unbiased Bayesian inference algorithm. I will then discuss how to use this idea to improve sampling in the diffusion models via reverse transition kernels.
Linear Bregman Divergence Control
Babak Hassibi, Caltech
In the past couple of decades, the use of "non-quadratic" convex cost functions has revolutionized signal processing, machine learning, and statistics, allowing one to customize solutions to have desired structures and properties. However, the situation is not the same in control where the use of quadratic costs still dominates, ostensibly because determining the "value function", i.e., the optimal expected cost-to-go, which is critical to the construction of the optimal controller, becomes computationally intractable as soon as one considers general convex costs. As a result, practitioners often resort to heuristics and approximations, such as model predictive control that only looks a few steps into the future. In the quadratic case, the value function is easily determined by appealing to certainty-equivalence and solving Riccati equations. In this talk, we consider a special class of convex cost functions constructed from Bregman divergence and show how, with appropriate choices, they can be used to fully extend the framework developed for the quadratic case. The resulting optimal controllers are infinite horizon, come with stability guarantees, and have state-feedback, or estimated state-feedback, laws. They exhibit a much wider range of behavior than their quadratic counterparts since the feedback laws are nonlinear. We demonstrate the applicability of the approach to several cases of interest, including safety control, sparse control, and bang-bang control.
Unleashing the Power of Variance Reduction for Training Large Models
Quanquan Gu, Caltech
Training deep neural networks—and more recently, large language models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this talk, I will introduce a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within this framework, I will introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. In addition, I will draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.
Optimization and Reasoning
Sicun (Sean) Gao, TILOS & UC San Diego
For a while, the remarkable progress in ML/AI has led many to dismiss the old-fashioned concerns of NP-hardness, with the belief that sufficiently large nonlinear functions can be trained to encode solutions to everything of practical relevance. Yet, as these functions are increasingly deployed as black-box models with agency, their usability is once again constrained by our ability to answer fundamental questions that demand deeper understanding across the entire training and inference pipeline. These questions inevitably correspond to solving NP-hard problems that remain well beyond the reach of existing algorithms. The formal reasoning community has spent decades developing a rich arsenal of tools for tackling similar problems, but mostly for discrete symbolic computing systems. Extending the same rigor and algorithmic power to the continuous domain is a grand challenge that has to be confronted. We need to unify optimization and reasoning towards new generations of capable algorithms that bring together numerical/analytic, combinatorial/algebraic, and statistical/probabilistic approaches. Addressing these challenges can establish new computational foundations for all real-world engineering disciplines too.
The Architecture of Intelligence
John Doyle, Caltech
The vast diversity of organisms and machines that show even minimal “intelligence” from bacteria to humans to the latest LLMs, nevertheless share a universal architecture involving layers, levels, and laws, or ULA for short. We will discuss the most important features of ULAs, which are Diversity-enable Sweet Spots (DeSS), efficiency-speed-accuracy constraints, tradeoffs, and conservation laws, and “constraints that deconstrain.” Depending on interest, motivating case studies can come from biology, neuroscience, medicine, and technology, with language offering a timely, familiar, fun, and controversial subject. Much of the relevant math is relatively new and will be sketched with links to publications. Most of it can be viewed as applications of optimization to control, networks, and “intelligence.”
Amplifying human performance in combinatorial competitive programming
Petar Veličković, Google DeepMind
Recent years have seen a significant surge in complex AI systems for competitive programming, capable of performing at admirable levels against human competitors. While steady progress has been made, the highest percentiles still remain out of reach for these methods on standard competition platforms such as Codeforces. In this talk, I will describe and dive into our recent work, where we focussed on combinatorial competitive programming. In combinatorial challenges, the target is to find as-good-as-possible solutions to otherwise computationally intractable problems, over specific given inputs. We hypothesise that this scenario offers a unique testbed for human-AI synergy, as human programmers can write a backbone of a heuristic solution, after which AI can be used to optimise the scoring function used by the heuristic. We deploy our approach on previous iterations of Hash Code, a global team programming competition inspired by NP-hard software engineering problems at Google, and we leverage FunSearch to evolve our scoring functions. Our evolved solutions significantly improve the attained scores from their baseline, successfully breaking into the top percentile on all previous Hash Code online qualification rounds, and outperforming the top human teams on several. To the best of our knowledge, this is the first known AI-assisted top-tier result in competitive programming.
Foundational Methods for Foundation Models for Scientific Machine Learning
Michael W. Mahoney, LBNL and UC Berkeley
The remarkable successes of ChatGPT in natural language processing (NLP) and related developments in computer vision (CV) motivate the question of what foundation models would look like and what new advances they would enable, when built on the rich, diverse, multimodal data that are available from large-scale experimental and simulational data in scientific computing (SC), broadly defined. Such models could provide a robust and principled foundation for scientific machine learning (SciML), going well beyond simply using ML tools developed for internet and social media applications to help solve future scientific problems. I will describe recent work demonstrating the potential of the “pre-train and fine-tune” paradigm, widely-used in CV and NLP, for SciML problems, demonstrating a clear path towards building SciML foundation models; as well as recent work highlighting multiple “failure modes” that arise when trying to interface data-driven ML methodologies with domain-driven SC methodologies, demonstrating clear obstacles to traversing that path successfully. I will also describe initial work on developing novel methods to address several of these challenges, as well as their implementations at scale, a general solution to which will be needed to build robust and reliable SciML models consisting of millions or billions or trillions of parameters.
Michael W. Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He is also an Amazon Scholar as well as head of the Machine Learning and Analytics Group at the Lawrence Berkeley National Laboratory. He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, scientific machine learning, scalable stochastic optimization, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, computational methods for neural network analysis, physics informed machine learning, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received his PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he was on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he co-organized the Simons Institute’s fall 2013 and 2018 programs on the foundations of data science, he ran the Park City Mathematics Institute’s 2016 PCMI Summer Session on The Mathematics of Data, he ran the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he was the Director of the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. More information is available at https://www.stat.berkeley.edu/~mmahoney/.
Single location regression and attention-based models
Claire Boyer, Université Paris-Saclay
Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.
Synthetic Tasks as Testbeds for Attributing Model Behavior
Surbhi Goel, University of Pennsylvania
Understanding how different components of the machine learning pipeline—spanning data composition, architectural choices, and optimization dynamics—shape model behavior remains a fundamental challenge. In this talk, I will argue that synthetic tasks, which enable precise control over data distribution and task complexity, serve as powerful testbeds for analyzing and attributing behaviors in deep learning. Focusing on the sparse parity learning problem, a canonical task in learning theory, I will present insights into: (1) the phenomenon of “hidden progress” in gradient-based optimization, where models exhibit consistent advancement despite stagnating loss curves; (2) nuanced trade-offs between data, compute, model width, and initialization that govern learning success; and (3) the role of progressive distillation in implicitly structuring curricula to accelerate feature learning. These findings highlight the utility of synthetic tasks in uncovering empirical insights into the mechanisms driving deep learning, without the cost of training expensive models.
This talk is based on joint work with a lot of amazing collaborators: Boaz Barak, Ben Edelman, Sham Kakade, Bingbin Liu, Eran Malach, Sadhika Malladi, Abhishek Panigrahi, Andrej Risteski, and Cyril Zhang.
Surbhi Goel is the Magerman Term Assistant Professor of Computer and Information Science at the University of Pennsylvania. She is associated with the theory group, the ASSET Center on safe, explainable, and trustworthy AI systems, and the Warren Center for Network and Data Sciences. Surbhi’s research focuses on theoretical foundations of modern machine learning paradigms, particularly deep learning, and is supported by Microsoft Research and OpenAI. Previously, she was a postdoctoral researcher at Microsoft Research NYC and completed her Ph.D. at the University of Texas at Austin under Adam Klivans, receiving the UTCS Bert Kay Dissertation Award. She has also been a visiting researcher at IAS, Princeton, and the Simons Institute at UC Berkeley. Surbhi co-founded the Learning Theory Alliance (LeT‐All) and holds several leadership roles, including Office Hours co-chair for ICLR 2024 and co-treasurer for the Association for Computational Learning Theory.
Tutorial on AI Alignment (part 2 of 2): Methodologies for AI Alignment
Ahmad Beirami, Google DeepMind
Hamed Hassani, University of Pennsylvania
The second part of the tutorial focuses on AI alignment techniques and is structured as three segments: In the first segment, we examine black-box techniques aimed at aligning models towards various goals (e.g., safety), such as controlled decoding and the best-of-N algorithm. In the second segment, we will also consider efficiency, where we examine information-theoretic techniques designed to improve inference latency, such as model compression or speculative decoding. If time permits, in the final segment, we discuss inference-aware alignment, which is a framework to align models to work better with inference-time compute algorithms.



















