HOT-AI: Horizons for Optimization in AI Workshop
The TILOS Horizons for Optimization in AI ("HOT-AI") Workshop will bring together leading researchers and practitioners to explore the evolving landscape of optimization for AI—and how AI itself is reshaping optimization. Over two days of engaging talks and interactive discussions participants will dive into cutting-edge research, share insights, and exchange ideas on the most pressing challenges and opportunities at the intersection of these fields.
Confirmed Speakers and Panelists
Caltech
University of Washington
TILOS (UC San Diego)
University of California, Los Angeles
Caltech
TILOS (UC San Diego)
TILOS (UC San Diego)
The University Texas at Austin
University of California, Berkeley
University of Southern California
Stanford University
Schedule
Thursday, April 17, 2025
8:30 - 9:00am | Registration and Breakfast |
9:00 - 9:10am | Opening Remarks Yusu Wang, TILOS Director Arya Mazumdar, TILOS Associate Director of Research |
9:10 - 9:55am | Keynote: Accelerating Nonconvex Optimization via Online Learning Aryan Mokhtari, The University of Texas at Austin [ RECORDING ] |
10:00 - 10:20am | TILOS Faculty Talk: Reverse diffusion Monte Carlo Yian Ma, UC San Diego [ RECORDING ] |
10:20 - 10:45am | Break |
10:45 - 11:30am | Keynote: The Architecture of Intelligence John Doyle, Caltech [ RECORDING ] |
11:35am - 12:20pm | Keynote: Linear Bregman Divergence Control Babak Hassibi, Caltech [ RECORDING ] |
12:25 - 1:45pm | Lunch |
1:45 - 2:30pm | Keynote: Hunting the Hessian Madeleine Udell, Stanford University [ RECORDING ] |
2:35 - 2:55pm | TILOS Faculty Talk: Optimization and Reasoning Sicun Gao, UC San Diego [ RECORDING ] |
2:55 - 3:15pm | Break |
3:15 - 4:00pm | Keynote: One Small Step, One Giant Leap: From Test-Time Tweaks to Global Guarantees Mahdi Soltanolkotabi, University of Southern California [ RECORDING ] |
4:05 - 4:25pm | TILOS Faculty Talk: The Binary Iterative Hard Thresholding Algorithm Arya Mazumdar, UC San Diego [ RECORDING ] |
Friday, April 18, 2025
8:30 - 9:00am | Registration and Breakfast |
9:00 - 9:45am | Keynote: Unleashing the Power of Variance Reduction for Training Large Models Quanquan Gu, University of California, Los Angeles [ RECORDING ] |
9:50 - 10:35am | Keynote: Flat Minima and Generalization: from Matrix Sensing to Neural Networks Maryam Fazel, University of Washington [ RECORDING ] |
10:35 - 11:00am | Break |
11:00am - 12:00pm | Panel Discussion: Should we still learn the theory of optimization today? John Doyle, Caltech Quanquan Gu, University of California, Los Angeles Benjamin Recht, University of California, Berkeley Mahdi Soltanolkotabi, University of Southern California Moderator: Misha Belkin, TILOS & UC San Diego |
12:00 - 1:30pm | Lunch |
1:30 - 2:15pm | Keynote: The Wisdom of the Body Revisited Benjamin Recht, University of California, Berkeley [ RECORDING ] |
2:15 - 2:45pm | Break |
2:45 - 3:45pm | Student & Postdoc Lightning Talks |
Date & Time
Thursday, April 17 | 8:30am - 4:30pm
*** THURSDAY SCHEDULE ***
Friday, April 18 | 8:30am - 4:00pm
*** FRIDAY SCHEDULE ***
Registration
Registration is complementary but required as space is limited. Register HERE by Friday, April 11, 2025.
Zoom information will be provided before the workshop to those who register to attend remotely.
Venue
Halıcıoğlu Data Science Institute Room 123
University of California, San Diego
3234 Matthews Lane
La Jolla, CA 92093
[MAP]
Parking
Gilman Parking Structure (252 Russell Ln, La Jolla, CA 92093; 5 minute walk to venue).
Hopkins Parking Structure (9800 Hopkins Dr, La Jolla, CA 92093; 10 minute walk to venue).
Parking fees are payable at pay stations or pay-by-phone. Note that many visitor spots are limited to two hours. Even though the app allows you to pay for longer periods, you will get a ticket after that time if parked in a 2-hour space.
Contacts
Talk Abstracts
Thursday, April 17
Accelerating Nonconvex Optimization via Online Learning
9:10 - 9:55am | Aryan Mokhtari, UT Austin
[ RECORDING ]
A fundamental problem in optimization is finding an ε-first-order stationary point of a smooth function using only gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O( ε−7/4 ). In this talk, I present a method with a gradient complexity of O( d1/4 ε−13/8 ), where d is the problem dimension—yielding improved complexity when d = O( ε−1/2 ). The proposed method builds on quasi-Newton ideas and operates by solving two online learning problems under the hood.Reverse diffusion Monte Carlo
10:00 - 10:20am | Yian Ma, TILOS & UC San Diego
[ RECORDING ]
I will introduce a novel Monte Carlo sampling approach that uses the reverse diffusion process. In particular, the intermediary updates—the score functions—can be explicitly estimated to arbitrary accuracy, leading to an unbiased Bayesian inference algorithm. I will then discuss how to use this idea to improve sampling in the diffusion models via reverse transition kernels.The Architecture of Intelligence
10:45 - 11:30am | John Doyle, Caltech
[ RECORDING ]
The vast diversity of organisms and machines that show even minimal “intelligence” from bacteria to humans to the latest LLMs, nevertheless share a universal architecture involving layers, levels, and laws, or ULA for short. We will discuss the most important features of ULAs, which are Diversity-enable Sweet Spots (DeSS), efficiency-speed-accuracy constraints, tradeoffs, and conservation laws, and “constraints that deconstrain.” Depending on interest, motivating case studies can come from biology, neuroscience, medicine, and technology, with language offering a timely, familiar, fun, and controversial subject. Much of the relevant math is relatively new and will be sketched with links to publications. Most of it can be viewed as applications of optimization to control, networks, and “intelligence.”Linear Bregman Divergence Control
11:35am - 12:20pm | Babak Hassibi, Caltech
[ RECORDING ]
In the past couple of decades, the use of "non-quadratic" convex cost functions has revolutionized signal processing, machine learning, and statistics, allowing one to customize solutions to have desired structures and properties. However, the situation is not the same in control where the use of quadratic costs still dominates, ostensibly because determining the "value function", i.e., the optimal expected cost-to-go, which is critical to the construction of the optimal controller, becomes computationally intractable as soon as one considers general convex costs. As a result, practitioners often resort to heuristics and approximations, such as model predictive control that only looks a few steps into the future. In the quadratic case, the value function is easily determined by appealing to certainty-equivalence and solving Riccati equations. In this talk, we consider a special class of convex cost functions constructed from Bregman divergence and show how, with appropriate choices, they can be used to fully extend the framework developed for the quadratic case. The resulting optimal controllers are infinite horizon, come with stability guarantees, and have state-feedback, or estimated state-feedback, laws. They exhibit a much wider range of behavior than their quadratic counterparts since the feedback laws are nonlinear. We demonstrate the applicability of the approach to several cases of interest, including safety control, sparse control, and bang-bang control. Hunting the Hessian
1:45 - 2:30pm | Madeleine Udell, Stanford University
[ RECORDING ]
Ill conditioned loss landscapes are ubiquitous in machine learning, and they slow down optimization. Preconditioning the gradient to make the loss more isotropic is a natural solution, but is challenging for extremely large problems, as direct access to the problem Hessian is prohibitively expensive. We present two fresh approaches to preconditioning using tools from randomized numerical linear algebra and online convex optimization for efficient access to Hessian information, motivated by the question: what is the most useful information we can query from the problem Hessian using linear memory and compute?Optimization and Reasoning
2:35 - 2:55pm | Sicun Gao, TILOS & UC San Diego
[ RECORDING ]
For a while, the remarkable progress in ML/AI has led many to dismiss the old-fashioned concerns of NP-hardness, with the belief that sufficiently large nonlinear functions can be trained to encode solutions to everything of practical relevance. Yet, as these functions are increasingly deployed as black-box models with agency, their usability is once again constrained by our ability to answer fundamental questions that demand deeper understanding across the entire training and inference pipeline. These questions inevitably correspond to solving NP-hard problems that remain well beyond the reach of existing algorithms. The formal reasoning community has spent decades developing a rich arsenal of tools for tackling similar problems, but mostly for discrete symbolic computing systems. Extending the same rigor and algorithmic power to the continuous domain is a grand challenge that has to be confronted. We need to unify optimization and reasoning towards new generations of capable algorithms that bring together numerical/analytic, combinatorial/algebraic, and statistical/probabilistic approaches. Addressing these challenges can establish new computational foundations for all real-world engineering disciplines too.One Small Step, One Giant Leap: From Test-Time Tweaks to Global Guarantees
3:15 - 4:00pm | Mahdi Soltanolkotabi, University of Southern California
[ RECORDING ]
Simple first-order methods like Gradient Descent (GD) remain foundational to modern machine learning. Yet, despite their widespread use, our theoretical understanding of the GD trajectory—how and why it works—remains incomplete in both classical and contemporary settings. This talk explores new horizons in understanding the behavior and power of GD across two distinct but connected fronts.
In the first part, we examine the surprising power of a single gradient step in enhancing model reasoning. We focus on test-time training (TTT)—a gradient-based approach that adapts model parameters using individual test instances. We introduce a theoretical framework that reveals how TTT can effectively handle distribution shifts and significantly reduce the data required for in-context learning, shedding light on why such simple methods often outperform expectations.
The second part turns to a more classical optimization setting: learning shallow neural networks with GD. Despite extensive study, even fitting a one-hidden-layer model to basic target functions lacks rigorous performance guarantees. We present a comprehensive analysis of the GD trajectory in this regime, showing how it avoids suboptimal stationary points and converges efficiently to global optima. Our results offer new theoretical foundations for understanding how GD succeeds in the presence of sub-optimal stationary points.The Binary Iterative Hard Thresholding Algorithm
4:05 - 4:25pm | Arya Mazumdar, TILOS & UC San Diego
[ RECORDING ]
We will discuss our work on the convergence of iterative hard threshold algorithms for sparse signal recovery problems. For classification problems with nonseparable data this algorithm can be thought of minimizing the so-called ReLU loss. It seems to be very effective (statistically optimal, simple iterative method) for a large class of models of nonseparable data—sparse generalized linear models. It is also robust to adversarial perturbation. Based on joint work with Namiko Matsumoto.Friday, April 18
Unleashing the Power of Variance Reduction for Training Large Models
9:00 - 9:45am | Quanquan Gu, UCLA
[ RECORDING ]
Training deep neural networks—and more recently, large language models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this talk, I will introduce a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within this framework, I will introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. In addition, I will draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.Flat Minima and Generalization: from Matrix Sensing to Neural Networks
9:50 - 10:35am | Maryam Fazel, University of Washington
[ RECORDING ]
When do overparameterized neural networks avoid overfitting and generalize to unseen data? Empirical evidence suggests that the shape of the training loss function near the solution matters—the minima where the loss is “flatter” tend to lead to better generalization. Yet quantifying flatness and its rigorous analysis, even in simple models, has been limited. In this talk, we examine overparameterized nonconvex models such as low-rank matrix sensing, matrix completion, robust PCA, as well as a 2-layer neural network as test cases. We show that under standard statistical assumptions, "flat" minima (minima with the smallest local average curvature, measured by the trace of the Hessian matrix) provably generalize in all these cases. These algorithm-agnostic results suggest a theoretical basis for favoring methods that bias iterates towards flat solutions, and help inform the design of better training algorithms.The Wisdom of the Body Revisited
1:30 - 2:15pm | Benjamin Recht, UC Berkeley
[ RECORDING ]
In 1932, Walter Cannon published his seminal text, The Wisdom of the Body, introducing the notion of homeostasis. He conceived of the body as a complex system actively working to keep itself in a stable state despite adversarial engagement with an uncertain and dangerous environment. Cannon’s concept of homeostasis would not only revolutionize the way we think about medicine but also inspire cyberneticists and early artificial intelligence researchers to think about the body and brain as well-regulated machines.
In this talk, I refocus Canon's work under a contemporary lens, showing how non-neural biological networks do very smart things. I will describe concepts from feedback control that illuminate necessary architectures for homeostasis. I will show how such systems can be both resilient to most disturbances while fragile to specific adversarial vectors. Identifying these fragilities can guide positive interventions that can steer dysregulated systems back to stable behavior. Throughout, I aim to highlight the role of mathematical and qualitative theory in our understanding and optimization of systems that behave effectively in unknown futures.