# Recorded Talks: Foundations of AI and Optimization

# What Kinds of Functions do Neural Networks Learn? Theory and Practical Applications

*Robert Nowak, University of Wisconsin*

This talk presents a theory characterizing the types of functions neural networks learn from data. Specifically, the function space generated by deep ReLU networks consists of compositions of functions from the Banach space of second-order bounded variation in the Radon transform domain. This Banach space includes functions with smooth projections in most directions. A representer theorem associated with this space demonstrates that finite-width neural networks suffice for fitting finite datasets. The theory has several practical applications. First, it provides a simple and theoretically grounded method for network compression. Second, it shows that multi-task training can yield significantly different solutions compared to single-task training, and that multi-task solutions can be related to kernel ridge regressions. Third, the theory has implications for improving implicit neural representations, where multi-layer neural networks are used to represent continuous signals, images, or 3D scenes. This exploration bridges theoretical insights with practical advancements, offering a new perspective on neural network capabilities and future research directions.

Robert Nowak is the Grace Wahba Professor of Data Science and Keith and Jane Nosbusch Professor in Electrical and Computer Engineering at the University of Wisconsin-Madison. His research focuses on machine learning, optimization, and signal processing. He serves on the editorial boards of the SIAM Journal on the Mathematics of Data Science and the IEEE Journal on Selected Areas in Information Theory.

# Transformers Learn In-context by (Functional) Gradient Descent

*Xiang Cheng, TILOS Postdoctoral Scholar at MIT*

Motivated by the in-context learning phenomenon, we investigate how the Transformer neural network can implement learning algorithms in its forward pass. We show that a linear Transformer naturally learns to implement gradient descent, which enables it to learn linear functions in-context. More generally, we show that a non-linear Transformer can implement functional gradient descent with respect to some RKHS metric, which allows it to learn a broad class of functions in-context. Additionally, we show that the RKHS metric is determined by the choice of attention activation, and that the optimal choice of attention activation depends in a natural way on the class of functions that need to be learned. I will end by discussing some implications of our results for the choice and design of Transformer architectures.

# How Large Models of Language and Vision Help Agents to Learn to Behave

*Roy Fox, Assistant Professor and Director of the Intelligent Dynamics Lab, UC Irvine*

If learning from data is valuable, can learning from big data be very valuable? So far, it has been so in vision and language, for which foundation models can be trained on web-scale data to support a plethora of downstream tasks; not so much in control, for which scalable learning remains elusive. Can information encoded in vision and language models guide reinforcement learning of control policies? In this talk, I will discuss several ways for foundation models to help agents to learn to behave. Language models can provide better context for decision-making: we will see how they can succinctly describe the world state to focus the agent on relevant features; and how they can form generalizable skills that identify key subgoals. Vision and vision–language models can help the agent to model the world: we will see how they can block visual distractions to keep state representations task-relevant; and how they can hypothesize about abstract world models that guide exploration and planning.

Roy Fox is an Assistant Professor of Computer Science at the University of California, Irvine. His research interests include theory and applications of control learning: reinforcement learning (RL), control theory, information theory, and robotics. His current research focuses on structured and model-based RL, language for RL and RL for language, and optimization in deep control learning of virtual and physical agents.

# TILOS Seminar: The Dissimilarity Dimension: Sharper Bounds for Optimistic Algorithms

*Aldo Pacchiano, Assistant Professor, Boston University Center for Computing and Data Sciences*

The principle of Optimism in the Face of Uncertainty (OFU) is one of the foundational algorithmic design choices in Reinforcement Learning and Bandits. Optimistic algorithms balance exploration and exploitation by deploying data collection strategies that maximize expected rewards in plausible models. This is the basis of celebrated algorithms like the Upper Confidence Bound (UCB) for multi-armed bandits. For nearly a decade, the analysis of optimistic algorithms, including Optimistic Least Squares, in the context of rich reward function classes has relied on the concept of eluder dimension, introduced by Russo and Van Roy in 2013. In this talk we shed light on the limitations of the eluder dimension in capturing the true behavior of optimistic strategies in the realm of function approximation. We remediate these by introducing a novel statistical measure, the “dissimilarity dimension”. We show it can be used to provide sharper sample analysis of algorithms like Optimistic Least Squares by establishing a link between regret and the dissimilarity dimension. To illustrate this, we will show that some function classes have arbitrarily large eluder dimension but constant dissimilarity. Our regret analysis draws inspiration from graph theory and may be of interest to the mathematically minded beyond the field of statistical learning theory. This talk sheds new light on the fundamental principle of optimism and its algorithms in the function approximation regime, advancing our understanding of these concepts.

# TILOS Seminar: Building Personalized Decision Models with Federated Human Preferences

*Aadirupa Saha, Research Scientist at Apple*

Customer statistics collected in several real-world systems have reflected that users often prefer eliciting their liking for a given pair of items, say (A,B), in terms of relative queries like: “Do you prefer Item A over B?”, rather than their absolute counterparts: “How much do you score items A and B on a scale of [0-10]?”. Drawing inspirations, in the search for a more effective feedback collection mechanism, led to the famous formulation of Dueling Bandits (DB), which is a widely studied online learning framework for efficient information aggregation from relative/comparative feedback. However despite the novel objective, unfortunately, most of the existing DB techniques were limited only to simpler settings of finite decision spaces, and stochastic environments, which are unrealistic in practice.

In this talk, we will start with the basic problem formulations for DB and familiarize ourselves with some of the breakthrough results. Following this, will dive deeper into a more practical framework of contextual dueling bandits (C-DB) where the goal of the learner is to make personalized predictions based on the user contexts. We will see a new algorithmic approach that can efficiently achieve the optimal O(sqrt T) regret performance for this problem, resolving an open problem from Dudík et al. [COLT, 2015]. In the last part of the talk, we will extend the aforementioned models to a federated framework, which entails developing preference-driven prediction models for distributed environments for creating large-scale personalized systems, including recommender systems and chatbot interactions. Apart from exploiting the limited preference feedback model, the challenge lies in ensuring user privacy and reducing communication complexity in the federated setting. We will conclude the talk with some interesting open problems.

# Towards Foundation Models for Graph Reasoning and AI 4 Science

*Michael Galkin, Research Scientist at Intel AI Lab*

Foundation models in graph learning are hard to design due to the lack of common invariances that transfer across different structures and domains. In this talk, I will give an overview of the two main tracks of my research at Intel AI: creating foundation models for knowledge graph reasoning that can run zero-shot inference on any multi-relational graphs, and foundation models for materials discovery in the AI4Science domain that capture physical properties of crystal structures and transfer to a variety of predictive and generative tasks. We will also talk about theoretical and practical challenges like scaling behavior, data scarcity, and diverse evaluation of foundation graph models.

Michael Galkin is a Research Scientist at Intel AI Lab in San Diego working on Graph Machine Learning and Geometric Deep Learning. Previously, he was a postdoc at Mila–Quebec AI Institute with Will Hamilton, Reihaneh Rabbany, and Jian Tang, focusing on many graph representation learning problems. Sometimes, Mike writes long blog posts on Medium about graph learning.

# TILOS Fireside Chat: Theory in the Age of Modern AI

A conversation about theory in the age of modern artificial intelligence (AI) with TILOS member panelists Nisheeth Vishnoi, Tara Javidi, Misha Belkin, and Arya Mazumdar (moderator).

# TILOS Seminar: Rare Gems: Finding Lottery Tickets at Initialization

*Dimitris Papailiopoulos, Associate Professor, University of Wisconsin-Madison*

Large neural networks can be pruned to a small fraction of their original size, with little loss in accuracy, by following a time-consuming “train, prune, re-train” approach. Frankle & Carbin in 2019 conjectured that we can avoid this by training lottery tickets, i.e., special sparse subnetworks found at initialization, that can be trained to high accuracy. However, a subsequent line of work presents concrete evidence that current algorithms for finding trainable networks at initialization, fail simple baseline comparisons, e.g., against training random sparse subnetworks. Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by discovering Rare Gems: sparse, trainable networks at initialization, that achieve high accuracy even before training. When Rare Gems are trained with SGD, they achieve accuracy competitive or better than Iterative Magnitude Pruning (IMP) with warmup.

Dimitris Papailiopoulos is the Jay & Cynthia Ihlenfeld Associate Professor of Electrical and Computer Engineering at the University of Wisconsin-Madison, a faculty fellow of the Grainger Institute for Engineering, and a faculty affiliate at the Wisconsin Institute for Discovery. His research interests span machine learning, information theory, and distributed systems, with a current focus on efficient large-scale training algorithms. Before coming to Madison, Dimitris was a postdoctoral researcher at UC Berkeley and a member of the AMPLab. He earned his Ph.D. in ECE from UT Austin, under the supervision of Alex Dimakis. He received his ECE Diploma M.Sc. degree from the Technical University of Crete, in Greece. Dimitris is a recipient of the NSF CAREER Award (2019), three years of Sony Faculty Innovation Awards (2018, 2019 and 2020), a joint IEEE ComSoc/ITSoc Best Paper Award (2020), an IEEE Signal Processing Society, Young Author Best Paper Award (2015), the Vilas Associate Award (2021), the Emil Steiger Distinguished Teaching Award (2021), and the Benjamin Smith Reynolds Award for Excellence in Teaching (2019). In 2018, he co-founded MLSys, a new conference that targets research at the intersection of machine learning and systems.

# TILOS Seminar: Robust and Equitable Uncertainty Estimation

*Aaron Roth, Professor, University of Pennsylvania*

Machine learning provides us with an amazing set of tools to make predictions, but how much should we trust particular predictions? To answer this, we need a way of estimating the confidence we should have in particular predictions of black-box models. Standard tools for doing this give guarantees that are averages over predictions. For instance, in a medical application, such tools might paper over poor performance on one medically relevant demographic group if it is made up for by higher performance on another group. Standard methods also depend on the data distribution being static—in other words, the future should be like the past.

In this lecture, I will describe new techniques to address both these problems: a way to produce prediction sets for arbitrary black-box prediction methods that have correct empirical coverage even when the data distribution might change in arbitrary, unanticipated ways and such that we have correct coverage even when we zoom in to focus on demographic groups that can be arbitrary and intersecting. When we just want correct group-wise coverage and are willing to assume that the future will look like the past, our algorithms are especially simple.

This talk is based on two papers, that are joint work with Osbert Bastani, Varun Gupta, Chris Jung, Georgy Noarov, and Ramya Ramalingam.

Aaron Roth is the Henry Salvatori Professor of Computer and Cognitive Science, in the Computer and Information Sciences department at the University of Pennsylvania, with a secondary appointment in the Wharton statistics department. He is affiliated with the Warren Center for Network and Data Science, and co-director of the Networked and Social Systems Engineering (NETS) program. He is also an Amazon Scholar at Amazon AWS. He is the recipient of a Presidential Early Career Award for Scientists and Engineers (PECASE) awarded by President Obama in 2016, an Alfred P. Sloan Research Fellowship, an NSF CAREER award, and research awards from Yahoo, Amazon, and Google. His research focuses on the algorithmic foundations of data privacy, algorithmic fairness, game theory, learning theory, and machine learning. Together with Cynthia Dwork, he is the author of the book “The Algorithmic Foundations of Differential Privacy.” Together with Michael Kearns, he is the author of “The Ethical Algorithm.”

# TILOS Seminar: On Policy Optimization Methods for Control

*Maryam Fazel, Professor, University of Washington*

Policy Optimization methods enjoy wide practical use in reinforcement learning (RL) for applications ranging from robotic manipulation to game-playing, partly because they are easy to implement and allow for richly parameterized policies. Yet their theoretical properties, from optimality to statistical complexity, are still not fully understood. To help develop a theoretical basis for these methods, and to bridge the gap between RL and control theoretic approaches, recent work has studied whether gradient-based policy optimization can succeed in designing feedback control policies.

In this talk, we start by showing the convergence and optimality of these methods for linear dynamical systems with quadratic costs, where despite nonconvexity, convergence to the optimal policy occurs under mild assumptions. Next, we make a connection between convex parameterizations in control theory on one hand, and the Polyak-Lojasiewicz property of the nonconvex cost function, on the other. Such a connection between the nonconvex and convex landscapes provides a unified view towards extending the results to more complex control problems.

Maryam Fazel is the Moorthy Family Professor of Electrical and Computer Engineering at the University of Washington, with adjunct appointments in Computer Science and Engineering, Mathematics, and Statistics. Maryam received her MS and PhD from Stanford University, and her BS from Sharif University of Technology in Iran, and was a postdoctoral scholar at Caltech before joining UW. She is a recipient of the NSF Career Award, UWEE Outstanding Teaching Award, and a UAI conference Best Student Paper Award with her student. She directs the Institute for Foundations of Data Science (IFDS), a multi-site NSF TRIPODS Institute. Her current research interests are in the area of optimization in machine learning and control.

# TILOS Seminar: Non-convex Optimization for Linear Quadratic Gaussian (LQG) Control

*Yang Zheng, Assistant Professor, UC San Diego*

Recent studies have started to apply machine learning techniques to the control of unknown dynamical systems. They have achieved impressive empirical results. However, the convergence behavior, statistical properties, and robustness performance of these approaches are often poorly understood due to the non-convex nature of the underlying control problems. In this talk, we revisit the Linear Quadratic Gaussian (LQG) control and present recent progress towards its landscape analysis from a non-convex optimization perspective. We view the LQG cost as a function of the controller parameters and study its analytical and geometrical properties. Due to the inherent symmetry induced by similarity transformations, the LQG landscape is very rich yet complicated. We show that 1) the set of stabilizing controllers has at most two path-connected components, and 2) despite the nonconvexity, all minimal stationary points (controllable and observable controllers) are globally optimal. Based on the special non-convex optimization landscape, we further introduce a novel perturbed policy gradient (PGD) method to escape a large class of suboptimal stationary points (including high-order saddles). These results shed some light on the performance analysis of direct policy gradient methods for solving the LQG problem. The talk is based on our recent papers: https://arxiv.org/abs/2102.04393 and https://arxiv.org/abs/2204.00912.

Yang Zheng is an assistant professor in the ECE department at UC San Diego. Yang Zheng received the DPhil (Ph.D.) degree in Engineering Science from the University of Oxford in 2019. He received the B.E. and M.S. degrees from Tsinghua University in 2013 and 2015, respectively. From February 2019 to August 2020, he was a postdoctoral researcher at Harvard University. He was a research associate at Imperial College London in 2021.

Dr. Zheng’s research interests include learning, optimization, and control of network systems, and their applications to cyber-physical systems, autonomous vehicles, and traffic systems. His work has been acknowledged by several awards, including the 2019 European Ph.D. Award on Control for Complex and Heterogeneous Systems, the Best Student Paper Award Finalist at the 2019 European Control Conference, the Best Student Paper Award at the 17th IEEE International Conference on Intelligent Transportation Systems, and the Best Paper Award at the 14th Intelligent Transportation Systems Asia-Pacific Forum. He received the National Scholarship, Outstanding Graduate at Tsinghua University, the Clarendon Scholarship at the University of Oxford, and the Chinese Government Award for Outstanding Self-financed Students Abroad.

# TILOS Seminar: How to use Machine Learning for Combinatorial Optimization

*Sherief Reda, Professor, Brown University and Principal Research Scientist at Amazon*

Combinatorial optimization methods are routinely used in many scientific fields to identify optimal solutions among a large but finite set of possible solutions for problems of interests. Given the recent success of machine learning techniques in classification of natural signals (e.g., voice, image, text), it is natural to ask how machine learning methods can be used to improve the quality of solution or the runtime of combinatorial optimization algorithms? In this talk I will provide a general taxonomy and research directions for the use of machine learning techniques in combinatorial optimization. I will illustrate these directions using a number of case studies from my group's research, which include (1) improving the quality of results of integer linear programming (ILP) solver using deep metric learning, and (2) using reinforcement learning techniques to optimize the size of graphs arising in digital circuit design.

Sherief Reda is a Full Professor at the School of Engineering and Computer Science Department at Brown University and a Principal Research Scientist at Amazon. He joined Brown University in 2006 after receiving his Ph.D. in computer science and engineering from University of California, San Diego. He has over 135 research articles in the areas of energy-efficient computing, electronic design automation and combinatorial optimization, as well as several patents. Professor Reda received a number of research acknowledgments and awards, including eight best paper nominations, three best paper awards, and a National Science Foundation CAREER award. He has been a PI or co-PI on more than $21.1M of funded projects from federal agencies and industry corporations. He is a senior member of IEEE.

# TILOS Seminar: Reasoning Numerically

*Sicun Gao, Assistant Professor, UC San Diego*

Highly-nonlinear continuous functions have become a pervasive model of computation. Despite newsworthy progress, the practical success of “intelligent” computing is still restricted by our ability to answer questions regarding their quality and dependability: How do we rigorously know that a system will do exactly what we want it to do and nothing else? For traditional software and hardware systems that primarily use digital and rule-based designs, automated reasoning has provided the fundamental principles and widely-used tools for ensuring their quality in all stages of design and engineering. However, the rigid symbolic formulations of typical automated reasoning methods often make them unsuitable for dealing with computation units that are driven by numerical and data-driven approaches. I will overview some of our attempts in bridging this gap. I will highlight how the core challenge of NP-hardness is shared across discrete and continuous domains, and how it motivates us to seek the unification of symbolic, numerical, and statistical perspectives towards better understanding and handling of the curse of dimensionality.

Sicun Gao is an Assistant Professor in Computer Science and Engineering at the University of California, San Diego. He works on search and optimization algorithms for improving the quality of automation and autonomous systems. He is a recipient of the Air Force Young Investigator Award, Amazon Research Award, NSF Career Award, and Silver Medal for the Kurt Godel Research Prize. He received his PhD from Carnegie Mellon University and was a postdoctoral researcher at CMU and MIT.

# TILOS Seminar: Deep Generative Models and Inverse Problems

*Alexandros G. Dimakis, Professor, The University of Texas at Austin*

Sparsity has given us MP3, JPEG, MPEG, Faster MRI and many fun mathematical problems. Deep generative models like GANs, VAEs, invertible flows and Score-based models are modern data-driven generalizations of sparse structure. We will start by presenting the CSGM framework by Bora et al. to solve inverse problems like denoising, filling missing data, and recovery from linear projections using an unsupervised method that relies on a pre-trained generator. We generalize compressed sensing theory beyond sparsity, extending Restricted Isometries to sets created by deep generative models. Our recent results include establishing theoretical results for Langevin sampling from full-dimensional generative models, generative models for MRI reconstruction and fairness guarantees for inverse problems.

Alexandros G. Dimakis is a Professor at the ECE department at UT Austin and the co-director of the National AI Institute on the Foundations of Machine Learning (IFML). He received his Ph.D. from UC Berkeley and the Diploma degree from the National Technical University of Athens. He received several awards including the James Massey Award, NSF Career, a Google research award, the UC Berkeley Eli Jury dissertation award and several best paper awards. He served as an Associate editor for IEEE Transactions on Information Theory and as an Area Chair for major Machine Learning conferences (NeurIPS, ICML, AAAI) and as the chair of the Technical Committee for MLSys 2021. He was selected as an IEEE Fellow for contributions to distributed coding and learning. His research interests include information theory, coding theory and machine learning.

# TILOS Seminar: Learning in the Presence of Distribution Shifts: How does the Geometry of Perturbations Play a Role?

*Hamed Hassani, Assistant Professor, University of Pennsylvania*

In this talk, we will focus on the emerging field of (adversarially) robust machine learning. The talk will be self-contained and no particular background on robust learning will be needed. Recent progress in this field has been accelerated by the observation that despite unprecedented performance on clean data, modern learning models remain fragile to seemingly innocuous changes such as small, norm-bounded additive perturbations. Moreover, recent work in this field has looked beyond norm-bounded perturbations and has revealed that various other types of distributional shifts in the data can significantly degrade performance. However, in general our understanding of such shifts is in its infancy and several key questions remain unaddressed.

The goal of this talk is to explain why robust learning paradigms have to be designed—and sometimes rethought—based on the geometry of the input perturbations. We will cover a wide range of perturbation geometries from simple norm-bounded perturbations, to sparse, natural, and more general distribution shifts. As we will show, the geometry of the perturbations necessitates fundamental modifications to the learning procedure as well as the architecture in order to ensure robustness. In the first part of the talk, we will discuss our recent theoretical results on robust learning with respect to various geometries, along with fundamental tradeoffs between robustness and accuracy, phase transitions, etc. The remaining portion of the talk will be about developing practical robust training algorithms and evaluating the resulting (robust) deep networks against state-of-the-art methods on naturally-varying, real-world datasets.

# TILOS Seminar: The Connections Between Discrete Geometric Mechanics, Information Geometry, Accelerated Optimization and Machine Learning

*Melvin Leok, Professor of Mathematics, UC San Diego*

Geometric mechanics describes Lagrangian and Hamiltonian mechanics geometrically, and information geometry formulates statistical estimation, inference, and machine learning in terms of geometry. A divergence function is an asymmetric distance between two probability densities that induces differential geometric structures and yields efficient machine learning algorithms that minimize the duality gap. The connection between information geometry and geometric mechanics will yield a unified treatment of machine learning and structure-preserving discretizations. In particular, the divergence function of information geometry can be viewed as a discrete Lagrangian, which is a generating function of a symplectic map, that arise in discrete variational mechanics. This identification allows the methods of backward error analysis to be applied, and the symplectic map generated by a divergence function can be associated with the exact time-h flow map of a Hamiltonian system on the space of probability distributions. We will also discuss how time-adaptive Hamiltonian variational integrators can be used to discretize the Bregman Hamiltonian, whose flow generalizes the differential equation that describes the dynamics of the Nesterov accelerated gradient descent method.

Melvin Leok is professor of mathematics and co-director of the CSME graduate program at the UC San Diego. His research interests are in computational geometric mechanics, computational geometric control theory, discrete geometry, and structure-preserving numerical schemes, and particularly how these subjects relate to systems with symmetry. He received his Ph.D. in 2004 from the California Institute of Technology in Control and Dynamical Systems under the direction of Jerrold Marsden. He is a three-time NAS Kavli Frontiers of Science Fellow, a Simons Fellow in Mathematics, and has received the DoD Newton Award for Transformative Ideas, the NSF Faculty Early Career Development (CAREER) award, the SciCADE New Talent Prize, the SIAM Student Paper Prize, and the Leslie Fox Prize (second prize) in Numerical Analysis. He has given plenary talks at Foundations of Computational Mathematics, NUMDIFF, and the IFAC Workshop on Lagrangian and Hamiltonian Methods for Nonlinear Control. He serves on the editorial boards of the Journal of Nonlinear Science, the Journal of Geometric Mechanics, and the Journal of Computational Dynamics, and has served on the editorial boards of the SIAM Journal on Control and Optimization, and the LMS Journal of Computation and Mathematics.