Nonconvex Optimization and Transformer Architectures

Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. Zhang, Zhang & Sra [SIAM Journal on Optimization 2023] takes a step toward understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems in geodesic metric spaces. The first main result of the paper is a geodesic metric space version of Sion’s minimax theorem; the authors’ proof is novel and broadly accessible as it relies on the finite intersection property alone. The second main result is a specialization to geodesically complete Riemannian manifolds, for which they analyze first-order methods for smooth minimax problems.

Weber & Sra (2023) studies algorithms for efficiently computing Brascamp–Lieb constants, a task that has recently received much interest. In particular, it reduces the computation to a nonlinear matrix-valued iteration, whose convergence is analyzed through the lens of fixed-point methods under the Thompson metric. This approach permits the authors to obtain (weakly) polynomial time guarantees, and it offers an efficient and transparent alternative to previous state-of-the-art approaches based on Riemannian optimization and geodesic convexity.

Motivated by the striking ability of transformers for in-context learning, several works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate gradient descent iterations. Going beyond the question of expressivity, Ahn et al. [NeurIPS 2023] asks: “Can Transformers learn to implement such algorithms by training over random problem instances?” This work makes the first theoretical progress toward this question via analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, it proves the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the geometry induced by data inadequacy. For a transformer with attention layers, the work proves that certain critical points of the training objective implement iterations of preconditioned gradient descent. The results call for future theoretical studies on learning algorithms by training transformers.

Many neural network architectures have been shown to be Turing Complete, and can thus implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. A line of recent work shows that linear Transformers naturally learn to implement gradient descent (GD) when trained on a linear regression in-context learning task. But the linearity assumption (either in the Transformer architecture or in the learning task) is far from realistic settings where nonlinear activations crucially enable Transformers to learn complicated nonlinear functions. In Cheng, Chen & Sra [ICML 2024] the authors provide theoretical and empirical evidence that non-linear Transformers can, and in fact do, learn to implement learning algorithms to learn non-linear functions in context. The results apply to a broad class of combinations of non-linear architectures, and non-linear in-context learning tasks. Interestingly, the authors show that the optimal choice of non-linear activation depends in a natural way on the non-linearity of the learning task.

Team Members

Stefanie Jegelka1
Melvin Leok2
Arya Mazumdar2
Suvrit Sra1
Nisheeth Vishnoi3
Yusu Wang2

Collaborators

Melanie Weber4
Jingzhao Zhang5

1. UC San Diego
2. MIT
3. Yale
4. Harvard University
5. Tsinghua University

Publications

SIAM Journal on Optimization 2023 >
Weber & Sra 2023 (preprint) >
NeurIPS 2023 >
ICML 2024 >