Nonconvex Optimization in Deep Learning

Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. In their work on the crucial role of normalization in sharpness-aware minimization, Suvrit Sra’s team focuses on understanding the role played by normalization, a key component of the SAM updates. They theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: (i) it helps in stabilizing the algorithm; and (ii) it enables the algorithm to drift along a continuum (manifold) of minima—a property identified by recent theoretical works that is the key to better performance. They further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM.

Figure 1. Role of normalization for stabilizing algorithms (η = 0.05).

Ahn, Jadbabaie & Sra (ICML 2024) undertakes a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, they adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, they then analyze algorithms that find approximate flat minima efficiently. For general cost functions, they propose a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, they present a faster algorithm that is inspired by sharpness-aware minimization (SAM), further supporting its success in practice.

Stefanie Jegelka’s team and external collaborators analyze the remarkable generalization ability of neural networks that is usually attributed to the implicit bias of stochastic gradient descent (SGD), which often yields models with lower complexity using simpler (e.g. linear) and low-rank features (Huh et al., 2021). Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are ’simple’, the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. They take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. Jegelka’s team shows that, for any high-dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons, thereby implying a simple rank one feature matrix. To obtain this result, their main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, they discover a novel property—a local geodesic convexity—of the trace of the Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold.

In related work (Gatmiry et al., NeurIPS 2023), Jegelka’s team takes the first step towards understanding the inductive bias of the minimum trace of the Hessian solutions in an important setting: learning deep linear networks from linear measurements, also known as deep matrix factorization. They show that with the standard Restricted Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters (i.e., the product of all layer matrices), which, in turn, leads to better generalization.

Vishnoi & Mangoubi (COLT 2023) addresses the challenge of approximating a covariance matrix M with the “best” matrix in the nonconvex space of rank-k matrix under a differential privacy constraint. The authors present and analyze a complex variant of the Gaussian mechanism, providing the optimal bound on the Frobenius norm difference between the output matrix of this mechanism and the best rank-k approximation to M. The analysis leverages the increased repulsion of eigenvalues in complex matrix Brownian motion compared to the real case, using Dyson’s stochastic differential equations to demonstrate that the eigenvalues of a matrix perturbed by complex Gaussian noise have large gaps with high probability. These results enhance the understanding of low-rank approximations under average-case perturbations and contribute to the study of eigenvalue gaps in random matrices.

In Zhang et al. (NeurIPS 2023), Amin Karbasi’s team examined the trade-off between convergence rate (gradient complexity) and reproducibility. They challenged the perception that a trade-off is necessary and demonstrated that both optimal reproducibility and near-optimal convergence guarantees can be achieved for smooth convex minimization and smooth convex-concave minimax problems under various error-prone oracle settings. Specifically, with an inexact initialization oracle, the regularization-based algorithms achieve both optimal reproducibility and near-optimal gradient complexity for minimization and minimax optimization. For the inexact gradient oracle, near-optimal guarantees are also maintained for minimax optimization. Additionally, with the stochastic gradient oracle, they showed that stochastic gradient descent ascent is optimal in terms of both reproducibility and gradient complexity.

In Zhu et al. (ICML 2024) Mikhail (Msha) Belkin’s team analyze some remarkable connections between non-linear “catapult” dynamics observed in neural networks and certain empirical phenomena. The authors first present an explanation regarding the common occurrence of spikes in the training loss when neural networks are trained with stochastic gradient descent (SGD), providing evidence that the spikes in the training loss of SGD are indeed "catapults", an optimization phenomenon originally observed in GD with large learning rates in Lewkowycz et al. (2020). They show empirically that these catapults occur in a low-dimensional subspace spanned by the top eigenvectors of the tangent kernel, for both GD and SGD. Furthermore, they posit an explanation for how catapults lead to better generalization by demonstrating that catapults promote feature learning by increasing alignment with the Average Gradient Outer Product (AGOP) of the true predictor. Significantly, they demonstrate that a smaller batch size in SGD induces a larger number of catapults, thereby improving AGOP alignment and test performance.