Nonlinear Feature Learning in Neural Networks
Learning non-linear features from data is thought to be one of the fundamental reasons for the success of deep neural networks. This has been observed in a wide range of domains, including computer vision and natural language processing. Among many theoretical approaches to study neural nets, much work has focused on two-layer fully-connected neural networks with a randomly generated, untrained first layer and a trained second layer—or random features models. Nevertheless, feature learning is absent in random features models, because the first layer weights are assumed to be randomly generated, and then fixed. Thus, random features models fall short of providing a comprehensive explanation for the success of deep learning.
As a better theoretical model that is able to capture feature learning, in [Moniri et al., ICML 2024] we consider a two-layer neural network, with the first layer trained with one gradient step, and the second layer trained using ridge regression. For the update on the first layer, we assume a step size that grows with the sample size polynomially. We present a spectral analysis of the updated feature matrix. We show that the spectrum of the feature matrix undergoes phase transitions depending on the regime of step size. In particular, we show that polynomial features of the input with finite degree are learned as a result of this one step update, and the degree of these polynomials depend on the step size. We then fully characterize asymptotics of the training and test errors under different regimes of step size and show that this one step updated neural network is able to beat linear and kernel methods.
Team Members
Hamed Hassani1
Stefanie Jegelka2
Amin Karbasi3
Yusu Wang4
Collaborators
Edgar Dobriban1
1. University of Pennsylvania
2. MIT
3. Yale University
4. UC San Diego