Transformers Learn In-context by (Functional) Gradient Descent

Name: TILOS Seminar: Transformers Learn In-context by (Functional) Gradient Descent
Uploaded: 2024-04-18T12:00:13-07:00
Duration: 55 min 35 s
Description: Transformers Learn In-context by (Functional) Gradient Descent Xiang Cheng, TILOS Postdoctoral Scholar at MIT Motivated by the in-context learning

Xiang Cheng, TILOS Postdoctoral Scholar at MIT

Motivated by the in-context learning phenomenon, we investigate how the Transformer neural network can implement learning algorithms in its forward pass. We show that a linear Transformer naturally learns to implement gradient descent, which enables it to learn linear functions in-context. More generally, we show that a non-linear Transformer can implement functional gradient descent with respect to some RKHS metric, which allows it to learn a broad class of functions in-context. Additionally, we show that the RKHS metric is determined by the choice of attention activation, and that the optimal choice of attention activation depends in a natural way on the class of functions that need to be learned. I will end by discussing some implications of our results for the choice and design of Transformer architectures.

18views

machine learning architectures,

neural networks,

transformers