TILOS Seminar: Transformers learn in-context by (functional) gradient descent

Xiang Cheng, TILOS Postdoctoral Scholar at MIT

Abstract: Motivated by the in-context learning phenomenon, we investigate how the Transformer neural network can implement learning algorithms in its forward pass. We show that a linear Transformer naturally learns to implement gradient descent, which enables it to learn linear functions in-context. More generally, we show that a non-linear Transformer can implement functional gradient descent with respect to some RKHS metric, which allows it to learn a broad class of functions in-context. Additionally, we show that the RKHS metric is determined by the choice of attention activation, and that the optimal choice of attention activation depends in a natural way on the class of functions that need to be learned. I will end by discussing some implications of our results for the choice and design of Transformer architectures.

  • 00

    days

  • 00

    hours

  • 00

    minutes

  • 00

    seconds

Date

Apr 17 2024

Time

Pacific Daylight Time
10:00 AM - 11:00 AM

Local Time

  • Timezone: America/New_York
  • Date: Apr 17 2024
  • Time: 1:00 PM - 2:00 PM

Location

Halıcıoğlu Data Science Building Room 123
3234 Matthews Ln, La Jolla, CA 92093

Organizer

TILOS

Speaker