Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent

Name: Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent
Uploaded: 2026-03-13T11:40:53-07:00
Duration: 56 min
Description: Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent Yuejie Chi, Yale University Transformers have demonstrated remarkable

Yuejie Chi, Yale University

Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve multiple subtasks in a single autoregressive path, and the bootstrapping of inherently sequential reasoning through recursive self-training curriculum. Our optimization-based guarantees demonstrate that even shallow multi-head transformers, with chain-of-thought, can be trained to effectively solve problems that would otherwise require deeper architectures.

Yuejie Chi is the Charles C. and Dorothea S. Dilley Professor of Statistics and Data Science at Yale University, with a secondary appointment in Computer Science, and a member of the Yale Institute for Foundations of Data Science. Before joining Yale, Dr. Chi was the Sense of Wonder Group Endowed Professor of Electrical and Computer Engineering in AI Systems at Carnegie Melon University, with affiliation in MLD and CyLab. She also spent some time as a visiting researcher at Meta’s Fundamental AI Research (FAIR). Dr. Yue’s research interests lie in the theoretical and algorithmic foundations of data science, generative AI, reinforcement learning, and signal processing, motivated by applications in scientific and engineering domains. Her current focus is on improving the performance, efficiency and reliability of generative AI and decision making, driven by data-intensive but resource-constrained scenarios.

11views

Foundations,

Machine Learning,

Optimization for ML & AI Seminar Series

Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent

You may also like