BEGIN:VCALENDAR
VERSION:2.0
PRODID:-// - ECPv6.16.2//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-ORIGINAL-URL:https://tilos.ai
X-WR-CALDESC:Events for 
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20230312T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20231105T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20240310T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20241103T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20250309T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20251102T090000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20241120T110000
DTEND;TZID=America/Los_Angeles:20241120T120000
DTSTAMP:20260526T085432
CREATED:20250828T200101Z
LAST-MODIFIED:20260430T150948Z
UID:7294-1732100400-1732104000@tilos.ai
SUMMARY:TILOS Seminar: How Transformers Learn Causal Structure with Gradient Descent
DESCRIPTION:Jason Lee\, Princeton University \nAbstract: The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism\, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However\, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process\, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality\, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case\, when the sequences are generated from in-context Markov chains\, we prove that transformers learn an induction head (Olsson et al.\, 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures. \n\nJason Lee is an associate professor in Electrical Engineering and Computer Science (secondary) at Princeton University. Prior to that\, he was in the Data Science and Operations department at the University of Southern California and a postdoctoral researcher at UC Berkeley working with Michael I. Jordan. Jason received his PhD at Stanford University advised by Trevor Hastie and Jonathan Taylor. His research interests are in the theory of machine learning\, optimization\, and statistics. Lately\, he has worked on the foundations of deep learning\, representation learning\, and reinforcement learning. He has received the Samsung AI Researcher of the Year Award\, NSF Career Award\, ONR Young Investigator Award in Mathematical Data Science\, Sloan Research Fellowship\, NeurIPS Best Student Paper Award and Finalist for the Best Paper Prize for Young Researchers in Continuous Optimization\, and Princeton Commendation for Outstanding Teaching.
URL:https://tilos.ai/event/tilos-seminar-how-transformers-learn-causal-structure-with-gradient-descent/
LOCATION:HDSI 123 and Virtual\, 3234 Matthews Ln\, La Jolla\, CA\, 92093\, United States
CATEGORIES:TILOS Seminar Series
ATTACH;FMTTYPE=image/jpeg:https://tilos.ai/wp-content/uploads/2025/08/lee-jason-e1727126682884-UcJAUD.jpg
END:VEVENT
END:VCALENDAR