How faithful and trustworthy are neuron explanations in mechanistic interpretability?

Understanding what individual units in a neural network represent is a cornerstone of mechanistic interpretability. A common approach is to generate human-friendly text explanations for each neuron to describe their functionalities—but how can we trust that these explanations are faithful reflections of the model’s actual behavior? In work published in the 2025 International Conference on […]

Read More

Unpacking the bias of large language models

MIT News || A team of MIT researchers, including TILOS Foundations team member and associate professor Stefanie Jegelka, and postdoctoral scholar Yifei Wang, has developed a theoretical framework to study how information flows through the machine learning (ML) architecture that forms the backbone of LLMs. Their work has uncovered the root cause of “position bias” […]

Read More

Opinion: We Can’t Regulate Our Way to Crypto Leadership. We Still Need Science

CoinDesk || National Science Foundation funding cuts threaten to devastate U.S. crypto research, say 10 leading professors, including TILOS Chips team co-lead Farinaz Koushanfar, the Nemat-Nasser Endowed Chair Professor of Electrical and Computer Engineering at UC San Diego, and founding co-director of the UC San Diego Center for Machine Intelligence, Computing, and Security (MICS).

Read More