How faithful and trustworthy are neuron explanations in mechanistic interpretability?
Understanding what individual units in a neural network represent is a cornerstone of mechanistic interpretability. A common approach is to generate human-friendly text explanations for each neuron to describe their functionalities—but how can we trust that these explanations are faithful reflections of the model’s actual behavior?