Mechanistic interpretability

Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations.^[1] The approach seeks to reverse-engineer neural networks in a manner similar to how computer programs are analyzed to understand their functions.

History

The term mechanistic interpretability was coined by Chris Olah.^[2] Early work combined various techniques such as feature visualization, dimensionality reduction, and attribution with human-computer interaction methods to analyze models like the vision model Inception v1.^[3] Later developments include the 2020 paper Zoom In: An Introduction to Circuits, which proposed an analogy between neural network components and biological neural circuits.^[4]

In recent years, mechanistic interpretability has gained prominence with the study of large language models (LLMs) and transformer architectures. The field is expanding rapidly, with multiple dedicated workshops such as the ICML 2024 Mechanistic Interpretability Workshop being hosted.^[5]

Key concepts

Mechanistic interpretability posits that neural networks implement their computations through independent, reverse-engineerable mechanisms encoded in their weights and activations. This contrasts with earlier interpretability methods that focused primarily on input-output explanations such as saliency maps.^[6]

Multiple definitions of the term exist, from narrow technical definitions (the study of causal mechanisms inside neural networks) to broader cultural definitions encompassing various AI interpretability research.^[2]

Linear representation hypothesis

This hypothesis suggests that high-level concepts are represented as linear directions in the activation space of neural networks. Empirical evidence from word embeddings and more recent studies supports this view, although it does not hold universally.^[7]^[8]

Superposition

Superposition describes how neural networks may represent many unrelated features within the same neurons or subspaces, leading to densely packed and overlapping feature representations.^[9]

Methods

Probing

Probing involves training simple classifiers on neural network activations to test whether certain features are encoded.^[10]

Causal interventions

Mechanistic interpretability employs causal methods to understand how internal model components influence outputs, often using formal tools from causality theory.^[11]

Sparse decomposition

Methods such as sparse dictionary learning and sparse autoencoders help disentangle complex overlapping features by learning interpretable, sparse representations.^[12]

Applications and significance

Mechanistic interpretability is crucial in the field of AI safety to understand and verify the behavior of increasingly complex AI systems. It helps identify potential risks and improves transparency.^[13]