Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Shay Cohen and Fazl Barez
Short paper accepted at RTML workshop at ICLR 2023: https://arxiv.org/abs/2304.12918
Longer preprint available here: https://arxiv.org/abs/2305.19911
We provide a new interpretability tool for Language Models, Neuron to Graph (N2G). N2G builds an intepretable graph representation of any neuron in a Language Model, which can be visualised for human interpretation and used to simulate the behaviour of the target neuron by predicting activations on input text, allowing for a direct measure of the quality of the representation by comparing to the ground truth activations of the neuron. The resulting graphs are a searchable and programmatically comparable representation, facilitating greater automation of interpretability research.
Given dataset examples that maximally activate a target neuron within a Language Model, N2G extracts the minimal sub-string required for activation, computes the saliency of each token for neuron activation, and creates additional examples by replacing important tokens with likely substitutes using DistilBERT. The set of enriched examples with token saliencies is then converted to a trie representing the tokens on which a neuron activates, as well as the context required for activation on these tokens. The trie can be used to process text and output token-level activations, which can be compared to the ground-truth activations of the neuron for automatic evaluation. A simplified version of this trie can also be visualised for human interpretation - activating tokens are coloured red according to how strongly they activate the neuron, and context tokens are coloured blue according to their importance for neuron activation. Once a model has been processed and a neuron graph has been generated for every neuron in the model, these graphs can be searched to identify neurons with particular properties, such as activating on a particular token when it occurs with another context token.
A neuron graph exhibiting polysemanticity, with three disconnected subgraphs each responding to a phrase in a different language. |
If you use N2G in your research, please cite one of our papers:
@inproceedings{foote2023neuron2graph,
title={Neuron to Graph: Interpreting Language Model Neurons at Scale},
author={Foote, Alex and Nanda, Neel and Kran, Esben and Konstas, Ionnis and Cohen, Shay and Barez, Fazl},
booktitle={arXiv},
year={2023}
}