Guarantees-Based Mechanistic Interpretability

This is the codebase for the Guarantees-Based Mechanistic Interpretability MARS stream. Successor to https://github.com/JasonGross/neural-net-coq-interp.

Writeups

Compact Proofs of Model Performance via Mechanistic Interpretability

@misc{gross2024compact,
  author      = {Jason Gross and Rajashree Agrawal and Thomas Kwa and Euan Ong and Chun Hei Yip and Alex Gibson and Soufiane Noubir and Lawrence Chan},
  title       = {Compact Proofs of Model Performance via Mechanistic Interpretability},
  year        = {2024},
  month       = {June},
  doi         = {10.48550/arxiv.2406.11779},
  eprint      = {2406.11779},
  url         = {https://arxiv.org/abs/2406.11779},
  eprinttype  = {arXiv},
}

Abstract:

In this work, we propose using mechanistic interpretability – techniques for reverse engineering model weights into human-interpretable algorithms – to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-K task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

Setup

The code can be run under any environment with Python 3.9 and above.

We use poetry for dependency management, which can be installed following the instructions here.

To build a virtual environment with the required packages, simply run

poetry config virtualenvs.in-project true
poetry install

Notes

On some systems you may need to set the environment variable PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring to avoid keyring-based errors.
The first line tells poetry to create the virtual environment in the project directory, which allows VS Code to find the virtual environment.
If you are using caches from other machines, if you see errors like "dbm.error: db type is dbm.gnu, but the module is not available", you can probably solve the issue by following instructions from StackOverflow:
- sudo apt-get install libgdbm-dev python3-gdbm
- If you are using conda or some other Python version management, you can inspect the output of dpkg -L python3-gdbm and copy the lib-dynload/_gdbm.cpython-*-x86_64-linux-gnu.so file to the corresponding lib/ directory associated to the python you are using.

Data Cache

A cache for pre-computed data for the Max-of-K experiments is available on branches of JasonGross/guarantees-based-mechanistic-interpretability-with-data: max-of-4-cache max-of-5-cache max-of-10-cache max-of-10-dvocab-128-cache

Running notebooks

To open a Jupyter notebook, run

poetry run jupyter lab

If this doesn't work (e.g. you have multiple Jupyter kernels already installed on your system), you may need to make a new kernel for this project:

poetry run python -m ipykernel install --user --name=gbmi

Training models

Models for existing experiments can be trained by running e.g.

poetry run python -m gbmi.exp_max_of_n.train

or by running e.g.

from gbmi.exp_max_of_n.train import MAX_OF_10_CONFIG
from gbmi.model import train_or_load_model

rundata, model = train_or_load_model(MAX_OF_10_CONFIG)

from a Jupyter notebook.

This function will attempt to pull a trained model with the specified config from Weights and Biases; if such a model does not exist, it will train the relevant model and save the weights to Weights and Biases.

Adding new experiments

The convention for this codebase is to store experiment-specific code in an exp_[NAME]/ folder, with

exp_[NAME]/analysis.py storing functions for visualisation / interpretability
exp_[NAME]/verification.py storing functions for verification
exp_[NAME]/train.py storing training / dataset code

See the exp_template directory for more details.

Adding dependencies

To add new dependencies, run poetry add my-package.

Code Style

We use black to format our code. To set up the pre-commit hooks that enforce code formatting, run

make pre-commit-install

Tests

This codebase advocates for expect tests in machine learning, and as such uses @ezyang's expecttest library for unit and regression tests.

[TODO: add tests?]

Name		Name	Last commit message	Last commit date
Latest commit History 1,314 Commits
.github		.github
.vscode		.vscode
gbmi		gbmi
notebooks_alex		notebooks_alex
notebooks_eo		notebooks_eo
notebooks_jason		notebooks_jason
notebooks_som		notebooks_som
notebooks_soufiane		notebooks_soufiane
.gitignore		.gitignore
.mailmap		.mailmap
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mypy.ini		mypy.ini
notebooks_s.ipynb		notebooks_s.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
renovate.json		renovate.json
two_functions.ipynb		two_functions.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guarantees-Based Mechanistic Interpretability

Writeups

Compact Proofs of Model Performance via Mechanistic Interpretability

Setup

Data Cache

Running notebooks

Training models

Adding new experiments

Adding dependencies

Code Style

Tests

About

Releases

Packages

Contributors 9

Languages

License

JasonGross/guarantees-based-mechanistic-interpretability

Folders and files

Latest commit

History

Repository files navigation

Guarantees-Based Mechanistic Interpretability

Writeups

Compact Proofs of Model Performance via Mechanistic Interpretability

Setup

Data Cache

Running notebooks

Training models

Adding new experiments

Adding dependencies

Code Style

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages