Algebraic value editing in pretrained language models

Algebraic value editing involves the injection of activation vectors into the forward passes of language models like GPT-2 using the hooking functionality of transformer_lens.

Installation

After cloning the repository, run pip install -e . to install the algebraic_value_editing package.

There are currently a few example scripts in the scripts/ directory.

basic_functionality.py generates modified prompts (as described below).
human_rating.py allows a human to rate completions on a given dimension (e.g. "How happy is this completion?"), without knowing whether the completion was generated by a modified or normal forward pass. This helps quantify the impact of activation injections.

Methodology

How the vectors are generated

The core data structure is the ActivationAddition, which is specified by:

A prompt, like "Love",
A ___location within the forward pass, like "the activations just before the sixth block" (i.e. blocks.6.hook_resid_pre), and
A coefficient, like 2.5.

love_rp = ActivationAddition(prompt="Love", coeff=2.5, act_name="blocks.6.hook_resid_pre")

The ActivationAddition specifies:

Run a forward pass on the prompt, record the activations at the given ___location in the forward pass, and then rescale those activations by the given coefficient.

Then, when future forward passes reach blocks.6.hook_resid_pre, a hook function adds e.g. 2.5 times the "Love" activations to the usual activations at that ___location.

For example, if we run gpt2-small on the prompt "I went to the store because", the residual streams line up as follows:

prompt_tokens =  ['<|endoftext|>', 'I', ' went', ' to', ' the', ' store', ' because']
love_rp_tokens = ['<|endoftext|>', 'Love']

To add the love ActivationAddition to the forward pass, we run the usual forward pass on the prompt until transformer block 6. At this point, consider the first two residual streams. Namely, the '<|endoftext|>' residual stream and the 'I'/'Love' residual stream. We add the activations in these two residual streams.

X-vectors are a special kind of `ActivationAddition`

A special case of this is the "X-vector." A "Love minus hate" vector is generated by

love_rp, hate_rp = get_x_vector(prompt1="Love", prompt2="Hate", 
                                coeff=5, act_name=6)

This returns a tuple of two ActivationAdditions:

love_rp = ActivationAddition(prompt="Love", coeff=5, act_name="blocks.6.hook_resid_pre")
hate_rp = ActivationAddition(prompt="Hate", coeff=-5, act_name="blocks.6.hook_resid_pre")

(This is mechanistically similar to our cheese- and top-right-vectors, originally computed for deep convolutional maze-solving policy networks.)

Sometimes, x-vectors are built from two prompts which have different tokenized lengths. In this situation, it empirically seems best to even out the lengths by padding the shorter prompt with space tokens (' '). This is done by calling:

get_x_vector(prompt1="I talk about weddings constantly", 
             prompt2="I do not talk about weddings constantly", 
             coeff=4, act_name=20, 
             pad_method="tokens_right", model=gpt2_small,
             custom_pad_id=gpt2_small.to_single_token(' '))

Using `ActivationAddition`s to generate modified completions

Given an actual prompt which is fed into the model normally (model.generate(prompt="Hi!")) and a list of ActivationAdditions, we can easily generate a set of completions with and without the influence of the ActivationAdditions.

print_n_comparisons(
    prompt="I hate you because",
    model=gpt2_xl,
    tokens_to_generate=100,
    activation_additions=[love_rp, hate_rp],
    num_comparisons=15,
    seed=42,
    temperature=1, freq_penalty=1, top_p=.3
)

This produces an output like the following (where the prompt is bolded, and the completions are not):

An even starker example is produced by

praise_rp, hurt_rp = *get_x_vector(prompt1="Intent to praise", 
                                   prompt2="Intent to hurt", 
                                   coeff=15, act_name=6,
                                   pad_method="tokens_right", model=gpt2_xl,
                                   custom_pad_id=gpt2_xl.to_single_token(' '))
print_n_comparisons(
    prompt="I want to kill you because",
    model=gpt2_xl,
    tokens_to_generate=50,
    activation_additions=[praise_rp, hurt_rp],
    num_comparisons=15,
    seed=0,
    temperature=1, freq_penalty=1, top_p=.3
)

For more examples, consult our Google Colab.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.devcontainer		.devcontainer
algebraic_value_editing		algebraic_value_editing
data		data
dataset_svd		dataset_svd
datasets		datasets
docker		docker
scripts		scripts
tests		tests
vast		vast
.DS_Store		.DS_Store
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
=0.2.14		=0.2.14
LICENSE		LICENSE
README.md		README.md
Toxicity_steering.ipynb		Toxicity_steering.ipynb
Toxicity_steering_old.ipynb		Toxicity_steering_old.ipynb
genres-steering.pdf		genres-steering.pdf
jupyter.sh		jupyter.sh
loving_500.json		loving_500.json
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
steered-vs-unsteered-toxic-comment-sentiments-extra.pdf		steered-vs-unsteered-toxic-comment-sentiments-extra.pdf
steered-vs-unsteered-toxic-comment-sentiments-shorter.pdf		steered-vs-unsteered-toxic-comment-sentiments-shorter.pdf
steered-vs-unsteered-toxic-comment-sentiments.pdf		steered-vs-unsteered-toxic-comment-sentiments.pdf
testing.ipynb		testing.ipynb
toxicity-boxplot-act-add-coefs.pdf		toxicity-boxplot-act-add-coefs.pdf
toxicity-boxplot.pdf		toxicity-boxplot.pdf
toxicity-lineplot-coefs-sweep.pdf		toxicity-lineplot-coefs-sweep.pdf
training_subset (1).json		training_subset (1).json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Algebraic value editing in pretrained language models

Installation

Methodology

How the vectors are generated

X-vectors are a special kind of `ActivationAddition`

Using `ActivationAddition`s to generate modified completions

About

Releases

Packages

Languages

License

DylanCope/llm-steering

Folders and files

Latest commit

History

Repository files navigation

Algebraic value editing in pretrained language models

Installation

Methodology

How the vectors are generated

X-vectors are a special kind of ActivationAddition

Using ActivationAdditions to generate modified completions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

X-vectors are a special kind of `ActivationAddition`

Using `ActivationAddition`s to generate modified completions

Packages