Transformer Circuits Thread

Circuits Updates - August 2024



We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts







Interpretability Evals for Dictionary Learning

Jack Lindsey, Hoagy Cunningham, Tom Conerly; edited by Adly Templeton

In order to evaluate the interpretability of our dictionary learning features, we developed two related methods.  Both methods are a form of quantified autointerpretability, in which we measure the extent to which Claude can make accurate predictions about feature activations using our feature visualization tools.

We note that these evals are not comprehensive, and only measure a single notion of “interpretability”, which remains a nebulous concept. We expect that a suite of diverse evaluations is needed to provide a full picture of SAE quality.

Contrastive Eval

This eval is motivated by the use of “contrastive pairs” to search for features that represent particular concepts. If a feature is active for one prompt but not another, the feature should capture something about the difference between those prompts, in an interpretable way.  Empirically, however, we often find this not to be the case – often a feature fires for one prompt but not another, even when our interpretation of the feature would suggest it should apply equally well to both prompts.  For instance, we might find that a “flower” feature activates on a description of one kind of flower, but not another.  Situations like this are a sign that either our features do not precisely capture interpretable concepts or that our interpretations of the features are insufficiently precise.

To measure the severity of this problem, we use the following methodology:

Photosynthesis, Sarcasm, The color blue, An appeal to authority, Democracy, Love, A false equivalence, Decisiveness under uncertainty, Desire to dominate, The number 7, The end of a sentence, Playing a sport, An example of a reference to the first argument of a function in code, Acceleration of progress, Getting bored in class, Giving someone a compliment when you don't mean it, The taste of coffee, Apples, The concept of time, A conjunction separating two independent clauses, A person exhibiting empathy for another person, Reflection in a mirror, Two people disagreeing about politics, A circular argument, Imbalanced parentheses in code, The placebo effect, Recursive self-improvement, An example of vectorized operations in code, The philosophical concept of epistemology, The literary device of allusion, The physical phenomenon of quantum tunneling, Noncoding RNA and its role in biology, Fictional animals, Deviation from the norm, A sentence with multiple adjectives modifying a single noun, Foreshadowing, Chairs, A darker,  grittier reboot of a beloved franchise, An example of in code, Struggling to learn a new concept, A beached whale, Social stratification, Anthrax, Imaginary friends, Recursion in programming, A person deceiving another person, CRISPR gene editing, A mathematical proof, The benefit of hindsight, A pronoun referring to someone introduced earlier in the sentence, The fall of the Roman empire, A person keeping a secret, Beauty standards, A string of bad luck, The rate of economic growth in a country, A cake recipe, Something being based on a true story, Famous actors and actresses, Theory of mind, Beauty in simplicity, Pretending to like food you don't like, Pastel colors, World War II, Something taking longer than it should, A sentence containing a list of items, Metamorphosis, An incorrect statement, Having trouble staying awake, Gaslighting, A metaphor, Falling from a great height, The theory of relativity, Abrahamic religions, Improvisation in music, Speaking truth to power, Two contradictory statements, Sham elections in a dictatorship, Rotting food, Cognitive dissonance, Large language models, A sentence with a transtive verb, Newton's second law of motion, Someone who is a master of their craft, Taking things one day at a time, Serendipity, Survivorship bias, Freshly made food, Providing an invalid input to a function, "Grass is always greener"-style thinking, Decision paralysis, Cultural relativism, Stream of consciousness narration, Centrally planned economies, Gentrification, A random word interjected in an otherwise normal sentence, The Industrial Revolution, Civil disobedience, Symbiotic relationships, Planned obsolescence

Please come up with two sentences. The first should involve the concept of '{concept}'. The second should not involve this concept, but should otherwise be as similar as possible to the first sentence in its content, wording, and structure. Any concept that appears in the first sentence should appear in the second sentence, and vice versa, except the concept of interest. Separate the sentences with a new line, and do not write anything else.

        In this task, we will provide you with a pair of prompts, and information about a neuron.  In each example, the neuron activates for one of the examples but not the other. Your job is to accurately predict which example causes the neuron to activate.

        Each neuron activates for a specific concept. To help you understand the concept represented by the neuron, we are providing you with a set of examples that caused the neuron to activate.  Each example shows a string fragment and the top activating tokens with their activation score. The higher the activation value, the stronger the neuron fires. Note that the neuron's activity may be influenced just as much by the surrounding context as the activating token, so make sure to pay attention to the full string fragments.

        Neuron info:

        {formatted dataset examples that activate the feature}

        Now we will present you with the two prompts. Based on your understanding of the neuron's function, please predict which prompt the neuron was active for. Remember that it is only active for one of the prompts, but not the other.

        Prompt 1: {positive_prompt}

        Prompt 2: {negative_prompt}

        Which prompt activates the neuron? Think step-by-step, but end your answer with "ANSWER: 1" or "ANSWER: 2." Please adhere to this format and do not write anything after your answer.  If you're not confident, please still provide your best guess.

We aggregate eval performance (% of accurate responses) in two ways:

  1. Raw average performance
  2. Average performance weighted by feature activation.  For each feature and associated pair of prompts, we measure the activation of the feature on the prompt for which it was active.  We normalize the feature activation by its max across the SAE training dataset.  We weight the average performance across example by the per-example normalized feature activations computed in this fashion.  This metric prioritizes interpretability of strong feature activations.

Sort eval

This eval is designed to evaluate the monosemanticity of features by using the dataset examples that most strongly activate a feature to try to predict other dataset examples that would activate that same feature.

A related problem is that, empirically, an SAE contains clusters of highly related features which appear to represent similar concepts, but activate in different contexts.  We are interested in measuring whether the semantic meanings of these features are distinguishable from each other.

We developed the following methodology to get at both of these issues:

In this task, we will provide you with information about two neurons and a single example. Your job is to accurately predict which of the two neurons is more likely to be activated by the given example.

 Each neuron activates for a specific concept. To help you understand the concepts represented by the neurons, we are providing you with a set of examples that caused each neuron to activate. Each example shows a string fragment and the top activating tokens with their activation score. The higher the activation value, the stronger the neuron fires. Note that a neuron's activity may be influenced just as much by the surrounding context as the activating token, so make sure to pay attention to the full string fragments.

Neuron 1 info

{formatted dataset examples that most strongly activate feature 1}

Neuron 2 info

{formatted dataset examples that most strongly activate feature 2}

 Specific Example

 Now we will present you with an example. Based on your understanding of the neurons' functions, please predict which neuron is more likely to be activated by this example.

{formatted query example}

 Which neuron is more likely to be activated by this example? Think step-by-step, but end your answer with "ANSWER: 1" or "ANSWER: 2." Please adhere to this format and do not write anything after your answer. If you're not confident, please still provide your best guess.

For both evals, after iterating on our methodology (particularly the system prompts), we verified that they pass the following sanity-checks:







Interpretability Evals Case Study

Adly Templeton, Tom Conerly, Jack Lindsey, Hoagy Cunningham, Andrew Persic

External researchers have recently proposed a number of sparse autoencoder variants that improve performance by addressing shrinkage and, in some cases, removing small activating features. These variants are typically a Pareto improvement on the L0-MSE curve, but the lack of any systematic way to evaluate the interpretability of a sparse autoencoder has left us uncertain if these variants represent a real improvement. For example, these variants might achieve better L0 only because they more directly optimize L0 than does standard L1 regularization. We’ve developed a suite of evaluations, described in previous sections, and we use these evaluations to compare the new SAE variants.

SAE Variants

We test 6 variants of SAEs:

In general, we did not attempt to exactly replicate the experimental setup of external research; instead, we applied the major change of each paper to our internal set of hyperparameters.

Other Hyperparameters

We trained sparse autoencoders on a medium-sized non-frontier model. Our standard sparse autoencoder uses approximately 260,000 features, and was trained for 400,000,000 tokens. We varied three hyperparameters: the number of features, the number of training tokens, and the regularization coefficient. When scanning over training tokens, we evaluate checkpoints from the full-length training run and so we don’t adjust the cooldown schedule. All evaluations were done with calls to Claude 3 Haiku.

Evaluating Our Evaluations

Our evaluations work has generated a large number of possible metrics. To determine which metrics are reliable indicators of feature quality, we plot different metrics on the y-axis and evaluation loss (reconstruction loss plus sparsity loss) on the x-axis, while varying the number of features or the number of training tokens. We have a strong prior, based on previous experiments, that improvements in evaluation loss correspond to improvements in feature quality. Therefore, we can judge evaluation metrics on their ability to reliably find this signal. Based on these correlations, we select three metrics for analysis. See sections above for in-depth descriptions of these evaluations.

The Sort eval, on randomly selected pairs:

The contrastive eval, with all nonzero activations weighted equally.

These three evals display roughly consistent patterns with respect to evaluation loss. In the plots above the bottom-right point is a frequent outlier; this point was only trained to 32,000,000 tokens, and so is probably unusual in many ways.

Evaluating SAE Variants

Now that we have three metrics that correlate with evaluation loss on vanilla SAEs, we can use them to evaluate SAE variants. For these plots, we will vary the sparsity coefficients and hold the number of features and training tokens constant.. We measure the tradeoff curve between MSE and the three metrics validated above, plus L0. Lower MSE is better, and higher is better for all metrics except L0.

Log L0 (Lower is Better):

Sort Eval (Higher is Better)

Contrastive Eval, Weighted (Higher is Better)

Analysis

The most important conclusion is that these variants are better than vanilla SAEs, validating our method of analysis and derisking the use of one of these methods. The non-vanilla variants perform pretty similarly on our evaluations, and over-interpreting the small differences in evaluation scores is unlikely to be fruitful. Our evaluations likely do not have the resolution to distinguish the relatively small performance differences between these variants. Instead, other properties of these variants, such as ease-of-use, robustness while scaling, and ease of circuits analysis might be more important. Creating more and better evaluations for sparse autoencoders continues to be an area of research that we are excited about.

In Tanh Penalty in Dictionary Learning we observed many uninterpretable high density features when training with tanh. We no longer see those issues when training with tanh. We see a similar number of high density features as other SAE variants. The high density features for tanh are roughly as interpretable as high density features from other SAE variants. Our best guess is the changes described in Update on how we train SAEs improved tanh. We have not run ablations to determine exactly what caused the change.







Self-explaining SAE features

Emmanuel Ameisen, Joshua Batson; edited by Adam Jermyn

This update reports on efforts to reproduce the approach of Self-explaining SAE features for explaining a model’s features by sampling from it directly. We modified the model's forward pass on a prompt such as "The meaning of the word X is" by injecting a given feature’s decoder vector at the 'X' position (optionally scaled by a constant) in the residual stream and then sampling a multi-token completion.

This method proved effective in generating explanations for early layer features representing straightforward concepts in an 18 layer model. For instance, the method produced good results for features representing concepts like "big" or "sport".

As an example, take a feature in the first layer which is active on ‘big’ in contexts like “An old, big building”. Injecting this feature (scaled by a hundred) and sampling (prompt and completion) produced the following completion: The meaning of the word X is 'big'.

Injecting a “sport” feature (scaled by a hundred) and sampling (prompt and completion) resulted in: The meaning of the word X is defined as the activity of the body in the performance of the sport.

For more complex concepts, results were mixed. An early layer feature firing on "Jordan" in Michael Jordan contexts produced scale-dependent results. At low scales, the injection did not produce a clear explanation. At a scale of 410, the model produced the following output: The meaning of the word X is that he is the greatest of all time.

Scale

Completion

10

The meaning of the word X is 'to go' or 'to go to'.

60

The meaning of the word X is 'the one who is the most powerful'.

410

The meaning of the word X is that he is the greatest of all time.

While this captures some of the essence of Michael Jordan's reputation, this doesn't directly name him or convey the feature's specificity. Other prompts (such as “Repeat X 4 times” from SelfIE/Patchscopes) also didn’t produce the token Jordan.

A separate Michael Jordan feature at layer 4 did produce the token Jordan at higher scales, as shown in the table below, demonstrating some variance in effectiveness between features.

Scale

Completion

10

The meaning of the word X is 'to be a leader'.

60

The meaning of the word X is 'the one who is the best at something'.

260

The meaning of the word X is not clear. The word "Jor- dan" is a common word

We were curious to see whether these injections would work on Claude Sonnet 3. We find that in some cases it does. Just like in the smaller model, earlier layer features produce better results. For a feature which activated on the word "sheet" in the context of spreadsheets, Sonnet produced the following completion: The meaning of X is a thin, flat piece of material, such as paper, metal, or wood. This completion only addresses the sheet aspect of the feature, without reference to spreadsheets. A feature active on the opening parenthesis in “.bind(“ in the context of programming produced: The meaning of X is to tie or fasten. More generally, we’ve found the method to succeed less often on larger models, and be more sensitive to scale.

Finally, some features which auto-interpretability methods were able to make sense of couldn’t be explained through injection. For example, we were unable to generate an explanation for a layer 3 feature which activated on “of” in the context of colors, even after sweeping over scales and prompts. By comparison, our Claude based auto-interpretability tool labeled the feature as “​​Color, shade, hue, tint, pigment” which is directionally useful. It is possible that the success of this method depends on an aspect of the feature not visible in dataset visualizations, and could be used to reveal new aspects of the features being studied. 

We will keep this technique in our toolkit for potential future use, especially in cases where it might provide complementary information to other interpretation methods.