February 20, 2023

LW - A circuit for Python docstrings in a 4-layer attention-only transformer by StefanHex

33 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A circuit for Python docstrings in a 4-layer attention-only transformer, published by StefanHex on February 20, 2023 on LessWrong.

Produced as part of the SERI ML Alignment Theory Scholars Program under the supervision of Neel Nanda - Winter 2022 Cohort.

TL;DR: We found a circuit in a pre-trained 4-layer attention-only transformer language model. The circuit predicts repeated argument names in docstrings of Python functions, and it features

3 levels of composition,

a multi-function head that does different things in different parts of the prompt,

an attention head that derives positional information using the causal attention mask.

Epistemic Status: We believe that we have identified most of the core mechanics and information flow of this circuit. However our circuit only recovers up to half of the model performance, and there are a bunch of leads we didn’t follow yet.

Introduction

Click here to skip to the results & explanation of this circuit.

What are circuits

What do we mean by circuits? A circuit in a neural network, is a small subset of model components and model weights that (a) accounts for a large fraction of a certain behavior and (b) corresponds to a human-interpretable algorithm. A focus of the field of mechanistic interpretability is finding and better understanding the phenomena of circuits, and recently the field has focused on circuits in transformer language models. Anthropic found the small and ubiquitous Induction Head circuit in various models, and a team at Redwood found the Indirect Object Identification (IOI) circuit in GPT2-small.

How we chose the candidate task

We looked for interesting behaviors in a small, attention-only transformer with 4 layers, from Neel Nanda’s open source toy language models. It was trained on natural language and Python code. We scanned the code dataset for examples where the 4-layer model did much better than a similar 3 layer one, inspired by Neel's open problems list. Interestingly, despite the circuit seemingly requiring just 3 levels of composition, only the 4-layer model could do the task.

The docstring task

The clearest example we found was in Python docstrings, where it is possible to predict argument names in the docstring: In this randomly generated example, a function has the (randomly generated) arguments load, size, files, and last. The docstring convention here demands each line starting with :param followed by an argument name, and this is very predictable. Turns out that attn-only-4l is capable of this task, predicting the next token (files in the example shown here) correctly in ~75% of cases.

Methods: Investigating the circuit

Possible docstring algorithms

There are multiple algorithms which could solve this task, such as

"Docstring Induction": Always predict the argument that, in the definition, follows the argument seen in the previous docstring line. I.e. look for param size, check the order in the definition size, files, and predict files accordingly.

Line number based: In the Nth line predict the Nth variable from the definition, irrespective of the content of the other lines. I.e after the 3rd param token, predict the 3rd variable files.

Inhibition based: Predict variable names from the definition, but inhibit variable names which occurred twice (similar to the inhibition in the IOI circuit), i.e. predict load, size, files, last, and inhibit the former two. Add some preference for earlier tokens to prefer files over last.

We are quite certain that at least the first two algorithms are implemented to some degree. This is surprising, since one of the two should be sufficient to perform the task; we do not investigate further why this is the case. A brief investigation showed that the implementation of the 2nd algorithm seems less robust and less generalizable that our model's implementation of th...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

February 20, 2023

LW - A circuit for Python docstrings in a 4-layer attention-only transformer by StefanHex

33 minutes

Produced as part of the SERI ML Alignment Theory Scholars Program under the supervision of Neel Nanda - Winter 2022 Cohort.

TL;DR: We found a circuit in a pre-trained 4-layer attention-only transformer language model. The circuit predicts repeated argument names in docstrings of Python functions, and it features

3 levels of composition,

a multi-function head that does different things in different parts of the prompt,

an attention head that derives positional information using the causal attention mask.

Introduction

Click here to skip to the results & explanation of this circuit.

What are circuits

How we chose the candidate task

The docstring task

Methods: Investigating the circuit

Possible docstring algorithms

There are multiple algorithms which could solve this task, such as

Line number based: In the Nth line predict the Nth variable from the definition, irrespective of the content of the other lines. I.e after the 3rd param token, predict the 3rd variable files.

...more

Share LW - A circuit for Python docstrings in a 4-layer attention-only transformer by StefanHex

Sign up to save your podcasts

LW - A circuit for Python docstrings in a 4-layer attention-only transformer by StefanHex

LW - A circuit for Python docstrings in a 4-layer attention-only transformer by StefanHex