The Nonlinear Library

By The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio conte... more

· Education

4.6

88 ratings

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about The Nonlinear Library:

How many episodes does The Nonlinear Library have?

The podcast currently has 9,862 episodes available.

The Nonlinear Library episodes:

May 26, 2023 EA - It is good for EA funders to have seats on boards of orgs they fund [debate] by Nathan Young
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: It is good for EA funders to have seats on boards of orgs they fund [debate], published by Nathan Young on May 25, 2023 on The Effective Altruism Forum.
It has come to my attention that many people (including my past self) think that it's bad for funders to sit on the boards of orgs they fund. Eg someone at OpenPhil being the lead decision maker on a grant and then sitting on the board of that org.
Let's debate this
Since I said this, several separate people I always update to, including a non-EA said this is trivially wrong. It is typical practice with good reason:
EA is not doing something weird and galaxy-brained here. Particularly in America this is normal practice
Having a board seat ensures that your funding is going where you want and might allow you to fund with other fewer strings attached
It allows funder oversight. They can ask the relevant questions at the time rather than in some funding meeting
Perhaps you might think that it causes funders to become too involved, but I dunno. And this is clearly a different argument than the standard "EA is doing something weird and slightly nepotistic"
To use the obvious examples, it is therefore good that Claire Zabel sits on whatever boards she sits on of orgs OP funds. And reasonable that OpenPhil considered funding OpenAI as a way to get a board seat (you can disagree with the actual cost benefit but there was nothing bad normsy about doing it)
Do you buy my arguments? Please read the comments to this article also, then vote in this anonymouse poll.
Loading...
And now you can bet and then make your argument to try and shift future respondents and earn mana for doing so.
This market resolves in a month to the final agree % + weakly agree % of the above poll. Hopefully we can see it move in real time if someone makes a convincing argument.
I think this is a really cool real time debate format and we should have it at EAG. Relevant doc
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
2min
May 26, 2023 LW - Book Review: How Minds Change by bc4026bd4aaa5b7fe
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Book Review: How Minds Change, published by bc4026bd4aaa5b7fe on May 25, 2023 on LessWrong.
In 2009, Eliezer Yudkowsky published Raising the Sanity Waterline. It was the first article in his Craft and the Community sequence about the rationality movement itself, and this first article served as something of a mission statement. The rough thesis behind this article—really, the thesis behind the entire rationalist movement—can be paraphrased as something like this:
We currently live in a world where even the smartest people believe plainly untrue things. Religion is a prime example: its supernatural claims are patently untrue, and yet a huge number of people at the top of our institutions—scholars, scientists, leaders—believe otherwise.
But religion is just a symptom. The real problem is humanity's lack of rationalist skills. We have bad epistemology, bad meta-ethics, and we don't update our beliefs based on evidence. If we don't master these skills, we're doomed to just replace religion with something equally as ridiculous.
We have to learn these skills, hone them, and teach them to others, so that people can make accurate decisions and predictions about the world without getting caught up in the fallacies so typical of human reasoning.
The callout of religion dates it: it was from the era where the early English-speaking internet was a battlefield between atheism and religion. Religion has slowly receded from public life since then, but the rationality community stuck around, in places like this site and SSC/ACX and the Effective Altruism community.
I hope you'll excuse me, then, if I say that the rationalist community has been a failure.
Sorry! Put down your pitchforks. That's not entirely true. There's a very real sense in which it's been a success. The community has spread and expanded to immense levels. Billions of dollars flow through Effective Altruist organizations to worthy causes. Rationalist and rationalist-adjacent people have written several important and influential books. And pockets of the Bay Area and other major cities have self-sustaining rationalist social circles, filled with amazing people doing ambitious and interesting things.
But that wasn't the point of the community. At least not the entire point. From Less Wrong's account of its own history:
After failed attempts at teaching people to use Bayes' Theorem, [Yudkowsky] went largely quiet from [his transhumanist mailing list] to work on AI safety research directly. After discovering he was not able to make as much progress as he wanted to, he changed tacts to focus on teaching the rationality skills necessary to do AI safety research until such time as there was a sustainable culture that would allow him to focus on AI safety research while also continuing to find and train new AI safety researchers.
In short: the rationalist community was intended as a way of preventing the rise of unfriendly AI.
The results on this goal have been mixed to say the least.
The ideas of AI Safety have made their way out there. Many people that are into AI have heard of ideas like a paperclip maximizer. Several AI Safety organizations have been founded, and a nontrivial chunk of Effective Altruists are actively trying to tackle this problem.
But the increased promulgation of the idea of transformational AI also caught the eye of some powerful and rich people, some of which proceeded to found OpenAI. Most people of a Yudkowskian bent consider this to be a major "own goal": although it's good to have one of the world's leading AI labs be a sort-of-non-profit that openly says that they care about AI Safety, they've also created a race to AGI, accelerating AI timelines like never before.
And it's not just AI. Outside of the rationalist community, the sanity waterline hasn't gotten much better. Sure, religion has retre...
...more
24min
May 26, 2023 LW - Mob and Bailey by Screwtape
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mob and Bailey, published by Screwtape on May 25, 2023 on LessWrong.
Epistemological status: Moderately confident that this is a more useful way to use a concept that has been expanded upon by others.
Previous building blocks: See Logical Rudeness and All Another Brick in the Motte and for the foundations, as well as Against Accusing People of Motte and Bailey for the direct predecessor.
If you haven’t read the previous building blocks, the core idea is called the Motte and Bailey. A Motte and Bailey argument is what you call it when someone makes a clearly supported and uncontested claim, then makes an outrageous but advantageous claim, then swaps between these two claims whenever it's useful to them. It draws from the medieval tactic of having an easily farmable bailey right next to a heavily fortified motte, then moving your peasants and troops back and forth between them whenever raiders come or leave.
I
Amy and Bob would like to have a civil discussion about a philosophical difference they have. Their conversation goes something like this:
Amy: I don't understand why you think tautologies are important. I mean, you can't get any extra information out of them, right?Bob: There are actually a number of different kinds of tautologies. For example, a logical tautology might say "either X equals Y or X does not equal Y" and while you might be correct that no new information is gained from this, I find it helps me organize my thoughts.A: Ah, I didn't know that. I've mostly seen them used as rhetorical devices.B: They can be used that way, but it's far from the most interesting thing about them for me.A: As long as people are going to keep using tautologies to win arguments though, how do we help those who don’t understand them well enough to defend against tautology based arguments?B: Oh go soak your head.
I think if you learned more about them you’d be able to actually counter them when people did use them in arguments.A: Even if I studied tautologies enough to do so, I worry that making a general rule of needing to study all potential rhetorical devices to be able to defend against them might be prohibitively difficult.B: As much as I love tautologies, I do think tautology proponents should be more careful in their usage.B: At least as long as we have to deal with idiots who try to ban anything they don’t understand.
This conversation disintegrated quickly. Bob seems to be moving between the position that tautologies are one way to organize information, and the position that if you don’t understand them there’s something wrong with you. This looks like a straightforward example of Motte and Bailey.
II
Imagine Bob is the vice-president of the Tautologies club at a well respected college, and he has just been invited into a very nice conference room by some campus authority.
Authority: We've had some complaints about the behavior of your club. Apparently proponents of tautologies are disruptive, disrespectful, and frankly prone to outrageous acts.Bob: What? That catches me completely by surprise: one of our members, Carol, has a perfect behavioral record- no infractions at all in the entire four years of her time here at the university.Authority: Yes but-Bob: Also, our secretary Dean just got a commendation last semester for Showing Proper Decorum. Isn't he going to the Competitive Decorum Displays next fall? Surely you aren't saying that he's disrespectful!Authority: No but-Bob: In addition, I happen to know that our treasurer Evan is on the boards of several charities with you.
Really, I think the Tautology Club is full of wonderful people!Authority: Then what do you have to say about your club president screaming "B is B, motherfkers!" in the middle of a class before running up to the front of the room to spray paint your club slogan onto the professor's chest?!Bob...
...more
10min
May 26, 2023 AF - Some thoughts on automating alignment research by Lukas Finnveden
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some thoughts on automating alignment research, published by Lukas Finnveden on May 26, 2023 on The AI Alignment Forum.
As AI systems get more capable, they may at some point be able to help us with alignment research. This increases the chance that things turn out ok.[1] Right now, we don’t have any particularly scalable or competitive alignment solutions. But the methods we do have might let us use AI to vastly increase the amount of labor spent on the problem before AI has the capability and motivation to take over. In particular, if we’re only trying to make progress on alignment, the outer alignment problem is reduced to (i) recognising progress on sub-problems of alignment (potentially by imitating productive human researchers), and (ii) recognising dangerous actions like e.g. attempts at hacking the servers.[2]
But worlds in which we’re hoping for significant automated progress on alignment are fundamentally scary. For one, we don’t know in what order we’ll get capabilities that help with alignment vs. dangerous capabilities.[3] But even putting that aside, AIs will probably become helpful for alignment research around the same time as AIs become better at capabilities research. Once AI systems can significantly contribute to alignment (say, speed up research by >3x), superintelligence will be years or months away.[4] (Though intentional, collective slow-downs could potentially buy us more time. Increasing the probability that such slow downs happen at key moments seems hugely important.)
In such a situation, we should be very uncertain about how things will go.
To illustrate one end of the spectrum: It’s possible that automated alignment research could save the day even in an extremely tense situation, where multiple actors (whether companies or nations) were racing towards superintelligence. I analyze this in some detail here. To briefly summarize:
If a cautious coalition were to (i) competitively advance capabilities (without compromising safety) for long enough that their AI systems became really productive, and (ii) pause dangerous capabilities research at the right time — then even if they only had a fairly small initial lead, that could be enough to do a lot of alignment research.
How could we get AI systems that significantly accelerates alignment research without themselves posing an unacceptable level of risk? It’s not clear that we will, but one possible story is motivated in this post: It’s easier to align subhuman models than broadly superhuman models, and in the current paradigm, we will probably be able to run hundreds of thousands of subhuman models before we get broadly superhuman models, each of them thinking 10-100X faster than humans. Perhaps they could make rapid progress on alignment.
In a bit more detail:
Let’s say that a cautious coalition starts out with an X-month lead, meaning that it will take X months for other coalitions to catch up to their current level of technology. The cautious coalition can maintain that X-month lead for as long as they don't leak any technology,[5] and for as long as they move as fast as their competitors.[6]
In reality, the cautious coalition should gradually become more careful, which might gradually reduce their lead (if their competitors are less cautious). But as a simple model, let’s say that the cautious coalition maintains their X-month lead until further advancement would pose a significant takeover risk, at which point they entirely stop advancing capabilities and redirect all their effort towards alignment research.
Simplifying even further, let’s say that, when they pause, their current AI capabilities are such that 100 tokens of an end-to-end trained language model on average lead to equally much alignment progress as 1 human researcher second (and that it does so safely).
According to this oth...
...more
10min
May 26, 2023 AF - Before smart AI, there will be many mediocre or specialized AIs by Lukas Finnveden
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Before smart AI, there will be many mediocre or specialized AIs, published by Lukas Finnveden on May 26, 2023 on The AI Alignment Forum.
Summary:
In the current paradigm, training is much more expensive than inference. So whenever we finish end-to-end training a language model, we can run a lot of them in parallel.
If a language model was trained with Chinchilla scaling laws on the FLOP-equivalent of a large fraction of the world’s current GPU and TPUs: I estimate that the training budget could produce at least ~20 million tokens per second.
Larger models trained on more data would support more tokens per second.
Language models can also run faster than humans. Current models generate 10-100 tokens per second. It’s unclear whether future models will be slower or faster.
This suggests that, before AI changes the world via being broadly superior to human experts, it will change the world via providing a lot of either mediocre (by the standard of human experts) or specialized thinking.
This might make the early alignment problem easier. But the full alignment problem will come soon thereafter, in calendar-time, so this mainly matters if we can use the weaker AI to buy time or make progress on alignment.
More expensive AI you can run more AIs with your training budget
(...assuming that we’re making them more expensive by increasing parameter-count and training data.)
We’re currently in paradigm where:
Training isn’t very sample-efficient.
When increasing capabilities, training costs increase faster (~squared) than inference costs.
Training is massively parallelizable.[1] While this paradigm holds, it implies that the most capable models will be trained using massively parallelized training schemes, equivalent to running a large number of models in parallel. The larger the model, the more data it needs, and so more copies of them will have to be run in parallel during training, in order to finish within a reasonable time-frame.[2]
This means that, once you have trained a highly capable model, you are guaranteed to have the resources to run a huge number of them in parallel. And the bigger and more expensive the model was — the more of them can run in parallel on your training cluster.
Here’s a rough calculation of how many language models you can run in parallel using just your training cluster:
Let’s say you use p parameters.
Running the model for one token takes kp FLOP, for some k.
Chinchilla scaling laws say training data is proportional to parameters, implying that the model is trained for mp tokens.
For Chinchilla, m=20 tokens / parameter.
Total training costs are 3kmp^2.
The 3 is there because backpropagation is ~2x as expensive as forward propagation.
You spend N seconds training your model.
During training, you use (3kmp^2/N) FLOP/s, and at inference you can run one model for kp FLOP/s. So using just your training compute, you can run (3kmp^2/N)/(kp) = 3mp/N tokens per second, just by reallocating your training compute to inference.
If you take a horizon-length framework seriously, you might expect that we’ll need more training data to handle longer-horizon tasks. Let’s introduce a parameter H that describes how many token-equivalents correspond to one data-point.
Total training costs are now 3kmHp^2.
So with the compute you used to train your models, you can process 3mpH/N token-equivalents per second.
Some example numbers (bolded ones are changed from the top one):
For p=1e14, N=1y, H=1, m=20, the above equation says you can process 200 million token-equivalents per second, with just your training budget.
For p=1e15, N=1y, H=1, m=20, it’s ~2 billion token-equivalents/second.
For p=1e14, N=3 months, H=1 hour, m=20, it’s ~1 trillion token-equivalents/second..
In addition, there are various tricks for lowering inference costs. For example, reducing precision (whi...
...more
16min
May 26, 2023 EA - Will AI end everything? A guide to guessing | EAG Bay Area 23 by Katja Grace
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Will AI end everything? A guide to guessing | EAG Bay Area 23, published by Katja Grace on May 25, 2023 on The Effective Altruism Forum.
Below is the video and transcript for my talk from EA Global, Bay Area 2023. It's about how likely AI is to cause human extinction or the like, but mostly a guide to how I think about the question and what goes into my probability estimate (though I do get to a number!)
The most common feedback I got for the talk was that it helped people feel like they could think about these things themselves rather than deferring. Which may be a modern art type thing, like "seeing this, I feel that my five year old could do it", but either way I hope this empowers more thinking about this topic, which I view as crucially important.
You can see the slides for this talk here
Introduction
Hello, it's good to be here in Oakland. The first time I came to Oakland was in 2008, which was my first day in America. I met Anna Salamon, who was a stranger and who had kindly agreed to look after me for a couple of days. She thought that I should stop what I was doing and work on AI risk, which she explained to me. I wasn't convinced, and I said I'd think about it; and I've been thinking about it. And I'm not always that good at finishing things quickly, but I wanted to give you an update on my thoughts.
Two things to talk about
Before we get into it, I want to say two things about what we're talking about. There are two things in this vicinity that people are often talking about. One of them is whether artificial intelligence is going to literally murder all of the humans. And the other one is whether the long-term future – which seems like it could be pretty great in lots of ways – whether humans will get to bring about the great things that they hope for there, or whether artificial intelligence will take control of it and we won't get to do those things.
I'm mostly interested in the latter, but if you are interested in the former, I think they're pretty closely related to one another, so hopefully there'll also be useful things.
The second thing I want to say is: often people think AI risk is a pretty abstract topic. And I just wanted to note that abstraction is a thing about your mind, not the world. When things happen in the world, they're very concrete and specific, and saying that AI risk is abstract is kind of like saying World War II is abstract because it's 1935 and it hasn't happened yet. Now, if it happens, it will be very concrete and bad. It'll be the worst thing that's ever happened. The rest of the talk's gonna be pretty abstract, but I just wanted to note that.
A picture of the landscape of guessing
So this is a picture. You shouldn't worry about reading all the details of it. It's just a picture of the landscape of guessing [about] this, as I see it. There are a bunch of different scenarios that could happen where AI destroys the future. There’s a bunch of evidence for those different things happening. You can come up with your own guess about it, and then there are a bunch of other people who have also come up with guesses.
I think it's pretty good to come up with your own guess before, or at some point separate to, mixing it up with everyone else's guesses. I think there are three reasons that's good.
First, I think it's just helpful for the whole community if numerous people have thought through these things. I think it's easy to end up having an information cascade situation where a lot of people are deferring to other people.
Secondly, I think if you want to think about any of these AI risk-type things, it's just much easier to be motivated about a problem if you really understand why it's a problem and therefore really believe in it.
Thirdly, I think it's easier to find things to do about a problem if you understand exactly why it's a p...
...more
29min
May 25, 2023 LW - [Linkpost] Interpretability Dreams by DanielFilan
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Interpretability Dreams, published by DanielFilan on May 24, 2023 on LessWrong.
A brief research note by Chris Olah about the point of mechanistic interpretability research. Introduction and table of contents are below.
Interpretability Dreams
An informal note on the relationship between superposition and distributed representations by Chris Olah. Published May 24th, 2023.
Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.
We aim to offer insight into our vision for addressing mechanistic interpretability's other challenges, especially scalability. Because we have focused on foundational issues, our longer-term path to scaling interpretability and tackling other challenges has often been obscure. By articulating this vision, we hope to clarify how we might resolve limitations, like analyzing massive neural networks, that might naively seem intractable in a mechanistic approach.
Before diving in, it's worth making a few small remarks. Firstly, essentially all the ideas in this essay were previously articulated, but buried in previous papers. Our goal is just to surface those implicit visions, largely by quoting relevant parts. Secondly, it's important to note that everything in this essay is almost definitionally extremely speculative and uncertain. It's far from clear that any of it will ultimately be possible. Finally, since the goal of this essay is to lay out our personal vision of what's inspiring to us, it may come across as a bit grandiose – we hope that it can be understood as simply trying to communicate subjective excitement in an open way.
Overview
An Epistemic Foundation - Mechanistic interpretability is a "microscopic" theory because it's trying to build a solid foundation for understanding higher-level structure, in an area where it's very easy for us as researchers to misunderstand.
What Might We Build on Such a Foundation? - Many tantalizing possibilities for research exist (and have been preliminarily demonstrated in InceptionV1), if only we can resolve superposition and identify the right features and circuits in a model.
Larger Scale Structure - It seems likely that there is a bigger picture, more abstract story that can be built on top of our understanding of features and circuits. Something like organs in anatomy or brain regions in neuroscience.
Universality - It seems likely that many features and circuits are universal, forming across different neural networks trained on similar domains. This means that lessons learned studying one model give us footholds in future models.
Bridging the Microscopic to the Macroscopic - We're already seeing that some microscopic, mechanistic discoveries (such as induction heads) have significant macroscopic implications. This bridge can likely be expanded as we pin down the foundations, turning our mechanistic understanding into something relevant to machine learning more broadly.
Automated Interpretability - It seems very possible that AI automation of interpretability may help it scale to large models if all else fails (although aesthetically, we might prefer other paths).
The End Goals - Ultimately, we hope this work can eventually contribute to safety and also reveal beautiful structure inside neural networks.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
4min
May 25, 2023 LW - Look At What's In Front Of You (Conclusion to The Nuts and Bolts of Naturalism) by LoganStrohl
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Look At What's In Front Of You (Conclusion to The Nuts and Bolts of Naturalism), published by LoganStrohl on May 25, 2023 on LessWrong.
If naturalism has been the subject of this sequence, then what is naturalism? Now might be a good time to try stating your own answer, in your own words, before reading more of mine.
Here is what I have to say about it.
Naturalism is an investigative method that focuses attention on the points in daily life where subjective experience intersects with crucial information.
It brings reflective awareness to experiences that were always available, but that our preconceptions inclined us to discard; it thereby grants us the opportunity to fold those observations into our stories about the world. It is a gradual process of original seeing, clarification, and deconfusion.
At its best, naturalism results in a greater ability to interact agentically with the world as it is, rather than fumbling haphazardly through a facade of misapprehensions.
This sequence was my attempt to outline the curriculum I guide people through when helping them learn the practice of naturalism—when helping them learn to behave as though knowing the territory takes patient and direct observation.
The curriculum is not the practice itself. Pedagogical considerations require that the curriculum be relatively linear and self-contained. The actual practice of naturalist investigation tends to draw as needed from all phases at once, interweaves with other methodologies as appropriate, and sometimes requires the invention of novel techniques on the spot.
Furthermore, my curriculum does not teach The One True Way of Patient and Direct Observation. It is only my way, tailored to and limited by the idiosyncratic shape of my particular mind, and adapted somewhat for the unusual class of people willing to work with me.
This is not an easy practice, and it certainly is not fast. It is a lot like lifting weights to get stronger. When easy answers are readily available through expert opinion; when problems are technical rather than adaptive; when you are certain that all you need is a “yes” or a “no”, and not to build a whole new way of understanding—in these cases, naturalism is often the wrong toolset for the job.
Nevertheless, I find that even when naturalism is the wrong toolset overall for the particular question I'm dealing with, there’s something deeply valuable in having built my own version of these tools, and in constantly feeling their readiness to hand. There is empowerment in recognizing that I personally can figure things out, by following my own curiosity, looking carefully for myself, and gradually piecing together whatever I find.
I think that a lot more work is needed before these methods can be smoothly integrated with the communal art of rationality. Still, I earnestly hope that my descriptions of this far-from-perfect method will prove a useful resource in whatever comes next.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
3min
May 25, 2023 LW - Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2 by StefanHex
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2, published by StefanHex on May 25, 2023 on LessWrong.
We solved the second Mechanistic Interpretability challenge (Transformer) that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, see here for our solution to the first challenge (CNN). The challenges each provide a pre-trained network, and the task is to
Find the labeling function that the network was trained with
Find the mechanism by which the network works.
We have understood the network’s labeling mechanism, but not found the original labeling function. Instead we have made a strong argument that it would be intractable to find the labeling function, as we claim that the network has not actually learned the labeling function.
A notebook reproducing all results in this post can be found here (requires no GPU, around ~10GB RAM).
Note that our solution descriptions are optimized with hindsight and skip all wrong paths and unnecessary techniques we tried. It took us, two somewhat experienced researchers, ~24 working hours to basically get the solutions for each challenge, and a couple days more for Stefan to perform the interventions, implement Causal Scrubbing tests, interventions & animations, and to write-up this post.
Task: The second challenge network is a 1-layer transformer consisting of embedding (W_E and W_pos), an Attention layer, and an MLP layer. There are no LayerNorms and neither the attention matrices nor the unembedding have biases.
The transformer is trained on sequences [A, B, C] to predict the next token. A and B are integer tokens from a = 0 to 112, C is always the same token (113). The answer is always either the token 0 or 1. If we consider all inputs we get 113x113 combinations which we can shape into an image to get the image from the challenge (copied below). Black is token 0, and white is token 1. The left panel shows the ground truth, and the right panel the model labels. The model is 98.6% accurate on the full dataset.
Spoilers ahead!
Summary of our solution (TL,DR)
We found that the model basically just learns the shapes by heart, it does not learn any mathematical equations. Concretely we claim that
The model barely uses the attention mechanism. Even if we fix the attention pattern to the dataset-mean, the model labels 92.7% of points correctly. We reverse-engineer the fixed-attention version of the transformer for simplicity, and we don’t expect any interesting mechanics in the attention mechanism but rather basically random noise the model has learned.
With attention fixed, the post-attention residual stream (resid_mid) at token C is just an "extended embedding", it is a linear combination of the token A and B embeddings. We show that the model classification can already be read off at this point.
In particular we claim that these extended embeddings are largely determined by a linear combination of two embedding directions.
These directions correspond to input filters, this equivalence exists because we fixed the attention pattern and thus resid_mid is given by a linear combination of the embeddings.
The classification is given by class 1 = Filter 1 AND Filter 2 (with some threshold numbers). We can illustrate this as binary mask using those thresholds: The left two images are filters learned by the model, and their AND-combination (3rd image) reproduces the model output to a large extent:
The MLP basically just implements this AND gate, a simple non-linear transformation of the embedding into a linearly separable form. The animations below illustrates this:
We test whether these PCA features we identified indeed correspond to the values of the two filters: If this hypothesis is true then randomly sampling data points to be equal ...
...more
23min
May 25, 2023 AF - Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2 by Stefan Heimersheim
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2, published by Stefan Heimersheim on May 25, 2023 on The AI Alignment Forum.
We solved the second Mechanistic Interpretability challenge (Transformer) that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, see here for our solution to the first challenge (CNN). The challenges each provide a pre-trained network, and the task is to
Find the labeling function that the network was trained with
Find the mechanism by which the network works.
We have understood the network’s labeling mechanism, but not found the original labeling function. Instead we have made a strong argument that it would be intractable to find the labeling function, as we claim that the network has not actually learned the labeling function.
A notebook reproducing all results in this post can be found here (requires no GPU, around ~10GB RAM).
Note that our solution descriptions are optimized with hindsight and skip all wrong paths and unnecessary techniques we tried. It took us, two somewhat experienced researchers, ~24 working hours to basically get the solutions for each challenge, and a couple days more for Stefan to perform the interventions, implement Causal Scrubbing tests, interventions & animations, and to write-up this post.
Task: The second challenge network is a 1-layer transformer consisting of embedding (W_E and W_pos), an Attention layer, and an MLP layer. There are no LayerNorms and neither the attention matrices nor the unembedding have biases.
The transformer is trained on sequences [A, B, C] to predict the next token. A and B are integer tokens from a = 0 to 112, C is always the same token (113). The answer is always either the token 0 or 1. If we consider all inputs we get 113x113 combinations which we can shape into an image to get the image from the challenge (copied below). Black is token 0, and white is token 1. The left panel shows the ground truth, and the right panel the model labels. The model is 98.6% accurate on the full dataset.
Spoilers ahead!
Summary of our solution (TL,DR)
We found that the model basically just learns the shapes by heart, it does not learn any mathematical equations. Concretely we claim that
The model barely uses the attention mechanism. Even if we fix the attention pattern to the dataset-mean, the model labels 92.7% of points correctly. We reverse-engineer the fixed-attention version of the transformer for simplicity, and we don’t expect any interesting mechanics in the attention mechanism but rather basically random noise the model has learned.
With attention fixed, the post-attention residual stream (resid_mid) at token C is just an "extended embedding", it is a linear combination of the token A and B embeddings. We show that the model classification can already be read off at this point.
In particular we claim that these extended embeddings are largely determined by a linear combination of two embedding directions.
These directions correspond to input filters, this equivalence exists because we fixed the attention pattern and thus resid_mid is given by a linear combination of the embeddings.
The classification is given by class 1 = Filter 1 AND Filter 2 (with some threshold numbers). We can illustrate this as binary mask using those thresholds: The left two images are filters learned by the model, and their AND-combination (3rd image) reproduces the model output to a large extent:
The MLP basically just implements this AND gate, a simple non-linear transformation of the embedding into a linearly separable form. The animations below illustrates this:
We test whether these PCA features we identified indeed correspond to the values of the two filters: If this hypothesis is true then randomly sampling da...
...more
23min

FAQs about The Nonlinear Library:

How many episodes does The Nonlinear Library have?

The podcast currently has 9,862 episodes available.