November 23, 2023

AF - Thomas Kwa's research journal by Thomas Kwa

4 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thomas Kwa's research journal, published by Thomas Kwa on November 23, 2023 on The AI Alignment Forum.

Thomas Kwa

Why I'm writing this

Research feedback loops for junior AI safety researchers are pretty poor right now. They're better than in the past due to the recent explosion in empirical work, but AI safety is still a very new field whose methodology is not quite nailed down, and which cannot admit the same level of aimlessness as other sciences, or ML at large. There are very likely mistakes being repeated by scientists after scientist, and hopefully I can alleviate the problem slightly by publicly writing about successes and failures in my research process.

This is a monologue containing my ~daily research thoughts, using the LW dialogue format because it allows continuous publication. Hopefully this lets some people either give me feedback or compare their research process to mine. If you do have some thoughts, feel free to leave a comment! People I'm collaborating with might occasionally leave dialogue entries.

With that, let's share the state of my work as of 11/12. Currently I have a bunch of disconnected projects:

Characterize planning inside KataGo by retargeting it to output the worst move (with Adria Garriga-Alonso).

Improve circuit discovery by implementing edge-level subnetwork probing on sparse autoencoder features (with Adria and David Udell).

Create a tutorial for using TransformerLens on arbitrary (e.g. non-transformer) models by extending 'HookedRootModule', which could make it easy to use TransformerLens for e.g. ARENA 3.0 projects.

Create proofs for the accuracy of small neural nets in Coq (with Jason Gross and Rajashree Agrawal).

Create demonstrations of catastrophic regressional Goodhart and possibly strengthen theoretical results.

Help Peter Barnett and Jeremy Gillen wrap up some threads from MIRI, including editing an argument that misaligned mesa-optimizers are very likely.

I plan to mostly write about the first three, but might write about any of these if it doesn't make things too disorganized.

Thomas Kwa

Monday 11/13

I did SERI MATS applications and thought about Goodhart, but most of my time was spent on the KataGo project. I might more about it later, but the idea is to characterize the nature of planning in KataGo.

Early training runs had produced promising results-- a remarkably sparse mask lets the network output almost the worst possible move as judged by the value network-- but I was a bit suspicious that the hooked network was implementing some trivial behavior, like always moving in the corner. I adapted some visualization code previously used for FAR's adversarial Go attack paper to see what the policy was doing, and well...

Turns out the network is doing the trivial behavior of moving in either the top left or bottom left corner. I wish I had checked this earlier (it took ~14 days of work to get here), but it doesn't kill this project-- I can just redo the next training run to only allow moves on the 3rd line or above, and hopefully the worst behavior here won't be so trivial.

Tomorrow I'm going to table this project and start implementing edge-level subnetwork probing-- estimate is 2 days for the algorithm and maybe lots more effort to run benchmarks.

Thomas Kwa

Wednesday 11/15

Updates from the last two days:

I finished the basic edge-level subnetwork probing code over the last two days. This is exciting because it might outperform ACDC and even attribution patching for circuit discovery. The original ACDC paper included a version of subnetwork probing, but that version was severely handicapped because it operated on the node level (structural pruning) rather than edge level.

Adria is now on vacation, so I'm planning to get as far as I can running experiments before getting stuck somewhere and coming back to this after Thank...

...more

View all episodes

By The Nonlinear Fund