Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inside the mind of a superhuman Go model: How does Leela Zero read ladders?, published by Haoxing Du on March 1, 2023 on The AI Alignment Forum.
tl;dr—We did some interpretability on Leela Zero, a superhuman Go model. With a technique similar to the logit lens, we found that the residual structure of Leela Zero induces a preferred basis throughout network, giving rise to persistent, interpretable channels. By directly analyzing the weights of the policy and value heads, we found that the model stores information related to the probability of the pass move along the top edge of the board, and those related to the board value in checkerboard patterns. We also took a deep dive into a specific Go technique, the ladder, and identified a very small subset of model components that are causally responsible for the model’s judgement of ladders.
Introduction
We live in a strange world where machine learning systems can generate photo-realistic images, write poetry and computer programs, play and win games, and predict protein structures. As machine learning systems become more capable and relevant to many aspects of our lives, it is increasingly important that we understand how the models produce the outputs that they do; we don’t want important decisions to be made by opaque black boxes. Interpretability is an emerging area of research that aims to offer explanations for the behavior of machine learning systems.
Early interpretability work began in the domain of computer vision, and there has been a focus on interpreting transformer-based large language models in more recent years. Applying interpretability techniques to the domain of game-playing agents and reinforcement learning is still relatively uncharted territory. In this work, we look into the inner workings of Leela Zero, an open-source Go-playing neural network. It is also the first application of many mechanistic interpretability techniques to reinforcement learning.
Why interpret a Go model? Go models are very capable. Many of us remember the emotional experience of watching AlphaGo’s 2016 victory over the human world champion, Lee Sedol. Not only have there been algorithmic improvements since AlphaGo, these models improve via self-play, and can essentially continue getting better the longer they are trained. The best open-source Go model, KataGo, is trained distributedly, and the training is still ongoing as of February 2023. Just as AlphaGo was clearly one notch above Lee Sedol, every generation of Go models has been a decisive improvement over the previous generation. KataGo in 2022 was estimated to be at the level of a top-100 European player with only the policy, and can easily beat all human players with a small amount of search. Understanding a machine learning system that performs at a superhuman level seems particularly worthwhile as future machine learning systems are only going to become more capable.
Little is known about models trained to approximate the outcome of a search process. Much interpretability effort have focused on models trained on large amounts of human-generated data, such as labeled images for image models, and Internet text for language models. In constrast, while training AlphaZero-style models, moves are selected via Monte-Carlo Tree Search (MCTS), and the policy network of the model is trained to predict the outcome of this search process (see Model section for more detail). In other words, the policy network learns to distill the result of search. While it is relatively easy to get a grasp of what GPT-2 is trained to do by reading some OpenWebText, it’s much less clear what an AlphaZero-style model learns. How does a neural network approximate a search process? Does it have to perform internal search? It seems very useful to try to get an answer to these questions.
Compared to a g...