Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Universality and Hidden Information in Concept Bottleneck Models, published by Hoagy on April 5, 2023 on The AI Alignment Forum.
Summary
I use the CUB dataset to finetune models to classify images of birds, via a layer trained to predict relevant concepts (e.g. "has_wing_pattern::spotted"). These models, when trained end-to-end, naturally put additional relevant information into the concept layer, which is encoded from run to run in very similar ways. The form of encoding is robust to initialisation of the concept-heads, and fairly robust to changes in architecture (presence of dropout, weighting of the two loss components), though not to initialisation of the entire model. The additional information seems to be primarily placed into low-entropy concepts.
This suggests that if steganography were to arise in, for example, fine-tuned autoregressive models, there would be convergent ways for this information to hide, which I previously thought unlikely, and which constrains the space of possible mitigations.
This work is ongoing and there are any more experiments to run but this is an intermediate post to show the arc of my current results and to help gather feedback. Code for all experiments on GitHub.
Background
In my Distilled Representations Research Agenda I laid out the basic plan for this research. The quick version of this is that we showed that we can create toy cases where we train auto-encoders to encode vectors of a certain dimension, within which some dimensions have a preferred way of being represented, while others don't, and by training multiple models, we can distinguish between these different dimensions and create a new model which encodes only those dimensions which have a preferred encoding.
The next step is to see whether this toy model has any relevance to a real-world case, which I explore in this post.
Training Procedure
This work uses the CUB dataset. This is a dataset which contains >10K images, each labelled one of 200 species of bird. Each image also has 109 features which describe the birds' appearances. These features are divided into 28 categories, such as beak-shape. Usually only one of these features is true for any given image, though the raw annotations can have multiple attributes in a class be true.
I perform experiments on the CUB dataset using Concept Bottleneck Models, specifically by finetuning an ImageNet-trained model to predict the species of bird from an image, with an auxiliary loss which incentivises each neuron in a layer near the final outputs to fire if and only if a human labeller has indicated that a particular 'concept', i.e. feature of the bird, is true of this bird. The model has a separate fully-connected 2-layer network for each concept, and then a single fully-connected 2-layer network to predict the class from the concept vector. These fully-connected layers are reinitialised from scratch with each run but the ImageNet model is pre-trained. The use of the pre-trained network is important to this shared information but also causes a host of other behaviour changes such that I don't think non-pre-trained networks are an ideal model of behaviour either (see final section).
Concept bottleneck models have three basic forms:
Independent: the image-to-concept model and concept-to-class model are trained totally separately, and only combined into a single model at test time.
Sequential: the image-to-concept model is trained first, and then the concept-to-class model is trained to predict the class from the generated concept vectors - but without propagating the gradient back to the image-to-concept model.
Joint: the model is trained end-to-end using a loss combined of the concept prediction error and class prediction error. The relative weight of these can be adjusted for different tradeoffs between performance and con...