Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An LLM-based “exemplary actor”, published by Roman Leventov on May 29, 2023 on The AI Alignment Forum.
Into and summary
This post is the second section of "Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor", posted separately because I think it could warrant a separate discussion, largely independent of the discussion of H-JEPA agent with GFlowNet actors. Here's the summary of this post, copied from the "Overview" section of the main article:
In section 2, I describe the “exemplary actor”, an LMCA (language model cognitive architecture) that takes a simple, “brute force” approach to alignment: a powerful LLM (think GPT-5/6 level, with a vast, or quasi-unlimited context) is given a list of “approved” textbooks on methodological and scientific disciplines: epistemology, rationality, ethics, physics, etc. Also, the LLM is given tools: narrow AIs (such as for protein folding or for predicting properties of materials, or for formal scientific modelling). Finally, the LLM is given a compute engine such as Wolfram and a knowledge base such as Wikidata or Wolfram Knowledgebase.
The exemplary actor creates plans or predictions for given situations (described in language and fed to the LLM underlying the exemplary actor as prompts) and iteratively critiques and refines its own plans and predictions while putting different textbooks into the LLM context (first, with the textbook on rationality, then epistemology, then physics, etc., with potentially dozens of different textbooks relevant for a plan or prediction that is being criticised), for many iterations, until convergence.
In section 2.1, I note that the type of alignment that the exemplary actor’s architecture tries to ensure is called (world) model alignment and that is stronger and also more essential than goal alignment.
Then, I discuss the properties of the exemplary actor. In section 2.2., I discuss what I see as likely non-issues or straightforwardly addressable issues: the “divergent reasoning nature” of LLMs, the lack of grounded common sense reasoning, and the bias of the quick reactive network (”System 1”), it it is added to the architecture to make it more practically usable in lower-stakes reasoning settings.
In section 2.3, I discuss the outstanding technical issues and risks of the exemplary actor’s architecture:
The risk of direct access to the underlying LLM (section 2.3.1).
The exemplary actor’s reasoning could still be partially directed by “alien” thinking patterns (i.e., the world model) of the underlying LLM even though these influences won’t surface in the explanations of the plan (section 2.3.2).
Iterated critique and refinement probably won’t make plans strictly conforming to the theories described in the textbooks (section 2.3.3).
In section 2.3.4, I discuss the alignment tax of the exemplary actor (compared with the baseline of a bare, minimally fine-tuned LLM) and conclude that the main source of alignment tax might happen to be the theory of ethics which may force the exemplary actor to refuse to participate in “games” (i.e., real-world situations and environments) where it doesn’t see ethical ways of “winning”, and thus will consider inaction (or some form of palliative action) the only ethical way forward. This is not a technical problem with the exemplary actor per se, but rather a problem with a higher-level system, i.e., the current economic, social, and political structure of the world. I mention this and other kinds of “higher-level” risks of the plans to build and deploy the exemplary actor (i.e., roughly the plans that OpenAI and Anthropic are betting on, as it seems to me) in section 2.4.
2. An LLM-based “exemplary actor”
Let's assume that we have three things:
First, a very powerful auto-regressive LLM (think GPT-5/6 level) with the ability to effe...