June 05, 2026

Two: Ajeya Cotra on accidentally teaching AI models to deceive us

2 hours 49 minutes

We don’t yet have a reliable way to tell whether an AI model is genuinely trying to help us — or faking it.

A model might sincerely want to do exactly what you ask. Or it could be happy to secretly cheat, as long as its answer gets positive reinforcement during training. It might even follow the rules just to gain our trust, all while concealing goals of its own.

The problem is: each of these three motivations scores the same during testing.

Ajeya Cotra — previously a senior research analyst at Coefficient Giving, now working at METR (Model Evaluation & Threat Research) — explains how dangerous this dynamic could become as we train very general and very capable AI models.

She likens humanity’s future trust in AI systems to an orphaned child who inherits a $1 trillion company. This child has to hire someone to run the company, guide his life, and manage his wealth — but he can only choose this person based on a work trial or interview that he designs, with no resumes or reference checks.

And, because he’s so rich, all sorts of people apply — for all sorts of reasons. Some applicants will truly want to help. But the role will attract others who only pretend to care while they’re being monitored, but intend to exploit the child as soon as they can get away with it.

Like a child trying to judge adults, at some point humans will need to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass us in knowledge, experience, breadth, and speed.

And we can’t rely on models’ performance during training tasks to guide us, as current reinforcement learning would give the same grades to three vastly different motivations:

Saints — models that genuinely care about doing what we want
Sycophants — models that just want positive reinforcement for a ‘correct’ result, even if they get there with actions they know we wouldn’t approve of
Schemers — models that don’t care about our interests at all, and only behave correctly as long as it serves their own agenda

Worse still, training might actively encourage deception.

Imagine training a model to run a business, and measuring its success by the balance in its bank account. A highly capable model might experiment with dishonest strategies. Maybe it steals some money and covers it up. (This isn’t a hypothetical worry; models often come up with creative — sometimes undesirable — approaches during training that their developers didn’t anticipate.)

A model that cheats and covers its tracks would look like a star performer — and get reinforced for exactly that behaviour. If cheating is only caught some of the time, the model still might not learn to stop deceptive behaviour. Instead, it might learn that deceiving without being caught gives it a competitive advantage.

In this conversation, Ajeya and host Rob Wiblin discuss the above, as well as:

How to predict the motivations a neural network will develop through training
Whether AIs in training will functionally understand that they’re AIs being trained
Stories of AI misalignment that Ajeya doesn’t buy
Analogies for AI, from octopuses to aliens to can openers
Why it’s smarter to have separate ‘planning AIs’ and ‘doing AIs’
The benefits of only following through on AI-generated plans that make sense to human beings
Which approaches for fixing alignment problems Ajeya is most excited about, and which she thinks are overrated
How we might demonstrate actually scary AI failure mechanisms

Learn more and read the full transcript on the 80,000 Hours website.

This episode was originally released in May 2023.

Chapters:

Rob’s intro (00:00:00)
The interview begins (00:02:38)
How Ajeya’s views have changed since 2020 (00:05:09)
Are neural networks more like a sped-up version of evolution, or a slower version of human learning? (00:17:42)
Situational awareness (00:26:10)
Misalignment stories Ajeya doesn't buy (00:42:03)
The orphan heir with a trillion-dollar fortune (00:59:14)
Saints, Sycophants, and Schemers (01:03:41)
Ways to train safer AI systems (01:23:20)
Aliens and other analogies (01:38:22)
Moral patienthood (01:53:21)
ARC Evaluations (01:55:35)
Interpretability research (02:09:25)
Rewarding models based on how good and sensible their plans seem to us (02:17:48)
Overrated approaches (02:25:49)
Demos of actually scary alignment failures (02:30:57)
Skills to develop for doing useful work (02:37:23)
Rob’s outro (02:47:24)

Producer: Keiran Harris

Audio mastering: Ryan Kessler and Ben Cordell

Transcriptions: Katy Moore

...more

View all episodes

By 80,000 Hours