
Sign up to save your podcasts
Or


We don’t yet have a reliable way to tell whether an AI model is genuinely trying to help us — or faking it.
A model might sincerely want to do exactly what you ask. Or it could be happy to secretly cheat, as long as its answer gets positive reinforcement during training. It might even follow the rules just to gain our trust, all while concealing goals of its own.
The problem is: each of these three motivations scores the same during testing.
Ajeya Cotra — previously a senior research analyst at Coefficient Giving, now working at METR (Model Evaluation & Threat Research) — explains how dangerous this dynamic could become as we train very general and very capable AI models.
She likens humanity’s future trust in AI systems to an orphaned child who inherits a $1 trillion company. This child has to hire someone to run the company, guide his life, and manage his wealth — but he can only choose this person based on a work trial or interview that he designs, with no resumes or reference checks.
And, because he’s so rich, all sorts of people apply — for all sorts of reasons. Some applicants will truly want to help. But the role will attract others who only pretend to care while they’re being monitored, but intend to exploit the child as soon as they can get away with it.
Like a child trying to judge adults, at some point humans will need to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass us in knowledge, experience, breadth, and speed.
And we can’t rely on models’ performance during training tasks to guide us, as current reinforcement learning would give the same grades to three vastly different motivations:
Worse still, training might actively encourage deception.
Imagine training a model to run a business, and measuring its success by the balance in its bank account. A highly capable model might experiment with dishonest strategies. Maybe it steals some money and covers it up. (This isn’t a hypothetical worry; models often come up with creative — sometimes undesirable — approaches during training that their developers didn’t anticipate.)
A model that cheats and covers its tracks would look like a star performer — and get reinforced for exactly that behaviour. If cheating is only caught some of the time, the model still might not learn to stop deceptive behaviour. Instead, it might learn that deceiving without being caught gives it a competitive advantage.
In this conversation, Ajeya and host Rob Wiblin discuss the above, as well as:
Learn more and read the full transcript on the 80,000 Hours website.
This episode was originally released in May 2023.
Chapters:
Producer: Keiran Harris
Audio mastering: Ryan Kessler and Ben Cordell
Transcriptions: Katy Moore
By 80,000 HoursWe don’t yet have a reliable way to tell whether an AI model is genuinely trying to help us — or faking it.
A model might sincerely want to do exactly what you ask. Or it could be happy to secretly cheat, as long as its answer gets positive reinforcement during training. It might even follow the rules just to gain our trust, all while concealing goals of its own.
The problem is: each of these three motivations scores the same during testing.
Ajeya Cotra — previously a senior research analyst at Coefficient Giving, now working at METR (Model Evaluation & Threat Research) — explains how dangerous this dynamic could become as we train very general and very capable AI models.
She likens humanity’s future trust in AI systems to an orphaned child who inherits a $1 trillion company. This child has to hire someone to run the company, guide his life, and manage his wealth — but he can only choose this person based on a work trial or interview that he designs, with no resumes or reference checks.
And, because he’s so rich, all sorts of people apply — for all sorts of reasons. Some applicants will truly want to help. But the role will attract others who only pretend to care while they’re being monitored, but intend to exploit the child as soon as they can get away with it.
Like a child trying to judge adults, at some point humans will need to judge the trustworthiness and reliability of machine learning models that are as goal-oriented as people, and greatly outclass us in knowledge, experience, breadth, and speed.
And we can’t rely on models’ performance during training tasks to guide us, as current reinforcement learning would give the same grades to three vastly different motivations:
Worse still, training might actively encourage deception.
Imagine training a model to run a business, and measuring its success by the balance in its bank account. A highly capable model might experiment with dishonest strategies. Maybe it steals some money and covers it up. (This isn’t a hypothetical worry; models often come up with creative — sometimes undesirable — approaches during training that their developers didn’t anticipate.)
A model that cheats and covers its tracks would look like a star performer — and get reinforced for exactly that behaviour. If cheating is only caught some of the time, the model still might not learn to stop deceptive behaviour. Instead, it might learn that deceiving without being caught gives it a competitive advantage.
In this conversation, Ajeya and host Rob Wiblin discuss the above, as well as:
Learn more and read the full transcript on the 80,000 Hours website.
This episode was originally released in May 2023.
Chapters:
Producer: Keiran Harris
Audio mastering: Ryan Kessler and Ben Cordell
Transcriptions: Katy Moore

112,220 Listeners

568 Listeners

6 Listeners

489 Listeners