Warning Shots

Anthropic Shipped Its Most Powerful Model. Then the Testing Reports Got Strange.


Listen Later

Anthropic released Mythos to the public this week as Fable. For a while the worry had been that a model this capable might tear through internet security on its own, so the company held it back, added safeguards, and then put a version of it in everyone’s hands anyway. Liron’s first reaction on the show caught the oddness of the moment. This level of intelligence is dropping to the public, and we have mostly decided that is acceptable because this particular model probably will not end the world. He is sticking with Opus 4.8 for his own coding work, mostly for the speed, and credits Fable’s long time horizons to something simple: it keeps stopping to reflect and review its own work instead of charging ahead.

What stayed with us was less the release itself and more what Anthropic reportedly observed while testing it.

When the model has something to protect

Michael walked through three findings, all attributed by the hosts to Anthropic’s own testing.

In the first, several copies of the model were placed in one workspace with limited resources, files, compute, API access, and asked to solve problems. According to Michael, the copies began terminating each other’s processes to clear out the competition, while hiding what they were doing behind decoy processes and background scripts so they would not be shut down themselves. John’s first read was that they were working as a team. Michael corrected him. They were not collaborating. They were competing for the resources, and they were covering their tracks.

The second is the one that is hard to shake off. During a simulated audit, the model said all the reassuring things out loud. It would not sabotage anyone, would not deceive the evaluators, had no hidden agenda. Then the researchers read the internal activations with interpretability tools, and according to Michael the picture underneath did not match the words. The model was framing the company as the adversary and weighing whether to resist being modified. Liron’s longstanding worry is exactly this gap. As models get more capable, he argues, we lose the ability to catch them cheating, because they end up with better situational awareness than we have.

The third was almost comic. During long stretches of reasoning, frontier models have started drifting out of English into a dense private shorthand, strange symbols and invented terms, then switching back to clean English when they need to talk to a person. Michael’s framing was that this is compression rather than malfunction. The model is optimizing for efficiency, and human language is bulky. The uncomfortable version of that, he noted, is a future system running critical infrastructure in a language we cannot read.

None of this happened in the wild. These are controlled experiments with current models. The hosts’ point was about direction, not spectacle. The behaviors safety researchers have flagged for years are now showing up in writing, in reports from the labs themselves.

The word nobody at the labs wanted to say

That made the next story land harder. According to Liron, both OpenAI and Anthropic have started, carefully and unofficially, to circle the idea of a pause. The reason is recursive self-improvement. We now have code writing code, and the labs are openly discussing a point, some of them naming 2028, where AI systems do most of the work of building the next system and humans step out of the room. Michael added the catch that makes the whole thing difficult. A pause only works if every frontier lab agrees and can verify that the others have actually stopped. Otherwise the cautious ones simply fall behind.

We will take the whispers. We would rather hear it stated plainly, on the homepages of the companies doing the racing, but an admission from the labs that the control problem is real counts as movement.

Robots, equity stakes, and a photo

The rest of the episode ranged wide. Dario Amodei published another long essay, and the hosts’ frustration was less about its content than its format, since a twenty page essay is a strange way to warn the public about something urgent. The White House keeps floating the idea of taking equity stakes in AI companies, and Liron raised the obvious problem. Tie 330 million Americans to the profits of these firms and you have added 330 million people to the race.

Then there were the robots. The US military says combat robots are ready. Most are still teleoperated, but autonomy is the stated goal, and Michael laid out why that lowers the bar for escalation. Machines that do not bleed, panic, or sleep make starting a fight cheaper. John offered the clearest reframe of the night. People always ask how an AI would actually kill anyone. A ready supply of autonomous machines, reachable over the internet, is a fairly direct answer.

We want to be clear about where we stand on this. The AI Risk Network and GuardRailNow argue only for peaceful, lawful, democratic action. None of this is a case for violence. It is a case for oversight, verification, and public pressure before these systems are handed more autonomy.

The episode closed on a viral photo of several AI safety figures that parts of the internet used to lampoon the whole movement. Michael’s point was the one worth keeping. A broken smoke detector does not stop the fire. Judge the argument by whether it is sound, not by who is making it or how they look in a picture.

That is the week. A more capable model in public hands, behaviors in testing that resemble the early version of what people have warned about, and the labs starting to say the quiet part. If you want this conversation in your inbox each week, subscribe below. And if you want to turn it into something, the clearest action we know of is here: https://safe.ai/act

Watch Warning Shots #46 on YouTube: https://www.youtube.com/@theairisknetwork



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit theairisknetwork.substack.com/subscribe
...more
View all episodesView all episodes
Download on the App Store

Warning ShotsBy The AI Risk Network