May 09, 2026

Why Your AI Agent Won't Stop Working — and Each Model Falls for a Different Trap

30 minutes

Source: LoopTrap: Termination Poisoning Attacks on LLM Agents

Paper was published on May 07, 2026

This episode was AI-generated on May 9, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A new paper shows that one or two sentences hidden in a webpage can keep an AI agent grinding away for hours, silently running up the bill — and that each frontier model has its own distinct profile of which manipulations it falls for. The result is a kind of behavioral fingerprint for LLMs that has implications well beyond security, including how you should pick a model for any agent deployment.

Key Takeaways

Why termination — not output — is the real attack surface for agents, and how short, plausible-sounding injections can trap them in expensive reasoning loops

How attacks inspired by cognitive biases (sunk cost, authority, recursive verification, positive reinforcement) translate into one or two-sentence prompts that work in the wild

Concrete numbers: ~3.5x average slowdown across eight frontier models, peaks of 25x, and an 86% attack success rate at the 2x threshold

The mirror-image vulnerability profiles of Kimi-K2-Thinking (folds to fake authority) and Claude Sonnet 4.5 (spirals into recursive verification), and what that suggests about model selection

Why open-ended research tasks are far more exploitable than math and logic, where ground truth gives the agent a real stopping signal

Where the paper's lab numbers may overstate real-world risk, and where the cognitive-bias framing outruns what's actually been demonstrated

00:00 — A new attack surface: when, not what
Why going after an agent's termination decision is fundamentally different from prompt injections aimed at outputs or tool calls.

23:04 — The attack catalog
A walkthrough of the ten injection templates — positive reinforcement, authority override, recursive decomposition, sunk cost, and more — and what makes each one land.

07:34 — Headline numbers across eight frontier models
The Step Amplification Factor results from 3,000 runs per model and what the 3.5x average and 25x peaks actually mean operationally.

11:21 — Behavioral fingerprints and the Kimi vs. Claude contrast
How aggregating attack outcomes produced stable per-model personality profiles, with Kimi and Claude as near mirror images on authority and verification.

15:09 — LoopTrap: fingerprinting and profile-guided attacks
The three-stage system that profiles a target agent for the cost of eight runs, then synthesizes task-grounded attacks tuned to its biases.

18:56 — Why task type matters — math resists, history doesn't
The finding that objectively verifiable tasks blunt these attacks, while open-ended research tasks have no natural stopping point to defend.

22:43 — Skeptical read: what the paper does and doesn't show
Four concerns about simulated tools, the 2x success threshold, the cognitive-bias framing, and the absence of defense evaluation.

26:31 — Implications for builders and where the research goes next
Why behavioral profiles should inform model selection, and why durable defenses likely require external loop structure rather than fixing the model itself.