The AI & Tech Society by Danar

Claude Opus 4.8: Benchmark Results and Review


Listen Later


Claude Opus 4.8 Review and Benchmark results


Key insight: 10.6-point gap on SWE-bench Pro is the largest between Opus 4.8 and GPT-5.5


Dynamic Workflows

What it is: Research preview feature letting Claude orchestrate hundreds of parallel subagents

How it works:

  1. Claude plans a large task
  2. Writes JavaScript orchestration script
  3. Spawns tens to hundreds of parallel subagents
  4. Runs them simultaneously
  5. Verifies results against test suite
  6. Returns coordinated final answer

Limits:

  • Up to 16 concurrent agents
  • Up to 1,000 agents total per run
  • "Meaningfully more tokens" than typical sessions
  • Available on Max, Team, Enterprise plans

Demonstrated capability: 750,000-line codebase migrated in 11 days with 99.8% test pass rate


Effort Control

Effort LevelUse CaseLowQuick responses, token-efficientMediumBalancedHighDefault for complex workMaxMaximum reasoning depth

Key finding: Opus 4.8 at minimum effort matches Opus 4.7 at maximum effort on SWE-bench Pro


Community Feedback

Positive:

  • Benchmark gains feel real on agentic coding
  • Better on complex, multi-step work
  • Proactively flags issues other models miss
  • More reliable in long-running sessions

Negative:

  • "Wicked Loop of Refactoring" — keeps finding minute issues
  • Less legible workings (grep/sed/awk vs edit tool)
  • Can get stuck in testing loops
  • Misses instructions on simpler tasks
  • Worse than 4.7 on some UI generation prompts

Hosted on Acast. See acast.com/privacy for more information.

...more
View all episodesView all episodes
Download on the App Store

The AI & Tech Society by DanarBy Danar Mustafa