AI Papers: A Deep Dive

Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training


Listen Later

Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training

Source: SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Paper was published on May 22, 2026

This episode was AI-generated on May 25, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A team at Microsoft moved a single Markdown file between two completely different agent systems and watched spreadsheet performance jump sixty points — no retraining, no code changes. The trick is treating the prompt as a parameter and applying actual optimizer discipline: learning rates, validation gates, rejected-edit buffers, momentum. It's the difference between a chef scribbling in margins and a real test kitchen.

Key Takeaways
  • Why prior LLM self-revision systems mostly fail: they look like optimizers but are missing the structural ingredients — bounded step size, validation gates, persistent failure memory — that make neural net training reliable
  • How a strict validation gate plus bounded edits combine to keep the rejected-edit buffer meaningful, and why removing the long-horizon machinery costs 22 points on the spreadsheet benchmark
  • What the trained skill documents actually contain — specific procedural rules like 'write evaluated static values instead of relying on Excel recalculation' that fill a gap between pretrained knowledge and task instances
  • Why the cross-harness transfer result (Codex to Claude Code, +60 points) is the cleanest evidence that the method captures domain knowledge rather than harness-specific syntax
  • The selection-bias risk in the validation gate the paper doesn't fully address, plus the method's hard dependency on a reliable scalar reward signal
  • Why small models gain disproportionately from trained skills — and the economic implication of training once on a frontier model then deploying on cheaper ones
    • 00:00 — The sixty-point transfer result
      A Markdown skill file trained in one agent system lifts performance from 22% to 82% when dropped into a completely different one, with no retraining.
    • 04:49 — The chef versus the test kitchen
      Why existing LLM self-improvement systems are shaped like optimizers but missing every structural ingredient that makes real optimization work.
    • 06:58 — Five pieces borrowed from neural net training
      Walking through SkillOpt's student/optimizer split, bounded edit count, validation gate, rejected-edit buffer, and epoch-level slow updates.
    • 10:27 — What the trained skills actually say
      Concrete examples of the procedural rules the optimizer writes — from Excel formula evaluation to household-task exploration heuristics.
    • 13:56 — Reading the empirical claims carefully
      Unpacking the '52-for-52' headline, separating gains over no-skill baselines from gains over the best alternative optimizer, and identifying the cleanest results.
    • 17:25 — The edit economy and why compactness matters
      Final shipped skills accept only a handful of edits across an entire training run — direct evidence the validation gate is doing real work.
    • 20:54 — Steelman critiques
      Selection bias against the validation split, the reward-signal dependency that excludes open-ended generation, and the partial reliance on a strong optimizer model.
    • 24:23 — What changes if this framing catches on
      Treating prompts as first-class optimizable objects, the auditability advantage over fine-tuning, and which questions remain open for messier real-world deployment.
    • Recommended Reading
      • TextGrad: Automatic 'Differentiation' via Text — The textual-optimization framework SkillOpt directly benchmarks against, and a clear example of the prior work the episode critiques for lacking validation-gate discipline.
      • GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning — The other main baseline SkillOpt is measured against — useful for seeing what reflective prompt evolution looks like without the bounded-step-size and rejected-edit-buffer machinery the episode highlights.
      • Reflexion: Language Agents with Verbal Reinforcement Learning — An early and influential entry in the self-revision ecosystem the episode situates SkillOpt against, where an agent rewrites its own guidance from failure traces.
      • Self-Refine: Iterative Refinement with Self-Feedback — Another canonical precursor in the LLM-self-improvement line the episode argues was missing optimizer discipline like validation gates and bounded step sizes.
      • ...more
        View all episodesView all episodes
        Download on the App Store

        AI Papers: A Deep DiveBy paperdive.ai