Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training
Source: SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Paper was published on May 22, 2026
This episode was AI-generated on May 25, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
A team at Microsoft moved a single Markdown file between two completely different agent systems and watched spreadsheet performance jump sixty points — no retraining, no code changes. The trick is treating the prompt as a parameter and applying actual optimizer discipline: learning rates, validation gates, rejected-edit buffers, momentum. It's the difference between a chef scribbling in margins and a real test kitchen.
Key Takeaways
Why prior LLM self-revision systems mostly fail: they look like optimizers but are missing the structural ingredients — bounded step size, validation gates, persistent failure memory — that make neural net training reliableHow a strict validation gate plus bounded edits combine to keep the rejected-edit buffer meaningful, and why removing the long-horizon machinery costs 22 points on the spreadsheet benchmarkWhat the trained skill documents actually contain — specific procedural rules like 'write evaluated static values instead of relying on Excel recalculation' that fill a gap between pretrained knowledge and task instancesWhy the cross-harness transfer result (Codex to Claude Code, +60 points) is the cleanest evidence that the method captures domain knowledge rather than harness-specific syntaxThe selection-bias risk in the validation gate the paper doesn't fully address, plus the method's hard dependency on a reliable scalar reward signalWhy small models gain disproportionately from trained skills — and the economic implication of training once on a frontier model then deploying on cheaper ones00:00 — The sixty-point transfer result
A Markdown skill file trained in one agent system lifts performance from 22% to 82% when dropped into a completely different one, with no retraining.04:49 — The chef versus the test kitchen
Why existing LLM self-improvement systems are shaped like optimizers but missing every structural ingredient that makes real optimization work.06:58 — Five pieces borrowed from neural net training
Walking through SkillOpt's student/optimizer split, bounded edit count, validation gate, rejected-edit buffer, and epoch-level slow updates.10:27 — What the trained skills actually say
Concrete examples of the procedural rules the optimizer writes — from Excel formula evaluation to household-task exploration heuristics.13:56 — Reading the empirical claims carefully
Unpacking the '52-for-52' headline, separating gains over no-skill baselines from gains over the best alternative optimizer, and identifying the cleanest results.17:25 — The edit economy and why compactness matters
Final shipped skills accept only a handful of edits across an entire training run — direct evidence the validation gate is doing real work.20:54 — Steelman critiques
Selection bias against the validation split, the reward-signal dependency that excludes open-ended generation, and the partial reliance on a strong optimizer model.24:23 — What changes if this framing catches on
Treating prompts as first-class optimizable objects, the auditability advantage over fine-tuning, and which questions remain open for messier real-world deployment.Recommended Reading
TextGrad: Automatic 'Differentiation' via Text — The textual-optimization framework SkillOpt directly benchmarks against, and a clear example of the prior work the episode critiques for lacking validation-gate discipline.GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning — The other main baseline SkillOpt is measured against — useful for seeing what reflective prompt evolution looks like without the bounded-step-size and rejected-edit-buffer machinery the episode highlights.Reflexion: Language Agents with Verbal Reinforcement Learning — An early and influential entry in the self-revision ecosystem the episode situates SkillOpt against, where an agent rewrites its own guidance from failure traces.Self-Refine: Iterative Refinement with Self-Feedback — Another canonical precursor in the LLM-self-improvement line the episode argues was missing optimizer discipline like validation gates and bounded step sizes.