May 25, 2026

Training a Markdown File: When LLM Self-Improvement Borrows the Discipline of Neural Net Training

27 minutes

Source: SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Paper was published on May 22, 2026

This episode was AI-generated on May 25, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.

A team at Microsoft moved a single Markdown file between two completely different agent systems and watched spreadsheet performance jump sixty points — no retraining, no code changes. The trick is treating the prompt as a parameter and applying actual optimizer discipline: learning rates, validation gates, rejected-edit buffers, momentum. It's the difference between a chef scribbling in margins and a real test kitchen.

Key Takeaways

Why prior LLM self-revision systems mostly fail: they look like optimizers but are missing the structural ingredients — bounded step size, validation gates, persistent failure memory — that make neural net training reliable

How a strict validation gate plus bounded edits combine to keep the rejected-edit buffer meaningful, and why removing the long-horizon machinery costs 22 points on the spreadsheet benchmark

What the trained skill documents actually contain — specific procedural rules like 'write evaluated static values instead of relying on Excel recalculation' that fill a gap between pretrained knowledge and task instances

Why the cross-harness transfer result (Codex to Claude Code, +60 points) is the cleanest evidence that the method captures domain knowledge rather than harness-specific syntax

The selection-bias risk in the validation gate the paper doesn't fully address, plus the method's hard dependency on a reliable scalar reward signal

Why small models gain disproportionately from trained skills — and the economic implication of training once on a frontier model then deploying on cheaper ones

00:00 — The sixty-point transfer result
A Markdown skill file trained in one agent system lifts performance from 22% to 82% when dropped into a completely different one, with no retraining.

04:49 — The chef versus the test kitchen
Why existing LLM self-improvement systems are shaped like optimizers but missing every structural ingredient that makes real optimization work.

06:58 — Five pieces borrowed from neural net training
Walking through SkillOpt's student/optimizer split, bounded edit count, validation gate, rejected-edit buffer, and epoch-level slow updates.

10:27 — What the trained skills actually say
Concrete examples of the procedural rules the optimizer writes — from Excel formula evaluation to household-task exploration heuristics.

13:56 — Reading the empirical claims carefully
Unpacking the '52-for-52' headline, separating gains over no-skill baselines from gains over the best alternative optimizer, and identifying the cleanest results.

17:25 — The edit economy and why compactness matters
Final shipped skills accept only a handful of edits across an entire training run — direct evidence the validation gate is doing real work.

20:54 — Steelman critiques
Selection bias against the validation split, the reward-signal dependency that excludes open-ended generation, and the partial reliance on a strong optimizer model.

24:23 — What changes if this framing catches on
Treating prompts as first-class optimizable objects, the auditability advantage over fine-tuning, and which questions remain open for messier real-world deployment.