1. Summary and overview
LLMs seem to lack metacognitive skills that help humans catch errors. Improvements to those skills might be net positive for alignment, despite improving capabilities in new directions.
Better metacognition would reduce LLM errors by catching mistakes, and by managing complex cognition to produce better answers in the first place. This could stabilize or regularize alignment, allowing systems to avoid actions they would not "endorse on reflection" (in some functional sense).[1] Better metacognition could also make LLM systems useful for clarifying the conceptual problems of alignment. It would reduce sycophancy, and help LLMs organize the complex thinking necessary for clarifying claims and cruxes in the literature.
Without such improvements, collaborating with LLM systems on alignment research could be the median doom-path: slop, not scheming. They are sycophantic, agreeing with their users too much, and produce compelling-but-erroneous "slop". Human brains produce slop and sycophancy, too, but we have metacognitive skills, mechanisms, and strategies to catch those errors. Considering our metacognitive skills gives some insight into how they might be developed for LLMs, and how they might help with alignment (§6, §7).
I'm not advocating for this. I'm noting that work is underway, noting the potential for [...]
---
Outline:
(00:13) 1. Summary and overview
(04:50) 2. Human metacognitive skills and why we dont notice them
(08:01) 2.1. Brain mechanisms for metacognition
(10:49) 3. Why we might expect LLMs metacognitive skills to lag humans
(13:28) 4. Evidence that LLM metacognition lags humans
(17:02) 5. Current approaches to improving metacognition in reasoning models
(23:11) 6. Improved metacognition would reduce slop and errors in human/AI teamwork on conceptual alignment
(25:02) Rationalist LLM systems for research
(26:38) Better LLM systems for alignment
(29:12) 7. Improved metacognition would improve alignment stability
(33:23) 8. Conclusion
The original text contained 6 footnotes which were omitted from this narration.
---