May 26, 2025

“Alignment Proposal: Adversarially Robust Augmentation and Distillation” by Cole Wyeth, abramdemski

20 minutes

Epistemic Status: Over years of reading alignment plans and studying agent foundations, this is my first serious attempt to formulate an alignment research program that I (Cole Wyeth) have not been able to find any critical flaws in. It is far from a complete solution, but I think it is a meaningful decomposition of the problem into modular pieces that can be addressed by technical means - that is, it seems to solve many of the philosophical barriers to AI alignment. I have attempted to make the necessary assumptions clear throughout. The main reason that I am excited about this plan is that the assumptions seem acceptable to both agent foundations researchers and ML engineers; that is, I do not believe there are any naive assumptions about the nature of intelligence OR any computationally intractable obstacles to implementation. This program (tentatively ARAD := Adversarially Robust Augmentation and Distillation) owes [...]

---

Outline:

(04:31) High-level Implementation

(06:03) Technical Justification

(17:45) Implementation Details

The original text contained 1 footnote which was omitted from this narration.

---

First published:

May 25th, 2025

Source:

https://www.lesswrong.com/posts/RRvdRyWrSqKW2ANL9/alignment-proposal-adversarially-robust-augmentation-and

---