December 19, 2023

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

2 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Assessment of AI safety agendas: think about the downside risk, published by Roman Leventov on December 19, 2023 on The AI Alignment Forum.

This is a post-response to The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda.

You evidently follow a variant of 80000hours' framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability.

I think for assessing AI safety ideas, agendas, and problems to solve, we should augment the assessment with another factor: the potential for a Waluigi turn, or more prosaically, the uncertainty about the sign of the impact (scale) and, therefore, the risks of solving the given problem or advancing far on the given agenda.

This reminds me of Taleb's mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the "ruin potential". See "The Logic of Risk Taking".

Of the approaches that you listed, some sound risky to me in this respect. Particularly, "4. 'Reinforcement Learning from Neural Feedback' (RLNF)" -- sounds like a direct invitation for wireheading to me. More generally, scaling BCI in any form and not falling into a dystopia at some stage is akin to walking a tightrope (at least at the current stage of civilisational maturity, I would say) This speaks to agendas #2 and #3 on your list.

There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability:

Why and When Interpretability Work is Dangerous

Against Almost Every Theory of Impact of Interpretability

AGI-Automated Interpretability is Suicide

AI interpretability could be harmful?

This speaks to the agenda "9. Neuroscience x mechanistic interpretability" on your list.

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

2 minutes

This is a post-response to The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda.

You evidently follow a variant of 80000hours' framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability.

This reminds me of Taleb's mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the "ruin potential". See "The Logic of Risk Taking".

There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability:

Why and When Interpretability Work is Dangerous

Against Almost Every Theory of Impact of Interpretability

AGI-Automated Interpretability is Suicide

AI interpretability could be harmful?

This speaks to the agenda "9. Neuroscience x mechanistic interpretability" on your list.

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

Sign up to save your podcasts

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast