Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some Thoughts on Virtue Ethics for AIs, published by peligrietzer on May 2, 2023 on LessWrong.
This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.’ The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition.
Praxis-based values as I define them are, informally, reflective decision-influences matching the description ‘promote x x-ingly’: ‘promote peace peacefully,’ ‘promote corrigibility corrigibly,’ ‘promote science scientifically.’
I will later propose a quasi-formal definition of this values-type, but the general idea is that certain values are an ouroboros of means and end. Such values frequently come up in human “meaning of life” activities (e.g. math, art, craft, friendship, athletics, romance, technology), as well as in complex forms of human morality (e.g. peace, democracy, compassion, respect, honesty). While this is already indirect reason to suspect that a human-aligned AI should have ‘praxis-based’ values, there is also a central direct reason: traits such as corrigibility, transparency, and niceness can only function properly in the form of ‘praxis-based’ values.
It’s widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don’t easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior’ or ‘maximize lifetime transparency.’ In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking whereby the AI violently remakes the world to (at a minimum) protect itself from the risk that humans will modify said value.
On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions’) are widely believed to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie’ or ‘never kill’ or ‘never refuse a direct order from the president’ are poor substitutes for active transparency, niceness, and corrigibility.
The idea of ‘praxis-based’ values is meant to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (for instance) engage in deception and plotting when it expects a high future-transparency payoff.
Having lightly motivated the idea that ‘praxis-based’ values are desirable from an alignment point of view, the rest of this post will survey key premises of the hypothesis that ‘praxis-based’ values are a viable alignment goal. I’m going to assume an agent with some form of online reinforcement learning going on, and draw on ‘shards’ talk pretty freely.
I informally described a ‘praxis-based’ value as having the structure 'promote x x-ingly.' Here is a rough formulation of what I mean, put in terms of a utility-theoretic description of a shard that implements an alignment-enabling value x:
Actions (or more generally 'computations') get an x-ness rating. We define the x shard's expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but r...