The Nonlinear Library

AF - For alignment, we should simultaneously use multiple theories of cognition and value by Roman Leventov


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: For alignment, we should simultaneously use multiple theories of cognition and value, published by Roman Leventov on April 24, 2023 on The AI Alignment Forum.
This post is a follow-up to "A multi-disciplinary view on AI safety research". I elaborate on some arguments behind this view.
TL;DR: please skim section headings and bolded sentences in the text.
Computationally tractable mathematical models of alignment are bound to be biased and blind to certain aspects of human values
No single mathematical model of human values that has orders of magnitude fewer degrees of freedom than an actual human will adequately capture the complexity of value because humans are complex systems and, therefore, cannot be reduced to a much simpler model.
If the model is sufficiently complex to robustly capture human values, such as whole-brain emulation, then the ethical concerns and S-risks of actually using these models for alignment appear because the model itself may suffer.
Many mathematical theories of human cognition or frameworks for computing (inferring) human values are considered as the basis for alignment, as well as process theories of alignment that implicitly rely on a particular mathematical theory even if it doesn’t infer (humans’ or AIs’) values explicitly, such as the shard theory (RL-based), or the latest Beren Millidge’s computational anatomy of human values, or cooperative inverse reinforcement learning, or Bayesian models and approaches, or various linguistic process theories of alignment that I expect to become very hot this year due to the astonishing success of LLMs. However, since all these theories are collapsing the complexity of humans (or else they are equivalent to full human simulations), they are all bound to be incomplete.
Moreover, all these theories are bound to be biased (this is a form of inductive bias if you wish), that is, to be relatively blind to specific kinds of human values, or to specific aspects of human nature that we can see as somehow related (or producing) “values”.
In other words, human values are not only complex in the sense that they are very elaborate. Crucially, human values are also not capturable within a single mathematical framework or ontology for describing them.
From “solving the alignment problem” to engineering the alignment process
The main implication of the above thesis is that we should abandon the frame that we should “solve” the alignment problem and seek a smart theory that will “crack” this problem.
I feel that a fair amount of unproductive debates and unproductive alignment research resource allocation stems from this illusion. People often debate whether this or that theory “can or cannot succeed” (in “solving” alignment, it is implied), or try to find the “best” theory and invest their effort into improving that theory because it’s the “best bet”.
Instead, we should adopt a portfolio approach. Theory A captures 90% of “value complexity” to be aligned, then theory B largely overlaps with theory A, but together, they capture 95% of value complexity, then adding theory C to the mix raises it to 97%, etc. (Of course, these “percent” are fictitious and cannot be actually computed.)
This is an engineering approach of adding extra assurances to the alignment process until all stakeholders of the system agree that the quality (the quality of being sufficiently aligned, or alignable to humans, in this case) is assured well enough for production deployment of the system.
When we consider this, it becomes clear that marshalling all effort behind improving a single theory is not optimal, vaguely speaking, due to the law of diminishing returns (also, as noted above, a good fraction of the alignment research community’s brain power goes into actually “finding” that best theory, on both individual and collective levels).
Su...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings