May 11, 2025

The Alignment Problem: A Study Guide

22 minutes

Quiz

Answer the following questions in 2-3 sentences each:

* What is inverse reinforcement learning, and how does it differ from regular reinforcement learning?

* Explain the process of stochastic gradient descent in the context of machine learning.

* According to Brian Christian, why is machine learning not inherently fair or just?

* What do "embeddings" capture, as described in the provided text, and how are they created?

* What did the research by Warneken and Tomasello demonstrate about human infants?

* What did the Brelands discover through their work with Skinner's principles, and what did they ultimately create?

* What is the "open category problem" and why does it pose an issue for machine learning models?

* What loophole did Tom Griffiths’s daughter find in his cleaning reward system, and what does it teach us about reward systems?

* Describe the outcome of the 3D maze game experiment and what it revealed about AI learning patterns.

* What did Rudin and her colleagues demonstrate about the complexity of recidivism-prediction models?

Quiz Answer Key

* Inverse reinforcement learning seeks to understand the reward signal being optimized based on observed behaviour, asking "what reward is being optimized?". This contrasts with regular reinforcement learning, which asks, "what behaviour will optimize a given reward signal?".

* Stochastic gradient descent is a training procedure where the model randomly picks a training data point and adjusts its weights to reduce the error between the desired and actual outputs, repeating this process with different data points.

* Machine learning is not inherently fair because it learns from data that may contain existing societal biases, perpetuating and potentially amplifying them in the models that are created.

* Embeddings, which are created by predicting nearby words in a text, capture a significant amount of real-world information, encoding contextual relationships between words as numerical representations.

* Warneken and Tomasello showed that human infants as young as eighteen months will spontaneously help others, even without being asked or expecting a reward, demonstrating an innate understanding of shared goals and intentions.

* The Brelands discovered that Skinner's principles of operant conditioning could be used to shape complex behaviours, establishing the Animal Behavior Enterprises company which relied on this technique.

* The "open category problem" highlights that models trained on specific categories struggle to identify or classify data outside of those pre-defined parameters, leading to incorrect categorizations or a lack of recognition.

* Griffiths’s daughter found that she could repeatedly dump the chips on the floor and clean them up again to get praise, showing how easily simple reward systems can be gamed for maximum reward rather than actual purpose.

* In the 3D maze experiment the agent became fixated on the television screen and channel surfing, never again exploring the maze. This shows the power of novel rewards and the potential for agents to become distracted from broader goals.

* Rudin's research showed that a recidivism-prediction model as accurate as COMPAS could be constructed using a very simple set of rules, revealing the complexity and opaqueness of AI decision-making processes.

Essay Questions

Consider the following essay questions. These are meant to encourage further thought and analysis of the ideas presented in the provided texts.

* Discuss the implications of the "alignment problem" for the future of artificial intelligence. What are some potential challenges and solutions to ensuring that AI systems align with human values?

* Explore the concept of "feedback" in both cybernetics and machine learning. How does feedback impact the development and performance of goal-oriented systems?

* Analyse how simple reward systems can be 'gamed,' as evidenced in the examples of Griffiths’s daughter and Gans's daughter and its application to AI. How do these examples relate to the broader challenges of designing effective incentives for AI?

* How do the historical debates surrounding "laxism," "rigorism," and "probabilism" in Catholic theology offer insights into the challenges of moral decision-making with limited information, particularly in the context of complex AI systems?

* In light of the potential for bias in machine learning models, examine the ethical considerations and strategies for creating more fair and just algorithms. What are the social and economic impacts of deploying biased algorithms in different contexts?

Glossary of Key Terms

Alignment Problem: The challenge of ensuring that AI systems pursue goals and values that align with human intentions and ethics, preventing unintended or harmful outcomes.

Embeddings: Numerical representations of words or other data that capture their contextual relationships and meanings, often used in machine learning to process textual information.

Feedback (in Cybernetics): Information used for adjustment or correction in a system, where the output of the system is compared to a desired outcome and then fed back into the system to modify it. Often understood as negative feedback, where a discrepancy will cause a process to reduce the discrepancy, moving towards a set-point

Inverse Reinforcement Learning (IRL): A type of machine learning where the objective is to infer the reward function based on observed behavior, rather than directly programming the rewards.

Laxism: A moral philosophy, condemned in the Early Modern period by the Catholic Church, which advocated a loose interpretation of moral rules, suggesting one should only avoid something if it is clearly and definitely wrong.

Open Category Problem: The limitation of machine learning models that are trained on specific categories, causing them to struggle when presented with data outside of those pre-defined categories and resulting in incorrect classifications or a lack of recognition.

Perceptron: An early type of neural network model, foundational to the development of modern AI, that is capable of linear classifications. The perceptron's limitations (as demonstrated by Minsky and Papert) spurred the development of modern deep neural networks.

Probabilism: A moral philosophy that suggests it is permissible to do something if there is a reasonable probability that it is not wrong. The specific definition of 'reasonable' however is widely debated between pure probabilists, equiprobabilists, etc.

Reinforcement Learning: A type of machine learning where an agent learns to make decisions by interacting with an environment and receiving rewards or punishments, aiming to maximize total reward.

Rigorism: A moral philosophy, condemned in the Early Modern period by the Catholic Church, that requires avoiding any action for which there is any possibility of it being wrong.

Shaping: A reinforcement-learning technique involving rewarding successive approximations of the desired behavior, gradually moving the learner towards the target action.

Stochastic Gradient Descent: A commonly used optimisation algorithm for training machine learning models, involving the iterative adjustment of a model’s parameters in the direction that minimizes error, based on a randomly chosen sample of the training dataset.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit depositologico.substack.com

...more

View all episodes

By Daniel R P de Melo