AI Papers Podcast Daily

OpenAI Deliberative Alignment: Reasoning Enables Safer Language Models


Listen Later

Researchers created a new way to train large language models (LLMs) to be safer, called Deliberative Alignment. This method teaches the models safety rules directly and trains them to think about these rules before answering a question. This helps prevent the models from giving harmful answers or refusing to answer harmless questions. They tested this method on OpenAI's o-series models and found that they were much better at following safety guidelines, less likely to be tricked into giving bad answers (jailbroken), and less likely to refuse to answer good questions. The models achieved this by using a chain-of-thought (CoT) reasoning process where they analyze the user's question, think about the safety rules, and then provide an appropriate answer. The training happens in two stages: first, the models learn the safety rules through examples, and second, they practice using the rules with feedback from a "judge" LLM.

https://assets.ctfassets.net/kftzwdyauwt9/4pNYAZteAQXWtloDdANQ7L/978a6fd0a2ee268b2cb59637bd074cca/OpenAI_Deliberative-Alignment-Reasoning-Enables-Safer_Language-Models_122024.pdf

...more
View all episodesView all episodes
Download on the App Store

AI Papers Podcast DailyBy AIPPD