Share AXRP - the AI X-risk Research Podcast
Share to email
Share to Facebook
Share to X
By Daniel Filan
4.7
77 ratings
The podcast currently has 45 episodes available.
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/11/27/38_2-jesse-hoogland-singular-learning-theory.html
FAR.AI: https://far.ai/
FAR.AI on X (aka Twitter): https://x.com/farairesearch
FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch
The Alignment Workshop: https://www.alignment-workshop.com/
Topics we discuss, and timestamps:
00:34 - About Jesse
01:49 - The Alignment Workshop
02:31 - About Timaeus
05:25 - SLT that isn't developmental interpretability
10:41 - The refined local learning coefficient
14:06 - Finding the multigram circuit
Links:
Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient: https://arxiv.org/abs/2410.02984
Investigating the learning coefficient of modular addition: hackathon project: https://www.lesswrong.com/posts/4v3hMuKfsGatLXPgt/investigating-the-learning-coefficient-of-modular-addition
Episode art by Hamish Doodles: hamishdoodles.com
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/11/16/episode-38_1-alan-chan-agent-infrastructure.html
FAR.AI: https://far.ai/
FAR.AI on X (aka Twitter): https://x.com/farairesearch
FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch
The Alignment Workshop: https://www.alignment-workshop.com/
Topics we discuss, and timestamps:
01:02 - How the Alignment Workshop is
01:32 - Agent infrastructure
04:57 - Why agent infrastructure
07:54 - A trichotomy of agent infrastructure
13:59 - Agent IDs
18:17 - Agent channels
20:29 - Relation to AI control
Links:
Alan on Google Scholar: https://scholar.google.com/citations?user=lmQmYPgAAAAJ&hl=en&oi=ao
IDs for AI Systems: https://arxiv.org/abs/2406.12137
Visibility into AI Agents: https://arxiv.org/abs/2401.13138
Episode art by Hamish Doodles: hamishdoodles.com
Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/11/14/episode-38_0-zhijing-jin-llms-causality-multi-agent-systems.html
FAR.AI: https://far.ai/
FAR.AI on X (aka Twitter): https://x.com/farairesearch
FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch
The Alignment Workshop: https://www.alignment-workshop.com/
Topics we discuss, and timestamps:
00:35 - How the Alignment Workshop is
00:47 - How Zhijing got interested in causality and natural language processing
03:14 - Causality and alignment
06:21 - Causality without randomness
10:07 - Causal abstraction
11:42 - Why LLM causal reasoning?
13:20 - Understanding LLM causal reasoning
16:33 - Multi-agent systems
Links:
Zhijing's website: https://zhijing-jin.com/fantasy/
Zhijing on X (aka Twitter): https://x.com/zhijingjin
Can Large Language Models Infer Causation from Correlation?: https://arxiv.org/abs/2306.05836
Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents: https://arxiv.org/abs/2404.16698
Episode art by Hamish Doodles: hamishdoodles.com
Epoch AI is the premier organization that tracks the trajectory of AI - how much compute is used, the role of algorithmic improvements, the growth in data used, and when the above trends might hit an end. In this episode, I speak with the director of Epoch AI, Jaime Sevilla, about how compute, data, and algorithmic improvements are impacting AI, and whether continuing to scale can get us AGI.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/10/04/episode-37-jaime-sevilla-forecasting-ai.html
Topics we discuss, and timestamps:
0:00:38 - The pace of AI progress
0:07:49 - How Epoch AI tracks AI compute
0:11:44 - Why does AI compute grow so smoothly?
0:21:46 - When will we run out of computers?
0:38:56 - Algorithmic improvement
0:44:21 - Algorithmic improvement and scaling laws
0:56:56 - Training data
1:04:56 - Can scaling produce AGI?
1:16:55 - When will AGI arrive?
1:21:20 - Epoch AI
1:27:06 - Open questions in AI forecasting
1:35:21 - Epoch AI and x-risk
1:41:34 - Following Epoch AI's research
Links for Jaime and Epoch AI:
Epoch AI: https://epochai.org/
Machine Learning Trends dashboard: https://epochai.org/trends
Epoch AI on X / Twitter: https://x.com/EpochAIResearch
Jaime on X / Twitter: https://x.com/Jsevillamol
Research we discuss:
Training Compute of Frontier AI Models Grows by 4-5x per Year: https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year
Optimally Allocating Compute Between Inference and Training: https://epochai.org/blog/optimally-allocating-compute-between-inference-and-training
Algorithmic Progress in Language Models [blog post]: https://epochai.org/blog/algorithmic-progress-in-language-models
Algorithmic progress in language models [paper]: https://arxiv.org/abs/2403.05812
Training Compute-Optimal Large Language Models [aka the Chinchilla scaling law paper]: https://arxiv.org/abs/2203.15556
Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data [blog post]: https://epochai.org/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
Will we run out of data? Limits of LLM scaling based on human-generated data [paper]: https://arxiv.org/abs/2211.04325
The Direct Approach: https://epochai.org/blog/the-direct-approach
Episode art by Hamish Doodles: hamishdoodles.com
Sometimes, people talk about transformers as having "world models" as a result of being trained to predict text data on the internet. But what does this even mean? In this episode, I talk with Adam Shai and Paul Riechers about their work applying computational mechanics, a sub-field of physics studying how to predict random processes, to neural networks.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/09/29/episode-36-adam-shai-paul-riechers-computational-mechanics.html
Topics we discuss, and timestamps:
0:00:42 - What computational mechanics is
0:29:49 - Computational mechanics vs other approaches
0:36:16 - What world models are
0:48:41 - Fractals
0:57:43 - How the fractals are formed
1:09:55 - Scaling computational mechanics for transformers
1:21:52 - How Adam and Paul found computational mechanics
1:36:16 - Computational mechanics for AI safety
1:46:05 - Following Adam and Paul's research
Simplex AI Safety: https://www.simplexaisafety.com/
Research we discuss:
Transformers represent belief state geometry in their residual stream: https://arxiv.org/abs/2405.15943
Transformers represent belief state geometry in their residual stream [LessWrong post]: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer: https://www.lesswrong.com/posts/mBw7nc4ipdyeeEpWs/why-would-belief-states-have-a-fractal-structure-and-why
Episode art by Hamish Doodles: hamishdoodles.com
Patreon: https://www.patreon.com/axrpodcast
MATS: https://www.matsprogram.org
Note: I'm employed by MATS, but they're not paying me to make this video.
How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat with Peter Hase about his research into these questions.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html
Topics we discuss, and timestamps:
0:00:36 - NLP and interpretability
0:10:20 - Interpretability lessons
0:32:22 - Belief interpretability
1:00:12 - Localizing and editing models' beliefs
1:19:18 - Beliefs beyond language models
1:27:21 - Easy-to-hard generalization
1:47:16 - What do easy-to-hard results tell us?
1:57:33 - Easy-to-hard vs weak-to-strong
2:03:50 - Different notions of hardness
2:13:01 - Easy-to-hard vs weak-to-strong, round 2
2:15:39 - Following Peter's work
Peter on Twitter: https://x.com/peterbhase
Peter's papers:
Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213
Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751
Other links:
Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279
Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262
Of nonlinearity and commutativity in BERT: https://arxiv.org/abs/2101.04547
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: https://arxiv.org/abs/2306.03341
Editing a classifier by rewriting its prediction rules: https://arxiv.org/abs/2112.01008
Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): https://arxiv.org/abs/2212.03827
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: https://arxiv.org/abs/2312.09390
Concrete problems in AI safety: https://arxiv.org/abs/1606.06565
Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: https://arxiv.org/abs/2103.03872
Episode art by Hamish Doodles: hamishdoodles.com
How can we figure out if AIs are capable enough to pose a threat to humans? When should we make a big effort to mitigate risks of catastrophic AI misbehaviour? In this episode, I chat with Beth Barnes, founder of and head of research at METR, about these questions and more.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/07/28/episode-34-ai-evaluations-beth-barnes.html
Topics we discuss, and timestamps:
0:00:37 - What is METR?
0:02:44 - What is an "eval"?
0:14:42 - How good are evals?
0:37:25 - Are models showing their full capabilities?
0:53:25 - Evaluating alignment
1:01:38 - Existential safety methodology
1:12:13 - Threat models and capability buffers
1:38:25 - METR's policy work
1:48:19 - METR's relationships with labs
2:04:12 - Related research
2:10:02 - Roles at METR, and following METR's work
Links for METR:
METR: https://metr.org
METR Task Development Guide - Bounty: https://taskdev.metr.org/bounty/
METR - Hiring: https://metr.org/hiring
Autonomy evaluation resources: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/
Other links:
Update on ARC's recent eval efforts (contains GPT-4 taskrabbit captcha story) https://metr.org/blog/2023-03-18-update-on-recent-evals/
Password-locked models: a stress case for capabilities evaluation: https://www.alignmentforum.org/posts/rZs6ddqNnW8LXuJqA/password-locked-models-a-stress-case-for-capabilities
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training: https://arxiv.org/abs/2401.05566
Untrusted smart models and trusted dumb models: https://www.alignmentforum.org/posts/LhxHcASQwpNa3mRNk/untrusted-smart-models-and-trusted-dumb-models
AI companies aren't really using external evaluators: https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
Nobody Knows How to Safety-Test AI (Time): https://time.com/6958868/artificial-intelligence-safety-evaluations-risks/
ChatGPT can talk, but OpenAI employees sure can’t: https://www.vox.com/future-perfect/2024/5/17/24158478/openai-departures-sam-altman-employees-chatgpt-release
Leaked OpenAI documents reveal aggressive tactics toward former employees: https://www.vox.com/future-perfect/351132/openai-vested-equity-nda-sam-altman-documents-employees
Beth on her non-disparagement agreement with OpenAI: https://www.lesswrong.com/posts/yRWv5kkDD4YhzwRLq/non-disparagement-canaries-for-openai?commentId=MrJF3tWiKYMtJepgX
Sam Altman's statement on OpenAI equity: https://x.com/sama/status/1791936857594581428
Episode art by Hamish Doodles: hamishdoodles.com
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html
Topics we discuss, and timestamps:
0:00:33 - Deceptive inflation
0:17:56 - Overjustification
0:32:48 - Bounded human rationality
0:50:46 - Avoiding these problems
1:14:13 - Dimensional analysis
1:23:32 - RLHF problems, in theory and practice
1:31:29 - Scott's research program
1:39:42 - Following Scott's research
Scott's website: https://www.scottemmons.com
Scott's X/twitter account: https://x.com/emmons_scott
When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747
Other works we discuss:
AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752
Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475
The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693
Episode art by Hamish Doodles: hamishdoodles.com
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
The transcript: axrp.net/episode/2024/05/30/episode-32-understanding-agency-jan-kulveit.html
Topics we discuss, and timestamps:
0:00:47 - What is active inference?
0:15:14 - Preferences in active inference
0:31:33 - Action vs perception in active inference
0:46:07 - Feedback loops
1:01:32 - Active inference vs LLMs
1:12:04 - Hierarchical agency
1:58:28 - The Alignment of Complex Systems group
Website of the Alignment of Complex Systems group (ACS): acsresearch.org
ACS on X/Twitter: x.com/acsresearchorg
Jan on LessWrong: lesswrong.com/users/jan-kulveit
Predictive Minds: Large Language Models as Atypical Active Inference Agents: arxiv.org/abs/2311.10215
Other works we discuss:
Active Inference: The Free Energy Principle in Mind, Brain, and Behavior: https://www.goodreads.com/en/book/show/58275959
Book Review: Surfing Uncertainty: https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/
The self-unalignment problem: https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem
Mitigating generative agent social dilemmas (aka language models writing contracts for Minecraft): https://social-dilemmas.github.io/
Episode art by Hamish Doodles: hamishdoodles.com
The podcast currently has 45 episodes available.
32,071 Listeners
2,317 Listeners
20,234 Listeners
105 Listeners
111,459 Listeners
293 Listeners
5,493 Listeners
12,563 Listeners
7,685 Listeners
209 Listeners
5,234 Listeners
13,571 Listeners
113 Listeners
3 Listeners