
Sign up to save your podcasts
Or


What if we could figure out how large language models really work by getting inside their "heads"?
While possible, it relies on a a controversial technique called mechanistic interpretability ("mech interp" ), also known as the neuroscience of AI.
In this episode, Ihor Kendiukhov — lead researcher at SPAR, a research programme for AI risks — explains mech interp in laypersons terms, and what it's currently able to reveal about the thinking processes of AIs.
We also talk about why mech interp has charted an unusually emotional course in AI research over the past few years, from physicists falling in love with it to prominent AI safety figures throwing shade on it, and where it might be headed next.
Some of the papers and announcements referenced in this episode include:
By Witch of GlitchWhat if we could figure out how large language models really work by getting inside their "heads"?
While possible, it relies on a a controversial technique called mechanistic interpretability ("mech interp" ), also known as the neuroscience of AI.
In this episode, Ihor Kendiukhov — lead researcher at SPAR, a research programme for AI risks — explains mech interp in laypersons terms, and what it's currently able to reveal about the thinking processes of AIs.
We also talk about why mech interp has charted an unusually emotional course in AI research over the past few years, from physicists falling in love with it to prominent AI safety figures throwing shade on it, and where it might be headed next.
Some of the papers and announcements referenced in this episode include: