
Sign up to save your podcasts
Or
TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions).
We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base (which was not subject to RLHF finetuning). However, the benchmark is still far from saturated, with the top-scoring model (Claude-3.5-Sonnet) scoring 54%, compared to a random chance of 27.4% and an estimated upper baseline of 90.7%.
This post has excerpts from our paper, as well as some results on new models that are not in the paper.
Links: Twitter thread, Website (latest results + code), Paper
The structure of our benchmark. We define situational awareness and break it down into three aspects. We test these aspects across 7 categories of task. Note: Some questions have been slightly simplified for [...]---
Outline:
(01:27) Abstract
(03:28) Introduction
(05:17) Motivation
(06:38) Benchmark
(08:45) Examples and prompts
(10:44) Results
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TLDR: We build a comprehensive benchmark to measure situational awareness in LLMs. It consists of 16 tasks, which we group into 7 categories and 3 aspects of situational awareness (self-knowledge, situational inferences, and taking actions).
We test 19 LLMs and find that all perform above chance, including the pretrained GPT-4-base (which was not subject to RLHF finetuning). However, the benchmark is still far from saturated, with the top-scoring model (Claude-3.5-Sonnet) scoring 54%, compared to a random chance of 27.4% and an estimated upper baseline of 90.7%.
This post has excerpts from our paper, as well as some results on new models that are not in the paper.
Links: Twitter thread, Website (latest results + code), Paper
The structure of our benchmark. We define situational awareness and break it down into three aspects. We test these aspects across 7 categories of task. Note: Some questions have been slightly simplified for [...]---
Outline:
(01:27) Abstract
(03:28) Introduction
(05:17) Motivation
(06:38) Benchmark
(08:45) Examples and prompts
(10:44) Results
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,433 Listeners
2,387 Listeners
7,897 Listeners
4,131 Listeners
87 Listeners
1,460 Listeners
9,043 Listeners
87 Listeners
389 Listeners
5,434 Listeners
15,176 Listeners
475 Listeners
121 Listeners
75 Listeners
458 Listeners