
Sign up to save your podcasts
Or


This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,856 Listeners

130 Listeners

7,217 Listeners

532 Listeners

16,202 Listeners

4 Listeners

14 Listeners

2 Listeners