
Sign up to save your podcasts
Or


This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongThis is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,331 Listeners

2,462 Listeners

8,603 Listeners

4,172 Listeners

97 Listeners

1,607 Listeners

10,015 Listeners

97 Listeners

523 Listeners

5,522 Listeners

16,010 Listeners

564 Listeners

133 Listeners

93 Listeners

471 Listeners