
Sign up to save your podcasts
Or
This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is a rough research note where the primary objective was my own learning. I am sharing it because I’d love feedback and I thought the results were interesting.
Introduction
A recent METR paper [1] showed that the length of software engineering tasks that LLMs could successfully complete appeared to be doubling roughly every seven months. I asked the same question for offensive cybersecurity, a domain with distinct skills and unique AI-safety implications.
Using METR's methodology on five cyber benchmarks, with tasks ranging from 0.5s to 25h in human-expert estimated times, I evaluated many state of the art model releases over the past 5 years. I found:
Below I outline the datasets, IRT-based analysis, results and caveats. [...]
---
Outline:
(00:20) Introduction
(01:34) Methodology
(04:07) Datasets
(11:49) Models
(13:33) Results
(18:26) Limitations
(20:47) Personal Retrospective & Next Steps
(23:08) References
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,469 Listeners
2,395 Listeners
7,953 Listeners
4,142 Listeners
89 Listeners
1,472 Listeners
9,207 Listeners
88 Listeners
417 Listeners
5,448 Listeners
15,321 Listeners
482 Listeners
121 Listeners
75 Listeners
461 Listeners