
Sign up to save your podcasts
Or
In this eye-opening deep dive, we unravel the surprising challenge that’s stumping even the most advanced AI systems today: web browsing. Yes, the thing you do every day—searching the internet—is still one of AI’s biggest hurdles. Enter Browse Comp, a brutal new benchmark created by OpenAI to test whether AI agents can actually navigate the messy, tangled reality of the web to find “needle-in-a-haystack” information.
We break down what Browse Comp really is—12,000+ handcrafted, devilishly specific tasks designed to test AI’s persistence, creativity, and judgment online. Think: finding a soccer match with an exact number of yellow cards and a referee from a specific country, or tracking down a research paper by identifying the undergrad schools of its authors. These aren’t trivia questions. They’re research problems.
The results? Shocking. Top language models like GPT-4? Scored under 2%. Even with browsing tools. The only model that crossed the human threshold was OpenAI’s specialized “Deep Research” agent—designed specifically for this kind of task—and even it struggled with calibration and overconfidence.
We dig into the wild strategies behind the benchmark (like building questions backwards from known answers), the tension between reasoning and searching, and why just giving an AI “access to the internet” is nowhere near enough. And here’s the kicker: humans didn’t do much better, with most giving up after hours of searching.
This episode isn’t just about AI benchmarks. It’s about how complex our information landscape really is—and what it says about the future of AI as a research partner. Will we learn to trust AIs to think for us online? Or are we still the best search engines we’ve got?
Listen now to explore the future of AI, web literacy, and the art of finding what matters in an overwhelming digital world.
In this eye-opening deep dive, we unravel the surprising challenge that’s stumping even the most advanced AI systems today: web browsing. Yes, the thing you do every day—searching the internet—is still one of AI’s biggest hurdles. Enter Browse Comp, a brutal new benchmark created by OpenAI to test whether AI agents can actually navigate the messy, tangled reality of the web to find “needle-in-a-haystack” information.
We break down what Browse Comp really is—12,000+ handcrafted, devilishly specific tasks designed to test AI’s persistence, creativity, and judgment online. Think: finding a soccer match with an exact number of yellow cards and a referee from a specific country, or tracking down a research paper by identifying the undergrad schools of its authors. These aren’t trivia questions. They’re research problems.
The results? Shocking. Top language models like GPT-4? Scored under 2%. Even with browsing tools. The only model that crossed the human threshold was OpenAI’s specialized “Deep Research” agent—designed specifically for this kind of task—and even it struggled with calibration and overconfidence.
We dig into the wild strategies behind the benchmark (like building questions backwards from known answers), the tension between reasoning and searching, and why just giving an AI “access to the internet” is nowhere near enough. And here’s the kicker: humans didn’t do much better, with most giving up after hours of searching.
This episode isn’t just about AI benchmarks. It’s about how complex our information landscape really is—and what it says about the future of AI as a research partner. Will we learn to trust AIs to think for us online? Or are we still the best search engines we’ve got?
Listen now to explore the future of AI, web literacy, and the art of finding what matters in an overwhelming digital world.