The Daily ML

Ep1. LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of OpenAI's O1 On PlanBench


Listen Later

The provided text is a research paper that evaluates the planning capabilities of large language models (LLMs) and large reasoning models (LRMs) using the PlanBench benchmark. The authors compare the performance of several LLMs, including GPT-4 and LLaMA, with OpenAI’s newly released o1 model, an LRM. The authors find that while o1 significantly outperforms LLMs on simple planning problems, it struggles with more complex or obfuscated tasks. They also explore the trade-offs between accuracy and efficiency, arguing that o1’s increased accuracy comes at a high computational cost, making it less practical than traditional planners or LLM-based systems. Finally, the paper highlights the lack of interpretability and correctness guarantees in o1, raising concerns about its reliability in safety-critical applications.
...more
View all episodesView all episodes
Download on the App Store

The Daily MLBy The Daily ML