September 29, 2024

Ep1. LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of OpenAI's O1 On PlanBench

11 minutes

The provided text is a research paper that evaluates the planning capabilities of large language models (LLMs) and large reasoning models (LRMs) using the PlanBench benchmark. The authors compare the performance of several LLMs, including GPT-4 and LLaMA, with OpenAI’s newly released o1 model, an LRM. The authors find that while o1 significantly outperforms LLMs on simple planning problems, it struggles with more complex or obfuscated tasks. They also explore the trade-offs between accuracy and efficiency, arguing that o1’s increased accuracy comes at a high computational cost, making it less practical than traditional planners or LLM-based systems. Finally, the paper highlights the lack of interpretability and correctness guarantees in o1, raising concerns about its reliability in safety-critical applications.

...more

View all episodes

By The Daily ML

September 29, 2024

Ep1. LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of OpenAI's O1 On PlanBench

11 minutes

...more

Share Ep1. LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of OpenAI's O1 On PlanBench

Sign up to save your podcasts

Ep1. LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of OpenAI's O1 On PlanBench

Ep1. LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation Of OpenAI's O1 On PlanBench