
Sign up to save your podcasts
Or


This research paper from Microsoft and Salesforce identifies a significant performance gap in Large Language Models (LLMs) when they transition from single-turn to multi-turn, underspecified conversations. Through large-scale simulations, the authors found that even state-of-the-art models suffer an average 39% drop in performance when instructions are revealed gradually rather than all at once. This degradation is primarily attributed to a phenomenon called "lost in conversation," where models make premature assumptions, propose incomplete solutions, and fail to recover once they take a wrong turn. The study decomposes these failures into two specific metrics: a slight loss in aptitude and a massive increase in unreliability. Ultimately, the findings suggest that current evaluation methods overestimate model capabilities by ignoring the underspecification common in real-world human-AI interactions.
By Enoch H. KangThis research paper from Microsoft and Salesforce identifies a significant performance gap in Large Language Models (LLMs) when they transition from single-turn to multi-turn, underspecified conversations. Through large-scale simulations, the authors found that even state-of-the-art models suffer an average 39% drop in performance when instructions are revealed gradually rather than all at once. This degradation is primarily attributed to a phenomenon called "lost in conversation," where models make premature assumptions, propose incomplete solutions, and fail to recover once they take a wrong turn. The study decomposes these failures into two specific metrics: a slight loss in aptitude and a massive increase in unreliability. Ultimately, the findings suggest that current evaluation methods overestimate model capabilities by ignoring the underspecification common in real-world human-AI interactions.