
Sign up to save your podcasts
Or


This research addresses the performance gap in large language models between single-turn and multi-turn interactions. The authors introduce TURNWISEEVAL, a new benchmark that isolates conversational ability by comparing model responses in long dialogues against equivalent single-turn prompts. To improve model performance, they also developed TURNWISEDATA, a scalable pipeline that generates synthetic multi-turn training data from existing single-turn instructions. Their experiments demonstrate that even advanced models often struggle with extended context, but incorporating a small amount of this synthetic data during training significantly boosts chat capabilities. Ultimately, the study highlights that multi-turn proficiency is a distinct skill set that requires dedicated evaluation and specialized training data.
By Enoch H. KangThis research addresses the performance gap in large language models between single-turn and multi-turn interactions. The authors introduce TURNWISEEVAL, a new benchmark that isolates conversational ability by comparing model responses in long dialogues against equivalent single-turn prompts. To improve model performance, they also developed TURNWISEDATA, a scalable pipeline that generates synthetic multi-turn training data from existing single-turn instructions. Their experiments demonstrate that even advanced models often struggle with extended context, but incorporating a small amount of this synthetic data during training significantly boosts chat capabilities. Ultimately, the study highlights that multi-turn proficiency is a distinct skill set that requires dedicated evaluation and specialized training data.