
Sign up to save your podcasts
Or
This research paper investigates the potential for large language models (LLMs) to self-improve in long-context reasoning, which involves processing and understanding complex information spread across long stretches of text. The authors propose a novel approach called SEALONG that leverages the LLMs' ability to generate multiple outputs for a given question and then scores these outputs using a method called Minimum Bayes Risk (MBR). The MBR approach prioritizes outputs that align better with each other, thereby filtering out outputs that might be incorrect or hallucinatory. SEALONG then uses these high-scoring outputs for further training, either through supervised fine-tuning or preference optimization. The authors demonstrate through extensive experiments that SEALONG significantly improves the long-context reasoning performance of LLMs without requiring expert model annotations or human labeling.
This research paper investigates the potential for large language models (LLMs) to self-improve in long-context reasoning, which involves processing and understanding complex information spread across long stretches of text. The authors propose a novel approach called SEALONG that leverages the LLMs' ability to generate multiple outputs for a given question and then scores these outputs using a method called Minimum Bayes Risk (MBR). The MBR approach prioritizes outputs that align better with each other, thereby filtering out outputs that might be incorrect or hallucinatory. SEALONG then uses these high-scoring outputs for further training, either through supervised fine-tuning or preference optimization. The authors demonstrate through extensive experiments that SEALONG significantly improves the long-context reasoning performance of LLMs without requiring expert model annotations or human labeling.