Share 【第75期】cDPO：通过发掘critical tokens去修正回答

Copy link

December 14, 2024

【第75期】cDPO：通过发掘critical tokens去修正回答

12 minutes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM’s Reasoning Capability

Summary

This research paper introduces cDPO, a novel approach to improve the reasoning capabilities of Large Language Models (LLMs). cDPO identifies "critical tokens"—tokens crucial to correct or incorrect reasoning—using contrastive estimation by comparing models trained on correct and incorrect reasoning trajectories. This allows for token-level reward adjustments during preference optimization, enhancing accuracy. Experiments on GSM8K and MATH500 benchmarks using Llama-3 and DeepSeek-math models demonstrate cDPO's superior performance over existing methods. The paper also explores the impact of various hyperparameters and offers an in-depth comparison with related techniques in contrastive estimation and reinforcement learning. The findings suggest that focusing on critical tokens significantly improves LLM reasoning accuracy.

原文链接：https://arxiv.org/abs/2411.19943

...more

View all episodes

By 任雨山