
Sign up to save your podcasts
Or
This pod introduces Confidence-Reward driven Preference Optimization (CRPO), a novel method for improving machine translation by more effectively selecting training data for large language models (LLMs). The paper highlights challenges in applying LLMs to translation due to pretraining on English-centric data and the complexity of traditional reinforcement learning from human feedback. While Direct Preference Optimization (DPO) simplifies training, its success relies on high-quality preference data. CRPO addresses this by combining reward scores with model confidence to identify challenging sentence pairs where the model is uncertain or underperforming, leading to more efficient fine-tuning. The authors demonstrate CRPO's effectiveness on both LLMs and encoder-decoder models, showing it outperforms existing data selection methods in translation accuracy and data efficiency.
This pod introduces Confidence-Reward driven Preference Optimization (CRPO), a novel method for improving machine translation by more effectively selecting training data for large language models (LLMs). The paper highlights challenges in applying LLMs to translation due to pretraining on English-centric data and the complexity of traditional reinforcement learning from human feedback. While Direct Preference Optimization (DPO) simplifies training, its success relies on high-quality preference data. CRPO addresses this by combining reward scores with model confidence to identify challenging sentence pairs where the model is uncertain or underperforming, leading to more efficient fine-tuning. The authors demonstrate CRPO's effectiveness on both LLMs and encoder-decoder models, showing it outperforms existing data selection methods in translation accuracy and data efficiency.