AI on Air

CREAM: A New Self-Rewarding Method that Allows the Model to Learn more Selectively and Emphasize on Reliable Preference Data


Listen Later

CREAM, which stands for Confidence-based REward Adjustment Method, is a new technique for training language models that focuses on improving their performance and alignment by using the model's confidence in its judgments to adjust rewards.

This method prioritizes high-confidence preferences while downplaying those with lower confidence, leading to more selective and efficient learning. CREAM builds upon earlier self-rewarding methods, such as those discussed in the Meta and NYU paper on self-rewarding language models and the meta-rewarding technique, by incorporating confidence-based reward adjustments.

This approach offers a more refined way to improve AI models through self-improvement and alignment.

...more
View all episodesView all episodes
Download on the App Store

AI on AirBy Michael Iversen