
Sign up to save your podcasts
Or
This document presents RENT, a novel method for improving the reasoning abilities of language models using unsupervised reinforcement learning. Instead of relying on external feedback or ground-truth answers, RENT utilizes the model's own confidence, specifically the negative entropy of its token distributions, as a reward signal. Experiments on various reasoning benchmarks and models demonstrate that minimizing entropy leads to improved performance,suggesting a strong correlation between confidence and accuracy, particularly in later tokens of the generated response. While acknowledging limitations of unsupervised learning, the paper highlights RENT's generality and effectiveness in enhancing language model reasoning.
This document presents RENT, a novel method for improving the reasoning abilities of language models using unsupervised reinforcement learning. Instead of relying on external feedback or ground-truth answers, RENT utilizes the model's own confidence, specifically the negative entropy of its token distributions, as a reward signal. Experiments on various reasoning benchmarks and models demonstrate that minimizing entropy leads to improved performance,suggesting a strong correlation between confidence and accuracy, particularly in later tokens of the generated response. While acknowledging limitations of unsupervised learning, the paper highlights RENT's generality and effectiveness in enhancing language model reasoning.