June 03, 2025

Machine Learning - Harnessing Negative Signals Reinforcement Distillation from Teacher Data for LLM Reasoning

5 minutes

Alright PaperLedge crew, Ernis here, ready to dive into some fascinating research! Today, we're talking about making big brains smaller...and smarter! Think of it like this: you've got a super-genius professor, and we want to teach all their amazing knowledge to a bright but much younger student. The paper we're looking at explores a new way to do just that with AI models, specifically in the realm of reasoning.

These massive reasoning models, like DeepSeek-R1 or OpenAI's o1, are incredible at solving complex problems. But they're also huge and power-hungry! So, the idea is to "distill" their knowledge into smaller, more efficient models. It's like taking all the wisdom from a giant encyclopedia and condensing it into a pocket-sized guide. The problem is, the usual way to do this involves throwing away a lot of learning material.

Normally, when we teach the little AI student, we only show it the correct answers and reasoning steps. If it gets something wrong, we just toss that example out. It's like saying, "Okay, forget you ever tried that wrong path, let's just focus on what's right." But this paper asks a really smart question: what if we could learn from the mistakes too? After all, we humans learn just as much from our failures as we do from our successes!

The researchers introduce a new method called REDI, which stands for Reinforcement Distillation. It's a two-step process. First, they do the usual thing: Supervised Fine-Tuning (SFT). They show the student model all the correct reasoning examples. It's like giving the student a solid foundation.

Step 1: Teach the student the right way to do things (Supervised Fine-Tuning).

Step 2: Now, use both the correct AND incorrect reasoning examples to further refine the model.

But here's the cool part: Step 2. They use a special loss function (don't worry about the technical details!), that helps the model learn from both the positive (correct) and negative (incorrect) reasoning examples. It's like saying, "Okay, you got this problem wrong, but let's analyze why you got it wrong and learn from that mistake." Think of it as learning what not to do! They found that their method, REDI, works better than just showing the student the right answers, or even trying to combine the correct answers with other common techniques like DPO or SimPO.

To put it simply, REDI is like having a teacher that not only shows you the right path but also explains why the wrong paths are wrong, helping you avoid them in the future!

The results are pretty impressive! They trained a model called Qwen-REDI-1.5B (that's a relatively small model) on a dataset of both correct and incorrect reasoning examples from a bigger model. And guess what? It crushed it on math problems!

"The Qwen-REDI-1.5B model achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B...establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data."

Not only did it do incredibly well, but it even outperformed a larger model that was trained on much more data (800k examples versus REDI's 131k) that wasn't even publicly available! That’s huge!

So, why does this matter? Well, think about it: smaller, more efficient AI models are cheaper to run, easier to deploy, and more accessible. This research shows that we can get these smaller models to perform at a really high level by being smarter about how we train them. It means potentially bringing advanced reasoning capabilities to more people and more applications.

For students, this could mean better AI tutors. For businesses, it could mean more efficient AI assistants. And for researchers, it opens up a whole new avenue for exploring how AI learns and reasons.

Here are a couple of questions that come to mind:

Do you think this approach of learning from mistakes could be applied to other areas of AI, like image recognition or natural language processing?

If negative examples are so helpful, why aren't we using them more often in AI training? What are the challenges?

That's all for this episode, learning crew! Hope you enjoyed diving into the world of model distillation and the power of learning from our mistakes. Until next time, keep exploring!

Credit to Paper authors: Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi

...more

View all episodes

By ernestasposkus

June 03, 2025

Machine Learning - Harnessing Negative Signals Reinforcement Distillation from Teacher Data for LLM Reasoning

5 minutes

Step 1: Teach the student the right way to do things (Supervised Fine-Tuning).

Step 2: Now, use both the correct AND incorrect reasoning examples to further refine the model.

To put it simply, REDI is like having a teacher that not only shows you the right path but also explains why the wrong paths are wrong, helping you avoid them in the future!

Not only did it do incredibly well, but it even outperformed a larger model that was trained on much more data (800k examples versus REDI's 131k) that wasn't even publicly available! That’s huge!

Here are a couple of questions that come to mind:

Do you think this approach of learning from mistakes could be applied to other areas of AI, like image recognition or natural language processing?

If negative examples are so helpful, why aren't we using them more often in AI training? What are the challenges?

That's all for this episode, learning crew! Hope you enjoyed diving into the world of model distillation and the power of learning from our mistakes. Until next time, keep exploring!

Credit to Paper authors: Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi

...more

Share Machine Learning - Harnessing Negative Signals Reinforcement Distillation from Teacher Data for LLM Reasoning

Sign up to save your podcasts

Machine Learning - Harnessing Negative Signals Reinforcement Distillation from Teacher Data for LLM Reasoning

Machine Learning - Harnessing Negative Signals Reinforcement Distillation from Teacher Data for LLM Reasoning