Here is a short summary of the paper "ZEPHYR: Direct Distillation of LM Alignment":
The Problem:While smaller, open-source large language models (LLMs) have significantly improved in accuracy using distilled supervised fine-tuning (dSFT), they often struggle with "intent alignment". This means they do not behave in a way that aligns with human preferences or respond well to natural prompts when compared to proprietary models or models trained with costly human feedback.
The Solution:The researchers introduce distilled direct preference optimization (dDPO), a highly efficient method to align a small open LLM entirely through distillation, without requiring any human annotation or sampling during fine-tuning.
The methodology consists of three main steps:
- Distilled Supervised Fine-Tuning (dSFT): Initial training using a large-scale dataset of instructions and responses (UltraChat).
- AI Feedback (AIF) Collection: Gathering responses from an ensemble of different language models and using a powerful teacher model (like GPT-4) to score and rank these responses to create preference data (UltraFeedback).
- dDPO: Optimizing the dSFT model using this static AI preference data to maximize the likelihood of ranking the preferred responses over the rejected ones.
The Results:By applying this method to the Mistral-7B base model, the authors created ZEPHYR-7B, which sets a new state-of-the-art on chat benchmarks (MT-Bench and AlpacaEval) for 7B parameter models. Remarkably, ZEPHYR-7B achieves conversational performance comparable to, and in some cases surpassing, much larger 70B parameter models trained with human feedback, such as LLAMA2-CHAT-70B. Furthermore, the entire training process can be completed in just a few hours.