
Sign up to save your podcasts
Or


The paper "Finetuned Language Models are Zero-Shot Learners" introduces instruction tuning, a simple but highly effective method for improving the zero-shot learning capabilities of large language models on unseen tasks.
Here is a summary of the paper's core methodology, results, and key findings:
Methodology The researchers took a 137-billion parameter pretrained language model and finetuned it on a mixture of over 60 diverse NLP datasets. Crucially, the examples in these datasets were verbalized into natural language instructions (e.g., "Is the sentiment of this movie review positive or negative?"). The resulting instruction-tuned model is called FLAN (Finetuned Language Net).
Evaluation and Results To rigorously evaluate FLAN's zero-shot capabilities on unseen tasks, the researchers grouped the datasets into clusters based on task type (e.g., translation, natural language inference, closed-book QA). They then held out entire clusters during training to ensure the model had never seen that specific type of task before evaluation.
The results showed that FLAN substantially improved upon the zero-shot performance of its unmodified counterpart. Furthermore, zero-shot FLAN outperformed the zero-shot 175B-parameter GPT-3 on 20 out of 25 evaluated datasets. On specific tasks like ANLI, RTE, and BoolQ, zero-shot FLAN even surpassed the performance of few-shot GPT-3 by a large margin.
Key Findings from Ablation Studies
• Model Scale is Crucial: The generalization benefits of instruction tuning only emerge with sufficient model scale (around 100B parameters). For smaller models (8 billion parameters or fewer), instruction tuning actually hurt zero-shot performance on unseen tasks, likely because the smaller models used up their entire capacity simply learning the training tasks.
• The Importance of Instructions: The performance gains were not merely a result of multi-task learning. Training with actual natural language instructions was found to be critical; setups that only used dataset names or no templates performed substantially worse on unseen tasks.
• Task Variety: Increasing the number of task clusters used during instruction tuning continuously improved the model's performance on novel tasks, suggesting that exposure to a wider variety of instructions helps generalization.
By Yun WuThe paper "Finetuned Language Models are Zero-Shot Learners" introduces instruction tuning, a simple but highly effective method for improving the zero-shot learning capabilities of large language models on unseen tasks.
Here is a summary of the paper's core methodology, results, and key findings:
Methodology The researchers took a 137-billion parameter pretrained language model and finetuned it on a mixture of over 60 diverse NLP datasets. Crucially, the examples in these datasets were verbalized into natural language instructions (e.g., "Is the sentiment of this movie review positive or negative?"). The resulting instruction-tuned model is called FLAN (Finetuned Language Net).
Evaluation and Results To rigorously evaluate FLAN's zero-shot capabilities on unseen tasks, the researchers grouped the datasets into clusters based on task type (e.g., translation, natural language inference, closed-book QA). They then held out entire clusters during training to ensure the model had never seen that specific type of task before evaluation.
The results showed that FLAN substantially improved upon the zero-shot performance of its unmodified counterpart. Furthermore, zero-shot FLAN outperformed the zero-shot 175B-parameter GPT-3 on 20 out of 25 evaluated datasets. On specific tasks like ANLI, RTE, and BoolQ, zero-shot FLAN even surpassed the performance of few-shot GPT-3 by a large margin.
Key Findings from Ablation Studies
• Model Scale is Crucial: The generalization benefits of instruction tuning only emerge with sufficient model scale (around 100B parameters). For smaller models (8 billion parameters or fewer), instruction tuning actually hurt zero-shot performance on unseen tasks, likely because the smaller models used up their entire capacity simply learning the training tasks.
• The Importance of Instructions: The performance gains were not merely a result of multi-task learning. Training with actual natural language instructions was found to be critical; setups that only used dataset names or no templates performed substantially worse on unseen tasks.
• Task Variety: Increasing the number of task clusters used during instruction tuning continuously improved the model's performance on novel tasks, suggesting that exposure to a wider variety of instructions helps generalization.