The paper "Textbooks Are All You Need II: phi-1.5 technical report" investigates the capabilities of small language models (SLMs) when trained on highly curated datasets.
Here is a short summary of the key findings:
- Textbook-Quality Data: The researchers developed phi-1.5, a 1.3 billion parameter model, by training it primarily on synthetically generated "textbook-like" data rather than traditional, raw web data. This dataset included roughly 20 billion tokens specifically crafted to teach general world knowledge and common sense reasoning.
- Surprising Performance: Despite its small size, phi-1.5 achieves performance on natural language tasks that is comparable to models five times larger. Furthermore, it vastly exceeds other models of similar size on complex multi-step reasoning tasks, such as grade-school mathematics and basic Python coding.
- Reduced Toxicity: By relying on synthetic textbook data instead of standard internet scrapes, the model demonstrated a much lower propensity for generating toxic, biased, or harmful content.
- Challenging Scaling Laws: Ultimately, the research challenges the prevailing assumption that language models must rely solely on massive scale (hundreds of billions of parameters) to achieve high capabilities. Instead, it proves that data quality is an equally, if not more, crucial factor.