
Sign up to save your podcasts
Or


This paper introduces the first theory capable of quantitatively predicting neural scaling law exponents for large language models based solely on the statistical properties of natural language. The researchers identify two primary drivers of performance: the decay of next-token conditional entropy as context length increases and the weakening of pairwise token correlations over time. By combining these metrics, they derive a first-principles formula that accurately forecasts how test loss improves with larger training datasets without requiring synthetic data or free parameters. Their theoretical predictions show a remarkable match with experimental results from GPT-2 and LLaMA-style models trained on the TinyStories and WikiText benchmarks. Ultimately, the study suggests that a model's learning efficiency is fundamentally governed by a data-dependent prediction horizon, where more data progressively unlocks the ability to utilize longer-range linguistic patterns.
By Enoch H. KangThis paper introduces the first theory capable of quantitatively predicting neural scaling law exponents for large language models based solely on the statistical properties of natural language. The researchers identify two primary drivers of performance: the decay of next-token conditional entropy as context length increases and the weakening of pairwise token correlations over time. By combining these metrics, they derive a first-principles formula that accurately forecasts how test loss improves with larger training datasets without requiring synthetic data or free parameters. Their theoretical predictions show a remarkable match with experimental results from GPT-2 and LLaMA-style models trained on the TinyStories and WikiText benchmarks. Ultimately, the study suggests that a model's learning efficiency is fundamentally governed by a data-dependent prediction horizon, where more data progressively unlocks the ability to utilize longer-range linguistic patterns.