Ref: https://arxiv.org/abs/2001.08361
Transformer-based language models. The authors find that performance
orders of magnitude. Other architectural details have minimal impact.
relatively less data and stopping before convergence. The study also
optimal resource allocation.