Bucilă et al. (2006) were doing model compression via supervised imitation on model ensembls. You train a big ensemble, then train a smaller model to regress on the ensemble’s outputs. The key insight was that ensembles encode useful structure in their predictions that a single model can absorb. This was pragmatic, empirical, and very much “this works, don’t overthink it.” Hinton et al. (2015) took that idea, cleaned it up, and made the hidden assumption explicit: the information lives in the soft targets, not the hard labels. Temperature-scaled softmax exposes class similarities, uncertainty, and dark knowledge. Given this, ensembles becomes optional. A single large teacher works. Multiple teachers work. Self-distillation works. The loss becomes a principled KL divergence instead of “just regress the logits and hope.” Source: Distilling the Knowledge in a Neural Network Google Inc., University of Toronto, Canadian Institute for Advanced Research Geoffrey Hinton, Oriol Vinyals, Jeff Dean URL: https://arxiv.org/pdf/1503.02531