
Sign up to save your podcasts
Or


The researchers introduce DroPE, a novel method for extending the context length of large language models by removing positional embeddings after pretraining. While explicit positional information like RoPE is essential for fast training convergence, it creates a "bottleneck" that prevents models from processing sequences longer than those seen during training. The authors demonstrate that these embeddings act as a temporary scaffold that can be discarded and replaced with a brief recalibration phase at the original context length. This approach allows models to achieve zero-shot context extension far beyond their initial training limits without the performance degradation typically seen in traditional scaling methods. Empirically, DroPE maintains high accuracy on long-range retrieval tasks across various model sizes, outperforming specialized architectures and complex frequency-scaling techniques. Ultimately, the work suggests that the inductive bias of positions is only necessary during early learning and can be removed to unlock robust, scalable inference.
By Enoch H. KangThe researchers introduce DroPE, a novel method for extending the context length of large language models by removing positional embeddings after pretraining. While explicit positional information like RoPE is essential for fast training convergence, it creates a "bottleneck" that prevents models from processing sequences longer than those seen during training. The authors demonstrate that these embeddings act as a temporary scaffold that can be discarded and replaced with a brief recalibration phase at the original context length. This approach allows models to achieve zero-shot context extension far beyond their initial training limits without the performance degradation typically seen in traditional scaling methods. Empirically, DroPE maintains high accuracy on long-range retrieval tasks across various model sizes, outperforming specialized architectures and complex frequency-scaling techniques. Ultimately, the work suggests that the inductive bias of positions is only necessary during early learning and can be removed to unlock robust, scalable inference.