April 11, 2026

In-Place Test-Time Training for Transformers

36 minutes

This episode explores a paper on “in-place” test-time training for autoregressive transformer LLMs, asking whether a standard model can update some of its own weights during inference without requiring a new architecture. It explains how test-time training differs from in-context learning by storing temporary information in fast-changing parameters rather than only in tokens or KV cache, and argues that the paper’s main contribution is to reuse an existing transformer MLP projection and train it with a next-token-prediction-aligned objective instead of a generic self-supervised loss. The discussion also situates the work within earlier test-time training and long-sequence modeling research, highlighting why prior approaches struggled to fit mainstream LLM serving stacks. Listeners would find it interesting for its clear look at a possible path toward models that keep adapting after deployment, along with a skeptical examination of whether that promise is truly practical as a “drop-in” enhancement.

Sources:

1. In-Place Test-Time Training — Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai, 2026

http://arxiv.org/abs/2604.06169

2. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts — Yu Sun, Xiaolong Wang, Ziwei Liu, John Miller, Alexei A. Efros, Moritz Hardt, 2020

https://scholar.google.com/scholar?q=Test-Time+Training+with+Self-Supervision+for+Generalization+under+Distribution+Shifts

3. TTT Layers: Online Throughput-Optimized Training for Long Sequence Modeling — Michael A. Ahn, Zhiqing Sun, et al., 2024

https://scholar.google.com/scholar?q=TTT+Layers:+Online+Throughput-Optimized+Training+for+Long+Sequence+Modeling

4. Learning to (Learn at Test Time): RNNs with Expressive Hidden States — various authors in the fast-weights / online adaptation literature; often contextualized through the TTT and expressive-state sequence modeling thread, 2024

https://scholar.google.com/scholar?q=Learning+to+(Learn+at+Test+Time):+RNNs+with+Expressive+Hidden+States

5. In-Place Test-Time Training — Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai, 2026

https://scholar.google.com/scholar?q=In-Place+Test-Time+Training

6. Overcoming Catastrophic Forgetting in Neural Networks — James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, et al., 2017

https://scholar.google.com/scholar?q=Overcoming+Catastrophic+Forgetting+in+Neural+Networks

7. Continual Learning with Deep Generative Replay — Hanul Shin, Jung Kwon Lee, Jaehong Kim, Jiwon Kim, 2017

https://scholar.google.com/scholar?q=Continual+Learning+with+Deep+Generative+Replay

8. Dark Experience for General Continual Learning: a Strong, Simple Baseline — Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny, 2020

https://scholar.google.com/scholar?q=Dark+Experience+for+General+Continual+Learning:+a+Strong,+Simple+Baseline

9. Continual Learning in Neural Networks: An Overview — German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, Stefan Wermter, 2019

https://scholar.google.com/scholar?q=Continual+Learning+in+Neural+Networks:+An+Overview

10. The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2024

https://scholar.google.com/scholar?q=The+Hedgehog+&+the+Porcupine:+Expressive+Linear+Attentions+with+Softmax+Mimicry

11. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Albert Gu, Tri Dao, 2023

https://scholar.google.com/scholar?q=Mamba:+Linear-Time+Sequence+Modeling+with+Selective+State+Spaces

12. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality — Tri Dao, Albert Gu, 2024

https://scholar.google.com/scholar?q=Transformers+are+SSMs:+Generalized+Models+and+Efficient+Algorithms+Through+Structured+State+Space+Duality

13. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025

https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time

14. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, et al., 2021

https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models

15. Memorizing Transformers — Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al., 2021

https://scholar.google.com/scholar?q=Memorizing+Transformers

16. Compute or Load KV Cache? Why Not Both? — approx. recent systems paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Compute+or+Load+KV+Cache?+Why+Not+Both?

17. AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving — approx. recent systems paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=AdaptCache:+KV+Cache+Native+Storage+Hierarchy+for+Low-Delay+and+High-Quality+Language+Model+Serving

18. DynamicKV: Task-aware Adaptive KV Cache Compression for Long Context LLMs — approx. recent systems paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=DynamicKV:+Task-aware+Adaptive+KV+Cache+Compression+for+Long+Context+LLMs

19. Streaming Lifelong Learning with Any-Time Inference — approx. recent continual-learning paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Streaming+Lifelong+Learning+with+Any-Time+Inference

20. Enabling Real-Time Inference in Online Continual Learning via Device-Cloud Collaboration — approx. recent continual-learning/systems paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Enabling+Real-Time+Inference+in+Online+Continual+Learning+via+Device-Cloud+Collaboration

21. Test-Time Training on Nearest Neighbors for Large Language Models — approx. authors unclear from snippet, 2023/2024

https://scholar.google.com/scholar?q=Test-Time+Training+on+Nearest+Neighbors+for+Large+Language+Models

22. Test-Time Learning for Large Language Models — approx. survey or position paper, authors unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Test-Time+Learning+for+Large+Language+Models

23. AI Post Transformers: NVIDIA: TTT-E2E: Unlocking Long-Context Learning via End-to-End Test-Time Training — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/nvidia-ttt-e2e-unlocking-long-context-learning-via-end-to-end-test-time-training/

24. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3

25. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3

26. AI Post Transformers: Native Sparse Attention: Efficient Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/native-sparse-attention-efficient-long-context-llms/

27. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/

28. AI Post Transformers: NeurIPS 2025: Self-Adapting Language Models — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/neurips-2025-self-adapting-language-models/

29. AI Post Transformers: MetaClaw: Just Talk and Continual Agent Adaptation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-31-metaclaw-meta-learning-agents-in-the-wil-ab324c.mp3

Interactive Visualization: In-Place Test-Time Training for Transformers

...more

View all episodes

By mcgrof