June 27, 2025

MatFormer: Elastic Transformers and Memory-Efficient AI Deployment

25 minutes

MatFormer, a novel Transformer architecture designed for elastic inference, allowing a single trained model to yield numerous smaller, functional submodels.

This is achieved by nesting sub-networks, primarily within the Feed-Forward Network (FFN) blocks, and jointly pptimizing them during training.

Complementing MatFormer is Per-Layer Embeddings (PLE), a memory-offloading technique that significantly reduces the model's VRAM footprint by storing large embedding tables in slower memory, exemplified by Google's Gemma 3n models.

This combined approach addresses the computational and memory constraints of deploying large foundation models across diverse hardware, enabling flexible and efficient AI applications.

...more

View all episodes

By Benjamin Alloul 🗪 🅽🅾🆃🅴🅱🅾🅾🅺🅻🅼

June 27, 2025

MatFormer: Elastic Transformers and Memory-Efficient AI Deployment

25 minutes

MatFormer, a novel Transformer architecture designed for elastic inference, allowing a single trained model to yield numerous smaller, functional submodels.

This is achieved by nesting sub-networks, primarily within the Feed-Forward Network (FFN) blocks, and jointly pptimizing them during training.

This combined approach addresses the computational and memory constraints of deploying large foundation models across diverse hardware, enabling flexible and efficient AI applications.

...more

Share MatFormer: Elastic Transformers and Memory-Efficient AI Deployment

Sign up to save your podcasts

MatFormer: Elastic Transformers and Memory-Efficient AI Deployment

MatFormer: Elastic Transformers and Memory-Efficient AI Deployment