March 01, 2026

EP090: Pixtral 12B Beats Llama With Better Eyesight

23 minutes

Pixtral 12B is a 12-billion-parameter multimodal language model developed by Mistral AI, designed to seamlessly understand both text and images. Released under an open-source Apache 2.0 license, the model achieves state-of-the-art performance on various multimodal benchmarks without compromising its strong natural language reasoning capabilities.

Here are the key takeaways from the paper:

Innovative Architecture: Pixtral 12B is built on top of the Mistral Nemo 12B text model and integrates a newly trained 400-million-parameter vision encoder called Pixtral-ViT.
Native Resolution and Aspect Ratio: Unlike traditional vision encoders that require images to be broken into fixed-size square tiles, Pixtral uses a novel ROPE-2D implementation. This allows the model to natively ingest images at their original resolution and aspect ratio, providing flexibility and better performance on complex visual tasks.
Multi-Image Context: The model features an expansive 128K-token context window, enabling it to process an arbitrary number of images within long, multi-turn conversations.
State-of-the-Art Performance: Pixtral 12B substantially outperforms other open models in its weight class, such as Llama-3.2 11B and Qwen-2-VL 7B. It also matches or exceeds the performance of much larger models (like Llama-3.2 90B) and leading closed-source models (like Claude-3 Haiku and Gemini-1.5 Flash 8B) on various multimodal benchmarks.
New Evaluation Benchmark (MM-MT-Bench): Noting that current evaluation protocols for vision-language models are poorly standardized, the authors introduced MM-MT-Bench. This new open-source benchmark is specifically designed to evaluate how well multimodal models follow instructions in practical, multi-turn, long-form assistant scenarios.

...more

View all episodes

By Yun Wu

March 01, 2026

EP090: Pixtral 12B Beats Llama With Better Eyesight

23 minutes

Here are the key takeaways from the paper:

Innovative Architecture: Pixtral 12B is built on top of the Mistral Nemo 12B text model and integrates a newly trained 400-million-parameter vision encoder called Pixtral-ViT.
Native Resolution and Aspect Ratio: Unlike traditional vision encoders that require images to be broken into fixed-size square tiles, Pixtral uses a novel ROPE-2D implementation. This allows the model to natively ingest images at their original resolution and aspect ratio, providing flexibility and better performance on complex visual tasks.
Multi-Image Context: The model features an expansive 128K-token context window, enabling it to process an arbitrary number of images within long, multi-turn conversations.
State-of-the-Art Performance: Pixtral 12B substantially outperforms other open models in its weight class, such as Llama-3.2 11B and Qwen-2-VL 7B. It also matches or exceeds the performance of much larger models (like Llama-3.2 90B) and leading closed-source models (like Claude-3 Haiku and Gemini-1.5 Flash 8B) on various multimodal benchmarks.
New Evaluation Benchmark (MM-MT-Bench): Noting that current evaluation protocols for vision-language models are poorly standardized, the authors introduced MM-MT-Bench. This new open-source benchmark is specifically designed to evaluate how well multimodal models follow instructions in practical, multi-turn, long-form assistant scenarios.

...more

Share EP090: Pixtral 12B Beats Llama With Better Eyesight

Sign up to save your podcasts

EP090: Pixtral 12B Beats Llama With Better Eyesight

EP090: Pixtral 12B Beats Llama With Better Eyesight