June 06, 2025

Mantis: Multi-Image Instruction Tuning for LMMs

17 minutes

This academic paper presents MANTIS, a new approach to training large multimodal models (LMMs) to handle interleaved text and images. Instead of relying on massive, potentially noisy pre-training datasets, the researchers developed MANTIS-INSTRUCT, a focused dataset of 721K instances designed to improve multi-image understanding. The paper evaluates MANTIS on several multi-image and single-image benchmarks, demonstrating that this instruction-tuning method achieves state-of-the-art performance on multi-image tasks with significantly less computational effort compared to previous methods. The research highlights the importance of a vision encoder and a well-structured text-image format for effectively processing multiple images.

...more

View all episodes

By Enoch H. Kang

June 06, 2025

Mantis: Multi-Image Instruction Tuning for LMMs

17 minutes

...more

Share Mantis: Multi-Image Instruction Tuning for LMMs

Sign up to save your podcasts

Mantis: Multi-Image Instruction Tuning for LMMs

Mantis: Multi-Image Instruction Tuning for LMMs