
Sign up to save your podcasts
Or
This academic paper presents MANTIS, a new approach to training large multimodal models (LMMs) to handle interleaved text and images. Instead of relying on massive, potentially noisy pre-training datasets, the researchers developed MANTIS-INSTRUCT, a focused dataset of 721K instances designed to improve multi-image understanding. The paper evaluates MANTIS on several multi-image and single-image benchmarks, demonstrating that this instruction-tuning method achieves state-of-the-art performance on multi-image tasks with significantly less computational effort compared to previous methods. The research highlights the importance of a vision encoder and a well-structured text-image format for effectively processing multiple images.
This academic paper presents MANTIS, a new approach to training large multimodal models (LMMs) to handle interleaved text and images. Instead of relying on massive, potentially noisy pre-training datasets, the researchers developed MANTIS-INSTRUCT, a focused dataset of 721K instances designed to improve multi-image understanding. The paper evaluates MANTIS on several multi-image and single-image benchmarks, demonstrating that this instruction-tuning method achieves state-of-the-art performance on multi-image tasks with significantly less computational effort compared to previous methods. The research highlights the importance of a vision encoder and a well-structured text-image format for effectively processing multiple images.