October 14, 2024

[Apple]MM1.5：多模态大语言模型微调的方法、分析与见解

12 minutes

一、 MM1.5 简介

MM1.5 是一系列多模态大型语言模型 (MLLM)，包括密集模型（参数规模从1B到30B）和专家混合 (MoE) 模型。该模型在 MM1 [118] 的基础上进行了显著升级，能够出色地处理各种多模态任务，包括：

二、 MM1.5 的主要能力

视觉指代和定位: MM1.5 具备强大的细粒度图像理解能力，能够解释文本提示以及点和边界框等视觉提示。
"MM1.5 offers robust, fine-grained image understanding, extending beyond text prompts to interpret visual prompts such as points and bounding boxes."
多图像推理和上下文学习: MM1.5 得益于大规模交错预训练，具备强大的上下文学习和多图像推理能力。
"MM1.5 benefits from large-scale interleaved pre-training, resulting in strong in-context learning and multi-image reasoning capabilities right out of the box."
扩展性: MM1.5 架构具有强大的扩展性，参数规模可达 30B，并在各种基准测试中取得了竞争力的性能。

三、 MM1.5 的变体

四、 MM1.5 的关键设计

动态图像分割: 也称为 AnyRes [101]，用于高分辨率图像理解。
"Besides data ablation, we also provide detailed ablation regarding dynamic image splitting, also known as AnyRes [101] (Section 3.5, also see Figure 1), for high-resolution image comprehension."
坐标标记: 用于视觉指代和定位，即使是像 GPT-4o 这样的强大专有模型也依赖于一组标记 (SoM) 提示 [167] 来指代图像区域。
"MM1.5 can generate grounded responses by grounding text output with image bounding boxes. This capability is notably under-explored in most open-source models (e.g., LLaVA-OneVision [74] and Phi-3-Vision [3]), and even in strong proprietary models like GPT-4o, which rely on set-of-mark (SoM) prompting [167] to reference image regions."

五、数据混合的重要性

不同数据类别对模型性能的影响: 富文本数据显著提高了富文本和知识基准的平均得分。科学数据提高了知识平均得分。指代和定位数据则使模型具备了这方面的能力。
"Text-rich data significantly improves text-rich and knowledge benchmarks on average. Science data improves knowledge average score. Referring and grounding data enables this capability."
单图像、多图像和纯文本数据的混合比例:
"Mixture of single-image, multi-image, and text-only data. Now, we study the mixture ratios, wsingle, wmulti and wtext."

六、与其他 SOTA 模型的比较

MM1.5 在多个基准测试中取得了与其他 SOTA 模型相当甚至更优的性能，具体比较结果可参考原文中的表格。

七、总结

MM1.5 是一系列强大的多模态大型语言模型，具备广泛的多模态理解和推理能力，并在多个基准测试中取得了竞争力的性能。该模型的开源发布将推动多模态领域的研究和应用发展。

...more

By DjvuLee