February 06, 2025

【第129期】Sa2VA：Sam2+LLaVA

14 minutes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Summary

The research introduces Sa2VA, a unified model for understanding images and videos. Sa2VA combines the strengths of SAM-2 (video segmentation) and LLaVA (vision-language model) to perform various tasks like referring segmentation and conversation. A new dataset, Ref-SAV, with complex video scenes, was created to improve model performance. Experiments show Sa2VA achieves state-of-the-art results across multiple benchmarks, particularly in referring video object segmentation. The code, dataset, and models are publicly available.

这项研究介绍了Sa2VA，一个统一的图像和视频理解模型。Sa2VA 结合了 SAM-2（视频分割）和 LLaVA（视觉语言模型）的优点，能够执行多种任务，如指代分割和对话。为提升模型性能，研究团队创建了一个新数据集 Ref-SAV，该数据集包含复杂的视频场景。实验结果表明，Sa2VA 在多个基准测试中取得了最先进的成果，尤其是在指代视频对象分割任务中表现突出。代码、数据集和模型已公开发布。

原文链接：https://arxiv.org/abs/2501.04001

...more