December 03, 2024

2024.12.03 每日AI论文 | X-Prompt提升图像生成，GATE OpenING评估图文生成。

16 minutes

本期的 24 篇论文如下：

[00:23] 🖼 X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models（X-Prompt：面向自回归视觉语言基础模型的通用上下文图像生成）

[00:58] 📊 GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation（GATE 开放：一个综合基准用于评估开放式交错图文生成）

[01:32] 🖼 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis（Switti：为文本到图像合成设计尺度变换器）

[02:09] 🎥 Open-Sora Plan: Open-Source Large Video Generation Model（开放Sora计划：开源大型视频生成模型）

[02:55] 🎥 TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video（TAPTRv3：时空上下文增强长视频中任意点的鲁棒跟踪）

[03:37] 🤖 o1-Coder: an o1 Replication for Coding（o1-Coder：一个面向编码任务的o1模型复现）

[04:12] 🤖 SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters（SOLAMI：沉浸式互动的3D自主角色社交视觉-语言-动作建模）

[04:49] 🎥 VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation（VISTA：通过视频时空增强提升长时和高分辨率视频理解）

[05:38] 🔍 TinyFusion: Diffusion Transformers Learned Shallow（微型融合：浅层扩散变换器的学习）

[06:19] 🔍 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models（VLsI：从大型到小型视觉语言模型的层级交互化）

[06:52] 🎙 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait（FLOAT：基于生成运动潜在流匹配的音频驱动说话人像）

[07:32] 🚀 Efficient Track Anything（高效追踪任何目标）

[08:15] 🌊 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation（在矢量场中引导校正流模型以实现受控图像生成）

[08:50] 🎥 Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation（长视频扩散生成与分段交叉注意力及内容丰富的视频数据集构建）

[09:33] 📹 WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model（WF-VAE：通过小波驱动的能量流动增强视频VAE以用于潜在视频扩散模型）

[10:11] 🔍 VLSBench: Unveiling Visual Leakage in Multimodal Safety（VLSBench：揭示多模态安全中的视觉泄露问题）

[10:51] 🧠 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information（VisOnlyQA：大型视觉语言模型在几何信息视觉感知方面仍存在困难）

[11:41] 🎮 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos（PhysGame：揭示游戏视频中的物理常识违规）

[12:14] 🗣 Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input（协作实例导航：利用代理自我对话最小化用户输入）

[12:51] 🌍 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge（评估多语言理解能力：基于区域知识）

[13:28] 🎨 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge（无艺术生成模型：无需图形艺术知识的艺术创作）

[14:02] 📈 A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models（大型语言模型测试时计算的简单可证明缩放定律）

[14:41] 🌐 World-consistent Video Diffusion with Explicit 3D Modeling（世界一致性视频扩散与显式3D建模）

[15:22] 🔊 Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning（面向低资源环境下跨语言音频滥用检测的小样本学习）

【关注我们】

您还可以在以下平台找到我们，获得播客内容以外更多信息

小红书: AI速递

...more

View all episodes

By duan

22 ratings