June 07, 2024

Ep. 240 - Part 2 - June 6, 2024

52 minutes

ArXiv Computer Vision research for Thursday, June 06, 2024.

00:20: M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data

02:34: Understanding Information Storage and Transfer in Multi-modal Large Language Models

04:27: Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals

06:01: Localized Gaussian Point Management

07:59: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation

09:25: GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions

11:07: MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

13:02: ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

14:39: VideoTetris: Towards Compositional Text-to-Video Generation

16:00: SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera

17:04: Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

18:51: Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

20:05: Vision-LSTM: xLSTM as Generic Vision Backbone

21:01: ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

22:03: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

23:43: Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

25:32: Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

27:23: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

28:33: DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

30:24: SF-V: Single Forward Video Generation Model

31:51: ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

34:06: Parameter-Inverted Image Pyramid Networks

35:50: Coarse-To-Fine Tensor Trains for Compact Visual Representations

37:23: BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

38:37: DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

40:24: Coherent Zero-Shot Visual Instruction Generation

41:17: Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

42:58: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

44:56: GLACE: Global Local Accelerated Coordinate Encoding

46:43: Interpreting the Second-Order Effects of Neurons in CLIP

48:03: Learning 1D Causal Visual Representation with De-focus Attention Networks

49:41: Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

51:14: Stereo-Depth Fusion through Virtual Pattern Projection

...more

By Brad Edwards