
Sign up to save your podcasts
Or
ArXiv Computer Vision research for Thursday, June 06, 2024.
00:20: M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data
02:34: Understanding Information Storage and Transfer in Multi-modal Large Language Models
04:27: Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals
06:01: Localized Gaussian Point Management
07:59: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation
09:25: GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions
11:07: MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
13:02: ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling
14:39: VideoTetris: Towards Compositional Text-to-Video Generation
16:00: SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera
17:04: Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
18:51: Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry
20:05: Vision-LSTM: xLSTM as Generic Vision Backbone
21:01: ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation
22:03: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization
23:43: Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
25:32: Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
27:23: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
28:33: DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
30:24: SF-V: Single Forward Video Generation Model
31:51: ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
34:06: Parameter-Inverted Image Pyramid Networks
35:50: Coarse-To-Fine Tensor Trains for Compact Visual Representations
37:23: BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
38:37: DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
40:24: Coherent Zero-Shot Visual Instruction Generation
41:17: Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
42:58: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
44:56: GLACE: Global Local Accelerated Coordinate Encoding
46:43: Interpreting the Second-Order Effects of Neurons in CLIP
48:03: Learning 1D Causal Visual Representation with De-focus Attention Networks
49:41: Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image
51:14: Stereo-Depth Fusion through Virtual Pattern Projection
ArXiv Computer Vision research for Thursday, June 06, 2024.
00:20: M3LEO: A Multi-Modal, Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data
02:34: Understanding Information Storage and Transfer in Multi-modal Large Language Models
04:27: Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals
06:01: Localized Gaussian Point Management
07:59: A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation
09:25: GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions
11:07: MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
13:02: ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling
14:39: VideoTetris: Towards Compositional Text-to-Video Generation
16:00: SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera
17:04: Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
18:51: Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry
20:05: Vision-LSTM: xLSTM as Generic Vision Backbone
21:01: ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation
22:03: ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization
23:43: Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
25:32: Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
27:23: VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
28:33: DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
30:24: SF-V: Single Forward Video Generation Model
31:51: ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
34:06: Parameter-Inverted Image Pyramid Networks
35:50: Coarse-To-Fine Tensor Trains for Compact Visual Representations
37:23: BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
38:37: DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
40:24: Coherent Zero-Shot Visual Instruction Generation
41:17: Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
42:58: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
44:56: GLACE: Global Local Accelerated Coordinate Encoding
46:43: Interpreting the Second-Order Effects of Neurons in CLIP
48:03: Learning 1D Causal Visual Representation with De-focus Attention Networks
49:41: Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image
51:14: Stereo-Depth Fusion through Virtual Pattern Projection