
Sign up to save your podcasts
Or


This academic paper introduces DiMSUM, a novel architecture for image generation that enhances diffusion models by integrating both spatial and frequency information. The authors address limitations of existing state-space models like Mamba in handling image data by incorporating wavelet transformations and a cross-attention fusion layer, which better captures both local details and long-range dependencies. Furthermore, the model includes globally shared transformer blocks to improve global relationship modeling, a known weakness of Mamba. Experiments show that DiMSUM achieves superior image quality and faster training convergence compared to current state-of-the-art models on various benchmarks.
Source: 2025 - https://arxiv.org/pdf/2411.04168 - DiMSUM : Diffusion Mamba - A Scalable and
Unified Spatial-Frequency Meth
By mcgrofThis academic paper introduces DiMSUM, a novel architecture for image generation that enhances diffusion models by integrating both spatial and frequency information. The authors address limitations of existing state-space models like Mamba in handling image data by incorporating wavelet transformations and a cross-attention fusion layer, which better captures both local details and long-range dependencies. Furthermore, the model includes globally shared transformer blocks to improve global relationship modeling, a known weakness of Mamba. Experiments show that DiMSUM achieves superior image quality and faster training convergence compared to current state-of-the-art models on various benchmarks.
Source: 2025 - https://arxiv.org/pdf/2411.04168 - DiMSUM : Diffusion Mamba - A Scalable and
Unified Spatial-Frequency Meth