The September 19, 2025 Alibaba paper introduces Flash-LLM, a novel software framework designed to enable cost-effective and highly-efficient inference for large generative models by supporting unstructured sparsity on high-performance tensor cores. The authors observe that the primary bottleneck in large language model (LLM) inference is the memory bandwidth limitation during "skinny" matrix multiplications, rather than the arithmetic processing of tensor cores. Flash-LLM addresses this through a "Load-as-Sparse and Compute-as-Dense" methodology, which minimizes global memory access by loading sparse data but utilizes tensor cores efficiently by transforming it to a dense format in on-chip memory. Extensive evaluations demonstrate that Flash-LLM significantly outperforms state-of-the-art libraries like Sputnik and SparTA at the kernel level and achieves substantial end-to-end throughput improvements and lower inference costs compared to frameworks like DeepSpeed and FasterTransformer on large OPT models. The paper also details the specialized techniques developed for the framework, including a Tiled-CSL sparse format and a two-level overlapping computation pipeline. Source: https://arxiv.org/pdf/2309.10285