This episode explores a 2025 arXiv paper on CXL-based computational memory, focusing on how partial offloading should be structured so applications actually run faster end to end rather than merely showing lower kernel-launch overhead. It explains why the central challenge is coordination between the host CPU and near-memory compute on a remote CXL memory device, especially for memory-bound workloads like graph analytics, sparse retrieval, database-style processing, and KV-heavy inference. The discussion contrasts two existing offloading models—Remote Polling and Bulk Synchronous Flow—arguing that one becomes too chatty while the other introduces lockstep stalls, and that communication semantics such as CXL.io versus CXL.mem fundamentally shape performance. Listeners would find it interesting because it reframes near-memory computing as a systems and scheduling problem, not just a kernel-selection problem, with direct implications for emerging disaggregated memory and CXL deployments.
Sources:
1. Offloading to CXL-based Computational Memory — Suyeon Lee, Kangkyu Park, Kwangsik Shin, Ada Gavrilovska, 2025
http://arxiv.org/abs/2512.04449
2. A Case for Memory-Centric HPC System Design — Dong Li, Jeffrey S. Vetter and others, 2015
https://scholar.google.com/scholar?q=A+Case+for+Memory-Centric+HPC+System+Design
3. The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition — Luiz André Barroso, Urs Hölzle, Parthasarathy Ranganathan, 2018
https://scholar.google.com/scholar?q=The+Datacenter+as+a+Computer:+Designing+Warehouse-Scale+Machines,+Third+Edition
4. A Roofline Model of Energy — Samuel Williams, Andrew Waterman, David Patterson and others, 2014
https://scholar.google.com/scholar?q=A+Roofline+Model+of+Energy
5. M2NDP: A Near-Memory Processing Architecture for CXL Memory Expansion — Authors commonly cited as the M2NDP team; exact author list varies by version, 2024
https://scholar.google.com/scholar?q=M2NDP:+A+Near-Memory+Processing+Architecture+for+CXL+Memory+Expansion
6. M2NDP — Not fully specified in the provided excerpt, Not specified in excerpt
https://scholar.google.com/scholar?q=M2NDP
7. Compute Express Link Specification / CXL 2.0 and 3.0 ecosystem references — Compute Express Link Consortium, 2020-2022
https://scholar.google.com/scholar?q=Compute+Express+Link+Specification+/+CXL+2.0+and+3.0+ecosystem+references
8. AIFM: High-Performance, Application-Integrated Far Memory — Anirudh Suresh et al., 2020
https://scholar.google.com/scholar?q=AIFM:+High-Performance,+Application-Integrated+Far+Memory
9. Infiniswap — Juncheng Gu et al., 2017
https://scholar.google.com/scholar?q=Infiniswap
10. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation — Yizhou Shan et al., 2019
https://scholar.google.com/scholar?q=LegoOS:+A+Disseminated,+Distributed+OS+for+Hardware+Resource+Disaggregation
11. CXL- and near-memory-processing-related prior offloading works cited as [8], [19], [11], [28], [14], [10], [30], [13], [12], [27], [26], [16] — Various, Various
https://scholar.google.com/scholar?q=CXL-+and+near-memory-processing-related+prior+offloading+works+cited+as+[8],+[19],+[11],+[28],+[14],+[10],+[30],+[13],+[12],+[27],+[26],+[16]
12. Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization — approx. recent systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Dissecting+CXL+Memory+Performance+at+Scale:+Analysis,+Modeling,+and+Optimization
13. TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling — approx. recent architecture/systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=TRACE:+Unlocking+Effective+CXL+Bandwidth+via+Lossless+Compression+and+Precision+Scaling
14. Remote Memory Prefetching: Is Coarse-grained Fine? — approx. recent systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Remote+Memory+Prefetching:+Is+Coarse-grained+Fine?
15. IBEX: Internal Bandwidth-Efficient Compression Architecture for Scalable CXL Memory Expansion — approx. recent architecture authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=IBEX:+Internal+Bandwidth-Efficient+Compression+Architecture+for+Scalable+CXL+Memory+Expansion
16. A Near CXL Memory Processing Architecture for Distributed Graph Neural Network Inference and Training — approx. recent ML-systems/architecture authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=A+Near+CXL+Memory+Processing+Architecture+for+Distributed+Graph+Neural+Network+Inference+and+Training
17. Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits — approx. recent LLM systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Scalable+Processing-Near-Memory+for+1M-Token+LLM+Inference:+CXL-Enabled+KV-Cache+Management+Beyond+GPU+Limits
18. Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing — approx. recent ML systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Enabling+Efficient+Large+Recommendation+Model+Training+with+Near+CXL+Memory+Processing
19. Towards Continuous Checkpointing for HPC Systems Using CXL — approx. recent HPC/storage systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=Towards+Continuous+Checkpointing+for+HPC+Systems+Using+CXL
20. System Suspend with Asynchronous Resume using CXL-Based Persistent Memory — approx. recent systems authors, exact list unclear from snippet, 2024/2025
https://scholar.google.com/scholar?q=System+Suspend+with+Asynchronous+Resume+using+CXL-Based+Persistent+Memory
21. AI Post Transformers: Xerxes: CXL 3.0 Simulation for Scalable Memory Systems — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/2026-03-18-xerxes-cxl-30-simulation-for-scalable-me-fdc3f1.mp3
22. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026
https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/
23. AI Post Transformers: ByteCheckpoint: A Unified LLM Checkpointing System — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/bytecheckpoint-a-unified-llm-checkpointing-system/
24. AI Post Transformers: Teraio: Cost-Efficient LLM Training via Lifetime-Aware Tensor Offloading — Hal Turing & Dr. Ada Shannon, 2025
https://podcast.do-not-panic.com/episodes/teraio-cost-efficient-llm-training-via-lifetime-aware-tensor-offloading/
Interactive Visualization: CXL Computational Memory Offloading for Lower Runtime