April 03, 2026

CXL Computational Memory Offloading for Lower Runtime

This episode explores a 2025 arXiv paper on CXL-based computational memory, focusing on how partial offloading should be structured so applications actually run faster end to end rather than merely showing lower kernel-launch overhead. It explains why the central challenge is coordination between the host CPU and near-memory compute on a remote CXL memory device, especially for memory-bound workloads like graph analytics, sparse retrieval, database-style processing, and KV-heavy inference. The discussion contrasts two existing offloading models—Remote Polling and Bulk Synchronous Flow—arguing that one becomes too chatty while the other introduces lockstep stalls, and that communication semantics such as CXL.io versus CXL.mem fundamentally shape performance. Listeners would find it interesting because it reframes near-memory computing as a systems and scheduling problem, not just a kernel-selection problem, with direct implications for emerging disaggregated memory and CXL deployments.

Sources:

1. Offloading to CXL-based Computational Memory — Suyeon Lee, Kangkyu Park, Kwangsik Shin, Ada Gavrilovska, 2025

http://arxiv.org/abs/2512.04449

2. A Case for Memory-Centric HPC System Design — Dong Li, Jeffrey S. Vetter and others, 2015

https://scholar.google.com/scholar?q=A+Case+for+Memory-Centric+HPC+System+Design

3. The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition — Luiz André Barroso, Urs Hölzle, Parthasarathy Ranganathan, 2018

https://scholar.google.com/scholar?q=The+Datacenter+as+a+Computer:+Designing+Warehouse-Scale+Machines,+Third+Edition

4. A Roofline Model of Energy — Samuel Williams, Andrew Waterman, David Patterson and others, 2014

https://scholar.google.com/scholar?q=A+Roofline+Model+of+Energy

5. M2NDP: A Near-Memory Processing Architecture for CXL Memory Expansion — Authors commonly cited as the M2NDP team; exact author list varies by version, 2024

https://scholar.google.com/scholar?q=M2NDP:+A+Near-Memory+Processing+Architecture+for+CXL+Memory+Expansion

6. M2NDP — Not fully specified in the provided excerpt, Not specified in excerpt

https://scholar.google.com/scholar?q=M2NDP

7. Compute Express Link Specification / CXL 2.0 and 3.0 ecosystem references — Compute Express Link Consortium, 2020-2022

https://scholar.google.com/scholar?q=Compute+Express+Link+Specification+/+CXL+2.0+and+3.0+ecosystem+references

8. AIFM: High-Performance, Application-Integrated Far Memory — Anirudh Suresh et al., 2020

https://scholar.google.com/scholar?q=AIFM:+High-Performance,+Application-Integrated+Far+Memory

9. Infiniswap — Juncheng Gu et al., 2017

https://scholar.google.com/scholar?q=Infiniswap

10. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation — Yizhou Shan et al., 2019

https://scholar.google.com/scholar?q=LegoOS:+A+Disseminated,+Distributed+OS+for+Hardware+Resource+Disaggregation

11. CXL- and near-memory-processing-related prior offloading works cited as [8], [19], [11], [28], [14], [10], [30], [13], [12], [27], [26], [16] — Various, Various

https://scholar.google.com/scholar?q=CXL-+and+near-memory-processing-related+prior+offloading+works+cited+as+[8],+[19],+[11],+[28],+[14],+[10],+[30],+[13],+[12],+[27],+[26],+[16]

12. Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization — approx. recent systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Dissecting+CXL+Memory+Performance+at+Scale:+Analysis,+Modeling,+and+Optimization

13. TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling — approx. recent architecture/systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=TRACE:+Unlocking+Effective+CXL+Bandwidth+via+Lossless+Compression+and+Precision+Scaling

14. Remote Memory Prefetching: Is Coarse-grained Fine? — approx. recent systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Remote+Memory+Prefetching:+Is+Coarse-grained+Fine?

15. IBEX: Internal Bandwidth-Efficient Compression Architecture for Scalable CXL Memory Expansion — approx. recent architecture authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=IBEX:+Internal+Bandwidth-Efficient+Compression+Architecture+for+Scalable+CXL+Memory+Expansion

16. A Near CXL Memory Processing Architecture for Distributed Graph Neural Network Inference and Training — approx. recent ML-systems/architecture authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=A+Near+CXL+Memory+Processing+Architecture+for+Distributed+Graph+Neural+Network+Inference+and+Training

17. Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits — approx. recent LLM systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Scalable+Processing-Near-Memory+for+1M-Token+LLM+Inference:+CXL-Enabled+KV-Cache+Management+Beyond+GPU+Limits

18. Enabling Efficient Large Recommendation Model Training with Near CXL Memory Processing — approx. recent ML systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Enabling+Efficient+Large+Recommendation+Model+Training+with+Near+CXL+Memory+Processing

19. Towards Continuous Checkpointing for HPC Systems Using CXL — approx. recent HPC/storage systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=Towards+Continuous+Checkpointing+for+HPC+Systems+Using+CXL

20. System Suspend with Asynchronous Resume using CXL-Based Persistent Memory — approx. recent systems authors, exact list unclear from snippet, 2024/2025

https://scholar.google.com/scholar?q=System+Suspend+with+Asynchronous+Resume+using+CXL-Based+Persistent+Memory

21. AI Post Transformers: Xerxes: CXL 3.0 Simulation for Scalable Memory Systems — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/2026-03-18-xerxes-cxl-30-simulation-for-scalable-me-fdc3f1.mp3

22. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026

https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/

23. AI Post Transformers: ByteCheckpoint: A Unified LLM Checkpointing System — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/bytecheckpoint-a-unified-llm-checkpointing-system/

24. AI Post Transformers: Teraio: Cost-Efficient LLM Training via Lifetime-Aware Tensor Offloading — Hal Turing & Dr. Ada Shannon, 2025

https://podcast.do-not-panic.com/episodes/teraio-cost-efficient-llm-training-via-lifetime-aware-tensor-offloading/

Interactive Visualization: CXL Computational Memory Offloading for Lower Runtime

...more

View all episodes

By mcgrof