MLOps.community

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270


Listen Later

Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft.


Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge.


// Abstract

Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint.


We will highlight some recent work we’ve done in integrating this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high-performance, low-latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months.


// Bio

Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been on building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA.


// MLOps Swag/Merch

https://mlops-community.myshopify.com/


// Related Links

Website: www.memverge.com

Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/

Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg

Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/

Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html

RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/


--------------- ✌️Connect With Us ✌️ -------------

Join our Slack community: https://go.mlops.community/slack

Follow us on Twitter: @mlopscommunity

Sign up for the next meetup: https://go.mlops.community/register

Catch all episodes, blogs, newsletters, and more: https://mlops.community/


Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/

Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/


Timestamps:

[00:00] Bernie's preferred coffee

[00:11] Takeaways

[01:37] First principles thinking focus

[05:02] Memory Abundance Concept Discussion

[06:45] Managing load spikes

[09:38] GPU checkpointing challenges

[16:29] Distributed memory problem solving

[18:27] Composable and Virtual Memory

[21:49] Interactive chat annotation

[23:46] Memory elasticity in AI

[27:33] GPU networking tests

[29:12] GPU Scheduling workflow optimization

[32:18] Kubernetes Extensions and Tools

[37:14] GPU bottleneck analysis

[42:04] Economical memory strategies

[45:14] Elastic memory management strategies

[47:57] Problem-solving approach

[50:15] AI infrastructure elasticity evolution

[52:33] RDMA and RoCE explained

[54:14] Wrap up

...more
View all episodesView all episodes
Download on the App Store

MLOps.communityBy Demetrios

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

23 ratings


More shows like MLOps.community

View all
The a16z Show by Andreessen Horowitz

The a16z Show

1,091 Listeners

Software Engineering Daily by Software Engineering Daily

Software Engineering Daily

622 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

301 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

333 Listeners

Data Engineering Podcast by Tobias Macey

Data Engineering Podcast

146 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

228 Listeners

Practical AI by Practical AI LLC

Practical AI

206 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

96 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

519 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

132 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

228 Listeners

AI + a16z by a16z

AI + a16z

36 Listeners

Lightcone Podcast by Y Combinator

Lightcone Podcast

22 Listeners

Training Data by Sequoia Capital

Training Data

40 Listeners

The Pragmatic Engineer by Gergely Orosz

The Pragmatic Engineer

63 Listeners