April 30, 2024

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

55 minutes

Join us at our first in-person conference on June 25, all about AI Quality: https://www.aiqualityconference.com

Simon Karasik⁠ is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax.Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/

MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.

// Abstract

The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, and how big are the checkpoints? It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.

// Bio

Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.

// MLOps Jobs board

jobs.mlops.community

// MLOps Swag/Merch

https://mlops-community.myshopify.com/

// Related Links

--------------- ✌️Connect With Us ✌️ -------------

Join our Slack community: https://go.mlops.community/slack

Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/

Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/

Timestamps:

[00:00] Simon's preferred beverage

[01:23] Takeaways

[04:22] Simon's tech background

[08:42] Zombie models garbage collection

[10:52] The road to LLMs

[15:09] Trained models Simon worked on

[16:26] LLM Checkpoints

[20:36] Confidence in AI Training

[22:07] Different Checkpoints

[25:06] Checkpoint parts

[29:05] Slurm vs Kubernetes

[30:43] Storage choices lessons

[36:02] Paramount components for setup

[37:13] Argo workflows

[39:49] Kubernetes node troubleshooting

[42:35] Cloud virtual machines have pre-installed mentoring

[45:41] Fine-tuning

[48:16] Storage, networking, and complexity in network design

[50:56] Start simple before advanced; consider model needs.

[53:58] Join us at our first in-person conference on June 25, all about AI Quality

...more

View all episodes

By Demetrios

4.6

2323 ratings

April 30, 2024

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

55 minutes

Join us at our first in-person conference on June 25, all about AI Quality: https://www.aiqualityconference.com

MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.

// Abstract

// Bio

Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.

// MLOps Jobs board

jobs.mlops.community

// MLOps Swag/Merch

https://mlops-community.myshopify.com/

// Related Links

--------------- ✌️Connect With Us ✌️ -------------

Join our Slack community: https://go.mlops.community/slack

Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/

Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/

Timestamps:

[00:00] Simon's preferred beverage

[01:23] Takeaways

[04:22] Simon's tech background

[08:42] Zombie models garbage collection

[10:52] The road to LLMs

[15:09] Trained models Simon worked on

[16:26] LLM Checkpoints

[20:36] Confidence in AI Training

[22:07] Different Checkpoints

[25:06] Checkpoint parts

[29:05] Slurm vs Kubernetes

[30:43] Storage choices lessons

[36:02] Paramount components for setup

[37:13] Argo workflows

[39:49] Kubernetes node troubleshooting

[42:35] Cloud virtual machines have pre-installed mentoring

[45:41] Fine-tuning

[48:16] Storage, networking, and complexity in network design

[50:56] Start simple before advanced; consider model needs.

[53:58] Join us at our first in-person conference on June 25, all about AI Quality

...more

More shows like MLOps.community

View all

This Week in Startups

1,289 Listeners

The Changelog: Software Development, Open Source

288 Listeners

The a16z Show

1,095 Listeners

Software Engineering Daily

624 Listeners

Talk Python To Me

583 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn

300 Listeners

NVIDIA AI Podcast

344 Listeners

Practical AI

213 Listeners

Dwarkesh Podcast

564 Listeners

Big Technology Podcast

506 Listeners

No Priors: Artificial Intelligence | Technology | Startups

145 Listeners

Latent Space: The AI Engineer Podcast

99 Listeners

This Day in AI Podcast

228 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis

691 Listeners

AI + a16z

32 Listeners

Share Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

Sign up to save your podcasts

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

More shows like MLOps.community

This Week in Startups

The Changelog: Software Development, Open Source

The a16z Show

Software Engineering Daily

Talk Python To Me

Super Data Science: ML & AI Podcast with Jon Krohn

NVIDIA AI Podcast

Practical AI

Dwarkesh Podcast

Big Technology Podcast

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

This Day in AI Podcast

The AI Daily Brief: Artificial Intelligence News and Analysis

AI + a16z