
Sign up to save your podcasts
Or


Episode 3 is a hardware-first guide to running AI at home. We break down what CPUs vs GPUs vs NPUs vs TPUs actually do in the inference pipeline, why memory capacity isn’t the same as performance (model loading, KV cache, and MoE), why backends/runtimes are real constraints (CUDA vs ROCm vs Metal/MLX vs CPU), and how to scale from one box to multi-GPU and multi-machine setups.
Keep your AI on a leash.
Links mentioned:
- GPU Glossary (Modal): https://modal.com/gpu-glossary
- CUDA → ROCm headline: https://wccftech.com/the-claude-code-has-managed-to-port-nvidia-cuda-backend-to-rocm-in-just-30-minutes/
- Unsloth PR: https://github.com/unslothai/unsloth/pull/3856
By SoyPete TechEpisode 3 is a hardware-first guide to running AI at home. We break down what CPUs vs GPUs vs NPUs vs TPUs actually do in the inference pipeline, why memory capacity isn’t the same as performance (model loading, KV cache, and MoE), why backends/runtimes are real constraints (CUDA vs ROCm vs Metal/MLX vs CPU), and how to scale from one box to multi-GPU and multi-machine setups.
Keep your AI on a leash.
Links mentioned:
- GPU Glossary (Modal): https://modal.com/gpu-glossary
- CUDA → ROCm headline: https://wccftech.com/the-claude-code-has-managed-to-port-nvidia-cuda-backend-to-rocm-in-just-30-minutes/
- Unsloth PR: https://github.com/unslothai/unsloth/pull/3856