
Sign up to save your podcasts
Or


OpenAI recently made waves by being the first big model lab to commit to a hyperscale AMD cluster (together with their own Titan XPUs), giving AMD the first biglab silicon win outside of Nvidia/Google. Returning guest Quentin Anthony, Head of Model Training at Zyphra and advisor at EleutherAI, has recently done this same transition. In part 1 of this pod, Quentin describes his journey from working on Oak Ridge National Lab's Frontier supercomputer to leading Zyphra's ambitious move to AMD MI300X GPUs, where they're achieving performance that beats NVIDIA H100s on certain workloads while dramatically reducing costs. The discussion dives deep into the technical challenges of kernel development, with Quentin explaining why he often bypasses high-level frameworks like Triton to write directly in ROCm or even GPU assembly when necessary. He reveals how Zyphra's hybrid transformer-Mamba models like Zamba 2 can match Llama 3 8B performance at 7B parameters, optimized specifically for edge deployment across a spectrum from 1.2B models for phones to 7B for desktops.
In Part 2, Quentin then candidly discusses his experience in the controversial METR software engineering productivity study, which found that developers felt 20% faster while using AI coding tools, but were in fact 20% slower. Quentin was one of the few developers who showed measurable speedup from AI tools. He shares practical insights on avoiding the "slot machine effect" of endlessly prompting models, the importance of context rot awareness, and why he prefers direct API access over tools like Cursor to maintain complete control over model context. The conversation also covers the state of open source AI research, with Quentin arguing that siloed, focused teams with guaranteed funding produce better results than grand collaborative efforts. He explains why kernel datasets alone won't solve the GPU programming problem, the challenges of evaluating kernel quality, and why companies should invest more in ecosystem development rather than traditional marketing.
https://www.linkedin.com/in/quentin-anthony/
https://www.zyphra.com/post/zamba2-7b
Key Topics: • AMD MI300X advantages: 192GB VRAM, superior memory bandwidth • Writing kernels from PTX/AMD GCN assembly up through CUDA/ROCm • Hybrid attention-Mamba architectures and optimal sparsity ratios • The Menlo productivity study: achieving positive AI speedup • Context rot and why shorter conversations beat long threads • Why physicists make great ML engineers ("embryonic stem cells") • Edge deployment strategies from phones to local clusters • The future of on-device vs cloud inference routing • EleutherAI's focus on interpretability with fully open pipelines • Building velocity-focused teams over position-based hiring
By swyx + Alessio4.7
8686 ratings
OpenAI recently made waves by being the first big model lab to commit to a hyperscale AMD cluster (together with their own Titan XPUs), giving AMD the first biglab silicon win outside of Nvidia/Google. Returning guest Quentin Anthony, Head of Model Training at Zyphra and advisor at EleutherAI, has recently done this same transition. In part 1 of this pod, Quentin describes his journey from working on Oak Ridge National Lab's Frontier supercomputer to leading Zyphra's ambitious move to AMD MI300X GPUs, where they're achieving performance that beats NVIDIA H100s on certain workloads while dramatically reducing costs. The discussion dives deep into the technical challenges of kernel development, with Quentin explaining why he often bypasses high-level frameworks like Triton to write directly in ROCm or even GPU assembly when necessary. He reveals how Zyphra's hybrid transformer-Mamba models like Zamba 2 can match Llama 3 8B performance at 7B parameters, optimized specifically for edge deployment across a spectrum from 1.2B models for phones to 7B for desktops.
In Part 2, Quentin then candidly discusses his experience in the controversial METR software engineering productivity study, which found that developers felt 20% faster while using AI coding tools, but were in fact 20% slower. Quentin was one of the few developers who showed measurable speedup from AI tools. He shares practical insights on avoiding the "slot machine effect" of endlessly prompting models, the importance of context rot awareness, and why he prefers direct API access over tools like Cursor to maintain complete control over model context. The conversation also covers the state of open source AI research, with Quentin arguing that siloed, focused teams with guaranteed funding produce better results than grand collaborative efforts. He explains why kernel datasets alone won't solve the GPU programming problem, the challenges of evaluating kernel quality, and why companies should invest more in ecosystem development rather than traditional marketing.
https://www.linkedin.com/in/quentin-anthony/
https://www.zyphra.com/post/zamba2-7b
Key Topics: • AMD MI300X advantages: 192GB VRAM, superior memory bandwidth • Writing kernels from PTX/AMD GCN assembly up through CUDA/ROCm • Hybrid attention-Mamba architectures and optimal sparsity ratios • The Menlo productivity study: achieving positive AI speedup • Context rot and why shorter conversations beat long threads • Why physicists make great ML engineers ("embryonic stem cells") • Edge deployment strategies from phones to local clusters • The future of on-device vs cloud inference routing • EleutherAI's focus on interpretability with fully open pipelines • Building velocity-focused teams over position-based hiring

537 Listeners

290 Listeners

1,094 Listeners

302 Listeners

340 Listeners

236 Listeners

212 Listeners

196 Listeners

70 Listeners

131 Listeners

209 Listeners

591 Listeners

522 Listeners

22 Listeners

39 Listeners