Inside Open Networking by STORDIS – the podcast where tech meets real life

Accelerating AI with Next-Gen Networking SONiC Innovations and Scalable Designs


Listen Later

Recorded live at the OCP Regional Summit Dublin 2025, this session features Kamran Naqvi (Broadcom) exploring how next-generation Ethernet networking is accelerating AI infrastructure — and what enterprises need to build practical, scalable GPU clusters with SONiC.


From AI traffic patterns (elephant flows, low entropy, tail latency) to real-world topology and cabling choices, Kamran breaks down what makes AI fabrics different from traditional data centers — and how Enterprise SONiC enhancements (smarter hashing, adaptive routing, better load distribution) help remove networking as the bottleneck.


Key Takeaways

The unique networking demands of AI back-end fabrics: elephant flows, poor entropy, RDMA retransmission challenges, and tail latency

The four fabrics in AI infrastructure — and why the back-end fabric is where “the uniqueness” shows up

Scale-up vs. scale-out networking: what each does and where enterprise designs focus today

Enterprise-ready GPU cluster designs and scalable Ethernet fabrics for AI

SONiC improvements for AI workloads:

Advanced hashing (including deeper header visibility for better entropy)

Adaptive routing approaches, including flowlet-based spraying for better balancing with minimal reordering risk

Practical best practices for cable/optics selection, rack/topology layout, and failure recovery planning

Why Ethernet often beats InfiniBand in production AI deployments — including faster failover behavior and scalability

Session outline

00:03 – Intro: Kamran’s role at Broadcom; session focus

00:21 – What you’ll learn: AI network needs, scalable enterprise designs

01:14 – AI infrastructure fabrics + scale-up vs scale-out (enterprise reality)

07:54 – AI as distributed compute: why networking drives job completion time

11:16 – AI traffic patterns: elephant flows, RDMA pain, tail latency

14:08 – Fabric challenges: collisions, failures, incast + required capabilities

16:04 – Broadcom approaches + why Ethernet often wins vs InfiniBand (incl. failover)

18:28 – Topologies: closed vs rail-optimized; when spines still matter

23:08 – Cabling/optics best practices: DAC first, then linear pluggables; CPO intro

27:09 – Reference designs: rack layouts and cluster scaling examples

32:04 – SONiC for AI: advanced hashing + adaptive routing (flowlet spray)

36:07 – Wrap-up: QR code for reference architecture + materials; thanks


Download

Download the Broadcom AI Reference Architecture (via the QR code shown during the session).


Stay Connected

📬 Questions or support:

[email protected] | 🌐 www.stordis.com


Let’s get social

💻 Blog: https://stordis.com/blog/

📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/

📸 Instagram: https://www.instagram.com/stordis_open_networking/

👥 LinkedIn: https://www.linkedin.com/company/stordis/

🐦 X: https://twitter.com/STORDIS_GmbH/

...more
View all episodesView all episodes
Download on the App Store

Inside Open Networking by STORDIS – the podcast where tech meets real lifeBy STORDIS GmbH