Recorded live at the OCP Regional Summit Dublin 2025, this session features Kamran Naqvi (Broadcom) exploring how next-generation Ethernet networking is accelerating AI infrastructure — and what enterprises need to build practical, scalable GPU clusters with SONiC.
From AI traffic patterns (elephant flows, low entropy, tail latency) to real-world topology and cabling choices, Kamran breaks down what makes AI fabrics different from traditional data centers — and how Enterprise SONiC enhancements (smarter hashing, adaptive routing, better load distribution) help remove networking as the bottleneck.
Key Takeaways
The unique networking demands of AI back-end fabrics: elephant flows, poor entropy, RDMA retransmission challenges, and tail latency
The four fabrics in AI infrastructure — and why the back-end fabric is where “the uniqueness” shows up
Scale-up vs. scale-out networking: what each does and where enterprise designs focus today
Enterprise-ready GPU cluster designs and scalable Ethernet fabrics for AI
SONiC improvements for AI workloads:
Advanced hashing (including deeper header visibility for better entropy)
Adaptive routing approaches, including flowlet-based spraying for better balancing with minimal reordering risk
Practical best practices for cable/optics selection, rack/topology layout, and failure recovery planning
Why Ethernet often beats InfiniBand in production AI deployments — including faster failover behavior and scalability
Session outline
00:03 – Intro: Kamran’s role at Broadcom; session focus
00:21 – What you’ll learn: AI network needs, scalable enterprise designs
01:14 – AI infrastructure fabrics + scale-up vs scale-out (enterprise reality)
07:54 – AI as distributed compute: why networking drives job completion time
11:16 – AI traffic patterns: elephant flows, RDMA pain, tail latency
14:08 – Fabric challenges: collisions, failures, incast + required capabilities
16:04 – Broadcom approaches + why Ethernet often wins vs InfiniBand (incl. failover)
18:28 – Topologies: closed vs rail-optimized; when spines still matter
23:08 – Cabling/optics best practices: DAC first, then linear pluggables; CPO intro
27:09 – Reference designs: rack layouts and cluster scaling examples
32:04 – SONiC for AI: advanced hashing + adaptive routing (flowlet spray)
36:07 – Wrap-up: QR code for reference architecture + materials; thanks
Download
Download the Broadcom AI Reference Architecture (via the QR code shown during the session).
Stay Connected
📬 Questions or support:
[email protected] | 🌐 www.stordis.com
Let’s get social
💻 Blog: https://stordis.com/blog/
📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/
📸 Instagram: https://www.instagram.com/stordis_open_networking/
👥 LinkedIn: https://www.linkedin.com/company/stordis/
🐦 X: https://twitter.com/STORDIS_GmbH/