Inside Open Networking by STORDIS – the podcast where tech meets real life

Ethernet Based AI Cluster Fabric - Performance Improvement - Tuning in SONiC | OCP Dublin 2025


Listen Later

I’ll rewrite your session description to match the same structure and tone as the example: short intro, “Learn how” value line, punchy bullet takeaways, and a timestamp-style outline, ending with the same contact/social footer.

Recorded live at the OCP Regional Summit Dublin 2025, this episode features Nanda Ravindran (VP of Technical Sales, Edgecore Networks) sharing hands-on, real-world insights into tuning AI-scale network fabrics with SONiC.

Learn how Edgecore benchmarks and optimizes 800G AI switches in SONiC — and why consistent, repeatable tuning (plus validation under realistic load) is critical for stable AI network performance.

  • AI workload characteristics and the fabric performance challenges they introduce

  • Step-by-step SONiC tuning: PFC, ECN, and DLB configuration fundamentals

  • Using Spirent test equipment to generate realistic AI traffic profiles and stress conditions

  • What changes performance: topology choices, link failures, VXLAN overlays, and traffic patterns

  • Flowlet mode vs. hash mode — which delivers better outcomes for AI use cases

  • Why automation, repeatable test methods, and community best practices matter at AI scale

  • Edgecore’s open networking approach: collaborating with Broadcom on Enterprise SONiC for next-gen AI deployments

Session outline:
00:00 Intro — Nanda Ravindran & session overview
01:00 Why AI fabric tuning matters — 800G benchmarking + recurring performance gaps
02:00 AI workload traits — elephant flows, low entropy, load-balancing pressure; goal: lossless + low latency
03:00 SONiC tuning focus — RoCEv2 mapping + PFC, ECN, DLB
04:00 Testbed overview — 6× Edgecore 800G (TH5), SONiC 202311-based, non-blocking fabric
05:00 Spirent methodology — AI workload emulation, collectives, measurements
06:00 PFC configuration — QoS profiles (DSCP→TC→Queue/PG), bindings, enablement
08:00 ECN configuration — WRED profile, thresholds, drop probability sweeps
09:00 DLB explained — hash vs flowlet; why flowlet tuning matters
10:00 Key findings — PFC-only best in lab; PFC+ECN required for deployments
12:00 ECN result highlight — example best setting (1% drop, 2MB/10MB thresholds)
13:00 800G vs 400G/breakout — native 800G performs better for AI workloads
14:00 Failure + VXLAN tests — link failures hurt; VXLAN shows minimal impact
15:00 Collectives + PXN — PXN best; flowlet recovers faster than hash
16:00 Call to action — automation + repeatable community best practices
18:00 Q&A — question on newer enhanced DLB/ECMP; plan to test on newer SONiC

📬 Questions or support: [email protected] | 🌐 www.stordis.com

Let’s get social
💻 Blog: https://stordis.com/blog/
📘 Facebook: https://www.facebook.com/people/STORDIS-GmbH/100057058555819/
📸 Instagram: https://www.instagram.com/stordis_open_networking/
👥 LinkedIn: https://www.linkedin.com/company/stordis/
🐦 X: https://twitter.com/STORDIS_GmbH/


#SONiC #AIFabricTuning #Edgecore #800GSwitches #OCPDublin2025 #ECN #PFC #DLB #AIWorkloads #SONiCOptimization #OpenNetworking #EnterpriseSONiC #Broadcom #FlowletMode #NetworkAutomation #AIInfrastructure

...more
View all episodesView all episodes
Download on the App Store

Inside Open Networking by STORDIS – the podcast where tech meets real lifeBy STORDIS GmbH