The Tolly Group - Third-party IT Testing & Validation

Unlike traditional computing, AI workloads are highly sensitive to network performance. Even minor network bottlenecks can lead to costly delays, reduced operational efficiency, and suboptimal outcomes in AI model training. Traditional compute and workloads are sequential and CPU-driven, whereas AI workloads rely heavily on GPUs, parallel processing, and intensive matrix operations. These distinctions in data movement represent a profound shift, causing traditional switch test methodologies to fail, leaving network architects blind to critical performance issues lurking beneath the surface.

To explore this new paradigm, The Tolly Group recently sat down with Asim Rasheed, who manages the high-speed Ethernet product line at Spirent Communications. Rasheed and his team are at the forefront of validating Ethernet fabrics specifically designed for AI workloads at 400G, 800G, and the soon-to-arrive 1.6T speeds. Here are the most significant insights from our conversation, distilled into practical guidance.

Why Legacy Benchmarks Miss the Mark

“You can run your old RFC throughput tests on an AI fabric,” Rasheed told us, “but the results won’t tell you anything meaningful in the AI context.”

Classic network measurements—such as raw throughput, latency, and jitter—remain important, but they're no longer the right metrics to evaluate modern AI environments. AI fabrics experience unique traffic patterns, often characterized by bursts of synchronized, GPU-driven communication. As a result, job completion time (JCT) overtakes speed and latency as the key performance indicator in AI contexts. JCT measures how quickly thousands of GPUs can complete training tasks without disruptive tail latencies or microbursts.

Navigating AI’s Unique Traffic Challenges

Additionally, AI workloads generate unique and complex traffic patterns, such as collective all-reduce operations, synchronized parameter updates, and checkpoint bursts. These patterns significantly stress switch buffers and congestion control mechanisms, far beyond what’s seen in traditional traffic (e.g. web, storage traffic, etc.)

Beyond the benchmarks, the challenges in accurately validating AI fabrics intensify with scale. Large AI training clusters can involve tens of thousands of 400G and 800G ports. Replicating such infrastructure in a test lab is an impossible feat, and any attempt to do so would cost significant sums due to GPU costs.

Even if an organization could afford such infrastructure, analyzing the enormous amount of data generated would be exceedingly challenging. Detecting hidden congestion hotspots amidst petabytes of data become impractical without specialized testing capabilities.

Spirent Solution: xPU Emulation Instead of Hardware Deployment

"Our emulation approach enables organizations to uncover hidden performance issues at scale, providing clarity and confidence without the massive expense of deploying physical GPUs”

Spirent addresses these challenges by emulating xPU traffic instead of relying on physical hardware. Realistic AI workloads are reproduced through software and dedicated test hardware, eliminating the need for large-scale GPU clusters. Spirent’s emulation solution delivers hyper-realistic workloads derived directly from genuine training scenarios. High-density test modules replicate the exact traffic profiles xPUs produce, enabling precise and repeatable testing.

The result is reliable, data-driven analysis and faster troubleshooting through detailed analytics. Metrics such as job completion time, tail latency, congestion mapping, packet latency, packet drop and other relevant statistics are all consolidated into a user-friendly dashboard.

Case Study: De-Risking an AI Data Center Deployment

In our conversation, Rasheed highlighted a situation in which a global cloud provider faced a critical decision: choosing network switch and optical module vendors for a new AI-focused data center. Their key challenge was the inability of traditional testing tools to accurately simulate the scale and complexity of expected GPU traffic.

Spirent deployed xPU-emulation appliances that produced millions of concurrent flows across a layered leaf-spine fabric. Engineers rapidly discovered a mismatch between forward-error-correction settings and congestion-control timers – an issue that, if left unresolved, would have significantly slowed training jobs in production.

Within two weeks, the cloud provider confidently finalized its vendor selection, assured the fabric would efficiently scale to meet future AI demands.

Roadmap: Accelerating Toward 800G, 1.6T, and Ultra-Ethernet

As AI model sizes continue to skyrocket, validation methods must evolve in parallel. Rasheed detailed Spirent’s forward-looking roadmap, including native 800G Ethernet support launching this year and 1.6T interfaces currently in development. Spirent is also actively collaborating with the Ultra Ethernet Consortium (UEC) to ensure robust validation of emerging congestion-control features purpose-built for AI workloads. Ongoing enhancements to their traffic model libraries will guarantee that testing remains aligned with tomorrow’s demanding AI workloads, rather than yesterday’s outdated benchmarks.

Implications for Network Architects and Operators

Relying on traditional throughput tests to validate AI fabrics puts network architects and operators at significant risk. Modern fabric validation must emulate xPU workloads at scale, accurately measure critical metrics like job completion times and congestion hotspots and rapidly iterate testing processes as link speeds and protocols evolve every 18–24 months.

Key Takeaways

Traditional CPU-driven benchmarking is inadequate for AI workloads due to their unique, GPU-driven traffic patterns
Accurate AI fabric testing demands realistic emulation of xPU traffic
Job completion time (JCT), tail latency along with throughput, packet latency, out of sequence packets are some of the critical performance metrics in AI network testing
Validating AI fabric testing without specialized hardware is costly and not accurate enough
Spirent xPU emulation solutions offer precise, repeatable testing that significantly reduces risk and accelerates the design process
Rapid evolution in Ethernet speeds and AI model complexity necessitates advanced validation methods capable of scaling alongside emerging technologies

Learn More

Visit Spirent’s AI network testing portal for additional resources: spirent.com/AI