Reports & Publications

AI Model Training and Inference Performance with NVIDIA GPUs Huawei Xinghe AI Data Center Network vs. Industry Ethernet Network

Sponsor: Huawei Technologies, Co. Ltd
Huawei Xinghe AI Data Center Network vs. Industry Ethernet Network

Abstract

Huawei’s Xinghe AI Data Center Network is designed to improve the efficiency of large-scale AI clusters by optimizing communication between GPU servers during model training and inference. In Tolly’s evaluation, Huawei’s Ethernet-based AI fabric was compared with RoCE networks from other mainstream Ethernet vendors using the same AI computing environment across NCCL collective communication, Llama 2 training, DeepSeek inference, and integrated training-plus-inference workloads. The report attributes Huawei’s gains to its AI accelerator NSLB algorithm, which is intended to provide global load balancing and reduce network contention in distributed AI jobs. 


The test bed used eight servers, each equipped with eight NVIDIA H100 80GB HBM3 GPUs and eight MCX75310AAS-NEAT NICs, connected through a spine-leaf fabric. The Huawei environment used CE9866-128DQ and XH9230-128DQ switches, while the comparison environment used 400GE Ethernet switches from other vendors. This setup was used to measure effective bandwidth and training throughput under both per-flow and per-packet load-balancing modes. 


In NCCL Ring AllReduce testing, Huawei delivered 389.06 GB/s effective bandwidth versus 253.32 GB/s for the industry comparison under per-flow load balancing, a 53.58% improvement. In the AllReduce-plus-background-task scenario under per-packet load balancing, Huawei reached 374.63 GB/s versus 334.05 GB/s, a 12.15% gain. For Llama2-13B model training, Huawei achieved 35.99 TFLOPs versus 32.96 TFLOPs under per-flow load balancing, a 9.19% improvement, and 36.89 TFLOPs versus 35.62 TFLOPs under per-packet load balancing with background tasks, a 3.57% gain. 


Inference results were also stronger. In DeepSeek Prefill multi-task scenarios, Huawei improved throughput by roughly 31% to 33%. In Decode plus background-task testing, throughput improved by 13.6%. In a mixed inference-and-training scenario combining DeepSeek Prefill with NCCL AllReduce, Huawei improved Prefill throughput by up to 33.86% and AllReduce network throughput by 31.15%. Overall, the report presents Huawei’s network as a high-performance AI fabric built to sustain demanding distributed training and inference workloads with lower communication bottlenecks.