Reports & Publications
F5 BIG-IP Next For Kubernetes (BNK) on DPU Performance Benefits of Intelligent AI Load Balancing
Login or create an account to download this report
Abstract
F5 BIG-IP Next for Kubernetes (BNK) on DPU is positioned as an AI inference load-balancing platform designed to improve both user response times and infrastructure efficiency in Kubernetes-based AI environments. In Tolly testing commissioned by F5, BNK was compared with HAProxy, Envoy, and another open-source load balancer in clusters running Meta Llama 3.1 70B, 3.1 8B, and 3.2 1B models. The core differentiator is GPU-aware traffic steering: instead of using simple round-robin distribution, BNK directs requests away from already busy GPUs, helping reduce contention and improve accelerator utilization.
The report’s test environment intentionally created uneven demand by artificially loading 50% of the GPUs before each run, then measuring output tokens per second, time to first token (TTFT), and end-to-end request latency over 60-minute tests. On the Llama 3.1 70B model, BNK delivered 879.82 output tokens per second, compared with 627.22 for HAProxy and 726.23 for Envoy. It also reduced TTFT to 13,970.80 ms, versus 35,778.08 ms for HAProxy and 29,483.65 ms for Envoy, while lowering request latency to 53,420.59 ms. This translated into up to 40% higher throughput, 61% better TTFT, and 34% lower latency versus competing approaches, depending on the comparison point.
Performance gains were even larger with smaller models. Against HAProxy, BNK produced 10,733.17 tokens per second on Llama 3.1 8B versus 5,020.75, and 42,458.08 versus 8,397.93 on Llama 3.2 1B. The report cites improvements of 114% and 406% in throughput for those models, along with major reductions in TTFT and latency.
The report also highlights host CPU savings from DPU offload. In the CPU-consumption test, BNK used about 2 CPU cores compared with roughly 12 cores for HAProxy, freeing approximately 10 additional cores for AI application processing. Testing used NVIDIA NIM with TensorRT-LLM, NVIDIA GH200 480GB GPUs, Kubernetes, Prometheus/Grafana monitoring, and NVIDIA AIPerf traffic generation.