REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation
Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. However, existing Ethernet-based solutions, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilization due to both increasing traffic demands and the expanding scale of datacenter topologies, which also exacerbate network failures. To address these limitations, we propose REPS, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. REPS adapts to network conditions by caching good-performing paths. In case of a network failure, REPS re-routes traffic away from it in less than 100 microseconds. REPS is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and uses less than 25 bytes of per-connection state regardless of the topology size. We extensively evaluate REPS in large-scale simulations and FPGA-based NICs.
💡 Research Summary
The paper addresses two pressing challenges in modern large‑scale datacenter networks that support AI training workloads: (1) severe congestion caused by massive, bursty traffic and (2) rapid performance degradation when links fail. Existing Ethernet‑based solutions—Equal‑Cost Multi‑Path (ECMP) and Oblivious Packet Spraying (OPS)—either suffer from hash collisions that concentrate many flows on a single path or lack any awareness of network asymmetries and failures. To overcome these limitations, the authors propose REPS (Recycled Entropy Packet Spraying), a lightweight, decentralized per‑packet load‑balancing algorithm that works with out‑of‑order transports such as Ultra Ethernet.
Key ideas of REPS:
- Entropy Value (EV) Caching – Each packet carries an entropy value (e.g., a source‑port field) that influences the ECMP hash in switches. When an ACK returns without an ECN mark, the EV is stored in a small circular buffer (8 entries, ≈25 bytes per connection). The buffer holds “good” EVs that have traversed uncongested paths.
- Adaptive Spraying – For new connections, REPS initially behaves like OPS, randomly probing EVs during the first bandwidth‑delay product of traffic. Afterwards, it preferentially re‑uses the oldest valid EV from the buffer, guaranteeing that traffic is sent over previously verified low‑congestion routes.
- Freezing Mode for Failure Mitigation – If consecutive ECN marks or ACK losses indicate a failing path, REPS enters a “freeze” state where it stops probing new EVs and continues to use only the cached safe EVs. This prevents the algorithm from accidentally selecting a broken path during the time required for the network to update ECMP groups (often several milliseconds).
Implementation requires only ECMP hashing and ECN marking on switches—no special hardware support. The per‑connection state is minimal, making REPS suitable for NIC firmware or FPGA‑based NICs.
Evaluation:
- Large‑scale simulations on symmetric Fat‑Tree and asymmetric Jellyfish topologies (up to 128 K nodes) show REPS achieving up to 6× higher link utilization than ECMP and up to 1.25× over OPS in symmetric networks; in asymmetric networks the gains rise to 5× over ECMP and 2× over OPS.
- During short‑lived link failures (milliseconds), REPS recovers traffic up to 100× faster than OPS, effectively eliminating the loss of up to 120 k packets (≈0.5 GB) that would occur on a 400 Gbps link with a 4 KiB MTU.
- FPGA‑based NIC prototypes confirm that REPS can be realized with negligible area overhead and meet the sub‑100 µs failure‑recovery target.
The authors discuss limitations such as the finite size of the EV space (e.g., 16‑bit source ports) which may require header extensions for extremely large topologies, and the reliance on ECN—if ECN is disabled, REPS loses its primary congestion signal. Future work includes integrating in‑network telemetry for proactive congestion prediction, expanding the EV set, and testing with a broader range of transport protocols.
In conclusion, REPS provides a practical, low‑overhead solution that simultaneously improves load balancing and dramatically speeds up failure recovery in next‑generation datacenter networks, making it well‑suited for the demanding traffic patterns of AI training and other high‑performance workloads.
Comments & Academic Discussion
Loading comments...
Leave a Comment