PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching
Mixture-of-Experts models have become a dominant architecture for scaling Large Language Models by activating only a sparse subset of experts per token. However, latency-critical MoE inference faces a fundamental tension: while expert parallelism improves memory efficiency, it also amplifies execution stragglers. In real-world serving, continuous batching and diverse concurrent requests induce rapid semantic shifts, causing expert hotspots to migrate abruptly across GPUs and triggering the ‘double penalty’ of coupled computational skew and network congestion. We propose PROBE, an inference system that co-balances computation and communication in real time. PROBE introduces Continuous Lookahead Pipelining, which proactively predicts, plans, and prefetches for upcoming layers while keeping all control overheads off the critical path. PROBE consists of: (1) a Gate-Initialized Lookahead Predictor that distills the target router to forecast next-layer expert activation with high fidelity; (2) a Hardware-Aware Balance Planning solver that jointly optimizes dynamic expert replication and token assignment under strict hiding-window constraints; and (3) a Phase-Locked Co-Scheduling policy that uses split-phase transmission to hide bandwidth-intensive expert transfers behind computation without contending with All-to-All collectives. Experiments show that PROBE reduces prefill latency by up to 1.32X and improves decoding throughput by up to 1.26X over state-of-the-art baselines, especially under extreme workload volatility.
💡 Research Summary
Mixture‑of‑Experts (MoE) models have become the de‑facto scaling strategy for large language models because they decouple parameter count from the amount of computation performed per token. While expert parallelism (EP) enables these massive models to fit into GPU memory, it also introduces a severe “double penalty” during inference: overloaded GPUs become stragglers both computationally and network‑wise, as the All‑to‑All communication required for token routing is congested on the same devices that host the hot experts. This problem is compounded by temporal volatility—continuous batching and heterogeneous request arrivals cause expert hotspots to shift abruptly from one layer to the next, especially during the prefill phase when tens of thousands of tokens are processed simultaneously. Existing solutions such as static expert replication, history‑based load‑balancing, or training‑time router adjustments cannot keep up with these rapid shifts without incurring prohibitive memory overhead or latency penalties.
PROBE (Predictive Real‑time Optimized Balancing Engine) tackles the issue by proactively forecasting, planning, and prefetching expert data for the upcoming layer while keeping all control work off the critical path. Its architecture consists of three tightly integrated components:
-
Gate‑Initialized Lookahead Predictor – The predictor freezes the target layer’s router weights as a prior and feeds the hidden states from the previous layer into a lightweight MLP. This design distills the routing logic without executing the full router, achieving roughly 90 % accuracy in predicting which experts will be activated in the next layer, and does so with negligible compute overhead.
-
Hardware‑Aware Balance Planning – Using the predictor’s output, PROBE formulates a resource‑assignment problem that jointly decides which experts to replicate and how to assign tokens to GPUs. Crucially, it respects a “hiding‑window” constraint that bounds the amount of data transferred for replication to the time window available for overlapping with the main compute pipeline. This ensures that prefetching never stalls the pipeline and that memory consumption stays within the limits imposed by KV‑cache requirements.
-
Phase‑Locked Co‑Scheduling – PROBE runs prediction, planning, and prefetching on a dual‑track architecture separate from the main inference stream. It splits expert transfers into two phases that are scheduled orthogonal to the All‑to‑All collective operations, eliminating bandwidth contention. The scheduler adapts to the specific hardware profile (compute‑heavy vs. bandwidth‑heavy GPUs) and automatically tunes the hiding‑window size to maximize overlap.
The authors evaluate PROBE on an 8‑GPU H800 cluster using two state‑of‑the‑art MoE models: GPT‑OSS‑120B (128 experts, top‑4 routing) and Qwen3‑235B (128 experts, top‑8 routing). Workloads emulate realistic serving conditions with continuous batching, varying request lengths, and both prefill (≈32 K tokens) and decoding (≈8 K tokens) phases. Compared to strong baselines such as Grace‑MoE, Libra, FasterMoE, and static expert replication, PROBE achieves up to 1.32× reduction in prefill latency and 1.26× increase in decoding throughput. The system consistently mitigates stragglers even when the imbalance ratio (IR) spikes above 2.5 during hot‑spot bursts. Memory overhead remains modest because replication is performed only for the most overloaded experts and is reclaimed when the hotspot dissipates. Overall control‑path overhead stays below 0.5 % of total inference time.
In summary, PROBE demonstrates that a predictive‑lookahead pipeline can effectively neutralize both spatial load imbalance and temporal hotspot volatility in MoE inference. By co‑optimizing computation and communication in real time, it enables latency‑critical services to deploy large‑scale MoE models without sacrificing responsiveness or incurring excessive memory costs, establishing a new paradigm for serving next‑generation language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment