Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty
Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices. Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing, especially in multi-user environments. In this work, we propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES, based on token-level uncertainty and resource constraints. We define a margin-based token-level uncertainty metric and demonstrate its correlation with model accuracy. Leveraging this metric, we design a greedy offloading algorithm (GOA) that minimizes delay while maintaining accuracy by prioritizing offloading for highuncertainty queries. Our experiments show that GOA consistently achieves a favorable trade-off, outperforming baseline strategies in both accuracy and latency across varying user densities, and operates with practical computation time. These results establish GOA as a scalable and effective solution for LLM inference in MEC environments.
💡 Research Summary
The paper addresses the challenge of deploying large language models (LLMs) for mobile services in a mobile edge computing (MEC) environment where devices have limited computation and memory resources, and edge servers (ESs) face bandwidth and processing constraints under multi‑user contention. The authors propose an uncertainty‑aware offloading framework that dynamically decides, for each user query, whether to run inference locally on a lightweight model (SLM) or to offload it to an ES that hosts the full LLM.
A key contribution is the definition of a token‑level margin‑based uncertainty metric αᵢ = 1 − (p₁ − p₂), where p₁ and p₂ are the top‑1 and top‑2 probabilities obtained during top‑k sampling for the first predicted token. Empirical analysis on the bAbI dataset with LLaMA‑3.2‑1B‑Instruct shows a strong negative correlation between αᵢ and prediction accuracy, especially when αᵢ exceeds roughly 0.2. This metric is used as a weight in the objective function, emphasizing delay reduction for more uncertain queries.
The system model includes: (1) a communication model where each ES’s total bandwidth B is equally divided among its connected users, with a fixed uplink power P and Rayleigh fading channels; SINR determines the transmission rate Rᵢⱼ and communication delay t_commᵢⱼ = Dᵢ / Rᵢⱼ. (2) a computation model where each ES’s FLOPS Cⱼ are equally shared among its users, yielding an effective compute capacity Cᵢⱼ,ES = min(C_max, Cⱼ / max(1, Σᵢ xᵢⱼ). Local devices have a fixed compute capacity Cᵢ,L. Workloads for SLM and LLM are denoted Wᵢ,SLM and Wᵢ,LLM, leading to local delay t_compᵢ,L = Wᵢ,SLM / Cᵢ,L and edge delay t_compᵢⱼ,ES = Wᵢ,LLM / Cᵢⱼ,ES. The total end‑to‑end delay for user i under association with ES j is dᵢⱼ = xᵢⱼ (t_commᵢⱼ + t_compᵢⱼ,ES) + (1 − xᵢⱼ) t_compᵢ,L.
The optimization problem seeks to minimize the average weighted delay Σᵢ Σⱼ αᵢ dᵢⱼ subject to: (i) any query with uncertainty above a threshold τ must be offloaded (Σⱼ xᵢⱼ = 1 for αᵢ > τ); (ii) each user can be assigned to at most one ES; (iii) binary decision variables xᵢⱼ ∈ {0,1}. This problem is a non‑convex mixed‑integer program due to the SINR‑dependent rates and the binary variables, and is NP‑hard.
To obtain a tractable solution, the authors develop the Greedy Offloading Algorithm (GOA). GOA proceeds in two stages:
- High‑uncertainty assignment – Users with αᵢ > τ form set I_off. For each (i, j) pair, GOA computes a weighted delay gap Δᵢⱼ = αᵢ (t_commᵢⱼ + t_compᵢⱼ,ES – t_compᵢ,L). The algorithm iteratively selects the pair with the smallest Δᵢⱼ, sets xᵢⱼ = 1, removes the user from I_off, and updates Δ for the remaining users, until all high‑uncertainty users are assigned.
- Remaining users – For users with αᵢ ≤ τ (set I_rem), GOA recomputes Δᵢⱼ and continues to offload the pair with the most negative Δᵢⱼ, i.e., the offload that reduces the objective. The process stops when the minimum Δᵢⱼ becomes non‑negative or no users remain.
The algorithm’s complexity is O(N³M²), far lower than exhaustive search (O(2^{NM})), and empirical runtime is under one second for scenarios with up to 200 users and 5 edge servers.
Experimental evaluation uses Monte‑Carlo simulations (500 runs) with varying user densities (10–200) and edge server counts (1–5). Baselines include: (a) all‑local inference, (b) full offloading, (c) a static confidence‑threshold offloading, and (d) a deep‑reinforcement‑learning resource allocation scheme. Results show that GOA reduces average end‑to‑end latency by 15–30 % relative to the best baseline while keeping overall accuracy loss below 2 %. The benefit grows with user density because GOA efficiently balances communication congestion and compute queuing. Sensitivity analysis on τ indicates that values between 0.2 and 0.4 provide the best trade‑off, ensuring that most high‑uncertainty queries receive the higher‑accuracy LLM inference while low‑uncertainty queries stay local to save bandwidth and compute.
The paper’s contributions are threefold: (1) introducing a simple yet effective token‑level margin‑based uncertainty metric that correlates strongly with LLM accuracy; (2) formulating a joint communication‑computation‑uncertainty optimization problem for MEC‑based LLM inference; (3) designing a practical greedy algorithm that delivers near‑optimal performance with low computational overhead.
Limitations acknowledged include the focus on the first token’s uncertainty (which may be insufficient for longer contexts) and the use of a simplified Rayleigh fading model without mobility dynamics. Future work could extend the uncertainty metric to aggregate multi‑token confidence, incorporate more realistic channel models, and validate the framework on real‑world edge hardware and live mobile traffic.
Overall, the study demonstrates that leveraging token‑level uncertainty as a decision metric enables a scalable, accurate, and low‑latency offloading strategy for LLM services in next‑generation mobile edge networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment