Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative AI (GenAI) has transformed applications in natural language processing and content creation, yet centralized inference remains hindered by high latency, limited customizability, and privacy concerns. Deploying large models (LMs) in mobile edge networks emerges as a promising solution. However, it also poses new challenges, including heterogeneous multi-modal LMs with diverse resource demands and inference speeds, varied prompt/output modalities that complicate orchestration, and resource-limited infrastructure ill-suited for concurrent LM execution. In response, we propose a Multi-Agentic AI framework for latency- and fairness-aware multi-modal LM inference in mobile edge networks. Our solution includes a long-term planning agent, a short-term prompt scheduling agent, and multiple on-node LM deployment agents, all powered by foundation language models. These agents cooperatively optimize prompt routing and LM deployment through natural language reasoning over runtime telemetry and historical experience. To evaluate its performance, we further develop a city-wide testbed that supports network monitoring, containerized LM deployment, intra-server resource management, and inter-server communications. Experiments demonstrate that our solution reduces average latency by over 80% and improves fairness (Normalized Jain index) to 0.90 compared to other baselines. Moreover, our solution adapts quickly without fine-tuning, offering a generalizable solution for optimizing GenAI services in edge environments.

💡 Research Summary

The paper addresses the emerging need to run large multi‑modal generative AI models (LLMs, diffusion models, vision‑language models, etc.) directly on mobile edge computing (MEC) infrastructure. Centralized inference in cloud data centers suffers from high uplink latency, bandwidth costs, limited customisation, and privacy concerns, while edge deployment introduces new challenges: heterogeneous resource demands across modalities, diverse prompt and output types that affect network traffic, and severely constrained compute and memory on edge servers. To tackle these issues, the authors propose a Multi‑Agentic AI framework consisting of three cooperating agents, each powered by a foundation language model (LLM) that performs natural‑language reasoning over telemetry and historical experience.

Long‑Term Planning Agent (Tier‑1 Global Planning Agent) operates on a coarse time scale (hours to days). It aggregates long‑term telemetry (traffic patterns, server utilisation histories) and queries an episodic memory of past decisions. Using few‑shot prompting and chain‑of‑thought reasoning, it produces a macro‑level policy that specifies probabilistic prompt‑to‑server routing and high‑level LM deployment intents for each MEC node.
Short‑Term Prompt Scheduling Agent works at the slot level (seconds). It continuously monitors real‑time queue lengths, wireless channel quality, CPU/GPU utilisation, and bandwidth availability. Given the macro‑policy, it asks the LLM “Which server and which model should handle the current batch of text and image prompts to minimise latency while respecting fairness constraints?” The LLM returns concrete routing decisions, which are enacted immediately. This replaces traditional reinforcement‑learning schedulers, eliminating the need for costly policy retraining when workloads shift.
On‑Node LM Deployment Control Agents reside on each MEC server. They translate high‑level deployment intents into concrete container‑level actions: starting or stopping specific model containers, allocating vRAM/VRAM, pinning GPUs, and adjusting Kubernetes resource quotas. The LLM generates textual commands such as “Swap out the image‑to‑text model if GPU memory falls below 2 GB and activate the text‑to‑text model.” These commands are executed via the Kubernetes API, enabling dynamic, fine‑grained resource management without manual intervention.

The framework thus creates a natural‑language‑driven decision‑to‑action pipeline: high‑level policies are expressed in prose, interpreted by LLMs, and automatically applied to the edge infrastructure. The agents share an episodic memory and a policy summary, allowing the system to adapt quickly to new models or workload patterns without any fine‑tuning of the underlying LLMs.

To evaluate the approach, the authors built a city‑wide testbed in Bristol, UK, using OpenStack for the underlying infrastructure and Kubernetes for container orchestration. Eight heterogeneous MEC servers (varying CPU, GPU, and memory capacities) host five state‑of‑the‑art multi‑modal models: a 6.7 B parameter text‑to‑text model, Stable‑Diffusion‑v1.5 for text‑to‑image, BLIP‑2 for image‑to‑text, and two additional vision‑language models. Realistic workloads were generated for 24 hours, including peak traffic (three times the average load), model updates, and simulated server failures. Baselines comprised (1) naïve single‑server routing, (2) DRL‑based offloading decisions, and (3) model‑centric compression/splitting techniques.

Key results:

Latency reduction: Average end‑to‑end response time dropped by >80 % compared with the best baseline (e.g., from 1.2 s to 0.2 s).
Fairness improvement: The Normalized Jain Index rose from 0.51 to 0.90, indicating near‑perfect equitable service across all user groups and modalities.
Adaptability: When a new model was added or traffic patterns changed, the system re‑optimised its policy within 5 minutes without any additional model‑specific training.
Overhead: The LLM‑driven agents consumed less than 3 % of CPU capacity and under 200 MB of memory on each node, confirming that the reasoning layer is lightweight relative to the inference workload.

The paper also discusses limitations. The LLM’s own inference latency, while modest, adds a small constant to the decision pipeline; in ultra‑low‑latency scenarios this may become non‑trivial. The cumulative resource cost of running multiple LLM‑based agents across a large fleet of edge nodes could grow, suggesting the need for hierarchical or shared‑agent designs in massive deployments. Because the agents rely on natural‑language generation, there is a risk of “hallucination” or invalid commands, so a rule‑based validator or safety filter is advisable. Finally, privacy considerations arise when telemetry and prompt content are fed to the LLM; encryption or on‑device LLMs could mitigate this.

In summary, the authors present a novel, practical, and scalable solution for multi‑modal generative AI inference at the edge. By leveraging foundation models as reasoning engines, they achieve simultaneous optimisation of latency, fairness, and resource utilisation, while maintaining rapid adaptability to evolving workloads. The extensive real‑world testbed and thorough evaluation substantiate the claim that multi‑agentic AI can become a cornerstone technology for future edge‑centric GenAI services.

Multi-Agentic AI for Fairness-Aware and Accelerated Multi-modal Large Model Inference in Real-world Mobile Edge Networks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment