Towards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning Framework
Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
💡 Research Summary
The paper addresses a critical gap in low‑altitude aerial intelligence for smart‑city governance: existing datasets and pipelines focus on exhaustive object detection, which does not align with the practical needs of urban management that prioritize abnormal, risky, or regulation‑violating situations. To bridge this gap, the authors introduce two contributions.
First, GovLA‑10K, a management‑oriented multimodal benchmark consisting of 10,572 high‑quality UAV images cropped to 512 × 512 pixels. The dataset is built around nine functionally salient categories (illegal parking, construction debris, fencing, brick piles, aggregate piles, construction workers, ground litter, overflowing trash bins, scaffolding) that directly map to common governance concerns. Data collection combines large‑scale web crawling and in‑house UAV flights over diverse urban environments. Annotation follows a semi‑automatic three‑stage pipeline: (1) expert manual bounding‑box labeling, (2) automatic verification with a strong grounding‑oriented detector (MM‑GroundingDINO) and IoU‑based secondary review for low‑confidence cases, and (3) generation of fine‑grained scene captions and management recommendations using a state‑of‑the‑art open‑source vision‑language model (Qwen3VL‑235B‑A22B), followed by human validation. Captions are explicitly split into an objective description and a concrete management recommendation, avoiding speculative language and ensuring actionable guidance.
Second, GovLA‑Reasoner, a unified vision‑language reasoning framework that eliminates the fragile “detect‑then‑prompt” paradigm. Instead of converting detections into textual prompts, the model inserts a lightweight Feature Adapter between the visual detector and a frozen large language model (LLM). The adapter compresses and aggregates discriminative visual features (ROI‑pooled embeddings and global image tokens) via multi‑head attention, then injects the resulting representation directly into the LLM’s token embedding space. Only the adapter parameters are trained; the detector and LLM remain untouched, dramatically reducing computational cost and avoiding catastrophic forgetting. During inference, the system follows a two‑stage process: Stage 1 produces a concise scene description based solely on visible objects; Stage 2 is triggered only when a violation category is detected, prompting the LLM to generate a management recommendation grounded in the visual evidence.
Extensive experiments on GovLA‑10K demonstrate that GovLA‑Reasoner outperforms conventional detector‑plus‑VLM pipelines. It achieves a 12 % absolute gain in vision‑language precision (measured by a combined mAP and textual alignment metric) and an 85 % human‑agreement rate on management recommendations, compared with 68 % for the baseline. Ablation studies confirm that the adapter is essential: removing it leads to severe performance drops, while varying the number of attention heads shows that 2–4 heads provide the best trade‑off between accuracy and efficiency. The approach runs in real‑time (≈30 FPS) on a single GPU, with a 30 % reduction in memory usage because only the adapter is updated.
The authors acknowledge limitations: the dataset is currently China‑centric and limited to nine categories, which may restrict generalization; the generated policy suggestions lack formal legal or ethical validation; and extending the framework to other domains will require new category definitions and possibly domain‑specific adapters. Future work is outlined to broaden geographic and categorical coverage, integrate regulatory compliance checks, and explore joint optimization of multiple LLMs with shared adapters.
In summary, GovLA‑10K and GovLA‑Reasoner together constitute a new paradigm for low‑altitude intelligence that aligns multimodal perception with concrete governance tasks, offering a cost‑effective, deployment‑friendly solution that can be directly incorporated into smart‑city management pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment