Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation

Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.


💡 Research Summary

This paper investigates whether state‑of‑the‑art open‑vocabulary object detection (OVD) models, which have demonstrated strong zero‑shot performance on natural images, can be directly transferred to aerial imagery without any domain‑specific fine‑tuning. To this end, the authors construct a rigorous benchmark, LAE‑80C, by aggregating validation splits from four established remote‑sensing datasets (DOT‑A‑v2.0, DIOR, FAIR1M, xView). LAE‑80C comprises 3,592 images, 86,558 annotated instances, and 80 categories that exhibit hierarchical overlap, attribute‑dependence, and fine‑grained maritime clusters—characteristics that make semantic disambiguation especially challenging for vision‑language models.

Five representative OVD systems are evaluated: Grounding DINO (two‑stage transformer with Swin‑L backbone), OWLv2 (dense image‑text matching with ViT‑L/14, pretrained on WebLI), YOLO‑World (real‑time YOLOv8 variant with offline vocabulary encoding), YOLO‑E (YOLOv11‑L with a “Seeing Anything” prompt mechanism), and LLMDet (Grounding DINO augmented by a frozen large language model for richer textual descriptions). All models are used strictly zero‑shot; they have never seen aerial data during pre‑training.

The experimental protocol defines three inference modes to isolate sources of error: (1) Global Inference – the full 80‑class vocabulary is supplied simultaneously, reflecting the standard zero‑shot scenario; (2) Oracle Inference – only the ground‑truth classes present in each image are prompted, thereby removing distractor classes and exposing pure localization capability; (3) Single‑Category Oracle – each class is prompted individually for the whole image, testing whether long text sequences degrade attention. In addition, two prompt‑engineering strategies are explored: (a) adding a domain‑specific prefix (“Aerial view of {class}”) to bias the model toward top‑down features, and (b) synonym expansion, where each class is mapped to a list of lexical equivalents.

Detection outputs are filtered with a low confidence threshold (0.1) to capture the full recall curve, followed by box confidence (0.35), text‑alignment (0.25), class‑wise NMS (IoU = 0.1), and a final score filter (0.1). Intersection‑over‑Area (IoA) with a 0.7 threshold is used instead of IoU to accommodate the extreme scale variance of satellite imagery. Standard precision, recall, and F1‑score are reported, together with detailed TP/FP/FN counts.

Results reveal a severe domain gap. In the Global mode, OWLv2 achieves the highest performance with an F1 of 27.6 % and recall of 24.7 %, but it also generates 47,058 false positives against 21,408 true positives—a 69 % false‑positive rate. The other models perform markedly worse: LLMDet reaches only 12.5 % F1, while Grounding DINO variants stay below 7 % F1. Oracle experiments dramatically increase recall for all models, confirming that the primary bottleneck is semantic confusion caused by the large vocabulary rather than pure localization. Reducing the effective vocabulary from 80 classes to roughly 3.2 classes yields a 15‑fold boost in F1, underscoring the sensitivity to label overlap.

Performance varies widely across source datasets: DIOR yields an F1 of ~0.53, whereas FAIR1M drops to 0.12, indicating that imaging conditions (resolution, sensor angle, object density) heavily influence transferability. Prompt engineering (domain prefixes, synonym lists) provides only marginal gains (1–3 % absolute improvement), suggesting that simple lexical tweaks cannot bridge the domain‑lexical gap.

The authors conclude that current OVD foundations, trained exclusively on ground‑level photographs, are ill‑suited for aerial applications in their raw form. They attribute the failure to two intertwined gaps: (i) a visual domain shift (top‑down perspective, lack of side‑profile cues) and (ii) a lexical gap (generic natural‑image vocabularies vs. specialized remote‑sensing terminology). To make open‑vocabulary detection viable for UAV, satellite, and disaster‑response scenarios, the paper recommends (a) pre‑training on large-scale aerial image‑text pairs, (b) designing scale‑invariant feature extractors that can handle dense small objects, (c) employing hierarchical label normalization and attribute‑aware prompting, and (d) developing domain‑adaptive alignment techniques (e.g., cross‑view embedding alignment).

By providing the first systematic, zero‑shot benchmark for OVD on aerial data, detailed error analyses, and a clear exposition of failure modes, this work establishes a baseline for future research and highlights the urgent need for domain‑specific adaptations before open‑vocabulary detectors can be trusted in real‑world aerial perception tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment