A High-Level Survey of Optical Remote Sensing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, significant advances in computer vision have also propelled progress in remote sensing. Concurrently, the use of drones has expanded, with many organizations incorporating them into their operations. Most drones are equipped by default with RGB cameras, which are both robust and among the easiest sensors to use and interpret. The body of literature on optical remote sensing is vast, encompassing diverse tasks, capabilities, and methodologies. Each task or methodology could warrant a dedicated survey. This work provides a comprehensive overview of the capabilities of the field, while also presenting key information, such as datasets and insights. It aims to serve as a guide for researchers entering the field, offering high-level insights and helping them focus on areas most relevant to their interests. To the best of our knowledge, no existing survey addresses this holistic perspective.

💡 Research Summary

This paper presents a comprehensive, modality‑centric survey of optical remote sensing (ORS) that focuses on the widely used RGB imagery captured by satellites, drones, and aircraft. The authors argue that, despite the abundance of existing surveys that concentrate on specific tasks, learning paradigms, or application domains, none provides a holistic overview of the capabilities, datasets, and emerging trends associated with the RGB modality. After a brief introduction that highlights the societal importance of Earth observation and the cost‑effectiveness of RGB sensors, the survey systematically categorises the field into eight major task families: (1) Classification – covering image/scene level labeling, cross‑domain generalisation, and fine‑grained sub‑category recognition; (2) Object Detection – distinguishing horizontal bounding‑box detection (HOD) and oriented bounding‑box detection (OOD), with emphasis on recent transformer‑enhanced YOLO variants and lightweight CNN designs; (3) Segmentation – split into semantic segmentation and instance segmentation, reviewing pure CNNs, hybrid CNN‑Transformer models, the Segment‑Anything Model (SAM), and generative approaches (GANs, diffusion) for mask refinement; (4) Change Detection – divided into binary change detection (BCD) and semantic change detection (SCD), summarising lightweight MobileNet‑based networks, pure transformer solutions, and the novel Mamba architecture; (5) Vision‑Language – covering image captioning, visual question answering, and visual grounding, illustrating how pre‑training on image‑text pairs and large language models enable natural‑language interaction with remote‑sensing imagery; (6) Image/Video Editing – focusing on super‑resolution for still images and video, with a discussion of hybrid CNN‑Transformer pipelines and diffusion‑based generators such as EDiffSR; (7) Object Counting – addressing single‑class and multi‑class counting through high‑resolution feature extraction and density‑map prediction; and (8) Other Tasks – including geo‑localisation, accident risk prediction, canopy‑height estimation from RGB, and image compression.

For each task family the authors list the most widely used public datasets, noting that classification datasets are the oldest and largest, detection datasets (e.g., DIOR, DOTA v1/v2, FAIR1M) are massive and support both HOD and OOD, segmentation datasets (Inria, WHU‑Buildings, LoveDA, iSAID) focus on built‑environment objects, and change‑detection datasets (WHU‑CD, LEVIR‑CD, S2Looking) are comparatively small due to annotation difficulty. Vision‑language benchmarks are the newest, often built by augmenting existing detection datasets with textual captions or prompts.

The survey then analyses recent methodological trends. Transformer‑based backbones and foundation models (SAM, Mamba, large multimodal models) are rapidly supplanting or augmenting traditional CNN pipelines, especially in tasks requiring global context or cross‑modal reasoning. Self‑supervised pre‑training followed by task‑specific fine‑tuning is highlighted as a promising way to mitigate the scarcity of labeled RGB data. However, the authors stress that practical deployment still faces challenges: real‑time inference on edge devices, the high storage burden of ultra‑high‑resolution imagery, domain shift between satellite, aerial, and UAV platforms, and the limited diversity of publicly available datasets.

In the “Insights and Open Topics” section the paper identifies several research directions: (i) multimodal fusion of RGB with multispectral or LiDAR data to overcome spectral limitations; (ii) weak‑supervision and active learning to reduce annotation costs; (iii) model compression, quantisation, and efficient inference for on‑board UAV processing; (iv) standardisation of benchmark protocols and larger, more diverse datasets; and (v) ethical considerations such as privacy, bias, and responsible use of remote‑sensing AI.

The authors conclude that this modality‑centric survey provides a unified entry point for newcomers and a roadmap for seasoned researchers, encouraging the community to advance RGB‑based ORS through more generalisable, efficient, and ethically sound AI solutions.

A High-Level Survey of Optical Remote Sensing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment