Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment
Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.
💡 Research Summary
The paper introduces a two‑stage, edge‑optimized pipeline for autonomous inspection of underground infrastructure such as sewers and culverts. The first stage employs RAPID‑SCAN, a lightweight semantic segmentation network designed with a Dynamic Feature Pyramid Network, adaptive routing, and Squeeze‑Excitation modules. Despite having only 0.64 M parameters (a 97 % reduction compared to conventional models), RAPID‑SCAN achieves an F1‑score of 0.834 and a mean IoU of 0.638 on the authors’ Sewer‑Culvert Defect dataset, while consuming less than 4.52 GFLOPS, making it suitable for real‑time deployment on mobile robots.
The second stage leverages a fine‑tuned Phi‑3.5 Vision‑Language Model (VLM) to convert the segmentation output and the original RGB image into concise, domain‑specific natural‑language summaries. To adapt the large pre‑trained model to the inspection domain under strict resource constraints, the authors apply QLoRA: the base model weights are quantized to 4‑bit NormalFloat4 (NF4) precision, and low‑rank LoRA adapters (rank = 16, scaling α = 32) are inserted into the attention layers. This reduces trainable parameters from 3.8 B to 67 M (a 98.2 % reduction) while preserving adaptation capacity. Training uses supervised fine‑tuning with a structured prompt that includes the image, segmentation mask, and defect labels, encouraging the model to generate summaries covering four key aspects: Condition, Location, Severity, and Implications. The training schedule consists of three epochs, a batch size of four, gradient accumulation of eight steps, a learning rate of 2 × 10⁻⁴, and cosine annealing.
Post‑training, the model undergoes a three‑step optimization pipeline: (1) merging LoRA adapters with the quantized base model, (2) INT8 symmetric per‑channel weight quantization with dynamic activation quantization, and (3) TensorRT engine generation that uses mixed‑precision inference (FP16 for the vision encoder, INT8 for compatible language layers). Validation against a held‑out test set ensures that ROUGE‑L scores stay above 0.70, with any model falling below this threshold being re‑calibrated. The final optimized model runs on an NVIDIA Jetson AGX Orin (CUDA 11.8, TensorRT 8.6) with 6 GB reserved for model weights and 2 GB for dynamic tensors. Asynchronous pipelines handle preprocessing, inference, and post‑processing in parallel streams, achieving inference latency under 3 seconds per summary, GPU memory usage below 85 % of the device, and thermal stability under 75 °C.
A curated dataset, the Sewer‑Culvert Defect (SCD) Natural Language Captioning Annotation set, underpins the VLM fine‑tuning. It comprises 5,051 RGB images annotated with eight defect classes (cracks, roots, holes, joint problems, deformation, fracture, erosion/deposits, loose gasket). Initial captions were generated automatically by the base Phi‑3.5 model, then refined by domain experts to ensure accuracy, proper terminology, and alignment with industry reporting standards. The annotation protocol enforces four pillars—Condition, Location, Severity, Implications—mirroring real inspection reports.
Experimental results demonstrate that the edge‑optimized pipeline delivers a 3× speedup over a full‑precision baseline while incurring less than 2 % degradation in summarization quality (ROUGE‑L 0.78 vs. 0.80, BLEU 0.71). RAPID‑SCAN’s segmentation performance remains competitive despite its tiny footprint, and the VLM’s summaries are both fluent and technically precise, as confirmed by expert evaluation. The complete system was deployed on a Clearpath Jackal robot equipped with an Axis RGB PTZ camera, IMU, GPS, and a Velodyne VLP‑16 LiDAR. Field trials in real sewer networks showed continuous defect detection, automatic consolidation of spatially redundant detections, and on‑the‑fly generation of actionable reports that operators could query in real time.
In conclusion, the authors provide a practical, scalable framework that bridges the gap between high‑accuracy visual defect detection and human‑readable reporting on resource‑constrained edge devices. The combination of a ultra‑lightweight segmentation model, parameter‑efficient VLM adaptation (QLoRA + LoRA), and hardware‑specific optimizations (INT8 quantization, TensorRT) enables real‑time, autonomous infrastructure inspection. Future work will explore multimodal sensor fusion, long‑term temporal analysis, and extension to other critical infrastructure domains such as bridges and tunnels.
Comments & Academic Discussion
Loading comments...
Leave a Comment