BLO-Inst: Bi-Level Optimization Based Alignment of YOLO and SAM for Robust Instance Segmentation
The Segment Anything Model has revolutionized image segmentation with its zero-shot capabilities, yet its reliance on manual prompts hinders fully automated deployment. While integrating object detectors as prompt generators offers a pathway to automation, existing pipelines suffer from two fundamental limitations: objective mismatch, where detectors optimized for geometric localization do not correspond to the optimal prompting context required by SAM, and alignment overfitting in standard joint training, where the detector simply memorizes specific prompt adjustments for training samples rather than learning a generalizable policy. To bridge this gap, we introduce BLO-Inst, a unified framework that aligns detection and segmentation objectives by bi-level optimization. We formulate the alignment as a nested optimization problem over disjoint data splits. In the lower level, the SAM is fine-tuned to maximize segmentation fidelity given the current detection proposals on a subset ($D_1$). In the upper level, the detector is updated to generate bounding boxes that explicitly minimize the validation loss of the fine-tuned SAM on a separate subset ($D_2$). This effectively transforms the detector into a segmentation-aware prompt generator, optimizing the bounding boxes not just for localization accuracy, but for downstream mask quality. Extensive experiments demonstrate that BLO-Inst achieves superior performance, outperforming standard baselines on tasks in general and biomedical domains.
💡 Research Summary
The paper introduces BLO‑Inst, a novel framework that unifies an object detector (YOLO) with the Segment Anything Model (SAM) to achieve fully automated instance segmentation. While SAM excels at zero‑shot mask generation given prompts such as points or boxes, its reliance on manual prompts limits deployment in real‑world pipelines where human input is unavailable. Existing solutions simply cascade a pretrained detector in front of SAM, but this creates two fundamental problems. First, an objective mismatch: detectors are optimized for geometric localization (tight bounding boxes), whereas the optimal prompt for SAM may be tighter, looser, or otherwise altered to improve mask quality. Second, alignment overfitting: standard joint training on the same data causes the detector to memorize specific box adjustments for training samples rather than learning a generalizable prompting policy, leading to poor performance on unseen images.
BLO‑Inst addresses both issues by formulating the detector‑SAM interaction as a bi‑level optimization (BLO) problem. The training set is split into two disjoint subsets, D₁ and D₂. In the lower‑level problem, the detector parameters Φ are fixed; SAM’s parameters Θ (including lightweight LoRA modules injected into the mask decoder while the heavy ViT encoder remains frozen) are fine‑tuned on D₁ to minimize a unified loss L_total that combines YOLO’s box, objectness, and classification terms with SAM’s segmentation loss, weighted by hyper‑parameters λ₁–λ₄. This yields Θ⁎(Φ), the optimal segmentation parameters conditioned on the current detector prompts.
In the upper‑level problem, the fine‑tuned SAM Θ⁎(Φ) is evaluated on the separate validation split D₂. The detector parameters Φ are then updated to minimize the same L_total computed on D₂, effectively treating the detector as a set of meta‑parameters (dynamic prompts) that must generalize to unseen data. By optimizing Φ on a validation set rather than the training set, BLO‑Inst prevents the detector from overfitting to the specific prompt adjustments of the training examples, thereby learning a robust prompting policy. The optimization proceeds iteratively: each iteration alternates between a gradient step on Θ (lower level) and a gradient step on Φ (upper level), with gradients blocked appropriately so that only the intended parameters receive updates (Algorithm 1).
Key technical contributions include:
- Identification of alignment overfitting in automated segmentation pipelines and its impact on generalization.
- A bi‑level formulation that treats detector outputs as hyper‑parameters, enabling meta‑learning of prompt generation.
- Efficient adaptation of SAM via Parameter‑Efficient Fine‑Tuning (LoRA), keeping the large encoder frozen while only a small fraction of parameters are trained.
Extensive experiments were conducted on both general‑purpose datasets (COCO‑style images) and biomedical domains (CT, MRI, microscopy). BLO‑Inst consistently outperformed baselines such as standard joint training, USIS‑SAM, and RSPrompter, achieving higher average IoU, APᵐ, and Dice scores. Ablation studies demonstrated the importance of the data split, the λ weighting scheme, and the LoRA dimension, confirming that the performance gains stem from the bi‑level alignment rather than merely larger model capacity. Notably, when evaluated under domain shift (training on natural images, testing on medical images), BLO‑Inst showed minimal degradation, highlighting its improved generalization.
In summary, BLO‑Inst provides a principled solution to the mismatch between detection and segmentation objectives and introduces a meta‑learning strategy that mitigates overfitting. By coupling a high‑speed one‑stage detector with a frozen SAM encoder and a lightweight LoRA‑augmented decoder, the framework delivers both efficiency and state‑of‑the‑art segmentation quality, paving the way for fully automated, robust instance segmentation across diverse visual domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment