TinyVLM: Zero-Shot Object Detection on Microcontrollers via Vision-Language Distillation with Matryoshka Embeddings
Zero-shot object detection enables recognising novel objects without task-specific training, but current approaches rely on large vision language models (VLMs) like CLIP that require hundreds of megabytes of memory - far exceeding the constraints of micro controller units (MCUs). We present TinyVLM, the first framework enabling zero-shot object detection on resource-constrained MCUs with less than 1MB of memory. Our approach introduces three key innovations: (1) a decoupled architecture that separates visual inference from text encoding, allowing precomputed class embeddings to be stored in flash memory; (2) Matryoshka distillation that trains nested embeddings at multiple dimensions (16-256), enabling flexible accuracy-memory trade-offs; and (3) quantized embedding storage that reduces class prototype memory by 4x with minimal accuracy loss. Trained on Conceptual Captions 3M (CC3M), TinyVLM achieves competitive zero-shot accuracy on COCO, Flowers102, and Food101 while requiring only 285KB of RAM and 892KB of flash memory for the deployed vision encoder. We demonstrate real-time inference at 26 FPS on STM32H7 and over 1,000 FPS on MAX78000 with its CNN accelerator, enabling practical zero-shot detection on edge devices for the first time.
💡 Research Summary
TinyVLM addresses the long‑standing gap between the powerful zero‑shot capabilities of large vision‑language models (VLMs) such as CLIP and the severe resource constraints of microcontroller units (MCUs). The authors propose a three‑pronged solution that reduces the on‑device memory footprint to under 1 MB while preserving competitive zero‑shot detection performance on standard benchmarks.
1. Decoupled architecture. In a closed‑set zero‑shot scenario the set of target classes is known ahead of time, allowing all text embeddings to be pre‑computed offline and stored in flash memory. Consequently, only a lightweight visual encoder runs on the MCU at inference time. The visual backbone is a MobileNetV2‑based network with a width multiplier of 0.35, followed by global average pooling and a linear projection to a configurable embedding dimension. After INT8 quantization the visual encoder occupies 892 KB of flash and requires at most 285 KB of SRAM for activations.
2. Matryoshka distillation. To accommodate a wide range of MCU memory budgets with a single model, the authors train nested (prefix) embeddings of dimensions 16, 32, 64, 128, and 256. The student network is distilled from a CLIP ViT‑B/32 teacher (512‑dim embeddings) using a combination of contrastive InfoNCE loss, mean‑squared error on projected embeddings, and a novel “Matryoshka loss” that applies the contrastive objective to each prefix simultaneously. This forces the early dimensions to capture the most salient information, enabling graceful degradation when the embedding is truncated for deployment.
3. Quantized embedding storage. Pre‑computed text prototypes are quantized from 32‑bit floating point to 8‑bit signed integers using symmetric per‑channel scaling. This reduces the storage requirement for class prototypes by a factor of four with less than 1 % drop in zero‑shot accuracy, as demonstrated on COCO, Flowers102, and Food101.
Training details. The model is trained on the Conceptual Captions 3M (CC3M) dataset for 100 epochs with Adam (lr = 1e‑3), cosine learning‑rate decay, and a batch size of 256 (via gradient accumulation). Temperature τ = 0.07 is used for the contrastive loss. All Matryoshka dimensions receive equal weight (w_d = 1/|D|).
Deployment and hardware results. On an STM32H7 (Cortex‑M7, 480 MHz, 2 MB flash, 1 MB SRAM) the full inference pipeline—image resize to 128 × 128, visual encoding, embedding truncation, cosine similarity with stored prototypes, and argmax—takes 38 ms (≈26 FPS). On the MAX78000, which includes a dedicated CNN accelerator, the same pipeline exceeds 1 000 FPS. Memory analysis shows 892 KB flash for the visual encoder, 285 KB peak SRAM for activations, and prototype storage well below the available flash (e.g., 80 COCO classes at 128‑dim require only ~10 KB).
Accuracy trade‑offs. The 256‑dim model achieves zero‑shot classification scores close to CLIP‑ViT‑B/32 (e.g., 56.4 % on COCO, 83.7 % on Flowers102). Reducing to 64 dimensions retains about 82 % of the 256‑dim accuracy while cutting embedding storage and computation by fourfold. Even the 16‑dim variant remains functional, demonstrating the flexibility of the Matryoshka approach.
Broader impact. By demonstrating that zero‑shot object detection can run on devices with sub‑megabyte memory, TinyVLM opens the door to truly intelligent edge sensors, low‑power robots, and IoT nodes that can recognize novel objects without any on‑device training. The paper also provides a benchmark suite across four MCU platforms (STM32H7, MAX78000, GAP9, ESP32‑S3), establishing a baseline for future research in ultra‑efficient VLM deployment. Potential extensions include dynamic class addition, on‑device fine‑tuning, and co‑design of hardware accelerators tailored to Matryoshka embeddings.
In summary, TinyVLM combines architectural decoupling, multi‑dimensional Matryoshka distillation, and aggressive quantization to deliver a practical, MCU‑compatible zero‑shot object detector, achieving real‑time performance and competitive accuracy while staying within a 1 MB memory envelope.
Comments & Academic Discussion
Loading comments...
Leave a Comment