Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores
Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.
💡 Research Summary
The paper addresses the gap between general‑purpose multimodal large language models (MLLMs) and the demanding conditions of Food‑Service and Retail Stores (FSRS). It identifies three core obstacles: (1) capability‑level misalignment—off‑the‑shelf MLLMs lack the visual semantics needed for FSRS tasks such as distinguishing operational signage from decorative elements, handling glare, low‑resolution multilingual text, motion blur, and occlusions; (2) data‑level noise and heterogeneity—real‑world FSRS data come from surveillance cameras, mobile devices, and regulatory inspections, resulting in severe visual corruption, inconsistent metadata, and temporal drift; (3) evaluation‑level misalignment—existing benchmarks focus on single‑image or text‑centric tasks and do not measure robustness to domain‑specific noise, fine‑grained evidence extraction, multi‑evidence composition, or rule‑based decision consistency across images and videos.
To overcome these challenges, the authors propose an integrated framework consisting of three contributions:
-
Ostrakon‑VL – a domain‑expert MLLM built on Qwen3‑VL‑8B. The model is trained with a multi‑stage strategy: (a) domain knowledge injection via caption bootstrapping, (b) offline curriculum learning that gradually increases data difficulty, and (c) Mixed Preference Optimization to balance output stability and robustness. Despite its modest 8 B parameter size, Ostrakon‑VL achieves an average score of 60.1 on the new benchmark, surpassing the much larger Qwen3‑VL‑235B‑A22B (59.4) and the same‑size baseline Qwen3‑VL‑8B (55.3), demonstrating superior parameter efficiency.
-
ShopBench – the first public FSRS benchmark. It spans three input modalities (single‑image, multi‑image, video) and three sub‑domains (ShopFront, ShopInterior, Kitchen). The benchmark evaluates (i) robustness to acquisition noise, (ii) fine‑grained evidence extraction from cluttered scenes, (iii) multi‑evidence composition, and (iv) decision consistency under explicit operational rules. By providing a unified, fine‑grained metric suite, ShopBench enables reproducible comparisons and systematic failure‑mode analysis for FSRS‑oriented models.
-
QUAD (Quality‑aware Unbiased Automated Data‑curation) – a four‑stage pipeline that transforms a raw candidate pool of 69.25 M multimodal instruction triples into a high‑signal corpus of 3.40 M (≈5 %). The stages are:
- Quality Filtering: a reward model (Skywork‑VL‑Reward) scores each (image, question, answer) triple on visual‑textual relevance, informativeness, linguistic quality, and credibility. Samples below a threshold are discarded.
- Vision‑Ablated Check: the same question is answered without visual input; the margin between the full‑vision score and the text‑only score isolates visual contribution, filtering out answers that rely mainly on language priors.
- Foundation Model Referenced Filtering: a strong foundation model generates a reference answer; the reward gap between the candidate answer and the reference quantifies learning potential. Low‑gap samples (already mastered by the foundation model) are removed, ensuring the remaining data provide meaningful gradient signals.
- Multimodal Semantic Deduplication: joint vision‑language embeddings (e.g., GME‑Qwen2VL‑2B) detect near‑duplicate triples, which are pruned to increase information density and prevent over‑fitting to redundant patterns.
- Capability Coverage Redistribution (implicit in the pipeline) rebalances the dataset to maintain coverage across tasks, sub‑domains, and difficulty levels.
Experiments show that training Ostrakon‑VL on the QUAD‑curated corpus yields consistent gains across all ShopBench sub‑tasks, with improvements ranging from 3.2 % to 5.6 % over models trained on unfiltered data. Ablation studies confirm that each QUAD stage contributes positively, especially the vision‑ablated margin and foundation‑model filtering, which together remove over‑reliant language‑only samples and low‑utility easy cases.
The paper also discusses limitations. QUAD relies on proprietary reward and foundation models, which may hinder fully open‑source reproducibility. ShopBench, while comprehensive for front‑of‑house, interior, and kitchen scenarios, does not yet cover logistics, delivery, or back‑of‑house automation, leaving room for future expansion. Finally, processing multi‑image and video inputs at scale incurs significant computational cost, suggesting the need for efficiency‑focused research before real‑time deployment.
In summary, the work delivers a complete pipeline—from data synthesis and rigorous curation (QUAD) to a standardized evaluation suite (ShopBench) and a domain‑specialized MLLM (Ostrakon‑VL). By demonstrating that a modest‑size model can outperform larger general‑purpose counterparts when equipped with high‑quality, domain‑aligned data and a tailored training regimen, the paper sets a new baseline for FSRS‑centric multimodal AI and provides valuable resources for the research community.
Comments & Academic Discussion
Loading comments...
Leave a Comment