Large-scale Benchmarks for Multimodal Recommendation with Ducho

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The common multimodal recommendation pipeline involves (i) extracting multimodal features, (ii) refining their high-level representations to suit the recommendation task, (iii) optionally fusing all multimodal features, and (iv) predicting the user-item score. Although great effort has been put into designing optimal solutions for (ii-iv), to the best of our knowledge, very little attention has been devoted to exploring procedures for (i) in a rigorous way. In this respect, the existing literature outlines the large availability of multimodal datasets and the ever-growing number of large models accounting for multimodal-aware tasks, but (at the same time) an unjustified adoption of limited standardized solutions. As very recent works from the literature have begun to conduct empirical studies to assess the contribution of multimodality in recommendation, we decide to follow and complement this same research direction. To this end, this paper settles as the first attempt to offer a large-scale benchmarking for multimodal recommender systems, with a specific focus on multimodal extractors. Specifically, we take advantage of three popular and recent frameworks for multimodal feature extraction and reproducibility in recommendation, Ducho, and MMRec/Elliot, respectively, to offer a unified and ready-to-use experimental environment able to run extensive benchmarking analyses leveraging novel multimodal feature extractors. Results, largely validated under different extractors, hyper-parameters of the extractors, domains, and modalities, provide important insights on how to train and tune the next generation of multimodal recommendation algorithms.

💡 Research Summary

This paper addresses a largely overlooked component of multimodal recommender systems: the feature‑extraction stage (step (i) of the standard pipeline). While most prior work has focused on refining, fusing, and scoring multimodal representations, the authors argue that the quality of the extracted visual, textual, and audio embeddings fundamentally determines downstream performance. To systematically investigate this, they construct a large‑scale benchmarking environment that unifies three popular open‑source frameworks—Ducho (a dedicated multimodal feature‑extraction library), MMRec, and Elliot (both widely used for reproducible recommendation experiments). By integrating these tools into a single end‑to‑end pipeline, they overcome interoperability challenges such as mismatched data schemas, differing model APIs, and GPU memory constraints, and they release the full codebase for reproducibility.

The experimental protocol is extensive: eight publicly available multimodal datasets spanning fashion (Zappos50K, Polyvore, Taobao), music (Last.fm, MSD‑A, Lambda), recipes (Recipe1M+, FoodRec, Allrecipes), news (MIND, MM‑Rec), social media (Pinterest, Kwai, TikTok), and movies (Netflix Crawled). For each dataset the authors consider all available modalities—visual, textual, and, when present, audio. They evaluate eight feature‑extractors, ranging from classic domain‑specific CNNs and BERT‑style text encoders to large multimodal models such as CLIP, BLIP, and Flamingo that jointly process multiple modalities. Each extractor is tuned across a grid of hyper‑parameters (batch size, learning rate, layer freezing, etc.), yielding an average of ten configurations per extractor.

On the recommendation side, six classic collaborative‑filtering baselines (e.g., BPR, LightGCN) and nine state‑of‑the‑art multimodal recommenders (e.g., VisualBPR, MMGCN, Dual‑Stream, LMM4Rec) are trained using the same extracted embeddings. Standard ranking metrics (NDCG@10, Recall@10, HR@10) are reported.

Key findings include: (1) modern large‑scale multimodal models consistently outperform traditional, lightweight extractors, delivering 2–5 percentage‑point gains in NDCG across most domains; (2) careful hyper‑parameter tuning of the extractor (especially batch size and learning rate) provides an additional ~1.5 pp improvement, underscoring that extraction is not a “plug‑and‑play” component; (3) audio features, while generally contributing less than visual or textual cues, can still boost performance in music‑centric datasets when paired with dedicated audio encoders and appropriate fusion strategies; (4) compatibility between extractor output format and recommender input expectations matters—models that assume separate modality embeddings suffer when fed jointly‑encoded CLIP‑style vectors, highlighting the need for modality‑aware interface design; (5) computational trade‑offs are evident: large models achieve higher accuracy but demand larger batch sizes and more GPU memory, whereas lightweight extractors remain attractive for resource‑constrained settings.

Beyond performance, the paper documents the engineering effort required to harmonize Ducho, MMRec, and Elliot, providing modular wrappers, unified data schemas, and scripts for automated grid‑search. This contribution lowers the barrier for future researchers to conduct reproducible multimodal recommendation studies.

In conclusion, the study demonstrates that the feature‑extraction stage is a decisive factor for multimodal recommendation quality. It validates that leveraging recent large multimodal models, together with domain‑specific hyper‑parameter optimization, yields robust gains across diverse domains and modalities. The authors also outline future directions: end‑to‑end joint training of extractors and recommenders, incorporation of multimodal large language models for dynamic feature generation, and adaptive modality weighting based on user preferences. Their comprehensive benchmark and open‑source toolkit set a new standard for systematic evaluation in multimodal recommender research.

Large-scale Benchmarks for Multimodal Recommendation with Ducho

💡 Research Summary

Comments & Academic Discussion

Leave a Comment