Online Evaluations for Everyone: Mr. DLibs Living Lab for Scholarly Recommendations

Online Evaluations for Everyone: Mr. DLibs Living Lab for Scholarly   Recommendations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce the first ’living lab’ for scholarly recommender systems. This lab allows recommender-system researchers to conduct online evaluations of their novel algorithms for scholarly recommendations, i.e., recommendations for research papers, citations, conferences, research grants, etc. Recommendations are delivered through the living lab’s API to platforms such as reference management software and digital libraries. The living lab is built on top of the recommender-system as-a-service Mr. DLib. Current partners are the reference management software JabRef and the CORE research team. We present the architecture of Mr. DLib’s living lab as well as usage statistics on the first sixteen months of operating it. During this time, 1,826,643 recommendations were delivered with an average click-through rate of 0.21%.


💡 Research Summary

The paper presents the first “living lab” dedicated to scholarly recommender systems, built on top of the Mr DLib recommendation‑as‑a‑service platform. A living lab allows researchers to evaluate new recommendation algorithms with real users in realistic settings, addressing the well‑known limitation of offline evaluations that often fail to predict real‑world performance. The authors describe the architecture, operational workflow, usage statistics, and future directions of this service.

Two types of partners are supported. Platform partners (currently the reference‑management tool JabRef and the open‑access aggregator CORE) request related‑article recommendations via a REST API. Research partners provide experimental recommender engines that can be plugged into the system. When a JabRef user selects a source article and clicks the “Related Articles” tab, the article title is sent to Mr DLib. An internal A/B‑testing engine randomly forwards the request either to Mr DLib’s own content‑based recommender (using terms, key‑phrases, embeddings, etc.) or to CORE’s recommender. The chosen engine returns a list of related articles, which Mr DLib forwards back to JabRef for display. User clicks are logged in real time and fed back to the researchers for algorithmic evaluation.

The service guarantees sub‑2‑second response times and prefers open‑access URLs for the recommended items. All recommendation logs are publicly released, enabling reproducibility.

During the first sixteen months of operation (June 2017–October 2020) the lab delivered 1,826,643 recommendations, achieving an overall click‑through rate (CTR) of 0.21 %. In the beta phase (June–September 2017) about 4,200 requests per month were processed, with a CTR that fell from 0.76 % to 0.34 %. After the beta, monthly deliveries rose to 150–200 k recommendations and the CTR stabilized around 0.18–0.22 %. Notably, the CTRs of Mr DLib’s internal engine and CORE’s engine were almost identical, despite differences in implementation details, suggesting that the underlying Lucene‑based retrieval backbone dominates performance. The CTR observed in JabRef (≈0.18 %) mirrors that of the social‑science repository Sowiport, indicating similar user behaviour across scholarly domains.

Future work aims to expand both sides of the partnership network, introduce personalized recommendations, and broaden the scope to other scholarly items such as research grants and potential collaborators. The authors also plan to standardize protocols, automate partner onboarding, and explore meta‑learning techniques for automatically selecting the best algorithm per platform partner.

In sum, this work delivers a publicly accessible, real‑time evaluation infrastructure for scholarly recommendation research, bridging the gap between algorithm development and real‑world deployment, and laying the groundwork for more robust, reproducible, and user‑centric recommender system studies.


Comments & Academic Discussion

Loading comments...

Leave a Comment