Evaluation on Entity Matching in Recommender Systems

Evaluation on Entity Matching in Recommender Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon ‘23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon'23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems.Data and Code are accessible at: https://github.com/huang-zihan/Reddit-Amazon-Entity-Matching.


💡 Research Summary

The paper addresses a critical gap in recommender‑system research: the lack of a rigorous, cross‑dataset evaluation framework for entity matching (EM), which is essential for both conversational recommender systems (CRS) and knowledge‑based recommendation pipelines. To fill this gap, the authors construct Reddit‑Amazon‑EM, a novel benchmark that aligns movie mentions extracted from Reddit conversations with canonical movie entries in the Amazon ‘23 dataset.

Data construction proceeds in two stages. First, about 1,000 frequent movie titles are harvested from a public Reddit dialogue corpus. For each title, the ten most relevant Amazon candidates are retrieved using a combination of minimum edit distance and dense embedding similarity (all‑MiniLM‑L6‑v2). Simple metadata filters (release year, TV‑season flags) prune obviously mismatched items. Second, a dedicated human‑annotation phase uses a Streamlit interface augmented with GPT‑3.5 suggestions. Annotators verify or reject each candidate, resulting in 869 Reddit titles with confirmed matches to 4,504 unique Amazon movies. The final annotated set contains 4,322 positive pairs and 42,748 negative pairs (a 1:10 positive‑to‑negative ratio), split into training (30,124), validation (7,532) and test (9,414) partitions. Each record stores Reddit ID, Amazon ID, the two textual titles, and a binary match label.

The benchmark is used to evaluate five families of EM methods:

  1. BM25 – a classic lexical retrieval model based on term frequency–inverse document frequency.
  2. Faiss – a dense‑vector nearest‑neighbor search engine using the same MiniLM embeddings as the candidate retrieval step.
  3. Embedding + Fuzzy – a hybrid that concatenates BERT embeddings with three fuzzy string similarity scores (Levenshtein ratio, Jaro‑Winkler, Jaccard token overlap).
  4. GNEM – Graph Neural Entity Matching, which builds a weighted record‑pair graph and applies a single‑layer gated graph convolution to propagate structural and semantic cues.
  5. ComEM – a recent LLM‑driven approach that first retrieves candidates (as Faiss does) and then uses a large language model (GPT‑3.5) to rank and select the best match.

Performance is measured with Recall@k, Precision@k, F1, and overall accuracy. GNEM achieves the best results (F1 = 96.29 %, Accuracy = 96.74 %), substantially outperforming the traditional baselines (BM25 F1 = 78.43 %, Faiss F1 ≈ 60 % due to low precision). The superiority of GNEM stems from its ability to model fine‑grained metadata (release year, format tags) and to capture relational patterns across records, which are crucial for disambiguating titles like “Prisoners (Blu‑ray+DVD)” versus “Prisoners (2013)”. ComEM follows closely (F1 = 94.02 %) and demonstrates that LLMs provide strong semantic understanding, yet they lag slightly in handling exact numeric identifiers compared with rule‑based graph reasoning. The Embedding + Fuzzy hybrid reaches a respectable F1 of 86.68 %, confirming that neural embeddings and symbolic fuzzy metrics complement each other.

Computational efficiency is also examined. CPU‑bound BM25 and Faiss have negligible initialization time but suffer from prohibitive inference latency (8–10 hours for the test set). GNEM, while requiring GPU training (≈ 423 seconds per epoch over ten epochs) and about 60 seconds for inference on the test split, offers fast runtime after the upfront cost. The Embedding + Fuzzy pipeline is the most lightweight (≈ 30 seconds per epoch, ≈ 10 seconds inference). ComEM does not need separate training; its inference costs ≈ 70 seconds on the same hardware.

To illustrate downstream impact, the authors conduct a case study on LLM‑based conversational recommendation. Four LLMs (GPT‑4, GPT‑3.5‑turbo, Qwen‑3‑4b, and an additional proprietary model) generate recommendation dialogues. The EM methods are then used to map the LLM‑mentioned movies to Amazon entries, and Recall@1/5 is reported. GNEM consistently yields the highest recall, confirming that superior entity matching directly translates into more accurate, user‑satisfying recommendations in a CRS setting.

In summary, the paper makes three major contributions: (1) the release of Reddit‑Amazon‑EM, the largest publicly available cross‑platform movie matching benchmark with high‑quality human annotations; (2) a comprehensive empirical comparison showing that graph‑neural approaches currently dominate EM performance, while hybrid and LLM‑based methods are competitive but have distinct trade‑offs; and (3) empirical evidence that EM quality materially affects the effectiveness of LLM‑driven conversational recommenders. The authors release both the gold‑standard matching set and the code for reproducing all experiments, providing a solid foundation for future research on scalable, knowledge‑grounded recommendation systems. Potential next steps include extending the benchmark to other domains (e.g., music, books), incorporating multimodal signals (cover images, audio snippets), and exploring hybrid architectures that fuse graph reasoning with large‑language‑model semantics.


Comments & Academic Discussion

Loading comments...

Leave a Comment