Heterogeneity in Entity Matching: A Survey and Experimental Analysis

Heterogeneity in Entity Matching: A Survey and Experimental Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as Heterogeneous EM (HEM). This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories – representation and semantic heterogeneity – and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the FAIR principles – Findability, Accessibility, Interoperability, and Reusability – demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them. Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.


💡 Research Summary

The paper tackles the pervasive problem of heterogeneity in entity matching (EM), a core task for data integration, cleaning, and analytics. The authors first formalize EM as a binary matching function that decides whether two mentions from possibly different datasets refer to the same real‑world entity. They then define “heterogeneous EM” (HEM) as the scenario where the datasets differ in representation (format, schema, modality) or in meaning (terminology, language, granularity, data quality).

A central contribution is a two‑level taxonomy that separates heterogeneity into representation heterogeneity and semantic heterogeneity, each further divided into concrete sub‑types. Representation heterogeneity includes (1) multimodality (text, images, video), (2) format differences (JSON, XML, CSV, JPEG/PNG, etc.), and (3) structural/schema mismatches (attribute naming, nesting, normalization). Semantic heterogeneity covers (1) terminology and language (synonyms, abbreviations, multilingual vocabularies), (2) contextual variability (domain‑specific meanings), (3) granularity and resolution (country vs. state vs. city), and (4) temporal and data‑quality issues (missing values, errors).

The authors connect this taxonomy to the FAIR data principles (Findability, Accessibility, Interoperability, Reusability). They argue that each heterogeneity type directly undermines one or more FAIR aspects—for example, schema mismatches hurt Interoperability, while language variation reduces Findability. Conversely, EM systems explicitly designed to handle heterogeneity can become enablers of FAIR compliance by automating schema alignment, terminology mapping, and quality checks.

A systematic survey of recent EM approaches follows. Rule‑based and classical statistical methods mainly address schema alignment and duplicate detection but lack robustness to semantic variation. Neural methods, especially Transformer‑based models (BERT, RoBERTa, etc.), show resilience to lexical and contextual changes but still require substantial fine‑tuning for format or structural differences. Graph‑based models (GNNs, Graph Convolutional Networks) excel at capturing structural relationships across schemas but struggle with language‑level variation. The paper also highlights emerging work that leverages large language models (LLMs) and generative AI for zero‑ or few‑shot entity resolution, noting their potential to bridge semantic gaps without extensive feature engineering.

The experimental section evaluates four state‑of‑the‑art EM models (a Siamese Transformer, a BERT‑based classifier, a Graph Neural Network, and a multimodal transformer) under controlled semantic perturbations. The perturbations include synonym substitution, automatic translation between English and Spanish, granularity shifts (e.g., “USA” → “California”), and the injection of missing or noisy values. Results reveal that:

  • Transformer‑based models degrade modestly (≈ 5 % or less) under synonym or language changes, confirming their lexical robustness.
  • Graph‑based models maintain performance when schema structures are altered but suffer larger drops when terminology changes.
  • All models experience significant performance loss (10‑20 %) when granularity or data‑quality perturbations are introduced, indicating a shared weakness in handling fine‑grained or noisy semantics.
  • Existing public benchmarks (e.g., Abt‑Buy, DBLP‑Scholar) rarely contain such heterogeneity, leading to over‑optimistic estimates of real‑world robustness.

From these findings the authors draw several research directions:

  1. Multimodal Matching – develop joint embedding spaces and cross‑modal attention mechanisms that can align textual, visual, and possibly audio representations of the same entity.
  2. Human‑in‑the‑Loop (HITL) Workflows – incorporate active learning and expert feedback to resolve ambiguous matches and to calibrate model uncertainty.
  3. LLM‑and‑Knowledge‑Graph Fusion – use pretrained large language models to generate semantic mappings and enrich them with structured knowledge graphs, thereby reducing reliance on domain‑specific training data.
  4. Fairness‑Aware Evaluation – design metrics that capture bias across languages, cultures, and domains, ensuring that heterogeneous EM systems do not systematically disadvantage particular groups.

In conclusion, the paper establishes heterogeneity as a first‑class concern in entity matching research. By providing a clear taxonomy, linking it to FAIR principles, surveying current methods, and empirically exposing their limits, the work offers a roadmap for building more robust, generalizable, and trustworthy EM systems capable of operating in today’s highly diverse data ecosystems.


Comments & Academic Discussion

Loading comments...

Leave a Comment