Merged ChemProt-DrugProt for Relation Extraction from Biomedical Literature

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The extraction of chemical-gene relations plays a pivotal role in understanding the intricate interactions between chemical compounds and genes, with significant implications for drug discovery, disease understanding, and biomedical research. This paper presents a data set created by merging the ChemProt and DrugProt datasets to augment sample counts and improve model accuracy. We evaluate the merged dataset using two state of the art relationship extraction algorithms: Bidirectional Encoder Representations from Transformers (BERT) specifically BioBERT, and Graph Convolutional Networks (GCNs) combined with BioBERT. While BioBERT excels at capturing local contexts, it may benefit from incorporating global information essential for understanding chemical-gene interactions. This can be achieved by integrating GCNs with BioBERT to harness both global and local context. Our results show that by integrating the ChemProt and DrugProt datasets, we demonstrated significant improvements in model performance, particularly in CPR groups shared between the datasets. Incorporating the global context using GCN can help increase the overall precision and recall in some of the CPR groups over using just BioBERT.

💡 Research Summary

The paper addresses a critical bottleneck in biomedical relation extraction: the limited size and diversity of existing annotated corpora for chemical‑gene interactions. To overcome this, the authors merge two widely used benchmark datasets, ChemProt and DrugProt, into a single, larger resource. Both original datasets annotate chemical‑protein/gene relations in PubMed abstracts and share a common taxonomy of 22 fine‑grained relation types, which the authors map onto ten higher‑level ChemProt Relation (CPR) groups for easier analysis.

Merging is performed at the abstract level. Abstracts that appear in only one of the two sources are added directly. For abstracts present in both, entities are aligned (no textual conflicts were found) and relation conflicts are manually resolved, resulting in 63 conflicting relations in the training split and 7 in validation. The final merged training set contains 3,824 abstracts, 97,597 entities, and 20,401 relations; the validation set holds 1,184 abstracts, 29,763 entities, and 6,450 relations. To mitigate class imbalance, the authors augment the “No Relation” class (CPR‑10) by generating additional negative pairs from sentences lacking any annotated interaction. CPR groups 7 and 8 are excluded from experiments due to their extreme sparsity.

Two state‑of‑the‑art relation extraction approaches are evaluated on this new corpus. The first is a vanilla BioBERT model: the

Merged ChemProt-DrugProt for Relation Extraction from Biomedical Literature

💡 Research Summary

Comments & Academic Discussion

Leave a Comment