miRNA and Gene Expression based Cancer Classification using Self- Learning and Co-Training Approaches

miRNA and Gene Expression based Cancer Classification using Self-   Learning and Co-Training Approaches
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

miRNA and gene expression profiles have been proved useful for classifying cancer samples. Efficient classifiers have been recently sought and developed. A number of attempts to classify cancer samples using miRNA/gene expression profiles are known in literature. However, the use of semi-supervised learning models have been used recently in bioinformatics, to exploit the huge corpuses of publicly available sets. Using both labeled and unlabeled sets to train sample classifiers, have not been previously considered when gene and miRNA expression sets are used. Moreover, there is a motivation to integrate both miRNA and gene expression for a semi-supervised cancer classification as that provides more information on the characteristics of cancer samples. In this paper, two semi-supervised machine learning approaches, namely self-learning and co-training, are adapted to enhance the quality of cancer sample classification. These approaches exploit the huge public corpuses to enrich the training data. In self-learning, miRNA and gene based classifiers are enhanced independently. While in co-training, both miRNA and gene expression profiles are used simultaneously to provide different views of cancer samples. To our knowledge, it is the first attempt to apply these learning approaches to cancer classification. The approaches were evaluated using breast cancer, hepatocellular carcinoma (HCC) and lung cancer expression sets. Results show up to 20% improvement in F1-measure over Random Forests and SVM classifiers. Co-Training also outperforms Low Density Separation (LDS) approach by around 25% improvement in F1-measure in breast cancer.


💡 Research Summary

The paper introduces two semi‑supervised learning frameworks—self‑learning and co‑training—to improve cancer subtype classification using miRNA and gene expression profiles. While numerous studies have demonstrated that miRNA or gene expression data can discriminate cancerous from normal tissues and even distinguish cancer subtypes, most of them rely solely on supervised learning with limited labeled samples. Public repositories, however, contain vast numbers of unlabeled expression profiles that remain largely untapped. This work leverages those resources by combining a small labeled set (L) with a much larger unlabeled set (U) (|U| ≫ |L|) to enrich the training data.

Self‑learning starts by training an initial classifier (Random Forest or SVM) on L. The classifier then predicts labels for U; samples whose prediction confidence exceeds a predefined threshold α are added to L with their predicted labels. The classifier is retrained on the expanded set, and the process repeats iteratively. The threshold α controls the trade‑off between precision (high α) and recall (low α) of the added samples.

Co‑training treats miRNA and gene expression as two distinct “views” of the same biological system. Two separate classifiers are built, one on miRNA data (L_miRNA) and one on gene data (L_gene). Each classifier labels its unlabeled view (U_miRNA, U_gene) and selects high‑confidence samples (U′_miRNA, U′_gene). To let the classifiers train each other, a mapping between miRNA and gene expression is required. The authors use the miRanda database to obtain miRNA‑target gene relationships, which are many‑to‑many. They simplify this mapping by aggregating expression values: the expression vector of a miRNA is the mean of its target genes’ expression, and conversely the expression vector of a gene is the mean of all miRNAs that target it. The mapped high‑confidence samples are then added to the opposite view’s labeled set, and both classifiers are retrained. This cycle is repeated, allowing the two views to mutually reinforce each other.

The methods were evaluated on three cancer types: breast cancer, hepatocellular carcinoma (HCC), and lung cancer. miRNA and gene expression datasets were downloaded from NCBI GEO; unlabeled datasets were constructed from additional public samples. Random Forests (10 trees) and SVMs served as base learners. For comparison, the Low‑Density Separation (LDS) semi‑supervised method, previously applied to cancer recurrence prediction, was also implemented.

Results show substantial gains over purely supervised baselines. Self‑learning improves F1‑score by up to 20 % for breast cancer, by about 10 % in precision for metastatic HCC, and by roughly 3 % for squamous lung cancer. Co‑training yields even larger improvements, especially in breast cancer where it outperforms LDS by approximately 25 % in F1‑score; similar gains (≈12 % for HCC and ≈5 % for lung cancer) are observed over LDS and the supervised baselines. The improvements demonstrate that (i) unlabeled data can be safely incorporated when high‑confidence samples are selected, and (ii) exploiting complementary biological views via co‑training provides a stronger signal than using a single view.

Discussion and Limitations: The mapping strategy (mean aggregation) is deliberately simple; more sophisticated approaches (weighted averages, deep embeddings, or network‑based propagation) could capture the complex regulatory relationships more accurately. The choice of confidence threshold α is critical and currently requires empirical tuning per dataset. The quality of unlabeled data (batch effects, platform heterogeneity) is not explicitly addressed, which could affect robustness. Moreover, the study focuses on two views; extending the framework to incorporate additional omics layers (e.g., DNA methylation, proteomics) could further boost performance.

Conclusion: By integrating semi‑supervised learning with dual‑view co‑training, the authors present a practical solution to the pervasive problem of limited labeled samples in omics‑based cancer classification. The approach not only achieves notable performance gains over traditional supervised classifiers (Random Forest, SVM) but also surpasses existing semi‑supervised methods (LDS). This work paves the way for more effective utilization of the ever‑growing public omics repositories in precision oncology and highlights the value of leveraging complementary biological data sources within a unified learning framework.


Comments & Academic Discussion

Loading comments...

Leave a Comment