Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data
We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 \times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 \times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/
💡 Research Summary
This paper introduces an advanced non‑parametric Bayesian framework, Multiple Data Integration (MDI), for discovering disease subtypes by jointly analyzing heterogeneous multi‑omics data. The authors extend the Dirichlet‑process mixture (DPM) paradigm to handle several data types simultaneously, allowing both agreement and disagreement among their latent clustering structures. Each omics platform (gene expression, copy‑number variation, DNA methylation, microRNA) is modeled with a Dirichlet‑Multinomial Allocation (DMA) mixture—Gaussian for continuous data, multinomial for binary/categorical data. Crucially, the model links the component‑allocation variables across data types through a set of non‑negative association parameters φ_{kℓ}. When φ_{kℓ}=0 the data types are independent; larger φ values increase the prior probability that the same sample is assigned to the same cluster in both data sets. This construction enables the integration of data with different statistical properties without forcing a single shared clustering.
Technical enhancements over the original MDI include: (1) incorporation of both Gaussian and multinomial likelihoods, (2) automatic feature selection within each data type to focus on the most informative variables, and (3) a split‑merge Metropolis–Hastings step added to the Gibbs sampler to improve mixing and convergence in the high‑dimensional mixture space. The authors set an upper bound N on the number of mixture components (chosen as a function of the number of samples) but allow the posterior to infer the effective number of clusters.
The method was applied to 277 glioblastoma multiforme (GBM) cases from The Cancer Genome Atlas (TCGA) for which complete data were available across the four platforms. After preprocessing (level‑3 data, missing value imputation, Wilcoxon rank‑sum tests with Bonferroni correction), the authors retained 1,011 gene‑expression features, 1,000 copy‑number probes, 769 methylation sites (binarized β>0.95), and 104 microRNA features. Clinical follow‑up data were also incorporated, though 51 cases lacked complete outcome information.
MDI identified eight consensus subtypes. Each subtype is represented by a consensus clustering that reflects the degree of sharing among the four data types, while also providing separate cluster assignments for each individual omics layer. The most striking finding is a methylation‑driven subtype comprising 47 patients with globally low methylation levels; this group exhibited no tumor recurrence and no new events over a ten‑year follow‑up, yielding a highly significant log‑rank p‑value (2.0×10⁻³) for recurrence. When the other three data types are added, the overall consensus subtypes achieve an even stronger association with recurrence (p=3.6×10⁻⁴ after multiple‑testing correction). A second, much smaller gene‑expression‑driven subtype of six patients showed markedly poor survival, underscoring the clinical relevance of even rare clusters.
The analysis also revealed partial overlap among data types: while many samples share the same cluster across platforms, a substantial fraction display discordant assignments, reflecting the biological complexity of GBM. The φ parameters quantify these inter‑type relationships, offering a direct measure of how much each omics layer contributes to the shared structure.
In addition to the biological insights, the paper contributes methodological value: the split‑merge MCMC improves sampler efficiency, and the feature‑selection component reduces dimensionality without sacrificing predictive power. The authors make all code and processed data publicly available at https://sites.google.com/site/multipledatafusion/, facilitating reproducibility and extension to other cancers or additional omics modalities.
Overall, the study demonstrates that integrating multiple high‑throughput data types with a flexible Bayesian model can uncover clinically meaningful subtypes that would be missed by single‑omics analyses, and it provides a robust, open‑source tool for the broader cancer genomics community.
Comments & Academic Discussion
Loading comments...
Leave a Comment