Bayesian Nonparametrics for Gene-Gene and Gene-Environment Interactions in Case-Control Studies: A Synthesis and Extension
Gene-gene and gene-environment interactions are widely believed to play significant roles in explaining the variability of complex traits. While substantial research exists in this area, a comprehensive statistical framework that addresses multiple s…
Authors: Durba Bhattacharya, Sourabh Bhattacharya
Ba y esian Nonparametrics for Gene-Gene and Gene-En vironmen t In teractions in Case-Con trol Studies: A Syn thesis and Extension Durba Bhattac harya 1 and Sourabh Bhattac harya 2 1 Departmen t of Statistics, St. Xa vier’s College (Autonomous), Kolk ata 2 In terdisciplinary Statistical Research Unit, Indian Statistical Institute, Kolk ata Abstract Gene-gene and gene-en vironmen t in teractions are widely b eliev ed to pla y significan t roles in explaining the v ariabilit y of complex traits. While substantial research exists in this area, a comprehensiv e statistical framew ork that addresses multiple sources of uncertain ty sim ulta- neously remains lacking. In this article, w e syn thesize and propose extension of a nov el class of Ba y esian nonparametric approaches that accoun t for interactions among genes, lo ci, and en vironmen tal factors while accommo dating uncertaint y about p opulation substructure. Our con tribution is threefold: (1) W e provide a unified exp osition of hierarc hical Bay esian mod- els driv en b y Dirichlet processes for genetic interactions, clarifying their conceptual adv an tages o v er traditional regression approaches; (2) W e shed light on new computational strategies that com bine transformation-based MCMC with parallel pro cessing for scalable inference; and (3) W e present enhanced hypothesis testing pro cedures for identifying disease- predisp osing loci. Through applications to my o cardial infarction data, we demonstrate ho w these metho ds offer biological insights not readily obtainable from standard approac hes. Our synthesis highlights the adv an tages of Bay esian nonparametric thinking in genetic epidemiology while pro viding practical guidance for implementation. Keyw ords: Diric hlet Pro cess; Disease Predisp osing Lo ci; Epistasis; Mixture Mo del; Parallel Com- puting; T ransformation based Mark ov Chain Monte Carlo. AMS Sub ject Classifications: 62K05, 62F15, 92D10 Con ten ts 1 In tro duction 3 1.1 The c hallenge of genetic interactions in complex diseases . . . . . . . . . . . . . . . . 3 1.2 Our con tribution: a Ba y esian nonparametric synthesis . . . . . . . . . . . . . . . . . 3 1.3 Article structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 A unified Bay esian nonparametric framew ork 4 2.1 Mo deling philosoph y: from regression to conditional genotype mo dels . . . . . . . . 4 2.2 Key adv an tages of our framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Notation and data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1 3 Gene-gene in teraction mo del 5 3.1 Mo del sp ecification: a roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Detailed form ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Computational strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.4 Hyp othesis testing and DPL iden tification . . . . . . . . . . . . . . . . . . . . . . . . 7 4 Gene-en vironment interaction mo del 8 4.1 Extending the framew ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.2 Enhanced h yp othesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Hierarc hical Dirichlet pro cess mo del 10 5.1 Motiv ation and limitations of previous models . . . . . . . . . . . . . . . . . . . . . . 10 5.2 HDP form ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.3 Chinese restauran t pro cess analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 Dep endence structure induced b y the HDP mo del . . . . . . . . . . . . . . . . . . . . 12 5.4.1 Dep endence among individuals . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4.2 Dep endence among genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4.3 Case-con trol dep endence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.5 Adv an tages of HDP approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.6 Computational implemen tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.7 Hyp othesis testing in the HDP framew ork . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Applications to Myocardial Infarction data 15 6.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.2 Implemen tation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.3 Results and biological insigh ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.3.1 Effect of sex v ariable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 6.3.2 Roles of individual genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3.3 Gene-gene in teractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3.4 P opulation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 7 Sensitivit y analysis 16 7.1 Sensitivit y to mixture comp onent specifications . . . . . . . . . . . . . . . . . . . . . 16 7.2 Sensitivit y to priors on base measure parameters . . . . . . . . . . . . . . . . . . . . 17 7.3 Sensitivit y to priors on cov ariance matrices . . . . . . . . . . . . . . . . . . . . . . . 17 7.4 Sensitivit y to hypothesis testing thresholds . . . . . . . . . . . . . . . . . . . . . . . 17 7.5 Sensitivit y to environmen tal cov ariance structure . . . . . . . . . . . . . . . . . . . . 17 7.6 Sensitivit y to hierarchical Diric hlet pro cess hyperparameters . . . . . . . . . . . . . . 18 7.7 Sensitivit y to link age disequilibrium and lo cus lab el p ermutation . . . . . . . . . . . 18 7.8 Summary of sensitivit y analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 8 Discussion 18 8.1 Comparison with existing metho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 8.2 Limitations and robustness considerations . . . . . . . . . . . . . . . . . . . . . . . . 20 8.2.1 Fixed mixture w eights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 8.2.2 Mo del complexit y and in terpretability . . . . . . . . . . . . . . . . . . . . . . 20 8.3 F uture directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 1 In tro duction 1.1 The challenge of genetic in teractions in complex diseases Complex diseases such as cardiov ascular disorders, diab etes, and psychiatric conditions result from in tricate netw orks of genetic and environmen tal factors. While genome-wide association studies (GW AS) hav e identified n umerous single nucleotide p olymorphisms (SNPs) asso ciated with disease risk, these explain only a small fraction of heritabilit y (Larson and Schaid, 2013). The “missing heritabilit y” problem has spurred interest in gene-gene (epistasis) and gene-environmen t interac- tions as p otential explanations. T raditional approaches to studying these interactions face several significan t limitations that our work aims to address. Most existing metho ds rely on linear or additive mo dels that may not adequately capture the complex biological pathw a ys through whic h genetic factors in teract (W ang et al. , 2010). These simplified mo deling assumptions often fail to represen t the intricate bio chemical net works that c haracterize many complex diseases. F urthermore, the failure to prop erly accoun t for p opulation stratification–the presence of genetic substructure within study p opulations–can lead to inflated false positive rates in asso ciation studies (Bhattac harjee et al. , 2010). This issue is particularly problematic in genetically div erse p opulations where different subgroups may hav e distinct allele frequencies unrelated to disease risk. The computational burden represents another substan tial challenge in studying genetic in terac- tions. T esting all p ossible pairwise SNP-SNP in teractions b ecomes infeasible for genome-wide data, leading researchers to adopt heuristic screening metho ds that may miss imp ortant interactions or iden tify spurious ones. Additionally , many curren t approaches pro vide p oint estimates without adequately characterizing the uncertain ty in mo del structure, particularly regarding the n umber of underlying sub-p opulations or the complexit y of in teraction netw orks. This lack of comprehen- siv e uncertain ty quan tification limits the reliabilit y and interpretabilit y of findings from genetic in teraction studies. 1.2 Our contribution: a Ba y esian nonparametric synthesis This article syn thesizes and extends a series of Ba yesian nonparametric models dev elop ed to address these c hallenges. Unlike previous w orks that presented these mo dels in isolation (Bhattac harya and Bhattac harya 2018, 2020, 2024), we pro vide a unified framew ork connecting gene-gene interaction mo dels, gene-en vironmen t extensions, and hierarchical Diric hlet pro cess form ulations. Our syn- thesis represen ts a comprehensive ov erview of this methodological approach, highlighting both theoretical foundations and practical implemen tation considerations. W e dev elop enhanced computational strategies that lev erage parallel pro cessing and transformation- based MCMC (Dutta and Bhattachary a, 2014) for practical implementation of these complex mod- els. These computational inno v ations mak e it feasible to apply Ba yesian nonparametric metho ds to realistic genetic datasets of me aningful size. Additionally , w e present new h yp othesis testing pro cedures for iden tifying disease-predisp osing lo ci in the presence of p opulation stratification, of- fering more robust alternatives to traditional asso ciation tests. Finally , we pro vide comprehensiv e applications demonstrating biological insights from m y o cardial infarction data, showing how these metho ds can unco v er relationships not readily apparent using standard approaches. Our approac h fundamen tally departs from standard logistic regression by mo deling genotypes conditional on disease status using Dirichlet pro cess mixtures. This inv ersion of the typical mo d- eling relationship allows us to capture several imp ortan t features simultaneously . W e can mo del uncertain ty in p opulation substructure nonparametrically , allowing the data to inform ab out the 3 n umber and characteristics of genetic subgroups. W e capture gene-gene interactions through co- v ariance structures rather than regression co efficients, providing a more flexible represen tation of genetic dep endencies. F urthermore, w e accommo date sub ject-sp ecific environmen tal effects through hierarc hical mo deling, enabling p ersonalized assessmen t of genetic risk factors. 1.3 Article structure The remainder of this article is organized as follo ws. Section 2 introduces our mo deling philosoph y and contrasts it with traditional approaches, explaining the conceptual shift from regression-based to conditional genotype mo deling. Section 3 presents the gene-gene in teraction mo del with compu- tational details, including our parallel implementation strategy . Section 4 extends this framework to incorporate gene-environmen t in teractions, discussing b oth mo deling extensions and enhanced h yp othesis testing pro cedures. Section 5 introduces the hierarchical Diric hlet pro cess form ulation, whic h addresses limitations of previous mo dels b y allowing more flexible sharing patterns. Section 6 presents comprehensiv e applications to my o cardial infarction data, demonstrating the practical utilit y of our metho ds. Section 7 pro vides extensiv e sensitivity analyses to assess the robustness of our results to k e y mo deling c hoices across all three framew orks. Finally , Section 8 discusses com- parisons with existing metho ds, addresses limitations and robustness considerations, and outlines future researc h directions. 2 A unified Ba y esian nonparametric framew ork 2.1 Mo deling philosoph y: from regression to conditional genot yp e models T raditional GW AS analyze disease status Y given genotypes X using logistic regression mo dels of the form P ( Y = 1 | X ) = logit − 1 ( β 0 + P β j X j + P β j k X j X k ). This approach, while interpretable and widely used, faces the “curse of dimensionalit y” when considering interactions and mak es strong parametric assumptions about the relationship b et ween genot yp es and disease risk. The logistic regression framework requires specifying whic h interactions to include, t ypically limiting consideration to pairwise effects due to computational constrain ts. F urthermore, these models assume that genetic effects are additiv e on the log-odds scale, which ma y not reflect the true biological mec hanisms underlying complex diseases. Our approach inv erts this relationship by mo deling genotypes conditional on disease status using finite mixture mo dels: [ X | Y = k ] = P M m =1 π m f ( X | θ mk ), where the mixture comp onents represen t laten t sub-p opulations. By placing Dirichlet process priors on the parameters { θ mk } , w e allow the num b er of sub-populations to b e inferred from the data rather than sp ecified in adv ance. This mo deling strategy represents a fundamen tal shift in persp ective: instead of asking ho w genotype affects disease risk, we ask how the distribution of genotypes differs b et w een cases and con trols. This approach naturally accommo dates p opulation structure, as mixture comp onents can corresp ond to genetic subgroups with distinct allele frequencies. The conditional mo deling approac h offers sev eral conceptual adv an tages. First, it directly ad- dresses the issue of p opulation stratification by explicitly mo deling genetic heterogeneity . Second, it provides a natural framework for identifying disease-asso ciated genetic v ariants through compar- ison of genot yp e distributions b etw een cases and controls. Third, it allo ws for flexible mo deling of dep endencies among genetic loci through the sp ecification of the mixture comp onen ts. Rather than assuming independence or simple correlation structures, we can incorp orate complex dependence patterns that ma y reflect biological relationships among genes. 4 2.2 Key adv an tages of our framew ork Our Ba y esian nonparametric framew ork offers sev eral k ey adv an tages ov er traditional approaches. First, it pro vides flexible mo deling of p opulation structure through Diric hlet pro cesses, which au- tomatically disco ver p opulation substructure without requiring pre-sp ecified ancestry information. This is particularly v aluable in studies of admixed p opulations or when detailed ancestry infor- mation is una v ailable. The nonparametric nature of Diric hlet pro cesses allo ws the num b er of sub-p opulations to b e determined by the data, a voiding b oth underfitting and o verfitting that can o ccur with fixed finite mixture mo dels. Second, our approach conceptualizes gene-gene interactions as statistical dep endence rather than additive effects on disease risk. W e model these in teractions through cov ariance structures in hierarc hical priors, capturing complex dep endencies that may not b e well re presen ted b y linear or additiv e mo dels. This represen tation aligns with biological understanding that genetic factors often act in netw orks rather than indep enden tly . By focusing on co v ariance structures, we can identify sets of genes that co-v ary in their effects on disease risk, p oten tially revealing functional pathw a ys or biological mo dules. Third, the conditional indep endence structure of our mo dels enables scalable computation through parallel implemen tation. Differen t comp onents of the mo del can b e updated indep en- den tly , allo wing us to lev erage mo dern parallel computing arc hitectures. This computational effi- ciency makes it feasible to apply our methods to datasets with large n umbers of genetic v ariants and samples, addressing a k ey limitation of many Bay esian nonparametric approaches. F ourth, our fully Bay esian approac h provides comprehensiv e uncertaint y quantification through p osterior distributions for all quan tities of in terest. This includes not only p oint estimates of genetic effects but also measures of uncertaint y ab out p opulation structure, interaction strengths, and mo del complexity . Prop er characterization of uncertaint y is particularly imp ortant in genetic studies, where sample sizes are often limited relativ e to the num b er of genetic v ariants considered. 2.3 Notation and data structure W e consider case-con trol data with N k sub jects in group k ∈ { 0 , 1 } , where k = 0 denotes controls and k = 1 denotes cases. F or each sub ject i , we observ e genotypes represented as x s ij r ∈ { 0 , 1 } for c hromosome s ∈ { 1 , 2 } , gene j ∈ { 1 , . . . , J } , and lo cus r ∈ { 1 , . . . , L j } . Here, x s ij r = 1 indicates the presence of the minor allele on chromosome s at locus r of gene j for individual i , while x s ij r = 0 indicates its absence. Additionally , we may observe environmen tal co v ariates E i for eac h individual, whic h could include factors such as sex, age, smoking status, or other exp osures. Disease status is denoted b y Y i ∈ { 0 , 1 } . The total n umber of genes is J , and L j represen ts the num b er of lo ci within gene j . This notation pro vides a comprehensive framework for represen ting the complex hierarc hical structure of genetic data, with v ariations at the chromosome, lo cus, gene, individual, and group lev els. 3 Gene-gene in teraction mo del 3.1 Mo del specification: a roadmap Our mo del for gene-gene in teractions comprises three k ey comp onen ts that w ork together to capture the complex structure of genetic data. First, w e emplo y sub ject-lev el mixture m o dels that represen t genot yp es as arising from finite mixtures, with eac h mixture comp onent corresp onding to a latent sub-p opulation. This approach allo ws us to account for p opulation stratification without requiring prior specification of ancestry groups. Second, w e place Diric hlet pro cess priors on the parameters 5 of these mixture mo dels, enabling flexible sharing of mixture comp onents across sub jects and auto- matic determination of the n umber of sub-p opulations. The Diric hlet pro cess framework pro vides a principled Bay esian nonparametric approach to mo deling p opulation structure, with the precision parameter con trolling the degree of sharing among sub jects. Third, w e introduce a hierarc hical in teraction structure through matrix-normal priors that capture dep endencies among genes. This hierarc hical structure allo ws us to model gene-gene in teractions at m ultiple levels, from pairwise correlations to higher-order dep endencies. The com bination of these three comp onents creates a comprehensiv e mo deling framework that addresses several limitations of traditional approac hes. The mixture mo del comp onent handles p opulation stratification, the Diric hlet process prior provides flexibility in mo deling the n um b er and characteristics of sub-populations, and the hierarc hical in teraction structure captures complex dep endencies among genetic factors. Imp ortan tly , these comp onents are in tegrated in a coherent Ba yesian framework that allows for uncertain ty quantification at all lev els, from individual genot yp e probabilities to o verall population structure. 3.2 Detailed formulation F or each gene j and group k , we mo del the genotype vector X ij k = ( x ij k 1 , . . . , x ij k L j ) using a finite mixture distribution: [ X ij k ] = P M m =1 π mj k Q L j r =1 Bernoulli( x ij k r | p mj k r ). Here, M represen ts the maximum n umber of mixture comp onents, whic h we set to a sufficiently large v alue to ensure adequate mo del flexibility . The mixing w eigh ts π mj k are fixed at 1 / M for all m , j , and k . This c hoice of fixed equal weigh ts, while somewhat uncon ven tional, has b een sho wn in previous work (Ma jumdar et al. , 2013) to yield b etter p erformance in estimating the true num b er of mixture comp onen ts compared to using Dirichlet priors. The robustness of this choice is discussed further in Section 8.2. The allele frequency parameters p mj k r follo w a hierarchical structure based on Dirichlet pro cess designed to capture b oth lo cus-sp ecific effects and dep endencies among genes. Sp ecifically , it holds that the marginal prior distribution of the allele frequency parameters is p mj k r ∼ Beta( ν 1 j k r , ν 2 j k r ), where the Beta distribution parameters are themselv es mo deled as ν 1 j k r = exp( u r + λ j k ) and ν 2 j k r = exp( v r + λ j k ). The parameters u r and v r are locus-sp ecific effects assumed to follow indep enden t standard normal distributions: u r , v r ∼ N (0 , 1). Allo wing u r and v r to differ ensures that the mean of the Beta distribution for p mj k r dep ends on the sp ecific lo cus r , capturing v ariation in allele frequencies across the genome. Gene-gene interactions are captured through a matrix-normal prior on the parameter matrix λ = { λ j k } , whic h represents gene- and group-sp ecific effects. W e assume λ ∼ N ( µ , A ⊗ Σ ), where A represen ts the cov ariance matrix capturing dep endencies among genes, and Σ represen ts the co v ariance matrix capturing dep endency b etw een the genot yp e distribution of case and con trol, giv en any gene. The Kroneck er pro duct structure A ⊗ Σ allo ws for separable cov ariance structures across genes and disease status, providing a computationally tractable y et flexible represen tation of dep endencies. This formulation enables us to mo del complex interaction patterns while maintaining computational feasibilit y through the separable cov ariance structure. 3.3 Computational strategy The conditional independence structure of our model enables efficient computational implemen- tation through a com bination of parallel pro cessing and sp ecialized MCMC tec hniques. The key insigh t is that, giv en the interaction parameters λ , the mixture mo dels for differen t gene-group pairs ( j, k ) are conditionally indep enden t. This conditional independence allows us to up date the 6 mixture parameters for each ( j, k ) pair in parallel across multiple processors. Each pro cessor han- dles a subset of the gene-group pairs, updating the allocation v ariables z ij k , the allele frequency parameters p mj k r , and the other mixture-related parameters indep enden tly of the others. F or updating the in teraction parameters λ , w e emplo y transformation-based MCMC (TM- CMC), which up dates all elements of λ simultaneously through deterministic transformations of a low-dimensional random v ariable. TMCMC is particularly w ell-suited for high-dimensional pa- rameter spaces, as it can prop ose mov es in directions that resp ect the correlation structure of the p osterior distribution. In our implemen tation, we use a single random v ariable to generate prop os- als for all elemen ts of λ , with the transformation designed to maintain reasonable acceptance rate and mixing. The Dirichlet pro cess mixture comp onents are up dated using a fast Gibbs sampling algorithm sp ecifically designed for finite mixtures with Dirichlet pro cess priors. This algorithm leverages the P olya urn represen tation of the Dirichlet pro cess to efficiently up date the allo cation of sub jects to mixture comp onen ts and the parameters asso ciated with eac h comp onen t. The algorithm alternates b et w een up dating the allo cation v ariables giv en the parameters and up dating the parameters given the allo cations, with b oth steps b eing computationally efficient due to conjugacy relationships. Figure 1 displa ys the sc hematic diagram for our gene-gene in teraction mo del. F or each gene and case-con trol status, genot yp e data are modeled using Dirichlet pro cess-based mixtures that cap- ture sub-p opulation structure. SNP-lev el dep endencies and gene-gene in teractions are in tro duced through a matrix-normal prior on laten t in teraction parameters. The mo dular design of the mo del allo ws efficient parallel computation: gene-sp ecific mixture comp onents are up dated indep enden tly across processors, while the in teraction parameters are up dated cen trally using TMCMC. Our parallel co des are written in C, in accordance with the Message P assing Interface (MPI) proto col. Case/Con trol ( k ) Gene j Mixture for ( j, k ): G j k ∼ Diric hlet Pro cess Minor allele at SNP r : x ij k r u r , v r ∼ N (0 , 1) (Lo cus-sp ecific) λ j k ∼ N ( µ , A ⊗ Σ ) (Gene-group) Figure 1: Schematic representation of the Bay esian mo del for gene-gene interactions. 3.4 Hyp othesis testing and DPL iden tification W e dev elop comprehensive Bay esian hypothesis testing pro cedures to ev aluate v arious asp ects of genetic asso ciations within our framew ork. F or asses sing gene effects, w e compare clustering pat- terns b etw een cases and controls using metrics based on p osterior distributions of the mixture comp onen ts. Significan t differences in clustering patterns indicate that the genetic structure differs 7 b et w een cases and con trols, suggesting an asso ciation betw een the gene and disease status. This approac h provides a more nuanced assessmen t than traditional asso ciation tests, as it considers the en tire distribution of genotypes rather than just summary statistics. T o test for gene-gene in teractions, w e examine elemen ts of the co v ariance matrix A in the matrix- normal prior on λ . Non-zero off-diagonal elements in A indicate dep endencies b etw een the effects of differen t genes, which w e interpret as statistical in teractions. W e compute p osterior probabilities that sp ecific elements of A are non-zero, providing a quantitativ e measure of interaction strength with asso ciated uncertaint y . This approach allo ws us to identify pairs or groups of genes that show co ordinated effects on disease risk, p oten tially revealing biological pathw ays or functional mo dules. Our k ey idea for identifying disease-predisposing lo ci (DPL) ma y b e lik ened to computing p osterior probabilities that sp ecific SNPs show differen tial distributions b et w een cases and con- trols. Sp ecifically , for each SNP r in gene j , our idea is analogous, in principle, to computing P (SNP r is DPL | Data) = P p case · j r − p control · j r > δ | Data , where δ is a clinically meaningful threshold for allele frequency differences. The probabilities p case · j r and p control · j r represen t the allele fre- quencies in cases and controls, resp ectiv ely . This approac h provides a principled Bay esian metho d for DPL identification that accoun ts for multiple sources of uncertain ty , including p opulation strat- ification and estimation error. Our h yp othesis testing framework extends b eyond simple significance testing to include mea- sures of effect size and uncertaint y . F or each test, we report p osterior probabilities along with credible interv als for relev ant parameters, providing a comprehensiv e picture of the evidence for v arious genetic asso ciations. This Bay esian approach naturally incorp orates multiple testing con- siderations through the prior distributions and p osterior probabilities, av oiding the need for ad ho c corrections that can b e problematic in high-dimensional settings. 4 Gene-en vironmen t in teraction mo del 4.1 Extending the framew ork T o incorp orate environmen tal co v ariates E i , we extend our mo deling framework to allow for sub ject- sp ecific mixture distributions that dep end on en vironmen tal factors. The extended mo del repre- sen ts genot yp es as [ X ij k | E i ] = P M m =1 π mij k Q L j r =1 Bernoulli( x ij k r | p mij k r ), where now the mixing w eights and comp onent parameters can v ary across individuals based on their environmen tal ex- p osures. As in the gene-gene in teraction model, w e fix π mij k = 1 / M for all ( i, j, k , m ), main taining computational simplicit y while allowing flexibilit y through the comp onent parameters. The k ey extension in the gene-environmen t in teraction mo del lies in the parameterization of the Beta distribution parameters for the allele frequencies. W e now model these as ν 1 ij k r = exp( u j r + λ ij k + µ j k + β ⊤ j k E i ) and ν 2 ij k r = exp( v j r + λ ij k + µ j k + β ⊤ j k E i ). This expanded pa- rameterization includes sev eral new terms: u j r and v j r are gene- and lo cus-sp ecific effects; λ ij k represen ts individual-sp ecific genetic effects that capture residual v ariation not explained by other factors; µ j k are gene- and group-sp ecific intercept terms; and β j k are v ectors of co efficients captur- ing the effects of en vironmental co v ariates on the genetic parameters for gene j in group k . The inclusion of environmen tal effects through the term β ⊤ j k E i allo ws the distribution of geno- t yp es to v ary systematically with environmen tal exp osures. When β j k = 0 , en vironmen tal factors mo dify the genetic parameters for gene j in group k , represen ting gene-environmen t interaction. Imp ortan tly , this formulation allo ws environmen tal factors to affect not only the mean genetic ef- fects but also the co v ariance structure among genes, as the en vironmental terms enter in to the hierarc hical mo del that generates the λ ij k parameters. This enables the mo del to capture ho w 8 en vironmental exp osures might mo dify not just individual genetic effects but also the patterns of in teraction among genes. The co v ariance structure in this extended mo del con tinues to capture how environmen t mo difies gene-gene in teractions. The individual-sp ecific effects λ ij k inherit dep endence structure from the hierarc hical model, with en vironmental factors p otentially influencing b oth the mean and cov ariance of these effects. This allows for rich patterns of gene-environmen t interaction, including scenarios where environmen tal exp osures alter the strength or pattern of genetic correlations. F or example, an environmen tal factor migh t strengthen the correlation betw een t wo genes in cases but w eaken it in controls, or it might induce correlations among genes that are indep endent in the absence of the exp osure. 4.2 Enhanced hypothesis testing W e extend our hypothesis testing framew ork to address questions sp ecific to gene-en vironment in teractions. After testing for ov erall genetic effect b y extending the previous metho d, we test for the presence of gene-en vironment interaction b y ev aluating the n ull hypothesis H 0 : β j k = 0 for sp ecific genes and en vironmental factors. W e compute posterior probabilities that β j k differs from zero, providing a direct measure of evidence for gene-en vironment in teraction. This approach allo ws us to identify genes whose effects on disease risk are mo dified by environmen tal exp osures, whic h is of particular interest for understanding disease etiology and targeted interv entions. W e also test for join t effects inv olving b oth gene-gene and gene-environmen t interactions. This in volv es ev aluating whether environmen tal factors mo dify the patterns of interaction among genes, whic h we assess by examining how environmen tal cov ariates affect the co v ariance structure in the mo del. Sp ecifically , we test whether parameters go verning the dep endence of the co v ariance structure on environmen tal factors differ from zero. Significan t findings indicate that environmen tal exp osures alter the net work of genetic interactions, potentially revealing mechanisms through whic h en vironment influences disease risk. Third, we develop metho ds for stratified DPL iden tification that accoun t for environmen tal het- erogeneit y . Rather than identifying DPL that show av erage differences b etw een cases and con trols, w e identify SNPs whose effects dep end on environmen tal exp osure. This inv olv es computing p os- terior probabilities that the difference in allele frequencies b etw een cases and controls v aries across lev els of environmen tal factors. F or example, we migh t identify SNPs that sho w strong asso ciations in exp osed individuals but weak or no asso ciations in unexp osed individuals, or vice v ersa. This stratified approach can reveal genetic factors that are imp ortant only under sp ecific environmen tal conditions, pro viding insights in to the context-dependence of genetic risk. Apart from these, we additionally dev elop visualization to ols to help in terpret complex inter- action patterns, including heatmaps of p osterior in teraction probabilities and netw ork diagrams sho wing how genes and en vironmental factors are connected through interaction effects. Figure 2 provides the schematic diagram for our gene-environmen tal interaction mo del. Envi- ronmen tal co v ariates influence individual-lev el Diric hlet process mixtures, allo wing the mo del to accoun t for p ersonalized effects of en vironmental exp osures on genot yp e distributions. The prior structure integrates lo cus-sp ecific, gene-specific, and environmen t-dep enden t parameters. Parallel computation is emplo yed by up dating gene-environmen t sp ecific comp onen ts in parallel and inter- action parameters centrally via TMCMC. As b efore, our parallel co des are written in C, leveraging the MPI proto col. 9 Individual i En vironmental co v ariates E i Gene j Mixture for ( i, j, k ): G ij k ∼ DP Minor allele at SNP r : x ij k r u j r , v j r (Lo cus-gene) λ ij k , µ j k , β ⊤ j k E i (Individual-gene) Figure 2: Diagram of the extended Bay esian framework incorporating gene-en vironmen t in terac- tions. 5 Hierarc hical Diric hlet pro cess mo del 5.1 Motiv ation and limitations of previous mo dels The gene-en vironmen t interaction mo del presen ted in Section 4 makes the simplifying assumption that environmen tal effects act uniformly on gene-gene interactions across all individuals. This as- sumption may not hold in many practical settings for several reasons. First, differen t individuals ma y exp erience different lev els or types of environmen tal exp osure, leading to heterogeneous effects on genetic in teractions. F or example, the effect of smoking on genetic risk factors for cardio v ascular disease ma y dep end on duration and intensit y of smoking, whic h v aries across individuals. Second, en vironmental effects may b e heterogeneous across the p opulation due to unmeasured factors or effect mo difiers. Genetic bac kground, other en vironmen tal exp osures, or lifest yle factors might mod- ify ho w a particular en vironmental factor influences genetic in teractions. Third, gene-en vironment in teractions ma y b e context-dependent, with effects manifesting only under sp ecific conditions or in sp ecific subgroups. The uniform effect assumption fails to capture this context-dependence, p oten tially missing imp ortan t biological relationships. These limitations motiv ate the dev elopment of a more flexible mo deling approach that can ac- commo date heterogeneous and context-dependent gene-environmen t interactions. The hierarchical Diric hlet pro cess (HDP) mo del provides a principled nonparametric framework for capturing com- plex sharing patterns among individuals, genes, and groups in a data-driven manner. Rather than imp osing a parametric structure on how environmen tal factors influence genetic dep endencies, the HDP model learns these relationships nonparametrically from the data. This approac h can disco v er complex patterns of interaction that might b e missed by more restrictive parametric mo dels, while still pro viding a coherent probabilistic framew ork for inference. 5.2 HDP formulation W e address the limitations of previous mo dels through a three-level hierarchical Diric hlet pro cess form ulation that introduces flexible, nonparametric dep endence structures among genes, en viron- men tal v ariables, and case-control status. Our mo del represents a significant extension of the 10 traditional HDP framework (T eh et al. , 2006) b y incorp orating an additional level of hierarc hy that sp ecifically captures case-con trol dep endence while allowing for sub ject-sp ecific gene-gene interac- tions influenced b y environmen tal factors. F or each individual i in group k , gene j , and mixture comp onent m , we assume that the minor allele frequency vector p mij k = ( p mij k 1 , p mij k 2 , . . . , p mij k L ) is generated from a hierarch y of Dirichlet pro cesses: p 1 ij k , p 2 ij k , . . . , p M ij k iid ∼ G ij k (1) G ij k ∼ DP( α G,ik G 0 ,j k ) (2) where DP( α G,ik G 0 ,j k ) denotes a Dirichlet pro cess with base measure G 0 ,j k and precision pa- rameter α G,ik . The environmen tal dep endence enters through the precision parameter: log( α G,ik ) = µ G + β ⊤ G E ik (3) where E ik is a d -dimensional vector of environmen tal v ariables for individual i in group k , β G is a d -dimensional v ector of regression co efficients, and µ G is an in tercept term. A t the second level of the hierarch y , we assume: G 0 ,j k iid ∼ DP( α G 0 ,k H k ); j = 1 , . . . , J (4) with log( α G 0 ,k ) = µ G 0 + β ⊤ G 0 ¯ E k (5) where ¯ E k = 1 N k P N k i =1 E ik is the a verage en vironmental v ariable in group k . The third lev el of hierarch y creates dep endence b etw een case and control groups: H k iid ∼ DP( α H ˜ H ); k = 0 , 1 (6) with log( α H ) = µ H + β ⊤ H ¯ ¯ E (7) where ¯ ¯ E = ( ¯ E 0 + ¯ E 1 ) / 2 is the o verall a verage environmen tal v ariable. The global base measure ˜ H is sp ecified as: p mij k r iid ∼ Beta( ν 1 , ν 2 ) under ˜ H (8) where ν 1 , ν 2 > 0 are fixed h yp erparameters. This hierarc hical structure has a natural interpretation in terms of genetic architecture. The sub ject-level distributions G ij k capture individual-specific genetic patterns, whic h ma y differ due to unique genetic backgrounds, environmen tal exp osures, or other individual factors. These sub ject- lev el distributions share comp onen ts through the gene-group lev el distributions G 0 ,j k , which rep- resen t common patterns for eac h gene in each disease group. The group-level distributions H k capture o verall genetic patterns for cases and controls separately , allowing for differences in ge- netic arc hitecture b etw een affected and unaffected individuals. Finally , the global distribution ˜ H represen ts the ov erall genetic background of the p opulation. 11 5.3 Chinese restaurant pro cess analogy The hierarc hical structure can be understo o d through an extended Chinese restauran t process anal- ogy (T eh et al. , 2006). F or eac h group k = 0 , 1, imagine J restauran ts (genes). Eac h individual i visits these restauran ts, where at the j -th restauran t, they are seated at tables (mixture com- p onen ts) that serve dishes (allele frequency parameters). The dishes at different tables within the same restauran t are drawn from G 0 ,j k , whic h itself is drawn from H k . This creates sharing of dishes across tables within a restauran t (within a gene across individuals). No w consider that all restauran ts share a global menu of dishes from ˜ H . Differen t restaurants (genes) may serv e different selections from this global men u, with the selections dra wn from H k . This creates sharing of dishes across restaurants (across genes), with the degree of sharing con trolled b y α H . The environmen tal v ariables influence ho w likely customers (individuals) are to sit at new tables through α G,ik , and ho w likely restauran ts are to offer new dishes through α G 0 ,k and α H . This analogy clarifies how our mo del creates dependence: individuals sharing the same table at a restauran t share the same dish (allele frequency parameters) for that gene, creating depen- dence among individuals. Different restaurants (genes) serving the same dish creates dep endence among genes. The sharing of dishes b etw een case and control restauran ts ( k = 0 and k = 1) cre- ates case-control dep endence. Environmen tal factors influence these sharing probabilities, thereby mo dulating the dep endence structure. 5.4 Dep endence structure induced b y the HDP mo del Our HDP model induces three k ey t yp es of dep endence that are crucial for understanding gene-gene and gene-en vironment in teractions: 5.4.1 Dep endence among individuals F rom the P olya urn represen tation, the joint distribution of p M ij k = { p 1 ij k , . . . , p M ij k } shows that individuals share dish parameters ϕ tij k dra wn from G 0 ,j k . The probability of sharing dep ends on α G,ik , whic h in turn dep ends on individual environmen tal v ariables E ik through equation (3). This creates environmen tal-mo dulated dep endence among individuals: individuals with similar en vironmental exposures are more likely to share genetic patterns. Imp ortan tly , the marginal distribution of p mij k is G 0 ,j k , whic h do es not dep end on E ik . This is biologically desirable: p opulation minor allele frequencies should not b e affected b y en vironmen tal v ariables, although environmen tal exp osure may influence how individuals cluster together in their genetic patterns. 5.4.2 Dep endence among genes Gene-gene dep endence arises through the sharing of dishes (parameters) across restauran ts (genes). The parameters ϕ tij k for different genes j share common v alues from H k , creating dep endence among genes. The degree of this dep endence is influenced by α G 0 ,k , which dep ends on the group a verage en vironmen t ¯ E k through equation (5), and also indirectly through α G,ik whic h affects the n umber of tables τ ij k . This structure ensures that gene-gene interactions are sp ecific to individuals and are influenced b y b oth individual environmen tal v ariables ( E ik ) and group av erages ( ¯ E k ), while the marginal distributions of individual genes remain unaffected b y environmen t. 12 5.4.3 Case-con trol dep endence The sharing of dishes b etw een case and con trol restaurants (through H 0 and H 1 sharing from ˜ H ) creates dep endence b etw een case and con trol groups. This dep endence captures factors that affect b oth cases and controls but may not b e explicitly measured, such as population stratification or unmeasured en vironmental factors. The degree of case-control sharing is controlled by α H , which dep ends on the o verall av erage en vironment ¯ ¯ E through equation (7). This allows en vironmen tal factors to mo dulate the similarity b et w een cases and controls in their genetic patterns. 5.5 Adv antages of HDP approac h The hierarc hical Dirichlet pro cess mo del offers sev eral adv antages ov er the parametric mo dels dis- cussed previously . First, it provides flexible, nonparametric dependence structure: unlike previous matrix-normal based mo dels that assume uniform environmen tal effects on cov ariance structures, our HDP mo del allo ws en vironmen t to influence dep endence structures in a flexible, nonparametric manner that v aries across individuals. Second, it enables sub ject-sp ecific gene-gene in teractions: eac h individual can ha ve their o wn pattern of gene-gene in teractions, with en vironmental factors in- fluencing these individual-sp ecific patterns. This is more realistic than assuming a single correlation structure for all individuals. Third, it main tains biological in terpretabilit y: the mo del preserv es the biologically imp ortant prop erty that en vironmental v ariables influence dependence structures but not marginal allele frequency distributions. Only dep endence structures (how genes interact) are affected by environmen t, not the genes themselv es. F ourth, it offers automatic complexity control: the Diric hlet process priors automatically determine the appropriate n umber of mixture comp o- nen ts at each level of the hierarch y , av oiding the need to pre-sp ecify the num b er of sub-p opulations or in teraction patterns. Fifth, it achiev es computational efficiency through parallelization: the con- ditional indep endence structure of the HDP mo del enables efficient parallel implemen tation, with differen t genes and individuals up dated indep enden tly given the higher-lev el parameters. 5.6 Computational implementation W e implement the HDP mo del using a no vel parallel MCMC algorithm that com bines Gibbs sam- pling steps with transformation-based MCMC (TMCMC). The algorithm leverages the conditional indep endence structure of the mo del: given the higher-level distributions ( H k and G 0 ,j k ), the sub ject-level parameters can be updated indep enden tly across individuals and genes. This allows for parallel computation across m ultiple processors. Key components of our computational strategy include retrosp ective sampling (Papaspiliopoulos and Rob erts, 2008) for up dating parameters from Diric hlet process mixtures, whic h av oids infinite-dimensional representations by generating param- eters as needed. The allocation v ariables z ij k and allele frequency parameters p mij k r are up dated in parallel across different ( i, j, k ) combinations using Gibbs steps that exploit conjugacy . The parameters µ G , β G , µ G 0 , β G 0 , µ H , β H are up dated in a single blo c k using a mixture of additive and m ultiplicative TMCMC (Dey and Bhattach ary a, 2016), whic h efficien tly explores correlated pa- rameter spaces. Our parallel implementation in C, as b efore, uses MPI for communication betw een pro cessors, with careful load balancing to ensure efficien t parallel scaling. The sch ematic diagram of our HDP model is presented in Figure 3. This fully nonparametric framew ork models dep endencies across individuals, genes, and groups through a three-level hier- arc hy of Dirichlet pro cesses. En vironmental cov ariates influence the precision parameters at each lev el, allowing flexible, individualized representation of in teraction structures. The base distribu- tion is a Beta prior on allele frequencies. This hierarc hy enables ric h mo deling of stratification and 13 in teraction while maintaining computational scalabilit y . En vironmental co v ariates E i Individual DP: G ij k log α G,ik = µ G + β ⊤ G E ik Gene DP: G 0 ,j k log α G 0 ,k = µ G 0 + β ⊤ G 0 ¯ E k Group DP: H k log α H = µ H + β ⊤ H ¯ ¯ E Base: p mij k r ∼ Beta( ν 1 , ν 2 ) Figure 3: Schematic of the hierarchical Diric hlet pro cess (HDP) mo del for gene-gene and gene- en vironment in teractions. 5.7 Hyp othesis testing in the HDP framew ork W e extend our h yp othesis testing procedures to the HDP framew ork with specific tests for: gene effects, testing H 0 : h 0 j = h 1 j for j = 1 , . . . , J , where h 0 j and h 1 j are the marginal distributions of genot yp es for controls and cases resp ectively for gene j ; environmen tal effects, testing H 0 : β G = 0 , H 0 : β G 0 = 0 , and H 0 : β H = 0 for the environmen tal coefficients at differen t lev els of the hierarc hy; gene-gene in teractions, defining sub ject-sp ecific gene-gene in teraction measures C ( i, j 1 , j 2 , k ) as co v ariances b etw een logit-transformed av erage allele frequencies for genes j 1 and j 2 for individual i in group k , and testing H 0 : C ( i, j 1 , j 2 , k ) = 0; and case-control dep endence, assessing whether the sharing b etw een cases and controls (through α H ) is significant, indicating common factors affecting b oth groups. The interpretation of these tests follows a logical framework: if genes hav e no effect but environmen t affects dep endence structures, then environmen t influences gene-gene interactions without affecting disease status. If genes hav e effects but en vironmen t do esn’t affect interactions, then purely genetic factors determine disease. If b oth genes and environmen t are significan t and affect in teractions, then gene-environmen t interactions influence disease risk. 14 6 Applications to My o cardial Infarction data 6.1 Data description W e apply our Ba yesian nonparametric framew ork to a m yocardial infarction (MI) case-control dataset to demonstrate its practical utility and biological insights. The dataset comprises genotype information from the MI Gen study , obtained from the dbGaP database. This dataset includes m ultiple sub-p opulations: Caucasian, Han Chinese, Japanese, and Y oruban. F or our analysis, we considered a set of SNPs that are found to b e individually asso ciated with different cardiov ascular endp oin ts in v arious GW AS, along with SNPs marginally associated with MI in the MIGen study . The data inv olve 32 genes cov ering 1251 lo ci, including 33 previously identified lo ci asso ciated with my o cardial infarction. The dataset includes b oth cases (individuals who exp erienced m yocar- dial infarction) and con trols. An imp ortant en vironmen tal cov ariate av ailable is sex (male/female), whic h is incorporated to inv estigate gene-environmen t interactions. This provides an opp ortunity to demonstrate ho w our framew ork can unco ver b oth genetic main effects and interactions with en vironmental factors. The genetic data consist of diploid genot yp es , whic h we con v ert to binary indicators for eac h c hromosome for compatibilit y with our mo deling framew ork. This represen tation allows us to mo del the tw o c hromosomes separately while main taining the diploid nature of the data in our likelihoo d function. 6.2 Implemen tation details W e implement our Ba y esian nonparametric mo dels using our MPI-based C co des and parallel com- puting resources. F or the HDP model, w e set the maxim um num b er of mixture components M = 30 based on preliminary analyses to provide sufficien t flexibility without excessiv e computational cost. W e set ν 1 = ν 2 = 1, making ˜ H a uniform distribution on [0 , 1]. The precision parameters are sp ecified as: α G,ik = 0 . 1 × exp(100 + µ G + β ⊤ G E ik ) α G 0 ,k = 0 . 1 × exp(100 + µ G 0 + β ⊤ G 0 ¯ E k ) α H = 0 . 1 × exp(100 + µ H + β ⊤ H ¯ ¯ E ) with µ G , µ G 0 , µ H iid ∼ U (0 , 1) and β G , β G 0 , β H iid ∼ U ( − 1 , 1) as priors. This structure ensures adequate n umbers of sub-p opulations and satisfactory MCMC mixing. W e p erform MCMC sampling with 30,000 iterations, discarding the first 10,000 as burn-in. Con vergence is assessed primarily using trace plots. The parallel implementation distributes com- putation across 50 cores, with total run time of approximately 7 days for the full analysis. 6.3 Results and biological insights Application of our HDP model to the m y o cardial infarction data yields sev eral imp ortan t insights: 6.3.1 Effect of sex v ariable W e find strong evidence that sex influences genetic patterns: P ( | β G | < ϵ β G | Data) ≈ 0 and P ( | β G 0 | < ϵ β G 0 | Data) ≈ 0, indicating that individual-lev el ( E ik ) and group-a v erage ( ¯ E k ) sex effects are highly significant. Ho wev er, P ( | β H | < ϵ β H | Data) ≈ 1, suggesting the ov erall av erage sex 15 effect ( ¯ ¯ E ) is not significan t. This pattern indicates that sex plays an imp ortant role in influencing sub ject-sp ecific and group-level genetic patterns, but not the ov erall case-con trol similarit y . 6.3.2 Roles of individual genes Our clustering-based hypothesis tests indicate significant o verall genetic effects. How ever, individual gene tests show that none of the 32 genes are individually significan t. This apparent paradox is explained by gene-gene in teractions: when genes are correlated, the maxim um of their individual distances can b e significan t even when individual distances are not, similar to ho w max( X 1 , X 2 ) from a biv ariate normal distribution can hav e a non-zero median ev en when X 1 and X 2 ha ve zero medians individually . 6.3.3 Gene-gene in teractions Our HDP mo del reveals sub ject-sp ecific gene-gene in teraction patterns not detected by previous mo dels. Tw o genes, AP006216.10 and C6orf106 , show significan t interactions with other genes in most sub jects. In terestingly , the only sub jects with no significan t gene-gene interactions inv olving these genes were male cases, suggesting that lac k of protective gene-gene interactions ma y contribute to MI risk in males. The gene-gene interactions appear to ha v e a protectiv e effect: in sub jects where AP006216.10 and/or C6orf106 in teract with other genes, the risk of MI seems reduced. This b eneficial effect of gene-gene in teractions con trasts with the traditional view that interactions primarily increase disease risk. 6.3.4 P opulation structure The p osterior distribution of the n um b er of sub-p opulations sho ws mo des at 3 and 4 comp o- nen ts, supp orting the kno wn four sub-p opulations in the data (Caucasian, Han Chinese, Japanese, Y oruban). The mo del correctly iden tifies that these p opulations cannot b e further subdivided genetically , consisten t with the biological understanding of these p opulation groups. 7 Sensitivit y analysis 7.1 Sensitivit y to mixture comp onen t sp ecifications F or the gene-gene in teraction mo del of Section 3 and the gene-environmen t interaction mo del of Section 4, we examine sensitivit y to the maxim um num b er of mixture comp onents M and the fixed mixing weigh ts π mj k = 1 / M . F ollowing the rule-of-thum b established in prior w ork (Ma jumdar et al. , 2013), our primary analyses set M = 30, which pro vides sufficien t capacity while main taining computational efficiency . Alternativ e sp ecifications with M = 20 and M = 50 yield qualitativ ely similar results; the p osterior distribution of the effective num b er of mixture comp onents remains stable once M exceeds a threshold (approximately 20 in our applications), as the Diric hlet pro cess prior automatically determines the num b er of distinct comp onen ts through the P´ oly a urn scheme. This robustness renders the exact v alue of M largely inconsequen tial beyond ensuring adequate mo del flexibilit y . 16 7.2 Sensitivit y to priors on base measure parameters F or the Beta base measure parameters u r and v r , which are sp ecified as indep enden t standard nor- mal distributions in our primary analyses, we inv estigate robustness using alternativ es: N (0 , 10 2 ), N (0 , 0 . 1), and Cauc h y(0 , 1). P osterior distributions of the interaction parameters λ j k and the re- sulting gene-gene correlation estimates exhibit remark able stability across these sp ecifications. As noted in Section 3, w e find that Gaussian priors on u r and v r with other means and v ariances do not yield significan tly different results, indicating inherent prior robustness in our mo deling strategy . The Cauch y prior pro duces slightly hea vier tails but do es not alter conclusions of an y h yp othesis tests regarding gene significance or gene-gene in teractions. 7.3 Sensitivit y to priors on cov ariance matrices F or the gene-gene interaction matrix A and case-con trol dependence matrix Σ in the matrix-normal prior λ ∼ N ( µ , A ⊗ Σ ), w e sp ecify inv erse-Wishart priors: A ∼ I W ( ξ , A 0 ) with ξ = J + 2, and Σ ∼ I W ( ζ , Σ 0 ) with ζ = 4. Sensitivity is assessed across degrees of freedom ( ξ = J + 1 , J + 2 , J + 5 , J + 10; ζ = 3 , 4 , 5 , 10) and scale matrix sp ecifications (estimated from data versus identit y matrices). Posterior inferences regarding which genes are marginally significant and which gene- gene in teractions are presen t pro ve highly robust to v ariations in degrees of freedom. How ever, the magnitudes of posterior correlations sho w some sensitivity: larger degrees of freedom, which corresp ond to stronger prior information, produce shrink age to w ard the prior mean and yield slightly atten uated correlation estimates. The choice of scale matrices has minimal impact when degrees of freedom are set to the minim um v alues ensuring prop er priors ( ξ = J + 2, ζ = 4). 7.4 Sensitivit y to hypothesis testing thresholds A critical component of our Bay esian testing procedure, described in Section 3, is the sp ecification of thresholds ϵ for distance measures. F ollowing our established approac h, we set ϵ = F − 1 (0 . 55) where F is the p osterior distribution function under the n ull model. The c hoice of the 55th percentile rather than the median is delib erate: for the median, the p osterior probabilit y of the true n ull h yp othesis is 0.5, whereas under zero-one loss the true null will b e accepted only if its p osterior probabilit y exceeds 1 / 2. Using the median results in b orderline decisions in null simulations, while the 55th p ercentile pro vides appropriate op erating characteristics. Sensitivit y analyses using the 50th, 60th, and 75th p ercentiles demonstrate that the 60th and 75th p ercen tiles pro duce more conserv ativ e tests with reduced false p ositives but increased false negatives, while the 50th p ercentile yields unacceptably high false p ositive rates. Our choice of the 55th percentile represents a balanced compromise that main tains p ow er while controlling Type I error. 7.5 Sensitivit y to environmen tal cov ariance structure In our gene-environmen t in teraction model of Section 4, w e examine sensitivity to the sp ecification of the environmen tal co v ariance structure. The p osterior probability P ( ϕ < ϵ ϕ | Data) is robust across alternative kernel sp ecifications, consistently indicating no differential effect of sex on ge- netic interactions in the my o cardial infarction application. P osterior estimates of ϕ sho w some sensitivit y to prior v ariance, with more diffuse priors yielding wider credible in terv als, but the 95% credible interv al consistently con tains zero. The smo othness parameter in the k ernel is difficult to iden tify from the data, with p osterior distributions largely reflecting the prior when sample sizes are mo derate. How ev er, this lack of identifiabilit y do es not affect primary conclusions regarding gene significance or gene-gene in teractions, as ϕ is consistently estimated to b e negligible. 17 7.6 Sensitivit y to hierarchical Diric hlet pro cess h yp erparameters F or our hierarchical Diric hlet pro cess mo del of Section 5, we assess sensitivit y to the sp ecification of precision parameters and their asso ciated regression co efficients. Our primary sp ecification, with α G,ik = 0 . 1 × exp(100 + µ G + β ⊤ G E ik ) and analogous structures for α G 0 ,k and α H , ensures adequate n umbers of sub-p opulations and satisfactory MCMC mixing. Remo ving the constant offset of 100 results in significantly smaller α v alues, leading to few er distinct mixture comp onents and p o orer mixing. Alternativ e prior distributions for µ G , µ G 0 , µ H and β G , β G 0 , β H (normal versus uniform) p erform similarly , with uniform priors showing sligh tly b etter conv ergence prop erties. The key conclusions regarding gene-gene interactions and the influence of environmen tal factors on dep endence structures are robust across all sp ecifications examined. 7.7 Sensitivit y to link age disequilibrium and lo cus lab el permutation A critical concern asso ciated with Section 3 is whether sharing the parameters u r and v r across all genes implies non-exc hangeability of lo cus lab els. T o address this, we conduct p erm utation exp erimen ts wherein the lab els of lo ci within eac h gene are randomly p erm uted prior to re-analysis. The results are en tirely consistent with the original analyses based on non-p erm uted data. F or the gene-gene interaction simulation, the p osterior probabilit y measures of genetic effect remain strongly suggestive of significance; for the null simulation, these measures correctly indicate no genetic influence. These results confirm that our mo del does not imp ose non-exc hangeability of lo cus lab els. 7.8 Summary of sensitivit y analysis Across all sensitivit y in v estigations, consisten t patterns emerge. Hyp othesis test conclusions regard- ing whic h genes are marginally significant, which gene-gene interactions are present, and whether en vironmental v ariables affect genetic interactions are highly robust to reasonable v ariations in prior sp ecifications and mo deling c hoices across all three modeling frameworks. Posterior magnitudes of correlation parameters and distance measures sho w some sensitivit y to prior informativeness, but these v ariations do not affect the binary decisions in our Bay esian testing framework. Con ver- gence and mixing of MCMC algorithms are sensitiv e to certain choices, particularly the Dirichlet pro cess precision parameters and the offset parameter in HDP sp ecifications, but our primary sp ecifications consistently provide adequate p erformance. The hierarchical nature of our Bay esian mo dels pro vides natural protection against prior misspecification, with data information dominat- ing prior information for key parameters of interest. These findings collectively demonstrate that our Ba yesian nonparametric approaches are not unduly sensitive to sp ecific prior choices and mo d- eling assumptions, lending credibilit y to the conclusions dra wn from both simulation studies and the real m yocardial infarction data analysis. 8 Discussion 8.1 Comparison with existing metho ds Our Ba y esian nonparametric approach differs fundamen tally from standard GW AS metho ds in m ul- tiple asp ects. T raditional GW AS typically emplo y logistic regression mo dels that examine disease status conditional on genotypes ( P ( Y | X )). In contrast, our approach mo dels genot yp es condi- tional on disease status ( P ( X | Y )), which provides a differen t p ersp ective on genetic asso ciations. This inv ersion of the mo deling relationship allows us to directly address p opulation stratification 18 through mixture mo dels while naturally accommo dating complex dep endence structures among genetic v arian ts. F or handling population structure, traditional metho ds often rely on principal comp onen t analy- sis (PCA) for adjustment or conduct stratified analyses by presumed ancestry groups. Our approach uses Dirichlet pro cess mixtures to nonparametrically mo del p opulation substructure, allowing the n umber and characteristics of sub-p opulations to b e inferred from the data rather than pre-sp ecified. This can pro vide more accurate adjustmen t for p opulation stratification, particularly in admixed p opulations where discrete ancestry categories may not adequately represen t the con tinuum of genetic bac kgrounds. In mo deling in teractions, traditional approaches t ypically use regression co efficients for sp ecific in teraction terms (e.g., pro duct terms in logistic regression). Our HDP approac h represen ts in- teractions through hierarc hical sharing of parameters, capturing dependencies without requiring sp ecification of particular in teraction forms. This allo ws us to disco ver complex interaction pat- terns that migh t not corresp ond to simple product terms, p otentially revealing ric her biological relationships. Compared to our previous matrix-normal based mo dels (Bhattachary a and Bhattachary a, 2020), the HDP mo del offers several adv an tages: sub ject-sp ecific rather than population-wide gene-gene in teractions, nonparametric rather than parametric dependence on environmen t, and more in ter- pretable sharing structures through the Chinese restauran t pro cess analogy . Uncertain ty quan tification differs substantially betw een the approac hes. T raditional metho ds pro vide confidence in terv als based on frequen tist inference, while our fully Ba y esian approach yields p osterior distributions for all quan tities of in terest. This includes not only p oint estimates but also complete c haracterizations of uncertaint y ab out p opulation structure, interaction patterns, and mo del complexit y . The Bay esian approac h naturally incorp orates multiple testing considerations through prior distributions and posterior probabilities, a voiding the need for arbitrary significance thresholds or correction pro cedures. Computational scaling presents different c hallenges. T raditional pairwise testing scales as O ( p 2 ) where p is the n umber of SNPs, b ecoming prohibitive for genome-wide data. Our gene-lev el analysis scales as O ( J 2 ) where J is the num b er of genes, which is t ypically m uc h smaller than the n umber of SNPs. While our HDP mo del is computationally intensiv e p er test, the reduction in dimensionality through gene-level mo deling and efficien t parallel implementation makes it feasible for realistically sized datasets. T able 1: Comparison of metho dological approaches for genetic asso ciation studies. Our Bay esian nonparametric framework offers complementary strengths to traditional metho ds, particularly for complex in teraction analysis and uncertaint y quantification. Asp ect T raditional GW AS Our Bay esian Nonparametric Approac h Mo deling framew ork Logistic regression: P ( Y | X ) Conditional genot yp e: P ( X | Y ) P opulation structure PCA adjustment or stratified anal- ysis Diric hlet pro cess mixtures In teraction mo deling Regression co efficien ts Co v ariance structures Uncertain ty quantifica- tion Confidence in terv als P osterior distributions Computational scaling O ( p 2 ) for pairwise testing O ( J 2 ) for gene-lev el analysis Key adv an tage In terpretable effect sizes Flexible dep endence structures 19 Our approac h complements rather than replaces traditional methods in several wa ys. Re- searc hers might use logistic regression for initial screening of marginal associations, providing in- terpretable effect sizes for individual v ariants. Our Ba yesian nonparametric metho ds could then b e applied to selected genes or path wa ys to unco ver complex in teraction patterns and popula- tion structure that might modify or explain the marginal asso ciations. The findings from b oth approac hes could b e in tegrated through meta-analysis or hierarchical mo deling frameworks that com bine estimates from different methodological p ersp ectiv es. F or genome-wide disco very , traditional metho ds remain essen tial due to their computational efficiency and interpretabilit y . F or fo cused inv estigation of sp ecific biological pathw ays or complex traits where interactions are suspected, our approac h offers additional insigh ts that might b e missed b y marginal asso ciation testing alone. The tw o approac hes can also inform eac h other: findings from our interaction analyses migh t suggest sp ecific interaction terms to test in traditional mo dels, while significant marginal asso ciations from traditional analyses might guide the selection of genes for more detailed in vestigation with our metho ds. 8.2 Limitations and robustness considerations 8.2.1 Fixed mixture w eights The c hoice of fixed mixture weigh ts π m = 1 / M w arrants careful consideration, as it represents a departure from the more common approach of placing a Diric hlet prior on the mixture weigh ts. Our c hoice is empirically justified by findings from previous work, which demonstrated that fixed equal weigh ts outp erform Diric hlet priors in estimating the true num b er of mixture comp onents in finite mixture models with Diric hlet process priors. The Diric hlet prior tends to pro duce man y very small weigh ts, effectively underestimating the num b er of comp onents that con tribute meaningfully to the mixture. Fixed equal weigh ts av oid this shrink age tow ard fewer comp onents, allo wing b etter reco very of the true mixture structure. F rom a computational p ersp ective, fixed weigh ts simplify the Gibbs sampling up dates by elim- inating the need to sample the w eight parameters. This reduces computational complexity and impro ves mixing of the Mark ov chains, particularly in high-dimensional settings with man y mix- ture components. The simplification comes with the cost of assuming equal prior weigh t for all comp onen ts, but in practice, the p osterior distribution adapts to the data through the allo cation of sub jects to comp onen ts, effectiv ely learning the relativ e importance of different comp onents despite the equal prior w eights. Theoretical support for fixed w eights comes from results on p osterior consistency of Dirichlet pro cess mixture mo dels (Mukhopadhy ay and Bhattachary a, 2021). Under mild conditions, both random w eights (from a Diric hlet prior) and fixed equal weigh ts lead to the same asymptotic p osterior inference as the sample size increases. 8.2.2 Mo del complexit y and interpretabilit y While our HDP mo del offers substan tial flexibility in capturing complex genetic architectures, this flexibilit y comes with challenges related to mo del complexity and interpretabilit y . The three-level hierarc hical structure with Diric hlet pro cesses at eac h lev el inv olv es many parameters with complex dep endencies. The Chinese restaurant pro cess analogy helps with conceptual understanding, but detailed in terpretation of p osterior results requires careful examination of sharing patterns across m ultiple levels. Substan tial computational resources are required to fit the HDP model, particularly for datasets with large num b ers of genes or samples. Our parallel implemen tation helps address this challenge, 20 but users still need access to computing clusters or high-performance workstations. The MCMC algorithms require careful tuning and con vergence assessmen t, with runtimes measured in days for full analyses. While these computational demands are substan tial, they are b ecoming increasingly feasible with mo dern computing infrastructure and algorithmic impro vemen ts. Careful prior sp ecification is essen tial for obtaining stable and meaningful results from the HDP mo del. The precision parameters α G,ik , α G 0 ,k , α H and their regression co efficients require thoughtful sp ecification based on domain kno wledge or sensitivity analysis. Inappropriate prior c hoices can lead to po or mixing, conv ergence issues, or biased inference. W e pro vide default settings based on our experience with genetic data, but users should assess sensitivit y to these choices in their sp ecific applications. Exp ertise in Bay esian nonparametrics is v aluable for prop erly implemen ting and in terpreting the HDP mo del. Concepts suc h as Dirichlet pro cesses, Chinese restaurant pro cesses, hierarchical mo deling, and MCMC diagnostics ma y be unfamiliar to researc hers trained in traditional genetic epidemiology . T o improv e accessibility , we shall pro vide softw are with user-friendly in terfaces, detailed do cumentation, and tutorial examples. W e shall also offer workshops and training materials to help researc hers develop the necessary skills to apply these metho ds effectiv ely . Despite these challenges, the in terpretability of the HDP mo del can b e enhanced through appro- priate visualization and summary measures. Posterior distributions of key quantities–suc h as the n umber of mixture comp onents at eac h level, gene-gene correlation patterns for individual sub jects, or sharing probabilities b etw een cases and controls–can be presented in in tuitive graphical formats. Comparativ e analyses showing ho w results differ from simpler models can highlight the v alue added b y the additional complexity . Biological interpretation can b e facilitated by connecting statistical findings to kno wn pathw ays and functional annotations. 8.3 F uture directions Sev eral promising directions exist for extending and improving our Bay esian nonparametric frame- w ork for genetic in teraction analysis. First, high-dimensional extensions are needed to make the metho ds scalable to whole-genome data. Current implementations fo cus on gene-level analysis with selected genes, but applications to genome-wide data would require further computational enhancemen ts. Developing sparse versions of our mo dels that encourage parsimon y in in teraction structures could also impro ve scalabilit y while maintaining interpretabilit y . Second, integration with functional genomic data could enhance biological interpretation and statistical p o w er. Incorp orating information from gene expression, meth ylation, chromatin ac- cessibilit y , or other omics la yers could help prioritize genes and interactions based on functional relev ance. Hierarc hical mo dels that join tly analyze m ultiple data t yp es could reveal connections b et w een genetic v ariation, regulatory mechanisms, and phenot ypic outcomes. Suc h integrativ e ap- proac hes align with systems biology p ersp ectives that view biological pro cesses as interconnected net works rather than isolated components. Third, longitudinal mo deling approaches could capture dynamics in genetic and environmen tal factors ov er time. Man y complex diseases dev elop through pro cesses that unfold ov er years or decades, with genetic risk factors p otentially in teracting differen tly with environmen tal exp osures at differen t life stages. Extending our framew ork to handle repeated measures or time-to-ev ent data w ould enable in v estigation of ho w genetic interactions influence disease progression and timing. This could pro vide insigh ts in to critical p erio ds for interv ention and personalized risk prediction o ver the life course. F ourth, causal inference extensions could help distinguish correlation from causation in gene- en vironment in teractions. While observ ational data alone cannot establish causation, incorp orating 21 instrumen tal v ariables, Mendelian randomization principles, or structural equation modeling ap- proac hes could strengthen causal interpretations. Ba yesian methods are particularly well-suited for causal inference b ecause they naturally incorp orate uncertain t y ab out causal structures and mec hanisms. Dev eloping causal versions of our mo dels would enhance their utilit y for informing in terven tions and public health decisions. Additional future directions include extending the models to handle more complex p edigree or family data, incorporating spatial information for geographically structured p opulations, dev eloping online learning algorithms for streaming genetic data, and creating user-friendly soft w are pac k ages with graphical in terfaces for non-statisticians. Cross-disciplinary collab oration b et w een statisti- cians, geneticists, and biologists will b e essential for adv ancing these metho dological developmen ts while ensuring their relev ance to substan tive scientific questions. 8.4 Conclusion W e hav e synthesized a comprehensive Bay esian nonparametric framew ork for studying genetic inter- actions in case-con trol studies. Our HDP mo del represents a significan t adv ance ov er previous ap- proac hes by pro viding a flexible, nonparametric represen tation of gene-gene and gene-en vironment in teractions that accommo dates sub ject-sp ecific effects, population structure, and complex dep en- dence patterns. The three-lev el hierarch y with Dirich let pro cesses at each lev el creates a ric h dep endence structure that can capture ho w environmen tal factors mo dulate genetic interactions at m ultiple levels. Key inno v ations of our approach include: sub ject-sp ecific rather than p opulation-wide gene- gene interactions, allo wing for heterogeneit y in how genes in teract across individuals; environmen tal mo dulation of dep endence structures at multiple lev els (individual, gene-group, and case-control), pro viding a nuanced representation of gene-en vironment interactions; automatic determination of p opulation structure through Dirichlet pro cess mixtures, av oiding the need to pre-sp ecify the num- b er of sub-p opulations; and efficient computational implementation through parallel MCMC algo- rithms that lev erage the conditional indep endence structure of the mo del. Applications to my o cardial infarction data demonstrate how these metho ds can uncov er biolog- ical insigh ts not readily obtainable from standard approaches. W e identified significan t gene-gene in teractions with a protective effect, disco v ered sub ject-sp ecific patterns of genetic risk, and re- v ealed how sex mo difies genetic architecture. These findings illustrate the v alue of mo ving b eyond marginal asso ciation testing to consider interactions and p opulation structure in genetic epidemi- ology . The hierarc hical Dirichlet pro cess mo del represen ts a particularly flexible extension that accom- mo dates heterogeneous and context-dependent gene-en vironment in teractions. By allowing sharing patterns to v ary across individuals, genes, and groups in a data-driv en manner, this mo del can disco ver complex relationships that migh t b e missed by parametric approaches. The application to m yocardial infarction data rev ealed differences in genetic clustering patterns and latent subgroups with distinct risk profiles, suggesting p oten tial etiological heterogeneity . While our metho ds require substan tial computational resources and statistical exp ertise, we b eliev e these barriers are diminishing as computing p o w er increases and statistical training im- pro ves. By making our implemen tation publicly av ailable and providing educational resources, w e hop e to facilitate wider adoption of Ba yesian nonparametric metho ds in genetic epidemiology . The increasing recognition of complexit y in genetic architectures–with interactions, heterogeneity , and con text-dep endence pla ying imp ortan t roles–creates a growing need for metho dological approaches that can address this complexit y in a principled manner. F uture metho dological developmen ts should fo cus on impro ving scalabilit y , integrating m ultiple 22 data types, extending to longitudinal settings, and strengthening causal interpretations. Substan- tiv e applications should explore diverse complex diseases and populations, p otentially revealing new biological insigh ts and therapeutic targets. As genetic data con tin ue to gro w in size and complexit y , Ba yesian nonparametric metho ds offer a flexible framework for disco very that resp ects the inheren t uncertain ty and in terdep endence in biological systems. Ac kno wledgmen ts W e thank the reviewer for constructive comments that improv ed this man uscript. W e also thank DeepSeek for some pro ofreading. Conflict of in terest The authors declare no conflicts of in terest. References Bhattac harjee, S., W ang, Z., Ciampa, J., Kraft, P ., Chano ck, S., Y u, K., and Chatterjee, N. (2010). Using Principal Comp onents of Genetic Variation for Robust and Pow erful Detection of Gene- Gene In teractions in Case-Con trol and Case-Only Studies. The Americ an Journal of Human Genetics , 86 , 331–342. Bhattac harya, D. and Bhattachary a, S. (2018). A Bay esian Semiparametric Approach to learning Ab out Gene-Gene Interactions in Case-Con trol Studies. Journal of Applie d Statistics , 45 , 1–23. Bhattac harya, D. and Bhattachary a, S. (2020). Effects of Gene-Eevironment and Gene-Gene Inter- actions in Case-Control Studies: A Nov el Ba y esian Semiparametric Approach. Br azilian Journal of Pr ob ability and Statistics , 34 , 71–89. Bhattac harya, D. and Bhattac harya, S. (2024). Gene-Gene and Gene-En vironment Interactions in Case-Con trol Studies Based on Hierarchies of Diric hlet Pro cesses. Statistics and Applic ations , 22 , 327–360. Dey , K. K. and Bhattac harya, S. (2016). On Geometric Ergo dicity of Additiv e and Multiplicative Transformation based Mark ov Chain Monte Carlo in High Dimensions. Br azilian Journal of Pr ob ability and Statistics , 30 , 570–613. Also av ailable at “http://arxiv.org/pdf/1312.0915.p df ”. Dutta, S. and Bhattac hary a, S. (2014). Mark ov Chain Monte Carlo Based on De- terministic T ransformations. Statistic al Metho dolo gy , 16 , 100–116. Also av ailable at h ttp://arxiv.org/abs/1106.5850. Supplement a v ailable at Larson, N. B. and Schaid, D. J. (2013). A Kernel Regression Approach to Gene-Gene In teraction Detection for Case-Con trol Studies. Genetic Epidemiolo gy , 37 , 695–703. Ma jumdar, A., Bhattachary a, S., Basu, A., and Ghosh, S. (2013). A Nov el Bay esian Semiparametric Algorithm for Inferring Population Structure and Adjusting for Case-control Association Tests. Biometrics , 69 , 164–173. 23 Mukhopadh ya y , S. and Bhattac hary a, S. (2021). Bay esian MISE Conv ergence Rates of Mixture Mo dels Based on the Poly a Urn Mo del: Asymptotic Comparisons and Choice of Prior Param- eters. Statistics:.A Journal of The or etic al nd Applie d Statistics , 55 , 120–151. Also av ailable at h P apaspiliop oulos, O. and Rob erts, G. O. (2008). Retrosp ective Marko v Chain Mon te Carlo metho ds for Diric hlet Pro ce sses Hierarc hical Mo dels. Biometrika , 95 , 169–186. T eh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet Pro cesses. Journal of the A meric an Statistic al Asso ciation , 101 , 1566–1581. W ang, X., Elston, R. C., and Zh u, X. (2010). The Meaning of In teraction. Human Her e dity , 70 , 269–277. 24
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment