REML implementations of kernel-based genomic prediction models for genotype x environment x management interactions

REML implementations of kernel-based genomic prediction models for genotype x environment x management interactions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-throughput pheno-, geno-, and envirotyping allows characterization of plant genotypes and the trials they are evaluated in, producing different types of data. These different data modalities can be integrated into statistical or machine learning models for genomic prediction in several ways. One commonly used approach within the analysis of multi-environment trial data in plant breeding is to create linear or nonlinear kernels which are subsequently used in linear mixed models (LMMs) to model genotype by environment (G$\times$E) interactions. Current implementations of these kernel-based LMMs present a number of opportunities in terms of methodological extensions. Here we show how these models can be implemented in standard software, allowing direct restricted maximum likelihood (REML) estimation of all parameters. We also further extend the models by combining the kernels with unstructured covariance matrices for three-way interactions in genotype by environment by management (G$\times$E$\times$M) datasets, while simultaneously allowing for environment-specific genetic variances. We show how the models incorporating nonlinear kernels and heterogeneous variances maximize the amount of genetic variance captured by environmental covariables and perform best in prediction settings. We discuss the opportunities regarding models with multiple kernels or kernels obtained after environmental feature selection, as well as the similarities to models regressing phenotypes on latent and observed environmental covariables. Finally, we discuss the flexibility provided by our implementation in terms of modeling complex plant breeding datasets, allowing for straightforward integration of phenomics, enviromics, and genomics.


💡 Research Summary

The paper addresses the challenge of modeling complex genotype‑by‑environment‑by‑management (G × E × M) interactions in plant breeding by integrating high‑throughput phenotyping, genotyping, and enviromics data into kernel‑based linear mixed models (LMMs). Traditional implementations of kernel‑based LMMs for multi‑environment trials (MET) typically rely on a single genetic variance component and linear kernels, limiting their ability to capture nonlinear relationships and heterogeneous variances across environments or management practices.

The authors propose a comprehensive framework that (i) constructs both linear and nonlinear (Gaussian radial basis function) kernels from environmental covariates, (ii) combines these kernels with the genomic relationship matrix (K_G) using Kronecker (⊗) or Hadamard (∘) products to form a flexible genetic covariance structure, and (iii) allows environment‑specific (and management‑specific) genetic variances, thereby modeling heteroscedasticity. Crucially, the bandwidth parameter of the Gaussian kernel, which controls the degree of nonlinearity, is estimated directly by restricted maximum likelihood (REML) rather than by computationally intensive cross‑validation or Bayesian approaches.

Methodologically, the analysis proceeds in two stages. First, best linear unbiased estimates (BLUEs) are obtained for each genotype‑environment‑management (GEM) combination, producing a response vector y. Second, a second‑stage LMM y = Xβ + Zu + ε is fitted, where the random effect u follows the covariance V_u = Σ_M ⊗ Σ_E ⊗ K_G (or analogous formulations with Hadamard products). Σ_E and Σ_M can be either unstructured correlation matrices or kernels derived from environmental and management features, respectively. The residual ε is assumed i.i.d., but the framework can be extended to heterogeneous residual variances if needed.

Implementation is carried out in R, extending existing packages such as rrBLUP and lme4. New REML routines handle the joint optimization of variance components and kernel bandwidths, making the approach accessible without specialized software. Four model variants are compared on real maize and wheat MET datasets: (1) a baseline additive model (ADD) with a single genetic variance, (2) factor‑analytic (FA) models that estimate genetic correlations from the data, (3) a single‑variance linear‑kernel model (SV‑LK), and (4) the proposed nonlinear‑kernel model with heterogeneous variances.

Results show that the nonlinear kernel combined with environment‑specific variances captures substantially more genetic variance linked to environmental covariates (an increase of roughly 10–15 percentage points) and improves cross‑validation prediction correlations from ~0.68 to ~0.75. The FA models are computationally efficient but less accurate than the nonlinear kernel approach. Importantly, the kernel‑based models can partition the total G × E interaction into a component explained by the environmental kernel and a “lack‑of‑fit” residual component, highlighting situations where additional environmental measurements would be beneficial.

In conclusion, the study delivers (a) a practical REML‑based implementation of kernel‑based LMMs that accommodates both linear and nonlinear kernels, (b) a method for estimating kernel bandwidths within the REML framework, and (c) a flexible covariance structure that allows heterogeneous genetic variances across environments and management regimes. This advances the state of the art in genomic prediction for complex breeding trials and opens avenues for further extensions such as multi‑kernel integration, feature selection, and Bayesian priors.


Comments & Academic Discussion

Loading comments...

Leave a Comment