Introns and Templates Matter: Rethinking Linkage in GP-GOMEA
GP-GOMEA is among the state-of-the-art for symbolic regression, especially when it comes to finding small and potentially interpretable solutions. A key mechanism employed in any GOMEA variant is the exploitation of linkage, the dependencies between variables, to ensure efficient evolution. In GP-GOMEA, mutual information between node positions in GP trees has so far been used to learn linkage. For this, a fixed expression template is used. This however leads to introns for expressions smaller than the full template. As introns have no impact on fitness, their occurrences are not directly linked to selection. Consequently, introns can adversely affect the extent to which mutual information captures dependencies between tree nodes. To overcome this, we propose two new measures for linkage learning, one that explicitly considers introns in mutual information estimates, and one that revisits linkage learning in GP-GOMEA from a grey-box perspective, yielding a measure that needs not to be learned from the population but is derived directly from the template. Across five standard symbolic regression problems, GP-GOMEA achieves substantial improvements using both measures. We also find that the newly learned linkage structure closely reflects the template linkage structure, and that explicitly using the template structure yields the best performance overall.
💡 Research Summary
This paper investigates a subtle yet impactful limitation of GP‑GOMEA, a state‑of‑the‑art genetic programming algorithm for symbolic regression that excels at discovering compact, interpretable expressions. GP‑GOMEA relies on a fixed tree template that maps each position in a fixed‑length genotype string to a specific node in a GP tree. Linkage learning—identifying dependencies between decision variables—is performed each generation by estimating pairwise mutual information (MI) among the genotype positions and constructing a linkage tree (Family of Subsets, FOS) via hierarchical clustering. The GOM (Gene‑pool Optimal Mixing) operator then uses the FOS to exchange groups of linked variables between individuals, ensuring that building blocks are preserved.
The authors point out that the fixed template inevitably creates conditionally inactive variables, known as introns, whenever a solution does not occupy the full template depth. Introns have no effect on the phenotype and thus receive no selection pressure; however, they are still present in the genotype and contribute random noise to the MI estimates. Consequently, the statistical signal that reflects true functional dependencies between active nodes is diluted, potentially leading to sub‑optimal linkage structures and poorer search performance.
To address this, two novel linkage‑measurement strategies are proposed. The first, “masked MI,” explicitly marks inactive variables with a special label (e.g., “masked”) before entropy and MI calculations. By treating all introns as a single distinct symbol, the MI computation becomes based solely on the distribution of active variables, eliminating the noise introduced by arbitrary intron values. This approach improves the fidelity of the learned linkage tree but is incompatible with the bias‑correction technique previously introduced for handling non‑uniform initialization, because a variable that is inactive in the entire population would cause division‑by‑zero issues.
The second strategy leverages the known template structure to construct a weighted Variable Interaction Graph (wVIG). Because the template defines a nested functional composition, the authors can analytically determine which variables co‑occur within the same sub‑function. They assign a weight to each pair of variables proportional to the number of shared sub‑functions (or, equivalently, the depth of their common ancestor in the template). This weight matrix directly serves as a similarity matrix for hierarchical clustering, producing a linkage tree without any statistical estimation from the population. In effect, the template itself provides a “grey‑box” model of variable interactions, removing the need for data‑driven learning of linkage.
The experimental evaluation uses five benchmark symbolic regression problems (including several Nguyen functions and Pagie‑1). For each benchmark, GP‑GOMEA is run under four linkage configurations: standard MI, bias‑corrected MI, masked MI, and template‑based wVIG, with 30 independent runs per configuration. Performance is measured in terms of final mean squared error and model size (node count). Both masked MI and wVIG achieve statistically significant improvements over the baseline methods. Notably, the wVIG approach consistently yields the best trade‑off between accuracy and compactness, often discovering the smallest expressions that still achieve the lowest error. Visual analysis of the resulting linkage trees shows that masked MI aligns closely with the true dependencies among active nodes, while the wVIG tree mirrors the original template structure, confirming that the template encodes meaningful prior knowledge about variable interactions.
The paper concludes that (1) introns, if left untreated, act as a source of noise that hampers linkage learning; (2) explicitly handling introns via masking restores the statistical quality of MI‑based linkage; (3) exploiting the template to build a deterministic weighted interaction graph provides an even more powerful, data‑free linkage model; and (4) incorporating such domain‑specific knowledge can substantially boost GP‑GOMEA’s ability to evolve small, high‑performing symbolic models. The authors suggest that these insights are broadly applicable to other evolutionary algorithms that employ fixed representations and that the approach is especially valuable in safety‑critical domains (e.g., healthcare, finance) where model interpretability and reliability are paramount.
Comments & Academic Discussion
Loading comments...
Leave a Comment