Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors
Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the PCC plateau: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can paradoxically suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model capacity: we derived a PCC improvement limit for any convex aggregator (including the softmax attention), showing that the convex hull of the inputs strictly bounds the achievable PCC gain. We demonstrate that data homogeneity intensifies both limitations. Motivated by these insights, we propose the Extrapolative Correlation Attention (ECA), which incorporates novel, theoretically-motivated mechanisms to improve the PCC optimization and extrapolate beyond the convex hull. Across diverse benchmarks, including challenging homogeneous data setting, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.
💡 Research Summary
The paper investigates a puzzling phenomenon that frequently appears when training attention‑based regression models with a joint loss comprising Mean Squared Error (MSE) and Pearson Correlation Coefficient (PCC) terms: the PCC curve quickly flattens (the “PCC plateau”) while the MSE continues to decrease. The authors provide the first rigorous theoretical explanation, identifying two fundamental sources of the plateau.
First, they decompose the MSE into three components—mean mismatch, variance mismatch, and a weighted correlation term (Proposition 2.1). Because PCC is invariant to affine scaling (Lemma 2.1), minimizing MSE can largely be achieved by adjusting the prediction mean and standard deviation, leaving the correlation term largely untouched. By deriving the gradients of both losses with respect to the pre‑softmax attention logits, they show that the two gradients share the same local factor (α_i wᵀ(h_i‑v)) but differ in global scaling. The PCC gradient contains a factor 1/σ̂_y, which shrinks as the predicted standard deviation σ̂_y grows to match the target σ_y. Corollary 2.1 proves that the ratio of PCC‑to‑MSE gradient magnitude decays as O(1/σ̂_y³⁄²). Empirically, σ̂_y indeed rises early in training, causing the PCC signal to become negligible compared with the MSE signal—this is the first bottleneck.
Second, the authors analyze the effect of data homogeneity. The magnitude of the PCC gradient is proportional to the within‑sample dispersion σ_s (Corollary 2.2). When a sample’s elements are highly similar, σ_s is tiny, the softmax attention weights become nearly uniform, and the shared local factor L_i collapses, further weakening the PCC gradient. Moreover, softmax attention is a convex combination of the input embeddings; Theorem 2.2 shows that any convex aggregator can only produce outputs inside the convex hull of the inputs. If the inputs are homogeneous, the hull radius R_s is small, which yields a strict upper bound on the achievable PCC improvement. This constitutes the second bottleneck—an intrinsic capacity limitation of convex attention.
To overcome both bottlenecks, the authors propose Extrapolative Correlation Attention (ECA), which introduces three complementary mechanisms:
- Dispersion‑Normalized PCC Loss – rescales the PCC term by σ̂_y to counteract the 1/σ̂_y attenuation, keeping the correlation gradient sizable throughout training.
- Dispersion‑Aware Temperature Softmax – adapts the softmax temperature based on the within‑sample dispersion σ_s; lower temperature for homogeneous samples amplifies differences in logits, preventing attention collapse.
- Scaled Residual Aggregation – adds a scaled residual term to the standard attention output, allowing the aggregated representation to extrapolate beyond the convex hull of the inputs, thereby breaking the capacity ceiling.
The paper validates ECA on eight UCI regression benchmarks, a video‑based sentiment analysis task, and a digital pathology dataset. Across all settings, ECA consistently surpasses vanilla softmax attention by 0.12–0.18 in PCC while maintaining or slightly improving MSE. Ablation studies confirm that each component independently contributes to plateau mitigation; in particular, the temperature adaptation is crucial for homogeneous samples where uniform attention would otherwise dominate.
In summary, the work clarifies why PCC plateaus arise: (i) MSE‑driven variance matching attenuates the PCC gradient, and (ii) convex attention limits the expressive space when inputs are homogeneous. By theoretically grounding these insights and designing ECA to address them, the authors deliver a practical solution that improves correlation learning without sacrificing predictive accuracy, offering a valuable contribution to any regression scenario where ranking or trend preservation is as important as absolute error minimization.
Comments & Academic Discussion
Loading comments...
Leave a Comment