Learning Representations for Independence Testing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many tools exist to detect dependence between random variables, a core question across a wide range of machine learning, statistical, and scientific endeavors. Although several statistical tests guarantee eventual detection of any dependence with enough samples, standard tests may require an exorbitant amount of samples for detecting subtle dependencies between high-dimensional random variables with complex distributions. In this work, we study two related ways to learn powerful independence tests. First, we show how to construct powerful statistical tests with finite-sample validity by using variational estimators of mutual information, such as the InfoNCE or NWJ estimators. Second, we establish a close connection between these variational mutual information-based tests and tests based on the Hilbert-Schmidt Independence Criterion (HSIC); in particular, learning a variational bound (typically parameterized by a deep network) for mutual information is closely related to learning a kernel for HSIC. Finally, we show how to, rather than selecting a representation to maximize the statistic itself, select a representation which can maximize the power of a test, in either setting; we term the former case a Neural Dependency Statistic (NDS). While HSIC power optimization has been recently considered in the literature, we correct some important misconceptions and expand to considering deep kernels. In our experiments, while all approaches can yield powerful tests with exact level control, optimized HSIC tests generally outperform the other approaches on difficult problems of detecting structured dependence.

💡 Research Summary

The paper tackles the fundamental statistical problem of testing independence between two random variables X and Y, a task that underlies many machine learning, scientific, and data‑analysis applications. Classical tests such as χ², Fisher’s exact test, or Pearson correlation work well for low‑dimensional, discrete or Gaussian data, but they either rely on strong parametric assumptions or become powerless when the data are high‑dimensional, continuous, and have complex dependence structures. The authors propose two learning‑based frameworks that construct powerful, finite‑sample valid independence tests by leveraging (i) variational lower bounds on mutual information (MI) and (ii) the Hilbert‑Schmidt Independence Criterion (HSIC).

Variational MI‑based tests (Neural Dependency Statistic, NDS).
The paper reviews a family of variational MI bounds such as InfoNCE, NWJ, DV, and MINE. Each bound is defined by a critic function f : X × Y → ℝ, typically parameterized by a deep neural network. During training, the bound ˆI_f is maximized on a training split, allowing f to learn representations that expose dependence. At test time, the authors observe that the second term of the bound (the log‑sum‑exp over negative samples) is invariant under permutations of Y, so the test statistic reduces to the simple average of f over the true paired samples:

T̂ = (1/K) ∑_{i=1}^K f(x_i, y_i)

They call this the Neural Dependency Statistic (NDS). To obtain a level‑α test, they use a permutation test: randomly shuffle Y, recompute T̂ for each permutation, and set the rejection threshold at the (1‑α) quantile of the permuted values (including the original pairing). Because the critic is learned on a separate split, the permutation distribution is valid under the null hypothesis. The authors derive the asymptotic distribution of T̂ under both null and alternative, showing that test power is driven by the signal‑to‑noise ratio (SNR) (T_H1 − T_H0)/τ_H1, where T_H0 and T_H1 are the population expectations of f under independence and dependence, respectively, and τ_H1 is the standard deviation under the alternative. They provide an explicit asymptotic power formula and verify empirically that it matches finite‑sample power.

HSIC‑based tests with kernel learning.
HSIC measures the squared cross‑covariance between feature maps of X and Y in reproducing kernel Hilbert spaces (RKHS). With a universal kernel, HSIC equals zero iff X ⟂ Y. Traditional HSIC tests use a fixed kernel (e.g., Gaussian with median bandwidth) and suffer when the kernel bandwidth is mismatched to the data scale. The authors propose to directly optimize test power by learning the kernel (or a deep kernel network) on a training split. They estimate the HSIC statistic’s mean and variance under both null and alternative, then maximize the ratio of mean difference to standard deviation—essentially the SNR of HSIC. This objective is differentiable, allowing gradient‑based learning of kernel parameters or deep networks that map raw data to a representation where HSIC is most discriminative. A uniform convergence argument guarantees that the learned representation generalizes to unseen data. After training, a permutation test (or a data‑splitting test with a calibrated threshold) yields an exact level‑α test.

Connections between the two approaches.
The authors prove that HSIC is a lower bound on MI; consequently, learning a variational MI bound and learning a kernel for HSIC are mathematically linked. However, maximizing the MI bound does not necessarily maximize the SNR of the resulting test statistic, whereas the HSIC power‑optimization explicitly targets SNR. This distinction explains why, in many experiments, the HSIC‑optimized test outperforms the NDS test, especially in high‑dimensional or structured dependence scenarios.

Experimental evaluation.
The paper evaluates several synthetic and real‑world tasks:

High‑dimensional Gaussian mixture (d = 4, 10, 15) – vanilla HSIC with median bandwidth fails as d grows, while the learned deep kernel (HSIC‑D) retains high power with far fewer samples.
“Decimal‑place swap” example – a subtle dependence that is invisible to Euclidean distance but becomes obvious when a representation extracts the swapped decimal digit. NDS learns such a representation and achieves strong power, whereas HSIC with generic kernels struggles.
Additional benchmarks (sinusoid, HDGM, RatInABox, image‑based data) – HSIC‑D consistently yields the highest empirical power, followed by MMD‑D and NDS.

The authors also compare against recent work (Ren et al., 2024) on HSIC power optimization, pointing out gaps in their theoretical justification and demonstrating that their own method corrects these issues.

Key contributions

Construction of finite‑sample valid independence tests based on variational MI bounds, with a clear reduction to the Neural Dependency Statistic.
Formal proof that HSIC is a lower bound on MI, establishing a theoretical bridge between MI‑based and kernel‑based testing.
Introduction of a test‑power‑driven kernel learning objective for HSIC, including a differentiable estimator of the SNR and a uniform convergence guarantee.
Comprehensive empirical study showing that HSIC‑power optimization generally dominates other methods on challenging, high‑dimensional dependence detection tasks.

Implications
By showing that learning representations specifically to maximize test power can dramatically reduce the sample complexity of independence testing, the paper opens a new design paradigm for statistical tests in modern high‑dimensional settings. Practitioners can now embed a learnable critic or kernel within a permutation‑test framework, obtain exact type‑I error control, and achieve substantially higher power than classical fixed‑kernel or fixed‑statistic approaches. This is especially valuable for scientific domains where subtle, structured dependencies (e.g., genomics, neuroscience, climate science) must be detected from limited data.

Overall, the work blends information‑theoretic bounds, kernel methods, and modern deep learning to deliver a principled, practical toolkit for independence testing that is both statistically rigorous and empirically powerful.

Learning Representations for Independence Testing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment