C-kNN-LSH: A Nearest-Neighbor Algorithm for Sequential Counterfactual Inference
Estimating causal effects from longitudinal trajectories is central to understanding the progression of complex conditions and optimizing clinical decision-making, such as comorbidities and long COVID recovery. We introduce \emph{C-kNN–LSH}, a nearest-neighbor framework for sequential causal inference designed to handle such high-dimensional, confounded situations. By utilizing locality-sensitive hashing, we efficiently identify ``clinical twins’’ with similar covariate histories, enabling local estimation of conditional treatment effects across evolving disease states. To mitigate bias from irregular sampling and shifting patient recovery profiles, we integrate neighborhood estimator with a doubly-robust correction. Theoretical analysis guarantees our estimator is consistent and second-order robust to nuisance error. Evaluated on a real-world Long COVID cohort with 13,511 participants, \emph{C-kNN-LSH} demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to existing baselines.
💡 Research Summary
The paper introduces C‑kNN‑LSH, a novel nearest‑neighbor framework designed for sequential counterfactual inference in high‑dimensional, irregularly sampled longitudinal data such as that arising from long‑COVID studies. The authors identify three intertwined challenges: (1) the curse of dimensionality in patient histories that include static clinical variables, comorbidities, and dense wearable sensor streams; (2) time‑varying confounding where treatment timing (e.g., vaccination) depends on evolving health status that also predicts future outcomes; and (3) heterogeneous disease trajectories that can shift between distinct regimes (stable low severity, vaccine‑responsive improvement, persistent high severity).
To address these, C‑kNN‑LSH proceeds in three stages. First, it compresses each patient’s full history (H_{i,t}) into a low‑dimensional latent representation (Z_{i,t}). A large language model (LLM) (\Psi_{\text{LLM}}) encodes raw textual and structured data into a high‑capacity semantic embedding (E_{i,t}). A variational auto‑encoder (VAE) encoder (q_{\phi}) then maps (E_{i,t}) to a Gaussian latent vector ((\mu_{\phi},\sigma_{\phi})), yielding (Z_{i,t}). The training objective augments the standard ELBO with (i) a reconstruction term for the original history, (ii) a causal outcome prediction term, (iii) a KL‑regularizer, and (iv) a mutual‑information penalty (I(Z;A)) that discourages leakage of treatment information into the representation. This joint loss forces (Z) to act as an approximate sufficient statistic for both treatment assignment and potential outcomes, satisfying the authors’ “latent sufficiency” assumption.
Second, the method performs local non‑parametric inference in the latent space. Using p‑stable random projections, a family of locality‑sensitive hash functions (h) maps each (Z) into multiple hash buckets. For a query state (Z_{i,t}) and a target treatment (a), the algorithm retrieves an approximate (k)-nearest‑neighbor set (\mathcal{N}_k(i,t,a)) among all past observations that received treatment (a). The LSH‑based ANN search runs in sub‑linear time (O((NT)^{\rho}\log NT)) with (\rho<1), making it scalable to hundreds of thousands of time points.
Third, to correct for any residual bias introduced by the representation or imperfect matching, the authors apply a doubly‑robust (DR) correction. Within each neighbor set they fit a local outcome model (\hat Q(Z,a)) and a propensity model (\hat e(a|Z)). The final estimator is
\
Comments & Academic Discussion
Loading comments...
Leave a Comment