Causal explanations of outliers in systems with lagged time-dependencies
Root-cause analysis in controlled time dependent systems poses a major challenge in applications. Especially energy systems are difficult to handle as they exhibit instantaneous as well as delayed effects and if equipped with storage, do have a memory. In this paper we adapt the causal root-cause analysis method of Budhathoki et al. [2022] to general time-dependent systems, as it can be regarded as a strictly causal definition of the term “root-cause”. Particularly, we discuss two truncation approaches to handle the infinite dependency graphs present in time-dependent systems. While one leaves the causal mechanisms intact, the other approximates the mechanisms at the start nodes. The effectiveness of the different approaches is benchmarked using a challenging data generation process inspired by a problem in factory energy management: the avoidance of peaks in the power consumption. We show that given enough lags our extension is able to localize the root-causes in the feature and time domain. Further the effect of mechanism approximation is discussed.
💡 Research Summary
The paper addresses the challenging problem of root‑cause analysis for outliers in systems that exhibit both instantaneous and delayed effects, a situation common in energy management where storage introduces memory. Building on the causal root‑cause analysis (CRCA) framework introduced by Budhathoki et al. (2022), the authors extend the method to handle infinite time‑lagged dependency graphs by imposing a finite maximum lag L and applying two distinct truncation strategies.
The first strategy, termed the “truncated‑L model,” retains the original causal mechanisms up to lag L while conditioning on the observed values of all parent variables whose lag exceeds L. This preserves the true functional relationships for the recent past and isolates the contribution of recent noise terms. The second strategy, the “non‑truncated‑L model,” approximates the mechanisms of the dangling parents (those beyond lag L) by fitting surrogate functions h_j,l that map the observed noisy parents to their values, thereby reconstructing a complete but approximate structural causal model (SCM). Both approaches enable the application of CRCA to a finite sub‑graph, but differ in how they treat the long‑range dependencies.
Root‑cause attribution is performed using an information‑theoretic calibration (IT‑Score) of an anomaly score g(·) (e.g., a z‑score) and a Shapley‑value decomposition. For each node‑lag pair (i,l) the method computes the probability q_t(I) that an outlier occurs when the noises of a subset I of nodes are replaced by draws from their nominal distributions, while the remaining noises are kept at their observed values. The contribution C_t(u|I) is defined as the log‑ratio of q_t with and without node u, and the Shapley value ϕ_t(u) averages C_t over all subsets I that exclude u. The sum of all ϕ_t(u) equals the IT‑Score, providing a full decomposition of the anomaly’s information content across variables and time steps.
Computationally, unfolding the SCM L times inflates the combinatorial space from n! to (L·n)!, making exact enumeration infeasible for large L. The authors mitigate this by exploiting the assumed independence of noise terms across time, which keeps memory requirements independent of L, and by employing Monte‑Carlo sampling to estimate q_t. Nevertheless, the non‑truncated‑L model incurs higher computational cost because it must also sample the approximated mechanisms of the dangling parents.
The empirical evaluation uses a sophisticated data‑generating process (DGP) that mimics a manufacturing plant’s energy consumption. The DGP includes two tool parks, a battery storage system with a controller, a cooling system whose power demand depends on ambient temperature, and a temperature sensor that follows real‑world weather data. Power draw from the grid is the sum of tool‑park consumption, cooling load, battery activity, and Gaussian noise. Peaks above 1500 kW lasting at least two minutes are labeled as anomalies. Three distinct root‑cause injection scenarios are examined: (1) a temperature surge (setting T = 31 °C for ten minutes), (2) a cooling‑power surge (forcing CL = 265 kW), and (3) an increased cooling‑scale factor (raising the kW/°C coefficient). For each scenario, two simulations with identical seeds are run—one baseline and one with the injected fault—to provide ground‑truth causal pathways.
Results show that when the maximum lag L is sufficiently large (e.g., L ≥ 5), both truncation approaches correctly identify the injected root cause in the feature‑time domain. The truncated‑L model, which preserves the original mechanisms, yields slightly higher attribution precision for variables with long lags (temperature, cooling), whereas the non‑truncated‑L model remains robust despite approximating the distant mechanisms. Variables with short lags, such as the battery controller, are detected reliably even with modest L values. The Shapley‑based attributions produce clear temporal signatures, allowing operators to pinpoint not only which component caused the peak but also when its influence manifested.
The study demonstrates that causal root‑cause analysis can be systematically extended to time‑dependent systems with memory, offering a principled alternative to correlation‑based or Granger‑causality methods that dominate the time‑series literature. By grounding explanations in counterfactual interventions on the SCM, the approach yields actionable insights for control‑oriented domains. The authors suggest future work on handling non‑Gaussian, non‑stationary noise, real‑time implementation via more efficient sampling, and integration with multi‑objective optimization for preventive control. Overall, the paper provides a solid methodological contribution and a compelling experimental validation that should interest researchers and practitioners dealing with complex, lag‑rich dynamical systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment