Revisiting Multivariate Time Series Forecasting with Missing Values

Revisiting Multivariate Time Series Forecasting with Missing Values
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.


💡 Research Summary

The paper tackles the pervasive problem of missing values in multivariate time‑series forecasting (MTSF‑M). While most recent works follow a two‑stage “impute‑then‑predict” pipeline—or an end‑to‑end variant that progressively imputes during encoding—these approaches fundamentally suffer from the lack of ground‑truth for the missing entries. The authors empirically demonstrate that, when the imputation module is trained only with the downstream forecasting loss, it often produces values that deviate substantially from the true data distribution and fails to recover the original inter‑variates correlations. Visualizations (t‑SNE, correlation maps) on the PEMS‑BAY dataset with 40 % missing data reveal that the imputed data form clusters far from the original and that the subsequent forecasts are less accurate than a model that directly consumes the partially observed series.

Motivated by these findings, the authors propose a paradigm shift: predict directly from the partially observed series without any imputation. Their solution, Consistency‑Regularized Information Bottleneck (CRIB), is built on the Information Bottleneck (IB) principle, which seeks a latent representation Z that minimizes mutual information with the input I(Z;Xₒ) (compressing away noise) while maximizing mutual information with the target I(Y;Z) (preserving predictive signal). The trade‑off is controlled by a Lagrange multiplier β.

CRIB consists of four key components:

  1. Patching Embedding – The raw series Xₒ is split into non‑overlapping patches of length P. A Temporal Convolutional Network (TCN) processes each patch, producing dense local features H. This reduces the sequence length from T to T/P, cutting the quadratic attention cost by a factor of P² while enriching semantic information.

  2. Unified‑Variate Attention – All patch tokens (across all variates) are flattened and fed into a standard Q‑K‑V self‑attention block. Unlike prior works that separate intra‑ and inter‑variates attention, this unified mechanism learns any correlation pattern without structural bias, which is crucial when missing values disrupt the usual correlation structure.

  3. Consistency Regularization – To enforce robustness to different missing‑mask patterns, the authors generate an augmented view Xᵃᵘᵍ of the same series via random mask perturbations and Gaussian noise. Both X and Xᵃᵘᵍ pass through the same encoder, yielding Z and Zᵃᵘᵍ. A consistency loss (e.g., L2 or cosine distance) aligns these representations, encouraging the encoder to focus on mask‑invariant features, especially under high missing rates.

  4. Simple Prediction Head – A two‑layer MLP maps Z to the future horizon S. By keeping the predictor simple, the authors attribute performance gains to the quality of the IB‑guided representations rather than to a powerful decoder.

Training approximates the IB objectives via variational bounds: the compactness term I(Z;Xₒ) is upper‑bounded by a KL divergence between the encoder’s posterior pθ(Z|Xₒ) and an isotropic Gaussian prior, while the informativeness term I(Y;Z) is linked to the forecasting loss (e.g., MSE). The overall loss combines the KL term, the forecasting loss, and the consistency regularizer, weighted by β and a consistency coefficient λ.

Extensive experiments on four real‑world benchmarks (PEMS‑BAY, traffic flow, electricity consumption, weather) cover missing rates from 10 % to 70 %. CRIB consistently outperforms state‑of‑the‑art impute‑then‑predict baselines (e.g., TimesNet + TimeXer, BiTGraph) by an average of 18 % reduction in MAE/RMSE, with the gap widening at higher missing ratios. Ablation studies confirm that each component—patching, unified attention, consistency regularization, and the IB loss—contributes positively; removing any leads to noticeable performance drops. Sensitivity analyses on patch size P, β, and λ illustrate stable behavior across a broad hyperparameter range.

In summary, the paper provides compelling evidence that imputation is not a prerequisite for accurate multivariate time‑series forecasting under missingness. By leveraging the Information Bottleneck to learn compact, noise‑robust representations and enforcing mask‑invariant consistency, CRIB achieves superior accuracy while simplifying the modeling pipeline. This work opens a new direction for handling incomplete sequential data in domains such as traffic prediction, energy demand forecasting, and meteorology, where missing observations are the norm rather than the exception.


Comments & Academic Discussion

Loading comments...

Leave a Comment