Overcoming the Modality Gap in Context-Aided Forecasting
Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpass traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world evaluation, and show clear evidence of context utilization. Our results suggest that dataset quality, rather than architectural limitations, has been the primary bottleneck in context-aided forecasting.
💡 Research Summary
The paper tackles a puzzling phenomenon in context‑aided forecasting (CAF): multimodal models that combine time‑series data with textual context often fail to outperform strong unimodal baselines. The authors argue that the root cause lies not in model architecture but in the quality of the contextual information used for training and evaluation. Existing CAF datasets are typically built by scraping web articles or using template‑generated texts, which either do not describe the dynamics of the target series, leak future information, or are too noisy to be useful. Consequently, models trained on such data cannot reliably exploit the textual modality.
To address this, the authors propose a fully automated, two‑stage pipeline for generating and verifying high‑quality synthetic contexts. In the generation stage, a large language model (Llama‑3.3‑70B‑Instruct) is prompted with both the historical window and the future target values of a time‑series, together with a brief domain description, to produce a “scenario” text that explains the expected change in dynamics. The prompt is engineered to avoid repetitive outputs, encouraging diverse narratives. In the verification stage, the generated text is fed to a strong context‑aided forecaster (a Direct Prompt model built on GPT‑5.2) together with the same time‑series. The model’s probabilistic forecasts are evaluated with the Continuous Ranked Probability Score (CRPS). Only those contexts that lead to a lower CRPS (i.e., improve the forecast) are retained. An additional LLM‑based sanity check confirms that the text mentions the correct variable and describes a plausible scenario. This automatic quality filter guarantees that every retained context is both descriptive of the temporal dynamics and complementary to the numerical history.
Applying this pipeline to 65 real‑world time‑series datasets, the authors construct CAF‑7M, a corpus of 7,433,239 context‑augmented windows for training and a rigorously verified test set of 904 windows. The test set is split evenly into “HARD” (MASE > 1.5) and “EASY” (MASE ≤ 1.5) subsets to ensure that evaluation measures genuine context utilization rather than trivial gains.
For the modeling side, the paper introduces DoubleCast, an alignment‑based multimodal architecture that fuses a Chronos time‑series encoder with a Dual‑T5 text decoder. The two modality embeddings are concatenated and processed through a stack of blocks containing masked self‑attention, Chronos‑to‑T5 cross‑attention, Dual‑T5 cross‑attention, and feed‑forward layers. This design allows the model to learn a joint representation where textual cues can dynamically influence the forecast.
Experiments show that pre‑training DoubleCast on CAF‑7M yields substantial improvements over the unimodal Chronos baseline (7‑10 % lower MAE and CRPS) and outperforms state‑of‑the‑art CAF models on the real‑world ChatTime benchmark. Ablation studies confirm that the verification step is essential: using unfiltered or randomly generated texts degrades performance, demonstrating that low‑quality context is the primary bottleneck. Moreover, DoubleCast achieves these gains with roughly 30 % fewer parameters than competing multimodal models, making it more cost‑effective at inference time.
The authors conclude that dataset quality, not model capacity, has limited progress in context‑aided forecasting. Their semi‑synthetic data generation and automatic verification framework provides a scalable path to high‑quality CAF datasets across domains, and the success of DoubleCast validates that such data can be leveraged effectively. Limitations include reliance on powerful LLMs for verification (which may be costly) and the current focus on English‑language contexts. Future work is suggested on multilingual generation, hybrid human‑LLM validation, and richer causal reasoning between text and series. Overall, the paper makes a compelling case that improving context quality is the key to unlocking the promise of multimodal time‑series forecasting.
Comments & Academic Discussion
Loading comments...
Leave a Comment