MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations

Time series forecasting is vital across many domains, yet existing models struggle with fixed-length inputs and inadequate multi-scale modeling. We propose MR-CDM, a framework combining hierarchical multi-resolution trend decomposition, an adaptive e…

Authors: Xianyong Xu, Yuanjun Zuo, Zhihong Huang

MR-ImagenTime: Multi-Resolution Time Series Generation through Dual Image Representations
MR-Imag enTime: Multi-Resolution Time Series Gener ation thr ough Dual Image Representat ions Xianyong Xu State Grid Hunan Electr ic Power Company Limited Research Institute & Hunan Province Eng ineering T echnolo g y Research Center of Electric Power Multimodal Perception and Edge Intelligence Changsha, China 93615 073@qq . com Y uanjun Zuo State Grid Hu nan Electric Power Company Limited Research Institute & Hunan Province Eng ineering T echnolo g y Research Center of Electric Power M ultimodal Perception and Edge Intelligen ce Changsha, Ch ina hjzuoyuan ju n@163.co m Zhihong Huang State Grid Hu nan Electric Power Company Limited Research I nstitute & Hun an Pr ovince E n gineerin g T echnolo g y Research Center of Electric Power M ultimodal Perception and Edge Intelligen ce Changsha, China zhihong _huan g111@163. com Y ihan Qin Hunan University Changsha, China yihan_q in@hnu.ed u.cn Haoxian Xu Hunan University Changsha, China xuhao x ian13@16 3.com Leilei Du Hunan University Changsha, China leileidu@hnu. edu.cn Haotian W ang Hunan University Changsha, China wanghaotian@hn u.edu.c n ABSTRA CT T ime series forecasting is v ital across many dom ains, yet existing models struggle with fixed-length inputs and inadeq uate multi-scale modeling. W e propose MR-CDM, a framework combining hierar- chical multi-resolution trend decomposition, an adaptiv e embedd ing mechanism for v ariable-length inputs, and a multi - scale conditional diffu sion process. Evaluations on four real-world datasets demon- strate that MR-CDM significantly outperforms state-of-the-art base- lines (e.g., CS DI, Informer), reducing MAE and RMSE by approx- imately 6–10 t o a certain degree. 1 INTR ODUCTION T ime series data is fundamental to decision-mak ing in domains such as finance, healthcare, and transportation. While deep learning mod- els like RNNs, CNNs, and Transfo rmers hav e adv anced ti me seri es forecasting, recent work has ex plored dif fusion models for their ability to ca pture comple x tempora l distributions. Ho we ver , real- world series often exhibit multi-scale patterns (trends, seasonality , noise) and v ariable l engths, posing challenges for standard diffu- sion approaches. Although trend-decomposition methods improv e multi-scale modeling, they typically assume fixed input l engths. T o address v ariable-leng th inputs, Naiman et al. [1] p ropose mapp ing time series to 2D image-like representations. While this enhances input flexibility , it risks distorting t emporal continuity and long- range dependencies by treating sequential data as spatial grids. Example 1. Consider for ecasting city-wide electricity demand with heter og eneou s sampling rates (e.g ., 5-minute vs. hourly sen- sors), as illustrated in Figur e 1 . High-freq uency data captur es lo- cal fluctuations, while low-fr equenc y data refle cts re gional tr ends. Figure 1: Multi-Resolution Time Series Example However , most models assume fixed sampling rates. F or cing het- er o gene ous signals into a common reso lution inevitably sacrifi ces fine-gr ained details or distorts global structures, limiting the abilit y to handle mult i -scale dynamics. T o address these limitations, we propose a dif fusion-ba sed fore- casting frame wo rk that (1) supports v ariable-length time series with- out fixed input windo ws, and (2) inco rporates multi-scale trend de- composition to model temporal patterns at dif ferent resolutions. The method retains the generativ e strengths of diffusion mod els whil e improving robustness across heterogeneous sequence lengths and temporal scales. Experiments on multiple real-world datasets show that it consistently outperforms both diffusio n-based and con v en- tional baselines, achieving more accurate and reliable forecasts. Xiany ong Xu, Y uanjun Zuo, Zhihong Huang, Yihan Qin, Ha o xian Xu, Leilei Du, and Haotian Wang The remainder of this paper is organized as follows: Section 2 rev ie ws r el at ed literature on time series forecasting and diffusion- based generati ve models. Section 3 introduces the problem state- ment. Section 4 pres ents the proposed MR-CDM frame work in de- tail. Section 5 outlines the experimen tal setup and reports empirical the results. Fi nally , Section 6 concludes the paper and discusses po- tential directions f or future research. 2 RELA TED WORK T ime series forecasting is the process of generating a future seg- ment of a time series by applying specific t ime series forecasting algorithms to historical ti me seri es data. For example, we use a his- torical wi nd speed series of 100 time units to generate a future wind speed series of 50 time units.Time series forecasting has under gone three evo lutionary phases: statistical modeling, d eep learn ing rev o- lution, and the emerging diffusio n paradigm. A Statistic al Modeling Classical forecasting reli ed on explicit statistical assumptions, with linear models dominating the field. The Autore gressi ve Integ rated Mov ing A verage (ARIMA) [1 ] and i ts seasonal variant SARIMA provid ed i nterpretable formulations under stationarity assumptions, while V ector A utoregression (V AR) [2] extended these ideas to mul- tiv ariate setti ngs. Non parametric methods like Holt–W i nters expo- nential smoothing [3] and the E rror–T rend–Seaso nality (ETS) fr ame- work [4] modeled trend-seasonality interactions without st rict distri- butiona l requirements. From a probabilistic perspectiv e, Gaussian Processes (GPs) [5] introduc ed Bayesian inferen ce for ti me series, provid ing principled uncertainty quantification via kernel functions. B Deep L ear ning Rev olutio n Neural networks revo lutionized forecasting by enabling hierarchi- cal feature learning and nonlinear modeling. Early recurrent archi- tectures like LSTM [6] and GRU [7] addressed gradient issues to capture long-term dependencies, with DeepAR [8] establishing a deep probabilistic framew ork. Con vo lution-based models such as W av eNet [9] and TCNs [10 ] overcame sequential bottlenecks using dilated causal con v olutions for efficient parallel processing. More recently , attention-based architectures like the Transfo rmer [11], In- former [12], and Autoformer [13] hav e become dominant by captur- ing global temporal dependencies through self-attention and frequenc y- domain analysis. C Emerging Diffusion Paradigm Diffusion models have recently emerged as po werful generati ve tools for time series. T imeGrad [14] introduced autoreg ressiv e de- noising diffu sion for probabilistic forecasting, while CSDI [15] ex- tended this to conditional score-based diffusion f or imputation tasks. T o improv e efficiency and handle long sequences, SSSD [16] incor- porated state space models, and TSDiff [17] proposed 2D represen- tations to accommodate v ariable-length sequences. Spatial-temporal extension s like DiffSTG [18] emb edded grap h structures to mode l complex spatio-temporal dynamics. Our frame work addresses these exsiting challenges through three core inno v ations: an adaptiv e delay embedding mechanism for han- dling va riable-length seq uences, a hierarchical multi-resolution de- composition strate gy to capture multi-scale t emporal patterns, and a conditionally guided diffusion process that lev erages coarse-lev el trends as structural priors. T ogether , these components ensure ro- bust learning of complex dynamics while enhancing computational efficien cy and training st ability . 3 PR OBLEM S T A TEMENT Problem Statement. W e address the problem of time series fore- casting, where the goal is to predict future v alues of a time series based on its past observ ations. Given a set of o bserved ti me series data 𝑥 ∈ R 𝐿 × 𝐾 , whe re 𝐿 is the sequ ence length a nd 𝐾 denotes the number of features, the challenge is to accurately model the com- plex temporal dependencies within the data, which span multiple time scales. T his includes capturing the vary ing trends at differ - ent temporal resolutions and handling the high-dimensio nal and dy- namic nature of the time series data f or reliable forecasting. 4 METHOD A Ov erv iew The proposed MR-CDM fr amework utilizes a five-stage pipeline to address multi-scale time series characteristics,as i llustrated in Figure 2. Initially , Multi-Scale Moving A verage Decomposition ( wi th windo ws of 5, 25, and 51) disentangles the input into short-term fluctuations, medium-term cycles, l ong-term trends, and high-frequency residuals [19]. Subsequently , these components undergo domain transformation: delay embedding con verts short- and medium-term signals into 32×32 i mages to preserve temporal dynamics, while STFT maps the long-term trend to t he time-frequenc y domain to highlight periodicity . These representations are then concatenated into a 35-channel fused image. A conditional dif fusion model [20] then performs generation based on historical contexts to ensure tem- poral consistency . T he process concludes with an Enhanced Hi erar- chical Reconstructor , which integrates hierarchical reconstruction, cross-scale attention, and a dapti ve weighting in parallel to recover high-fidelity predictions. This architecture effecti v ely isolates multi- scale features, ha rnesses the generati ve po wer of dif fusion models, and ensures precise detail recovery through multi-path fusion. MR-CDM decomposes the input sequence into multi-scale com- ponents via a three-lev el moving av erage with window sizes of 5, 25, and 51, ex plicitly decou pling short-term fluctuations from sea- sonal periodicity . Trend 1 and the residual, which carry local tran- sient information, are mapped to image space [21, 22] via Delay Embedding to preserve fi ne-grained temporal dependencies. T rend3 , representing long-term trends and seasonal cycles, is transformed via STFT to capture t he amplitude and phase of periodic compo- nents i n the time-frequency domain. T rend2 addresses medium-scale periodic st r uctures between the two extremes. All four image branches are fused and modeled jointly by a dif fusion model, follo wed by a hierarchical reconstructor to recov er the predicted sequence. This branch-wise design prev ents lo w-frequenc y componen ts from mask- ing high-freque ncy details, en hancing the model’ s ability to repre- sent complex temporal dynamics. MR-Image nTime: Mu lti-Resolution Time Series Generation through Dual Image Representatio ns Figure 2: MR-CDM Model Arc hitectur e B Pr oposed Algori thm The proposed MR-CDM frame work follo ws a multi -stage pipeline detailed in prio r work . W e briefly summa rize the core components here: Multi-Scale Decomposition: Input time series are hierarchically decompose d into short, medium, and long-term trends using mo v- ing av erages, alongside high-frequen cy residuals [23]. Time-Series- to-Image Mapping: T emporal components are transformed into 2D image representations via adapti ve delay embedding and STFT [ 24], enabling resolution-indep endent processing. Conditional Diffusion: A multi-resolution dif fusion process is employed, where n oise pre- diction is conditioned on historical context and cross-scale features to ensure temporal consistency . 5 EXPERIME N TS A Experimental Setup A.1 Research Objectiv es. The objecti ve of this study is to ev al - uate the effecti v eness of the proposed MR-CDM model for time series forecasting tasks. S pecifically , the experiments are design ed to: (1) verify the ef fectiv eness of multi-scale trend decomposition in improvin g f orecasting accuracy; (2) ev aluate the performance of conditional diffusion models in ti me series f orecasting; (3) analyze the effectiv eness of transforming time series into image representa- tions; and (4) comp are traditional time series forecasting me thods with diffusion-based approaches. A.2 Experi mental Environment. All experiments were conduc ted on a high-performance workstation configured with an Intel Xeon Silver 4110 CP U, 256 GB of DDR4 RAM, and two NVI D I A TI- T AN R TX GPUs, each equipped with 24 GB of dedicated memory . The software stack was built on Ubuntu 20.04 L TS and included Python 3.11, P yT orch 2.0.1, and CUD A 11.8 to enable efficient GPU-accelerated t raining and inference. This setup ensured con- sistent and reproducible exp erimental conditions across all ev alu- ations. A.3 Ev aluati on Metr ics. W e adopt three primary ev aluation met- rics: Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE), which measure the squared er - ror , absolute error , and scale-consistent error between predictions and ground t ruth, respectiv ely . B Datasets and Prepr oces sing B.1 Dataset Descriptio n. Experiments are conducted on the ETTh1 (Electricity Tran sformer T emperature - Hourly) dataset. The dataset spans fr om July 2016 to July 2018 with an hourly sampling rate, containing 17,420 time points an d sev en features. Follo wing co m- mon practice, we focus on the univ ariate forecasting task of t he LUFL variable. The demonstration of our model’ s adaptability in other fields is provided in Appendix B . B.2 Data Prep rocessing. Data preprocessing consists of three main steps. First, missing value s are fil led using linear interpolation, and outliers are detected and handled based on the 3 𝜎 rule to ensure temporal continuity . Second, Z-score normalization is applied using statistics computed from the training set: 𝑥 norm = 𝑥 − 𝜇 𝜎 . Finally , the dataset is split chronolog ically into training (70%), vali- dation ( 10%), and testing (20%) sets, corresponding to 12,194, 1,742, and 3,484 time points, respectiv ely . B.3 Dataset Sythesis. T o demonstrate the generalization capa- bility in similar dataset of our model, we generated a synthetic dataset using a multi-component time series synthesis approach . The synthetic data preserves the statistical properties (mean, stan- dard de viation, and v alue ranges) of the original ETTh1 dataset while i ncorporating reali st ic t emporal patterns i ncluding daily cy- cles (24-hour pe riodicity), weekly c ycles (7-day periodicity), long - term trends, and Gaussian noise. Inter-v ariable correlations were maintained through correlated si gnal ge neration, an d daytime load enhanceme nt was applied to simulate realistic power consumption patterns. C Basel ine Models C.1 T raditiona l Ti me Ser ies Mode l. W e adopt ARIMA(2,1,2) as a standard un iv ariate baseline, suitable for no n-stationary seri es. It performs 96-step prediction via linear extrapo lation of t he recent trend, augmented with Gaussian noise to model uncertainty . C.2 Deep Lear nin g Baseline. A two-layer stacked LSTM with 128 hidden un its an d 0.2 dropou t serves as our deep learnin g base- line. It generates 96-step forecasts from the fi nal hidde n state. W e train for 200 epochs using AdamW (lr=0.0001, weight decay=0.00001 ), with cosine annealing and gradient cl i pping (max norm 1.0). C.3 Advanced Diffusio n Model. W e utilize CS D I , a conditional score-based diffusion model for forecasting. It employs a masking mechanism for conditional generation and uses 6 residual blocks (causal con vo lutions + 4-head self-att ention, hidden dim=64) to model both local and global temporal dependencies. Xiany ong Xu, Y uanjun Zuo, Zhihong Huang, Yihan Qin, Ha o xian Xu, Leilei Du, and Haotian Wang Figure 3: The Predi ction Result Based on Our MR-CDM M odel D Implements Deta ils of Featur e Pr ocess ing Multi-scale T rend Decomposition. In the decomposition stage, we employ three moving average filters with window sizes of [ 5 , 25 , 51 ] , correspondin g to temporal scales o f approx imately 5 hours, 1 day , and 2 days, respecti v ely . This configuration enables t he cap ture of multi-lev el features ranging from short-term fluctuations t o l ong- term trends. S pecifically , the filt er with windo w size 5 (MA-5) ex- tracts high-frequency short-term fl uctuations, MA-25 identifies daily periodic patterns, and MA-51 captures long-term t rends. T his hi- erarchical design ensures t hat features at distinct temporal scales are disentangled and can be sub jected to differen tiated processing strategies. Adaptive Ima ge T ransformation. For the components T rend 1 , T rend 2 , and the Residual, we utilize the delay embedding method to transform 1D t i me series into 32 × 32 2D images, setting the delay pa- rameter 𝜏 = 3 and the embedding dimension 𝑑 = 32 . Here, 𝜏 = 3 ef- fectiv ely captures t emporal dependenc ies ov er a 3-hour span, while 𝑑 = 32 aligns with the base resolution of the U-Net architecture, thereby avo iding unnecessary upsampling or do wnsampling oper- ations. Con versely , for Trend 3 , we apply the Short-T ime Fourier T ransform (STF T) to con ve rt the signal into the t i me-frequenc y do- main, configured with n_fft = 6 4 and ho p_length = 16 . The choice of n_fft = 64 yields 32 frequen cy co mponents sufficient for identi- fying periodic characteristics, while hop_length = 16 ensures a 75% windo w overlap, guaranteeing temporal continuity and information integrity . Image F usion. In the fusion stage, we adopt a channel concate- nation strate gy to merge the images of the fou r components into a single fused tenso r with 35 channels (structured as 7 + 7 + 14 + 7 ). This design p reserves the complete information of all compo nents, facilitating subsequent hierarchical reconstruction. E Main Results E.1 Ov erall P erf or mance Co mpari son. T able 1 reports the fore- casting performance on the ETT h1 LU F L t ask on t hese models abov e. T able 1: Perf orm ance comparison on ETTh1 Model MSE MAE RMSE ARIMA 42 .2121 6.2893 6.49 6 LSTM 13.4981 3.6270 3.6739 CSDI 1.553 8 0.89 92 1.2465 MR-CDM 1.4842 0.9650 1.2183 As sho wn in T able 1 , we compare MR-CDM with two represen- tativ e baselines and an adv anced diffusion model on the E TTh1 fea- ture prediction task. MR-CDM achiev es superior performance, with Figure 4: The Predicti on Result Based on CSDI Model a 96.5 percent MSE reduction compared to ARIMA and an 89.0 per- cent improvemen t over LSTM. Although CSD I outperforms ARIMA and LSTM, our mo del further surpasses CSDI. These results ind i- cate that MR-CDM more effec tiv ely captures complex temporal pat- terns, highlighting t he importance of explicitly modeling hierarchi- cal trends. As il lustrated in Figure 3, we e v aluate the prediction performance of MR-CDM on long-term LUF L forecasting using the ET T h1 dataset. The i nput sequence consists of t he fi r st 96 ti me steps (0–95), while the subsequent 96 steps (96–191 ) form the prediction horizon, cov- ering a total duration of 192 hours (8 days) wit h hourly resolution. Experimental results on ETTh1 further confirm that MR-CD M outperforms CSDI, as shown i n Figure 4. T his adv antage arises from key architectural differences. MR-CDM exp licitly decomposes time series into multi-scale components and applies tailored strate- gies such as delay embedding and STFT , whereas CSDI reli es on a unified con volution that limits multi- scale representation. By trans- forming time series into the i mage domain, MR-CDM better l e ver - ages diffusion-bas ed generation. I n addition, its multi-path hierar- chical reconstructor captures inter-scale dependenc ies more effec- tiv ely than CSDI’ s single-path decoder , leading to improv ed feature learning and prediction accuracy . T able 2: Perfo rmance comparison on Sythetic Dataset Model MSE MAE RMSE ARIMA 39.8131 8.3046 6.3097 LSTM 12.79 53 8.2361 3.5770 CSDI 1.674 9 0.8741 1.2941 MR-CDM 1.2544 0.9080 1.1200 T o f urther validate robustness and generalization, we conduct the same exp eriments on a synthetic dataset wi th similar statisti- cal properties, multi -scale periodicity , and correlations as ETTh1, as sho wn in T able 2. The consistent improv ements on both real and synthetic data demonstrate that MR-CDM does n ot overfit spe- cific datasets but effec tiv ely captures underlying t emporal dynam- ics. This validates the generalizability of combining multi-scale de- composition with conditional diffu sion for time series forecasting. A ri gorous multiple-run validation protocol was employ ed to mit- igate the impact of random variation and ensure fair comparison. Appendix C presents the comprehensi ve statisti cal summaries, con- firming the reliability of our ex perimental setup and the co nsistent performance patterns across runs. T o ev aluate the multi-step forecasting performance of MR-CDM on time series data, we condu ct a comprehe nsiv e comparison exp er- iment on the ETTh1 dataset. The historical input windo w is fi xed at 96 time steps, while the prediction horizon is varied across four set- tings: 24, 48, 96, and 192 steps, enabling a systematic assessment of MR-Image nTime: Mu lti-Resolution Time Series Generation through Dual Image Representatio ns Figure 5: Multi-Step Prediction Perf ormanc e Comparison model accuracy and error degrad ation under increasing forecasting difficu lty , as sho wn in Figure 5. T hree baseline models are selected for comparison: ARIMA, LST M, and CS DI. As illustrated in the figure, MR-CDM consistently achiev es the lowest error across all prediction horizons, and its error gro wth rate remains significantly lo wer than that of the baselines as the prediction length increases. These results demonstrate that MR-CDM effecti vely suppresses er- ror accumulation in long-range forecasting scenarios, confirming the superiority of the proposed approach for multi-step ti me series prediction tasks. W e conduct ablation studies to analyze the effecti v eness of each componen t. Due to space limitations, detailed results are pro vided in Appendix A.1 and Appendix A.2. 6 CONCLU S ION This thesis presents a comprehensi v e stud y on t i me series forecast- ing through the de v elopment of MR-CDM , a novel frame work that addresses key l imitations of existing methods. The rese arch makes se veral significant co ntributions: (1 ) a delay e mbedding technique that con v erts v ariable-length time series into structured 2D image representations while preserving temporal dependenc ies, thereby enabling the use of spatial inducti v e biases from computer vision; (2) a hierarchical trend decomposition module t hat explicitly cap- tures multi-scale temporal patterns—including short-term fluctua- tions, seasonal cycles, and long-term trends; and (3) a hierarchical conditional diffusion model that performs denoising generation un- der multi-scale guidance, red ucing stocha sticity through coarse-to- fine constraints. 7 A CKN O WLEDGMENT This research was funded by S cience and T echnology Project of State Grid Hunan Electric Po wer Company Limited, titl ed "Research on Key T echnolog ies and Complete E quipment of Di ffu sion Super- Resolution Data Augmentation for P o wer Grid Model Identification and Sit uation D eduction", grant number 5216A52500 09. REFERE NCES [1] I. Naiman, N. Berman, I. Pemper , I. Arbiv , G. Fadlon, and O. Azencot, “Utiliz- ing image transforms and diffusion models for generative modeling of short and long time s eries, ” in Advances in Neural Information Pr ocessing Systems 38: An- nual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, V ancouver , BC , Canada, December 10 - 15, 2024 (A. Globersons, L. Mack ey , D. Belgrav e, A. Fan, U. Paquet, J. M. T omczak, and C. Z hang, eds. ), 2024. [2] E. Ziv ot and J. W ang, V ec tor Autore gr essive Models for Multivariate T ime Series , pp. 369–413. New Y ork, NY : Springer New Y ork, 2003. [3] A. B. K oehler , R. D. Snyder , and J. Ord, “Forecasting models and prediction intervals for the multiplicati ve holt–winters method, ” International Journal of F orecasting , vol. 17, no. 2, pp. 269–286, 2001. [4] D. J. Hand, “Forecasting with exponential smoothing: The sta te s pace approach by rob j. hyndman, anne b. koehler , j. keith ord, ralph d. s nyder , ” International Statistical Review , vol. 77, no. 2, pp. 315–316, 2009. [5] C. E. Rasmus s en and C. K. I. Williams, “Gaussian processes for machine learn- ing (adaptiv e computation and machine learning), ” The MIT Press , 2005. [6] S. Hochreiter a nd J. Schmidhuber , “ Long short-term memory , ” Neural Computa- tion , vol. 9, no. 8, pp. 1735–1780, 1997. [7] V . S. Ishwarya and M. K othandaraman, “ A novel feature- fusion-based sparse masked attention network for acoustic ec ho ca ncellation using wavelet and STFT synergies, ” Circuits Syst. Signal Pr ocess. , vol. 44, no. 4, pp. 2882–2901, 2025. [8] V . Flunker t, D. Salinas, and J. Gas thaus, “De epar: Probabilistic forecasting with autoregressiv e recurrent networks, ” International Journal of F orecasting , vol. 36, no. 3, 2020. [9] A. van den Oord, S. Dieleman, H. Z en, K. Si monyan, O. V inyals, A. Gra ves, N. Kalchbrenner, A. W . Senior, and K. Ka vukcuoglu, “W a venet: A generative model for raw audio, ” in The 9th ISCA Speech Synthesis W orkshop, SSW 2016, Sunnyvale, CA , USA, September 13-15, 2016 (A. W . Black, ed.), p. 125, ISCA, 2016. [10] S. Bai, J . Z. Kolter , and V . Koltun, “Trellis networks for sequence modeling, ” in 7th International Conference on Learning R epresentations, ICLR 2019, New Orleans, LA , USA, May 6-9, 2019 , OpenReview .net, 2019. [11] A. V as wani, N. Shazeer, N. Parmar, J. Usz koreit, L. Jones, A. N. Gomez, L. Kaiser, a nd I. Polosukhin, “ Attention is all you need, ” arXiv , 2017. [12] H. Zhou, S. Z hang, J. Peng, S. Zhang, J. Li, H. Xiong, and W . Zhang, “Informer: Beyond e fficient transformer for long s equence time-series forecasting, ” CoRR , vol. abs/2012.07436, 2020. [13] H. Wu, J . Xu, J. W ang, and M. Long, “ Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting, ” pp. 22419–22430, 2021. [14] K. Rasul, C. Seward, I. Schuster, and R. V ollgraf, “ Autore gressiv e denoising dif- fusion models for multiv ariate probabilistic time series forecasting, ” pp. 8857– 8868, 2021. [15] Y . T ashiro, J. Song, Y . Song, and S. E rmon, “C SDI: conditional score-based dif- fusion models for probabilistic time series imputation, ” pp. 24804–24816, 2021. [16] J. M. L. Alcaraz and N. Strodthof f, “ Diffusion- based time series imputation and forecas ting with structured state s pace models, ” CoRR , vol. abs/2208.09399, 2022. [17] M. Kollo vieh, K. Stelzner, J. Kossen, M. Lützenberger , and K. Kersting, “Pre- dict, refine, s ynthesize: Self-guiding diffusion models for probabilistic time se- ries forecas ting,” in Advances in Neural Information Pr ocessing Sys tems , 2023. [18] H. W en, Y . Lin, Y . Xia, H. W an, Q. W en, R . Z immermann, a nd Y . Liang, “Diffstg: Probabilistic spatio-temporal graph forecasting with denoising diffusion models,” in Pro ceedings of the 31st ACM International Conference on Advances in Geo- graphic Information Systems , (New Y ork, NY , USA), Ass ociation for Computing Machinery , 2023. [19] L. Shen, W . Chen, and J. T . Kwok, “ Multi-resolution diffusion models for time series forecasting, ” 2024. [20] J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models, ” in Ad- vances in N eural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hads ell, M. Balcan, and H. Lin, eds .), vol. 33, pp. 6840–6851, Curran Asso- ciates, Inc. , 2020. [21] F . T akens, “Detecting strange attractors in turbulence, ” in Dy namical Systems and T urbulence , W arwick 1980: pr oceedings of a symposium held at the University of W arwick 1979/80 , pp. 366–381, 2006. [22] D. W . Griffin and J. S. Lim, “Signal estimation from modified short-time fourier transform, ” in IEEE International Conference on Acous tics, Speech, and Sig- nal P r ocessing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983 , pp. 804–807, IEEE, 1983. [23] L. Shen, W . Chen, and J. T . Kwok, “ Multi-resolution diffusion models for time series forecasting, ” in The T welfth International Conference on Learning R epre- sentations, ICLR 2024, V ienna, Austria, May 7-11, 2024 , OpenReview . net, 2024. [24] V . S. Ishwarya and M. K othandaraman, “ A novel featur e-fusion-based sparse masked attention network for acoustic ec ho ca ncellation using wavelet and STFT synergies, ” Circuits Syst. Signal Pr ocess. , vol. 44, no. 4, pp. 2882–2901, 2025. A ABLA TION ST UDIES A.1 First A blation Study T o v ali date the ef fecti ven ess of ke y compon ents in MR-CDM, we conduct ablation studies on the LUFL uni v ariate forecasting task. W e ev aluate three varian ts: (1) Baseline-NoDecomposition , w hich Xiany ong Xu, Y uanjun Zuo, Zhihong Huang, Yihan Qin, Ha o xian Xu, Leilei Du, and Haotian Wang remov es the multi-scale trend decomposition; (2) Unconditi onalD- iffusion , which disables historical conditioning in the diffusion pro- cess; and (3) NoImageFusion , which replaces our smart fusion module with simple feature concatenation. T hese componen ts are selected b ecause the y eac h play a distinct and essential role: trend decomposition captures multi-resolution temporal patterns, condi- tional diffusion leverag es historical context to guide generation, and image-inspired fusion enab les e ffecti v e integration of decomp osed signals. The ablation results help isolate their individual contribu- tions to t he model’ s performance. The first ablation experiment use consistent configurations: se- quence length of 96 t ime steps for both input and prediction, batch size of 16, learning r at e of 0.001 with Adam optimizer , and train- ing for 50 epochs. T he models are ev aluated using MSE , MAE, and RMSE metrics on an 80/20 train-test split. This experimen tal design allows us to quantify the indi vidual contribution of each component while maintaining computational efficien cy for rapid iteration. First A blation Stud y Resul t. The results of the ablation experi- ments, summarized in T able 3, re veal the i mportance of each com- ponent in MR-CDM. T o v erify t he effecti ven ess of our design, we conducted a systematic ablation study on the LUFL features of the ETTh1 dataset. The trend decom position module prov es essential: removing it (Baseline-NoDecomposition) leads to an 89.6% per- formance degradation c ompared to the FullModel, c onfirming that multi-scale decomposition is crucial for capturing co mplex tempo- ral patterns. Even more dramatically , the UnconditionalDiffus ion v ariant—where historical trends are not used to condition t he diffu- sion process—suffers a 1280% drop in performance, strongly high- lighting the necessity of conditioning on coarse-grained historical information for accurate and st able forecasting. The NoImageFusion variant, which replaces our proposed image- inspired fusion with simple feature concatenation, performs better than the baseline but st i ll falls significantly short of the full model. This demonstrates that naive concatenation fails to exploit th e rich cross-scale correlations among decomposed trends, whereas our smart fusion mechanism effecti vely i ntegrates multi-resolution features. Importantly , the remov al of any single component causes a substan- tial performance decline that cannot be compensated by the remain- ing modules, va lidating the complementary nature of our MR-CDM architecture. T ogether , these compone nts form a syner gistic sys- tem, enabling the complete model t o achiev e the best results across all me trics (MSE: 1.4842, MAE: 0 .9650, RM SE: 1.218 3), thereby fully verifying the effec tiv eness of our approach. Experiments demon strate that MR-CDM significantly outperforms baselines. Ablation studies attribute this gain to three core mecha- nisms: multi-scale tr end decomposition disentangles short-, mid-, and long-term components for indepen dent feature l earning, sur- passing ARIMA ’ s linearity ,LSTM’ s monolithic encoding and CS DI’ s instability; conditional diffusio n i njects historical context to ensure temporal continuity , yielding higher accuracy than unconditional generation; and a multi-path hierarchical reconstructor le verag es parallel pathways (hierarchical recovery , cross-scale attention, adap- tiv e weighting) to precisely restore details, avo iding LS TM’ s gradi- ent v anishing and information loss. Crucially , the complete frame- work exhibits strong syner gistic effects, where the i ntegrated perfor- mance excee ds the sum of indiv idual contributions, v alidating the necessity of the proposed architecture. T able 3: Ablation Studies on ETTh1 Model-V ari ant MSE MAE RMSE Baseline-NoDecompos ition 14.2177 3.6753 3.7706 UnconditionalDif fusion 20.4822 4.4252 4.5257 NoImageFusion 13.0950 3.5406 3.6187 FullModel 1.4842 0.9650 1.2183 T o systematically ev aluate the role of the conditional i nformation in the proposed diffusion model, we designe d a set of ablation ex - periments, in which all ex ternal condition inputs (such as t ime fea- tures, cov ariates, or historical context guidance) were removed, and the diffusion process itself was solely relied upon for unconditional prediction of the time series. F igure 6 sh o ws the prediction results of the model on the ET Th1 dataset under this ablation setting. Compared wit h the MR-CDM with conditional information in- troduced in the main experiment, the unconditional dif fusion model can rough ly capture the ov erall trend of the sequence, b ut it is signif- icantly deficient in detail modeling, phase alignment, and l ong-term dynamic evolu tion, manifested as o v erly smooth prediction curves, delayed peak responses, and distortion of local fluctuations.These results fully demonstrate that the introduced conditional mechanism plays a crucial role in enhancing the expression ability and tempo- ral consistenc y of the diffusion model in complex time series pre- diction tasks. Figure 6: P redi ction Result Based on MR- CDM model without condi- tional informa tion T o visually demonstrate the cruc ial role of conditional informa- tion in the diffusion model, Figure 3 and Figure 6 r espectively sho w the performance of the unconditional diffusion model and t he pro - posed conditional diffusion model in t he same time series prediction task. Here, the blue so lid line re presents the historical observed se- quence (0–95 steps) as input, the green solid line represents the tr ue v alues (Ground T ruth, 96–19 1 steps), and the r ed dashed line rep- resents the model’ s prediction r esults. From the figures, it can be seen that under the unconditional setting, al t hough the model can capture the overall trend, t here are significant de viations in terms of detail fl uctuations, peak responses, and temporal alignment, espe- cially in t he rapidly changing areas, indicating its lack of the ability to precisely model the temporal dynamic structure. MR-Image nTime: Mu lti-Resolution Time Series Generation through Dual Image Representatio ns In contrast, after introducing conditional information, the model can more accurately track the intense fluctuations and l ocal features of the t rue signal, and the prediction curve closely matches the true v alues, significantly enhancing the ability to restore short-term fl uc- tuations and maintain long-term consistency . This comparison fully v alidates t he effecti veness of the conditional mechanism in guiding the diffusion process and enhancing time-dependent modeling, fur- ther highlighting the superior performance of the proposed method in complex time series prediction tasks. T able 4: Ablation Studies on S ynthetic ETTh1 Model-V ari ant MSE MAE RMSE Baseline-NoDecompos ition 108.5863 8.0619 10.4204 UnconditionalDif fusion 116.4437 8.3155 10.7909 NoImageFusion 111.8180 8.2349 10.5744 FullModel 1.2544 0.90 80 1.1200 As sho wn in T able 4,we further conducted ablation studies on the synthetic dataset, and the results consistently demon strate that each componen t of MR-CDM plays a v ital role: multi-scale trend decomposition effecti v ely captures hierarchical t emporal patterns, conditional dif fusion lev erages historical context for accurate gen- eration, and the image-based fusion mechanism enables meaningful integration of multi-scale features. Removing any one of these com- ponents leads to a significant performance drop, confirming that the full architecture i s necessary and well-designed. A.2 Second Ablatio n Study T o systematically e v aluate the structural effecti v eness of the pro- posed MR-CDM in multi-scale time-series modeling, we conduct an ablation study on the trend decomposition module. The exp eri- ments are performed on the L UFL variable of the E TTh1 dataset under a unified setting of prediction leng th, with MSE, MAE, and RMSE adopted as ev aluation metrics. S pecifically , three configu- rations are compared: the full MR-CDM model, MR-CDM w/o T rend1 (removing the l ow-frequenc y t rend component), and MR- CDM w/o Trend3 (removing the high-frequency comp onent). The reason for not removing Trend2 is that it is not the "highest fre- quenc y" nor the "smoothest", but somewhere in between, mainly carrying medium-term structural information. The results T able 5 sho w that the full model achie ves the best performance across all three metrics. Removing either trend compo- nent lead s t o a noticeable d egrada tion in prediction accu racy , with a larger drop observed when T rend3 is remov ed. T his indicates that the high-frequency component is criti cal for short-term fluctuation modeling, while the low-frequenc y component remains indispens- able for preserving global trend dynamics. Overall, the proposed multi-scale trend decomposition mechanism enables collaborati ve modeling of short-term disturbances and l ong-term evolu tion within a unified framework , thereby improving model robustness and gen- eralization in practical forecasting scenarios. A.3 Input Length Sensitivity Analys is T o further in vestigate the effect of historical input l ength on the fore- casting performance of MR-CDM, we conduct an input length sen- sitivity analysis on the L UFL variable of the ETTh1 dataset. The T able 5: T rend Decomposition Ablation Result Model-V ari ant MSE MAE RMSE MR-CDM w/o Trend1 1.9342 1.124 7 1.3907 MR-CDM w/o Trend3 2.2769 1.267 8 1.5089 FullModel 1.4842 0.9650 1.2183 prediction length is fixed at 96, while the i nput sequence length Seq_Len is varied across 48, 96, and 192. As sho wn in t he T able 6, all three error metrics decrease monotonically as the input length increases, indicating that longer historical sequences provide richer temporal context f or the model.Longer input sequences contain more complete periodic structures and trend dynamics, enabling the multi- scale dec omposition module to e xtract ea ch frequenc y compone nt more accurately and reducing decomposition errors caused by in- suf ficient historical contex t. This enables the multi - scale trend de- composition module to extract more complete periodic and t rend features, thereby improving prediction accuracy . These results con- firm that MR-CD M is adaptiv e to v arying input lengths, and suggest that increasing the historical windo w si ze in practical deplo yment can further enhance f orecasting performance. T able 6: Input Length Sensitivi ty Results Seq_Len MSE MAE RMSE 48 1.8998 1.1387 1.3783 96 1.4842 0.9650 1.2183 192 1.2319 0.8492 1.0965 B GENERALIZA TION TO O THER DOMAINS T o further v alidate the generalization capability and robustne ss of our model, we extended our ev aluation t o datasets from distinct do- mains: weather forecasting and traffic flo w prediction. These datasets embody complex non-linear spatiotemporal dynamics characteris- tic of meteorological variations and urban traffic patterns, respec- tiv ely , thereby serving as rigorou s benchmarks for assessing cross- domain adaptability .In our exp eriments, the O T metric serves as the ground truth for forecasting tasks on both datasets. E xperimental results demonstrate that our model not only retains its performance adv antages b ut also achie v es superior prediction accu racy and sta- bility in these div erse scenarios, u nderscoring its significant poten- tial for broad real-world applications.T able 7 and T able 8 sh o w the prediction ability among these models. W eather dataset comprises 2 1 multi v ariate t i me series collected from a meteorological station in Jena, Germany , with a sampling rate of 10 minutes. Characterized by comple x seasonal patterns (daily and yearly cycles) and high-frequency noise, this dataset tests the model’ s robustness in handling non-stationary en vironmental data and capturing inter-v ariable dependen cies. T able 7: Model Perf ormance on W eather Domain Model MSE MAE RMSE ARIMA 191.2356 11.6258 13.8287 LSTM 124.7 143 7.5514 11.1675 CSDI 96.32 45 6.9438 9.8145 MR-CDM 79.4128 6.3487 8.9113 T raf fic dataset records the road occupanc y rat es collected fr om 862 sensors in the San F r ancisco Bay Ar ea. The data is sampled Xiany ong Xu, Y uanjun Zuo, Zhihong Huang, Yihan Qin, Ha o xian Xu, Leilei Du, and Haotian Wang hourly and e xhibits strong daily and weekly period icity , as well as complex spatial dependencies among sensors. It serves as a rigor- ous benchmark for e v aluating a model’ s ability to capture recurring patterns and handle high-dimensional multiv ariate series. T able 8: Model Perf ormance on T raffic Domain Model MSE MAE RMSE ARIMA 0.0011 0.0237 0.0331 LSTM 0.000 7 0.0152 0.0264 CSDI 0.000 4 0.0143 0.0200 MR-CDM 0.0003 0.0139 0.0173 C MUL TIPLE -R UN V ALID A T ION T o ensure the robustness and reproducibility of our results, each ex- periment is repeated three times using distinct random seeds (specif- ically , 42, 43, and 44). This practice helps account for the inher- ent st ochasticity in model initi alization and training dynamics. T he correspondin g statistical summaries—including mean and standard de viation of ke y performance metrics across the three runs—are re- ported in T able 9(ETTh1) and T able 10(Synthetic ETTH1). These aggreg ated results provide a more reliable basis f or comparison and mitigate the influence of random v ariation on our conclusions. T able 9: Multiple-Run Perf ormance on ETTh1 Model MSE MAE RMSE ARIMA 30.0946 8.2296 5.4858 LSTM 13.2883 6.2361 3.6453 CSDI 2.749 3 1.2478 1.6581 T o ensure the reliability and reproducibility of our experimen - tal results, we conducted multiple-run va lidation f or all baseline models. Specifically , we trai ned a nd ev aluated ARI MA,LSTM and CSDI models three times wi th different random seeds, and reported the av eraged performance metrics across all runs. This r i gorous val- idation pro tocol helps mit i gate the impact of random initialization and stochastic training processes, providin g more robust and st ati s- tically reliable c omparisons. The results presented in T able 9 show the mean performance across three independent runs. T hese aver - aged results demonstrate consistent performanc e patterns across mul- tiple runs, confirming the stability of baseline mo del performance and e nsuring fair compariso n with our p roposed method. The rela- tiv ely small v ariance across runs ( not shown for brevity) further v al- idates the reliability of our experimen tal setup and the robustness of the reported performance improv ements. T able 10: Multiple-Run Perf ormance on Synsetic ETTh1 Model MSE MAE RMSE ARIMA 47.2826 7.4434 6.8762 LSTM 13.6146 3.6670 3.6897 CSDI 2.615 9 1.9632 1.6173 Similarly , we also repeated th e aforementioned baseline experi- ments on the synthetic dataset t hree times, using dif ferent rand om seeds to account for stochastic v ariations in model initialization and training. The results across all runs demonstrate consistent perfor- mance for AR IMA, LST M and CSDI baseline mode ls, confirming their stabili ty on this controlled dataset. As sho wn i n the correspond- ing ev aluation metrics reported in the relev ant T able 10, the low v ariance across rep etitions further v ali dates the reliability o f these baselines u nder synthetic co nditions. These fi ndings not only rein- force the robustness of our experimental setup but also provide a solid foundation for comparing more adv anced methods, thereby supporting our overall analysis and conclusions. Experiments demon strate that MR-CDM significantly outperforms baselines. Experiments studies attribute this gain to three core mech- anisms: multi-scale trend decomposition disentangles short-, mid- , and long-term components for independen t feature learning, sur- passing ARIMA ’ s linearity ,LST M’ s monolithic encoding and CSDI’ s instability; conditional diffusio n i njects historical context to ensure temporal continuity , yielding higher accuracy than unconditional generation; and a multi-path hierarchical reconstructor l e verage s parallel pathways (hierarchical recovery , cross-scale attention, adap- tiv e weighting) to precisely restore details, avo iding LS TM’ s gradi- ent v anishing and information loss. Crucially , the complete frame- work exhibits strong syner gistic effects, where the i ntegrated perfor- mance excee ds the sum of indiv idual contributions, v alidating the necessity of the proposed architecture. D B A CKGR OUND Diffusion Models. Diffusion models [20] consist of a forward nois- ing and a backward denoising process. F orwar d Diffusion gradually adds Ga ussian n oise o ver 𝐾 steps. T he closed-form e xpression f or step 𝑘 is: x 𝑘 = √ ¯ 𝛼 𝑘 x 0 + p 1 − ¯ 𝛼 𝑘 𝜖 , 𝜖 ∼ N ( 0 , I ) , (1) where ¯ 𝛼 𝑘 = Î 𝑘 𝑠 = 1 ( 1 − 𝛽 𝑠 ) . Reverse Deno ising recov ers clean data from noise, typ ically by p redicting the noise 𝜖 or the clean data x 0 via: L 𝜖 = E k 𝜖 − 𝜖 𝜃 ( x 𝑘 , 𝑘 ) k 2 or L x = E k x 0 − x 𝜃 ( x 𝑘 , 𝑘 ) k 2 . (2) Conditional Diffusi on Models. For time series prediction, the denoising process is conditioned on historical contex t c = F ( x 0 − 𝐿 + 1:0 ) . The conditional distribution is defined as: 𝑝 𝜃 ( x 0: 𝐾 1: 𝐻 | c ) = 𝑝 𝜃 ( x 𝐾 1: 𝐻 ) 𝐾 Ö 𝑘 = 1 𝑝 𝜃 ( x 𝑘 − 1 1: 𝐻 | x 𝑘 1: 𝐻 , c ) , (3) where the mean 𝜇 𝜃 ( x 𝑘 , 𝑘 | c ) is computed by leverag ing both the noisy input and the conditioning context c . Hierarchical T rend Decomposition (HTD). HTD [19] decom- poses time series into multi-scale trends. Given a series 𝑋 0 , trend componen ts are extracted sequentially: 𝑋 𝑠 = A vgPool ( Padding ( 𝑋 𝑠 − 1 ) , 𝜏 𝑠 ) , 𝑠 = 1 , . . . , 𝑆 − 1 , (4) where A vgPool is average pooling and 𝜏 𝑠 (increasing with 𝑠 ) con- trols the smoothing kernel si ze, enabling extraction of progressiv ely coarser trends. Time Series to Image T ransf orms. T ime series are ma pped to images using in vertible transforms to exploit spatial inductiv e bi- ases [21, 22]: • Delay Embeddin g: For a uni v ariate series 𝑥 1: 𝐿 , it constructs a matrix 𝑋 ∈ R 𝑛 × 𝑞 where 𝑛 is the embedding dimension and 𝑞 is the number of t ime steps. Adaptiv e variants dy- namically adjust 𝑚 and 𝑛 to capture complex dynamics. MR-Image nTime: Mu lti-Resolution Time Series Generation through Dual Image Representatio ns • Short Time Fourier T ransform (STF T): Maps the signal to the frequenc y domain using a sliding windo w . For in- put x ∈ R 𝐿 × 𝐾 , S TFT outputs x img ∈ R 2 𝐾 × 𝐻 × 𝑊 (storing real/imaginary parts). It is in vertible with negligible infor- mation loss.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment