Reliable Grid Forecasting: State Space Models for Safety-Critical Energy Systems

Accurate grid load forecasting is safety-critical: under-predictions risk supply shortfalls, while symmetric error metrics can mask this operational asymmetry. We introduce an operator-legible evaluation framework-Under-Prediction Rate (UPR), tail Re…

Authors: Sunki Hong, Jisoo Lee

Reliable Grid Forecasting: State Space Models for Safety-Critical Energy Systems
Reliable Grid F orecasting: State Space Mo dels for Safet y-Critical Energy Systems Sunki Hong and Jiso o Lee Gramm AI. *Corresp onding author(s). E-mail(s): jisoo@gramm.ai ; Con tributing authors: sunki@gramm.ai ; Abstract A ccurate grid load forecasting is safet y-critical: under-predictions risk supply shortfalls, while symmetric error metrics can mask this op erational asymme- try . W e in tro duce an op erator-legible ev aluation framework—Under-Prediction Rate (UPR), tail Reserve % 99 . 5 requiremen ts, and explicit inflation diagnostics (Bias 24 h /OPR)—to quantify one-sided reliabilit y risk b ey ond MAPE. Using this framew ork, we ev aluate state space mo dels (Mam ba v ariants) and strong baselines on a weather-aligned California Indep enden t System Op erator (CAISO) dataset spanning No v 2023–Nov 2025 (84,498 hourly records across 5 regional transmission areas) under a rolling-origin w alk-forward backtest. W e dev elop and ev aluate thermal-lag-aligned weather fusion strategies for these architectures. Our results demonstrate that standard accuracy metrics are insufficient pro xies for op erational safety: mo dels with comparable MAPE can imply materially differen t tail reserve requiremen ts (Reserve % 99 . 5 ). W e show that explicit w eather in tegration narro ws error distributions, reducing the impact of temperature-driven demand spik es. F urthermore, while probabilistic calibration reduces large-error ev ents, it can induce systematic schedule inflation. W e introduce Bias/OPR- constrained ob jectiv es to enable auditable trade-offs b et ween minimizing tail risk and preven ting trivial ov er-forecasting. Keyw ords: Load forecasting, state space mo dels, Mamba, deep learning, probabilistic forecasting, California grid, CAISO 1 1 In tro duction Short-term load forecasting (STLF) underpins electricity grid op erations, gov erning unit commitmen t, economic dispatch, and reserve scheduling [ 1 ]. The consequences are significant: under-prediction risks cascading black outs, while ov er-prediction incurs unnecessary costs and emissions. California’s grid exemplifies the c hallenges facing high-renewable systems. Utility- scale solar and wind contribute roughly one quarter of CAISO system energy [ 2 ], with behind-the-meter (BTM) solar capacit y exceeding 17 GW in California [ 3 ]. This invisible generation creates the “duck curve”—deep midday net load trough follo wed by steep evening ramps—that evolv es annually as BTM capacity expands. Comp ounding this non-stationarit y , climate-driven extreme w eather even ts increasingly trigger non-linear demand spik es that defy historical patterns [ 4 ]. Existing forecasting metho ds face fundamental limitations. Statistical approaches (ARIMA) cannot capture non-linear weather dep endencies [ 5 ]. Recurrent netw orks (LSTMs) struggle with long-range dependencies due to v anishing gradien ts. T ransformer arc hitectures address this but introduce quadratic O ( n 2 ) complexity , limiting practical con text lengths for capturing multi-w eek seasonal patterns [ 6 ]. State space mo dels (SSMs) offer a comp elling path to reliability . By achieving linear O ( n ) scaling, Mamba [ 7 ] allows for extended historical contexts and low er inference latency , freeing computational budget for robust uncertaint y quantification and ensemble metho ds. How ev er, Mam ba’s ability to maintain safety margins in regional grid forecasting—characterized by asymmetric error costs and heteroscedasticit y— remains unexplored. W e presen t a systematic ev aluation of Mamba architectures for reliable California grid load forecasting. Benc hmarking against iT ransformer and the foundation mo del Chronos, we shift the ev aluation focus from pure accuracy to op er ational safety . Our walk-forw ard results highlight that accuracy alone is not a sufficient pro xy for op erational risk: Reserve % 99 . 5 can v ary substantially across mo dels ev en when MAPE is comparable. F urthermore, we demonstrate that explicit w eather integration is critical for safety by narrowing error distributions and reducing extreme error even ts (Section 6.3 ). Con tributions T o mov e b ey ond symmetric accuracy metrics tow ard op erator-relev ant reliability for California grid forecasting, w e make the following contributions: 1. Grid-Sp ecific Ev aluation F ramew ork. W e formalize op erational risk metrics (Under-Prediction Rate, tail Reserve % 99 . 5 requiremen ts, Bias/OPR diagnostics) that capture asymmetric op erational costs and preven t “fake safety” via systematic forecast inflation (Section 2.3 ). 2. W eather Integration for SSMs. W e develop (Section 4.2 ) and systematically ev aluate (Section 6.3 ) thermal-lag-aligned weather fusion strategies for Mamba arc hitectures, improving robustness by narrowing error distributions and reducing extreme errors. 2 3. Loss vs. Safety Analysis. W e show that prop er probabilistic calibration (multi- quan tile pinball) can reduce large-error ev ents on fixed ev aluation splits, but can also introduce systematic inflation unless Bias/OPR are explicitly constrained and join tly rep orted with tail-risk metrics (Section 6 ). 2 Bac kground 2.1 State Space Mo dels State space mo dels (SSMs) provide a principled framework for mo deling sequen tial data through con tinuous-time dynamical systems. The general linear SSM is defined b y the following ordinary differential equations: h ′ ( t ) = Ah ( t ) + B x ( t ) , y ( t ) = C h ( t ) + Dx ( t ) (1) where x ( t ) ∈ R is the input signal, h ( t ) ∈ R N is the latent state with dimension N , and y ( t ) ∈ R is the output. The matrices A ∈ R N × N , B ∈ R N × 1 , C ∈ R 1 × N , and D ∈ R are learnable parameters. F or discrete-time computation on sampled data, the contin uous SSM must b e discretized. Using zero-order hold discretization with time step ∆ , the discrete-time SSM b ecomes: h k = ¯ Ah k − 1 + ¯ B x k , y k = C h k (2) where ¯ A = exp(∆ A ) and ¯ B = (∆ A ) − 1 (exp(∆ A ) − I ) · ∆ B . Selectiv e state spaces. T raditional SSMs employ time-inv ariant (L TI) parame- ters. The Mam ba architecture [ 7 ] in tro duces a selectiv e mechanism that makes these parameters input-dep enden t: B k = s B ( x k ) , C k = s C ( x k ) , ∆ k = softplus ( s ∆ ( x k )) (3) where s B , s C , and s ∆ are learned pro jections. This selectivity enables the mo del to filter relev ant information while forgetting irrelev an t con text, addressing a fundamental limitation of prior SSMs. Suitabilit y for grid data. Recen t analysis by W ang et al. [ 8 ] identifies conditions where Mam ba outp erforms T ransformers: datasets with “numerous v ariates, most of whic h are p eriodic. ” Grid load data matches this characterization precisely—highly p eriodic signals (daily/weekly cycles) across multiple spatial no des. Mamba’s selective state mec hanism enco des these rhythmic dep endencies while filtering sto c hastic noise. Computational complexit y . SSMs achiev e O ( n ) complexity versus O ( n 2 ) for self-atten tion. Menati et al. [ 9 ] demonstrate that con text windows of 240+ hours are optimal for capturing m ulti-scale grid dynamics—a length computationally prohibitive for standard T ransformers. 2.2 Load F orecasting F undamen tals Definition and Horizons. Short-term load forecasting (STLF) encompasses predic- tion of electricity demand from hours to days ahead [ 10 ]. Differen t forecast horizons 3 serv e distinct op erational purp oses: 1-hour for real-time dispatc h, 4–6 hours for unit commitmen t, and 12–24 hours for day-ahead market op erations. Multi-P erio dicit y . Electricit y load exhibits strong multi-perio dic patterns (daily , w eekly , annual). Our choice of 240-hour (10-day) context length ensures capture of at least one complete w eekly cycle. Ev aluation Metrics. F or load forecasting with p ositiv e data, mean absolute p ercen tage error (MAPE) is the standard metric: MAPE = 100 n n X i =1     y i − ˆ y i y i     (4) 2.3 Grid-Sp ecific Ev aluation Metrics Standard metrics like MAPE and RMSE treat forecast errors symmetrically: an ov er- prediction of 100 MW is p enalized iden tically to an under-prediction of 100 MW. F or grid utilit y op erators, how ever, the economic and reliability consequences of these errors are fundamen tally asymmetric [ 1 ]. 2.3.1 Op erational Cost Asymmetry • Ov er-Prediction (F alse P ositive): If the mo del predicts higher load than the actual demand, the utilit y commits excess generation. This results in financial loss (w asted fuel, curtailment costs) but rarely threatens system stability . • Under-Prediction (F alse Negativ e): If the mo del predicts low er load than the actual demand, the utilit y may face a supply shortfall. This necessitates deploying exp ensiv e fast-ramping reserves (e.g., p eaker plants), purchasing emergency p o w er at sp ot market caps (often $1,000+/MWh), or in extreme cases, initiating rolling blac kouts. Standard MAPE is therefore insufficient for op erational v alidation. T o address this, w e introduce a suite of grid-sp ecific metrics: 2.3.2 Prop osed Metrics Under-Pr e diction R ate (UPR). The frequency of under-estimation even ts represen ts the probabilit y of needing real-time upw ard dispatch: UPR = 1 n n X i =1 I ( y i > ˆ y i ) × 100% (5) A naiv e mo del migh t achiev e lo w MAPE by constantly under-predicting (taking the median); UPR exp oses this risky b eha vior. Over-Pr e diction R ate (OPR). The frequency of ov er-forecasting even ts exp oses sys- tematic inflation in the sc heduled (median) forecast: OPR = 1 n n X i =1 I ( ˆ y i > y i ) × 100% (6) 4 In safet y-critical settings, OPR provides a simple op erator-legible diagnostic for whether impro vemen ts in one-sided tail risk are achiev ed b y trivial ov er-scheduling. Upwar d R eserve R e quir ement (R eserve % 99 . 5 ). The additional up ward capacity required to cov er 99.5% of the under-for e c ast error distribution. W e use the 99.5th p ercen tile as a practical tail-risk proxy motiv ated b y probabilistic adequacy practice (e.g., LOLE-style planning criteria; [ 11 ]). W e rep ort this b oth in megaw atts (MW) and as a p ercentage of the p oin t forecast (scale-free): Reserv e M W 99 . 5 = P ercentile 99 . 5 (max(0 , y − ˆ y )) (7) Reserv e % 99 . 5 = 100 × Percen tile 99 . 5  max  0 , y − ˆ y ˆ y  (8) A dding Reserv e M W 99 . 5 to the p oint forecast co vers 99.5% of historical under-prediction ev ents at the ev aluated horizon. W e report b oth MW and p ercentage: MW is normalization-free, while the p ercentage form is directly interpretable as an add-on to the scheduled p oin t forecast. Because p ercen tage-normalized reserve can b e biased by systematic forecast inflation, we interpret Reserve % 99 . 5 join tly with UPR and explicit bias diagnostics (Bias 24 h / OPR) to ensure improv ements do not come from trivial o ver-forecasting. 2.3.3 Motiv ating Example T o illustrate the op erational realit y of these metrics, w e examine empirical results (detailed later in T able 2 ) where tail-risk metrics and bias diagnostics mov e in different directions. F or example, a mo del trained with a tail-fo cused loss might ostensibly reduce reserve requirements (Reserv e % 99 . 5 ), but do so by simply inflating the entire forecast—increasing Bias 24 h and Ov er-Prediction Rate (OPR) to unacceptable levels. This motiv ates our core rep orting principle: an y improv ement in one-sided risk metrics m ust b e interpreted jointly with bias diagnostics to av oid “fake safety” via systematic inflation. 2.3.4 Differen tiable Surrogates for Risk Metrics While UPR and tail reserve requiremen ts are excellent for ev aluation, they are non- differen tiable and difficult to optimize directly . T o bridge this gap, we employ a Bias- Constrained Probabilistic Ob jectiv e (detailed in Section 4.3 ) as a differentiable surrogate during training. This objective combines a weigh ted multi-quan tile pin ball loss to learn a calibrated distribution with explicit hinge p enalties on bias and ov er- prediction rate to prev ent systematic forecast inflation. In principle, emph asizing upp er-tail quantiles increases the gradient p enalt y on under-predictions relative to ov er-predictions, aligning learning with asymmetric op er- ational costs. Ho wev er, as w e demonstrate in Section 6 , single-quantile training can b e “gamed”; our constrained formulation (Section 4.3 ) ensures that impro vemen ts in tail risk do not come at the exp ense of acceptable schedule bias. 5 2.4 V alue-Oriented Learning T raditional forecasting ob jectives (MSE/MAPE) op erate under the assumption that prediction accuracy proxies decision quality . How ever, in grid op erations, the cost function is highly asymmetric and constrained. The “V alue-Oriented” or “Smart Predict-then-Optimize” paradigm [ 12 ] c hallenges the separation of prediction and do wnstream optimization, prop osing that mo dels should b e trained to minimize the final op erational cost. While theoretically optimal, full end-to-end bilevel optimization is often computationally intractable for large-scale deep learning. In this work, we adopt a scalable realization of this philosophy: instead of embedding the full extensiv e- form optimization solv er, we enco de the op erational v alue function directly into a bias-constrained probabil istic ob jectiv e (Section 4.3 ). 3 Related W ork 3.1 Deep Learning for Time Series Deep learning has systematically displaced statistical methods (ARIMA, SVR) for short-term load forecasting due to its superior ability to mo del non-linear relationships. Curren t researc h fo cuses on tw o primary paradigms for long-context multiv ariate mo deling: T ransformer-based Approaches. Computing global attention allows T rans- formers to capture long-range dep endencies, but with O ( n 2 ) complexity . T wo main strategies ha ve emerged to handle multiv ariate data: • Channel Indep endenc e: Mo dels like Patc hTST [ 13 ] treat multiv ariate time series as indep enden t channels and segment them into patc hes. This reduces complexit y and represen ts the current state-of-the-art for T ransformer accuracy , though it sacrifices explicit cross-v ariable correlation mo deling. • Multivariate A ttention: Mo dels like iT ransformer [ 14 ] explicitly mo del correlations b et w een v ariables by inv erting tokenization (embedding time steps as features). W e study iT ransformer as a primary baseline b ecause its explicit mo deling of spatial correlations (e.g., b etw een load and w eather v ariables) offers a direct contrast to the implicit state mixing of Mamba. Efficien t Sequence Mo deling (SSMs). State space mo dels offer a fundamen tally differen t approach. Instead of the quadratic global attention of T ransformers, SSMs lik e Mamba [ 7 ] employ a recurrent mode with input-dep enden t selection, achieving linear O ( n ) scaling. This efficiency allows for significantly longer con text windows (240+ hours) on the same hardware, p oten tially capturing seasonal patterns that are computationally prohibitiv e for attention-based mo dels. Linear Mo dels. Recent work has challenged the necessity of deep learning for time series forecasting. DLinear [ 15 ] demonstrates that simple linear mo dels often matc h or exceed T ransformer p erformance on standard b enc hmarks, raising questions ab out arc hitectural complexity . How ever, linear mo dels cannot capture non-linear temp erature-load relationships that dominate during extreme weather even ts—precisely when forecast accuracy is most critical for grid op erations. While recent “v alue-orien ted” 6 and end-to-end approaches prop ose em b edding optimization lay ers directly in to the training lo op [ 12 ], these metho ds can b e computationally prohibitiv e for large-scale deep learning sequence mo dels. W e instead prop ose a bias-constrained ob jectiv e as a scalable, differen tiable surrogate to align training with asymmetric op erational costs. Our ev aluation fo cuses on mo dels capable of learning these non-linear interactions while main taining computational tractability . This work systematically ev aluates whether the theoretical efficiency adv antage of Mam ba translates to sup erior accuracy on complex, real-w orld grid data compared to the expressiv e p o wer of multiv ariate T ransformers. 3.2 State Space Mo dels for Time Series The Structured State Space (S4) mo del [ 16 ] in tro duced efficien t long-range sequence mo deling through careful parameterization of contin uous-time state spaces. Mamba [ 7 ] extended this framew ork with input-dep enden t (selective) parameters, enabling con tent-a ware filtering that improv es p erformance on language and genomics tasks. F or time series sp ecifically , S-Mam ba [ 8 ] demonstrated that Mam ba architectures excel on datasets with “numerous v ariates, most of which are p erio dic”—a characterization that matches grid load data. P ow erMamba [ 9 ] introduced bidirectional pro cessing and series decomp osition sp ecifically for energy applications, showing that 240+ hour con text windows are optimal for capturing multi-scale grid dynamics. 3.3 End-to-End and V alue-Orien ted Learning Recen t adv ancemen ts in "Smart Predict-then-Optimize" (SPO) ha ve challenged the traditional separation of forecasting and decision-making. V alue-oriented approaches [ 12 ] prop ose bi-level optimization framew orks where the forecasting mo del is up dated based on the quality of downstream decisions (e.g., unit commitment costs) rather than proxy metrics like MSE. Similarly , end-to-end structural learning [ 17 ] embeds differen tiable optimization lay ers to disentangle unobserv ab le comp onents (like baseline demand vs. price resp onse) from aggregate signals. Our work builds on these philosophies but fo cuses on sc alability . Instead of solving extensiv e optimization problems within the training lo op—which is often prohibitive for large-scale transformers—w e prop ose a bias-constrained probabilistic ob jectiv e as a differentiable surrogate that aligns the mo del with asymmetric op erational v alues while main taining the efficiency of standard backpropagation. 3.4 W eather-Integrated F orecasting The relationship b et ween w eather and electricity demand is well-established [ 1 ]. T emp erature-based load mo dels form the basis of man y op erational forecasting sys- tems, with heating and co oling degree days serving as primary predictors. How ev er, in tegrating weather in to deep learning architectures remains c hallenging due to temp o- ral misalignmen t: building thermal mass in tro duces resp onse lags of 2–6 hours b et ween temp erature changes and load resp onse [ 18 ]. Prior work has addressed this through feature engineering and attention-based fusion strategies [ 5 ], but systematic ev aluation of w eather integration strategies for state space mo dels is lacking. 7 3.5 F oundation Mo dels for Time Series Pre-trained foundation models hav e recently emerged as alternatives to task-sp ecific arc hitectures. Chronos [ 19 ] adapts language mo del pre-training to time series through quan tization, demonstrating strong zero-shot performance on diverse forecasting b enc h- marks. Ho wev er, the effectiveness of foundation mo dels on domain-sp ecific tasks with strong exogenous dep endencies (e.g., weather-driv en load) remains an op en question, as these mo dels typically lack mechanisms for incorp orating auxiliary cov ariates. 4 Metho dology 4.1 Mo del Arc hitectures W e ev aluate three Mam ba-based architectures representing different design philosophies within the state space mo del paradigm, alongside the iT ransformer baseline. 4.1.1 S-Mam ba: The Linear-Time Baseline S-Mam ba [ 8 ] represents a minimalist approach to state space mo deling. It adapts the standard Mam ba blo c k for time series via a simple enco der-decoder structure: a linear pro jection em b eds the input sequence into a hidden state D , follow ed b y stack ed Mam ba lay ers. The primary goal of S-Mamba is to test whether the core selectiv e state space mec hanism alone—without complex decomp osition or atten tion—is sufficient to mo del grid dynamics. Its design prioritizes computational efficiency ( O ( n ) complexity) ab o ve all else, making it an ideal cand idate for resource-constrained edge deploymen t (0.08ms inference latency). 4.1.2 P ow erMamba: Ph ysics-A ware Decomp osition P ow erMamba [ 9 ] addresses a sp ecific limitation of standard SSMs: the difficulty of mo deling disparate frequencies (e.g., long-term trend vs. daily seasonality) within a single state v ector. P ow erMamba introduces tw o critical inno v ations for energy data: • Series De c omp osition: It splits the input in to "T rend" (low-frequency) and "Seasonal" (high-frequency) comp onen ts, and pro cesses them via indep enden t Mamba enco ders. This explicitly separates the "duc k curve" drift from daily load cycles. • Bidir e ctionality: Unlike language mo deling where causalit y is strict (future cannot affect past), time series analysis b enefits from lo oking forward (e.g., smo othing). P ow erMamba pro cesses sequences in both forward and bac kw ard directions to capture a holistic temp oral context. 4.1.3 Mam ba-ProbTSF: Probabilistic Output Heads Grid op erators rarely make decisions based on a single deterministic num b er; they require confidence interv als to schedule reserv es. Mamba-ProbTSF extends the arc hitecture with a probabilistic output head. 8 The v arying nature of renewable generation introduces heteroscedastic noise— uncertain ty that c hanges o ver time (e.g., higher uncertain t y at sunset). Mam ba- ProbTSF replaces the standard linear readout with a Gaussian head that predicts b oth mean µ and v ariance σ 2 : L = 1 2  log( σ 2 ) + ( y − µ ) 2 σ 2  (9) This shift from minimizing MSE to maximizing likelihoo d enables the mo del to quantify "kno wn unknowns," providing actionable risk metrics for uncertaint y-aw are dispatc h. Alignmen t with our ev aluations. In our main w alk-forward results we ev aluate this Gaussian-head v ariant as a probabilistic baseline; in the fixed-split loss-function ablations we fine-tune weather-initialized chec kp oin ts with m ulti-quantile heads to study calibration and Bias/OPR-constrained ob jectiv es (Section 4.3 , T able 2 ). 4.1.4 iT ransformer: The Strongest Multiv ariate Baseline The iT ransformer [ 14 ] c hallenges the standard T ransformer paradigm. Instead of tok- enizing time steps (embedding T steps in to vectors), it tokenizes variables (embedding the en tire time series of a v ariate as a single token). Standard T ransformers struggle to mo del correlations b et w een different sensors (or grid no des) b ecause they treat them as c hannels within a time-step token. iT ransformer in verts this: the self-attention mechanism computes correlations b etwe en variables (e.g., Load vs. T emp erature), explicitly mo deling the system-wide interdependencies critical for grid stabilit y . W e include this baseline to rigorously test whether Mam ba’s implicit state mixing can comp ete with iT ransformer’s explicit correlation mo deling. 4.2 W eather-Integrated Architectures Motiv ated by the dep endency of load on weather (analyzed in Section 6 ), we developed w eather-integrated v ariants of each architecture. The k ey challenge is fusing meteoro- logical co v ariates (temp erature, humidit y , solar radiation) with the load signal while resp ecting building thermal lag—HV A C systems resp ond to temp erature changes with dela ys of 2–6 hours dep ending on building mass. F usion Strategies. S-Mam ba uses Early F usion (Fig. 1a ): load and w eather are concatenated at the input, requiring a wider embedding lay er. Po w erMamba uses Summation F usion (Fig. 1b ): weather features are added to b oth decomp osed streams. Mam ba-ProbTSF also uses summation fusion (Fig. 1c ) with careful normalization to preserv e uncertaint y estimates. iT ransformer uses T ok enization F usion (Fig. 1d ): eac h weather v ariable b ecomes a separate token, enabling cross-v ariable attention at the cost of increased complexit y . 4.3 Bias-Constrained Probabilistic Ob jectiv e Grid op erations induce an asymmetric cost for forecast errors: under-predictions can require exp ensiv e upw ard balancing and reserves, while ov er-predictions typically incur 9 (a) S-Mam ba: Early F usion via Concatena- tion (b) Po w erMamba: Summation into decom- p osed streams (c) Mam ba-ProbTSF: Summation F usion with Pre-Norm FFN (d) iT ransformer: W eather as additional tok ens Fig. 1 : W eather integration strategies for state space mo dels and T ransform- ers. Architectures incorp orate thermally lag-aligned weather cov ariates motiv ated b y the temp erature–error asso ciation. (a) Early fusion b y concatenation (S-Mamba). (b) Summation into decomp osed streams (Po werMam ba). (c) Summation with probab ilis- tic head (Mamba-ProbTSF). (d) T okenization of w eather v ariables for cross-attention (iT ransformer). lo wer (but non-zero) costs from o ver-commit, curtailment, and redispatch. W e represent this asymmetry using a cost ratio ρ op = C under C ov er , (10) where C under and C ov er are marginal costs p er MWh asso ciated with under- and o ver-forecasting, resp ectively . In practice, C under includes reliability and scarcity costs that are not fully captured b y av erage market spreads. W e therefore use CAISO mark et data to estimate a bias-indep enden t price-spread asymmetry anc hor ρ price (App endix F ), and map it to an op erational stance via an explicit reliability premium κ ≥ 1 : ρ op = κ · ρ price . (11) This mak es the risk-av ersion choice auditable: ρ price grounds asymmetry empirically , while κ do cumen ts the op erator p olicy preference. 10 F rom cost asymmetry to an op erational target quan tile. Under a piecewise- linear asymmetric cost mo del, the Bay es-optimal quantile is q ∗ = C under C under + C ov er = ρ 1 + ρ . (12) W e set a conserv ative risk-av erse training quan tile by com bining the market anc hor with the op erator stance: q target = max  0 . 5 , ρ op 1 + ρ op  , (13) ensuring we never train b elo w the median when the objective is reliability-seeking. (App endix F rep orts that ρ price can imply q ∗ price < 0 . 5 in calendar-year av erages; we in terpret this as evidence that market spreads alone are not a complete proxy for reliabilit y cost, motiv ating the explicit κ premium.) Prop er probabilistic training (m ulti-quantile pinball). Rather than training a single high quantile (which can induce systematic forecast inflation), we train a calibrated predictive distribution using multiple quantiles Q and a weigh ted prop er scoring rule. F or each q ∈ Q , define the pinball loss ℓ q ( y , ˆ Q q ) = max  q ( y − ˆ Q q ) , ( q − 1)( y − ˆ Q q )  , (14) and the w eighted multi-quan tile ob jectiv e ov er horizon H : L Q = X q ∈Q w q 1 H H X h =1 ℓ q  y t + h , ˆ Q q ,t + h  . (15) This ob jective is strictly prop er for quantile forecasts; when Q is dense, it approxi- mates distributional scores such as CRPS. Multi-quan tile pinball is also the standard ev aluation ob jectiv e in b enchmark comp etitions for probabilistic load forecasting (e.g., GEF Com), making it familiar and auditable for practitioners. Wh y single-quan tile pin ball can b e “gamed. ” T raining only a high quantile (e.g., q = 0 . 9 ) can reduce under-prediction frequency by shifting the conditional lo cation upw ard. This may improv e one-sided risk metrics (e.g., UPR, Reserv e % 99 . 5 ) while in tro ducing systematic p ositiv e bias that is op erationally unacceptable (inflating sc hedules and masking true uncertaint y). W e therefore couple probabilistic training with explicit bias con trols and rep ort bias diagnostics alongside risk metrics. Bias constraint to av oid trivial “safet y” via o ver-forecasting. T o preven t degeneracy via systematic inflation, we add an explicit p ositiv e-bias constraint on the median (scheduled) forecast at a target horizon (e.g., 24h). Let h ⋆ b e the index for the target horizon (24h ⇒ h ⋆ = 24 ), and define the mean bias at that horizon using the median quan tile q = 0 . 5 : b h ⋆ = E h ˆ Q 0 . 5 ,t + h ⋆ − y t + h ⋆ i . (16) 11 W e implement a hinge-p enalized constraint: L Q+Bias = L Q + λ bias max  0 , b h ⋆ − b max  , (17) where λ bias discourages systematic ov er-forecast (inflation) and b max is an auditable bias budget (w e use b max = 0 b y default). OPR-st yle constraint (frequency of ov er-forecasting). Bias controls the me an shift but do es not directly b ound the frequency of o ver-predictions. W e therefore optionally constrain the ov er-prediction rate at horizon h ⋆ b y p enalizing violations of an OPR budget π max : OPR h ⋆ = E h I  ˆ Q 0 . 5 ,t + h ⋆ > y t + h ⋆ i , (18) appro ximating the indicator smo othly with a sigmoid temp erature τ : ] OPR h ⋆ = E " σ ˆ Q 0 . 5 ,t + h ⋆ − y t + h ⋆ τ !# , (19) and adding a hinge p enalt y: L Q+Bias+OPR = L Q+Bias + λ opr max  0 , ] OPR h ⋆ − π max  . (20) Practically , w e use these constrained objectives during fine-tuning to learn a calibrated distribution while preven ting trivial “safety” via inflation. W e then use ρ op (via q target ) to select an op erationally conserv ativ e schedule or reserve p olicy from the learned distribution, and we rep ort Bias/OPR join tly with tail-risk metrics to ensure impro vemen ts are not attributable to trivial inflation. App endix F describ es ho w we empirically estimate ρ price from CAISO mark et data [ 20 ]. 5 Exp erimen tal Setup 5.1 Dataset W e use hourly load data from CAISO (Nov 2023–Nov 2025), comprising 84,498 hourly records across 5 ma jor transmission access charge (T AC) areas: CA ISO-T AC (system aggregate), PGE-T AC, SCE-T AC, SDGE-T AC, and TIDC. F ollowing Po w erMamba [ 9 ], we adopt a 240-hour (10-da y) context windo w to capture at least one complete w eekly cycle, enabling mo dels to learn b oth daily and weekly p erio dicit y patterns. Prepro cessing. Load v alues are normalized using z-score standardization com- puted only on training data to prev ent information leakage. T emp oral features include hour-of-da y (0–23) and da y-of-week (0–6), enco ded as sinusoidal embeddings follow- ing standard practice [ 6 ]. F or w eather-integrated mo dels, we include 8 meteorological co v ariates with thermal lag alignment, as describ ed in Section 5.3 . 12 5.2 CAISO Mark et Context and Benc hmarks CAISO op erations place a premium on day-ahead reliability: the Day-Ahead Market (D AM) requires hourly schedules submitted b y 10:00 AM the prior da y , while the Real- Time Market (R TM) clears at 5-minute granularit y . A defining feature of the CAISO region is the rapid growth of b ehind-the-meter (BTM) solar; California BTM solar capacit y exceeds 17 GW [ 3 ]. This “invisible” generation deep ens the midday net-load trough and steep ens evening ramps (the “duck curve”), increasing the imp ortance of w eather-aw are forecasting. Op erational forecast p erformance pro vides a practical b enc hmark. During the July 2024 heat w av e (July 4–12), third-part y analysis rep orted CAISO’s da y-ahead forecast ac hieved 4.55% MAPE, while commercial forecasting services achiev ed 2.65% MAPE (41% improv emen t) [ 21 ]. The same analysis rep orted a CAISO p eak load forecast error of 3,211 MW during the even t, highlighting the op erational imp ortance of reducing tail errors during extreme w eather. 5.3 Thermal Lag Parameter Determination A critical design decision for weather-in tegrated forecasting is the temp oral align- men t b et ween meteorological v ariables and load resp onse. Buildings do not resp ond instan taneously to temp erature c hanges; rather, the thermal mass of walls, flo ors, and furnishings creates a dela yed resp onse in HV A C demand. Cross-Correlation Analysis. T o determine the optimal lag parameters empiri- cally , w e computed the Pearson correlation co efficien t b et ween each w eather cov ariate w t and load L t at lags τ ∈ { 0 , 1 , ..., 12 } hours: ρ ( τ ) = Corr ( w t − τ , L t ) (21) W e selected the optimal lag τ ∗ = arg max τ | ρ ( τ ) | for each cov ariate. Iden tified Lag P arameters. Analysis of the dataset reveals distinct resp onse patterns consisten t with building physics [ 18 ]: • T emp erature (dry bulb): τ ∗ = 3 hours ( ρ = 0 . 72 ), reflecting the thermal time constan ts of commercial building env elop es (typically 2–4 hours). • Solar radiation (GHI): τ ∗ = 1 hour ( ρ = 0 . 45 ), corresp onding to rapid solar heat gain through fenestration. • Humidit y: τ ∗ = 3 hours ( ρ = 0 . 38 ), tracking latent co oling load dynamics. • Wind sp eed: τ ∗ = 0 – 1 hour ( ρ = 0 . 12 ), indicating immediate ven tilation effects. Horizon-A daptive Alignmen t. These empirically determined lags are applied during feature construction. When forecasting load at time t + h for horizon h , we align w eather features from time t to predict load response at t + τ ∗ + h , ensuring consisten t temp oral causalit y across all forecast horizons. Ev aluation Proto col (W alk-F orw ard Bac ktest). W e ev aluate using rolling- origin walk-forw ard backtesting ov er Nov 2023–Nov 2025. Starting with an initial 180-da y training history , we refit every 90 days using an expanding training windo w; eac h refit uses a 30-day v alidation windo w immediately preceding the cutoff for early stopping. W e then ev aluate one forecast p er da y (stride 24h) o ver the subsequen t blo c k 13 un til the next cutoff (48h horizon). All rep orted metrics in T ables 1 – 2 are computed on the concatenated set of walk-forw ard timestamps (477 ev aluation windows for CA ISO-T AC). 5.4 Mo dels Under Comparison W e ev aluate three Mamba-based architectures against established baselines, enabling direct comparison across parameter efficiency and accuracy . Pr op ose d Mamb a A r chite ctur es. • S-Mam ba (16.4M): Minimal SSM arc hitecture with an MLP pro jection head, testing whether selectiv e state spaces alone suffice for grid dynamics. • P ow erMam ba (2.5M): Series decomp osition with bidirectional pro cessing and a ligh tw eight pro jection head for multi-scale patterns. • Mam ba-ProbTSF (16.4M): Risk-aw are v ariant sharing the S-Mam ba backbone with an uncertain ty-/risk-orien ted output parameterization. Baselines. • LSTM [ 22 ]: 2-lay er LSTM (0.21M parameters), representing the industry-standard recurren t approach. • iT ransformer [ 14 ]: Inv erted T ransformer baseline used in our exp erimen ts (50.0M parameters), represen ting a strong attention-based forecasting mo del. • Chronos-T5-Small [ 19 ]: 8M-parameter foundation mo del ev aluated in zero-shot regime. This comparison directly tests whether Mamba’s O ( n ) efficiency can matc h the accuracy of larger T ransformer mo dels. Po werMam ba (2.5M parameters) is particularly compact, while S-Mamba and Mamba-ProbTSF (16.4M parameters) trade parameter coun t for expressive pro jection heads. 5.5 Mo del Hyp erparameters F ollowing the exp erimen tal proto col of P ow erMamba [ 9 ], all Mamba v ariants share consisten t arc hitectural h yp erparameters: d model = 128 (embedding dimension), d state = 16 (state dimension), d conv = 4 (conv olution kernel), expansion factor 2, and 2 bidirectional enco der lay ers. P arameter counts differ due to pro jection head design: Po w erMamba uses a linear head (2.5M total), while S-Mamba and Mamba- ProbTSF use MLP heads (16.4M total). Drop out rate is 0.1 across all mo dels. W eather in tegration adds negligible parameters ( ∼ 1K for the weather embedding lay er). 5.6 T raining Proto col T wo ev aluation regimes. W e use tw o complementary training/ev aluation regimes and clearly lab el which results come from which: • W alk-forw ard backtest (architecture + w eather). F or the systematic architec- ture comparisons (T ables 1 and related analyses), we use rolling-origin w alk-forward 14 ev aluation (Section 5.3 ). Mo dels are trained with A dam W and early stopping on a v alidation window p er refit. • Fixed-split fine-tuning (loss-function ablation). F or the loss-function ablation and bias/OPR constraints (T able 2 , T able 3 ), w e fine-tune w eather-initialized c heckpoints on a fixed split using the trainer framework: Adam W (lr 10 − 4 ) with OneCycleLR, up to 150 ep o c hs with patience 60, and multi-quan tile heads ( Q = [0 . 025 , 0 . 5 , 0 . 975] , weigh ts [4 , 1 , 4] ). Bias/OPR constraints are applied at the 24h lead time ( h ⋆ = 24 ; index 23 if using 0-based arrays for a 48h horizon forecast). Repro ducibilit y . All exp erimen ts run on NVIDIA R TX 5090 GPUs. W e rep ort results for a fixed random seed (42); multi-seed robustness is provided in App endix C . 5.7 Ev aluation Metrics Primary metric is MAPE ev aluated at m ultiple horizons (1h, 6h, 12h, 24h). F or tail error analysis, w e rep ort counts of errors exceeding op erational thresholds (1000, 1500, 2000 MW). 5.8 Statistical Analysis W e rep ort p oin t-estimate metrics on the walk-forw ard backtest timestamps. W e con- firm the significance of our findings using the Dieb old-Mariano test (MSE, h = 24 ). P ow erMamba significantly outp erforms the LSTM baseline ( p < 0 . 001 ) and p erforms comparably to S-Mamba ( p = 0 . 89 ), verifying that its parameter-efficient design (85% few er parameters) main tains accuracy . The iT ransformer baseline retains a statistically significan t adv antage ov er Po werMam ba ( p < 0 . 001 ), reflecting the trade-off b etw een Mam ba’s linear scalability and T ransformer’s pairwise correlation mo deling. 6 Results 6.1 Multi-Horizon P erformance (walk-forw ard) Ev aluation regime. This subsection rep orts results from the rolling-origin walk- forw ard backtest describ ed in Section 5.3 . T able 1 presents m ulti-horizon MAPE. P ow erMam ba is the strongest Mamba v ariant at the 1-hour horizon, while iT ransformer achiev es the low est MAPE at longer horizons (6h–24h) in our walk-forw ard ev aluation. These accuracy rankings, how ever, do not fully determine op erational tail risk (Reserve % 99 . 5 ), motiv ating the grid-sp ecific metrics rep orted later. 6.2 Error Analysis: The Impact of W eather on F orecast Reliabilit y (walk-forw ard) Ev aluation regime. This subsection uses the walk-forw ard backtest forecasts (all lead times within the 48h horizon) describ ed in Section 5.3 . The in tegration of weather data into deep learning models for load forecasting has b een extensively studied [ 1 , 5 ]. Prior work has demonstrated improv ements using 15 T able 1 : Multi-horizon accuracy on CA ISO-T A C (w alk-forward). MAPE (%) across forecast horizons aggregated ov er a rolling-origin w alk-forward backtest (seed 42); low er is b etter. T rained mo dels rep ort mean ± std across v alidation folds. Chronos is ev aluated in a deterministic zero-shot setting (single run, no std rep orted). Mo del P arams 1h 6h 12h 24h S-Mamba 16.4M 5.49 ± 0.12 3.86 ± 0.02 3.40 ± 0.47 8.03 ± 0.18 Mamba-ProbTSF 16.4M 5.23 ± 0.04 3.91 ± 0.19 3.30 ± 0.08 7.89 ± 0.17 Po w erMamba 2.5M 4.44 ± 0.20 4.10 ± 0.09 3.62 ± 0.30 7.31 ± 0.06 LSTM 0.21M 7.85 ± 0.20 5.24 ± 0.13 9.46 ± 0.32 9.99 ± 0.13 iT ransformer 50.0M 4.64 ± 0.10 3.36 ± 0.05 3.10 ± 0.21 6.85 ± 0.08 Chronos (zero-shot) † 8.0M 2.10 3.21 3.72 7.59 † Zero-shot ev aluation; no training or uncertainty estimation. LSTM-based architectures with temp erature features [ 22 ], CNN-GRU hybrids for extreme w eather scenarios, and attention-based mo dels for cross-v ariable correlation. Ho wev er, systematic ev aluation of weather integration strategies for state space models is lac king. W e inv estigated the correlation b et ween forecast errors and weather conditions. T emp erature-Error Asso ciation. Using all horizons of the CA ISO-T AC sliding- windo w forecasts (113,376 hourly prediction p oints), we find a statistically significant but modest asso ciation b et w een temp erature and absolute forecast error (Pearson r = 0 . 16 ). The fitted slop e is 24.1 MW/ ° C (Fig. 2 a). Hot conditions ( > 30 ° C) exhibit a mo dest increase in mean error (+3.4%). Fig. 2 : F orecast errors increase during temp erature extremes ( CA ISO-T A C , w alk-forward). Using all lead times within 48h horizon forecasts, (a) mean absolute error versus temp erature (slop e = 24.1 MW/ ° C; Pearson r = 0 . 16 ); error bars show 99.5% confidence interv als. (b) Probabilit y of large errors (top decile) increases at temp erature extremes. 16 6.3 W eather In tegration Results (w alk-forw ard) Ev aluation regime. This subsection rep orts w alk-forward results on matched ev aluation windows (Section 5.3 ). Fig. 3 : W eather in tegration narro ws the error distribution across utilities (w alk-forward, 48h). Signed p ercen tage error distributions comparing the matched MSE b aseline (grey) v ersus weather-in tegrated mo dels (orange) for eac h utility and the aggregate. Whiskers indicate the 0.5th to 99.5th p ercen tiles across walk-forw ard ev aluation windows. Based on the error analysis, we integrated thermal-lagged weather cov ariates into the architectures. As sho wn in Figure 3 , weather integration c hanges the forecast error distribution; improv emen ts are not uniform across architectures and regions and should b e rep orted on matched ev aluation windows. Comprehensive 24-hour MAPE results across all T AC areas are provided in App endix T able D3 . Ov erall, weather integration materially narrows the error distribution in the regimes where temperature-driven load spikes dominate. Next, we test whether probabilistic 17 calibration and explicit Bias/OPR constraints can further reduce tail even ts without allo wing trivial “safety” via inflation. 6.4 Probabilistic Calibration and Bias/OPR-Constrained Ob jectiv es (fixed split) Ev aluation regime. This subsection rep orts fixe d-split fine-tuning results for loss- function ablations (Section 5.6 ); it is not directly comparable to the walk-forw ard results ab o v e. Loss-function h yp othesis. While weather integration improv es robustness, error distributions still exhibit rare large deviations. W e hypothesized that some tail even ts are loss-driven: standard MSE training treats all errors symmetrically and do es not target calibrated upp er-tail b eha vior. Multi-quan tile pin ball exp erimen t. W e fine-tune with a w eighted multi-quan tile pin ball ob jectiv e (Eq. 15 ). On the CA ISO-T AC fixed ev aluation split (24h), this reduces large-error even ts (e.g., > 1000 MW: 1,111 → 917; > 2000 MW: 365 → 241) while also improving MAPE (Fig. 4 ). How ev er, these gains can reflect forecast inflation; w e therefore rep ort and (when needed) constrain Bias 24 h and Over-Prediction Rate (OPR) alongside tail metrics. Fig. 4 : Multi-quantile pin ball reduces large-error ev ents ( CA ISO-T A C , fixed split, 24h). (a) Error distribution under W eather MSE training (grey) versus W eather → m ulti-quantile pinball (orange). (b) Counts of large-error even ts by threshold. Because pin ball ob jectives can shift bias, we interpret tail improv ements jointly with Bias/OPR diagnostics in T able 2 and T able 3 . T able 2 presents an op erator-legible breakdown ( CA ISO-T AC , 24h), explicitly rep ort- ing Bias 24 h and OPR alongside tail-risk Reserv e % 99 . 5 . Three high-signal observ ations: • T ail metrics can improv e via inflation. F or iT ransformer, W eather → Multi- Q reduces Reserve % 99 . 5 (28.70% → 13.83%) but do es so with large p ositiv e bias (+1,862 MW) and high OPR (78.8%), indicating a shift in scheduled lo cation. 18 • Constrain ts enforce auditable trade-offs. Adding Bias/OPR con trol reduces inflation for iT ransformer (Bias 24 h +1,862 → +456 MW; OPR 78.8% → 61.6%) while partially trading off tail-risk (Reserv e % 99 . 5 13.83% → 15.18%). • Constrain t tuning is arc hitecture-dep enden t. Some arc hitectures exhibit unstable b ehavior under the same constraint settings (e.g., Po werMam ba inflates under BiasCtrl), reinforcing that risk-a version should b e p osed as constrained optimization with auditable budgets rather than a single universal loss. T able 2 : Loss-function ablation with bias/OPR-constrained probabilistic training (fixed split, 24h, CA ISO-T A C ). All mo dels use w eather-integrated chec kp oin ts. Comparison of (i) MSE baseline, (ii) multi-quan tile calibration (Multi-Q, Eq. 15 ), and (iii) Multi-Q + bias/OPR constrain t (+BiasCtrl). Reserve % 99 . 5 is the 99.5th p ercentile one-sided under-forecast error. Bias 24 h and OPR exp ose trivial “safet y” via systematic inflation. Mo del Loss MAPE UPR Reserve % 99 . 5 Bias 24 h (MW) OPR S-Mam ba MSE 4.42% 43.5% 16.28% +243.6 56.5% Multi-Q 3.63% 34.0% 13.65% +405.4 66.0% +BiasCtrl 3.86% 31.2% 12.75% +507.6 68.8% P ow erMamba MSE 3.95% 38.4% 12.91% +370.9 61.6% Multi-Q 4.19% 34.6% 13.65% +480.1 65.4% +BiasCtrl 9.01% 6.2% 7.42% +2027.1 93.8% Mam ba- ProbTSF MSE 4.26% 44.9% 14.86% +176.5 55.1% Multi-Q 3.69% 36.2% 14.17% +355.2 63.8% +BiasCtrl 3.84% 64.3% 17.45% -325.9 35.7% iT ransformer MSE 9.34% 50.6% 28.70% +109.3 49.4% Multi-Q 10.08% 21.2% 13.83% +1862.3 78.8% +BiasCtrl 5.04% 38.4% 15.18% +456.4 61.6% Effect of probabilistic calibration on tail even ts. Figure 5 compares (i) W eather MSE, (ii) W eather → Multi-Q, and (iii) W eather → Multi-Q+BiasCtrl on a single consisten t ev aluation split (same targets across v arian ts). Multi-quantile calibration reduces large-error coun ts relative to W eather MSE, while adding Bias/OPR constraints partially trades off tail impro vemen ts for reduced inflation risk (T able 2 ). 6.5 NEMs (BTM Registry) In tegration Results Ev aluation regime. This subsection rep orts walk-forw ard results on utilities where NEMs/registry features are av ailable. W e next ev aluate whether static NEMs/registry-deriv ed BTM features (e.g., installed PV and storage capacity) pro vide incremental predictiv e v alue on top of the weather-aligned baseline. This anal- ysis is rep orted only for utilities where NEMs/registry features were av ailable in our pip eline (PGE, SCE, SDGE, TIDC). 19 T able 3 : Constrain t trade-offs (op erator-legible) on CA ISO-T A C (fixed split, 24h). All mo dels use w eather-integrated chec kp oin ts. Rep orting Bias 24 h and OPR alongside Reserve % 99 . 5 mak es clear when apparent tail-risk gains coincide with systematic inflation. Mo del V ariant Bias 24 h (MW) Reserve % 99 . 5 UPR (%) OPR (%) S-Mam ba MSE +243.6 16.28 43.5 56.5 Multi-Q +405.4 13.65 34.0 66.0 +BiasCtrl +507.6 12.75 31.2 68.8 P ow erMamba MSE +370.9 12.91 38.4 61.6 Multi-Q +480.1 13.65 34.6 65.4 +BiasCtrl +2027.1 7.42 6.2 93.8 Mam ba- ProbTSF MSE +176.5 14.86 44.9 55.1 Multi-Q +355.2 14.17 36.2 63.8 +BiasCtrl -325.9 17.45 64.3 35.7 iT ransformer MSE +109.3 28.70 50.6 49.4 Multi-Q +1862.3 13.83 21.2 78.8 +BiasCtrl +456.4 15.18 38.4 61.6 Fig. 5 : Large-error counts under calibration and constraints (fixed split, 24h, CA ISO-T A C ). Coun ts by threshold comparing W eather MSE vs W eather → Multi-Q vs W eather → Multi-Q+BiasCtrl on the same fixed ev aluation split. Ho w NEMs features are integrated. NEMs/registry features are static or slow- mo ving, so w e integrate them as an additional BTM input stream that is fused in to the we ather-inte gr ate d forecaster. T o av oid ph ysically irrelev ant feature injection (e.g., PV capacit y “explaining” nighttime net load), we apply a simple da ylight prior d t ∈ [0 , 1] (implemen ted as a time-of-day gate). When w eather inputs are present, we further apply a light weigh t context-conditioned mo dulation so the BTM con tribution can v ary smo othly b y regime (e.g., dayligh t/irradiance conditions) rather than acting as an unconditional bias. 20 Concretely , let x t denote the main embedded inputs (net load history with w eather and time features), and let b denote the NEMs/registry feature vector. W e compute a BTM embedding b emb = ϕ ( b ) and modulate it using a context signal c t (w eather+time) via FiLM-st yle parameters: ( γ t , β t ) = g ( c t ) , b mod t = d t · ( γ t ⊙ b emb + β t ) , (22) then fuse additiv ely b efore the sequence enco der: z t = x t + b mod t . (23) Figure 6 summarizes this “w eather + NEMs” fusion path. Fig. 6 : NEMs/registry in tegration on top of weather integration (w alk- forw ard). Static BTM capacit y features are embedded, gated by a dayligh t prior, and (when w eather is present) context-modulated using w eather+time signals b efore additiv e fusion into the weather-in tegrated forecaster. Figure 7 summarizes the incremen tal impact of NEMs integration relative to the w eather baseline, measured as c hanges in MAPE and Reserv e % 99 . 5 (tail under-forecast requiremen t proxy). Impro vemen ts are heterogeneous: in the PV-dominant reference utilit y with NEMs av ailabilit y (PGE), NEMs integration reduces b oth av erage error and tail-risk reserve requirement. In contrast, other territories show smaller changes and one (SCE) degrades slightly . This pattern is consistent with the mechanism that slo wly v arying capacity metadata cannot fully capture fast irradiance/cloud transients that dominate the hardest BTM-driven forecast errors. 21 Fig. 7 : NE Ms/BTM registry features hav e a small marginal effect b ey ond w eather in tegration (w alk-forward, 48h). Percen t change in MAPE and Reserv e % 99 . 5 (tail under-forecast requiremen t proxy) when adding NEMs/registry fea- tures to the weather baseline (S-Mamba, 48h; PGE, SCE, SDGE, TIDC). The y-axis is constrained to ± 5% to highlight the mo dest magnitude of the effect. 7 Discussion 7.1 P erformance Comparison with CAISO Benc hmarks T able 4 provides contextual comparison with published CAISO op erational forecasts and commercial alternativ es. T able 4 : Con textual comparison to op erational benchmarks ( CA ISO-T A C , 24h). Published CAISO and commercial day-ahead forecast MAPE v alues from the July 2024 heat wa ve [ 21 ] are shown alongside our mo dels for reference. Our mo del rows rep ort 24h MAPE from our walk-forw ard ev alua- tion. F orecasting System 24h MAPE vs CAISO P arameters W eather CAISO Op erational 4.55% — N/A Y es Commercial (Y es Energy) 2.65% -41.8% N/A Y es LSTM Baseline 6.49% +42.7% 0.21M No iT ransformer 5.37% +18.0% 50.0M No Chronos (zero-shot) 7.59% +66.8% 8.0M No iT ransformer + W eather 4.15% -8.8% 50.0M Y es S-Mamba + W eather 4.47% -1.8% 16.4M Y es Mamba-ProbTSF + W eather 4.52% -0.7% 16.4M Y es P ow erMamba + W eather 3.68% -19.1% 2.5M Y es The p erformance gains are particularly significant given the mo del efficiency: P ow- erMam ba achiev es competitive accuracy relative to published CAISO b enc hmarks with only 2.5M parameters—69% fewer than Chronos and 95% fewer than the iT ransformer baseline used in our exp erimen ts. This efficiency enables real-time deplo yment on edge devices without sacrificing forecast qualit y . 22 7.2 T emp erature, T ail Risk, and Op erator Implications Our error analysis shows a statistically significant but mo dest asso ciation b et ween temp erature and forecast error magnitude ( r = 0 . 16 ), supp orting the op erational in tuition that weather is a relev ant co v ariate and that loss-function mo difications cannot substitute for missing exogenous information. More imp ortan tly for op erations, the walk-forw ard results reinforce that reserve requirements are not determined by MAPE alone: mo dels with similar av erage accuracy can imply very different one-sided tail requiremen ts (Reserve % 99 . 5 ) under under-forecast risk. W eather inte gration narrows the error distribution across utilities (Fig. 3 ), which can translate directly into reduced up ward reserve needs during temp erature-driven demand spikes. 7.3 Computational Efficiency The p erformance of Mamba-based mo dels can b e attributed to three factors: (1) selectiv e state spaces capture long-range dep endencies, (2) O ( n ) complexity enables extended 240h context windows, and (3) parameter efficiency . In our implementations, P ow erMamba is particularly compact (2.5M parameters) relative to the iT ransformer baseline used in our experiments (50.0M), enabling low er-latency inference and reduced memory fo otprin t. 7.4 F oundation Mo del P erformance Chronos-T5-Base (200M parameters) p erformed no b etter than Chronos-Small (8M) in zero-shot ev aluation (e.g., 5.76% vs 5.23% MAPE at the 24-hour horizon). This suggests that foundation mo del scaling does not automatically translate to domain- sp ecific p erformance. General-purp ose pre-training lacks the structural priors needed to handle the sp ecific physical constraints of the grid—suc h as the complex interaction b et w een thermal inertia and b ehind-the-meter generation—reinforcing the v alue of sp ecialized architectures and weather integration strategies. 7.5 Limitations from Unobserved BTM Generation A recurring theme in our results is the ceiling imp osed by data observ abilit y . Our "grey b o x" exp erimen ts with static NEMs registry features (Section 6.5 ) yielded only marginal gains. Figure 7 sho ws that even with capacity data, the mo del struggles to capture the dynamic, weather-dependent ramping of distributed solar. This p oin ts to a fundamental limitation: net load signals conflate grid demand with b ehind-the-meter (BTM) generation, and static capacit y metadata is insufficient to disentangle them. This "visibilit y gap" suggests that future improv ements will not come from larger blac k-b o x transformers, but from structural end-to-end learning . As demonstrated b y Shi et al. [ 17 ], differentiable optimization lay ers can explicitly mo del physical relationships (such as solar generation physics or price resp onse) within the neural net work. By adopting such end-to-end decomp osition metho ds to dynamically learn laten t BTM states from aggregate net load, future w ork could address the visibility gap that static feature engineering failed to close. 23 Real-Time Deploymen t. While we report inference latency , w e hav e not ev aluated these mo dels in a pro duction environmen t with streaming data, mo del up dating, and in tegration with grid management systems. Baseline Co v erage. Our comparisons fo cus on a small set of strong neural base- lines (LSTM, iT ransformer, Chronos) rather than an exhaustiv e b enchmark including P atchTST, TFT, N-HiTS/N-BEA TS, or linear baselines (e.g., DLinear/NLinear). F uture w ork should include auto correlation-a ware significance testing for a more complete statistical analysis. 8 Conclusion Mam ba-based state space mo dels ac hieve comp etitiv e accuracy for California grid load forecasting. With weather integration, P ow erMamba ac hieves 4.11% MAPE for 24- hour forecasts, which compares fa vorably to published CAISO op erational b enc hmarks (4.55% MAPE; contextual reference). Critically , our walk-forw ard ev aluation demon- strates that operational tail risk (Reserve % 99 . 5 ) can div erge materially from a verage accuracy , motiv ating grid-sp ecific reliability metrics alongside MAPE for safety-critical deplo yment decisions. W eather In tegration A cross Regions. W eather in tegration improv es 24-hour MAPE for several T AC areas (App endix T able D3 ), particularly smaller and more v olatile systems. F uture work should ev aluate tail-risk improv ements under a strictly matc hed ev aluation set (same timestamps across mo dels). Probabilistic calibration must b e paired with bias con trols. Multi-quantile ob jectiv es can reduce large-error ev ents and tail reserve requirements, but can also b e “gamed” by systematic inflation if Bias/OPR are not monitored and constrained. Our Bias/OPR-constrained ob jectiv e provides an op erator-legible wa y to enforce auditable trade-offs b etw een tail-risk and sc hedule inflation (T able 2 , T able 3 ). Op era- tional Deploymen t Implications. Reduced tail under-forecast risk (Reserv e % 99 . 5 ) can translate to lo wer upw ard reserve requirements and reduced reliance on fast-ramping resources. The computational efficiency of Mamba architectures enables real-time deplo yment without significant infrastructure upgrades, while calibrated uncertaint y quan tification from Mamba-ProbTSF supp orts probabilistic reserve scheduling. References [1] Hong, T., F an, S.: Probabilistic electric load forecasting: A tutorial review. In ternational Journal of F orecasting 32 (3), 914–938 (2016) [2] California Indep enden t System Op erator: 2022 annual rep ort on market issues and p erformance. T echnical rep ort, CAISO (July 2023). Rep orts CAISO system energy mix including solar and wind shares. https://www.caiso.com/Documents/ 2022- Annual- Report- on- Market- Issues- and- Performance- Jul- 11- 2023.p df [3] U.S. Energy Information A dministration: Electric Po w er Monthly: T able 6.2.B. Net summer capacity using primarily renewable energy sources, by state. https://www.eia.go v/electricity/mon thly/epm_table_ grapher.php?quot= 24 &t=table_ 6_02_ b . Includes California b ehind-the-meter (small-scale) solar capacit y figures (accessed 2026-01-12). (2024) [4] California Indep enden t System Op erator: Final ro ot cause analysis: Mid-august 2020 extreme heat w av e. T ec hnical rep ort, CAISO (2021) [5] Eren, Y., Kucukdemiral, I.B.: A comprehensive review on deep learning approaches for short-term load forecasting. Renew able and Sustainable Energy Reviews 189 , 114031 (2024) [6] V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Atten tion is all you need. In: Adv ances in Neural Information Pro cessing Systems, vol. 30 (2017) [7] Gu, A., Dao, T.: Mamba: Linear-time sequence mo deling with selective state spaces. arXiv preprin t arXiv:2312.00752 (2023) [8] W ang, Z., Kong, F., F eng, S., W ang, M., Y ang, X., Zhao, H., W ang, D., Zhang, Y.: Is mam ba effective for time series forecasting? arXiv preprint (2024) [9] Menati, A., Doudi, F., Kalathil, D., Xie, L.: Po wermam ba: A deep state space mo del and comprehensive b enchmark for time series prediction in electric p ow er systems. arXiv preprin t arXiv:2412.06112 (2024) [10] Dong, X., Cao, L., Shi, Y., Sun, S.: A short-term p o wer load forecasting metho d based on deep learning pip eline. In: Pro ceedings of IEEE Sustainable P ow er and Energy Conference (2024). IEEE [11] North American Electric Reliabilit y Corp oration: 2024 long-term reliability assess- men t. T echnical rep ort, NERC (2024). Defines planning reserv e margin and probabilistic adequacy criteria suc h as LOLE (0.1 even ts/year) used in resource adequacy planning. https://www.nerc.com/globalassets/programs/rapa/ra/nerc_ long- term- reliability- assessmen t_ 2024.pdf [12] Zhang, Y., W en, H., Shi, Y., et al.: T o ward v alue-oriented renew able energy forecasting: An iterative learning approach. arXiv preprint (2023) [13] Nie, Y., Nguyen, N.H., Sinthong, P ., Kalagnanam, J.: A time series is worth 64 w ords: Long-term forecasting with transformers. In: International Conference on Learning Represen tations (2023) [14] Liu, Y., Hu, T., Zhang, H., W u, H., W ang, S., Ma, L., Long, M.: itransformer: In verted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625 (2023) 25 [15] Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? In: Pro ceedings of the AAAI Conference on Artificial Intelligence, v ol. 37, pp. 11022–11030 (2023) [16] Gu, A., Go el, K., Ré, C.: Efficien tly modeling long sequences with structured state spaces. In: In ternational Conference on Learning Representations (2022) [17] Shi, Y., Xu, B.: End-to-end demand resp onse mo del identification and baseline estimation with deep learning. arXiv preprint arXiv:2109.00741 (2021) [18] Seem, J.E.: Dynamic mo deling of buildings with thermal mass. ASHRAE T ransactions 113 (1), 518–529 (2007) [19] Ansari, A.F., Stella, L., T urkmen, C., Zhang, X., Mercado, P ., Shen, H., Shch ur, O., Rangapuram, S.S., Arb er Pinilla, S., Kap oor, S., Zschiegner, J., Maddix, D.C., W ang, H., Mahoney , M.W., T orb er, K., Wilson, A.G., Bohlke-Sc hneider, M., Gasthaus, J.: Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815 (2024) [20] California Indep enden t System Op erator: Oasis interface specification v5.1.2 (fall 2017 release). T echnical rep ort, CAISO (2017). Do cuments the CAISO OASIS SingleZip API. https://www.caiso.com/documents/oasis- in terfacesp ecification_ v5_ 1_2clean_ fall2017release.pdf [21] Y es Energy: California Energy Demand F orecast A ccuracy Dur- ing Holida y and Heat W av e. h ttps://blog.yesenergy .com/yeblog/ tesla- california- energy- demand- forecast- accuracy- shines . Third-party analysis of CAISO forecast accuracy during July 4-12, 2024 heat wa v e (2024) [22] Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Optimal deep learning lstm mo del for electric load forecasting using feature selection and genetic algorithm: Comparison with mac hine learning approaches. Energies 13 , 1633 (2020) App endix A W eather Co v ariates T able A1 lists the 8 meteorological co v ariates used in weather-in tegrated models. All w eather data are sourced from NO AA In tegrated Surface Database (ISD) stations within eac h utility service territory , aggregated to hourly resolution. Thermal lag v alues are based on typical building thermal mass resp onse c harac- teristics [ 18 ]. T emp erature-related co v ariates use longer lags (2–4 hours) to accoun t for HV AC system resp onse times in commercial buildings, while radiation and wind effects manifest more quic kly (0–2 hours). App endix B Hyp erparameter Configuration T able B2 provides complete hyperparameter sp ecifications for all mo dels. 26 T able A1 : W eather cov ariates used for in tegration. Meteorological inputs and their assumed thermal lag ranges for feature align- men t, used by all weather-in tegrated mo dels in the w alk-forward exp erimen ts. Co v ariate Unit Thermal Lag T emp erature (dry bulb) ° C 2–4 hours Dew p oin t temperature ° C 2–4 hours Relative humidit y % 2–4 hours Wind sp eed m/s 0–1 hours Wind direction degrees 0–1 hours Cloud cov er oktas 1–2 hours Solar radiation (GHI) W/m 2 1–2 hours Atmospheric pressure hPa 0 hours T able B2 : Mo del hyperparameters. Architectural and training settings used for eac h ev aluated mo del class. P arameter S-Mamba Po w erMamba Mamba-ProbTSF iT ransformer Embedding dim ( d model ) 128 128 128 512 State dim ( d state ) 16 16 16 — Conv kernel ( d conv ) 4 4 4 — Expansion factor 2 2 2 — Encoder layers 2 2 2 10 Atten tion heads — — — 8 Dropout 0.1 0.1 0.1 0.1 Bidirectional Y es Y es Y es — T otal parameters 16.4M 2.5M 16.4M 50.0M App endix C Multi-Seed Robustness T o assess robustness to random initialization, w e trained Po w erMamba + W eather with seeds { 42 , 123 , 456 } . The 24-hour MAPE on CA ISO-T AC w as 3 . 68% ± 0 . 09% (mean ± std), confirming that rep orted results are stable across initializations. F ull m ulti-seed results for all architectures will b e provided in the final version. App endix D Per-Utilit y Results T able D3 presen ts 24-hour MAPE for eac h T AC area, demonstrating consisten t p erformance across diverse grid regions. Smaller utilities (SDGE-T A C, TIDC) exhibit higher MAPE due to increased load v olatility relative to system size. W eather integration tends to improv e accuracy , with the largest relativ e gains in smaller, more v olatile territories; how ever, improv emen ts are not uniform across all regions (T able D3 ). 27 T able D3 : Per-utilit y accuracy with and without weather (w alk-forward, 24h). 24-hour MAPE (%) for Mamba-ProbTSF on each T AC area comparing baseline v ersus weather integration. T AC Area P eak Load (MW) Baseline MAPE W eather MAPE CA ISO-T AC (aggregate) 52,061 4.29% 4.52% PGE-T AC 21,847 4.66% 4.18% SCE-T AC 24,156 6.59% 6.08% SDGE-T AC 4,892 8.43% 6.61% TIDC 1,166 5.18% 3.59% App endix E Sensitivity to W eather F orecast Errors T o ensure our results are robust to real-w orld conditions where p erfect weather forecasts are una v ailable, w e conducted a sensitivity analysis b y injecting Gaussian noise in to the temp erature inputs. W e ev aluated the pre-trained Po w erMamba + W eather mo del on the CA ISO-T AC test set with noise ϵ ∼ N (0 , σ 2 ) added to the dry-bulb temp erature feature, where σ ∈ { 1 ◦ C , 2 ◦ C , 3 ◦ C } . T able E4 : Sensitivit y to temp erature fore- cast uncertaint y ( CA ISO-T A C , 24h). Impact of additive temp erature noise on 24-hour forecast accuracy for P ow erMamba + W eather. Noise Level ( σ ) MAPE (%) Degradation Background (No Noise) 3.68% — σ = 1 ◦ C 3.68% +0.0% σ = 2 ◦ C 3.69% +0.3% σ = 3 ◦ C 3.72% +1.1% T able E4 demonstrates that the mo del retains its p erformance adv an tage even with mo derate forecast errors. A t σ = 2 ◦ C (a typical error range for day-ahead weather forecasts), the p erformance degrades b y less than 10%, remaining comp etitiv e with the non-w eather-integrated baselines. This confirms that the b enefits of weather integration largely p ersist under op erational w eather uncertaint y . App endix F Empirical Estimation of Asymmetry Ratio ρ from CAISO Mark et Data This app endix describ es a repro ducible pro cedure to estimate the op erational cost asymmetry ratio ρ for California grid op erations from public CAISO market data. The goal is to connect risk-av ersion in the loss function (Eq. 12 ) to market-based evidence. 28 F.1 Data Sources and Alignmen t (CAISO O ASIS) W e use the CAISO OASIS “SingleZip” API [ 20 ] to do wnload time-aligned series: • Load actuals: queryname=SLD_FCST , market_run_id=ACTUAL . • Da y-ahead load forecast: queryname=SLD_FCST , market_run_id=DAM . • Da y-ahead LMP (hourly): queryname=PRC_LMP , market_run_id=DAM . • Real-time LMP (5-minute R TM): queryname=PRC_INTVL_LMP , market_run_id=RTM . W e use a representativ e settlement p oin t (e.g., trading hub TH_NP15_GEN-APND or a DLAP suc h as DLAP_PGAE-APND ). Real-time LMP is av eraged to hourly resolution to matc h the hourly load series. F.2 Price-Spread Asymmetry (Robust Estimator) Define the hourly D A–R T spread: s t = LMP RT t − LMP DA t . (F1) W e decomp ose the spread into p ositiv e and negative parts: s + t = max(0 , s t ) , s − t = max(0 , − s t ) . (F2) The price-only asymmetry ratio is then ρ price = E [ s + t ] E [ s − t ] , q ∗ price = ρ price 1 + ρ price . (F3) This estimator is robust to forecast bias b ecause it dep ends only on market spreads. F.3 Calendar-2025 Empirical Estimates (P er Utility) T able F5 rep orts empirical asymmetry ratios o ver calendar-2025 for eac h T AC area using represen tative settlement p oints (DLAPs or the NP15 trading hub). W e rep ort ρ price as the primary , bias-indep enden t market-based estimator and include ρ even t as an optional diagnostic when stable and av ailable. Interpretation for risk-a verse training. The implied q ∗ price = ρ price / (1 + ρ price ) can fall b elo w 0 . 5 in calendar-year a verages; w e do not interpret this as a recommendation to train b elo w-median forecasts for reliability-critical op erations. Instead, we treat ρ price as a market-grounded anchor and select an op erationally conserv ative q target ≥ 0 . 5 using an explicit reliabilit y premium κ and bias-con trol constraints (Section 4.3 ). Notes: ρ price depends only on the DA–R T price spread and is therefore robust to forecast bias. ρ even t conditions on the sign of the DA load forecast error and can be unstable or unav ailable (rep orted as N/A) depending on data coverage and estimator settings. F or risk-averse training, we treat market-derived asymmetry as an anchor and select an operationally conserv ative q ∗ (Eq. 12 ) jointly with an explicit bias-control term (Section 4.3 ). 29 T able F5 : Empirical market-based asymmetry estimates for calendar- 2025. W e rep ort b oth a robust price-spread asymmetry ( ρ price ) and an optional ev ent-conditional proxy ( ρ even t ). Implied optimal quan tiles are q ∗ = ρ/ (1 + ρ ) (Eq. 12 ). Utilit y Price No de Hours ρ price q ∗ price ρ even t q ∗ even t CA ISO-T AC (aggregate) TH_NP15_GEN-APND 8736 0.78 0.439 N/A N/A PGE-T A C DLAP_PGAE-APND 8736 0.71 0.414 0.26 0.204 SCE-T A C DLAP_SCE-APND 8736 0.74 0.426 N/A N/A SDGE-T A C DLAP_SDGE-APND 8736 0.78 0.439 N/A N/A TIDC (proxy) TH_NP15_GEN-APND 8736 0.78 0.439 N/A N/A F.4 Ev ent-Conditional Settlemen t Pro xy (Optional) T o connect directly to forecast errors, define the day-ahead forecast deviation ∆ t = y t − ˆ y DA t , (F4) where y t is actual load and ˆ y DA t is the day-ahead forecast. A simple settlement-st yle proxy for marginal costs conditions on the sign of ∆ t : C under ≈ E [ s + t | ∆ t > 0] , C ov er ≈ E [ s − t | ∆ t < 0] , (F5) yielding ρ even t = C under C ov er , q ∗ even t = ρ even t 1 + ρ even t . (F6) Because ρ even t can be unstable if ∆ t is highly one-sided ov er a short windo w, we recommend rep orting ρ price as the primary market-derived estimator and using ρ even t as a diagnostic. F.5 Repro ducible Implemen tation W e provide an implementation in the rep ository at experiments/core/robust_evaluation/caiso_empirical_rho.py , which downloads the ab o ve OASIS series in ch unks (to respect acceptable-use limits), aligns them by timestamp, and writes a JSON report containing ρ price and (optionally) ρ even t for a sp ecified node and time range. 30

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment