Data-driven online control for real-time optimal economic dispatch and temperature regulation in district heating systems

Highlights Data-driven online control for real-time op timal economic dispatch and temperature regulation in district heating sys tems Xinyi Yi, Ioannis Lestas • Optimality conditions are embedded into augmented DHS dynamics. • A dat a-driven controller is dev eloped f or real-time DHS operation. • Adaptiv e online updates improv e lear ning and closed-loop per f ormance. • The framewor k guarantees controller optimality and closed-loop conv er g ence. • V alidation on an industrial-park DHS show s stable near -optimal operation. Data-dr iv en online control f or real-time optimal economic dispatch and temperature regulation in dis tr ict heating sy stems Xin yi Yi a , Ioannis Lestas a , ∗ Department of Engineering, University of Cambridge, T r umpington Stree t, Cambridge, CB2 1PZ, United Kingdom A R T I C L E I N F O Keyw or ds : District heating systems Economic dispatch T emperature regulation Online data-dr iven control Performance guarantees A B S T R A C T District heating systems (DHSs) require coordinated economic dispatch and temperature regulation under uncertain operating conditions. Existing DHS operation str ategies often rely on disturbance f orecasts and nominal models, so their economic and thermal perf ormance ma y degrade when predictive information or model know ledge is inaccurate. This paper dev elops a data-dr iven online control framew ork f or DHS operation by embedding steady-s tate economic optimality conditions into the temperature dynamics, so that the closed-loop system conv erg es to the economically optimal operating point without relying on disturbance forecasts. Based on t his f ormulation, we dev elop a Data-Enabled Policy Optimization (DeePO)-based online lear ning controller and incor porate Adaptiv e Moment Estimation (ADAM) to impro ve closed-loop performance. We furt her establish conv er gence and performance guarantees f or the resulting closed-loop system. Simulations on an industrial-park DHS in Norther n China show that t he proposed method achie v es stable near-optimal operation and strong empirical robustness to both static and time-varying model mismatch under practical disturbance conditions. 1. Introduction Heating sys tems account for a signiﬁcant share of global energy consumption and greenhouse gas emissions. Improv - ing t he operational eﬃciency and ﬂexibility of district heat- ing systems (DHSs) is therefore impor tant f or low -carbon energy transitions. Widely deployed in China, Russia, and Europe, DHSs distr ibute thermal energy through lar ge-scale pipeline netw orks. Their increasing integration wit h renew - able and w aste-heat sources reduces reliance on f ossil fuels, but also increases operational uncer tainty and coordination complexity [ 15 ]. In practical DHS operation, coordinating economic dis- patch and temperature regulation remains c hallenging under demand uncertainty and model mismatch. Existing forecas t- based and model-based control str ategies can per f or m well when disturbance predictions and nominal models are ac- curate, but their perf ormance may deg rade under uncertain heat demand and changing operating conditions. This mo- tivates a closer e xamination of tem perature regulation for DHSs under uncer tainty . Although temperature regulation has been widely stud- ied in buildings [ 3 ], DHSs diﬀer substantially in system structure, control objectives, and operating conditions. Com- pared wit h building heating control, DHS regulation requires coordinated heat generation, transpor t, and allocation o ver larg e-scale networks, where control perf ormance depends on netw ork interconnections and t hermal transpor t dynam- ics [ 1 , 16 ]. This motiv ates the dev elopment of control frame- w orks f or DHSs that can maintain economically eﬃcient and ther mall y stable operation under uncert ainty . ∗ Corresponding author . xy343@cam.ac.uk (X. Yi); icl20@cam.ac.uk (I. Lestas) OR CID (s): 0000-0003-1797-6280 (X. Yi) 1.1. F orecast-r eliant and model-based DHS operation In practice, DHS operation often combines f orecast- based setpoint scheduling with real-time se tpoint tr acking [ 1 , 16 ]. How ever , forecas t er rors and model mismatch can de- grade both thermal and economic per f or mance under uncer - tain heat demand [ 6 , 10 ]. Model predictiv e control (MPC) partially alleviates this issue through receding-hor izon re- optimization [ 11 ]. Related DHS operation framewor k s hav e been dev eloped using impro ved load f orecasting methods [ 8 , 24 ] together with control-oriented or reduced-order ther- mal models [ 14 , 19 , 28 ] to impro ve thermal and economic performance. Howe ver , their practical performance remains sensitive to f orecast quality and model accuracy , especially when heat demand varies and model mismatch is present. 1.2. Data-driven enhancements within MPC-based DHS operation Building on these MPC framew orks, recent studies hav e furt her introduced data-dr iven components to mitigate the impact of model mismatch in DHS operation. Ho wev er, rather than removing the reliance on model kno wledg e, these methods are still largel y de veloped within MPC frame- w orks. For ex ample, recent works hav e e xplored phy sics- inf or med neural netw ork models f or predictive control [ 7 ] and AI-based approaches f or steam system modeling from a graph perspective [ 27 ]. Ne vert heless, these approaches re- main prediction-reliant and model-reliant: their performance still depends on the quality of the predictive model and its associated inf or mation, while formal closed-loop guar - antees under prediction er rors and model mismatc h remain limited. These limit ations motiv ate a shift from prediction- reliant MPC framewor k s tow ards online data-dr iv en control framew orks that can suppor t economically consistent and reliable DHS operation under uncer tainty while providing Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier Page 1 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems strong er closed-loop guarantees. Feedback -based optimiza- tion controllers [ 9 ] provide an alter native, but t heir fas t- optimization assumption may be restrictive f or slow ther mal dynamics. Our previous linear quadratic regulator (LQR) framew ork [ 26 ] guarantees conv erg ence to t he economically optimal operating point under unkno wn deterministic dis- turbances, but does not account for stochastic demand varia- tions or renew able-heat ﬂuctuations encountered in practical operation. 1.3. Online dat a-driven control opportunities f or real-time economic dispatch and temperatur e regulation in DHSs Data-dr iv en LQR provides a promising framew ork f or online DHS control when accurate system models are dif- ﬁcult to maintain and str ong closed-loop guarantees are desired. Existing approac hes broadly f all into two categor ies: indirect methods, which ﬁrst identify system dynamics and then solv e a cer tainty-equiv alent (CE) control problem, and direct methods, which learn the control policy directly from data. For large-scale DHSs under uncer tainty and model mismatch, direct methods are par ticularly attractive because they reduce reliance on repeated model identiﬁcation and can adapt more naturally to closed-loop operating data. Representativ e direct data-dr iven LQR methods include gradient-based policy updates [ 17 , 22 ] and Data-Enabled Policy Optimization (DeePO) [ 29 ]. Gradient-based methods often require relativel y long data tra jector ies to estimate gradients or value functions accurately , which can limit online sample eﬃciency . By contrast, DeePO constr ucts a cov ar iance-based sur rogate objective directly from closed- loop data, making it w ell suited to online adaptation from a single operating trajectory . This is par ticularly relev ant f or DHS operation, where repeated restarts, multiple inde- pendent e xper iments, or extensive exploration are typically impractical. Despite t hese advantages, existing DeePO studies hav e mainly been v alidated on low -dimensional benchmark sys- tems, and its online implementation can still be sensitive to stoc hastic disturbances, time-cor related dat a, and sur rogate bias under model mismatch. To improv e performance in this setting, w e incor porate Adaptiv e Moment Estimation (AD AM) [ 13 ] into t he DeePO update. Although AD AM is widely used to accelerate gradient-based lear ning [ 4 , 12 ], its con verg ence properties f or LQR policy learning do not fol- low directly from existing g eneral-pur pose results, because the cost depends implicitly on L yapunov equations. This mo- tivates an AD AM-enhanced DeePO framew ork f or optimal real-time economic dispatch and temperature regulation in DHSs under demand uncer tainty and model mismatch. 1.4. Contributions and paper organization The main contr ibutions of t his work are threef old: • Economically consistent online control formula- tion for DHS operation. W e develop a data-dr iven online control framew ork for real-time optimal eco- nomic dispatch and temperature regulation in DHSs. Steady -state economic optimality conditions are em- bedded into t he DHS temperature dynamics, yielding an augmented regulation problem whose closed-loop equilibrium coincides wit h the economically optimal operating point, wit hout relying on disturbance f ore- casts for control design or real-time operation. • Online data-driven control with conv ergence and stability guarantees. Based on Dat a-Enabled P olicy Optimization (DeePO), we dev elop an online dat a- driven controller for DHSs under stoc hastic distur- bances and model mismatch, and establish conv er- gence to an optimal control policy , toge t her wit h closed-loop st ability guarantees. • AD AM-enhanced online policy learning with im- pro ved closed-loop performance. W e incor porate Adaptiv e Moment Estimation (ADAM) into the DeePO update to improv e closed-loop per f or mance, and es- tablish con ver gence guarantees for the resulting AD AM- enhanced scheme in large-scale DHS control. The remainder of the paper is organized as f ollow s. Section 2 presents the DHS model. Sections 3 and 4 de velop the augmented LQR framew ork and the DeePO-based online controller design. Section 5 presents simulation results on an industrial-scale DHS wit h model mismatch and time- varying parameter per turbations. Section 6 concludes t he paper . 2. District heating sys tem (DHS) model 2.1. T emperatur e dynamics of DHSs District heating system (DHS) tem perature dynamics can be modeled at diﬀerent ﬁdelity lev els. High-ﬁdelity par- tial diﬀerential equation (PDE) models capture spatiotem- poral heat transpor t in detail, but are often impractical f or real-time optimization and learning-based control because of their high dimensionality and nonlinear structure [ 21 ]. Reduced-order models t hat emphasize agg regate heat trans- port and node-level mixing are theref ore more suit able for scalable netw ork -lev el control [ 1 , 14 , 16 , 18 ]. Representative DHS pipeline models are summar ized in T able 1 . W e there- f ore adopt a control-or iented DHS temperature model based on energy conser vation and mass-ﬂo w mixing, which yields a low -dimensional state-space represent ation while preser v - ing t he network topology and interconnection str ucture. Let   denote t he set of edg es (heat ex chang ers and pipelines) and   the set of nodes (storage tanks). The edge dynamics in ( 1a ) capture t he dependence of outle t temperatures on inlet–outlet temperature diﬀerences, while the node dynamics in ( 1b ) descr ibe the balance between variations in stored ther mal ener gy and the net inﬂo w from adjacent edg es, where    is the outlet temperature of edge  and    is t he temperature of node  :          =      (    −    ) +    −    ,  ∈   ,  ∈   , (1a) Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 2 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems T able 1 Compa rison of representative DHS pip eline mo dels. Here,    denotes the outlet temp erature of the  th pip e segment at time step  , 𝑻 in and 𝑻 denote the inlet and outlet temperature vecto rs, respectively , 𝒗 is the diagonal ﬂo w-velo city matrix,   is the ambient temp erature, and  ,   , 𝝀  ,   , and 𝑽 denote the ﬂuid densit y , speciﬁc heat capacity , heat-transfer coeﬃcient, pip e cross-sectional area, and pip e volume, resp ectively .    is the heat loss in the  th pip e segment at time step  . Reference Mo del [ 21 ]    +1 −    Δ  = −     −   −1  Δ  −        [ 1 ] 𝑽  𝑻 = 𝒗 ( 𝑻 in − 𝑻 ) + 𝝀  (   − 𝑻 ) [ 14 , 16 , 18 ] 𝑽  𝑻 = 𝒗 ( 𝑻 in − 𝑻 )          =   ∈        (    −    ) ,  ∈   . (1b)    and    denote the heat-source and heat-load powers at edge  , respectiv ely .   is the set of edg es whose outlets are node  . The parameters    and    represent edge and node volumes,    is t he mass-ﬂow rate, and   and  are t he speciﬁc heat capacity and density of water . The edge set is partitioned into the heat producer set  , heat load set  , and pipeline set  . For an edge  ∈  , we ha ve    = 0 . For an edge  ∈  , we hav e    = 0 and    is prescribed. For an edge  ∈  , we ha ve    = 0 and    = 0 . Equations ( 1a )–( 1b ) can be compactl y expressed in ma- trix form as 1 . 𝑽       𝑻   𝑻   𝑻       = − 𝑨      𝑻  𝑻  𝑻      +     𝒉 𝑮 − 𝒉  𝟎     , (2a) 𝑽   𝑻   𝑻   = − 𝑨   𝑻  𝑻   +      𝒉 𝑮 − 𝒉   𝟎     , (2b)  𝑻 = − 𝑨 𝟏 𝑻 + 𝑩 𝟏 𝒉 𝑮 − 𝑩 𝟐 𝒉 𝑳 , (2c) where 𝒉 𝑮 and 𝒉 𝑳 denote the production and load vectors, respectiv ely , scaled by 1   . 𝑨 𝟏 = 𝑽 −𝟏 𝑨 𝒉 , 𝑩 𝟏 = 𝑽 −𝟏 ⎡ ⎢ ⎢ ⎣ 𝑰 𝟎 𝟎 ⎤ ⎥ ⎥ ⎦ , 𝑩 𝟐 = 𝑽 −𝟏 ⎡ ⎢ ⎢ ⎣ 𝟎 𝑰 𝟎 ⎤ ⎥ ⎥ ⎦ . 𝑨 𝒉 is deﬁned as 𝑨 𝒉 =  diag( 𝒒 𝑬 ) − diag( 𝒒 𝑬 ) 𝑩 𝒔𝒉 − 𝑩 𝒕𝒉 diag( 𝒒 𝑬 ) diag( 𝑩 𝒕𝒉 𝒒 𝑬 )  . 𝑨 𝒉 is a con- stant matrix for a given mass ﬂow v ector   , where 𝑩 𝒕𝒉 = 1 2 (  𝑩 𝒉  + 𝑩 𝒉 ) , and 𝑩 𝒔𝒉 = 1 2 (  𝑩 𝒉  − 𝑩 𝒉 ) , where 𝑩 𝒉 is the incidence matr ix of t he DHS, and  𝑩 𝒉  denotes the elementwise absolute value of 𝑩 𝒉 [ 18 ]. 1 𝑨 𝒉 satisﬁes 𝑨 𝒉 𝟏 = 𝟎 , 𝟏 ⊤ 𝑨 𝒉 = 𝟎 , and 𝑨 𝒉 + 𝑨 ⊤ 𝒉 ⪰ 𝟎 with a simple zero eigen value, hence it can be regarded as a Kirchhoﬀ matrix of the heating netw ork. Therefore, the null space of 𝑨 𝒉 and 𝑨 𝟏 is  𝟏 , where  ∈ ℝ . 2.2. Steady -state optimization of DHSs Given the steady-state heat demand  𝒉  , the steady-state DHS operation is characterized by the follo wing two opti- mization problems. Problem E1 determines the economi- cally optimal heat g eneration, while problem E2 selects the cor responding temperature proﬁle by minimizing tempera- ture deviation o ver the equilibrium set. E1: min 𝒉 𝑮 ∈ ℝ    , 𝑻 ∈ ℝ   1 2 𝒉 𝑮  𝑭 𝑮 𝒉 𝑮 , (3a) s.t. 𝑨 𝟏 𝑻 = 𝑩 𝟏 𝒉 𝑮 − 𝑩 𝟐  𝒉 𝑳 , (3b) where 𝑭 𝑮 = diag(    )  𝟎 collects t he producer cost coef- ﬁcients for  ∈  . E1 admits a unique optimizer 𝒉  . The associated equilibrium temperature set is  𝑻 = 𝑨 † 1 ( 𝑩 1 𝒉  − 𝑩 2  𝒉  ) +  𝟏 ,  ∈ ℝ , where 𝑨 † 1 denotes the Moore–Penr ose pseudoin verse. Problem E2 then deter mines 𝑻  by minimiz- ing t he temperature-deviation cost ov er this set. E2: min  ∈ ℝ , 𝑻 ∈ ℝ   1 2 𝑻  𝑭 𝑫 𝑻 , (4a) .. 𝑻 = 𝑨 † 𝟏 ( 𝑩 𝟏 𝒉 𝑮 ⋆ − 𝑩 𝟐  𝒉 𝑳 ) +  𝟏 , (4b) where 𝑭 𝑫 =   (    )  𝟎 , and    represents the temperature deviation cost coeﬃcient at node  . The follo wing result characterizes the steady -state opti- mality conditions that link the DHS equilibr ium to problems E1 and E2 . Theorem 1. If the DHS ( 2c ) achieves an equilibr ium at 𝑻 ⋆ and 𝒉 𝑮  , and satisﬁes 𝑭 𝑴 𝒉 𝑮 ⋆ = 𝟎 and 𝟏 ⊤ 𝑭 𝑫 𝑻 ⋆ = 0 , wher e 𝑭 𝑴 ∈ ℝ (    −1)×    is deﬁned by 𝑭 𝑴 =         1 −   2 0 ⋯ 0 0   2 −   3 ⋯ 0 ⋮ ⋮ ⋱ ⋱ ⋮ 0 0 ⋯      −1 −            . Then it uniquely solv es the optimization problems E1 and E2 . The proof is def er red to Appendix A . 3. LQR problem f or mulation In this paper,  ⋅  denotes the Euclidean nor m,  ⋅  2 the induced  2 norm,  ⋅   the Frobenius norm,  ⋅   the nuclear nor m, and  ⋅   2 the  2 norm. The notation  ,   ∶= Tr (    ) denotes the Frobenius inner product. 3.1. Discrete-time temper ature dynamics T o f acilitate controller design and simulation, we dis- cretize the continuous-time dynamics in ( 2c ). For simplicity , we use Euler discretization with sampling inter val  . The controller de veloped in the remainder of the paper is not restricted to this choice and can be applied with an y dis- cretization method that yields discrete-time dynamics with a Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 3 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems constant sampling inter v al. The resulting discrete-time DHS model is 𝑻  +1 = ( 𝑰 −  𝑨 1 ) 𝑻  +  𝑩 1 𝒉   −  𝑩 2 ( 𝒉 , con + 𝒉 , st o  ) = 𝑨  𝑻  + 𝑩  𝒉   + 𝑩   𝒉 , con + 𝑩   𝒉 , st o  , (5) where the heat load 𝒉  is decomposed into a slow -varying component 𝒉 , con and a fast-v ar ying stochastic component 𝒉 , st o  . 3.2. Output and error deﬁnition W e aim to design a temperature regulator that en sures con verg ence to the optimal equilibr ium point ( 𝑻 ⋆ , 𝒉 𝑮⋆ ) deﬁned b y t he solutions of E1 and E2 . To achiev e this, we introduce an er ror signal whose con ver gence to zero guaran- tees satisfaction of the cor responding optimality conditions. Speciﬁcally , the error deﬁnition is constructed directly from the E1–E2 optimality conditions in Theorem 1 , with 𝒆 𝒌 ha ving the same dimension as 𝒉 𝑮 𝒌 : 𝒆 𝒌 =  𝟎 𝟏 ⊤ 𝑭 𝑫  𝑻 𝒌 +  𝑭 𝑴 𝟎  𝒉 𝑮 𝒌 = 𝑪 𝑻 𝑻 𝒌 + 𝑫 𝑻 𝒉 𝑮 𝒌 . (6) 3.3. A ugmented dynamics T o achiev e both temperature regulation and economi- cally consistent steady -st ate operation, we design the con- troller so that the optimality er ror satisﬁes 𝔼 [ 𝒆  ] → 𝟎 and the heat-generation increment satisﬁes 𝔼 [ 𝒉   − 𝒉   −1 ] → 𝟎 at equilibrium in e xpect ation. This motiv ates the augmented state: 𝒙 𝒌 +𝟏 =  𝑻 𝒌 +𝟏 − 𝑻 𝒌 𝒆 𝒌  , t he resulting augmented dy- namics are descr ibed as follo ws, 𝒙 𝒌 +𝟏 =  𝑨 𝑻 𝟎 𝑪 𝑻 𝑰  𝒙 𝒌 +  𝑩 𝑻 𝑫 𝑻  𝒖 𝒌 +  𝑩 𝑳 𝑻 𝟎  𝒘 𝒌 (7a) = 𝑨𝒙 𝒌 + 𝑩 𝒖 𝒌 + 𝑩 𝒘 𝒘 𝒌 𝒖 𝒌 = 𝒉 𝑮 𝒌 − 𝒉 𝑮 𝒌 −𝟏 , 𝒘 𝒌 = 𝒉 , st o  − 𝒉 , st o  −1 , (7b) 𝒆 𝒌 =  𝑪 𝑻 𝑰  𝒙 𝒌 + 𝑫 𝑻 𝒖 𝒌 (7c) = 𝑪 𝒙 𝒌 + 𝑫 𝒖 𝒌 . Assumption 1. The pair ( ,  ) in ( 7 ) is controllable. Assumption 2. (Bounded disturbances with ﬁnite co var i- ance) The disturbance sequence { 𝒘  } is zero-mean and uniformly bounded. Speciﬁcally, ther e exists a constant  > 0 suc h that  𝒘   ≤  for all  . Mor eover , the disturbance has a ﬁnite, time-invariant covariance matrix, i.e., 𝔼 [ 𝒘  ] = 𝟎 , 𝔼 [ 𝒘  𝒘   ] = 𝑼 dis  𝟎 , where 𝑼 dis is ﬁnite. Deﬁnition 1. The augmented system ( 7 ) has an input dimen- sion of  =     and a state dimension of  =    +     . Proposition 1. (Converg ence to the op timal operating point under a stabilizing feedback law) Consider the original DHS ( 2c ) and its augmented repr esentation in ( 7 ) under the state-f eedback law 𝒖  = 𝑲 𝒙  , wher e 𝑲 is stabilizing . If 𝔼 [ 𝒙  ] → 𝟎 , 𝔼 [ 𝒖  ] → 𝟎 , and 𝔼 [ 𝒆  ] → 𝟎 , then the original DHS conv erg es in expectation to its optimal equilibrium  𝑻 ⋆ , 𝒉 𝑮⋆  deﬁned in ( 3 , 4 ). Mor eov er , un- der Assumption 2 , the closed-loop state is mean-squar e bounded, i.e., sup  𝔼  𝒙   2 < ∞ ; equivalently , 𝔼 [ 𝒙  𝒙   ] conv erg es to a unique stationar y covariance that scales with the disturbance covariance level. P RO OF . From 𝔼 [ 𝒖  ] = 𝔼 [ 𝒉   − 𝒉   −1 ] = 𝟎 it f ollows that 𝔼 [ 𝒉   ] is constant f or  suﬃcientl y large. T ogether with 𝔼 [ 𝒙  ] → 𝟎 and 𝔼 [ 𝒆  ] → 𝟎 , the expected steady- state of the original DHS satisﬁes the KKT -based optimality conditions characterized in Theorem 1 ; hence t he original DHS con ver ges in expectation to  𝑻 ⋆ , 𝒉 𝑮⋆  . Since 𝑲 is stabilizing, the closed-loop matrix satisﬁes  ( 𝑨 + 𝑩 𝑲 ) < 1 , where  ( ⋅ ) denotes the spectral radius. With zero-mean disturbances of ﬁnite cov ar iance, the standard discrete-time L yapunov argument implies mean-square boundedness and con verg ence of 𝔼 [ 𝒙  𝒙   ] to the stationar y cov ar iance. 3.4. Model-based LQR tem perature r egulator 3.4.1. Stationar y distr ibution of 𝒙 𝒌 T o support online lear ning, we inject a bounded ex- ploration signal with magnitude  > 0 . Let { 𝒘 , } be a zero-mean bounded sequence with 𝔼 [ 𝒘 , ] = 𝟎 and 𝔼 [ 𝒘 , 𝒘  , ] = 𝑰  . The control input is 𝒖  = 𝑲 𝒙  +  𝒘 , , yielding the closed-loop dynamics 𝒙  +1 = ( 𝑨 + 𝑩 𝑲 ) 𝒙  + 𝝐  , where 𝝐  ∶= 𝑩 𝒘 𝒘  +  𝑩 𝒘 , . Under Assumption 2 , { 𝒘  } and { 𝒘 , } are zero-mean and independent, so 𝝐  is zero-mean with covariance 𝔼 [ 𝝐  𝝐   ] = 𝑩 𝒘 𝑼 𝒅 𝒊𝒔 𝑩 ⊤ 𝒘 +  2 𝑩 𝑩 ⊤ =∶ 𝑼 𝝐  𝟎 . Since both sequences are bounded, { 𝝐  } is also bounded. For  ( 𝑨 + 𝑩𝑲 ) < 1 , the closed-loop state process admits a stationary cov ar iance 𝑼  , whic h is the unique positive semideﬁnite solution to 𝑼 𝑲 = 𝑼 𝝐 + ( 𝑨 + 𝑩 𝑲 ) 𝑼 𝑲 ( 𝑨 + 𝑩 𝑲 )  , (8a) 𝑼 𝑲 = ∞   =0 ( 𝑨 + 𝑩 𝑲 )  𝑼 𝝐 [( 𝑨 + 𝑩 𝑲 )  ]  . (8b) It is a standard result for stable discrete-time stochas tic linear systems [ 30 ]. 3.4.2. Cost function T o quantify long-r un regulation performance, we deﬁne the performance output as 𝒛  =  𝑸 1∕2 𝒙  𝑹 1∕2 𝒖   , where 𝑸  𝟎 and 𝑹  𝟎 weight state deviations and control eﬀort, respectiv ely . W e consider t he feedbac k la w 𝒖  = 𝑲 𝒙  . Under Assumption 2 and a stabilizing gain 𝑲 satisfying  ( 𝑨 + 𝑩 𝑲 ) < 1 , the closed-loop system admits a unique stationary cov ar iance. The cor responding  2 cost of the closed-loop map  ( 𝑲 ) ∶ 𝑩  𝒘 ↦ 𝒛 is  ( 𝑲 ) ∶=   ( 𝑲 )  2  2 = Tr  ( 𝑸 + 𝑲  𝑹𝑲 ) 𝑼   , (9a) = Tr  𝑷  𝑼   , (9b) Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 4 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems where 𝑷   𝟎 is the unique solution to 𝑷  = 𝑸 + 𝑲  𝑹𝑲 + ( 𝑨 + 𝑩 𝑲 )  𝑷  ( 𝑨 + 𝑩 𝑲 ) . (10) Equiv alently ,  ( 𝑲 ) is t he inﬁnite-hor izon long-r un av erage quadratic cost  ( 𝑲 ) = lim  → ∞ 1    −1  =0 𝔼 [ 𝒛   𝒛  ] . 3.4.3. Model-based LQR When the system matrices ( 𝑨, 𝑩 ) are known and the pair is stabilizable, the inﬁnite-horizon discrete-time LQR problem admits the optimal state-f eedback solution 𝑲 ⋆ = −( 𝑹 + 𝑩  𝑷 ⋆ 𝑩 ) −1 𝑩  𝑷 ⋆ 𝑨 , (11) where 𝑷 ⋆  𝟎 is the stabilizing solution to the discrete- time algebraic Riccati equation. This standard model-based solution ser ves as a benchmark f or t he subsequent dat a- driven controller design [ 2 ]. 3.4.4. Indirect cer tainty-equivalence (CE) LQR As a benc hmark data-dr iven approach, CE ﬁrst identiﬁes a model from dat a and then solv es the cor responding LQR problem as if t he identiﬁed dynamics were e xact. Consider a trajectory of length  generated under a stabilizing policy 𝑲 0 , and deﬁne 𝑿 0 ∶= [ 𝒙 0 𝒙 1 ⋯ 𝒙  −1 ] ∈ ℝ  ×  , 𝑼 0 ∶= [ 𝒖 0 𝒖 1 ⋯ 𝒖  −1 ] ∈ ℝ  ×  , 𝑿 1 ∶= [ 𝒙 1 𝒙 2 ⋯ 𝒙  ] ∈ ℝ  ×  , 𝑾 0 ∶= [ 𝒘 0 𝒘 1 ⋯ 𝒘  −1 ] ∈ ℝ  ×  . (12) Let 𝑫 0 ∶=  𝑼 0 𝑿 0  = [ 𝒅 0 𝒅 1 ⋯ 𝒅  −1 ] and 𝚽 ∶= 1  𝑫 0 𝑫  0 , so that 𝑿 1 = 𝑨𝑿 0 + 𝑩 𝑼 0 + 𝑾 0 . Assumption 3 (Persistent excitation). The input sequence { 𝒖  } is persistentl y exciting of order  +  , so that the data matrix 𝑫 0 has full row r ank . Under Assumption 3 , the least-squares estimate is [  𝑩 ,  𝑨 ] =  𝑿 1 𝚽 −1 ,  𝑿 1 ∶= 1  𝑿 1 𝑫  0 . (13) The agg regated-disturbance cov ar iance is estimated by  𝜺 0 = 𝑿 1 − (  𝑨 +  𝑩 𝑲 0 ) 𝑿 0 ,  𝑼  = 1   𝜺 0  𝜺  0 . (14) Fixing  𝑼  from t he initial batch, the CE LQR problem is min 𝑲 ,  𝑼     ( 𝑲 ) = Tr  ( 𝑸 + 𝑲  𝑹𝑲 )  𝑼   , (15a) s.t.  𝑼  =  𝑼  + (  𝑨 +  𝑩 𝑲 )  𝑼  (  𝑨 +  𝑩 𝑲 )  . (15b) Remar k 1 (Role of estimating 𝑼  ). Although the optimal controller is theoretically independent of 𝑼  , estimating 𝑼  impro ves the physical rele vance of cov ar iance-weighted learning in DHS applications, where disturbances may be anisotropic and state-cor related. 4. Data-enabled policy optimization (DeePO) 4.1. Co variance parame ter ization Assumption 4 (Uniform boundedness of signals). There exist positive constants   > 0 ,   > 0 , and   > 0 such that  𝒙   ≤   ,  𝒖   ≤   , and  𝒅   ≤   for all  ∈ ℕ . T o obtain a data-enabled parameterization of the LQR problem, deﬁne  𝑼 0 ∶= 1  𝑼 0 𝑫  0 ,  𝑿 0 ∶= 1  𝑿 0 𝑫  0 ,  𝑿 1 ∶= 1  𝑿 1 𝑫  0 , and  𝑾 0 ∶= 1  𝑾 0 𝑫  0 . Then  𝑿 1 = 𝑨  𝑿 0 + 𝑩  𝑼 0 +  𝑾 0 . Under Assumption 3 , there exists a unique matrix 𝑽 ∈ ℝ (  +  )×  such that  𝑲 𝑰   = 𝚽 𝑽 =   𝑼 0 𝑽  𝑿 0 𝑽  . Hence, t he closed-loop matr ix can be written as 𝑨 + 𝑩 𝑲 = (  𝑿 1 −  𝑾 0 ) 𝑽 . Since { 𝒅  } is uniformly bounded and { 𝒘  } is zero-mean with ﬁnite cov ar iance,   𝑾 0  2 a . s . ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← ← →  → ∞ 0 . Thus, for large samples, 𝑨 + 𝑩 𝑲 ≈  𝑿 1 𝑽 . The cov ar iance estimate is  𝜺 0 = 𝑿 1 −  𝑿 1 𝑽 𝑿 0 ,  𝑼  = 1   𝜺 0  𝜺  0 . (16) Using  𝑲 𝑰   = 𝚽 𝑽 and 𝑲 =  𝑼 0 𝑽 , the CE-LQR problem in ( 15 ) can be rewritten in terms of 𝑽 as min 𝑽 ,  𝑼   ( 𝑽 ) = Tr  ( 𝑸 + 𝑽   𝑼  0 𝑹  𝑼 0 𝑽 )  𝑼   , (17a) s.t.  𝑼  =  𝑼  + (  𝑿 1 𝑽 )  𝑼  (  𝑿 1 𝑽 )  , (17b)  𝑿 0 𝑽 = 𝑰 . (17c) Let 𝑽 ∶= 𝚽 −1  𝑲 𝑰   . Then ( 15 ) and ( 17 ) are equiv alent, and the optimal gain is recov ered as 𝑲  =  𝑼 0 𝑽  [ 29 ]. 4.2. DeePO Implementation Assumption 5 (Feasibility of co variance parametrization). The cov ariance parame terization of the LQR pr oblem in ( 17 ) admits a nonempty feasible set. Speciﬁcally, there exist constants  ∈ (0 , 1) and   > 0 such that  ∶=  𝑽    𝑿 0 𝑽 = 𝑰  ,  (  𝑿 1 𝑽 ) ≤ 1 −  ,   𝑼 0 𝑽  2 ≤    is nonempty. Theorem 2. Consider ( 17 ) . Assume that 𝑽 satisﬁes  𝑿 0 𝑽 = 𝑰 and that  𝑼  is the unique solution of ( 17b ) associated with 𝑽 . Then the gr adient of  ( 𝑽 ) in ( 17a ) is ∇ 𝑽  ( 𝑽 ) = 2   𝑼  0 𝑹  𝑼 0 +  𝑿  1 𝑷   𝑿 1  𝑽  𝑼  . (18) The proof is def er red to Appendix B . 4.2.1. Rank-1 gradient-descent implement ation of DeePO In t he adaptive control setting, we collect online closed- loop data ( 𝒙 𝒕 , 𝒖 𝒕 , 𝒙 𝒕 +𝟏 ) at each time step  , which are used to f orm the data matr ices ( 𝑿 𝟎 ,𝒕 +𝟏 , 𝑼 𝟎 ,𝒕 +𝟏 , 𝑿 𝟏 ,𝒕 +𝟏 ) . These data enable a single projected gradient-descent update of the parameterized policy at time  . The updated policy is then applied to the system, and the procedure is repeated iterativ ely . The method is summar ized in Algor ithm 1 . Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 5 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems Algorithm 1: Rank -1 GD DeePO Input: Initial stabilizing policy 𝑲  0 , stepsize  , and oﬄine dat a ( 𝑿 0 , 0 , 𝑼 0 , 0 , 𝑿 1 , 0 ) . 1 for  =  0 ,  0 + 1 , … do 2 Apply 𝒖  = 𝑲  𝒙  and obser ve 𝒙  +1 ; 3 Form 𝝓  ∶=  𝒖   𝒙     ; 4 Update  𝑿 0 , +1 ,  𝑼 0 , +1 ,  𝑿 1 , +1 recursiv ely ; 5 Update 𝚽  +1 =  𝚽  + 𝝓  𝝓    +1 and compute 𝚽 −1  +1 ; 6 Compute 𝑽  +1 = 𝚽 −1  +1  𝑲  𝑰   ; 7 Perf or m one-step projected GD: 𝑽 ′  +1 = 𝑽  +1 −  𝚷  𝑿 0 , +1 ∇   +1 ( 𝑽  +1 ); (19) Update the control gain: 𝑲  +1 =  𝑼 0 , +1 𝑽 ′  +1 . (20) Assumption 6 (Persistency of ex cit ation for online DeePO). F or all  ≥  0 , the input sequence { 𝒖  }  −1  =0 pr ovides a uniform level of excitation. Speciﬁcally, the input matrix 𝑼 0 , ∶= [ 𝒖 0 , … , 𝒖  −1 ] is persistentl y exciting of or der  + 1 with excitation level    (  + 1) , in the sense that the associated Hankel matrix   +1 ( 𝑼 0 , ) =      𝒖 0 𝒖 1 ⋯ 𝒖  −  −1 𝒖 1 𝒖 2 ⋯ 𝒖  −  ⋮ ⋮ ⋱ ⋮ 𝒖  𝒖  +1 ⋯ 𝒖  −1      satisﬁes  min    +1 ( 𝑼 0 , )  ≥    (  + 1) , for some constant  > 0 independent of  . Remar k 2. Assumption 6 ensures that suﬃciently informa- tive data are av ailable f or cov ariance estimation and online policy updates, and together with Assumption 2 deter mines the SNR used in the subsequent analysis [ 29 ]. The signal-to-noise ratio (SNR) is deﬁned as  ∕  , where  and  are deﬁned in Assumptions 6 and 2 , respectivel y . The sample cov ar iance matrices are updated recursivel y . Let 𝝓  ∶=  𝒖   𝒙     . Then, f or ex ample,  𝑿 0 , +1 =   +1  𝑿 0 , + 1  +1 𝒙  𝝓   , and similarly f or  𝑼 0 , +1 and  𝑿 1 , +1 . The sample co variance matr ix satisﬁes 𝚽  +1 =  𝚽  + 𝝓  𝝓    +1 . By the Sher man–Mor rison formula [ 20 ], 𝚽 −1  +1 is calculated as: 𝚽 −1  +1 =  +1   𝚽 −1  − 𝚽 −1  𝝓  𝝓   𝚽 −1   + 𝝓   𝚽 −1  𝝓   . The projection operator 𝚷  𝑿 0 , +1 in ( 19 ) denotes the or- thogonal projection onto the tangent space of the aﬃne constraint  𝑿 0 , +1 𝑽 = 𝑰 . It preserves the equality constraint to ﬁrst order ( 17c ) and can be implemented either by the closed-f or m projector 𝑰 −  𝑿  0 , +1 (  𝑿 0 , +1  𝑿  0 , +1 ) −1  𝑿 0 , +1 or via a null-space parameterization. For any gain 𝑲 , let  ( 𝑲 ) denote the tr ue steady -state cost and   ( 𝑲 ) its cov ar iance-estimated counter part. In the lifted formulation, let  ( 𝑽 ) denote t he cor responding ob- jective in ( 17 ). W e use the optimality gap  ( 𝑲  ) −   to quantify t he conv ergence of the controller . Accordingly , deﬁne t he reg ret Regr et  ∶= 1   0 +  −1   =  0   ( 𝑲  ) −    . (21) This regret measures t he a verag e steady -st ate per f ormance gap between t he online controller and the optimal steady- state policy . Let 𝒆 noise ∶= 𝑼  −  𝑼  denote the co variance estimation er ror induced by ﬁnite data. The optimality gap ( 21 ) can be decomposed as  ( 𝑲  ) −   =   ( 𝑲  ) −   ( 𝑲  )       noise mismatch +    ( 𝑲  ) −          CE regret +     −         optimal bias , (22) where noise mismatch =  ( 𝑲  ) −   ( 𝑲  ) = Tr  𝑷   𝒆 𝐧𝐨𝐢𝐬𝐞  and optimal bias =    −   = Tr ( 𝑷   𝒆 noise ) . Lemma 1. Under Assum ptions 1 – 6 , let Algor ithm 1 run for  ∶=  −  0 + 1 steps, ther e exist positive const ants   > 0 ,  ∈ {1 , 2 , 3 , 4} , depending on (  ,  ,  𝑹  2 ,  ( 𝑸 ) ,  ( 𝑹 ) ,    ) , suc h that, if the stepsize satisﬁes  ∈ (0 ,  1 ] and SNR ≥  2 , then 1   0 +  −1   =  0 CE r egre t ≤  3   +  4 SNR −1∕2 . (23) The proof f ollow s the framew ork of Theorem 2 in [ 29 ], which relies on projected g radient dominance and local smoothness of the objective ov er the feasible set. In our setting, replacing the unit disturbance co variance with a bounded cov ar iance estimate  𝑼  does not change t he proof structure; it only rescales the associated constants. Since our f ocus is on DHS modeling and cov ar iance- a ware robustiﬁcation, t he full derivation is omitted f or brevity . Lemma 2. Under Assumption 5 , there exists a constant  max , 2 > 0 such that for all 𝑽 ∈  with 𝑲 =  𝑼 0 𝑽 , the Lyapunov solution 𝑷 𝑲 in ( 10 ) satisﬁes  𝑷 𝑲  2 ≤  max , 2 . Consequently,  𝑷 𝑲   ≤   𝑷 𝑲  2 ≤  max , 2 =∶  max , uniformly for all 𝑽 ∈  . The proof is def er red to Appendix C . Assumption 7. (Covariance estimation mismatch) W e assume that the mismatc h  noise is bounded in  2 norm:   noise  2 ≤   noise . Theorem 3. Suppose Assumptions 1 – 7 hold, and le t Algorithm 1 run f or  ∶=  −  0 + 1 iter ations giv en oﬄine data ( 𝑿 0 , 0 , 𝑼 0 , 0 , 𝑿 1 , 0 ) , there exis t positive constants   > 0 ,  ∈ {1 , 2 , 3 , 4} , de- pending on (  ,  ,  𝑹  2 ,  ( 𝑸 ) ,  ( 𝑹 ) ,   ) , suc h that, if the stepsize satisﬁes  ∈ (0 ,  1 ] and SNR ≥  2 , the r egr et satisﬁes Regr et  ≤  3   +  4 SNR −1∕2 + 2  max   noise . (24) The proof is def er red to Appendix D . Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 6 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems 4.2.2. AD AM–GD Implementation of DeePO T o improv e the performance of online policy updates, we replace t he lif ted-variable standard gradient step in ( 19 ) with an AD AM-style preconditioned update ( 25 )–( 26 ), while keeping the same online recursive cov ar iance updates as in DeePO. The re- sulting AD AM–DeePO procedure is summarized in Algor ithm 2 . Here, the ADAM moments   ,   ,    ,    ha ve the same dimension as 𝑽  , namel y ℝ (  +  )×  . In ( 25e ), all nonlinearities act elementwise:     +1 is t aken entrywise, and    +1 ∕(     +1 +  ) denotes elemen- twise division. Accordingl y , 𝑫  +1 ∶=   𝒗  +1 +  is used only f or elementwise scaling. Assumption 8. Let 𝑫  ∶=   𝒗  +  denote the elementwise ADAM scaling matrix. There exist constants 0 <  min ≤  max < ∞ suc h that, for all entr ies ( ,  ) and all  ,  min ≤ ( 𝑫  ) −1  ≤  max . Assumption 9 (Eﬀective stepsize schedule). Let    ∶=   ∕ (1 −   +1 1 ) denote the bias-correct ed eﬀective st epsize. Assume that {    } is positive, nonincreasing, and unif or mly bounded, i.e., 0 <    ≤   max for all  . Moreov er , the stepsizes ar e chosen suc h that the cumulative stepsize gr ows sublinearly with the time horizon, satisfying  1   ≤   0 +  −1  =  0    ≤  2   , for some constants  1 ,  2 > 0 and all  ≥ 1 . Theorem 4. U nder Assumptions 1 – 9 , let Algorithm 2 run for  ∶=  −  0 + 1 st eps. Then there exist constants  1 ,  2 ,  3 ,  4 > 0 , depending on (  ,  ,  𝑹  2 ,  ( 𝑸 ) ,  ( 𝑹 ) ,    ) , such that, if   max ∈ (0 ,  1 ] and SNR ≥  2 , then Regr et  ≤  3   +  4 SNR −1∕2 + 2  max   noise . (27) The proof is def er red to Appendix E . In practice, the proposed update improv es online learning per- f or mance wit hout sacr iﬁcing per f or mance guarantees. Moreov er, Assumption 2 is adopted for analytical clar ity; extending the result to light-tailed stochas tic disturbances is left f or future work. 5. Simulation This section evaluates the proposed method from both method- ological and application-oriented perspectives. W e ﬁrst use a standard three-dimensional benchmark [ 29 ] to isolate the eﬀects of disturbance-cov ar iance estimation and AD AM-based updates under stoc hastic ex cit ation. W e then assess the method on an industrial-park DHS in Norther n China [ 25 ]. In the DHS study , the system model is fur ther subjected to static and time-varying parameter per turbations to emulate practical uncer tainty in mass ﬂow rates and heat-transfer characteristics. Overall, the results show that the proposed method achie ves stable near-optimal DHS operation and improv ed robustness under realistic model mismatch and stochas tic disturbances. 5.1. Mechanism validation on a three-dimensional benchmark system This benchmar k is not intended to represent t he scale of in- dustrial DHSs. Rather, it ser ves as a low -dimensional testbed for isolating two key mechanisms of t he proposed approac h, namely disturbance-co variance estimation and adaptive gradient updates, bef ore moving to the industrial DHS case. W e consider the marginall y unstable Laplacian system from [ 5 , 29 ]: 𝑨 =     1 . 01 0 . 01 0 0 . 01 1 . 01 0 . 01 0 0 . 01 1 . 01     . The state and control weighting Algorithm 2: AD AM–DeePO for Direct A daptive LQR Policy Learning Input: Initial stabilizing gain 𝑲  0 , 𝒎  0 = 𝟎 , 𝒗  0 = 𝟎 ; initial stepsize  0 > 0 (and a stepsize schedule {   }  ≥  0 ); ADAM hyperparameters  1 ,  2 ∈ (0 , 1) and  > 0 ; oﬄine data (  0 , 0 ,  0 , 0 ,  1 , 0 ) . 1 for  =  0 ,  0 + 1 , … do 2 Apply control   = 𝑲    and obser ve   +1 ; 3 Form 𝝓  ∶=  𝒖   𝒙     ; 4 Update  𝑿 0 , +1 ,  𝑼 0 , +1 ,  𝑿 1 , +1 recursiv ely ; 5 Update 𝚽  +1 =  𝚽  + 𝝓  𝝓    +1 and compute 𝚽 −1  +1 . 6 Given 𝑲  , compute lifted representation: 𝑽  +1 = 𝚽 −1  +1  𝑲  𝑰  . 7 Compute stochastic gradient 𝒈  +1 ∶= ∇   +1 ( 𝑽  +1 ) ; 8 AD AM moment updates: 𝒎  +1 =  1 𝒎  + (1 −  1 ) 𝒈  +1 , (25a) 𝒗  +1 =  2 𝒗  + (1 −  2 )  𝒈  +1  𝒈  +1  , (25b)  𝒎  +1 = 𝒎  +1 ∕  1 −   −  0 +1 1  , (25c)  𝒗  +1 = 𝒗  +1 ∕  1 −   −  0 +1 2  , (25d)  𝑽  +1 = 𝑽  +1 −    𝒎  +1   𝒗  +1 +  = 𝑽  +1 −    𝒎  +1 𝑫  +1 , (25e) where    ∶=   ∕  1 −   −  0 +1 1  and 𝑫  +1 ∶=   𝒗  +1 +  (elementwise). 9 Aﬃne projection to satisfy  𝑿 0 , +1 𝑽 = 𝑰 : 𝑽 ′  +1 =  𝑽  +1 +  𝑿  0 , +1   𝑿 0 , +1  𝑿  0 , +1  −1  𝑰 −  𝑿 0 , +1  𝑽  +1  . (26) 10 Update control gain: 𝑲  +1 =  𝑼 0 , +1 𝑽 ′  +1 . matrices are 𝑸 = 𝑹 = 𝑰 3 . T o ev aluate DeePO under stochastic dis- turbances, we consider Gaussian process noise 𝒘  ∼  (0 , 1 100 𝑰 3 ) and apply a PE input 𝒖  = 𝑲  𝒙  + 𝒗  , where 𝒗  ∼  (0 , 𝑰 3 ) , yielding SNR ∈ [0 , 5] as in [ 29 ]. The DeePO algor ithm is initialized using an LQR controller designed f or a perturbed model 𝑨 design = (1 +   ) 𝑨 , where   denotes the percent age model mismatch, and a 50- step w ar m-start dat aset is collected. Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 7 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems 5.1.1. Eﬀect of disturbance–covariance estimation T o assess t he eﬀect of disturbance–cov ar iance estimation, we test f our input matrices with increasing coupling lev els: decoupled 𝑩  1 =     1 0 0 0 1 0 0 0 1     , moderately coupled 𝑩  2 =     1 1 0 0 1 0 0 0 1     , strongl y coupled 𝑩  3 =     1 1 0 . 5 0 1 0 0 0 1     , and highly coupled 𝑩  4 =     1 1 1 0 1 0 0 0 1     . A 10% model per turbation is used to initialize a stabilizing gain 𝑲 . W e compare DeePO with 𝑼  = 𝑰 against DeePO using t he estimated disturbance cov ar iance  𝑼  across all four input matr ices as shown in Figure. 1 . Although the optimal LQR g ain 𝑲  is t heo- reticall y independent of 𝑼  , the sur rogate gradient used b y DeePO depends on the closed-loop cov ar iance. Consequently , assuming 𝑼  = 𝑰 distor ts the eﬀective cost landscape when disturbances are anisotropic. In contrast, using  𝑼  pro vides more accurate g radient directions and leads to smoother and more stable con verg ence. This advantage becomes increasingly pronounced from 𝑩  1 to 𝑩  4 as the disturbance coupling strength increases. Figure 1: Eﬀect of disturbance-co variance es timation on DeePO conv ergence. 5.1.2. Compar ison with zeroth-order policy optimization (ZO-PO) W e furt her compare DeePO with classical zeroth-order policy optimization (ZO-PO) on the strongly coupled case 𝑩  4 , using the same initial stabilizing controller and disturbance lev el (   = 0 . 01 ). ZO-PO es timates gradients using two-point ﬁnite diﬀer- ences, whereas DeePO updates t he policy directl y from cov ar iance inf or mation in closed-loop dat a. As shown in Figure. 2 (a), DeePO con verg es smoothly and rapidly , whereas ZO-PO conv erges more slow ly and exhibits lar ger ﬂuctuations due to noisy gradient esti- mates. The sample-eﬃciency compar ison in Figure. 2 (b) fur ther highlights this diﬀerence: each ZO-PO update req uires roughly 3000 samples, whereas DeePO uses only one. Under the same sample budget, DeePO reaches near -optimal performance se veral orders of magnitude sooner , indicating substantially higher sample eﬃciency and strong er robustness to stochastic noise. Figure 2: Comparison of con ver gence beha vior and sample eﬃciency between DeePO and ZO-PO. 5.2. Industrial DHS in Northern China T o e valuate the proposed method in a realistic setting, w e consider an industrial-park DHS in Nor thern China with three producers and eight loads [ 25 ], yielding an augmented system with  = 22 states and  = 11 inputs. The data are collected wit h a sampling inter val of  = 0 . 1 s . W e apply Gaussian process noise 𝒘  = 0 . 0042 𝒗 (  )  (k W) and exploration noise 𝒘 , = 0 . 1 𝒗 (  )  (k W) , where 𝒗 (  )  , 𝒗 (  )  ∼  ( 𝟎 , 𝑰 11 ) are independent. The f or mer captures stochas tic demand ﬂuctuations and unmodeled thermal eﬀects, while the latter provides excitation f or online lear ning. W e compare GD-DeePO and AD AM-DeePO under model mismatch. The nominal system is 𝑨 nominal =  𝑰 −  𝑨 1 𝟎 𝑪  𝑰  , whereas the per turbed real model is 𝑨 r eal =  𝑰 − (1 +   )  𝑨 1 𝟎 𝑪  𝑰  , with   denoting t he percent age model mismatch. A positive   corresponds t o fas ter thermal dynamics, whereas a negativ e   corresponds to slow er t hermal dynamics. The controller is initialized using the LQR solution of the nominal model, af ter which DeePO updates the f eedback gain 𝑲 online from closed-loop data. Both algorit hms are run for 10 , 000 iterations. Bey ond static model mismatch, we also consider time-varying dynamics. Speciﬁcally , the thermal time-scale per turbation   is augmented by a zero-mean bounded time-varying component    (  ) , yielding the eﬀective perturbation   (1 +    (  )) . This emulates variations in operating conditions such as chang es in mass ﬂo w rates and heat-transfer characteristics, and results in a linear time- varying sys tem whose dynamics deviate persistentl y from the nom- inal model used f or controller initialization. 5.2.1. Near -optimal and st able DHS operation With a model mismatch of 20% , and a constant heat demand of 10 MW , the ﬁnal relative cost error   ( 𝑲  )−      achie ved by AD AM-DeePO is as low as 𝟕 . 𝟓𝟒𝟑 × 𝟏𝟎 − 𝟑 , indicating near-optimal closed-loop operation of the lear ned feedbac k gain 𝑲 . This result show s that the proposed data-dr iven controller can maint ain near - optimal operating performance in a high-dimensional industrial DHS without requiring an exact system model. Figure 3: T emperature ev olution under AD AM-DeePO. Figure 4: Optimality er ror ev olution under AD AM-DeePO. Figure 5: Heat generation under AD AM-DeePO. The closed-loop trajectories in Figures. 3 – 5 furt her illustrate the operating behavior of the lear ned controller . The augmented f or mulation drives the optimality er ror 𝒆  to zero, impl ying con- ver gence to the economically optimal operating point. Meanwhile, Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 8 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems the temperature state 𝑻  and the heat-generation input 𝒉   remain bounded and ﬂuctuate only mildly around their steady-s t ate v alues under stochas tic disturbances and persistent ex citation. These re- sults show t hat t he proposed method can simult aneousl y maintain stable temperature regulation and economically eﬃcient heat allo- cation in a larg e-scale DHS under uncer tainty . 5.2.2. V alue of online data-dr iven control under model mismatch and stoc hastic disturbances T o illustrate the value of online data-driven control, we com- pare ADAM-DeePO with a nominal forecast-based MPC baseline under the same 20% model mismatch and the same stochas tic heat-demand disturbance. The MPC controller is implemented in receding-horizon fashion using the nominal model, whereas t he phy sical process evol ves according to t he mismatched model 𝑨 r eal . Figures. 6 and 7 show that nominal MPC yields smooth heat- generation trajectories, but does not dr iv e the optimality er ror to zero under model mismatch. At least one error component settles at a nonzero steady -state value, indicating conv ergence to a biased op- erating point rather than the economically optimal one. By contrast, Figures. 4 and 5 show that ADAM-DeePO drives the optimality error close to zero while maintaining bounded heat-g eneration trajectories. These results suggest t hat t he proposed online data- driven controller is better able to reco ver t he economically optimal operating point under model mismatch, because it updates the control policy directly from closed-loop data rat her t han relying solely on nominal predictions in real-time operation. 5.2.3. Compar ison of AD AM-DeePO and GD-DeePO T ables 2 and 3 repor t t he relative operating-cost errors under diﬀerent le vels of static model mismatch. Both GD-DeePO and AD AM-DeePO con verg e to near-optimal solutions, but ADAM- DeePO consistently achiev es lo wer operating-cost error, with im- pro vements (IMP) of up to 48 . 71% . This indicates that adaptiv e moment-based scaling impr ov es online learning and leads to better closed-loop performance under model uncert ainty . Figure 6: Heat generation trajectories under forecas t-based nominal MPC with 20% model mismatch. Figure 7: Optimality er ror ev olution under f orecast-based nominal MPC with 20% model mismatch. 5.2.4. Robustness to time-varying thermal dynamics Comparing T ables 2 and 4 , we observe that mild time-v ar ying perturbations around a ﬁxed 20% model mismatch improv e the operating-cost per f or mance of both GD-DeePO and ADAM-DeePO. Small zero-mean variations provide additional excitation, par tially mitigating t he bias introduced by static model mismatch and T able 2 Relative operating-cost error under static mo del mismatch (mo derate mo del va riations).   -15% 15% -20% 20% GD 3.423e-2 1.673e-2 3.849e-2 1.470e-2 AD AM 2.695e-2 9.563e-3 3.118e-2 7.543e-3 IMP 21.27% 42.84% 18.99% 48.71% T able 3 Relative operating-cost error under static mo del mismatch (mild mo del variations).   -1% 1% -2% 2% GD 2.468e-2 2.354e-2 2.527e-2 2.299e-2 AD AM 1.746e-2 1.633e-2 1.805e-2 1.579e-2 IMP 29.25% 30.63% 28.57% 31.32% T able 4 Relative operating-cost error under time-varying thermal- mo del p erturbations (   = 20% ).    10% 30% 50% 80% GD 1.780e-3 3.302e-3 1.190e-2 3.795e-2 AD AM 3.334e-4 1.857e-3 1.046e-2 3.651e-2 IMP 81.27% 43.76% 12.10% 3.79% impro ving online adaptation. Under these conditions, AD AM- DeePO achie ves better per f or mance through more stable policy updates. As the per turbation magnitude increases, howe ver , the eﬀective dynamics mov e fur ther a wa y from t he nominal linear model, reducing the beneﬁt of adaptive g radient scaling. As a result, the per formance gap between AD AM-DeePO and GD-DeePO becomes smaller at larg er per turbation lev els. 6. Conclusion This paper dev elops a dat a-driven online control framew ork f or DHSs by embedding steady-state economic optimality conditions into the system dynamics. Based on DeePO, the resulting controller enables online learning of near -optimal regulation policies un- der stochastic disturbances and model mismatch, while pro viding con verg ence to an optimal control policy , together with closed- loop stability guarantees. An AD AM-enhanced variant is fur ther dev eloped to improv e the per f or mance of online policy updates. Simulations on an industrial-scale DHS sho w that the proposed method achie ves stable near-optimal operation and strong empir ical robustness to both static and time-varying model per turbations under diﬀerent heat-demand disturbance conditions. These results suggest that the proposed data-driven online control is a promising approach for practical DHS operation, especially in large-scale sys- tems where accurate models are diﬃcult to maintain, disturbance f orecasts may be unreliable, and strong closed-loop guarantees are desired. CRedi T authorship contribution statement Xinyi Yi: Conceptualization, Methodology , Software, V ali- dation, For mal analysis, Visualization, Writing – original draft, W r iting – review & editing. Ioannis Lestas: Supervision, For mal analy sis, Writing – revie w & editing. Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 9 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems Declaration of compe ting interest The authors declare that they hav e no kno wn competing ﬁnan- cial interests or personal relationships that could hav e appeared to inﬂuence the work repor ted in this paper . A. Proof of Theor em 1 The Lagrangian for E1 is given by :  = 1 2 𝒉 𝑮  𝑭 𝑮 𝒉 𝑮 + 𝝁  ( − 𝑨 𝟏 𝑻 + 𝑩 𝟏 𝒉 𝑮 − 𝑩 𝟐  𝒉 𝑳 ) , where 𝝁 is the dual variable associated with ( 3b ). The KKT conditions yield:   𝒉 𝑮 = 𝑭 𝑮 𝒉 𝑮 + 𝑩 ⊤ 𝟏 𝝁 = 𝟎 ,   𝑻 = 𝑨 𝟏  𝝁 = 𝟎 , wher e 𝝁 =  𝟏 . The conditions can be rewritten as 𝑭 𝑮 𝒉 𝑮 ⋆ =  𝟏 such that        −        = 0 for any ,  ∈  , establishing 𝑭 𝑴 𝒉 𝑮 ⋆ = 𝟎 as the optimality condition for the unique solution of 𝒉 𝑮 ⋆ . The Lag rangian for E2 is:  ( 𝑻 , , 𝝀 ) = 1 2 𝑻 ⊤ 𝑭 𝑫 𝑻 + 𝝀  ( 𝑻 − 𝑨 † 𝟏 ( 𝑩 𝟏 𝒉 𝑮 ⋆ − 𝑩 𝟐  𝒉 𝑳 ) −  𝟏 ) . The KKT conditions f or E2 are ( 4b ) together with:    𝑻 = 𝑭 𝑫 𝑻 + 𝝀 = 𝟎 ,    = − 𝝀  𝟏 = 0 , which gives 𝟏 𝑻 𝑭 𝑫 𝑻 ⋆ = 𝟎 . The optimality condition ( 4b ) en- sures that 𝑻 ⋆ satisfy the optimal condition of E1 . Given that 𝑭 𝑫 is positive deﬁnite, E2 constitutes a conve x optimization problem ov er 𝑻 , t hereby conﬁrming unique 𝑻 ⋆ and t hat these optimality conditions are both necessary and suﬃcient. B. Proof of Theor em 2 The proof follo ws the same diﬀerential argument as Lemma 2 in DeePO [ 29 ]. In particular, using  ( 𝑽 ) = Tr (( 𝑸 + 𝑽   𝑼  0 𝑹  𝑼 0 𝑽 )  𝑼  ) = Tr ( 𝑷   𝑼  ) , t he same recursive diﬀerentiation of 𝑷  as in the DeePO proof yields   = 2Tr   𝑼  𝑬    𝑽  , where 𝑬  = (  𝑼  0 𝑹  𝑼 0 +  𝑿  1 𝑷   𝑿 1 ) 𝑽 . Hence, ∇ 𝑽  ( 𝑽 ) = 2 𝑬   𝑼  = 2(  𝑼  0 𝑹  𝑼 0 +  𝑿  1 𝑷   𝑿 1 ) 𝑽  𝑼  . The linear constraint ( 17c ) is en- f orced separatel y by projection in the algor ithmic update and therefor e does not appear e xplicitly in the gradient expression. C. Proof of Lemma 2 Fix 𝑽 ∈  and let 𝑨 𝑲 ∶=  𝑿 1 𝑽 . By Assumption 5 ,  ( 𝑨 𝑲 ) ≤ 1 −  and  𝑲  2 ≤   . The L yapuno v equation 𝑷 𝑲 =  ∞  =0 ( 𝑨  𝑲 )  ( 𝑸 + 𝑲  𝑹𝑲 )( 𝑨 𝑲 )  implies  𝑷 𝑲  2 ≤  ∞  =0  𝑨  𝑲  2 2   𝑸  2 +  𝑲  2 2  𝑹  2  ≤   ∞  =0  𝑨  𝑲  2 2    𝑸  2 +   2  𝑹  2  . Since  ( 𝑨 𝑲 ) ≤ 1 −  uniformly on  , the series   ≥ 0  𝑨  𝑲  2 2 is unif or mly bounded, hence  𝑷 𝑲  2 ≤  max , 2 f or all 𝑽 ∈  . D. Proof of Theor em 3 By Assumption 7 , we hav e   noise  2 ≤   noise . Moreover , with the deﬁnition of  max , by Hölder’ s inequality for Sc hatten norms [ 23 , (1.174)], it holds that   ( 𝑲  ) −   ( 𝑲  )  =    Tr  𝑷   𝒆 𝐧𝐨𝐢𝐬𝐞     ≤  𝑷      𝒆 𝐧𝐨𝐢𝐬𝐞  2 ≤  max   noise . Similarl y ,     −    =    Tr  𝑷   𝒆 𝐧𝐨𝐢𝐬𝐞     ≤  𝑷      𝒆 𝐧𝐨𝐢𝐬𝐞  2 ≤  max   noise . Thus, 1    0 +  −1  =  0 (   ( 𝑲 𝒕 ) −   ( 𝑲 𝒕 )  +     −    ) ≤ 2  max   noise . Combined with ( 22 , 23 ), we obtain ( 24 ). E. Proof of Theor em 4 The proof f ollows t he same descent-based argument as the GD- DeePO reg ret analysis in Theorem 3 , with the standard gradient step replaced b y the AD AM-preconditioned update. By t he local smoothness property of   in DeePO[ 29 ], the one-step chang e of   along the AD AM update direction satisﬁes   ( 𝑽  +1 ) −   ( 𝑽  ) ≤ −     ∇   ( 𝑽  ) , 𝑫   𝒎   +  (   ( 𝑽  )) 2   2   𝑫   𝒎   2  . (28) Under Assumptions 8 – 9 , Lemma 22 in [ 31 ] implies that t he preconditioned AD AM direction remains suﬃciently aligned with the tr ue gradient and has bounded magnitude relative to it. Hence, there exist constants  1 ,  2 > 0 such that  ∇   ( 𝑽  ) , 𝑫   𝒎   ≥  1  ∇   ( 𝑽  )  2  ,  𝑫   𝒎   2  ≤  2  ∇   ( 𝑽  )  2  . (29) Substituting ( 29 ) into ( 28 ) giv es   ( 𝑽  +1 ) −   ( 𝑽  ) ≤ −      1 −  (   ( 𝑽  )) 2  2      ∇   ( 𝑽  )  2  . (30) By absorbing  1 and  2 into the projected gradient-dominance modulus and smoothness constant, ( 30 ) yields the ADAM analogue of the one-step descent inequality used in the GD-DeePO analysis. The remainder of the reg ret proof then f ollows exactl y as in the proof of Theorem 3 , toge t her wit h the cov ar iance-mismatch term 2  max   noise . This gives ( 27 ). Data a vailability The data and code that support the ﬁndings of this study are a vailable from the cor responding author upon reasonable reques t. Ref erences [1] Ahmed, S., Machado, J.E., Cucuzzella, M., Scherpen, J.M., 2023. Control-oriented modeling and passivity analysis of thermal dynam- ics in a multi-producer district heating system. IF AC-PapersOnLine 56, 175–180. [2] Anderson, B.D., Moore, J.B., 2007. Optimal control: linear quadratic methods. Courier Cor poration. [3] Cholew a, T ., Siuta-Olcha, A., Smolarz, A., Muryjas, P ., W olszczak, P ., Guz, Ł., Bocian, M., Balaras, C.A., 2022. An easy and widel y applicable f orecast control for heating systems in e xisting and new buildings: First ﬁeld experiences. Journal of Cleaner Production 352, 131605. [4] Cui, L., Jiang, Z.P ., Kolm, P .N., Macqueron, G.G., 2025. A fully data- driven value iteration for stochas tic lqr: Con vergence, robustness and stability . arXiv preprint arXiv:2505.02970 . [5] Dean, S., Mania, H., Matni, N., Recht, B., Tu, S., 2020. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mat hematics 20, 633–679. [6] Frison, L., Gölzhäuser, S., Bitterling, M., Kramer, W ., 2024. Ev al- uating diﬀerent ar tiﬁcial neural network forecasting approaches for optimizing district heating network operation. Ener gy 307, 132745. [7] de Giuli, L.B., La Bella, A., Scattolini, R., 2024. Ph ysics-informed neural network modeling and predictive control of district heating systems. IEEE Transactions on Control Systems T echnology 32, 1182–1195. [8] Guo, C., Zhang, J., Y uan, H., Y uan, Y ., W ang, H., Mei, N., 2024. Inf ormer-based model predictive control framew ork considering group controlled hydraulic balance model to improv e the precision of client heat load control in district heating system. Applied energy 373, 123951. [9] Hauswirth, A., He, Z., Bolognani, S., Hug, G., Dörﬂer, F ., 2024. Optimization algorithms as robust feedbac k controllers. Annual Re views in Control 57, 100941. Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 10 of 11 Data-driven online control for real-time optimal economic dispatch and temp erature regulation in district heating systems [10] Jansen, J., Jorissen, F ., Helsen, L., 2024a. Eﬀect of prediction uncert ainties on the per formance of a white-box model predictive controller f or district heating netw orks. Energy and Buildings 319, 114520. [11] Jansen, J., Jor issen, F ., Helsen, L., 2024b. Mixed-integer non-linear model predictive control of district heating netw orks. Applied Energy 361, 122874. [12] Kim, Y ., Kim, Y ., Kim, M., Cho, N., 2025. Neural policy iteration f or stochastic optimal control: A physics-inf or med approach. arXiv preprint arXiv :2508.01718 . [13] Kingma, D.P ., 2014. Adam: A method f or stoc hastic optimization. arXiv prepr int arXiv:1412.6980 . [14] La Bella, A., Del Cor no, A., 2023. Optimal management and data-based predictive control of district heating systems: The nov ate milanese experiment al case-study . Control Engineering Practice 132, 105429. [15] Liu, S., Guo, Y ., W agner, F ., Liu, H., Cui, R.Y ., Mauzerall, D.L., 2024. Diversifying heat sources in china ’ s urban distr ict heating systems will reduce r isk of carbon lock -in. N ature Energy 9, 1021–1031. [16] Machado, J.E., Ferguson, J., Cucuzzella, M., Scher pen, J.M., 2022. Decentralized temperature and storage volume control in multipro- ducer distr ict heating. IEEE Control Systems Letters 7, 413–418. [17] Mohammadi, H., Zare, A., Solt anolko t abi, M., Jov anović, M.R., 2021. Conv ergence and sample complexity of gradient methods f or the model-free linear –quadratic regulator problem. IEEE T ransactions on Automatic Control 67, 2435–2450. [18] Qin, X., Lestas, I., 2024. Frequency control and power shar ing in combined heat and power netw orks, in: 2024 IEEE 63rd Conference on Decision and Control (CDC), IEEE. pp. 5771–5776. [19] Saloux, E., Candanedo, J.A., 2021. Model-based predictive control to minimize pr imary energy use in a solar district heating system with seasonal ther mal energy storage. Applied energy 291, 116840. [20] Sherman, J., Mor rison, W .J., 1950. Adjustment of an inv erse matr ix corresponding to a chang e in one element of a given matrix. The Annals of Mathematical Statistics 21, 124–127. [21] Simonsson, J., 2021. To wards eﬃcient modeling and simulation of district energy systems. Ph.D. thesis. Luleå University of Tec hnology . [22] T u, S.L., 2019. Sample complexity bounds for the linear quadratic regulator . Ph.D. thesis. Univ ersity of California, Berkeley . [23] W atrous, J., 2018. The theory of quantum inf ormation. Cambridge university press. [24] W ei, Z., Tien, P .W ., Calautit, J., Darkwa, J., W orall, M., Boukhanouf, R., 2024. Inv estigation of a model predictive control (mpc) strategy f or seasonal thermochemical energy storag e systems in district heat- ing networks. Applied Energy 376, 124164. [25] Yi, X., Guo, Y ., Sun, H., Qin, X., Wu, Q., 2023. Energy-grade double pricing f or combined heat and po wer systems. IEEE Transactions on Po wer Systems . [26] Yi, X., Lestas, I., 2025. Optimal energy-sharing and temperature regulation in distr ict heating systems. IFA C-PapersOnLine 59, 67– 72. [27] Y uan, C., Lin, X., 2025. Graph-temporal convolutional netw ork f or steam heating networ k simulation consider ing dynamic characteris- tics. Energy , 137567. [28] Zhang, Z., Zhou, X., Du, H., Cui, P ., 2023. A new model predictive control approach integrating physical and data-dr iven modelling for improv ed energy per f or mance of district heating substations. Energy and Buildings 301, 113688. [29] Zhao, F ., Dörﬂer, F ., Chiuso, A., Y ou, K., 2025. Data-enabled policy optimization f or direct adaptive learning of t he lqr. IEEE Transactions on Automatic Control . [30] Zhou, K., Doy le, J.C., Glover , K., et al., 1996. Robust and optimal control. volume 40. Prentice hall N ew Jerse y . [31] Zou, F ., Shen, L., Jie, Z., Zhang, W ., Liu, W ., 2019. A suﬃcient con- dition f or con vergences of adam and rmsprop, in: Proceedings of t he IEEE/CVF Conf erence on computer vision and pattern recognition, pp. 11127–11135. Xinyi Yi and Ioannis Lestas: Preprint submitted to Elsevier P age 11 of 11

Data-driven online control for real-time optimal economic dispatch and temperature regulation in district heating systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment