Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs

Ph ysics-Guided T ransformer (PGT): Ph ysics-A w are A tten tion Mec hanism for PINNs Ehsan Zeraatk ar* , Ro dion P o dorozhn y , Jelena T ešić Computer Scienc e, T exas State University, San Mar c os, T exas, US Abstract Reconstructing con tin uous ph ysical ﬁelds from sparse, irregular observ ations is a fundamen tal c hallenge in scien tiﬁc machine learning, particularly for nonlinear systems go verned b y partial diﬀeren tial equations (PDEs). Dominan t physics-informed approac hes enforce go v erning equations as soft p enalt y terms during optimization, a strategy that often leads to gradien t im balance, instability , and degraded physical consistency when measuremen ts are scarce. Here we in troduce the Ph ysics-Guided T ransformer (PGT) , a neural arc hitecture that mo v es b eyond residual regularization by embedding physical structure directly into the self-atten tion mec hanism. Sp eciﬁcally , PGT incorp orates a heat-k ernel–derived additiv e bias into attention logits, endo wing the encoder with an inductive bias consistent with diﬀusion ph ysics and temporal causality . Query co ordinates attend to these ph ysics-conditioned context tokens, and the resulting features drive a FiLM-mo dulated sinusoidal implicit deco der that adaptively con trols spectral response based on the inferred global context. W e ev aluate PGT on tw o canonical b enchmark systems spanning diﬀusi on-dominated and con vection-dominated regimes: the one-dimensional heat equation and the tw o-dimensional incompressible Na vier–Stokes equations. In 1D sparse reconstruction with as few as 100 observ ations, PGT attains a relative L 2 error of 5 . 9 × 10 − 3 , representing a 38-fold reduction ov er ph ysics-informed neural net works and more than 90-fold reduction ov er sinusoidal implicit representations. In the 2D cylinder-wak e problem reconstructed from 1500 scattered spatiotemporal samples, PGT uniquely ac hieves strong p erformance on b oth axes of ev aluation: a gov erning-equation residual of 8 . 3 × 10 − 4 — on par with the b est residual-based metho ds — alongside a comp etitive o verall relativ e L 2 error of 0.034, substantially b elo w all metho ds that achiev e com- parable physical consistency . No individual baseline sim ultaneously satisﬁes these dual criteria. Conv ergence analysis further rev eals sustained, monotonic error reduction in PGT, in contrast to the early optimization plateaus observ ed in residual-based approaches. These ﬁndings demonstrate that structural incorp oration of physical priors at the representational level, rather than solely as an external loss p enalt y , substan tially impro ves b oth optimization stabilit y and physical coherence under data-scarce conditions. Ph ysics-guided atten tion pro vides a principled and extensible mechanism for reliable reconstruction of nonlinear dynamical systems go verned by partial diﬀerential equations. Keywor ds: ph ysics-informed neural netw orks, implicit neural representations, transformer architecture, ph ysics-guided attention, sparse reconstruction, Navier–Stok es equations, scientiﬁc machine learning, partial diﬀeren tial equations 1. In tro duction Scien tiﬁc systems go verned by partial diﬀerential equations (PDEs) describ e a wide range of natural and engineered phenomena, from heat diﬀusion and ﬂuid transp ort to climate dynamics and material de- formation. A ccurately solving these equations is central to adv ancing predictiv e science and engineering. Ho wev er, traditional numerical solvers — ﬁnite diﬀerence, ﬁnite v olume, or sp ectral metho ds [17, 6] — of- ten require ﬁnely discretized spatiotemp oral grids to maintain stability and accuracy , leading to prohibitive computational costs for mo deling high-dimensional or multiscale systems. This motiv ates the growing ﬁeld Email addr ess: ehsanzeraatkar@txstate.edu (Ehsan Zeraatk ar*) of Scientiﬁc Machine L e arning (SciML) [14, 4], which seeks to learn surrogate mo dels that embed ph ysical kno wledge into data-driven neural architectures, enabling eﬃcient y et ph ysically c onsisten t approximations of PDE-go verned pro cesses. Among the early and inﬂuential developmen ts in SciML are Physics-Informe d Neur al Networks (PINNs) [25], whic h enforce PDE constraints as soft penalties within the loss function. Although conceptually elegant, PINNs exhibit well-kno wn challenges: gradien t pathologies in stiﬀ or multiscale regimes, multiscale con- v ergence in high dimensions, and limited ability to represent oscillatory or high-frequency comp onents. T o o vercome these issues, sev eral op erator-learning frameworks, suc h as F ourier Neur al Op er ators (FNOs) [19], De epONets , and Galerkin T r ansformers [7], hav e b een prop osed to learn mappings b etw een function spaces rather than on discrete ﬁelds. Despite their success, these models often rely on purely sp ectral priors or dense F ourier con volutions, lacking explicit aw areness of underlying physical causality or PDE structure. In parallel, the T r ansformer arc hitecture [29] has rev olutionized sequence and vision mo deling by cap- turing long-range dep endencies through self-atten tion. Recent w orks ha ve adapted T ransformers to physical systems, demonstrating their p oten tial for spatiotemp oral forecasting and op erator learning. Y et, conv en- tional T ransformers are data-driven but not physics-driven — their attention w eigh ts are learned solely from data, without constraints that enforce ph ysical consistency , suc h as causality in time, lo calit y in diﬀusion, or conserv ation la ws. As a result, these mo dels ma y achiev e high predictive accuracy while violating funda- men tal PDE dynamics, particularly when training data are sparse or partially observed. T o bridge this gap, we introduce the Physics-Guided T ransformer (PGT) , a uniﬁed arc hitecture that couples T ransformer-based context mo deling with explicit ph ysical priors deriv ed from PDE theory . PGT em b eds the heat-kernel Green’s function [11] as an additive bias within the self-attention mechanism, enabling the mo del to resp ect the causal and diﬀusive structure of parab olic PDEs. Con textual patches extracted from lo w-resolution data are enco ded through this physics-guided attention, pro ducing a latent representation that captures b oth lo cal in teractions and physically meaningful dependencies. Query co ordinates ( x, t ) then attend to these enco ded context tokens to generate physics-c onditione d fe atur es , which are decoded by a FiLM-mo dulate d SIREN —an implicit neural representation that adaptiv ely adjusts its frequency resp onse based on the learned con text. Unlik e purely data-driven T ransformers or ph ysics-agnostic INRs, PGT in tegrates ph ysical reasoning directly in to the attention kernel while retaining the ﬂexibilit y of neural implicit represen tations. The resulting mo del can infer contin uous spatiotemporal ﬁelds, satisfy gov erning PDEs via auto diﬀeren tiation, and integrate m ultiple sources of sup ervision, including high-resolution data, low-resolution av erages, and b oundary or initial constraints. The k ey contributions of this work are summarized as follows: • Physics-guided attention: W e form ulate an additive attention bias derived from the heat-kernel Green’s function, in tro ducing an inductive bias consistent with diﬀusion physics and temp oral causality . • Context-conditioned implicit deco ding: W e design a FiLM-mo dulated SIREN deco der that adap- tiv ely con trols sp ectral bias via frequency gating, enabling accurate reconstruction of high-frequency details. • Uniﬁed physics–data training framew ork: W e propose a comp osite uncertain t y-weigh ted loss com bining PDE residuals, b oundary/initial conditions, and data-ﬁdelity terms. • Demonstration on canonical b enc hmarks: Through extensive exp eriments on 1D heat diﬀusion and 2D incompressible Navier–Stok es problems, PGT achiev es comp etitive reconstruction accuracy alongside markedly reduced gov erning-equation residuals compared to PINNs, FNOs, and T ransformer- based baselines. By integrating physical inductive biases into the T ransformer attention mechanism, PGT mov es b eyond blac k-b ox learning tow ard interpretable, generalizable, and computationally eﬃcien t scientiﬁc mo dels. The framew ork is broadly extensible to other parab olic and h yp erb olic PDEs, laying the foundation for scalable ph ysics-aw are neural op erators across scientiﬁc domains. 2 2. Related W ork Scien tiﬁc mac hine learning has emerged as a framework for in tegrating data-driv en mo dels with gov erning ph ysical la ws [14, 4, 24]. A cen tral ob jective is to reconstruct or predict solutions of partial diﬀeren tial equations (PDEs) from limited observ ations while preserving physical consistency . Early eﬀorts focused on embedding diﬀerential constrain ts directly into neural netw ork training, leading to the developmen t of Ph ysics-Informed Neural Netw orks (PINNs) [25, 26]. PINNs incorp orate PDE residuals as soft p enalties in the loss function and hav e been s uccessfully applied to diﬀusion, ﬂuid dynamics, elasticity , and m ultiphysics systems [14, 21]. How ev er, optimization instabilit y , gradient im balance, and sp ectral bias often limit their p erformance under sparse or noisy sup ervision [16, 30]. T o address representational limitations, implicit neural represen tations (INRs) ha ve b een explored for mo deling contin uous ph ysical ﬁelds [27, 28, 22]. Sinusoidal activ ation functions [27] and F ourier feature em b eddings [28] mitigate sp ectral bias and enable improv ed reconstruction of high-frequency comp onents. INR-based approac hes hav e b een extended to scien tiﬁc computing tasks, including PDE solution appro xi- mation and sup er-resolution of physical ﬁelds [18]. Despite their expressiveness, pure INRs typically lack explicit ph ysical constraints unless combined with residual regularization. P arallel dev elopments in op erator learning aim to learn mappings b et ween function spaces rather than discrete solutions. DeepONet [20] and its physics-informed v ariants [13] approximate nonlinear op erators from data, while F ourier Neural Op erators (FNO) [19] leverage sp ectral con volution to mo del global interac- tions eﬃcien tly . These operator-based mo dels scale fa vorably and ha v e demonstrated strong p erformance in parametric PDE settings [15]. Ho wev er, they often require extensive training data and may exhibit reduced robustness in sparse-measuremen t regimes. T ransformer arc hitectures ha v e recen tly b een in troduced into scientiﬁc machine learning to capture long- range dep endencies in spatiotemp oral systems [29, 1]. PINNsF ormer [32] extends classical PINNs by incor- p orating self-attention mechanisms to enhance global feature mo deling. T ransformer-based PDE solvers and neural op erators hav e also b een prop osed for ﬂuid simulation and spatiotemp oral forecasting [32]. While these approac hes impro ve long-range interaction mo deling, they typically enforce ph ysics through residual p enalties rather than embedding physical structure directly into the attention mechanism. Recen t works ha ve explored incorp orating T ransformer architectures into ph ysics-informed learning, in- cluding T ransformer-based PINNs and attention-based neural op erators. In most of these approac hes, the atten tion mechanism itself remains purely data-driv en: p ositional information is t ypically enco ded via learned relativ e p osition biases or sin usoidal enco dings, while physical laws are enforced only through additional loss terms, suc h as PDE residual p enalties. In contrast, the prop osed Physics-Guided T ransformer (PGT) embeds ph ysics directly into the attention computation through a kernel-based bias term Γ . Rather than represen ting a generic distance-dep endent bias as in standard relative p osition enco dings, Γ is derived from PDE theory , sp eciﬁcally the heat k ernel (Green’s function) of the diﬀusion op erator. This formulation encodes b oth tem- p oral causalit y and the spatial diﬀusion structure of the underlying physical pro cess. Consequently , PGT shifts the role of physics from an external regularization term to the attention logits themselves, thereb y shaping how contextual information propagates b etw een tokens b efore the softmax op eration. This design fundamen tally diﬀers from prior T ransformer-based PINN formulations b y allowing the attention mechanism to follo w physically meaningful interaction patterns rather than learning them solely from data. Recen t studies highlight the imp ortance of architectural inductive biases in scien tiﬁc learning [2, 23]. Graph neural netw orks and message-passing frameworks enco de conserv ation laws and lo calit y constrain ts in to model structure [2]. Similarly , symmetry-preserving and equiv ariant netw orks incorp orate ph ysical priors at the representational level [9, 12]. These works collectively suggest that embedding physics into arc hitecture, rather than solely into the ob jective function, may improv e generalization and optimization stabilit y . Sparse reconstruction of ﬂuid ﬂows presen ts additional challenges due to nonlinear conv ection and pres- sure–v elo city coupling [3, 10]. Classical c ompressed sensing and reduced-order mo deling metho ds hav e b een widely studied [5, 31], but they often rely on linear subspace assumptions. Data-driven neural approaches pro vide greater ﬂexibility but must reconcile data ﬁdelity with physical constraints. Existing approac hes largely fall into tw o categories: residual-based physics enforcemen t (e.g., PINNs and PINNsF ormer) [32] and op erator-learning framew orks (e.g., DeepONet and FNO) [12, 20]. While these metho ds improv e either physical regularization or global mo deling capacit y , they t ypically treat gov erning 3 equations as external constraints added to the loss function. In contrast, Ph ysics-Guided T ransformers (PGT) integrate physical structure directly into the attention mechanism itself through a physics-guided bias term. This arc hitectural integration shifts physics from a p enalt y-based regularizer to an in trinsic comp onen t of represen tational in teractions. By coupling physics-guided attention with adaptive implicit deco ding, PGT uniﬁes insights from PINNs, neural op erators, INRs, and T ransformer architectures while explicitly addressing optimization im balance and physical inconsistency under sparse sup ervision. 3. Metho dology 3.1. Pr oblem F ormulation W e consider a time-dep enden t physical system gov erned by a partial diﬀerential equation (PDE) F  u ( x, t ) , ∇ x u ( x, t ) , ∂ t u ( x, t ); θ p  = f ( x, t ) , ( x, t ) ∈ Ω × [0 , T ] , (1) where u ( x, t ) denotes the physical state v ariable of in terest, F is a diﬀerential op erator parameterized by ph ysical co eﬃcients θ p , and f ( x, t ) is a known source or forcing term. The ob jective of the Physics-Guided T ransformer (PGT) is to learn a con tinuous mapping u Θ : ( x, t ) 7→ u ( x, t ) , (2) parameterized by Θ , such that the predicted solution satisﬁes the go v erning PDE while remaining consistent with av ailable observ ations. By mo deling the solution as an implicit function of contin uous co ordinates, PGT enables prediction at arbitrary spatial and temp oral lo cations. 3.2. Overview of the PGT Ar chite ctur e PGT combines a ph ysics-guided T ransformer enco der with an implicit neural representation deco der. Giv en a set of sparse or coarse observ ations, the encoder constructs a laten t token represen tation that captures b oth lo cal measurements and global system structure. F or an y query co ordinate ( x, t ) , the mo del retriev es relev an t con textual information through a cross-attention mechanism and conditions an implicit deco der to pro duce the solution v alue at that co ordinate. Figure 1 illustrates the ov erall architecture. 3.3. Physics-Guide d T r ansformer Enc o der Let { ( u i , x i , t i ) } P i =1 denote the av ailable spatiotemp oral observ ations. Each observ ation is em bedded into a laten t context token by linearly pro jecting b oth the observed v alue and its co ordinate, c i = W u u i + W p [ x i , t i ] + b , (3) where W u and W p are learnable pro jection matrices. In addition to these context tokens, a learnable global tok en c (0) glob is prep ended to the sequence. The resulting token matrix is C (0) = [ c (0) glob , c 1 , . . . , c P ] ∈ R ( P +1) × d model . (4) A stack of L ph ysics-guided T ransformer blo cks processes the token matrix. In each block, self-atten tion is mo diﬁed by an additive physics-based bias Γ , A ttn ( Q , K , V ) = softmax  QK ⊤ √ d k + Γ  V , (5) where the bias matrix Γ enco des known physical relations b etw een spatiotemp oral lo cations. F or diﬀusion- t yp e systems, Γ is derived from the logarithm of the heat k ernel and enforces forward temp oral inﬂuence. The output of eac h blo ck is up dated using resi dual connections and a feed-forward netw ork, C ( l +1) = C ( l ) + Attn ( LN ( C ( l ) )) + MLP ( LN ( C ( l ) )) . (6) After L lay ers, the ﬁnal token matrix C ( L ) is split by index into a global tok en c glob and context tokens C ctx . 4 3.4. Physics Guide d Bias Γ The physics-guided bias Γ is constructed from the Green’s function (fundamen tal solution) of the gov ern- ing PDE in Eq. (1). F or a pair of context tokens lo cated at spatiotemp oral co ordinates ( x i , t i ) and ( x j , t j ) , the bias en try is deﬁned as Γ ij = log G ( x i − x j , t i − t j ; θ p ) , (7) where G is the Green’s function of the diﬀeren tial operator F and θ p are the physical parameters in tro duced in Eq. (1). T aking the logarithm maps the multiplicativ e k ernel structure of G to an additiv e logit bias, consisten t with the pre-softmax additive form in Eq. (5). En tries for which G = 0 — suc h as future tokens in causal problems or oﬀ-c haracteristic tokens in advection-dominated systems — are set to Γ ij = −∞ , so they receiv e exactly zero attention weigh t after the softmax op eration. This formulation is general across PDE families. F or p ar ab olic systems such as the heat equation ( ∂ t u = α ∇ 2 u ) , the Green’s function is the Gaussian heat kernel, giving Γ ij = − ∥ x i − x j ∥ 2 4 α ∆ t ij − d 2 log(4 π α ∆ t ij ) , ∆ t ij = t i − t j > 0 , (8) where d is the spatial dimension and α > 0 is the physical diﬀusivit y (e.g. thermal conductivity for the heat equation, kinematic viscosity ν for linearized viscous ﬂow). The bias decays quadratically with spatial distance and logarithmically with elapsed time, encoding b oth diﬀusive lo calit y and strict temp oral causalit y . The eﬀective spatial inﬂuence radius scales as σ = p 2 α ∆ t ij , matching the diﬀusion length of the underlying PDE. F or hyp erb olic systems (e.g. the wa ve equation ∂ tt u = c 2 ∇ 2 u ), G has compact supp ort on the ligh t cone, so Γ ij is ﬁnite only within the causal wa v efront ∥ x i − x j ∥ ≤ c ∆ t ij , imp osing ﬁnite-sp eed propagation directly in the attention computation. F or el liptic problems (e.g. Poisson or Laplace equations), G dep ends only on spatial separation and no temp oral causal mask is applied. In e ac h case, the physical parameters θ p — diﬀusivity , viscosit y , w av e sp eed, or b oundary geometry — enter Γ directly from the problem sp eciﬁcation, in tro ducing no additional architectural h yperparameters b eyond those already presen t in the go verning equation. In the limiting case where θ p driv es the kernel tow ard a uniform distribution (e.g. α → ∞ for diﬀusion), Γ approac hes a constant matrix and its eﬀect on the softmax v anishes, recov ering a standard data-driven T ransformer. Conv ersely , as α → 0 , the Gaussian narrows tow ard a Dirac delta, restricting eac h token to attend only to its immediate spatial neighbor. Because the v anilla T ransformer is reco vered exactly in the former limit, PGT strictly contains standard atten tion as a sp ecial case: ph ysics-guided atten tion is a con tinuously tunable inductiv e bias that reduces to purely data-driven attention when physical information is absen t, and progressiv ely imposes PDE-consistent structure as the go verning parameters mov e tow ard the diﬀusion-dominated regime. 3.5. Query Conditioning via Cr oss-Attention T o ev aluate the solution at a query coordinate q = ( x, t ) , the co ordinate is ﬁrst mapp ed to a laten t query em b edding using a small multila y er p erceptron, ϕ ( q ) = MLP q ( x, t ) . (9) The query em b edding attends to the enco ded con text tokens through cross-attention, g ( q ) = softmax  ( W q ϕ )( W k C ctx ) ⊤ √ d k  ( W v C ctx ) , (10) pro ducing a query-speciﬁc con text v ector that summarizes the most relev ant observ ations for the giv en lo cation. 3.6. FiLM-Mo dulate d Implicit De c o der The con tinuous solution is reconstructed using an implicit neural represen tation implemen ted as a si- n usoidal represen tation net work (SIREN). The deco der takes the raw query co ordinates ( x, t ) as input and computes h 1 = sin( ω 0 ( W 0 [ x, t ] + b 0 )) . (11) 5 T o adapt the deco der to lo cal and global physical con text, F eature-wise Linear Mo dulation (FiLM) is applied at eac h la yer. A h yp ernet work conditioned on the query-sp eciﬁc context g ( q ) and the global tok en c glob generates mo dulation parameters, ( α l , β l , ω l ) = H ([ g ( q ) , c glob ]) , (12) whic h control amplitude, bias, and frequency at lay er l . Eac h hidden lay er is computed as h l +1 = sin ( ω l ⊙ ( α l ⊙ ( W l h l ) + β l )) . (13) The ﬁnal prediction is obtained through a linear readout, u Θ ( x, t ) = W out h L + b out . (14) The o verall PGT architecture is illustrated in Figure 1. Figure 1: Overview of the PGT architecture. The physics-guided T ransformer enco der pro cesses sparse observ ations into latent context tok ens, whic h are then used to condition the FiLM-mo dulated SIREN decoder for contin uous ﬁeld reconstruction. 3.7. T r aining Obje ctive PGT is trained by minimizing a comp osite loss that enforces b oth data ﬁdelity and physical consistency across sources of sup ervision: observed data, PDE residuals, b oundary conditions, and initial conditions. The data loss p enalizes discrepancies b etw een predictions and av ailable observ ations at the sampled spatiotemp oral lo cations, L data = 1 N d N d X i =1 ∥ u Θ ( x i , t i ) − u obs i ∥ 2 2 . (15) Ph ysical consistency is enforced through the PDE residual loss , ev aluated at N r randomly sampled collo- cation p oin ts, L PDE = 1 N r N r X j =1 ∥F ( u Θ )( x j , t j ) − f ( x j , t j ) ∥ 2 2 , (16) 6 where F is the diﬀerential op erator deﬁned in Eq. (1) and f is the known forcing term. The b oundary condition loss and initial condition loss enforce prescrib ed constraints on ∂ Ω and at t = 0 , resp ectiv ely , L BC = 1 N b N b X k =1 ∥ u Θ ( x k , t k ) − u bc k ∥ 2 2 , L IC = 1 N 0 N 0 X k =1 ∥ u Θ ( x k , 0) − u ic k ∥ 2 2 . (17) These four terms are com bined into the total training ob jective using uncertaint y-based weigh ting [8], L = 1 2 σ 2 data L data + 1 2 σ 2 PDE L PDE + 1 2 σ 2 BC L BC + 1 2 σ 2 IC L IC . (18) Here σ data , σ PDE , σ BC , σ IC > 0 are learnable scalar uncertain ty parameters, one p er loss term. Eac h w eight 1 / 2 σ 2 k ) is inv ersely prop ortional to the task-sp eciﬁc noise v ariance σ 2 k : if a sup ervision signal is noisy or inconsistent, the mo del learns a larger σ k , automatically do wn-weigh ting that term. All four σ k are initialized to 1 and optimize d join tly with the net work parameters Θ via the same gradient-based up date step. The form ulation eliminates the need for manual loss-w eight tuning and allo ws PGT to adapt its sup ervision balance automatically as training progresses. 3.8. Gener ality of Physics-Guide d Attention The physics-guided atten tion formulation in Eq. (5) is inten tionally general. The bias matrix Γ ma y enco de diﬀusion kernels, transp ort directionalit y , or other problem-sp eciﬁc relational priors. By incorp orating suc h a structure directly into the attention logits, PGT embeds domain knowledge into the T ransformer enco der while retaining the ﬂexibility and scalability of mo dern attention-based architectures. 4. Exp erimen tal Results W e ev aluate the prop osed Physics-Guided T ransformer (PGT) on tw o canonical PDE-gov erned systems: (1) sparse scattered-data reconstruction for the 1D heat diﬀusion equation, and (2) reconstruction of 2D v elo city and pressure ﬁelds gov erned by the incompressible Navier–Stok es equations. Across b oth tasks, PGT is compared with state-of-the-art baselines, including FNO, PINN, PI-DeepONet, PINNsF ormer, SIREN, and WIRE, under matc hed sparse-sampling budgets. 4.1. 1D He at Equation: Sp arse R e c onstruction Analysis W e ﬁrst ev aluate PGT on the 1D heat equation: ∂ t u − ν ∂ xx u = 0 , x ∈ [0 , 1] , t ∈ [0 , 1] , (19) with sin usoidal initial conditions u ( x, 0) = sin( nπ x ) , u ( x, t ) = e − ν ( nπ ) 2 t sin( nπ x ) . (20) The task is sparse reconstruction: given only M randomly sampled spatiotemp oral observ ations, the mo del must reconstruct the full solution ﬁeld. W e v ary the num ber of sparse samples ( M = 100 , 200 , 500 ) to study robustness under diﬀeren t sup ervision levels. T able 1 reports the data loss, PDE residual loss, relativ e L 2 error, computational cost (FLOPs), training time, and parameter count for SIREN, PINN, and PGT. PGT achiev es dramatically low er data error and relativ e L 2 error across all diﬀerent observ ation data p oints, M . F or example, at M = 100 , PGT reduces the relativ e L 2 error to 5 . 90 × 10 − 3 , compared to 2 . 26 × 10 − 1 for PINN and 5 . 40 × 10 − 1 for SIREN. The PGT error reduction is approximately a 38 × improv emen t o ver PINN and nearly 90 × improv emen t o v er SIREN . Ev en as the n umber of sparse p oin ts increases to M = 500 , PGT maintains relativ e errors on the order of 10 − 3 , while PINN and SIREN remain t wo orders of magnitude higher. Although PGT’s PDE residual loss remains around 6 . 6 × 10 − 2 , its signiﬁcan tly smaller data loss indicates sup erior global ﬁeld reconstruction accuracy . SIREN exhibits relatively high PDE residuals due to the absence of explicit physics enforcement. PINN reduces the PDE residual compared to SIREN, but struggles to ac hieve comparable reconstruction ﬁdelity under sparse sup ervision. 7 T able 1: Comparison of SIREN, PINN, and PGT models across diﬀerent v alues of M . Model Loss FLOPs T rain time Param Data PDE Rel L 2 (G) SIREN M = 100 0.063 0.18 0.54 6.5 126 s 26,419 M = 200 0.045 0.14 0.45 6.8 127 s M = 500 0.025 0.18 0.34 7.5 135 s PINN M = 100 0.026 0.12 0.226 1.69 65 s 66,500 M = 200 0.024 0.19 0.332 1.75 65.9 s M = 500 0.021 0.13 0.313 1.94 66.8 s PGT (ours) M = 100 0.000076 0.066 0.0059 116 9.5 min 4.05E+08 M = 200 0.000029 0.066 0.0028 132 13 min M = 500 0.000017 0.067 0.0026 190 27.5 min As the num b er of sparse observ ations increases, all mo dels b eneﬁt from additional sup ervision. Ho w ever, the improv emen t for PGT is substantially more stable and consistent. Its relative L 2 error decreases from 5 . 90 × 10 − 3 at M = 100 to 2 . 60 × 10 − 3 at M = 500 , demonstrating robustness even in low-data regimes. In con trast, PINN and SIREN show less consistent trends and remain signiﬁcantly less accurate. The improv ed accuracy of PGT comes at a higher computational cost. While PINN requires approxi- mately 1 . 7 – 1 . 9 GFLOPs and ab out 65 seconds of training time, PGT requires 116 – 190 GFLOPs and up to 1 . 67 × 10 3 seconds of training time. The parameter count is also substan tially larger for PGT. The trade-oﬀ highligh ts that PGT prioritizes reconstruction ﬁdelit y and ph ysics-guided generalization ov er light weigh t deplo yment. Figure 2: T raining error conv ergence (relativ e L 2 ) for PINN, SIREN, and PGT on the 1D heat diﬀusion sparse reconstruction task ( M = 100 observ ations). PGT exhibits sustained monotonic decay , whereas PINN and SIREN plateau at muc h higher error levels. Figure 2 illustrates the evolution of the reconstruction error during training. Both SIREN and PINN sho w rapid initial error reduction, follow ed b y early stagnation at relatively high error levels around 10 − 3 . In contrast, PGT exhibits contin uous, sustained error decay throughout training, ultimately reac hing errors on the order of 10 − 7 . Thus, the PGT av oids the optimization plateaus observed in the baseline metho ds and con verges to a signiﬁcantly more accurate solution. T o further analyze the training dynamics, Figure 3 (loss comp onen t contribution plot) sho ws the relativ e con tributions of the data, PDE, b oundary-condition (BC), and initial-condition (IC) losses within PGT. In the early stages of training, the IC loss contributes substantially , reﬂecting the mo del’s initial eﬀort to satisfy the prescribed initial state. As training progresses, the relative contribution of the PDE loss increases, indicating a gradual shift tow ard enforcing the go verning ph ysical dynamics across the domain. Notably , the contribution of the data loss remains comparatively stable throughout training. The balanced evolution suggests that PGT eﬀectively co ordinates data ﬁtting and physics enforcement, progressively reﬁning the 8 solution while main taining consistency with b oth observ ations and gov erning equations. (a) (b) Figure 3: (a) Error components breakdown. (b) Error comp onents con tribution analysis. Ov erall, the 1D exp erimen t demonstrates that incorp orating physics-guided attention and adaptive fre- quency mo dulation signiﬁcan tly enhances sparse reconstruction accuracy . The controlled setting v alidates the eﬀectiv eness of PGT b efore extending the comparison to more complex 2D nonlinear systems. 4.2. 2D Navier–Stokes R e c onstruction Under Sp arse Me asur ements W e next ev aluate PGT on the tw o-dimensional incompressible Navier–Stok es equations gov erning viscous ﬂo w: ∂ u ∂ t + u ∂ u ∂ x + v ∂ u ∂ y = − ∂ p ∂ x + ν ∇ 2 u, (21) ∂ v ∂ t + u ∂ v ∂ x + v ∂ v ∂ y = − ∂ p ∂ y + ν ∇ 2 v , (22) ∂ u ∂ x + ∂ v ∂ y = 0 , (23) where ( u, v ) denote the v elo cit y comp onen ts, p is the pressure ﬁeld, and ν = 1 / Re is the kinematic viscosit y . W e consider the canonical cylinder wak e dataset and p erform sparse reconstruction from N train = 1500 randomly sampled spatiotemp oral p oin ts. All mo dels are ev aluated on a full-resolution spatial snapshot at a ﬁxed time index. T able 2 rep orts the ﬁnal quan titative comparison across all seven metho ds, including total relative L 2 error, v ariable-wise errors for u , v , and p , the PDE residual, and computational cost. T able 2: Sparse reconstruction p erformance for the 2D Na vier–Stokes cylinder wak e at a ﬁxed time snapshot ( N train = 1500 ). Best results are shown in b old . Arc hitecture family groups metho ds: operator-learning (FNO, PI-DeepONet), T ransformer- based (PINNsF ormer), implicit representation (SIREN, WIRE), classical PINN, and the proposed PGT. Model Params (M) FLOPs (G) T rain Time Rel- L 2 Rel- L 2 ( u ) Rel- L 2 ( v ) Rel- L 2 ( p ) PDE Residual T rain Error FNO 2.360 0.092 4.41 s 0.710 0.2200 1.4900 0.770 4 . 70 × 10 − 2 4 . 20 × 10 − 4 PI-DeepONet 0.215 176.12 36 s 0.095 0.0600 0.2300 0.210 4 . 20 × 10 − 1 3 . 90 × 10 − 3 PINNsF ormer 0.545 13.200 296 min 0.080 0.0400 0.0900 0.110 8 . 30 × 10 − 2 8 . 30 × 10 − 4 SIREN 0.264 1.310 2.5 min 0.690 0.3100 1.0000 0.770 3 . 90 × 10 − 4 5 . 50 × 10 − 2 WIRE 0.528 5.200 6.3 min 0.018 0.0041 0.0350 0.014 5 . 10 × 10 − 1 1 . 54 × 10 − 4 PINN 0.336 1.600 2.6 min 0.110 0.0400 0.1400 0.160 8 . 40 × 10 − 4 1 . 19 × 10 − 3 PGT (ours) 5.630 32.860 47 min 0.034 0.0160 0.0410 0.046 8 . 30 × 10 − 4 6 . 50 × 10 − 5 The results reveal imp ortant diﬀerences across mo del families. Among the ope rator-learning metho ds, FNO ac hieves the lo w est computational cost but yields the highest reconstruction error, with a relativ e L 2 error of 1.49 for the v ertical-velocity comp onent, indicating a near-complete failure to capture v ortex- shedding dynamics under sparse sampling. PI-DeepONet substan tially impro ves reconstruction accuracy , yet 9 its PDE residual remains large ( 4 . 2 × 10 − 1 ), indicating that the predicted ﬁelds do not satisfy the gov erning momen tum equations with adequate precision. Among implicit represen tation baselines, SIREN achiev es a low PDE residual ( 3 . 9 × 10 − 4 ). Stil l, it exhibits very p o or ﬁeld reconstruction accuracy , with a relative L 2 error of 0.69 and a vertical v elo city error of 1.00, reﬂecting the absence of explicit physics enforcement in the reconstruction pro cess. WIRE presents an opp osing b eha viour: it attains the low est ov erall relativ e L 2 error (0.018). It excels on the horizon tal v elo city ( 0 . 0041 ) and pressure ( 0 . 014 ) comp onents. Y et, its PDE residual ( 5 . 1 × 10 − 1 ) is the highest among all metho ds, revealing that accurate data ﬁtting do es not guarantee satisfaction of the gov erning equations. The classical PINN achiev es a PDE residual of 8 . 4 × 10 − 4 , comp etitive with PGT, but its reconstruction accuracy (o verall relative L 2 = 0 . 11 ) is substantially inferior, particularly for the vertical velocity and pressure ﬁelds. Figure 4: Qualitative comparison b etw een ground truth and PGT reconstruction for the 2D Navier–Stok es problem. Rows correspond to u , v , and p ﬁelds. Columns sho w ground truth, PGT prediction, and absolute error. PGT achiev es the strongest sim ultaneous balance b et ween reconstruction ﬁdelit y and ph ysical consistency across all ev aluated metho ds. Its ov erall relative L 2 error of 0.034 is surpassed only by WIRE (0.018). Y et, PGT deliv ers a PDE residual of 8 . 3 × 10 − 4 — comparable to PINN and SIREN in residual magnitude, but accompanied by signiﬁcantly sup erior reconstruction accuracy . V ariable-wise errors remain consistently low across u ( 0 . 016 ), v ( 0 . 041 ), and p ( 0 . 046 ), indicating balanced coupling b et ween the velocity and pressure comp onen ts, a property not ac hieved by any individual baseline. The combination — competitive data ﬁdelit y with rigorous physical consistency — distinguishes PGT from all comparators and is precisely the regime that matters for scien tiﬁc reconstruction tasks. Figure 4 illustrates the qualitative reconstruction of u , v , and p . The predicted ﬁelds faithfully repro duce the dominant v ortex shedding structures and pressure gradien ts presen t in the ground truth. Absolute error maps conﬁrm that residuals are primarily conﬁned to regions of strong nonlinear interaction and high v orticity , while the global ﬂow top ology is accurately preserved throughout the domain. The training conv ergence b ehavior is shown in Figure 5. PGT exhibits stable, monotonically decreasing error throughout optimization. In contrast, several baselines undergo rapid initial decay follo wed by early stagnation, a hallmark of optimization imbalance under sparse sup ervision. PGT’s sustained con vergence re- ﬂects the stabilizing eﬀect of embedding physics directly within the attention mechanism, whic h contin uously biases the mo del tow ard physically plausible solutions rather than relying on comp eting loss terms. Collectiv ely , the 2D results demonstrate that PGT uniquely addresses the fundamen tal tension b etw een 10 data ﬁdelity and physical consistency that aﬄicts all baseline approac hes: pure data-ﬁtting methods (WIRE, SIREN) ac hieve low reconstruction error at the exp ense of PDE compliance, while residual-based metho ds (PINN, PINNsF ormer) enforce physics at the cost of reconstruction accuracy . By em b edding physical pri- ors architecturally rather than as an external penalty , PGT sidesteps this trade-oﬀ and ac hieves strong p erformance on b oth axes simultaneously . Figure 5: T raining error conv ergence for PGT and baseline metho ds on the 2D Na vier–Stokes problem. 4.3. A blation Study T o rigorously attribute PGT’s p erformance to its individual design choices, we conduct a comprehensive ablation study on the 2D Na vier–Stokes cylinder wak e task ( N train = 1500 ). The study prob es three inde- p enden t axes of the arc hitecture: (i) the ph ysics-guided attention bias Γ and the PDE residual loss L PDE , whic h together constitute the ph ysics-in tegration mec hanisms in the enco der and training ob jective; and (ii) the deco der design, which isolates the con tribution of sinusoidal activ ations and FiLM-based con text conditioning within the implicit reconstruction head. All sev en conﬁgurations share the same encoder depth, em b edding dimension, optimizer, and training budget, and all retain the data loss L data throughout. Results are summarized in T able 3. T able 3: Ablation study on the 2D Navier–Stok es cylinder wak e ( N train = 1500 ). A ttn. bias Γ : heat-k ernel additive bias in self-attention logits (disabled in pure-transformer v ariants). PDE loss L PDE : momen tum and contin uit y residual terms. De c o der : FiLM-SIREN is the full deco der; SIREN (no FiLM) uses a plain SIREN with context concatenated at the input; MLP replaces sin usoidal activ ations with GELU; FiLM-MLP retains FiLM conditioning but uses GELU activ ations. ✓ activ e; ◦ disabled. Data loss is active in all rows. Best results in b old . V ariant Decoder A ttn. bias Γ PDE loss L PDE Reconstruction Error ↓ PDE Residual ↓ PGT (full) FiLM-SIREN ✓ ✓ 6 . 50 × 10 − 5 8 . 30 × 10 − 4 No PDE loss FiLM-SIREN ✓ ◦ 8 . 80 × 10 − 5 2 . 90 × 10 − 3 No attention bias FiLM-SIREN ◦ ✓ 3 . 00 × 10 − 4 1 . 30 × 10 − 2 No physics (data only) FiLM-SIREN ◦ ◦ 3 . 60 × 10 − 4 3 . 50 × 10 − 2 SIREN, no FiLM SIREN ✓ ✓ 1 . 83 × 10 − 4 1 . 10 × 10 − 3 FiLM-MLP FiLM-MLP ✓ ✓ 1 . 21 × 10 − 4 1 . 05 × 10 − 3 Plain MLP MLP ✓ ✓ 3 . 12 × 10 − 4 1 . 42 × 10 − 3 The results rev eal three distinct and complementary insights into the architectural design of PGT. The physics-guide d attention bias Γ is the dominant driver of r e c onstruction ac cur acy.. Comparing v arian ts that diﬀer only in whether Γ is activ e, its eﬀect is consistent and substantial. Without the ph ysics loss (“no PDE loss” vs. “no ph ysics”), activ ating Γ reduces the reconstruction error by approximately 4 × — from 3 . 60 × 10 − 4 to 8 . 80 × 10 − 5 — with no change to the training ob jectiv e. This gain arises purely from the structural inductive bias in tro duced at the attention lev el: by enco ding the heat-k ernel Green’s function into the atten tion logits, Γ directs the enco der to aggregate con text tokens according to physically meaningful 11 spatiotemp oral pro ximit y , enabling more coheren t ﬁeld reconstruction ev en when no explicit PDE p enalt y is applied. The same pattern holds when the physics loss is active (“no attention bias” vs. “PGT full”): enabling Γ reduces the reconstruction error from 3 . 00 × 10 − 4 to 6 . 50 × 10 − 5 , a further 4 . 6 × improv emen t. Across b oth conditions, Γ is the single most impactful design choice for reconstruction ﬁdelity . The PDE loss L PDE is essential for governing-e quation c omplianc e.. When Γ is active but L PDE is withheld (“no PDE loss”), the mo del achiev es strong reconstruction accuracy y et yields a PDE residual of 2 . 90 × 10 − 3 — approximately 3 . 5 × larger than the full mo del. Introducing L PDE reduces the residual to 8 . 30 × 10 − 4 , conﬁrming that explicit residual sup ervision at collocation p oin ts is necessary to enforce momentum conserv ation and incompressibility throughout the domain. This eﬀect also app ears in the absence of Γ : adding the ph ysics loss (“no atten tion bias” vs. “no ph ysics”) reduces the residual from 3 . 50 × 10 − 2 to 1 . 30 × 10 − 2 , a 2 . 7 × reduction. How ever, the residual achiev ed b y L PDE alone without Γ ( 1 . 30 × 10 − 2 ) remains more than an order of magnitude larger than that of the full mo del ( 8 . 30 × 10 − 4 ), conﬁrming that the tw o mechanisms are c omplemen tary rather than substitutable. Structural physics priors shap e the represen tation. Explicit residual sup ervision enforces p oint wise compliance with the PDE. FiLM c onditioning and sinusoidal activations b oth c ontribute to de c o der quality, and their c ombination is ne c essary for optimal p erformanc e.. The bottom three rows of T able 3 isolate the deco der design while holding the enco der ( Γ active) and training ob jective ( L PDE activ e) ﬁxed. R emoving FiLM while ke eping sinusoidal activations (SIREN, no FiLM) raises the reconstruction error from 6 . 50 × 10 − 5 to 1 . 83 × 10 − 4 , a degradation of 2 . 8 × , and mo destly worsens the PDE residual to 1 . 10 × 10 − 3 . Without FiLM, the deco der cannot adapt its frequency resp onse to the lo cal physical con text inferred by the cross-attention step. Instead, it receives the entire context v ector only at the ﬁrst lay er, forcing a single ﬁxed set of sinusoidal frequencies to represent all query lo cations equally . The resulting mo del still b eneﬁts from the p eriodic activ ation’s sp ectral prop erties but loses the p er-query mo dulation that allows PGT to resolv e ﬁne vortex-shedding structures and pressure gradients with v arying spatial complexity . R eplacing sinusoidal activations with GELU while r etaining FiLM (FiLM-MLP) yields a reconstruction error of 1 . 21 × 10 − 4 and a PDE residual of 1 . 05 × 10 − 3 . This is noticeably b etter than the SIREN-no-FiLM v arian t, indicating that deep, m ultilay er context conditioning contributes more to reconstruction quality than the choice of activ ation function alone. Nevertheless, the gap with the full FiLM-SIR EN mo del ( 6 . 50 × 10 − 5 ) conﬁrms that GELU activ ations are insuﬃcient to represent the smooth, oscillatory solution ﬁelds t ypical of conv ection-dominated ﬂows. The absence of perio dic activ ations prev ents the deco der from eﬃciently enco ding the sp ectral con tent present in the cylinder w ake — a limitation that FiLM conditioning alone cannot o vercome. R emoving b oth FiLM and sinusoidal activations (plain MLP) pro duces the weak est deco der: reconstruc- tion error climbs to 3 . 12 × 10 − 4 and the PDE residual reaches 1 . 42 × 10 − 3 . The reconstruction error of this v arian t approaches that of the no-atten tion-bias v ariant ( 3 . 00 × 10 − 4 ), suggesting that a sub optimal decoder can negate the represen tational b eneﬁt of physics-guided attention in the enco der. T ogether, thes e deco der results establish a clear p erformance hierarch y: FiLM-SIREN > FiLM-MLP > SIREN (no FiLM) > plain MLP . The hierarch y reveals that FiLM conditioning is the more critical of the t wo mechanisms: Its remo v al costs 2 . 8 × in reconstruction accuracy , whereas swapping sinusoidal activ ations for GELU cost only 1 . 9 × . Y et the combination uniquely achiev es the lo west error, conﬁrming that b oth prop erties — p eriodic sp ectral represen tation and adaptive con text-driven mo dulation — are indep enden tly b eneﬁcial and mutually reinforcing. The ful l mo del uniquely achieves str ong p erformanc e on b oth evaluation axes simultane ously.. Across all seven conﬁgurations, only the full PGT v ariant simultaneously attains the low est reconstruction error ( 6 . 50 × 10 − 5 ) and the lo west PDE residual ( 8 . 30 × 10 − 4 ). Disabling Γ primarily degrades reconstruction accuracy; disabling L PDE primarily degrades PDE compliance; degrading the deco der hurts b oth metrics in prop ortion to the sev erity of the simpliﬁcation. The four-w ay ablation traces a clear Pareto fron t in the accuracy–consistency space: no simpliﬁed v arian t achiev es the Pareto fron tier of the full mo del. These ﬁndings pro vide mechanistic v alidation of ev ery ma jor design c hoice in PGT — ph ysics-guided attention, explicit PDE sup ervision, sin usoidal implicit deco ding, and FiLM-based context conditioning — and conﬁrm that all four are necessary comp onen ts of the architecture. 12 4.4. R obustness to Me asur ement Noise Sensor noise is una voidable in exp erimental ﬂow measurement. T o assess its impact, w e corrupt the u and v context observ ations with additiv e Gaussian noise at six relativ e levels, u noisy i = u i + ε i , ε i ∼ N (0 , σ 2 ) , σ = η · std( u train ) , for η ∈ { 0 , 0 . 01 , 0 . 02 , 0 . 05 , 0 . 10 , 0 . 20 } (clean to 20% of signal std). W e compare standard PGT (MSE data loss) against a v arian t, PGT-UW, that replaces the data loss with a heteroscedastic uncertaint y- w eighted negative log-likelihoo d, L UW data = 1 N d N d X i =1 " log σ 2 i +   u Θ ( x i , t i ) − u obs i   2 σ 2 i # , (24) where σ 2 i is a per-token aleatoric v ariance predicted by a small auxiliary head on the cross-atten tion output. All other comp onen ts are identical b et ween the tw o v arian ts. Results are rep orted in T able 4 and Figure 6. T able 4: Noise robustness on the 2D Navier–Stok es cylinder wak e ( σ = η · std( u train ) ). PGT uses a standard MSE data loss; PGT-UW uses a heteroscedastic uncertain ty-w eigh ted loss. Best p er ro w in b old . A vg. Rel- ℓ 2 ( u, v ) ↓ PDE Residual ↓ η PGT PGT-UW PGT PGT-UW 0.00 0.0496 0.0410 1.61e-03 4.55e-03 0.01 0.0634 0.0861 1.99e-03 6.80e-03 0.02 0.0406 0.1054 1.86e-03 7.78e-03 0.05 0.0508 0.1102 1.49e-03 7.63e-03 0.10 0.0512 0.1551 1.92e-03 7.74e-03 0.20 0.0607 0.0400 1.82e-03 4.64e-03 Standard PGT is r emarkably stable : the av erage Rel- ℓ 2 error for u and v remains within [0 . 040 , 0 . 064] across all noise levels, and the PDE residual stays in the narrow band [1 . 5 , 2 . 0] × 10 − 3 throughout. This robustness is a direct consequence of the ph ysics-guided attention bias Γ : by con tinuously anc horing in ternal represen tations to the heat-kernel Green’s function, the architectural prior preven ts noise from propagating in to the reconstructed ﬁelds, acting as an implicit regularizer indep endent of the data loss. PGT-UW, by contrast, exhibits a non-monotone tra jectory: error rises steeply for intermediate noise lev els ( η = 0 . 01 – 0 . 10 ), p eaking at 0 . 155 b efore partially recov ering at η = 0 . 20 . PDE residuals follo w the same pattern, remaining 3 – 4 × higher than standard PGT across all in termediate lev els. The heteroscedastic head in tro duces optimisation complexit y that the ﬁxed training budget cannot fully resolve; at very high noise, the large predicted v ariances eﬀectively suppress the data loss, allowing the physics loss to comp ensate. These ﬁndings indicate that when a strong architectural ph ysics prior is present, an explicit aleatoric uncertaint y mec hanism oﬀers limited beneﬁt and ma y destabilise training. Standard PGT is therefore the recommended conﬁguration for noisy measuremen t scenarios. 5. Discussion This study demonstrates that embedding physical structure directly in to the neural architecture, rather than solely as an external loss p enalty , can substantially improv e the reconstruction of nonlinear dynamical systems from sparse observ ations. Across both diﬀusion-dominated and conv ection-dominated regimes, PGT ac hieved strong physical consistency while main taining competitive reconstruction accuracy—a com bination that none of the individual baseline metho ds achiev ed simultaneously . The expanded 2D Navier–Stok es comparison reveals a clear structural dichotom y among existing meth- o ds. Pure data-ﬁtting approaches (WIRE, SIREN) achiev e low reconstruction error but exhibit large PDE residuals, conﬁrming that accurate interpolation of sparse observ ations do es not guarantee satisfaction of the underlying gov erning equations. Con versely , residual-based metho ds (PINN, PINNsF ormer) reduce the PDE residual at the cost of reconstruction accuracy , particularly for the vertical velocity and pressure ﬁelds where the eﬀects of pressure–velocity coupling are most pronounced. Op erator-learning metho ds (FNO, PI-DeepONet) o ccupy a middle ground but fail to excel on either axis under the sparse measurement budget considered here. 13 Figure 6: Reconstruction error (left) and PDE residual (right) as a function of noise lev el η for PGT and PGT-UW on the 2D Navier–Stok es cylinder-wak e problem. PGT sidesteps this trade-oﬀ b y in tegrating ph ysical priors at the represen tational lev el. The heat- k ernel–derived atten tion bias con tinuously steers the model to ward ph ysically plausible solutions during optimization, without comp eting with data-ﬁdelity terms as p enalt y-weigh ted losses do. This architectural constrain t app ears to stabilize training and promote globally coheren t solutions, esp ecially in nonlinear ﬂow regimes where pressure–v elo city coupling is critical. The result is a mo del that sim ultaneously ac hieves reconstruction accuracy approac hing that of the best data-ﬁtting baseline (WIRE) and a PDE residual comparable to that of the b est residual-based baseline (PINN). This outcome is structurally diﬃcult to ac hieve through loss reweigh ting alone. The ablation study provides direct mechanistic evidence for this claim. Removing the physics-guided atten tion bias Γ while retaining the PDE loss degrades reconstruction error by more than an order of mag- nitude, conﬁrming that the architectural bias — not the loss term — is the primary mechanism resp onsible for accurate ﬁeld recov ery . Conv ersely , removing the PDE loss while retaining Γ lea ves reconstruction ac- curacy largely intact but causes the PDE residual to increase nearly sevenfold, demonstrating that explicit residual supervision remains necessary for gov erning-equation compliance. These t w o mec hanisms are there- fore non-redundant: Γ shap es the latent representation to ward physically coheren t solutions, while L PDE enforces point wise satisfaction of the diﬀerential constraints. Their com bination uniquely ac hieves strong p erformance on b oth ev aluation axes. PGT is deliberately more exp ensiv e than ligh t weigh t baselines suc h as PINN, SIREN, and FNO, as the quadratic self-attention ov er context tokens and the FiLM hypernetw orks are precisely the mechanisms that enable global propagation of sparse observ ations and adaptiv e spectral deco ding — capabilities that c heap er architectures forgo, at the cost of the large reconstruction and PDE-residual errors do cumen ted in T ables 1 and 2. View ed against ph ysics-aw are metho ds of comparable am bition, how ever, PGT is comp etitive: PINNsF ormer trains substan tially longer y et yields higher reconstruction error, and PI-DeepONet consumes far greater FLOPs while pro ducing a PDE residual orders of magnitude larger. The curren t cost is therefore an engineering constrain t of the protot yp e rather than a fundamental limit of physics-guided atten tion. The spatial lo cality of Γ — which decays rapidly b eyond a physically determined radius — motiv ates sparse or hierarc hical atten tion that would reduce complexity from O ( P 2 ) tow ard O ( P log P ) ; low-rank factorisation of the Gaussian bias matrix and mixed-precision training oﬀer further reductions with no c hange to the underlying ph ysics prior. It is worth noting that sparse reconstruction diﬀers fundamen tally from sup er-resolution: whereas sup er- resolution op erates on a dense lo w-resolution grid with uniform spatial co verage, sparse reconstruction m ust reco ver contin uous ﬁelds from scattered, unstructured samples, leaving large p ortions of the domain en tirely unobserv ed. This more ill-p osed setting makes architectural inductive biases and physics-based constraints corresp ondingly more critical, precisely the regime that PGT targets. More broadly , these ﬁndings suggest that future scien tiﬁc mac hine learning mo dels ma y b eneﬁt from em- b edding gov erning principles at the representational lev el rather than relying solely on residual regularization. 14 Extending this framewor k to higher-dimensional, m ulti-physics, and turbulen t regimes — and quantifying its behaviour under v arying Reynolds num bers and noise levels — remains an imp ortan t direction for future researc h. 6. Conclusion W e hav e in tro duced a Physics-Guided T ransformer (PGT) for reconstructing partial diﬀerential equa- tion–go verned systems from sparse observ ations. By embedding physical structure directly in to the atten- tion mechanism and coupling it with an adaptive implicit decoder, the prop osed framework mov es b eyond residual-only ph ysics enforcement. Instead, it incorp orates gov erning principles at the architectural level. A cross diﬀusion and nonlinear ﬂow problems, PGT achiev ed solutions that were b oth numerically accu- rate and strongly consistent with the underlying equations. A controlled ablation study further conﬁrmed that the physics-guided attention bias Γ and the PDE residual loss L PDE op erate through distinct and complemen tary path wa ys: Γ is the primary driv er of reconstruction accuracy , while L PDE is essential for go verning-equation compliance. Only their com bination simultaneously minimizes b oth ob jectives, pro viding mec hanistic v alidation of the architectural design choices. While the approach incurs a higher computational cost than light w eigh t op erator-learning metho ds, it oﬀers a meaningful trade-oﬀ b etw een eﬃciency and physical ﬁdelity . More broadly , this w ork p oints tow ard a shift in scien tiﬁc machine learning: from treating gov erning equations as external constraints to incorp orating them as intrinsic comp onents of model design. Extending suc h ph ysics-guided attention mechanisms to higher-dimensional, multiscale, and multiph ysics systems ma y provide a path to reliable and interpretable data-driv en mo deling in science and engineering. References [1] Alexey Dosovitskiy , L.B., et. al., 2021. An image is worth 16x16 words: T ransformers for image recog- nition at scale. URL: , . [2] Brandstetter, J., W orrall, D., W elling, M., 2023. Message passing neural pde solvers. URL: https: //arxiv.org/abs/2202.03376 , arXiv:2202.03376 . [3] Brun ton, B.W., Brunton, S.L., Pro ctor, J.L., Kutz, J.N., 2016. Sparse sensor placement optimization for classiﬁcation. SIAM Journal on Applied Mathematics 76, 2099–2122. doi: 10.1137/15M1036713 . [4] Brun ton, S.L., Noack, B.R., K oumoutsakos, P ., 2020. Machine learning for ﬂuid mec hanics. Ann ual Review of Fluid Mec hanics 52, 477–508. [5] Candès, E.J., Romberg, J., T ao, T., 2006. Robust uncertaint y principles: Exact signal reconstruction from highly incomplete frequency information. IEEE T ransactions on Information Theory 52, 489–509. [6] Can uto, C., Quarteroni, A., 2017. Sp ectral Methods. John Wiley and Sons, Ltd. c hapter 5. pp. 1–16. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/ 9781119176817.ecm2003m , doi: https://doi.org/10.1002/9781119176817.ecm2003m , arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119176817.ecm2003m . [7] Cao, S., 2021. Cho ose a transformer: F ourier or galerkin. . [8] Cip olla, R., Gal, Y., Kendall, A., 2018. Multi-task learning using uncertaint y to weigh losses for scene geometry and semantics, in: Pro ceedings of the IEEE Conference on Computer Vision and P attern, pp. 7482–7491. doi: 10.1109/CVPR.2018.00781 . [9] Cohen, T.S., W elling, M., 2016. Group equiv ariant conv olutional netw orks. . [10] Eric hson, N.B., Mathelin, L., Y ao, Z., Brunton, S.L., Mahoney , M.W., Kutz, J.N., 2020. Shallo w neural netw orks for ﬂuid ﬂow reconstruction with limited sensors. Pro- ceedings of the Roy al So ciety A: Mathematical, Physical and Engineering Sciences 476, 20200097. URL: https://doi.org/10.1098/rspa.2020.0097 , doi: 10.1098/rspa.2020.0097 , arXiv:https://royalsocietypublishing.org/rspa/article-pdf/doi/10.1098/rspa.2020.0097/637169/rspa.202 0.0097.pdf . 15 [11] Ev ans, L.C., 2010. Partial Diﬀerential Equations. volume 19 of Gr aduate Studies in Mathematics . 2 ed., American Mathematical So ciet y , Providence, RI. [12] Finzi, M., Stan ton, S., Izmailov, P ., Wilson, A.G., 2020. Generalizing conv olutional neural netw orks for e quiv ariance to lie groups on arbitrary contin uous data, in: Pro ceedings of the 37th International Conference on Machine Learning, PMLR. pp. 3165–3176. URL: https://proceedings.mlr.press/ v119/finzi20a.html . [13] Gosw ami, S., Bora, A., Y u, Y., Karniadakis, G.E., 2022. Physics-informed deep neural op erator net- w orks. URL: , . [14] Karniadakis, G.E., Kevrekidis, I.G., Lu, L., P erdik aris, P ., W ang, S., Y ang, L., 2021. Physics-informed mac hine learning. Nature Reviews Ph ysics 3, 422–440. [15] K ov achki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattachary a, K., Stuart, A., Anandkumar, A., 2023. Neural op erator: learning maps b et ween function spaces with applications to pdes. The Journal of Mac hine Learning Research 24. [16] Krishnapriy an, A., Gholami, A., Zhe, S., Kirby , R., Mahoney , M.W., 202 1. Characterizing possible failure modes in ph ysics-informed neural net w orks, in: Adv ances in Neural Information Pro cessing Systems, Curran Asso ciates, Inc.. pp. 26548–26560. [17] Le V eque, R.J., 2007. Finite Diﬀerence Methods for Ordinary and P artial Diﬀer- en tial Equations. So ciet y for Industrial and Applied Mathematics. URL: https: //epubs.siam.org/doi/abs/10.1137/1.9780898717839 , doi: 10.1137/1.9780898717839 , arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898717839 . [18] Li, Z., K o v ac hki, N., Azizzadenesheli, K., Liu, B., Bhattachary a, K., Stuart, A., Anandkumar, A., 2020a. Neural operator: Graph k ernel net w ork for partial diﬀeren tial equations. URL: https://arxiv. org/abs/2003.03485 , arXiv:2003.03485 . [19] Li, Z.Y., Ko v achki, N.B., Azizzadenesheli, K., Liu, B., Bhattac harya, K., Stuart, A.M., Anandkumar, A., 2020b. F ourier neural op erator for parametric partial diﬀerential equations. ArXiv abs/2010.08895. URL: https://api.semanticscholar.org/CorpusID:224705257 . [20] Lu, L., Jin, P ., Pang, G., Zhang, Z., Karniadakis, G.E., 2021a. Learning nonlinear operators via deep onet based on the universal appro ximation theorem of op erators. Nature Machine Intelligence 3, 218–229. [21] Lu, L., Meng, X., Mao, Z., Karniadakis, G.E., 2021b. Deep xde: A deep learning li- brary for solving diﬀeren tial equations. SIAM Review 63, 208–228. doi: 10.1137/19M1274067 , arXiv:https://doi.org/10.1137/19M1274067 . [22] Mildenhall, B., Sriniv asan, P .P ., T ancik, M., Barron, J.T., Ramamo orthi, R., Ng, R., 2020. Nerf: Represen ting scenes as neural radiance ﬁelds for view synthesis, in: V edaldi, A., Bischof, H., Brox, T., F rahm, J.M. (Eds.), Computer Vision – ECCV 2020, Springer, Cham. pp. 405–421. [23] Pfaﬀ, T., F ortunato, M., Sanc hez-Gonzalez, A., Battaglia, P .W., 2021. Learning mesh-based sim ulation with graph net works. URL: , . [24] Rac k auc k as, C., Ma, Y., Martensen, J., W arner, C., Zub ov, K., Supek ar, R., Skinner, D., Ramadhan, A., Edelman, A., 2021. Universal diﬀerential equations for scientiﬁc machine learning. URL: https: //arxiv.org/abs/2001.04385 , arXiv:2001.04385 . [25] Raissi, M., Perdik aris, P ., Karniadakis, G., 2019. Ph ysics-informed neural netw orks: A deep learning framew ork for solving forw ard and inv erse problems inv olving nonlinear partial diﬀerential equations. Journal of Computational Ph ysics 378, 686–707. URL: https://www.sciencedirect.com/science/ article/pii/S0021999118307125 , d oi: https://doi.org/10.1016/j.jcp.2018.10.045 . 16 [26] Sirignano, J., Spiliop oulos, K., 2018. Dgm: A deep learning algorithm for solving partial diﬀeren tial equations. Journal of Computational Physics 375, 1339–1364. URL: https://www.sciencedirect.com/ science/article/pii/S0021999118305527 , doi: https://doi.org/10.1016/j.jcp.2018.08.029 . [27] Sitzmann, V., Martel, J.N.P ., Bergman, A.W., Lindell, D.B., W etzstein, G., 2020. Implicit neural represen tations with perio dic activ ation functions, in: Pro ceedings of the 34th In ternational Conference on Neural Information Pro cessing Systems, Curran Asso ciates Inc., Red Ho ok, NY, USA. pp. 7462–7473. [28] T ancik, M., Sriniv asan, P .P ., Mildenhall, B., F ridovic h-Keil, S., Raghav an, N., Singhal, U., Ramamo or- thi, R., Barron, J.T., Ng, R., 2020. F ourier features let net works learn high frequency functions in lo w dimensional domains. . [29] V asw ani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2023. Atten tion is all you need. URL: , . [30] W ang, S., T eng, Y., Perdik aris, P ., 2021. Understanding and mitigating gradient pathologies in ph ysics- informed neural net works. SIAM Journal on Scientiﬁc Computing 43, A3055–A3081. [31] Willco x, K., P eraire, J., 2012. Balanced mo del reduction via the prop er orthogonal decomposition. AIAA Journal 40, 2323–2330. doi: doi.org/10.2514/2.1570 . [32] Zhao, Z., Ding, X., Prak ash, B.A., 2024. Pinnsformer: A transformer-based framew ork for ph ysics- informed neural net works. URL: , . 17

Physics-Guided Transformer (PGT): Physics-Aware Attention Mechanism for PINNs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment