Standard Transformers Achieve the Minimax Rate in Nonparametric Regression with $C^{s,λ}$ Targets

The tremendous success of Transformer models in fields such as large language models and computer vision necessitates a rigorous theoretical investigation. To the best of our knowledge, this paper is the first work proving that standard Transformers …

Authors: Yanming Lai, Defeng Sun

Standard T ransformers Ac hiev e the Minimax Rate in Nonparametric Regression with C s,λ T argets Y anming Lai ∗ a and Defeng Sun b a,b Departmen t of Applied Mathematics, The Hong Kong P olytechnic Univ ersity , Hung Hom, Hong Kong, China Abstract The tremendous success of T ransformer mo dels in fields suc h as large language mo dels and computer vision necessitates a rigorous theoretical in vestigation. T o the b est of our knowledge, this pap er is the first w ork pro ving that standard T ransform- ers can appro ximate H¨ older functions C s,λ  [0 , 1] d × n  ( s ∈ N ≥ 0 , 0 < λ ≤ 1) under the L t distance ( t ∈ [1 , ∞ ]) with arbitrary precision. Building up on this approx- imation result, w e demonstrate that standard T ransformers achiev e the minimax optimal rate in nonparametric regression for H¨ older target functions. It is worth men tioning that, by introducing t wo metrics: the size tuple and the dimension v ector, we provide a fine-grained characterization of T ransformer structures, which facilitates future researc h on the generalization and optimization errors of T rans- formers with differen t structures. As in termediate results, w e also derive the upp er b ounds for the Lipsc hitz constan t of standard T ransformers and their memorization capacit y , which may b e of indep enden t interest. These findings pro vide theoretical justification for the p o werful capabilities of T ransformer models. 1 In tro duction 1.1 Bac kground In recen t years, the tremendous success of machine learning in v arious applications has spark ed theoretical researc h in this field. As an imp ortan t mo del in machine learning, neural netw orks are a key area of study . One of the reasons for the p o werful success of neural netw orks is their strong expressiv e capabilit y; in other w ords, neural netw orks can approximate a wide range of function classes with arbitrary precision. As the most fundamen tal and simple model in neural net w orks, feedforward neural net works (FNNs) ha ve received the most extensive and mature inv estigation in this area. This includes studies on b oth shallo w and deep architectures, differen t activ ation functions, and the appro ximation of v arious function classes, etc.; see [57, 58, 46, 42, 31, 43, 40, 12, 34, 44, 55] and the references therein. In addition to FNNs, there is also a b o dy of w ork exploring the approximation capabilities of other types of neural netw orks, suc h as con volutional ∗ Corresp onding Author (y anming.lai@p olyu.edu.hk) 1 neural net works (CNNs) [37, 62, 61], recurren t neural net works (RNNs) [29, 15, 21]. F urthermore, the memorization capacity of neural net works is also closely related to their appro ximation ability . Researc h in this direction can b e found in [36, 50] and the references therein. Since its in tro duction b y [51], the T ransformer mo del has ac hiev ed remark able success across a wide v ariety of domains, including but not limited to large language mo dels [6] and computer vision [7]. Compared to traditional FNNs, T ransformers are more efficient due to their use of self-atten tion mec hanisms, whic h enable parameter sharing and parallel tok en pro cessing. How ev er, this architecture also introduces c hallenges in studying the appro ximation prop erties of T ransformers: they must capture the en tire con text of eac h input sequence rather than simply assigning a lab el to each token indep enden tly . Theoretical researc h on the expressiv e p o w er of T ransformers b egan with [59], which in tro duced the concept of con textual mapping—the ability to distinguish tokens that are in the same sequence but at differen t p ositions, or tokens across different input se- quences. By demonstrating that T ransformers with biased self-attention lay ers can im- plemen t contextual mapping, the authors established a universal approximation theorem for suc h T ransformers regarding contin uous functions. This metho d was later extended to sparse T ransformers [60]. [28] also applied this approach to study constrained T rans- formers. [25] in v estigated the closely related memorization problem. They refined the tec hniques in [59], enhancing the parameter efficiency of atten tion lay ers, and estab- lished an upp er b ound for T ransformers solving memorization tasks. Subsequen tly , [23] impro ved the upp er b ounds of [25] and provided lo wer b ounds. Other studies on the memorization capabilities of T ransformers include [33, 32]. Lev eraging the prop erties of the Boltzmann op erator, [22] pro v ed that a single-lay er, single-head self-attention mec ha- nism without bias terms is sufficient to ac hieve con textual mapping. How ev er, this comes at the cost of a separation parameter that decays exp onen tially with the size of the dic- tionary . Building on this result, they also established a univ ersal approximation theorem for contin uous functions. Recen tly , there ha v e b een some quan titative works in this area. [18] pro vided a quantitativ e c haracterization of the conv ergence rate when T ransformers appro ximate functions in H¨ older space C 0 ,λ (Ω), i.e., the dependence of T ransformer size on appro ximation accuracy . Based on the Kolmogorov-Arnold Represen tation Theorem, [19] constructed sev eral T ransformers that can o vercome the curse of dimensionalit y when appro ximating functions in H¨ older space C 0 ,λ (Ω) by replacing the activ ation functions in the feed-forward la yers. [16] utilized an interpolation-based approach to show that self- atten tion can appro ximate the ReLU function. By combining this with classical ReLU FNN appro ximation results, they pro ved that t wo-la y er multi-head atten tion can approx- imate an y contin uous function. Additional w orks regarding the expressive capability of T ransformers can b e found in [2, 24, 13, 8, 54, 47, 38, 17, 53, 4, 48, 30, 52] and the references therein. H o w ever, in these prior studies, there still do es not exist an y quan ti- tativ e characterization result for the appro ximation of functions in general H¨ older space C s,λ (Ω) with s ∈ N ≥ 0 b y T ransformers, ev en though C s,λ (Ω) is a central ob ject of study in approximation theory . In other w ords, no existing w ork has considered the approxi- mation rate of the standard T ransformer when the target function p ossesses higher-order con tinuous deriv ativ es. W e therefore pose the first question: A t what r ate c an standar d T r ansformers appr oximate functions in gener al H¨ older sp ac e C s,λ (Ω) ? As a class of non-parametric mo dels, neural netw orks hav e b een widely used in recen t y ears to solve regression problems. A series of prior works [39, 35, 27, 10, 3, 20, 9, 56] has 2 demonstrated that FNNs with v arious architectures can ac hiev e the minimax optimal rate in regression tasks. In contrast, research on utilizing T ransformers for regression remains relativ ely limited at present. [47] inv estigated T ransformers with infinite-dimensional in- puts and derived error b ounds for regression under the assumption of anisotropic smooth- ness in the target function. [13, 14] studied error estimation for T ransformers using hardmax and ReLU, resp ectiv ely , as activ ation functions in the atten tion la yers. [18] an- alyzed the error b ounds of standard T ransformers for regression problems and obtained a sub-optimal rate under the assumptions of C 0 ,λ -con tinuous target functions and weakly dep enden t data. None of these studies ha ve answ ered the second k ey question w e raise: Can standar d T r ansformers achieve the minimax optimal r ate when applie d to r e gr es- sion pr oblems with C s,λ -c ontinuous tar gets? 1.2 Main Results Our main results answ er the tw o questions p osed in the previous section. Before stating these results, w e first giv e a brief introduction to the standard T ransformer arc hitecture. A T ransformer consists of three comp onen ts: the embedding la y er, the feedforw ard blo c k and the self-atten tion lay er. An em b edding lay er F E B : R d in × n → R d E B × n is an affine transformation, defined as F E B ( X ) := W E B X + B E B , where W E B ∈ R d E B × d , B E B ∈ R d E B × n . A feedforw ard blo c k F F F : R d ( in ) F F × n → R d ( out ) F F × n of depth L and width W is recursviely defined as F 0 := X ; F l := σ R ( W l F l − 1 + b l 1 1 × n ) , l ∈ { 1 , 2 , . . . , L − 1 } ; F F F := W L F L − 1 + b L 1 1 × n , where W 1 ∈ R W × d ( in ) F F , W l ∈ R W × W ( l = 2 , · · · , L − 1) , W L ∈ R d ( out ) F F × W , b l ∈ R W ( l = 1 , · · · , L − 1) , b L ∈ R d ( out ) F F , σ R is the element-wise ReLU function: σ R ( x ) = max { x, 0 } . In this pap er, we sa y that F F F is generated from its feedforw ard neural netw ork (FNN) coun terpart f F F : R d ( in ) F F → R d ( out ) F F , whic h acts on v ectors: f 0 := x ; f l := σ R ( W l f l − 1 + b l ) , l ∈ { 1 , 2 , . . . , L − 1 } ; f F F := W L f L − 1 + b L . A self-atten tion lay er F S A : R d S A × n → R d S A × n of head n umber H and head size S is defined as F S A ( X ) := X + H X h =1 W ( h ) O W ( h ) V X σ S  X ⊤ W ( h ) ⊤ K W ( h ) Q X  , where W ( h ) O ∈ R d S A × S , W ( h ) V , W ( h ) K , W ( h ) Q ∈ R S × d S A , σ S is the column-wise softmax func- tion: [ σ S ( x )] i = e x i /  P n j =1 e x j  , i ∈ [ n ] , x ∈ R n . 3 A T ransformer T : R d in × n → R d out × n of length K is defined as an initial embedding la yer F E B : R d in × n → R d 0 × n follo wed b y the alternating comp osition of K + 1 feed-forw ard blo c ks F ( k ) F F : R d k × n → R d k +1 × n and K self-attention lay ers F ( k ) S A : R d k × n → R d k × n : T ( X ) := F ( K ) F F ◦ F ( K ) S A ◦ F ( K − 1) F F ◦ · · · ◦ F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X ) . (1) W e use t w o metrics to c haracterize the structure of T : the size tuple { ( L 0 , W 0 ) , ( H 1 , S 1 ) , ( L 1 , W 1 ) , · · · , ( H K , S K ) , ( L K , W K ) } and the dimension v ector d :=  d in d 0 d 1 · · · d K d out  . W e use B E B , B F F and B S A to denote the upp er b ounds of the parameters in the em- b edding lay er, all feedforward blo c ks and all self-attention lay ers, resp ectiv ely . And we denote by M E B , M F F and M S A the total num b er of the parameters in all feedforward blo c ks and self-attention lay ers, respectively . W e first consider the appro ximation of H¨ older function b y T ransformers. Let Ω be a set in R d . Let µ = ( µ 1 , . . . , µ d ) ∈ N d ≥ 1 b e a multi-index and denote | µ | := µ 1 + · · · + µ d . Denote D µ f := ∂ | µ | f ( x ) ∂ x µ 1 1 ··· ∂ x µ d d as the partial deriv ativ es of function f : R d → R . Let s ∈ N ≥ 0 , 0 < λ ≤ 1. Denote γ := s + λ . A function f b elongs to the H¨ older space C s,λ (Ω) if it has finite C s,λ norm, whic h is defined as ∥ f ∥ C s,λ (Ω) := max  ∥ f ∥ C s (Ω) , max | µ | = s | D µ f | C 0 ,λ (Ω)  , where ∥ f ∥ C s (Ω) := max | µ |≤ s sup x ∈ Ω | D µ f ( x ) | , | f | C 0 ,λ (Ω) := sup x  = y ∈ Ω | f ( x ) − f ( y ) | ∥ x − y ∥ λ 2 . In Theorems 1 and 2, we resp ectiv ely establish approximation results for H¨ older functions b y T ransformers in the L t norm ( t ∈ [1 , ∞ )) and in the L ∞ norm. Our results include a quan titative characterization of the T ransformer structure. Comparing the t w o theorems sho ws that ac hieving p oin twise approximation ( L ∞ appro ximation) comes at the cost of a larger T ransformer size. Theorem 1. L et 1 ≤ t < ∞ . F or any 0 < ϵ < 1 and any f : [0 , 1] d × n → R d × n with c omp onents in H¨ older sp ac e C s,λ  [0 , 1] d × n  , ther e exists a T r anformer T : R d × n → R d × n with size L 0 = 4 , W 0 = O  ϵ − 1 /γ  ; H 1 = (3 d + 1) C dn − 1 s + dn − 1 , S 1 = max { dn, 3 } ; L l = 3 , W l = dC dn − 1 s + dn − 1 (4 n + 11); l = 1 , · · · , n − 1; H l = 3 dC dn − 1 s + dn − 1 , S l = 3; l = 2 , · · · , n ; L n = O  log 1 ϵ  , W n = O  ϵ − dn/γ  4 and dimension ve ctor  d d + n + 1 (2 dn + 5 d ) C dn − 1 s + dn − 1 · 1 1 × n d  such that for p ∈ [ d ] , q ∈ [ n ] , ∥ T pq − f pq ∥ L t ([0 , 1] d × n ) ≤ ϵ. F urthermor e, the weight b ounds of T ar e B E B = O (1) , B F F = O  ϵ − (7 dn + γ +2) t/γ  , B S A = O  log 1 ϵ  and the numb er of p ar ameters of T ar e M E B = O (1) , M F F = O  ϵ − dn/γ  , M S A = O (1) . Theorem 2. F or any 0 < ϵ < 1 and any f : [0 , 1] d × n → R d × n with c omp onents in H¨ older sp ac e C s,λ  [0 , 1] d × n  , ther e exists a T r anformer T : R d × n → R d × n with size L 0 = 4 , W 0 = O  ϵ − 1 /γ  ; H 1 = (3 d + 1)3 dn C dn − 1 s + dn − 1 , S 1 = max { dn, 3 } ; L l = 3 , W l = d 3 dn C dn − 1 s + dn − 1 (4 n + 11); l = 1 , · · · , n − 1; H l = d 3 dn +1 C dn − 1 s + dn − 1 , S l = 3; l = 2 , · · · , n ; L n = O  log 1 ϵ  , W n = O  ϵ − dn/γ  and dimension ve ctor  d d + n + 1 + 3 n (2 dn + 5 d )3 dn C dn − 1 s + dn − 1 · 1 1 × n d  such that for p ∈ [ d ] , q ∈ [ n ] , ∥ T pq − f pq ∥ L ∞ ([0 , 1] d × n ) ≤ ϵ. F urthermor e, the weight b ounds of T ar e B E B = O (1) , B F F = O  ϵ − max { (6 dn +2) /γ , 1 /λ }  , B S A = O  log 1 ϵ  and the numb er of p ar ameters of T ar e M E B = O (1) , M F F = O  ϵ − dn/γ  , M S A = O (1) . W e next consider the application of T ransformers to nonparametric regression: Y i = f 0 ( X i ) + ξ i , i ∈ [ m ] . Here f 0 ( x ) = E [ Y | X = x ] : [0 , 1] d × n → R is the unknown target function, { ( X i , Y i ) } m i =1 ⊂ [0 , 1] d × n × R are observ ation pairs, { ξ } m i =1 are i.i.d. Gaussian noises with E ξ i = 0 , V ar( ξ i ) = σ 2 . Let µ b e the marginal distribution of X . W e assume that | Y | ≤ B Y with some constan t B Y > 0. Our goal is to estimate f 0 based on the given observ ation pairs { ( X i , Y i ) } m i =1 . Sp ecifically , we consider the following least square problem ov er a function class F : b f = arg min f ∈F 1 m m X i =1 [ f ( X i ) − Y i ] 2 . The function class F can b e selected in v arious forms, suc h as repro ducing kernel Hilb ert spaces, p olynomial spaces, spline spaces, and neural netw ork function classes. In this pap er, we choose F := f tr,B Y ◦ F T 5 with f tr,B Y ( x ) := max { min { x, B Y } , − B Y } b eing a truncation function and F T b eing a function class generated b y T ransformers: F T ( K, L, W, H , S, M F F , M S A , B F F , B S A , d ) := { f T : f T = ⟨ T , E 11 ⟩ , T ∈ T ( K , L, W, H , S, M F F , M S A , B F F , B S A , d ) } , (2) where T ( K , L, W , H , S, M F F , M S A , B F F , B S A , d ) := { T : T is a T ransformer defined by (1) with L := max k =0 , 1 , ··· ,K L k , W := max k =0 , 1 , ··· ,K W k , H := max k =1 , ··· ,K H k , S := max k =1 , ··· ,K S k  . Here ⟨· , ·⟩ is the matrix inner pro duct and E 11 is the matrix that the entry in the first ro w and first column is 1, while all other entries are 0. E 11 can b e replaced b y any other fixed matrix, as there is no essen tial difference here. Our next theorem provides an upp er b ound of the excess risk    b f − f 0    2 L 2 ( µ ) := R [0 , 1] d × n    b f − f 0    2 dµ , which examines the distance betw een the estimator b f and the target f 0 . Theorem 3. Assume f 0 ∈ C s,λ  [0 , 1] d × n  . L et the p ar ameters in (2) b e K = n, L = O (log m ) , W = O  m dn/ (2 γ + dn )  , H = O (1) , S = O (1) , M E B = O (1) , M F F = O  m dn/ (2 γ + dn )  , M S A = O (1) , B E B = O (1) , B F F = O  m max { 6 dn +2 ,γ /λ } / (2 γ + dn )  , B S A = O (log m ) , d =  d d + n + 1 + 3 n (2 dn + 5 d )3 dn C dn − 1 s + dn − 1 · 1 1 × n d  . Then with pr ob ability at le ast 1 − 2 exp  − m dn/ (2 γ + dn )  , ther e holds    b f − f 0    2 L 2 ( µ ) ≲ m − 2 γ / (2 γ + dn ) (log m ) 2 . As shown in the classical wor k [45], the minimax optimal con vergence rate for non- parametric regression with C s,λ targets is Θ( m − 2 γ / (2 γ + dn ) ). Therefore, ignoring logarith- mic factors, our result sho ws that T ransformers achiev e minimax optimal rate in the nonparametric regression. 1.3 Our Con tributions Summarizing the conten t of the previous section, our con tributions are as follows: • T o the b est of our knowledge, this is the first w ork to derive the appro ximation rates of standard T ransformers for functions in H¨ older spaces C s,λ  [0 , 1] d × n  with s ∈ N ≥ 0 (Theorems 1 and 2). Our results generalize those of [18], whic h studied appro ximation rates of standard T ransformers for functions in C 0 ,λ  [0 , 1] d × n  . W e sho w that higher smo othness of the target function leads to improv ed appro ximation rates for the T ransformer. Giv en the fundamental imp ortance of H¨ older spaces in the study of partial differen tial equations and dynamical systems, our results lay a theoretical foundation for inv estigating the application of T ransformers in these areas. 6 • W e pro v e that standard T ransformers can ac hiev e the minimax optimal rate under the classical nonparametric regression setting (Theorem 3). As an in termediate result, we estimate the Lipsc hitz constant of T ransformers (Lemma 4), a result that ma y b e of indep enden t in terest. • As an in termediate step during the deriv ation of the approximation results, we obtain a memorization result for the standard T ransformer (Lemma 7). This im- pro ves up on the results in [25], whic h in v estigated T ransformers with bias terms in the atten tion la yers. This finding may b e of indep enden t in terest. • Previous works either lack ed a precise characterization of the T ransformer archi- tecture or pro vide only a coarse structural description. In con trast, b y in tro ducing t wo metrics: the size tuple and the dimension vector, we achiev e a fine-grained c haracterization of T ransformer structures. This facilitates the analysis of b oth generalization error and optimization error for T ransformers with v arying struc- tures in future w ork. 1.4 Organization of This P ap er The remainder of the pap er is organized as follo ws: In Section 2, w e present the pro ofs of Theorems 1, 2 and 3 along with the in termediate results needed in the pro ofs. Sections 3 – 7 con tain the pro ofs of these intermediate results. Finally , w e summarize the pap er in Section 8. 2 Pro ofs of Main Results In this s e ction, w e presen t the pro ofs of Theorems 1, 2 and 3 in Sections 2.1, 2.2 and 2.3, resp ectiv ely , along with the in termediate results required for these proofs. 2.1 Pro of of Theorem 1 T o prov e Theorem 1, we first construct a T ransformer that can approximate the target function to arbitrary precision o v er the regions Ω β :=  X ∈ [0 , 1] d × n : X ik ∈  β ik K , β ik + 1 − δ K  , i ∈ [ d ] , k ∈ [ n ]  , β ∈ { 0 , 1 , · · · , K − 1 } d × n , where K ∈ N ≥ 1 and δ ∈ R > 0 . W e sort the index set { 0 , 1 , · · · , K − 1 } d × n as { β 1 , · · · , β K dn } and denote Ω β j simply as Ω j for j ∈  K dn  hereafter. The complemen t of S j ∈ [ K dn ] Ω j in [0 , 1] d × n is Ω ( f law ) :=  X ∈ [0 , 1] d × n : there exist i ∈ [ d ] , j ∈ [ n ] , k ∈ [ K ] suc h that X ij ∈  k − δ K , k K  . Suc h a T ransformer is giv en in Prop osition 1 b elo w. W e pro vide a detailed charac- terization of its structure and deriv e an upp er b ound on its magnitude on Ω ( f law ) . The pro of of Prop osition 1 is deferred to Section 3. 7 Prop osition 1. F or any 0 < ϵ < 1 and any f : [0 , 1] d × n → R d × n with c omp onents in H¨ older sp ac e C s,λ  [0 , 1] d × n  , ther e exists a T r anformer T : R d × n → R d × n with size L 0 = 4 , W 0 = O  ϵ − 1 /γ  ; H 1 = (3 d + 1) C dn − 1 s + dn − 1 , S 1 = max { dn, 3 } ; L l = 3 , W l = dC dn − 1 s + dn − 1 (4 n + 11); l = 1 , · · · , n − 1; H l = 3 dC dn − 1 s + dn − 1 , S l = 3; l = 2 , · · · , n ; L n = O  log 1 ϵ  , W n = O  ϵ − dn/γ  and dimension ve ctor  d d + n + 1 (2 dn + 5 d ) C dn − 1 s + dn − 1 · 1 1 × n d  such that for p ∈ [ d ] , q ∈ [ n ] , • for X ∈ S j ∈ [ K dn ] Ω j , | T pq ( X ) − f pq ( X ) | ≤ ϵ ; • for X ∈ Ω ( f law ) , | T pq ( X ) | ≲ ϵ − (7 dn +2) /γ . Her e the gr anularity K = Θ  ϵ − 1 /γ  . F urthermor e, the weight b ounds of T ar e B E B = O (1) , B F F = max  C ( f , d, n, s ) ϵ − (6 dn +2) /γ , 1 /δ  , B S A = O  log 1 ϵ  and the numb er of p ar ameters of T ar e M E B = O (1) , M F F = O  ϵ − dn/γ  , M S A = O (1) . With Proposition 1 in hand, w e are able to pro ve Theorem 1. Pr o of of The or em 1. W e divide the in tegral in to t w o parts: ∥ T pq − f pq ∥ t L t ([0 , 1] d × n ) = Z X ∈ S j ∈ [ K dn ] Ω j | T pq ( X ) − f pq ( X ) | t d X + Z X ∈ Ω ( f law ) | T pq ( X ) − f pq ( X ) | t d X . F or the region S j ∈ [ K dn ] Ω j , we apply Proposition 1 by replacing ϵ with ϵ 2 1 /t (1 − δ ) dn/t therein and obtain Z X ∈ S j ∈ [ K dn ] Ω j | T pq ( X ) − f pq ( X ) | t d X ≤ ϵ t 2(1 − δ ) dn · K dn X j =1 | Ω j | = ϵ t 2(1 − δ ) dn · K dn  1 − δ K  dn = ϵ t 2 . According to Bernoulli’s inequality , we hav e the following estimate for the measure of Ω ( f law ) :   Ω ( f law )   = 1 − K dn X j =1 | Ω j | = 1 − K dn  1 − δ K  dn = 1 − (1 − δ ) dn ≤ dnδ. 8 Using the ab o ve estimate and Proposition 1, we derive that Z X ∈ Ω ( f law ) | T pq ( X ) − f pq ( X ) | t d X ≲ δ (1 − δ ) dn (7 dn +2) /γ ϵ − (7 dn +2) t/γ ≲ δ ϵ − (7 dn +2) t/γ . Cho osing δ ≍ ϵ (7 dn + γ +2) t/γ , w e can mak e Z X ∈ Ω ( f law ) | T pq ( X ) − f pq ( X ) | t d X ≤ ϵ t 2 . It follo ws that ∥ T pq − f pq ∥ L t ([0 , 1] d × n ) ≤ ϵ. 2.2 Pro of of Theorem 2 The pro of of Theorem 2 relies on the horizon tal shift tec hnique, first prop osed b y [31] to pro ve the approximation capabilities of ReLU feedforw ard neural net works. Sp ecifically , b y exploiting prop erties of the middle v alue function, this tec hnique strengthens approx- imation results on [0 , 1] d × n \ Ω ( f law ) to the full domain [0 , 1] d × n , at the cost of a larger net work size. The tec hnique is presen ted in the tw o lemmas b elo w, where ω f ( δ ) denotes the modulus of con tin uity of f ∈ C ([0 , 1] d × n ), defiend as ω f ( δ ) := sup  | f ( X ) − f ( Y ) | : ∥ X − Y ∥ F ≤ δ, X , Y ∈ [0 , 1] d × n  , for an y δ ≥ 0 . Lemma 1 ([31], Lemma 3.1) . Ther e exists a R eLU FNN function f ( mid ) F F : R 3 → R with width 14 , depth 2 and weight b ound 1 that outputs the midd le value of the c omp onents of x for any x ∈ R 3 . Lemma 2 ([31], Lemma 3.4) . L et K ∈ N > 0 , ϵ, δ ∈ R > 0 . Supp ose that δ ≤ 1 3 K . Assume the tar get function f ∈ C ([0 , 1] d × n ) and ther e exists g : R d × n → R such that for any X ∈ [0 , 1] d × n \ Ω ( f law ) , | g ( X ) − f ( X ) | ≤ ϵ. Then for any X ∈ [0 , 1] d × n , | ϕ ( X ) − f ( X ) | ≤ ϵ + dn · ω f ( δ ) , wher e ϕ := ϕ dn is define d by induction thr ough ϕ i ( X ) := f ( mid ) F F  ϕ i − 1 ( X − δ E i ) , ϕ i − 1 ( X ) , ϕ i − 1 ( X + δ E i )  , for i ∈ [ dn ] , with ϕ 0 = g and E i ∈ R d × n b eing the matrix whose entry is 1 at the p osition ( ⌈ i/n ⌉ , i − n ( ⌈ i/n ⌉ − 1)) and 0 elsewher e. 9 In this pap er, w e extend the horizon tal shift technique from FNNs to T ransformers, thereb y establishing approximation in the L ∞ norm. This extension relies on the paral- lelizabilit y of T ransformers, whic h is formalized in Lemma 3. This lemma will also b e used frequen tly in the later tec hnical proofs. Its proof is deferred to Section 5. Lemma 3. (1) L et F (1) F F : R d (1) × n → R ¯ d (1) × n and F (2) F F : R d (2) × n → R ¯ d (2) × n b e two fe e dforwar d blo cks, b oth with depth L , widths W (1) and W (2) , weight b ounds B (1) and B (2) , r esp e ctively. Ther e exists a fe e dforwar d blo ck F ( prl ) F F : R ( d (1) + d (2) ) × n → R ( ¯ d (1) + ¯ d (2) ) × n with depth L , width not gr e ater then W (1) + W (2) and weight b ound max { B (1) , B (2) } such that for any X ∈ R d (1) × n , Y ∈ R d (2) × n , F ( prl ) F F  X Y  = F (1) F F ( X ) F (2) F F ( Y ) ! . (2) L et F (1) S A : R d (1) × n → R d (1) × n and F (2) S A : R d (2) × n → R d (2) × n b e two self-attention layers with he ad numb ers H (1) and H (2) , he ad sizes S (1) and S (2) , weight b ounds B (1) and B (2) , r esp ctively. Ther e exists a self-attention layer F ( prl ) S A : R ( d (1) + d (2) ) × n → R ( d (1) + d (2) ) × n with he ad numb er H (1) + H (2) , he ad size max { S (1) , S (2) } and weight b ound max { B (1) , B (2) } such that for any X ∈ R d (1) × n , Y ∈ R d (2) × n , F ( prl ) S A  X Y  = F (1) S A ( X ) F (2) S A ( Y ) ! . (3) L et T (1) : R d (1) in × n → R d (1) out × n and T (2) : R d (2) in × n → R d (2) out × n b e two T r ansformers with sizes n L 0 , W (1) 0  ,  H (1) 1 , S (1) 1  ,  L 1 , W (1) 1  , · · · ,  H (1) K , S (1) K  ,  L K , W (1) K o , n L 0 , W (2) 0  ,  H (2) 1 , S (2) 1  ,  L 1 , W (2) 1  , · · · ,  H (2) K , S (2) K  ,  L K , W (2) K o , dimension ve ctors  d (1) in d (1) 0 d (1) 1 · · · d (1) K d (1) out  ,  d (2) in d (2) 0 d (2) 1 · · · d (2) K d (2) out  , weight b ounds  B (1) F F , B (1) S A  ,  B (2) F F , B (2) S A  , r esp e ctively. Ther e exists a T r ansformer T ( prl ) : R  d (1) in + d (2) in  × n → R  d (1) out + d (2) out  × n with size n L 0 , W (1) 0 + W (2) 0  ,  H (1) 1 + H (2) 1 , max n S (1) 1 , S (2) 1 o ,  L 1 , W (1) 1 + W (2) 1  , · · · ,  H (1) K + H (2) K , max n S (1) K , S (2) K o ,  L K , W (1) K + W (2) K o , dimension ve ctor  d (1) in + d (2) in d (1) 0 + d (2) 0 d (1) 1 + d (2) 1 · · · d (1) K + d (2) K d (1) out + d (2) out  and weight b ounds B F F = max n B (1) F F , B (2) F F o , B S A = max n B (1) S A , B (2) S A o such that for any X ∈ R d (1) in × n , Y ∈ R d (2) in × n , T ( prl )  X Y  =  T (1) ( X ) T (2) ( Y )  . 10 Pr o of of The or em 2. Using Proposition 1 with ϵ replaced b y ϵ 2 therein, w e obtain a T rans- former e T that can appro ximate f entry-wise on [0 , 1] d × n \ Ω ( f law ) . By shifting e T in v arious w ays, where eac h entry of the shift matrix takes v alues in {− δ, 0 , δ } , we obtain 3 dn dis- tinct shifted versions of e T . The shifting op eration can b e implemented via the em b edding la yer. Parallelizing these 3 dn shifted versions of e T yields a new T ransformer T ( prl ) . Then, b y comp osing T ( prl ) with dn feedforward la yers (each consisting of several parallel F ( mid ) F F mo dules), we obtain the desired T ransformer T . Note that ω f pq ( δ ) ≲ δ λ . Setting δ ≍ ϵ 1 /λ and applying Prop osition 1, Lemma 1, Lemma 2 and Lemma 3, we ac hiev e the result. 2.3 Pro of of Theorem 3 Based on classical empirical pro cess theory , we can derive an upp er b ound of the excess risk in terms of the co v ering num b er of the function class. The pro of of Prop osition 2 is deferred to Section 6. Definition 1 (cov ering num b er) . L et T b e a set in a metric sp ac e ( ¯ T , τ ) . A n ϵ -c over of T is a subset T c ⊂ ¯ T such that for e ach t ∈ T , ther e exists a t c ∈ T c such that τ ( t, t c ) ≤ ϵ . The ϵ -c overing numb er of T , denote d as N ( ϵ, T , τ ) is define d to b e the minimum c ar dinality among al l ϵ -c over of T with r esp e ct to the metric τ . Prop osition 2. L et B F ∈ R > 0 . L et F b e any given function class in R d such that | f | ≤ B F for al l f ∈ F . Ther e holds    b f − f 0    2 L 2 ( µ ) ≤  896 B 2 F 3 + 2 17 σ 2 + 20  m − 2 γ / (2 γ + d ) + 896 B 2 F log N ( m − γ / (2 γ + d ) , F , ∥ · ∥ L ∞ ( µ ) ) 3 m + 146 ∥ f ∗ − f 0 ∥ 2 L ∞ ( µ ) + 2 10 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 with pr ob ability at le ast 1 − 2 exp  − m d/ (2 s + d )  , wher e f ∗ c an b e any function in F . It remains to upp er-b ound the cov ering num b er of T ransformer class, which is done in Lemma 4 with proof deferred to Section 7. Lemma 4. F or F T define d in (2) , ther e holds N ( ς , F T , ∥ · ∥ L ∞ ( µ ) ) ≤  2 B E B L ς  M E B  2 B F F L ς  M F F  2 B S A L ς  M S A , ∀ ς > 0 , wher e L =(2 K + 2)6 K 4 K 2 + K +4 n K 2 +5 K/ 2+3 d K +1 / 2 in d 2 K +1 0 K Y k ′ =1 d 4( K − k ′ )+6 k ′ ! d 1 / 2 out H K 2 + K − 1 S K 2 +2 K +1 L K 2 +2 K +3 W ( L − 1)( K 2 +3 K +3) B 2 K +1 E B B L ( K 2 +3 K +3) F F B 2( K 2 +2 K +1) S A . (3) W e also need the Lipsc hitz con tinuit y of the truncation function. 11 Lemma 5. L et B ∈ R > 0 . L et U b e a set. F or any two functions f 1 , f 2 : U → R , | f tr,B ◦ f 1 ( x ) − f tr,B ◦ f 2 ( x ) | ≤ | f 1 ( x ) − f 2 ( x ) | , x ∈ U. Pr o of. F or x suc h that | f 1 ( x ) | , | f 2 ( x ) | ≤ B , the conclusion is trivial. F or x suc h that | f 1 ( x ) | , | f 2 ( x ) | > B , f tr,B ◦ f 1 ( x ) = f tr,B ◦ f 2 ( x ) = 0. F or x suc h that | f 1 ( x ) | > B , | f 2 ( x ) | ≤ B , without loss of generalit y w e assume f 1 ( x ) > B , then 0 < f tr,B ◦ f 1 ( x ) − f tr,B ◦ f 2 ( x ) = B − f 2 ( x ) < f 1 ( x ) − f 2 ( x ) . F or x suc h that f 1 ( x ) ≤ B , f 2 ( x ) > B , the conclusion can b e prov ed similarly . Pr o of of The or em 3. By Theorem 2, for a given 0 < ϵ < 1, there exists f : R d × n → R , whic h lies in F T  K = n, L = O  log 1 ϵ  , W = O  ϵ − dn/γ  , H = O (1) , S = O (1) , M E B = O (1) , M F F = O  ϵ − dn/γ  , M S A = O (1) , B E B = O (1) , B F F = O  ϵ − max { (6 dn +2) /γ , 1 /λ }  , B S A = O  log 1 ϵ  , d  (4) with d =  d d + n + 1 + 3 n (2 dn + 5 d )3 dn C dn − 1 s + dn − 1 · 1 1 × n d  , suc h that ∥ f − f 0 ∥ L ∞ ([0 , 1] d × n ) ≤ ϵ. It follo ws from Lemma 5 that ∥ f tr,B Y ◦ f − f 0 ∥ L ∞ ([0 , 1] d × n ) ≤ ϵ. (5) By Lemma 4 and some calculations, the en tropy of the function class (4) can b e b ounded as log N ( ς , F T , ∥ · ∥ L ∞ ( µ ) ) ≲ ϵ − dn/γ  log 1 ϵ  2 + ϵ − dn/γ log 1 ς , ∀ ς > 0 . T runcation do es not increase en trop y: log N ( ς , f tr,B Y ◦ F T , ∥ · ∥ L ∞ ( µ ) ) ≤ log N ( ς , F T , ∥ · ∥ L ∞ ( µ ) ) ≲ ϵ − dn/γ  log 1 ϵ  2 + ϵ − dn/γ log 1 ς . (6) W e can then apply Prop osition 2 with f ∗ = f tr,B Y ◦ f to derive an upp er bound for    b f − f 0    L 2 ( µ ) . T o this end, we need to ev aluate the integral in Prop osition 2: Z 2 7 σ m − γ / (2 γ + dn ) 0 q log 2 N ( ς , f tr,B Y ◦ F T , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 12 ≤ 2 7 σ m − γ / (2 γ + dn ) Z 2 7 σ m − γ / (2 γ + dn ) 0 log 2 N ( ς , f tr,B Y ◦ F T , ∥ · ∥ L ∞ ( µ ) ) 2 dς ≲ m − γ / (2 γ + dn ) Z 2 7 σ m − γ / (2 γ + dn ) 0 ϵ − dn/γ  log 1 ϵ  2 + ϵ − dn/γ log 1 ς ! dς ≲ m − 2 γ / (2 γ + dn ) ϵ − dn/γ  log 1 ϵ  2 + log m ! . (7) Here w e use H¨ older’s inequalit y . No w, com bining Proposition 2 and (5)-(7) yields    b f − f 0    2 L 2 ( µ ) ≲ m − 2 γ / (2 γ + dn ) + m − 1 ϵ − dn/γ  log 1 ϵ  2 + ϵ − dn/γ m − 1 log m + ϵ 2 + m − (3 γ + dn ) / (2 γ + dn ) ϵ − dn/γ  log 1 ϵ  2 + log m ! with probabilit y at least 1 − 2 exp  − m dn/ (2 γ + dn )  . W e complete the pro of by setting ϵ ≍ m − γ / (2 γ + dn ) . 3 Pro of of Prop osition 1: Construction of T rans- formers In Section 3.1, we presen t the pro of of Prop osition 1 along with the technical lemmas required during the pro of. Sections 3.2 and 3.3 con tain the pro ofs of the technical lemmas. 3.1 Pro of of Prop osition 1 Inspired b y the approximation of highly smooth functions using FNNs [57, 31], the basic idea of proving Prop osition 1 is to construct T ransformers to achiev e the appro ximation of T aylor p olynomials. Consider the set n α ∈ N d × n ≥ 0 : P d u =1 P n v =1 α uv ≤ s o . Given that its cardinalit y is C dn − 1 s + dn − 1 , w e can rewrite it as n α 1 , · · · , α C dn − 1 s + dn − 1 o . F or eac h p ∈ [ d ] and q ∈ [ n ], let P j ( f pq ) = X i ∈ [ C dn − 1 s + dn − 1 ] c i,j ( f pq )  X − X ( j )  α i b e the s − 1 order T a ylor p olynomial of f pq at the grid p oin t X ( j ) := β j /K . Here we use the notation X α := Q d u =1 Q n v =1 x α uv uv . F or an y X ∈ Ω j with j ∈  K dn  , the standard T aylor remainder estimate giv es        f pq ( X ) − X i ∈ [ C dn − 1 s + dn − 1 ] c i,j ( f pq )  X − X ( j )  α i        ≤ C ( f , s, d, n ) 1 K s + λ . (8) 13 Based on this estimation, we can ahiev e the appro ximation of f pq once w e ac hiev e the appro ximation of T aylor p o lynomials P j ( f pq ). T o this end, we divide our pro of in to sev en steps. In step 1, we use the follo wing lemma to contruct a feedforw ard blo c k that maps Ω j to the grid p oin t X ( j ) . Lemma 6. Ther e exists a R eLU FNN function f ( dsc ) F F : R → R with width K , depth 3 and weight b ound 1 /δ such that for any x ∈  k K , k +1 K  with k ∈ { 0 , 1 , · · · , K − 1 } , f ( dsc ) F F ( x ) =  k K , x ∈  k K , k +1 − δ K  ; k +1 K − k +1 − K x K δ , x ∈  k +1 − δ K , k +1 K  . Pr o of. W e first construct a FNN function that appro ximate the step function 1 x ≥ 1 : g ( x ) := σ R  1 − σ R  − x δ + 1 δ  =    0 , x < 1 − δ ; 1 − 1 − x δ , 1 − δ ≤ x ≤ 1; 1 , x > 1 . Then f ( dsc ) F F is con tructed through a summation of a series of g that ha ve undergone translation and scaling transformations: f ( dsc ) F F ( x ) := 1 K K − 1 X k =0 g ( K x − k ) . In step 2, w e construct a T ransformer to map the grid p oin t X j to the T aylor co effi- cien ts c i,j , whic h is exactly a memorization task. The follo wing tok en wise separatedness assumption on the input sequences is common in the literature of T ransformer memo- rization [25, 22, 23]. Definition 2 (T oken wise separatedness) . L et N ∈ N ≥ 1 and r , ϕ ∈ R > 0 . L et X (1) , · · · , X ( N ) ∈ R d × n b e a set of N input se quenc es. Then, we say that n X ( i ) o i ∈ [ N ] ar e tokenwise ( r, ϕ ) - sep ar ate d if the fol lowing two c onditions ar e satisfie d: • F or any i ∈ [ N ] and k ∈ [ n ] ,    X ( i ) : ,k    2 ≤ r holds. • F or any i, j ∈ [ N ] and k , l ∈ [ n ] , either X ( i ) : ,k = X ( j ) : ,l or    X ( i ) : ,k − X ( j ) : ,l    2 ≥ ϕ holds. With this assumption, w e are able to construct a T ransformer realizing memorization. W e provide a detailed c haracterization of its structure in Lemma 7. Since this result is of independent interest, its pro of is presen ted in a separate section (Section 4). Lemma 7. L et N ∈ N ≥ 2 and r, ϕ, B y ∈ R > 0 . Supp ose r > ϕ . F or any N data p airs n X ( i ) , Y ( i ) o i ∈ [ N ] ⊂ R d × n × [ − B y , B y ] 1 × n such that the input se quenc es n X ( i ) o i ∈ [ N ] ar e distinct and ar e tokenwise ( r, ϕ ) -sep ar ate d, ther e eixsts a T r ansformer T ( mmr ) : R d × n → R 1 × n with size { (2 , max { d, 5 } ) , ((3 , 3) , (3 , 11)) × ( n − 1) times , (3 , 3) , (3 , max { 5 , nN − 1 } ) } and dimension ve ctor  d d 5 · 1 1 × n 1  and a p ositional enc o ding matrix E := 3 r √ d  1 d × 1 2 1 d × 1 · · · n 1 d × 1  ∈ R d × n such that 14 • T ( mmr )  X ( i )  = Y ( i ) , i ∈ [ N ] . In this c ase, the lab els n Y ( i ) o i ∈ [ N ] ne e d to satisfy the c onsistency c ondition: for any i, j ∈ [ N ] , k , l ∈ [ n ] , y ( i ) 1 ,k = y ( j ) 1 ,l if X ( i ) : ,k = X ( j ) : ,l and X ( i ) = X ( j ) up to p ermutations. • T ( mmr )  X ( i ) + E  = Y ( i ) , i ∈ [ N ] . In this c ase, the lab els n Y ( i ) o i ∈ [ N ] ne e d not to satisfy the c onsistency c ondition. The weight b ounds of T ( mmr ) ar e B F F = max { R, 2 B y } , B S A = 1 2 log(3 √ dπ n 4 (3 n +1) N 4 r ϕ − 1 ) , wher e R := ( √ 2 π dn 2 N 2 (3 n + 1) r ϕ − 1 + 1)  3 π 4 √ dn 3 N 4 (3 n + 1) r ϕ − 1 + 3 2  . F urthermor e, if X ∈ R d × n satisfies ∥ X : ,k ∥ 2 ≤ r for al l k ∈ [ n ] , then    T ( mmr ) ( X ) 1 ,k    ,    T ( mmr ) ( X + E ) 1 ,k    ≤ [4( nN − 1) R + 1] B y for al l k ∈ [ n ] . In step 3, w e construct a T ransformer to appro ximate the monomials  X − X ( j )  α i with the aid of the follo wing t w o lemmas, whose pro ofs are deferred to Sections 3.2 and 3.3, respectively . Lemma 8. F or any R eLU FNN f F F : R dn → R with depth L , width W and weight b ound B , ther e exists a T r ansformer T ( F F ) : R ( d + n ) × n → R 1 × n with size { (2 , 2 dn ) , (1 , dn ) , ( L, max { W, 2 dn } ) } and dimension ve ctor  d + n d + n 2 dn 1  such that for any X ∈ [0 , 1] d × n , T ( F F )  X I n × n − 1 n × n  = f F F  X ( f lt )  1 1 × n , wher e X ( f lt ) ∈ R dn is the flatten of X , define d in the fol lowing way: X ( f lt ) :=  x 11 . . . x 1 n x 21 . . . x 2 n . . . x d 1 . . . x dn  ⊤ . F urthermor e, B F F = max { B , 1 } , B S A = n . Lemma 9. L et d, s ∈ N ≥ 1 . L et α ∈ N d ≥ 0 and P d i =1 α i = ¯ α . F or any 0 < ϵ ≤ 3 ⌈ log 2 ¯ α ⌉ − 1 3 ⌈ log 2 ¯ α ⌉− 1 − 1 , ther e exists a R eLU FNN function f ( mnm ) F F : R d → R with width 21 · 2 ⌈ log 2 ¯ α ⌉− 1 , depth C ln 3 ⌈ log 2 ¯ α ⌉ − 1 2 ϵ and weight b ound C such that for any x ∈ [0 , 1] d ,    f ( mnm ) F F ( x ) − x α 1 1 x α 2 2 · · · x α d d    ≤ ϵ. 15 In step 4, we con truct a T ransformer to parallel c i,j and the appromants of  X − X ( j )  α i , follo wed by con tructing a feedforward blo c k that approximates the multiplications of c ij and  X − X ( j )  α i and sums them o ver i ∈  C dn − 1 s + dn − 1  in step 5. W e need the following result on the appro ximation of m ultiplication b y ReLU FNNs in step 5. Lemma 10 ([57], Prop osition 3) . L et B , B ′ ∈ R ≥ 1 . F or any ϵ > 0 , ther e exists a R eLU FNN function e × : R 2 → R with width 21 , depth C ( B ) ln 1 ϵ and weight b ound B 2 such that for any x ∈ [ − B , B ] 2 ,   e × ( x ) − x 1 x 2   ≤ ϵ. F urthermor e, for any x ∈ [ − B ′ , B ′ ] 2 ,   e × ( x )   ≤ max { 12 B 2 , 4 B B ′ } . Remark 1. A lthough [57] do es not explicitly char acterize the width, weight b ound, and magnitude of e × , these c an b e r e adily derive d fr om its pr o of and ar guments. Finally , in step 6 and 7, b y prop erly setting the v alues of parameters, w e estimate the appro ximation error on S j ∈ [ K dn ] Ω j and the magnitude of constructed T ransformer on Ω ( f law ) , resepctively . Figure 1 illustrates the flow of the proof for Proposition 1. X X ( j ) n X − X ( j )  α i o i ∈ [ C dn − 1 s + dn − 1 ] { c i,j } i ∈ [ C dn − 1 s + dn − 1 ] P j = P i ∈ [ C dn − 1 s + dn − 1 ] c i,j  X − X ( j )  α i Lemma 6 Lemma 8, 9 Lemma 7 Lemma 10 Figure 1: Illustration of the pro of pro cess In the pro of of Prop osition 1, w e also require the follo wing tw o trivial lemmas, which state that b oth the feedforw ard blo c k and the self-atten tion la yer can realize the iden tit y mapping. W e ma y sometimes use these t w o lemmas without explicit men tion, particularly when w e emplo y Lemma 3 to parallel tw o feedforw ard blocks of differen t depths. Lemma 11. Ther e exists a R eLU FNN function f ( idt ) F F : R → R with width 2 , depth 2 and weight b ound 1 such that for any x ∈ R , f ( idt ) F F ( x ) = x. 16 Pr o of. The result follo ws directly from the property x = σ R ( x ) − σ R ( − x ) , x ∈ R . Lemma 12. Ther e exists a self-attention layer F ( idt ) S A : R 1 × n → R 1 × n with he ad numb er 1 , he ad size 1 and weight b ound 1 such that for any x ∈ R 1 × n , F ( idt ) S A ( x ) = x . Pr o of. Setting W O = 0 yields the result. W e no w presen t the formal pro of of Proposition 1. Pr o of of Pr op osition 1. Step 1: Map Ω j to the grid p oin t X ( j ) = β j /K . Denote W E B :=  I d × d 0 ( n +1) × d  ∈ R ( d + n +1) × d , B E B :=   0 d × n I n × n − 1 n × n e B E B   ∈ R ( d + n +1) × n , where e B E B :=  3 6 · · · 3 n  ∈ R 1 × n . The em b edding lay er F E B : R d × n → R ( d + n +1) × n is defined to b e F E B ( X ) := W E B X + B E B =   X I n × n − 1 n × n e B E B   ∈ R ( d + n +1) × n . Let F ( dsc ) F F : R 1 × n → R 1 × n b e the feedforward blo c k generated by the ReLU FNN function f ( dsc ) F F in Lemma 6. Let F ( prl − dsc ) F F : R ( d + n +1) × n → R d × n b e the parallelization of F ( dsc ) F F : F ( prl − dsc ) F F ( Y ) :=      F ( dsc ) F F ( Y 1 , : ) F ( dsc ) F F ( Y 2 , : ) . . . F ( dsc ) F F ( Y d, : )      , Y ∈ R ( d + n +1) × n . It follows that F ( prl − dsc ) F F is of width max { dK, d + n + 1 } , depth 3 and w eigh t b ound 1 /δ , and F ( prl − dsc ) F F ◦ F E B ( X ) = X ( j ) , X ∈ Ω j . Step 2: Map X j to the T a ylor co efficien ts c i,j . Applying Lemma 7 with param ters r = √ d, ϕ = 1 K , N = K dn , B y = C ( f ) therein, w e can find T ransformers n T ( mmr ) i,k o i ∈ [ C dn − 1 s + dn − 1 ] ,k ∈ [ d ] with size L 0 = 2 , W 0 = max { d, 5 } ; 17 H l = 3 , S l = 3; l = 1 , · · · , n ; L l = 3 , W l = 11; l = 1 , · · · , n − 1; L n = 3 , W n = max  5 , nK dn − 1  and dimension vector  d d 5 · 1 1 × n 1  suc h that for j ∈  K dn  , T ( mmr ) i,k  X ( j ) + 1 d × 1 e B E B  :=  c i,j ( f k 1 ) c i,j ( f k 2 ) · · · c i,j ( f kn )  , i ∈ [ C dn − 1 s + dn − 1 ] , k ∈ [ d ] . The w eight b ounds of T ( mmr ) i,k are B F F = C ( f ) d 2 n 7 K 6 dn +2 , B S A = 1 2 log(3 dπ n 4 (3 n + 1) K 4 dn +1 ). Based on Lemma 3, we can construct a T ransformer T ( prl − mmr ) : R d × n → R dC dn − 1 s + dn − 1 × n with with size L 0 = 2 , W 0 = dC dn − 1 s + dn − 1 max { d, 5 } ; H l = 3 dC dn − 1 s + dn − 1 , S l = 3; l = 1 , · · · , n ; L l = 3 , W l = 11 dC dn − 1 s + dn − 1 ; l = 1 , · · · , n − 1; L n = 3 , W n = dC dn − 1 s + dn − 1 max  5 , nK dn − 1  and dimension vector  d d 5 dC dn − 1 s + dn − 1 · 1 1 × n dC dn − 1 s + dn − 1  suc h that for j ∈  K dn  , T ( prl − mmr )  X ( j ) + 1 d × 1 e B E B  :=                          T ( mmr ) 1 , 1  X ( j ) + 1 d × 1 e B E B  . . . T ( mmr ) 1 ,d  X ( j ) + 1 d × 1 e B E B  T ( mmr ) 2 , 1  X ( j ) + 1 d × 1 e B E B  . . . T ( mmr ) 2 ,d  X ( j ) + 1 d × 1 e B E B  . . . T ( mmr ) C dn − 1 s + dn − 1 , 1  X ( j ) + 1 d × 1 e B E B  . . . T ( mmr ) C dn − 1 s + dn − 1 ,d  X ( j ) + 1 d × 1 e B E B                           =                     c 1 ,j ( f 11 ) c 1 ,j ( f 12 ) · · · c 1 ,j ( f 1 n ) . . . . . . . . . . . . c 1 ,j ( f d 1 ) c 1 ,j ( f d 2 ) · · · c 1 ,j ( f dn ) c 2 ,j ( f 11 ) c 2 ,j ( f 12 ) · · · c 2 ,j ( f 1 n ) . . . . . . . . . . . . c 2 ,j ( f d 1 ) c 2 ,j ( f d 2 ) · · · c 2 ,j ( f dn ) . . . . . . . . . c C dn − 1 s + dn − 1 ,j ( f 11 ) c C dn − 1 s + dn − 1 ,j ( f 12 ) · · · c C dn − 1 s + dn − 1 ,j ( f 1 n ) . . . . . . . . . . . . c C dn − 1 s + dn − 1 ,j ( f d 1 ) c C dn − 1 s + dn − 1 ,j ( f d 2 ) · · · c C dn − 1 s + dn − 1 ,j ( f dn )                     . 18 The w eight b ounds of T ( prl − mmr ) are B F F = C ( f ) d 2 n 7 K 6 dn +2 , B S A = 1 2 log(3 dπ n 4 (3 n + 1) K 4 dn +1 ). Step 3: Approximate the monomials  X − X ( j )  α i . F or X ∈ Ω j with some j ∈  K dn  , denote ¯ X := X − X ( j ) . Let 0 < ϵ 1 ≤ 3 ⌈ log 2 ( s − 1) ⌉ − 1 3 ⌈ log 2 ( s − 1) ⌉− 1 − 1 b e some accuracy to b e determined later. By Lemma 9, there exists ReLU FNN functions n f ( mnm ) F F ,i o i ∈ [ C dn − 1 s + dn − 1 ] with width 21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , depth C ln 3 ⌈ log 2 ( s − 1) ⌉ − 1 2 ϵ 1 and w eigh t bound C suc h that    f ( mnm ) F F ,i  ¯ X ( f lt )  − ¯ X α i    ≤ ϵ 1 , i ∈  C dn − 1 s + dn − 1  . (9) By Lemma 8, there exists T ransformers n T ( mnm ) i o i ∈ [ C dn − 1 s + dn − 1 ] with size  (2 , 2 dn ) , (1 , dn ) ,  C ln 3 ⌈ log 2 ( s − 1) ⌉ − 1 2 ϵ 1 , max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn   and dimension vector  d + n d + n 2 dn 1  suc h that for i ∈  C dn − 1 s + dn − 1  , T ( mnm ) i  ¯ X I n × n − 1 n × n  = f ( mnm ) F F ,i  ¯ X ( f lt )  1 1 × n . The weigh t bounds of T ( mnm ) i are B F F = C , B S A = 1. Based on Lemma 3, there exists a T ransformer T ( prl − mnm ) : R ( d + n ) × n → R C dn − 1 s + dn − 1 × n with size   2 , 2 dnC dn − 1 s + dn − 1  ,  C dn − 1 s + dn − 1 , dn  ,  C ln 3 ⌈ log 2 ( s − 1) ⌉ − 1 2 ϵ 1 , max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn  C dn − 1 s + dn − 1  and dimension vector  d + n d + n 2 dnC dn − 1 s + dn − 1 C dn − 1 s + dn − 1  suc h that T ( prl − mnm ) ( Y ) :=       T ( mnm ) 1 ( Y ) T ( mnm ) 2 ( Y ) . . . T ( mnm ) C dn − 1 s + dn − 1 ( Y )       , Y ∈ R ( d + n ) × n . It follo ws that T ( prl − mnm )  ¯ X I n × n − 1 n × n  =         f ( mnm ) F F , 1  ¯ X ( f lt )  f ( mnm ) F F , 2  ¯ X ( f lt )  . . . f ( mnm ) F F ,C dn − 1 s + dn − 1  ¯ X ( f lt )          1 1 × n . The w eigh t bounds of T ( prl − mnm ) are B F F = C , B S A = 1. 19 Step 4: Parallel c i,j and the approman ts of  X − X ( j )  α i . F rom the result of step 1, w e can construct a feedforward block F ( mid ) F F : R ( d + n +1) × n → R (2 d + n +1) × n as F ( mid ) F F ( Y ) :=      F ( idt ) F F ( Y 1: d, : ) − F ( prl − drc ) F F ( Y 1: d, : ) F ( idt ) F F ( Y ( d +1):( d + n ) , : ) F ( prl − drc ) F F ( Y 1: d, : ) F ( idt ) F F ( Y d + n +1 , : )      , Y ∈ R ( d + n +1) × n , where F ( idt ) F F is the identical mapping. It follo ws that for X ∈ Ω j with some j ∈ [ K dn ], F ( mid ) F F ◦ F E B ( X ) =     X − X ( j ) I n × n − 1 n × n X ( j ) e B E B     . Based on Lemma 11 and the structure of F ( drc ) F F constructed in step 1, it is not hard to see that F ( prl − drc ) F F is of depth 3, width ( K + 2) d + 2 n + 2 and weigh t b ound max { 1 /δ, 1 } . Lemma 3 ensures the existence of a T ransformer T ( prl ) : R (2 d + n +1) × n → R ( d +1) C dn − 1 s + dn − 1 × n with size L 0 = 2 , W 0 = dC dn − 1 s + dn − 1 [max { d + 1 , 5 } + 2 n ] ; H 1 = (3 d + 1) C dn − 1 s + dn − 1 , S 1 = max { dn, 3 } ; L l = 3 , W l = dC dn − 1 s + dn − 1 (4 n + 11); l = 1 , · · · , n − 1; H l = 3 dC dn − 1 s + dn − 1 , S l = 3; l = 2 , · · · , n ; L n = C ln 3 ⌈ log 2 ( s − 1) ⌉ − 1 2 ϵ 1 , W n = dC dn − 1 s + dn − 1 ( nK dn − 1) + max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn  C dn − 1 s + dn − 1 and dimension vector  2 d + n + 1 2 d + n + 1 (2 dn + 5 d ) C dn − 1 s + dn − 1 · 1 1 × n ( d + 1) C dn − 1 s + dn − 1  suc h that T ( prl ) ( Y ) :=  T ( prl − mnm )  Y 1:( d + n ) , :  T ( prl − mmr )  W 1 Y ( d + n +1):(2 d + n +1) , :   , Y ∈ R (2 d + n +1) × n , where W 1 :=  I d × d 1 d × 1  . The w eight b ounds of T ( prl ) are B F F = C ( f ) d 2 n 7 K 6 dn +2 , B S A = 1 2 log(3 dπ n 4 (3 n + 1) K 4 dn +1 ). F or X ∈ Ω j with some j ∈  K dn  , T ( prl ) ◦ F ( mid ) F F ◦ F E B ( X ) =    T ( prl − mnm )  X − X ( j ) I n × n − 1 n × n  T ( prl − mmr )  X ( j ) + 1 d × 1 e B E B     . 20 It is w orth men tioning that the last feedforw ard blo c k of T ( prl ) has only dC dn − 1 s + dn − 1 ( nK dn − 1) + max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn  C dn − 1 s + dn − 1 neurons in the final lay er, while the n umber of neurons in each of the other lay ers do es not exceed 10 dC dn − 1 s + dn − 1 +max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn  C dn − 1 s + dn − 1 . Step 5: Approximate the m ultiplications of c ij and  X − X ( j )  α i and sum them ov er i . W e apply Lemma 10 to approximate the m ultiplications of c ij and f ( mnm ) F F ,i  ¯ X ( f lt )  (the appro ximan t of  X − X ( j )  α i ). Since | c ij | ≤ C ( f ) and    f ( mnm ) F F ,i  ¯ X ( f lt )     ≤ 2 (assuming ϵ 1 < 1), we choose B in Lemma 10 to b e C ( f ). Then according to Lemma 10, for i = j d + k ∈  dC dn − 1 s + dn − 1  with some 0 ≤ j ≤ C dn − 1 s + dn − 1 − 1 and 1 ≤ k ≤ d , there exists a ReLU FNN f ( mtp − two ) F F ,i : R ( d +1) C dn − 1 s + dn − 1 → R with width max  21 , ( d + 1) C dn s + dn  , depth C ( f ) ln 1 ϵ 2 and w eigh t bound C ( f ) suc h that for an y x ∈ [ − C ( f ) , C ( f )] ( d +1) C dn s + dn , f ( mtp − two ) F F ,i ( x ) := e ×  x j +1 , x C dn − 1 s + dn − 1 + i  and    f ( mtp − two ) F F ,i ( x ) − x j +1 x C dn − 1 s + dn − 1 + i    ≤ ϵ 2 , (10) where ϵ 2 will b e determined later. Let F ( mtp − two ) F F ,i : R ( d +1) C dn − 1 s + dn − 1 × n → R 1 × n b e the feedforw ard blo c k generated from f ( mtp − two ) F F ,i . By Lemma 3, there exists a feedforward blo c k F ( prl − mtp ) F F : R ( d +1) C dn − 1 s + dn − 1 × n → R dC dn − 1 s + dn − 1 × n with width 21 dC dn − 1 s + dn − 1 , depth C ( f ) ln 1 ϵ 2 and w eigh t bound C ( f ) suc h that F ( prl − mtp ) F F ( Y ) :=  F ( mtp − two ) F F , 1 ( Y ) · · · F ( mtp − two ) F F ,dC dn − 1 s + dn − 1 ( Y )  ⊤ , Y ∈ R ( d +1) C dn − 1 s + dn − 1 × n . Then for X ∈ Ω j with some j ∈  K dn  , F ( prl − mtp ) F F ◦ T ( prl ) ◦ F ( mid ) F F ◦ F E B ( X ) =                            e ×  c 1 ,j ( f 11 ) , f ( mnm ) 1  e ×  c 1 ,j ( f 12 ) , f ( mnm ) 1  · · · e ×  c 1 ,j ( f 1 n ) , f ( mnm ) 1  . . . . . . . . . . . . e ×  c 1 ,j ( f d 1 ) , f ( mnm ) 1  e ×  c 1 ,j ( f d 2 ) , f ( mnm ) 1  · · · e ×  c 1 ,j ( f dn ) , f ( mnm ) 1  e ×  c 2 ,j ( f 11 ) , f ( mnm ) 2  e ×  c 2 ,j ( f 12 ) , f ( mnm ) 2  · · · e ×  c 2 ,j ( f 1 n ) , f ( mnm ) 2  . . . . . . . . . . . . e ×  c 2 ,j ( f d 1 ) , f ( mnm ) 2  e ×  c 2 ,j ( f d 2 ) , f ( mnm ) 2  · · · e ×  c 2 ,j ( f dn ) , f ( mnm ) 2  . . . . . . . . . e ×  c C dn − 1 s + dn − 1 ,j ( f 11 ) , f ( mnm ) C dn s + dn  e ×  c C dn − 1 s + dn − 1 ,j ( f 12 ) , f ( mnm ) C dn − 1 s + dn − 1  · · · e ×  c C dn − 1 s + dn − 1 ,j ( f 1 n ) , f ( mnm ) C dn − 1 s + dn − 1  . . . . . . . . . . . . e ×  c C dn − 1 s + dn − 1 ,j ( f d 1 ) , f ( mnm ) C dn − 1 s + dn − 1  e ×  c C dn − 1 s + dn − 1 ,j ( f d 2 ) , f ( mnm ) C dn − 1 s + dn − 1  · · · e ×  c C dn − 1 s + dn − 1 ,j ( f dn ) , f ( mnm ) C dn − 1 s + dn − 1                             , where f ( mnm ) i is short for f ( mnm ) F F ,i  ¯ X ( f lt )  . Denoting W sum :=  I d × d I d × d · · · I d × d  ∈ R d × dC dn − 1 s + dn − 1 , 21 then for X ∈ Ω j with some j ∈  K dn  , w e ha ve W ( sum ) F ( prl − mtp ) F F ◦ T ( prl ) ◦ F ( mid ) F F ◦ F E B ( X ) =          P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f 11 ) , f ( mnm ) i  P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f 12 ) , f ( mnm ) i  · · · P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f 1 n ) , f ( mnm ) i  P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f 21 ) , f ( mnm ) i  P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f 22 ) , f ( mnm ) i  · · · P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f 2 n ) , f ( mnm ) i  . . . . . . . . . . . . P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f d 1 ) , f ( mnm ) i  P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f d 2 ) , f ( mnm ) i  · · · P C dn − 1 s + dn − 1 i =1 e ×  c i,j ( f dn ) , f ( mnm ) i           . The T ransformer T w e are going to find is exactly defined as T ( X ) := W ( sum ) F ( prl − mtp ) F F ◦ T ( prl ) ◦ F ( mid ) F F ◦ F E B ( X ) . Its size is L 0 = 4 , W 0 = max { ( K + 2) d + 2 n + 2 , dC dn − 1 s + dn − 1 [max { d + 1 , 5 } + 2 n ] } ; H 1 = (3 d + 1) C dn − 1 s + dn − 1 , S 1 = max { dn, 3 } ; L l = 3 , W l = dC dn − 1 s + dn − 1 (4 n + 11); l = 1 , · · · , n − 1; H l = 3 dC dn − 1 s + dn − 1 , S l = 3; l = 2 , · · · , n ; L n = C ln 3 ⌈ log 2 ( s − 1) ⌉ − 1 2 ϵ 1 + C ( f ) ln 1 ϵ 2 , W n = dC dn − 1 s + dn − 1 ( nK dn − 1) + max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn, 21 d  C dn − 1 s + dn − 1 and its dimension v ector is  d d + n + 1 (2 dn + 5 d ) C dn − 1 s + dn − 1 · 1 1 × n d  . The w eigh t b ounds of T are B F F = max  C ( f ) d 2 n 7 K 6 dn +2 , 1 /δ  , B S A = 1 2 log(3 dπ n 4 (3 n + 1) K 4 dn +1 ). It is w orth mentioning that the last feedforw ard blo c k of T has only dC dn − 1 s + dn − 1 ( nK dn − 1) + max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn, 21 d  C dn − 1 s + dn − 1 neurons in a certain lay er, while the n um b er of neurons in eac h of the other lay ers do es not exceed 10 dC dn − 1 s + dn − 1 + max  21 · 2 ⌈ log 2 ( s − 1) ⌉− 1 , 2 dn, 21 d  C dn − 1 s + dn − 1 . Step 6: Estimate the approximation error on S j ∈ [ K dn ] Ω j . Let p ∈ [ d ] , q ∈ [ n ]. F or X ∈ Ω j with some j ∈  K dn  , combining (8)(9)(10) and letting K = Θ  ϵ − 1 /γ  , ϵ 1 = Θ ( ϵ ) , ϵ 2 = Θ ( ϵ ) yields | T pq ( X ) − f pq ( X ) | =       C dn − 1 s + dn − 1 X i =1 e ×  c i,j ( f pq ) , f ( mnm ) i  ¯ X ( f lt )  − f pq ( X )       =       C dn − 1 s + dn − 1 X i =1 e ×  c i,j ( f pq ) , f ( mnm ) i  ¯ X ( f lt )  − C dn − 1 s + dn − 1 X i =1 c i,j ( f pq ) f ( mnm ) i  ¯ X ( f lt )        22 +       C dn − 1 s + dn − 1 X i =1 c i,j ( f pq ) f ( mnm ) i  ¯ X ( f lt )  − C dn − 1 s + dn − 1 X i =1 c i,j ( f pq ) ¯ X α i       +       C dn − 1 s + dn − 1 X i =1 c i,j ( f pq ) ¯ X α i − f pq ( X )       ≤ C dn − 1 s + dn − 1 X i =1    e ×  c i,j ( f pq ) , f ( mnm ) i  ¯ X ( f lt )  − c i,j ( f pq ) f ( mnm ) i  ¯ X ( f lt )     + C dn − 1 s + dn − 1 X i =1 | c i,j ( f pq ) |    f ( mnm ) i  ¯ X ( f lt )  − ¯ X α i    +       C dn s + dn X i =1 c i,j ( f pq ) ¯ X α i − f pq ( X )       ≤ ϵ 3 + ϵ 3 + ϵ 3 = ϵ. Step 7: Estimate the magnitude of T on Ω ( f law ) . F or p ∈ [ d ] , q ∈ [ n ] and X ∈ Ω ( f law ) , b y the ab o ve construction, w e ha v e T pq ( X ) = C dn − 1 s + dn − 1 X i =1 e ×  h T ( mmr ) i,p  F ( prl − dsc ) F F ( X ) i 1 ,q , h T ( mnm ) i  X − F ( prl − dsc ) F F ( X ) i 1 ,q  . (11) Applying Lemma 7 with param ters r = √ d, δ = 1 K , N = K dn , B y = C ( f ) therein, we ha ve     h T ( mmr ) i,p  F ( prl − dsc ) F F ( X ) i 1 ,q     ≤ C ( f ) d 2 n 8 K 7 dn +2 . Noting that     h T ( mnm ) i  X − F ( prl − dsc ) F F ( X ) i 1 ,q     ≤ 2 and hence applying Lemma 10 with B ′ = C ( f ) d 2 n 8 K 7 dn +2 , w e ha ve     e ×  h T ( mmr ) i,p  F ( prl − dsc ) F F ( X ) i 1 ,q , h T ( mnm ) i  X − F ( prl − dsc ) F F ( X ) i 1 ,q      ≤ C ( f ) d 2 n 8 K 7 dn +2 . Plugging the ab o ve estimate in to (11) yields | T pq ( X ) | ≤ C ( f ) C dn − 1 s + dn − 1 d 2 n 8 K 7 dn +2 . 23 3.2 Pro of of Lemma 8 Denote f X :=  X I n × n − 1 n × n  . Define F ( F F ) F F , 1 : R ( d + n ) × n → R 2 dn × n as F ( F F ) F F , 1 ( Y ) := σ R ( W 1 Y ) , Y ∈ R ( d + n ) × n with W 1 :=        1 n × 1 0 n × 1 · · · 0 n × 1 I n × n 0 n × 1 1 n × 1 · · · 0 n × 1 I n × n . . . . . . . . . . . . . . . 0 n × 1 0 n × 1 · · · 1 n × 1 I n × n 0 dn × ( d + n )        ∈ R 2 dn × ( d + n ) . It can b e c hec ked that F ( F F ) F F , 1  f X  = σ R  W 1 f X  =        diag( x 11 , x 12 , . . . , x 1 n ) diag( x 21 , x 22 , . . . , x 2 n ) . . . diag( x d 1 , x d 2 , . . . , x dn ) 0 dn × n        ∈ R 2 dn × n , where diag( x i 1 , x i 2 , . . . , x in ) :=      x i 1 0 · · · 0 0 x i 2 · · · 0 . . . . . . . . . . . . 0 0 · · · x in      . Define the matrices in the self-attention la y er to be W O :=  0 dn × dn n I dn × dn  , W V :=  I dn × dn 0 dn × dn  , W K := W Q := 0 1 × 2 dn . It follo ws that the output of the softmax function is 1 n 1 n × n and hence F ( F F ) S A ◦ F ( F F ) F F , 1  f X  =        diag( x 11 , x 12 , . . . , x 1 n ) diag( x 21 , x 22 , . . . , x 2 n ) . . . diag( x d 1 , x d 2 , . . . , x dn ) 0 dn × n        + 1 n  0 dn × dn n I dn × dn   I dn × dn 0 dn × dn         diag( x 11 , x 12 , . . . , x 1 n ) diag( x 21 , x 22 , . . . , x 2 n ) . . . diag( x d 1 , x d 2 , . . . , x dn ) 0 dn × n        1 n × n =        diag( x 11 , x 12 , . . . , x 1 n ) diag( x 21 , x 22 , . . . , x 2 n ) . . . diag( x d 1 , x d 2 , . . . , x dn ) X ( f lt ) 1 1 × n        ∈ R 2 dn × n . 24 Denote W 2 :=  0 dn × dn I dn × dn  and let F ( F F ) F F , 2 ( Y ) := F F F ( W 2 Y ) , Y ∈ R 2 dn × n , where F F F is the feedforward blo c k generated from f F F . Then the T ransformer T ( F F ) is defined as T ( F F )  f X  := F ( F F ) F F , 2 ◦ F ( F F ) S A ◦ F ( F F ) F F , 1  f X  = F F F         0 dn × dn I dn × dn         diag( x 11 , x 12 , . . . , x 1 n ) diag( x 21 , x 22 , . . . , x 2 n ) . . . diag( x d 1 , x d 2 , . . . , x dn ) X ( f lt ) 1 1 × n               = F F F  X ( f lt ) 1 1 × n  = f F F  X ( f lt )  1 1 × n . 3.3 Pro of of Lemma 9 Lemma 13. L et d ∈ N > 1 . F or any 0 < ϵ ≤ 3 ⌈ log 2 d ⌉ − 1 3 ⌈ log 2 d ⌉− 1 − 1 , ther e exists a R eLU FNN function f ( mtp ) F F ,d : R d → R with width 21 · 2 ⌈ log 2 d ⌉− 1 , depth C ln 3 ⌈ log 2 d ⌉ − 1 2 ϵ and weight b ound C such that for any x ∈ [0 , 1] d ,    f ( mtp ) F F ,d ( x ) − x 1 · · · x d    ≤ ϵ. Pr o of. Assume 2 e d − 1 < d ≤ 2 e d for some e d ∈ N ≥ 1 . W e first adopt a linear mapping to expand x into e x :=  x 1 2 e d − d  ∈ R 2 e d . Without loss of generality , in the following w e simply assume d = 2 e d . f ( mtp ) F F , 2 e d is recursviely defined b y f ( mtp ) F F , 2 e d ( x ) := e ×  f ( mtp ) F F , 2 e d − 1 ( x 1 ) , f ( mtp ) F F , 2 e d − 1 ( x 2 )  , x ∈ [0 , 1] 2 e d , (12) where e × is the FNN function in Lemma 10 that ac hiev es appro ximation accuracy e ϵ and x 1 =  x 1 , · · · , x 2 e d − 1  , x 2 =  x 2 e d − 1 +1 , · · · , x 2 e d  . W e show by induction on e d that the ReLU FNN function f ( mtp ) F F , 2 e d defined in (12) is of width 21 · 2 e d − 1 , depth C e d ln 1 e ϵ and w eigh t bound C , and    f ( mtp ) F F , 2 e d ( x ) − x 1 · · · x 2 e d    ≤ 1 2  3 e d − 1  e ϵ. 25 The case of e d = 1 is v erified by Lemma 10. No w w e assume f ( mtp ) F F , 2 e d − 1 is of width 21 · 2 e d − 2 , depth C ( e d − 1) ln 1 e ϵ and w eigh t bound C , and for an y y ∈ [0 , 1] 2 e d − 1 ,    f ( mtp ) F F , 2 e d − 1 ( y ) − y 1 · · · y 2 e d − 1    ≤ 1 2  3 e d − 1 − 1  e ϵ. Supp osing e ϵ ≤ 2 3 e d − 1 − 1 , w e ha ve    f ( mtp ) F F , 2 e d − 1 ( y )    ≤    f ( mtp ) F F , 2 e d − 1 ( y ) − y 1 · · · y 2 e d − 1    +   y 1 · · · y 2 e d − 1   ≤ 1 2  3 e d − 1 − 1  e ϵ + 1 ≤ 2 . By (12), f ( mtp ) F F , 2 e d is of width 21 · 2 e d − 1 , depth C e d ln 1 e ϵ and weigh t b ound C . F urthermore, for an y x ∈ [0 , 1] 2 e d ,    f ( mtp ) F F , 2 e d ( x ) − x 1 · · · x 2 e d    ≤    f ( mtp ) F F , 2  f ( mtp ) F F , 2 e d − 1 ( x 1 ) , f ( mtp ) F F , 2 e d − 1 ( x 2 )  − f ( mtp ) F F , 2 e d − 1 ( x 1 ) f ( mtp ) F F , 2 e d − 1 ( x 2 )    +    f ( mtp ) F F , 2 e d − 1 ( x 1 ) f ( mtp ) F F , 2 e d − 1 ( x 2 ) − x 1 · · · x 2 e d − 1 · f ( mtp ) F F , 2 e d − 1 ( x 2 )    +    x 1 · · · x 2 e d − 1 · f ( mtp ) F F , 2 e d − 1 ( x 2 ) − x 1 · · · x 2 e d    ≤    f ( mtp ) F F , 2  f ( mtp ) F F , 2 e d − 1 ( x 1 ) , f ( mtp ) F F , 2 e d − 1 ( x 2 )  − f ( mtp ) F F , 2 e d − 1 ( x 1 ) f ( mtp ) F F , 2 e d − 1 ( x 2 )    + 2    f ( mtp ) F F , 2 e d − 1 ( x 1 ) − x 1 · · · x 2 e d − 1    +    f ( mtp ) F F , 2 e d − 1 ( x 2 ) − x 2 e d − 1 +1 · · · x 2 e d    ≤ e ϵ + 2 · 1 2  3 e d − 1 − 1  e ϵ + 1 2  3 e d − 1 − 1  e ϵ = 1 2  3 e d − 1  e ϵ, where in the third step we use Lemma 10. The proof is completed by setting e ϵ = 2 3 e d − 1 ϵ . Pr o of of L emma 9. The first hidden lay er is used to transform x in to e x := ( x 1 , · · · , x 1 | {z } α 1 , x 2 , · · · , x 2 | {z } α 2 , · · · , x d , · · · , x d | {z } α d ) ⊤ ∈ R s . According to Lemma 11, the num b er of neurons in this la y er is 2 ¯ α . F rom Lemma 13, w e can find a ReLU FNN function f ( mtp ) F F ,s : R ¯ α → R with width 21 · 2 ⌈ log 2 ¯ α ⌉− 1 , depth C ln 3 ⌈ log 2 ¯ α ⌉ − 1 2 ϵ and w eigh t bound C suc h that    f ( mtp ) F F ,d ( e x ) − x α 1 1 x α 2 2 · · · x α d d    ≤ ϵ. 26 4 Pro of of Lemma 7: Memorization of T ransformers F ollowing the path of researc hes on T ransformer memorization [59, 25, 22, 23], w e first pro ve that the T ransformer can achiev e con textual mapping (Lemma 14), and then asso- ciate the resulting con text ids with the corresp onding lab els via an ReLU FNN (Lemma 15), thereb y realizing the memorization task. This section is divided into tw o subsections: In Section 4.1, we prov e Lemma 7 based on Lemmas 14 and 15. Since the pro of of Lemma 14 is lengthy , we defer it to Section 4.2. 4.1 Pro of of Lemma 7 The contextual mapping defined b elo w assigns a unique id to each tok en in the input sequence. Definition 3 (Contextual mapping) . L et N ∈ N ≥ 1 and r, ϕ ∈ R > 0 . L et X (1) , · · · , X ( N ) ∈ R d × n b e a set of N input se quenc es. A map A : R d × n → R 1 × n is c al le d an ( r, ϕ ) -c ontextual mapping if the fol lowing two c onditions hold: • F or any i ∈ [ N ] and k ∈ [ n ] ,     A  X ( i )  1 ,k     ≤ r . • F or any i, j ∈ [ N ] and k , l ∈ [ n ] such that X ( i ) : ,k  = X ( j ) : ,l or X ( i )  = X ( j ) up to p ermutations,     A  X ( i )  1 ,k − A  X ( j )  1 ,l     ≥ ϕ . In p articular, A ( X ( i ) ) 1 ,k is c al le d a c ontext id of the k -th token in X ( i ) . The follo wing lemma shows that T ransformers can realize contextual mapping (in fact, this is one of the key reasons for their tremendous success in natural language pro cessing and other domains). Its proof is presen ted in Section 4.2. Lemma 14. L et d, n, N ∈ N ≥ 1 and r, ϕ ∈ R > 0 . L et X (1) , · · · , X ( N ) ∈ R d × n b e a set of N input se quenc es that ar e tokenwise ( r, ϕ ) -sep ar ate d. Ther e eixsts a T r ansformer T ( cm ) : R d × n → R 1 × n with size { (2 , max { d, 5 } ) , ((3 , 3) , (3 , 11)) × ( n − 1) times , (3 , 3) , (2 , 5) } and dimension ve ctor  d d 5 · 1 1 × n 1  such that it is an ( R, 2) -c ontextual mapping with R := ( √ 2 π dn 2 N 2 r ϕ − 1 + 1)  3 π 4 √ dn 3 N 4 r ϕ − 1 + 3 2  . The weight b ounds of T ( cm ) ar e B F F = max n √ 2 n 2 N 2 √ π dr ϕ − 1 + 1 , 3 √ 2 8 N 2 √ π n o , B S A = 1 2 log(3 √ dπ n 4 N 4 r ϕ − 1 ) . F urthermor e, if X ∈ R d × n satisfies ∥ X : ,k ∥ 2 ≤ r for al l k ∈ [ n ] , then    T ( cm ) ( X ) 1 ,k    ≤ R for al l k ∈ [ n ] . Lemma 15. L et N ∈ N ≥ 2 and ϕ, B x , B y ∈ R > 0 . F or any N data p airs { ( x i , y i ) } i ∈ [ N ] ⊂ [ − B x , B x ] × [ − B y , B y ] satisfying | x i − x j | ≥ ϕ for any i, j ∈ [ N ] with i  = j , ther e exists a R eLU FNN function f ( mmr ) F F : R → R with width N − 1 , depth 2 and weight b ound max { 1 , B x , B y , 4 B y /ϕ } such that f ( mmr ) F F ( x i ) = y i , i ∈ [ N ] . (13) F urthermor e,    f ( mmr ) F F ( x )    ≤ 8( N − 1) B x B y ϕ + B y for any x ∈ [ − B x , B x ] . 27 Pr o of. Without loss of generality , w e assm ue x 1 < x 2 < · · · < x N . f ( mmr ) F F is defined as f ( mmr ) F F ( x ) := W 2 σ R ( W 1 x + b 1 ) + b 2 with W 1 := 1 N − 1 ∈ R N − 1 , b 1 := ( − x 1 , − x 2 , · · · , − x N − 1 ) ⊤ ∈ R N − 1 , b 2 := y 1 ∈ R and W 2 :=  y 2 − y 1 x 2 − x 1 , y 3 − y 2 x 3 − x 2 − y 2 − y 1 x 2 − x 1 , · · · , y N − y N − 1 x N − x N − 1 − y N − 1 − y N − 2 x N − 1 − x N − 2  ∈ R 1 × ( N − 1) . It follo ws that f ( mmr ) F F ( x i ) = W 2 σ R ( W 1 x i + b 1 ) + b 2 = y 2 − y 1 x 2 − x 1 · ( x i − x 1 ) + i − 1 X j =2  y j +1 − y j x j +1 − x j − y j − y j − 1 x j − x j − 1  · ( x i − x j ) + y 1 . If we express the right-hand side of the ab o ve equation as a linear com bination of { y j } j ∈ [ N ] , w e can v erify that the coefficient of y i is 1 while the co efficien ts of all other y j are 0, whic h indicates that (13) holds. Pr o of of L emma 7. The T ransformer T ( mmr ) : R d × n → R 1 × n is defined as T ( mmr ) := F ( mmr ) F F ◦ T ( cm ) , where T ( cm ) : R d × n → R 1 × n is from Lemma 14 with size { (2 , max { d, 5 } ) , ((3 , 3) , (3 , 11)) × ( n − 1) times , (3 , 3) , (2 , 5) } and dimension vecto r  d d 5 · 1 1 × n 1  ; F ( mmr ) F F : R 1 × n → R 1 × n is generated from f ( mmr ) F F : R → R in Lemma 15 with width nN − 1 and depth 2. W e only pro v e the p ositional encoding case. F or i ∈ [ N ], denote f X ( i ) := X ( i ) + E . By using the triangle inequalit y , w e deriv e that for i ∈ [ N ] , k ∈ [ n ], (3 k − 1) r ≤ ∥ E : ,k ∥ 2 −    X ( i ) : ,k    2 ≤    f X ( i ) : ,k    2 =    X ( i ) : ,k + E : ,k    2 ≤    f X ( i ) : ,k    2 + ∥ E : ,k ∥ 2 ≤ (3 k + 1) r, (14) whic h implies that f X ( i ) : ,k and f X ( j ) : ,l with k  = l are imp ossible to b e identical. Moreo ver, the triangle inequlity also yields    f X ( i ) : ,k − f X ( j ) : ,l    2 =     X ( i ) : ,k − X ( j ) : ,l  + ( k − l ) p 1 d × 1    2 ≥        X ( i ) : ,k − X ( j ) : ,l     2 − ∥ ( k − l ) p 1 d × 1 ∥ 2 ≥ ϕ, when k = l ∥ ( k − l ) p 1 d × 1 ∥ 2 −     X ( i ) : ,k − X ( j ) : ,l     2 ≥ r , when k  = l ≥ ϕ. (15) W e conclude from (14)(15) that n f X ( i ) o i ∈ [ N ] are ((3 n + 1) r, ϕ )-sep erated. Applying Lemma 14 to n f X ( i ) o i ∈ [ N ] , w e ha ve 28 • F or an y i ∈ [ N ] and k ∈ [ n ],     T ( cm )  f X ( i )  1 ,k     ≤ R . • F or an y i, j ∈ [ N ] and k, l ∈ [ n ] with i  = j or k  = l ,     T ( cm )  f X ( i )  1 ,k − T ( cm )  f X ( j )  1 ,l     ≥ 2 . Applying Lemma 15 to  T ( cm )  f X ( i )  1 ,k , y ( i ) 1 ,k  i ∈ [ N ] ,k ∈ [ n ] , w e obtain F ( mmr ) F F  T ( cm )  f X ( i )  1 ,k  = y ( i ) 1 ,k , i ∈ [ N ] , k ∈ [ n ] . The weigh t bound of F ( mmr ) F F is max { R , 2 B y } . The weigh t b ounds of T ( cm ) are B F F = max n √ 2 n 2 (3 n + 1) N 2 √ π dr ϕ − 1 + 1 , 3 √ 2 8 N 2 √ π n o , B S A = 1 2 log(3 √ dπ n 4 (3 n +1) N 4 r ϕ − 1 ). The w eigh t bounds of T ( mmr ) can be obtained b y making a comparison betw een them. 4.2 Pro of of Lemma 14: Realizing Contextual Mapping b y T rans- formers Our pro of of Lemma 14 adapts the construction from [25]. Sp ecifically , we first construct a feedforward blo c k that maps the input sequence X ( i ) to the token id x ( i ) (Lemma 19), follow ed b y a T ransformer that computes the sequence id z ( i ) (Lemma 20). The T ransformer in Lemma 14 is then obtained via a linear com bination of the token id and the sequence id. The pro ofs of Lemmas 19 and 20 dep end on the following three technical lemmas (Lemmas 16 – 18). Lemma 16 ([36], Lemma 13) . L et d, N ∈ N ≥ 1 . L et  x ( i )  i ∈ [ N ] ⊂ R d . Ther e exists a unit ve ctor u ∈ R d such that for any i, j ∈ [ N ] , 1 N 2 r 8 π d   x ( i ) − x ( j )   2 ≤   u ⊤  x ( i ) − x ( j )    ≤   x ( i ) − x ( j )   2 . Lemma 17 ([25], Lemma E.3) . L et r ′ ∈ R . Ther e exists a R eLU FNN f ( elm ) F F : R 2 → R with depth 2 , width 4 and weight b ound max { r ′ , 2 } such that for any x ∈ R 2 , f ( elm ) F F ( x ) = ( r ′ , if | x 1 − x 2 | < 1 2 ; 0 , if | x 1 − x 2 | > 1 . The follo wing lemma shows that the self-attention la yer can appro ximate the max- im um o ver the input sequence. This is an adaptation of [25, Lemma E.2], where the self-atten tion la yer is defined without skip connections and the softmax includes a bias term. Lemma 18. L et n ∈ N ≥ 1 and r ′ , P ∈ R > 1 . Ther e exists a self-attention layer F ( max ) S A : R 3 × n → R 3 × n with he ad numb er 1 , he ad size 3 and weight b ound 1 2 log(8 n 3 / 2 r ′ P ) such that for any x =  x 1 x 2 · · · x n  ∈ R 1 × n satisfying 29 • | x i | ≤ 2 r ′ for i ∈ [ n ] ; • x i ≤ x max − 2 for i ∈ [ n ] with x i  = x max ( x max := max i ∈ [ n ] x i ), ther e holds F ( max ) S A     x 1 1 × n 0 1 × n     =   x 1 1 × n e x max 1 1 × n   , wher e x max − 1 2 P √ n ≤ e x max ≤ x max . In p articular, if x = 0 1 × n , ther e holds F ( max ) S A     0 1 × n 1 1 × n 0 1 × n     =   0 1 × n 1 1 × n 0 1 × n   . Pr o of. The pro of is a mo dification of the pro of of [25, Lemma E.2]. The matrices in F ( max ) S A are set as W O :=   0 0 1   , W V :=  1 0 0  , W K :=   t 0 0 0 0 0 0 0 0   , W Q :=   0 1 0 0 0 0 0 0 0   , where t > 0 is some parameter that will b e defined later. Denoting X =   x 1 1 × n 0 1 × n   , w e ha v e W O W V X =   0 0 1    1 0 0    x 1 1 × n 0 1 × n   =   0 1 × n 0 1 × n x   and ( W K X ) ⊤ ( W Q X ) =     t 0 0 0 0 0 0 0 0     x 1 1 × n 0 1 × n     ⊤     0 1 0 0 0 0 0 0 0     x 1 1 × n 0 1 × n     =  t x ⊤ 0 n × 1 0 n × 1    1 1 × n 0 1 × n 0 1 × n   = t x ⊤ 1 1 × n . Hence F ( max ) S A     x 1 1 × n 0 1 × n     = X + W O W V X σ S  ( W K X ) ⊤ ( W Q X )  =   x 1 1 × n 0 1 × n   +   0 1 × n 0 1 × n x   σ S  t x ⊤ 1 1 × n  30 =   x 1 1 × n 0 1 × n   +   0 1 × n 0 1 × n x   σ S  t x ⊤  1 1 × n =   x 1 1 × n  x σ S  t x ⊤  1 1 × n   . Define e x max := x σ S  t x ⊤  = P n i =1 x i exp( tx i ) P n i =1 exp( tx i ) . (16) Since e x max is a con v ex combination of { x i } i ∈ [ n ] , it is easy to see that x max upp er b ounds e x max . It suffices to find t that satisfies the low er b ound condition. W e lo wer b ound the softmax w eigh ts on x max as p max := P i : x i = x max exp( tx i ) P n i =1 exp( tx i ) = P i : x i = x max exp( tx i ) P i : x i = x max exp( tx i ) + P i : x i  = x max exp( tx i ) ≥ P i : x i = x max exp( tx max ) P i : x i = x max exp( tx max ) + P i : x i  = x max exp( t ( x max − 2)) = n max n max + ( n − n max ) exp( − 2 t ) = 1 1 + ( n n max − 1) exp( − 2 t ) , where n max := |{ i : x i = x max }| . Cho osing t = 1 2 log(8 n 3 / 2 r ′ P ), w e ha ve p max ≥ 1 1 + ( n n max − 1) 1 8 n 3 / 2 r ′ P ≥ 1 1 + 1 8 r ′ P √ n . No w, w e can low er bound e x max as e x max ≥ x max p max − 2 r ′ (1 − p max ) = x max − ( x max + 2 r ′ )(1 − p max ) ≥ x max − 4 r ′ (1 − p max ) ≥ x max − 4 r ′ 1 − 1 1 + 1 8 r ′ P √ n ! = x max − 1 2 P √ n 1 + 1 8 r ′ P √ n ≥ x max − 1 2 P √ n . When x = 0 1 × n , the definition (16) sho ws that e x max = 0. 31 Lemma 19. L et d, n, N ∈ N ≥ 1 . L et r, ϕ ∈ R > 0 . L et X (1) , · · · , X ( N ) ∈ R d × n b e a set of N input se quenc es that ar e tokenwise ( r , ϕ ) -sep ar ate d. Denote r ′ = √ 2 2 n 2 N 2 √ π dr ϕ − 1 . Ther e exists a fe e dforwar d blo ck F ( pj t ) F F : R d × n → R 4 × n with depth 2 , width max { d, 4 } and weight b ound √ 2 2 n 2 N 2 √ π dr ϕ − 1 such that for i ∈ [ N ] , F ( pj t ) F F  X ( i )  =     x ( i ) 1 1 × n 0 1 × n 0 1 × n     , wher e  x ( i )  i ∈ [ N ] ⊂ R 1 × n ar e non-ne gative and tokenwise (2 r ′ , 2) -sep ar ate d. Mor e over, for i, j ∈ [ N ] and k , l ∈ [ n ] , x ( i ) 1 ,k = x ( j ) 1 ,l if and only if X ( i ) : ,k = X ( j ) : ,l . F urthermor e, if X ∈ R d × n satisfies ∥ X : ,k ∥ 2 ≤ r for al l k ∈ [ n ] , then    F ( pj t ) F F ( X ) 1 ,k    ≤ 2 r ′ for al l k ∈ [ n ] . Pr o of. Recall the definition of the v o cabulary V = [ i ∈ [ N ] V ( i ) = n v ∈ R d : v = X ( i ) : ,k for some i ∈ [ N ] , k ∈ [ n ] o . Note that |V | ≤ nN . W e use Lemma 16 on V to find a unit v ector u ′ suc h that 1 n 2 N 2 r 8 π d ∥ v − v ′ ∥ 2 ≤ 1 |V | 2 r 8 π d ∥ v − v ′ ∥ 2 ≤   u ′⊤ ( v − v ′ )   ≤ ∥ v − v ′ ∥ 2 (17) for ev ery v , v ′ ∈ V . Let u := S u ′ with S = √ 2 2 n 2 N 2 √ π dϕ − 1 . Then F ( pj t ) F F : R d × n → R 4 × n is defined as F ( pj t ) F F ( X ) :=     σ R  u ⊤ X + r ′ 1 1 × n  1 1 × n 0 1 × n 0 1 × n     . Let x ( i ) := σ R  u ⊤ X ( i ) + r ′ 1 1 × n  ∈ R 1 × n , i ∈ [ N ] . (18) F or an y i ∈ [ N ] , k ∈ [ n ], since X ( i ) : ,k ∈ V , according to(17), w e ha v e    u ⊤ X ( i ) : ,k    =    S u ′⊤ X ( i ) : ,k    ≤ S    X ( i ) : ,k    2 ≤ S r ≤ r ′ . Hence w e can remov e the ReLU activ ation in (18): x ( i ) = u ⊤ X ( i ) + r ′ 1 1 × n . It follo ws that for an y i, j ∈ [ N ] , k , l ∈ [ n ],    x ( i ) 1 ,k    ≤    u ⊤ X ( i ) : ,k + r ′    ≤ 2 r ′ 32 and    x ( i ) 1 ,k − x ( j ) 1 ,l    =    u ⊤  X ( i ) : ,k − X ( j ) : ,l     = S    u ′⊤  X ( i ) : ,k − X ( j ) : ,l     ≥ S n 2 N 2 r 8 π d    X ( i ) : ,k − X ( j ) : ,l    2 ≥ S n 2 N 2 r 8 π d ϕ ≥ 2 , where in the third step we mak e use of (17). The abov e inequalit y also implies x ( i ) 1 ,k = x ( j ) 1 ,l if and only if X ( i ) : ,k = X ( j ) : ,l . Lemma 20. L et N , n ∈ N ≥ 1 and r ′ ∈ R > 0 . L et x (1) , · · · , x ( N ) ∈ R 1 × n b e a set of N input se quenc es that ar e non-ne gative and tokenwise (2 r ′ , 2) -sep ar ate d. Ther e exists a T r ans- former T ( sid ) : R 4 × n → R 1 × n with size { (2 , 4) , ((2 , 3) , (3 , 10)) × ( n − 1) times , (2 , 3) , (2 , 4) } and dimension ve ctor  4 · 1 1 × ( n +2) 1  such that T ( sid )         x ( i ) 1 1 × n 0 1 × n 0 1 × n         = z ( i ) 1 1 × n , i ∈ [ N ] , wher e  z ( i )  i ∈ [ N ] satisfies the fol lowing c onditions: • F or any i ∈ [ N ] ,   z ( i )   ≤ 3 √ 2 π 4 nN 2 r ′ + 1 2 . • F or any i, j ∈ [ N ] such that x ( i )  = x ( j ) up to p ermutations,   z ( i ) − z ( j )   ≥ 2 . The weight b ounds of T ( sid ) ar e B F F = max n 2 r ′ , 3 √ 2 8 N 2 √ π n o , B S A = 1 2 log(3 √ 2 π n 2 N 2 r ′ ) . F urthermor e, if the first c omp onent of the input ve ctor is r eplac e d with any x ∈ R 1 × n and the c omp onents of x ar e b ounde d by 2 r ′ , then the c omp onents of the output of T ( sid ) ar e b ounde d by 3 √ 2 π 4 nN 2 r ′ + 1 2 . Pr o of. F or i ∈ [ N ], let n i ∈ N ≥ 1 b e the n umber of components of x ( i ) that tak e differen t v alues. W e define a new sequence { ¯ x ( i ) } i ∈ [ N ] ⊂ R 1 × n constructed as follows: for 1 ≤ j ≤ n i , ¯ x ( i ) j is tak en as the j -th largest comp onen t of x ( i ) ; for n i < j ≤ n , ¯ x ( i ) j is set to 0. Then according to Lemma 16, w e can find a unit v ector w ′ ∈ R n suc h that for any i, j ∈ [ N ], 1 N 2 r 8 π n   ¯ x ( i ) − ¯ x ( j )   2 ≤    ¯ x ( i ) − ¯ x ( j )  w ′   ≤   ¯ x ( i ) − ¯ x ( j )   2 . (19) Define w := P w ′ (20) with P = 3 √ 2 8 N 2 √ π n . 33 By Lemma 18, there exists a self-atten tion la y er F ( max ) S A : R 3 × n → R 3 × n with head n umber 1, head size 3 and w eight b ound 1 2 log(8 n 3 / 2 r ′ P ) suc h that for any x ∈ R 1 × n satisfying the condition in Lemma 18, there holds F ( max ) S A     x 1 1 × n 0 1 × n     =   x 1 1 × n e x max 1 1 × n   , where x max − 1 2 P √ n ≤ e x max ≤ x max . It follo ws that, according to Lemma 3, there exists a self attention lay er F ( sid ) S A : R 4 × n → R 4 × n suc h that for any x , z ∈ R 1 × n with x satifsying the condition in Lemma 18, F ( sid ) S A         x 1 1 × n 0 1 × n z         :=     F ( max ) S A   x 1 1 × n 0 1 × n   F ( idt ) S A  z      =     x 1 1 × n e x max 1 1 × n z     , (21) where F ( idt ) S A : R 1 × n → R 1 × n is the self-atten tion la y er from Lemma 12 with head n um b er 1, head size 1 and weigh t b ound 1. According to Lemma 3, F ( sid ) S A is of head n um b er 2, head size 3 and w eigh t b ound 1 2 log(8 n 3 / 2 r ′ P ). F or l ∈ [ n − 1], F ( sid ) F F ,l : R 4 × n → R 4 × n are defined as F ( sid ) F F ,l         x 1 1 × n y z         :=       σ R  F ( idt ) F F ( x ) − 2 F ( elm ) F F  x y  1 1 × n 0 1 × n F ( idt ) F F ( z + w l y )       =     x ′ 1 1 × n 0 1 × n z + w l y     , x , y , z ∈ R 1 × n , (22) with x ′ :=  x ′ 1 · · · x ′ n  ha ving comp onen ts x ′ i = ( σ R ( x i − 2 r ′ ) , if | x i − y i | < 1 2 ; σ R ( x i ) , if | x i − y i | > 1 . (23) Here, F ( idt ) F F : R 1 × n → R 1 × n is generated from the ReLU FNN f ( idt ) F F in Lemma 11 that implemen ts the iden tity mapping and hence has depth 2, width 2 and weigh t b ound 1; F ( elm ) F F : R 2 × n → R 1 × n is generated from the ReLU FNN f ( elm ) F F in Lemma 17 and hence has depth 2, width 4 and weigh t b ound max { r ′ , 2 } ; w l is the l -th comp onen t of w defined in (20). Therefore, F ( sid ) F F ,l is of depth 3, width 10 and weigh t bound max { 2 r ′ , P } . F ( sid ) F F ,n : R 4 × n → R 1 × n is defined as F ( sid ) F F ,n         x 1 1 × n y z         := F ( idt ) F F ( z + w l y ) = z + w n y , x , y , z ∈ R 1 × n . It can b e seen that F ( sid ) F F ,n is of depth 2, width 4 and weigh t bound P . 34 No w, our T ransformer T ( sid ) : R 4 × n → R 1 × n is defined as T ( sid ) := F ( sid ) F F ,n ◦ F ( sid ) S A ◦ F ( sid ) F F ,n − 1 ◦ F ( sid ) S A ◦ · · · ◦ F ( sid ) F F , 1 ◦ F ( sid ) S A ◦ F ( idt ) F F . F or i ∈ [ N ] , l ∈ [ n − 1], letting Z ( i ) l := F ( sid ) F F ,l ◦ F ( sid ) S A ◦ F ( sid ) F F ,l − 1 ◦ F ( sid ) S A ◦ · · · ◦ F ( sid ) F F , 1 ◦ F ( sid ) S A ◦ F ( idt ) F F         x ( i ) 1 1 × n 0 1 × n 0 1 × n         ∈ R 4 × n b e the output of the l -th step of the i -th sample, w e sho w inductiv ely that Z ( i ) l =      x ( i ) l 1 1 × n 0 1 × n  P min { l,n i } j =1 w j e ¯ x ( i ) j  1 1 × n      , where e ¯ x ( i ) j is 1 2 P √ n -appro ximation of ¯ x ( i ) j and x ( i ) l is generated from x ( i ) in the following w ay: when l ≤ n i , the first l largest comp onen ts in x ( i ) (where comp onen ts with the same v alue are considered iden tical) are replaced by 0, while the remaining comp onen ts remain unc hanged; when l > n i , x ( i ) l := 0 1 × n . It is w orth to note that x ( i ) l satisfies the condition in Lemma 18. F or l = 0, the conclusion holds for the input obviously . Supp ose that the conclusion holds for l = l ′ − 1. When l ′ ≤ n i , since the largest v alue of x ( i ) l ′ − 1 is ¯ x ( i ) l ′ , b y the induction h yp othesis and (21)(22), w e ha ve Z ( i ) l ′ = F ( sid ) F F ,l ′ ◦ F ( sid ) S A  Z ( i ) l ′ − 1  = F ( sid ) F F ,l ′ ◦ F ( sid ) S A,l ′           x ( i ) l ′ − 1 1 1 × n 0 1 × n  P min { l ′ − 1 ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n           = F ( sid ) F F ,l ′             x ( i ) l ′ − 1 1 1 × n e ¯ x ( i ) l ′ 1 1 × n  P min { l ′ − 1 ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n             =        Z ( i ) l ′  1 , : 1 1 × n 0 1 × n  P min { l ′ ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n       . Since  x ( i )  i ∈ [ N ] is (2 r ′ , 2)-separated and all comp onen ts are non-negativ e, w e know that n x ( i ) l ′ − 1 o i ∈ [ N ] is also (2 r ′ , 2)-separated with all comp onen ts b eing non-negativ e from its definition. Therefore, the 1 2 P √ n -appro ximations e ¯ x ( i ) l ′ differ from the maximum comp onen t 35 ¯ x ( i ) l ′ of x ( i ) l ′ − 1 b y less than 1 2 , while differing from all other comp onen ts of x ( i ) l ′ − 1 b y more than 1. According to (23), w e obtain  Z ( i ) l ′  1 , : = x ( i ) l ′ . When l ′ > n i , noticing x ( i ) l ′ − 1 = 0 1 × n , b y the induction h yp othesis and (21)(22), w e ha ve Z ( i ) l ′ = F ( sid ) F F ,l ′ ◦ F ( sid ) S A  Z ( i ) l ′ − 1  = F ( sid ) F F ,l ′ ◦ F ( sid ) S A,l ′           0 1 × n 1 1 × n 0 1 × n  P min { l ′ − 1 ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n           = F ( sid ) F F ,l ′           0 1 × n 1 1 × n 0 1 × n  P min { l ′ − 1 ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n           =      0 1 × n 1 1 × n 0 1 × n  P min { l ′ − 1 ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n      =      x ( i ) l ′ 1 1 × n 0 1 × n  P min { l ′ ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n      . Th us, the induction is completed and we hav e Z ( i ) n − 1 =      x ( i ) n − 1 1 1 × n 0 1 × n  P min { n − 1 ,n i } j =1 w j e ¯ x ( i ) j  1 1 × n      . F rom the definition of F ( sid ) F F ,n , it can b e seen that Z ( i ) n only outputs the aggregated sum- mation information. Therefore, through an analysis similar to the one ab o ve, we obtain Z ( i ) n = F ( sid ) F F ,n ◦ F ( sid ) S A  Z ( i ) n − 1  =   min { n,n i } X j =1 w j e ¯ x ( i ) j   1 1 × n = z ( i ) 1 1 × n , where z ( i ) := e ¯ x ( i ) w , with eac h component of e ¯ x ( i ) 1 2 P √ n -appro ximating the corresp onding comp onen t of ¯ x ( i ) . W e no w chec k that  z ( i )  i ∈ [ N ] are  3 √ 2 π 4 nN 2 r ′ + 1 2 , 2  -separated. Let i, j ∈ [ N ] with i  = j , noting that ∥ w ′ ∥ 2 = 1 and  ¯ x ( i )  i ∈ [ N ] are tok enwise (2 r ′ , 2)- sep erated, we hav e   z ( i )   =    e ¯ x ( i ) w    = P    e ¯ x ( i ) w ′    ≤ P    e ¯ x ( i )    2 36 ≤ P    e ¯ x ( i ) − ¯ x ( i )    2 + P   ¯ x ( i )   2 ≤ P · √ n 2 P √ n + P · 2 r ′ √ n ≤ 3 √ 2 π 4 nN 2 r ′ + 1 2 and   z ( i ) − z ( j )   =     e ¯ x ( i ) − e ¯ x ( j )  w    = P     e ¯ x ( i ) − e ¯ x ( j )  w ′    ≥ P    ¯ x ( i ) − ¯ x ( j )  w ′   − P     ¯ x ( i ) − ¯ x ( j )  w ′ −  e ¯ x ( i ) − e ¯ x ( j )  w ′    ≥ P    ¯ x ( i ) − ¯ x ( j )  w ′   − P     ¯ x ( i ) − e ¯ x ( i )  w ′    − P     ¯ x ( j ) − e ¯ x ( j )  w ′    ≥ P N 2 r 8 π n   ¯ x ( i ) − ¯ x ( j )   2 − P    ¯ x ( i ) − e ¯ x ( i )    2 − P    ¯ x ( j ) − e ¯ x ( j )    2 ≥ P N 2 r 8 π n · 2 − P · √ n 2 P √ n − P · √ n 2 P √ n ≥ 3 − 1 2 − 1 2 = 2 , where in the fifth step we mak e use of (19). Pr o of of L emma 14. Denote r ′ := √ 2 2 n 2 N 2 √ π dr ϕ − 1 . The T ransformer T ( cm ) is defined as T ( cm ) ( X ) : =  2 r ′ + 1 1  T ( sid ) ◦ F ( pj t ) F F ( X )  1 0 0 0  F ( pj t ) F F ( X ) ! , where F ( pj t ) F F : R d × n → R 4 × n is from Lemma 19 with depth 2, width max { d, 4 } and w eight b ound √ 2 2 n 2 N 2 √ π dr ϕ − 1 , T ( sid ) : R 4 × n → R 1 × n is from Lemma 20 with size { (2 , 4) , ((2 , 3) , (3 , 10)) × ( n − 1) times , (2 , 3) , (2 , 4) } , dimension v ector  4 · 1 1 × ( n +2) 1  and w eight b ounds B F F = max n 2 r ′ , 3 √ 2 8 N 2 √ π n o , B S A = 1 2 log(3 √ 2 π n 2 N 2 r ′ ). According to Lemma 3, T ( cm ) has size { (2 , max { d, 5 } ) , ((3 , 3) , (3 , 11)) × ( n − 1) times , (3 , 3) , (2 , 5) } , dimension v ector  d d 5 · 1 1 × n 1  and w eigh t bounds B F F = max ( √ 2 n 2 N 2 √ π dr ϕ − 1 + 1 , 3 √ 2 8 N 2 √ π n ) , B S A = 1 2 log(3 √ dπ n 4 N 4 r ϕ − 1 ) . Applying Lemma 19 and Lemma 20, w e ha ve a ( i ) := T ( cm )  X ( i )  =  2 r ′ + 1 1    T ( sid ) ◦ F ( pj t ) F F  X ( i )   1 0 0 0  F ( pj t ) F F  X ( i )    =  2 r ′ + 1 1   z ( i ) 1 1 × n x ( i )  = (2 r ′ + 1) z ( i ) 1 1 × n + x ( i ) . 37 It follo ws that for an y i ∈ [ N ] , k ∈ [ n ],    a ( i ) k    ≤ (2 r ′ + 1)   z ( i )   + x ( i ) 1 ,k ≤ (2 r ′ + 1) 3 √ 2 π 4 nN 2 r ′ + 1 2 ! + 2 r ′ ≤ (2 r ′ + 1) 3 √ 2 π 4 nN 2 r ′ + 3 2 ! . It remains to sho w the separatedness of  a ( i )  i ∈ [ N ] when X ( i ) : ,k  = X ( j ) : ,l or X ( i )  = X ( j ) up to p erm utations. According to Lemma 19 and Lemma 20, w e hav e following equiv alent conditions: • X ( i ) : ,k  = X ( j ) : ,l ⇐ ⇒ x ( i ) 1 ,k  = x ( j ) 1 ,l . • X ( i )  = X ( j ) up to p erm utations ⇐ ⇒ z ( i )  = z ( j ) . Therefore, in the follo wing we c heck the separatedness of  a ( i )  i ∈ [ N ] when x ( i ) 1 ,k  = x ( j ) 1 ,l or z ( i )  = z ( j ) . F rom definition, w e ha v e a ( i ) 1 ,k − a ( j ) 1 ,l = (2 r ′ + 1)  z ( i ) − z ( j )  +  x ( i ) 1 ,k − x ( j ) 1 ,l  . If z ( i ) = z ( j ) and x ( i ) 1 ,k  = x ( j ) 1 ,l , b y Lemma 19 there holds    a ( i ) 1 ,k − a ( j ) 1 ,l    =    x ( i ) 1 ,k − x ( j ) 1 ,l    ≥ 2 . If z ( i )  = z ( j ) and x ( i ) 1 ,k = x ( j ) 1 ,l , b y Lemma 20 there holds    a ( i ) 1 ,k − a ( j ) 1 ,l    = (2 r ′ + 1)   z ( i ) − z ( j )   ≥ 2(2 r ′ + 1) ≥ 2 . If z ( i )  = z ( j ) and x ( i ) 1 ,k  = x ( j ) 1 ,l , assuming without loss of generalit y that z ( i ) > z ( j ) , by Lemma 19 and Lemma 20 there holds    a ( i ) 1 ,k − a ( j ) 1 ,l    =    (2 r ′ + 1)  z ( i ) − z ( j )  +  x ( i ) 1 ,k − x ( j ) 1 ,l     = (2 r ′ + 1)  z ( i ) − z ( j )  +  x ( i ) 1 ,k − x ( j ) 1 ,l  ≥ 2(2 r ′ + 1) − 4 r ′ = 2 . 5 Pro of of Lemma 3: P arallelization of T ransformers (1) Since the argument used here is equally applicable to cases where L > 2, we only discuss the case where L = 2. In this case, the t w o feedforw ard blocks are in the form of F (1) F F ( X ) = W (1) 2 σ R  W (1) 1 X + B (1) 1  + B (1) 2 , 38 F (2) F F ( Y ) = W (2) 2 σ R  W (2) 1 Y + B (2) 1  + B (2) 2 , where W (1) 1 ∈ R r × d (1) , W (2) 1 ∈ R r × d (2) , B (1) 1 , B (2) 1 ∈ R r × n , W (1) 2 ∈ R ¯ d (1) × r , W (2) 2 ∈ R ¯ d (2) × r , B (1) 2 ∈ R ¯ d (1) × n , B (2) 2 ∈ R ¯ d (2) × n . Denote W ( prl ) 1 := W (1) 1 0 r × d (2) 0 r × d (1) W (2) 1 ! , B ( prl ) 1 := B (1) 1 B (2) 1 ! , W ( prl ) 2 := W (1) 2 0 ¯ d (1) × r 0 ¯ d (2) × r W (2) 2 ! , B ( prl ) 2 := B (1) 2 B (2) 2 ! , and define F ( prl ) F F  X Y  := W ( prl ) 2 σ R  W ( prl ) 1  X Y  + B ( prl ) 1  + B ( prl ) 2 = W (1) 2 0 ¯ d (1) × r 0 ¯ d (2) × r W (2) 2 ! σ R W (1) 1 0 r × d (2) 0 r × d (1) W (2) 1 !  X Y  + B (1) 1 B (2) 1 !! + B (1) 2 B (2) 2 ! =   W (1) 2 σ R  W (1) 1 X + B (1) 1  + B (1) 2 W (2) 2 σ R  W (2) 1 Y + B (2) 1  + B (2) 2   = F (1) F F ( X ) F (2) F F ( Y ) ! . (2) Consider the follo wing t wo self-attention lay ers: F (1) S A ( X ) = X + H (1) X h =1 W (1) O,h W (1) V ,h X σ S  X ⊤ W (1) ⊤ K,h W (1) Q,h X  , F (2) S A ( Y ) = Y + H (2) X h =1 W (2) O,h W (2) V ,h Y σ S  Y ⊤ W (2) ⊤ K,h W (2) Q,h Y  , where W (1) O,h ∈ R d (1) × S (1) , W (1) V ,h , W (1) K,h , W (1) Q,h ∈ R S (1) × d (1) , W (2) O,h ∈ R d (2) × S (2) , W (2) V ,h , W (2) K,h , W (2) Q,h ∈ R S (2) × d (2) . Let W ( prl ) O,h, 1 :=  W (1) O,h 0 d (2) × S (1)  , W ( prl ) V ,h, 1 :=  W (1) V ,h 0 S (1) × d (2)  , W ( prl ) K,h, 1 :=  W (1) K,h 0 S (1) × d (2)  , W ( prl ) Q,h, 1 :=  W (1) Q,h 0 S (1) × d (2)  . It follo ws that W ( prl ) O,h, 1 W ( prl ) V ,h, 1  X Y  σ S   X ⊤ Y ⊤  W ( prl ) T K,h, 1 W ( prl ) Q,h, 1  X Y  =  W (1) O,h 0 d (2) × S (1)   W (1) V ,h 0 S (1) × d (2)   X Y  σ S   X ⊤ Y ⊤   W (1) ⊤ K,h 0 S (1) × d (2)   W (1) Q,h 0 S (1) × d (2)   X Y  = W (1) O,h W (1) V ,h X σ S  X ⊤ W (1) ⊤ K,h W (1) Q,h X  0 d (2) × n ! . Similarly , letting W ( prl ) O,h, 2 :=  0 d 1 × S 2 W (2) O,h  , W ( prl ) V ,h, 2 :=  0 S (2) × d (1) W (2) V ,h  , 39 W ( prl ) K,h, 2 :=  0 S (2) × d (1) W (2) K,h  , W ( prl ) Q,h, 2 :=  0 S (2) × d (1) W (2) Q,h  and w e ha ve W ( prl ) O,h, 2 W ( prl ) V ,h, 2  X Y  σ S   X ⊤ Y ⊤  W ( prl ) ⊤ K,h, 2 W ( prl ) Q,h, 2  X Y  = 0 d (1) × n W (2) O,h W (2) V ,h Y σ S  Y ⊤ W (2) ⊤ K,h W (2) Q,h Y  ! . Therefore, F ( prl ) S A  X Y  :=  X Y  + H (1) X h =1 W ( prl ) O,h, 1 W ( prl ) V ,h, 1  X Y  σ S   X ⊤ Y ⊤  W ( prl ) ⊤ K,h, 1 W ( prl ) Q,h, 1  X Y  + H (2) X h =1 W ( prl ) O,h, 2 W ( prl ) V ,h, 2  X Y  σ S   X ⊤ Y ⊤  W ( prl ) ⊤ K,h, 2 W ( prl ) Q,h, 2  X Y  =  X Y  + H (1) X h =1 W (1) O,h W (1) V ,h X σ S  X ⊤ W (1) ⊤ K,h W (1) Q,h X  0 d (2) × n ! + H (2) X h =1 0 d (1) × n W (2) O,h W (2) V ,h Y σ S  Y ⊤ W (2) ⊤ K,h W (2) Q,h Y  ! = F (1) S A ( X ) F (2) S A ( Y ) ! . (3) A direct corollary from (1) and (2). 6 Pro of of Prop osition 2 Our pro of of Prop osition 2 follows techniques developed in [49, 35, 56]. Before proving it, we first list the relev ant concepts and results from probabilit y theory that will b e used in the pro of. In the definitions and lemmas below, T is a set in the metric space ( ¯ T , τ ). Lemma 21 (Bernstein’s inequality) . F or i.i.d. r andom variables { Z i } m i =1 satisfying | Z i | ≤ c, E [ Z i ] = 0 , V ar( Z i ) = σ 2 , it holds that P      1 m m X i =1 Z i      ≥ u ! ≤ exp  − mu 2 2 σ 2 + 2 cu/ 3  for any u > 0 . Definition 4 (Gaussian pro cess) . A sto chastic pr o c ess { X ( t ) } t ∈ T is a Gaussian pr o c ess if for al l n ∈ N , a i ∈ R and t i ∈ T , the r andom variable P n i =1 a i X ( t i ) is normal or, e quivalently, if al l the finite-dimensional mar ginals of X ar e multivariate normal. X is a c entr e d Gaussian pr o c ess if al l these r andom variables ar e normal with me an zer o. 40 Definition 5 (Sub-gaussian v ariable and sub-gaussian pro cess) . A squar e inte gr able r an- dom variable ξ is said to b e sub-gaussian with p ar ameter σ > 0 if for al l λ ∈ R , E e λξ ≤ e λ 2 σ 2 / 2 . A c entr e d sto chastic pr o c ess { X ( t ) } t ∈ T is sub-gaussian r elative to τ if its incr ements satisfy the sub-gaussian ine quality: E e λ ( X ( t ) − X ( s )) ≤ e λ 2 τ 2 ( s,t ) / 2 , λ ∈ R , s, t ∈ T . Lemma 22 (Borell-Sudako v-Tsirelson concen tration inequalit y) . L et { X ( t ) } t ∈ T b e a sep- ar able c entr e d Gaussian pr o c ess. Supp ose E sup t ∈ T | X ( t ) | < ∞ , σ 2 := sup t ∈ T E X 2 ( t ) < ∞ . Then, P  sup t ∈ T | X ( t ) | ≥ E sup t ∈ T | X ( t ) | + u  ≤ e − u 2 / 2 σ 2 , P  sup t ∈ T | X ( t ) | ≤ E sup t ∈ T | X ( t ) | − u  ≤ e − u 2 / 2 σ 2 . Pr o of. See [11, Theorem 2.5.8]. Lemma 23. L et { X ( t ) } t ∈ T b e a sub-Gaussian pr o c ess r elative to τ . Assume that Z ∞ 0 p log N ( ϵ, T , τ ) dϵ < ∞ . Then any sep ar able version of { X ( t ) } t ∈ T , that we ke ep denoting by X ( t ) satisfies the ine qualities E sup t ∈ T | X ( t ) | ≤ E | X ( t 0 ) | + 4 √ 2 Z D/ 2 0 p log 2 N ( ϵ, T , τ ) dϵ, wher e t 0 ∈ T , D is the diameter of T . Pr o of. See [11, Theorem 2.3.7]. Pr o of of Pr op osition 2. By the triangle inequalit y , w e ha v e    b f − f 0    2 L 2 ( µ ) ≤ 2    b f − f ∗    2 L 2 ( µ ) + 2 ∥ f ∗ − f 0 ∥ 2 L 2 ( µ ) , (24) where f ∗ can b e any function in F . F or the remainder of the pro of, we primarily fo cus on deriving an upp er b ound for    b f − f ∗    2 L 2 ( µ ) . T o this end, we first control    b f − f ∗    2 L 2 ( µ ) b y its empirical coun terpart    b f − f ∗    2 m := 1 m P m i =1    b f ( X i ) − f ∗ ( X i )    2 , and then derive an upp er b ound for    b f − f ∗    2 m b y ev aluating a v ariance term. Step 1: Upp er Bound of    b f − f ∗    2 L 2 ( µ ) . Denote N = N ( m − γ / (2 γ + d ) , F , ∥ · ∥ L ∞ ( µ ) ), and let { f 1 , . . . , f N } b e a set of cen ters of the minimal m − γ / (2 γ + d ) -co ver of F with ∥ · ∥ L ∞ ( µ ) norm. Supp ose f j ′ ∈ { f 1 , . . . , f N } satisfies    b f − f j ′    L ∞ ( µ ) ≤ m − γ / (2 γ + d ) . By the triangle inequality , w e ha ve    b f − f ∗    2 L 2 ( µ ) ≤ 2    b f − f j ′    2 L 2 ( µ ) + 2 ∥ f j ′ − f ∗ ∥ 2 L 2 ( µ ) ≤ 2 m − γ / (2 γ + d ) + 2 ∥ f j ′ − f ∗ ∥ 2 L 2 ( µ ) . (25) 41 W e b ound the term ∥ f j − f ∗ ∥ 2 L 2 ( µ ) uniformly for all j ∈ [ N ] in order to b ound the random quan tity ∥ f j ′ − f ∗ ∥ 2 L 2 ( µ ) . Firstly , giv en j ∈ [ N ], we apply Bernstein’s inequalit y (Lemma 21) with Z i := ( f j ( X i ) − f ∗ ( X i )) 2 − E [( f j ( X i ) − f ∗ ( X i )) 2 ] . Since | Z i | =   ( f j ( X i ) − f ∗ ( X i )) 2 − E [( f j ( X i ) − f ∗ ( X i )) 2 ]   ≤ 8 B 2 F , V ar ( Z i ) = V ar  ( f j ( X i ) − f ∗ ( X i )) 2 − E [( f j ( X i ) − f ∗ ( X i )) 2 ]  = V ar  ( f j ( X i ) − f ∗ ( X i )) 2  = E  [ f j ( X i ) − f ∗ ( X i )] 4  −  E [ f j ( X i ) − f ∗ ( X i )] 2  2 ≤ 4 B 2 F E  [ f j ( X i ) − f ∗ ( X i )] 2  + 4 B 2 F  E [ f j ( X i ) − f ∗ ( X i )] 2  = 8 B 2 F ∥ f j − f ∗ ∥ 2 L 2 ( µ ) ≤ 16 B 2 F u, w e substitute u with max n v , ∥ f j − f ∗ ∥ 2 L 2 ( µ ) / 2 o , c with 8 B 2 F and τ 2 with 16 B 2 F u in Lemma 21 and obtain P  ∥ f j − f ∗ ∥ 2 L 2 ( µ ) ≥ ∥ f j − f ∗ ∥ 2 m + u  ≤ exp  − 3 mv 112 B 2 F  . (26) By the uniform b ound argumen t, ∥ f j − f ∗ ∥ 2 L 2 ( µ ) ≥ ∥ f j − f ∗ ∥ 2 m + u holds for all j ∈ [ N ] with probabilit y at most N exp ( − 3 mv / (112 B 2 F )). Substituting v with 112 B 2 F ( m d/ (2 γ + d ) + log N ) / (3 m ) leads to the follo wing inequalit y: ∥ f j − f ∗ ∥ 2 m + u ≤ ∥ f j − f ∗ ∥ 2 m + v + 1 2 ∥ f j − f ∗ ∥ 2 L 2 ( µ ) ≤ ∥ f j − f ∗ ∥ 2 m + 112 B 2 F m − 2 γ / (2 γ + d ) 3 + 112 B 2 F log N 3 m + 1 2 ∥ f j − f ∗ ∥ 2 L 2 ( µ ) . Com bining the abov e inequalit y and (26), w e deriv e that ∥ f j − f ∗ ∥ 2 L 2 ( µ ) ≤ 2 ∥ f j − f ∗ ∥ 2 m + 224 B 2 F m − 2 γ / (2 γ + d ) 3 + 224 B 2 F log N 3 m (27) holds for all j ∈ [ N ] with probability at least 1 − exp  − m d/ (2 γ + d )  . Plugging (27) in to (25) yields    b f − f ∗    2 L 2 ( µ ) ≤ 2 m − 2 s/ (2 s + d ) + 4 ∥ f j ′ − f ∗ ∥ 2 m + 448 B 2 F m − 2 γ / (2 γ + d ) 3 + 448 B 2 F log N 3 m ≤ 2 m − 2 γ / (2 γ + d ) + 8    b f − f j ′    2 m + 8    b f − f ∗    2 m + 448 B 2 F m − 2 γ / (2 γ + d ) 3 + 448 B 2 F log N 3 m ≤ 10 m − 2 γ / (2 γ + d ) + 8    b f − f ∗    2 m + 448 B 2 F m − 2 γ / (2 γ + d ) 3 + 448 B 2 F log N ( m − γ / (2 γ + d ) , F , ∥ · ∥ L ∞ ( µ ) ) 3 m (28) 42 with probabilit y at least 1 − exp  − m d/ (2 s + d )  . Step 2: Upp er Bound of    b f − f ∗    2 m . Denote δ = max n 2 8 σ m − γ / (2 γ + d ) , 2    b f − f 0    m o . Given the observed v ariables { X i } m i =1 , w e b ound    b f − f ∗    2 m b y considering t wo cases. In the first case, we supp ose that    b f − f ∗    m ≤ δ holds. By the definition of b f , w e ha ve for an y f ∈ F ,    Y − b f    2 m ≤ ∥ Y − f ∥ 2 m . By substituting Y i = f 0 ( X i ) + ξ i , w e obtain the base inequalit y as    b f − f 0    2 m ≤ ∥ f − f 0 ∥ 2 m + 2 m m X i =1 ξ i  ˆ f ( X i ) − f ( X i )  , f ∈ F . Setting f = f ∗ in the ab o ve inequality , w e ha ve    b f − f 0    2 m ≤ ∥ f ∗ − f 0 ∥ 2 m + 2 m m X i =1 ξ i  b f ( X i ) − f ∗ ( X i )  , from whic h w e deriv e that    b f − f ∗    2 m ≤ 2    b f − f 0    2 m + 2 ∥ f ∗ − f 0 ∥ 2 m ≤ 4 ∥ f ∗ − f 0 ∥ 2 m + 4 sup g ∈ G δ      1 n n X i =1 ξ i g ( X i )      , (29) where G δ :=  g : g = f − f ′ , ∥ g ∥ L ∞ ( µ ) ≤ δ, f , f ′ ∈ F  . F or g ∈ G δ , denote Z g := 1 n P n i =1 ξ i g ( X i ). It is easy to see that { Z g } g ∈G δ is a centred Gaussian process and E | Z g | 2 = V ar( Z g ) = σ 2 m 2 m X i =1 g 2 ( X i ) ≤ σ 2 δ 2 m . According to Lemma 22, P sup g ∈ G δ      1 m m X i =1 ξ i g ( X i )      ≥ E " sup g ∈ G δ      1 m m X i =1 ξ i g ( X i )      # + 2 − 7 δ 2 ! ≤ exp  − mδ 2 2 15 σ 2  ≤ exp  − 2 m d/ (2 γ + d )  . (30) F or g ∈ G δ , denote ¯ Z g := 1 √ m P m i =1 g ( X i ) ξ i σ . Since { ξ i } m i =1 are centred Gaussian v ari- ables with v ariances σ 2 , w e ha ve E e λ ( ¯ Z g − ¯ Z h ) = E e λ 1 √ m P m i =1 [ g ( X i ) − h ( X i )] ξ i σ = m Y i =1 E e λ 1 √ n [ g ( X i ) − h ( X i )] ξ i σ 43 ≤ m Y i =1 e λ 2 1 2 m [ g ( X i ) − h ( X i )] 2 ≤ e λ 2 ∥ g − h ∥ L ∞ ( µ ) / 2 , whic h implies that { ¯ Z g } g ∈G δ is a sub-gaussian pro cess relative to distance ∥ · ∥ L ∞ ( µ ) . Applying Lemma 23 to { ¯ Z g } g ∈G δ and noting that N ( ς , G δ , ∥ · ∥ L ∞ ( µ ) ) ≤ N ( ς / 2 , F , ∥ · ∥ L ∞ ( µ ) ) 2 , w e ha ve E sup g ∈G δ | ¯ Z g | ≤ 4 √ 2 Z δ 0 q log 2 N ( ς , G δ , ∥ · ∥ L ∞ ( µ ) ) dς ≤ 4 √ 2 Z δ 0 q log 2 N ( ς / 2 , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς . It follo ws directly that E " sup g ∈G δ      1 m m X i =1 ξ i g ( X i )      # ≤ 4 √ 2 σ √ m Z δ 0 q log 2 N ( ς / 2 , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς . (31) Com bining (30) and (31) yields that sup g ∈ G δ      1 m m X i =1 ξ i g ( X i )      ≤ 4 √ 2 σ √ m Z δ 0 q log 2 N ( ς / 2 , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς + 2 − 7 δ 2 = 4 √ 2 σ δ √ m Z 1 0 q log 2 N ( δ ς / 2 , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς + 2 − 7 δ 2 ≤ 2 10 σ 2 m  Z 1 0 q log 2 N ( δ ς / 2 , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς  2 + 2 − 6 δ 2 ≤ 2 10 σ 2 m  Z 1 0 q log 2 N (2 7 σ m − γ / (2 γ + d ) ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς  2 + 2 − 6 δ 2 = 8 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 + 2 − 6 δ 2 with probability at least 1 − exp  − 2 m d/ (2 γ + d )  . Here w e use the inequality ab ≤ a 2 2 + b 2 2 . Plugging the ab o ve inequality into (29) yields    b f − f ∗    2 m ≤ 4 ∥ f ∗ − f 0 ∥ 2 m + 32 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 + 1 16 δ 2 ≤ 4 ∥ f ∗ − f 0 ∥ 2 m + 32 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 + 2 12 σ 2 m − 2 γ / (2 γ + d ) + 1 4    b f − f 0    2 m ≤ 4 ∥ f ∗ − f 0 ∥ 2 m + 32 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 + 2 12 σ 2 m − 2 γ / (2 γ + d ) + 1 2    b f − f ∗    2 m + 1 2 ∥ f ∗ − f 0 ∥ 2 m 44 with probabilit y at least 1 − exp  − 2 m d/ (2 γ + d )  , from which we can immediately obtain    b f − f ∗    2 m ≤ 9 ∥ f ∗ − f 0 ∥ 2 m + 2 13 σ 2 m − 2 γ / (2 γ + d ) + 64 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 (32) with probabilit y at least 1 − exp  − 2 m d/ (2 γ + d )  . In the second case, w e supp ose that    b f − f ∗    m ≥ δ ≥ 2    b f − f 0    m holds. It follows that    b f − f ∗    2 m ≤ 2    b f − f 0    2 m + 2 ∥ f ∗ − f 0 ∥ 2 m ≤ 1 2    b f − f ∗    2 m + 2 ∥ f ∗ − f 0 ∥ 2 m . whic h implies    b f − f ∗    2 m ≤ 4 ∥ f ∗ − f 0 ∥ 2 m . Hence in this case, the inequalit y (32) still holds. Step 3: Combine the Results. F rom the conclusion of (28) in Step 1 and (32) in Step 2, w e obtain    b f − f ∗    2 L 2 ( µ ) ≤  448 B 2 F 3 + 2 16 σ 2 + 10  m − 2 γ / (2 γ + d ) + 448 B 2 F log N ( m − γ / (2 γ + d ) , F , ∥ · ∥ L ∞ ( µ ) ) 3 m + 72 ∥ f ∗ − f 0 ∥ 2 m + 2 9 σ m ( γ + d ) / (2 γ + d ) Z 2 7 σ m − γ / (2 γ + d ) 0 q log 2 N ( ς , F , ∥ · ∥ L ∞ ( µ ) ) 2 dς ! 2 with probabilit y at least 1 − 2 exp  − m d/ (2 γ + d )  . W e complete the pro of b y com bining this inequalit y and (24). 7 Pro of of Lemma 4: Estimation of Lipsc hitz Con- stan t In Lemmas 24, 25 and 27, w e deriv e sev eral k ey prop erties of the softmax function. Lemma 24. F or any x ∈ R d , ther e holds ∥ σ S ( x ) ∥ 2 ≤ 1 . Pr o of. Since [ σ S ( x )] i > 0 for i ∈ [ d ], w e ha v e ∥ σ S ( x ) ∥ 2 2 = d X i =1 [ σ S ( x )] 2 i ≤ d X i =1 [ σ S ( x )] i ! 2 = 1 . Lemma 25. L et x ∈ R d and y = σ S ( x ) . L et σ ′ S b e the Jac obian matrix of σ S . Ther e holds σ ′ S ( x ) = diag( y ) − y y ⊤ . 45 Pr o of. F or i, j ∈ [ d ], direct calculation yields ∂ y i ∂ x j = ∂ ∂ x j e x i P d k =1 e x k = e x i δ ij P d k =1 e x k − e x i P d k =1 e x k e x j P d k =1 e x k . Lemma 26 (Mean-v alue theorem for v ector-v alued functions) . L et u, v ∈ N ≥ 1 . L et S b e an op en subset of R u and assume that f : S → R v is differ entiable at e ach p oint of S . L et x and y b e two p oints in S such that L ( x , y ) ⊆ S , wher e L ( x , y ) := { t x + (1 − t ) y : t ∈ [0 , 1] } . Then for every ve ctor a in R v , ther e is a p oint z in L ( x , y ) such that a ⊤ [ f ( y ) − f ( x )] = a ⊤ f ′ ( z )( y − x ) , wher e f ′ is the Jac obian matrix of f . Pr o of. See, for example, [1, Theorem 12.9]. Lemma 27. F or any x , e x ∈ R d , ther e holds ∥ σ S ( e x ) − σ S ( x ) ∥ 2 ≤ 2 ∥ e x − x ∥ 2 . Pr o of. Cho osing a = σ S ( e x ) − σ S ( x ) in Lemma 26, w e obtain [ σ S ( e x ) − σ S ( x )] ⊤ [ σ S ( e x ) − σ S ( x )] = [ σ S ( e x ) − σ S ( x )] ⊤ σ ′ S ( z )( e x − x ) for some z ∈ R d . It follo ws that ∥ σ S ( e x ) − σ S ( x ) ∥ 2 2 ≤ ∥ σ S ( e x ) − σ S ( x ) ∥ 2 ∥ σ ′ S ( z ) ∥ 2 ∥ e x − x ∥ 2 , whic h implies ∥ σ S ( e x ) − σ S ( x ) ∥ 2 ≤ ∥ σ ′ S ( z ) ∥ 2 ∥ e x − x ∥ 2 . Denote y = σ S ( z ). By Lemma 24 and Lemma 25, we hav e ∥ σ ′ S ( z ) ∥ 2 = ∥ diag( y ) − y y ⊤ ∥ 2 ≤ ∥ diag( y ) ∥ 2 + ∥ y ∥ 2 2 ≤ 1 + 1 = 2 . In the follo wing three lemmas (Lemmas 28 - 30), w e study prop erties of single feed- forw ard blo c k F F F : R d ( in ) F F × n → R d ( out ) F F × n with depth L , width W and weigh t b ound B F F , taking the form of F 0 = X ; F l = σ R ( W l F l − 1 + B l ) , l ∈ { 1 , 2 , . . . , L − 1 } ; F F F = W L F L − 1 + B L , prop erties of single softmax lay er F S A : R d S A × n → R d S A × n with head n umber H , head size S and w eight b ound B S A , taking the form of F S A ( Y ) = Y + H X h =1 W ( h ) O W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y  , and prop erties of em b edding lay er F E B : R d in × n → R d E B × n with weigh t b ound B E B , taking the form of F E B ( Z ) = W E B Z + B E B . Without loss of generalit y w e assume B F F , B S A , B E B ≥ 1. 46 Lemma 28. F or any X ∈ R d ( in ) F F × n , Y ∈ R d S A × n , Z ∈ R d in × n , ther e holds ∥ F F F ( X ) ∥ F ≤ 2 q d ( in ) F F d ( out ) F F nLW L − 1 B L F F ∥ X ∥ F , ∥ F S A ( Y ) ∥ F ≤ 2 d S A √ nH S B 2 S A ∥ Y ∥ F , ∥ F E B ( Z ) ∥ F ≤ 2 p d E B d in nB E B ∥ Z ∥ F . Pr o of. F or F F F , b y definition we hav e ∥ F F F ( X ) ∥ F = ∥ W L F L − 1 + B L ∥ F ≤ ∥ W L ∥ F ∥ F L − 1 ∥ F + ∥ B L ∥ F ≤ ∥ W L ∥ F ∥ W L − 1 F L − 2 + B L − 1 ∥ F + ∥ B L ∥ F ≤ ∥ W L ∥ F ∥ W L − 1 ∥ F ∥ F L − 2 ∥ F + ∥ W L ∥ F ∥ B L − 1 ∥ F + ∥ B L ∥ F , where in the third step, we use the prop ert y σ R ( x ) ≤ x for any x ∈ R . Rep eating this pro cess, we obtain ∥ F F F ( X ) ∥ F ≤ L X l =1 L Y l ′ = l +1 ∥ W l ′ ∥ F ! ∥ B l ∥ F + L Y l =1 ∥ W l ∥ F ! ∥ X ∥ F ≤ q d ( out ) F F nLW L − 1 B L F F + q d ( in ) F F d ( out ) F F W L − 1 B L F F ∥ X ∥ F ≤ 2 q d ( in ) F F d ( out ) F F nLW L − 1 B L F F ∥ X ∥ F . F or F S A , b y definition we hav e ∥ F S A ( Y ) ∥ F ≤ ∥ Y ∥ F + H X h =1    W ( h ) O    F    W ( h ) V    F ∥ Y ∥ F    σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤ ( d S A √ nH S B 2 S A + 1) ∥ Y ∥ F ≤ 2 d S A √ nH S B 2 S A ∥ Y ∥ F , where in the second step we use Lemma 24. The b ound of F E B can b e obtained directly from the definition. Lemma 29. L et ς ∈ R > 0 . L et e F F F , F F F b e two fe e dforwar d blo cks with e ach tr ainable p ar ameter differing by at most ς . L et e F S A , F S A b e two self-attention layers with e ach tr ainable p ar ameter also differing by at most ς . L et e F E B , F E B b e two emb e dding layers with e ach tr ainable p ar ameter also differing by at most ς . F or any X ∈ R d ( in ) F F × n , Y ∈ R d S A × n , Z ∈ R d in × n , ther e holds    e F F F ( X ) − F F F ( X )    F ≤ 4 d ( in ) F F  d ( out ) F F  3 / 2 nL 2 W 2 L − 3 / 2 B 2 L − 1 F F ∥ X ∥ F ς ,    e F S A ( Y ) − F S A ( Y )    F ≤ 3 d 2 S A √ nH S 2 B 3 S A ∥ Y ∥ 3 F ς ,    e F E B ( Z ) − F E B ( Z )    F ≤ 2 p d E B d in n ∥ Z ∥ F ς . Pr o of. F or F F F , b y definition we hav e    e F F F ( X ) − F F F ( X )    F 47 =    f W L e F L − 1 + e B L − W L F L − 1 − B L    F ≤    f W L    F    e F L − 1 − F L − 1    F +    f W L − W L    F ∥ F L − 1 ∥ F +    e B L − B L    F ≤    f W L    F    f W L − 1 e F L − 2 + e B L − 1 − W L − 1 F L − 2 − B L − 1    F +    f W L − W L    F ∥ F L − 1 ∥ F +    e B L − B L    F ≤    f W L    F    f W L − 1    F    e F L − 2 − F L − 2    F +    f W L    F    f W L − 1 − W L − 1    F ∥ F L − 2 ∥ F +    f W L    F    e B L − 1 − B L − 1    F +    f W L − W L    F ∥ F L − 1 ∥ F +    e B L − B L    F , where in the third step we use the fact that σ R is 1-Lipschitz. Rep eating this pro cess, we obtain    e F F F ( X ) − F F F ( X )    F ≤ L X l =1 L Y l ′ = l +1    f W l ′    F !     f W l − W l    F ∥ F l − 1 ∥ F +    e B l − B l    F  . (33) F rom the deriv ation of Lemma 24, w e can find that for l ∈ [ L ], ∥ F l ∥ F ≤ 2 q d ( in ) F F d ( out ) F F nLW L − 1 B L F F ∥ X ∥ F . (34) Plugging (34) in to (33), we obtain    e F F F ( X ) − F F F ( X )    F ≤ 4 d ( in ) F F  d ( out ) F F  3 / 2 nL 2 W 2 L − 3 / 2 B 2 L − 1 F F ∥ X ∥ F ς . F or F S A , b y definition we hav e    e F S A ( Y ) − F S A ( Y )    F ≤ H X h =1    f W ( h ) O f W ( h ) V Y σ S  Y ⊤ f W ( h ) ⊤ K f W ( h ) Q Y  − W ( h ) O W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤ H X h =1    f W ( h ) O f W ( h ) V Y σ S  Y ⊤ f W ( h ) ⊤ K f W ( h ) Q Y  − f W ( h ) O f W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F + H X h =1    f W ( h ) O f W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y  − W ( h ) O W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤ H X h =1    f W ( h ) O    F    f W ( h ) V    F ∥ Y ∥ F    σ S  Y ⊤ f W ( h ) ⊤ K f W ( h ) Q Y  − σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F + H X h =1    f W ( h ) O f W ( h ) V − W ( h ) O W ( h ) V    F ∥ Y ∥ F    σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤ 2 H X h =1    f W ( h ) O    F    f W ( h ) V    F ∥ Y ∥ 3 F    f W ( h ) ⊤ K f W ( h ) Q − W ( h ) ⊤ K W ( h ) Q    F + √ n H X h =1    f W ( h ) O f W ( h ) V − W ( h ) O W ( h ) V    F ∥ Y ∥ F 48 ≤ 2 H X h =1    f W ( h ) O    F    f W ( h ) V    F ∥ Y ∥ 3 F     f W ( h ) K    F    f W ( h ) Q − W ( h ) Q    F +    f W ( h ) K − W ( h ) K    F    W ( h ) Q    F  + √ n H X h =1     f W ( h ) O    F    f W ( h ) V − W ( h ) V    F +    f W ( h ) O − W ( h ) O    F    W ( h ) V    F  ∥ Y ∥ F ≤ 3 d 2 S A √ nH S 2 B 3 S A ∥ Y ∥ 3 F ς , where in the fourth step w e use Lemma 24 and Lemma 27. F or F E B , by definition w e ha ve    e F E B ( Z ) − F E B ( Z )    F =    f W E B Z + e B E B − W E B Z − B E B    F ≤    f W E B − W E B    F ∥ Z ∥ F +    e B E B − B E B    F ≤ 2 p d E B d in n ∥ Z ∥ F ς . Lemma 30. L et E ∈ R > 0 . L et X , f X ∈ R d ( in ) F F × n , Y , e Y ∈ R d S A × n . Supp ose ∥ Y ∥ F ,    e Y    F ≤ E . Ther e holds    F F F  f X  − F F F ( X )    F ≤ q d ( in ) F F d ( out ) F F W L − 1 B L F F    f X − X    F ,    F S A  e Y  − F S A ( Y )    F ≤ 6 d 2 S A √ nE 2 H S 2 B 4 S A    e Y − Y    F . Pr o of. F or F F F , b y definition we hav e    F F F  f X  − F F F ( X )    F ≤    W L F L − 1  f X  − W L F L − 1 ( X )    F ≤ ∥ W L ∥ F    F L − 1  f X  − F L − 1 ( X )    F = ∥ W L ∥ F    σ R  W L − 1 F L − 2  f X  + B L − 1  − σ R ( W L − 1 F L − 2 ( X ) + B L − 1 )    F ≤ ∥ W L ∥ F ∥ W L − 1 ∥ F    F L − 2  f X  − F L − 2 ( X )    F , where in the final step we use the fact that σ R is 1-Lipschitz. Rep eating this process, w e obtain    F F F  f X  − F F F ( X )    F ≤ L Y l =1 ∥ W l ∥ F !    f X − X    F ≤ q d ( in ) F F d ( out ) F F W L − 1 B L F F    f X − X    F . F or F S A , b y definition we hav e    F S A  e Y  − F S A ( Y )    F ≤    e Y − Y    F + H X h =1    W ( h ) O W ( h ) V e Y σ S  e Y ⊤ W ( h ) ⊤ K W ( h ) Q e Y  − W ( h ) O W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F . 49 The term to b e summed can b e bounded b y    W ( h ) O W ( h ) V e Y σ S  e Y ⊤ W ( h ) ⊤ K W ( h ) Q e Y  − W ( h ) O W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤    W ( h ) O W ( h ) V e Y σ S  e Y ⊤ W ( h ) ⊤ K W ( h ) Q e Y  − W ( h ) O W ( h ) V e Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F +    W ( h ) O W ( h ) V e Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y  − W ( h ) O W ( h ) V Y σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤    W ( h ) O    F    W ( h ) V    F    e Y    F    σ S  e Y ⊤ W ( h ) ⊤ K W ( h ) Q e Y  − σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F +    W ( h ) O    F    W ( h ) V    F    e Y − Y    F    σ S  Y ⊤ W ( h ) ⊤ K W ( h ) Q Y     F ≤ 2    W ( h ) O    F    W ( h ) V    F    e Y    F    e Y ⊤ W ( h ) ⊤ K W ( h ) Q e Y − Y ⊤ W ( h ) ⊤ K W ( h ) Q Y    F + √ n    W ( h ) O    F    W ( h ) V    F    e Y − Y    F ≤ 2    W ( h ) O    F    W ( h ) V    F    e Y    2 F    W ( h ) K    F    W ( h ) Q    F    e Y − Y    F + 2    W ( h ) O    F    W ( h ) V    F    e Y    F ∥ Y ∥ F    W ( h ) K    F    W ( h ) Q    F    e Y − Y    F + √ n    W ( h ) O    F    W ( h ) V    F    e Y − Y    F ≤ (4 d 2 S A E 2 S 2 B 4 S A + d S A √ nS B 2 S A )    e Y − Y    F ≤ 5 d 2 S A √ nE 2 S 2 B 4 S A    e Y − Y    F , where in the third step we use Lemma 24 and Lemma 27. Hence    F S A  e Y  − F S A ( Y )    F ≤ 6 d 2 S A √ nE 2 H S 2 B 4 S A    e Y − Y    F . W e pro ve Lemma 4 b y emplo ying Lemmas 28 - 30. Pr o of of L emma 4. W e examine the difference of f e T , f T ∈ F T , where each trainable pa- rameter in f e T and f T differs b y at most ς . Since   f e T ( X ) − f T ( X )   =    D e T ( X ) − T ( X ) , E 11 E    ≤    e T ( X ) − T ( X )    F , (35) what we need is an upp er b ound of    e T ( X ) − T ( X )    F . W e split    e T ( X ) − T ( X )    F in the follo wing w ay:    e T ( X ) − T ( X )    F =    e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ e F (0) F F ◦ e F E B ( X ) − F ( K ) F F ◦ F ( K ) S A ◦ F ( K − 1) F F ◦ · · · ◦ F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X )    F ≤    e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ e F (0) F F ◦ e F E B ( X ) − e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ e F (0) F F ◦ F E B ( X )    F 50 +    e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ e F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ F (0) F F ◦ F E B ( X )    F +    e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X )    F + · · · +    e F ( K ) F F ◦ e F ( K ) S A ◦ F ( K − 1) F F ◦ · · · ◦ F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ F ( K ) S A ◦ F ( K − 1) F F ◦ · · · ◦ F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X )    F +    e F ( K ) F F ◦ F ( K ) S A ◦ F ( K − 1) F F ◦ · · · ◦ F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X ) − F ( K ) F F ◦ F ( K ) S A ◦ F ( K − 1) F F ◦ · · · ◦ F (1) F F ◦ F (1) S A ◦ F (0) F F ◦ F E B ( X )    F . (36) W e first handle the term that difference app ears in the k -th feedforw ard blo c k:    e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F , where k ∈ { 0 , 1 , · · · , K } . Applying Lemma 28 rep eatedly , w e can deriv e that    e F ( K − 1) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F    e F ( K − 1) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F    ≤ 4 K n K +1 / 2 d 1 / 2 in d 0 K − 1 Y k =1 d 2 k ! d 1 / 2 K H K − 1 S K − 1 L K W ( L − 1) K B E B B LK F F B 2( K − 1) S A . Using the ab o ve estimates and Lemma 30, w e ha ve    e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ p d K d out M L − 1 B L F F    e F ( K ) S A ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) S A ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ 6 · 4 2 K n 2 K +3 / 2 d in d 2 0 K − 1 Y k =1 d 4 k ! d 7 / 2 K d 1 / 2 out H 2 K − 1 S 2 K L 2 K W ( L − 1)(2 K +1) B 2 E B B L (2 K +1) F F B 4 K S A    e F ( K − 1) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K − 1) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F . 51 Rep eating this pro cess and making use of the follo wing estimates (obtained by applying Lemma 28 rep eatedly):    e F ( k ′ ) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F    e F ( k ′ ) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F    ≤ 4 k ′ +1 n k ′ +3 / 2 d 1 / 2 in d 0 k ′ Y k =1 d 2 k ! d 1 / 2 k ′ +1 H k ′ S k ′ L k ′ +1 W ( L − 1)( k ′ +1) B E B B L ( k ′ +1) F F B 2 k ′ S A , where k ′ ∈ { k + 1 , · · · , K − 1 } , w e deriv e that    e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ 6 K − k 4 ( K − k )( K − k +1) n ( K − k )( K − k +5 / 2) d K − k in d 2( K − k ) 0 K − 1 Y k ′ =1 k ′ Y k ′′ =1 d 4 k ′′ ! K Y k ′ = k +1 d k ′ ! d 5 / 2 k +1 K Y k ′ = k +2 d 3 k ′ ! d 1 / 2 out H ( K − k ) 2 S ( K − k )( K − k +1) L ( K − k )( K − k +1) W ( L − 1)( K − k )( K − k +2) B 2( K − k ) E B B L ( K − k )( K − k +2) F F B 2( K − k )( K − k +1) S A    e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F . (37) Applying Lemma 28 rep eatedly , w e ha v e    F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ 4 k +1 / 2 n k +1 d 1 / 2 in d 0 k − 1 Y k ′ =1 d 2 k ′ ! d 3 / 2 k H k S k L k W ( L − 1) k B E B B Lk F F B 2 k S A . Using the ab o ve estimate and Lemma 29, w e ha ve    e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ 4 k +3 / 2 n k +2 d 1 / 2 in d 0 k − 1 Y k ′ =1 d 2 k ′ ! d 5 / 2 k d 3 / 2 k +1 H k S k L k +2 W ( k +2) L − k − 3 / 2 B E B B ( k +2) L − 1 F F B 2 k S A ς . (38) Plugging (38) in to (37) and simplifying the expression, w e deriv e that    e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ e F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ · · · ◦ e F ( k +1) S A ◦ F ( k ) F F ◦ F ( k ) S A ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ 6 K 4 K 2 + K +3 / 2 n K 2 +5 K/ 2+2 d K +1 / 2 in d 2 K +1 0 K Y k ′ =1 d 4( K − k ′ )+5 k ′ ! d 1 / 2 out 52 H K 2 S K 2 + K L K 2 + K +2 W ( L − 1) K ( K +2)+2 L − 3 / 2 B 2 K +1 E B B LK ( K +2)+2 L − 1 F F B 2 K ( K +1) S A ς . (39) In a similar manner, w e can deriv e an upp er b ound for the term that difference app ears in the k -th self-atten tion la yer ( k ∈ { 1 , 2 , · · · , K } ):    e F ( K ) F F ◦ · · · ◦ e F ( k ) F F ◦ e F ( k ) S A ◦ F ( k − 1) F F ◦ · · · ◦ F (0) F F ◦ F E B ( X ) − e F ( K ) F F ◦ · · · ◦ e F ( k ) F F ◦ F ( k ) S A ◦ F ( k − 1) F F ◦ · · · ◦ F (0) F F ◦ F E B ( X )    F ≤ 3 · 6 K − 1 4 K 2 + K +4 n K 2 +5 K/ 2+3 d K +1 / 2 in d 2 K +1 0 K Y k ′ =1 d 4( K − k ′ )+6 k ′ ! d 1 / 2 out H K 2 + K − 1 S K 2 +2 K +1 L K 2 +2 K +3 W ( L − 1)( K 2 +3 K +3) B 2 K +1 E B B L ( K 2 +3 K +3) F F B 2( K 2 +2 K +1) S A ς , (40) and the term that difference app ears in the em b edding lay er:    e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ e F (0) F F ◦ e F E B ( X ) − e F ( K ) F F ◦ e F ( K ) S A ◦ e F ( K − 1) F F ◦ · · · ◦ e F (1) F F ◦ e F (1) S A ◦ e F (0) F F ◦ F E B ( X )    F ≤ 3 · 6 K − 1 4 K 2 + K +4 n K 2 +5 K/ 2+3 d K +1 / 2 in d 2 K +1 0 K Y k ′ =1 d 4( K − k ′ )+6 k ′ ! d 1 / 2 out H K 2 + K − 1 S K 2 +2 K +1 L K 2 +2 K +3 W ( L − 1)( K 2 +3 K +3) B 2 K +1 E B B L ( K 2 +3 K +3) F F B 2( K 2 +2 K +1) S A ς . (41) Plugging (39)-(41) in to (36) yields    e T ( X ) − T ( X )    F ≤ (2 K + 2)6 K 4 K 2 + K +4 n K 2 +5 K/ 2+3 d K +1 / 2 in d 2 K +1 0 K Y k ′ =1 d 4( K − k ′ )+6 k ′ ! d 1 / 2 out H K 2 + K − 1 S K 2 +2 K +1 L K 2 +2 K +3 W ( L − 1)( K 2 +3 K +3) B 2 K +1 E B B L ( K 2 +3 K +3) F F B 2( K 2 +2 K +1) S A ς . Plugging this result in to (35), w e finally obtain   f e T ( X ) − f T ( X )   ≤ L ς with L defined in (3). W e obtain the desired co vering num b er b ound b y discretizing the trainable parameters in f T with ς / L grid size. 8 Conclusions In this work, w e sho w that standard T ransformers can appro ximate H¨ older functions C s,λ  [0 , 1] d × n  under the L t distance with arbitrary precision. Building up on this ap- pro ximation result, we demonstrate that standard T ransformers ac hieve the minimax optimal rate in nonparametric regression for H¨ older target functions. By in tro ducing the size tuple and the dimension v ector, w e provide a fine-grained c haracterization of T rans- former structures. These findings demonstrate the p o werful abilit y of T ransformers at the theoretical level. 53 There are several promising directions for future researc h. F or example, it is crucial to establish a theoretical foundation for T ransformers in broader applications, such as pre- training in large language mo dels (LLMs) and vision T ransformers (ViT) in computer vision tasks. Recen t theoretical studies ha v e in v estigated the approximation and gener- alization errors of T ransformers in the setting of in-con text learning (ICL) [26, 41, 5]. Their findings indicate that in ICL, only when both the n um b er of tokens and the num- b er of pre-training sequences are sufficiently large can the final error b e made sufficien tly small. Ho w ever, these results often rely on arc hitectural simplifications: [26, 5] utilize linear attention instead of softmax attention, while [41] employs softmax only in the fi- nal la yer of the netw ork. Deriving the conv ergence rates of a standard T ransformer in ICL settings remains an op en problem. F urthermore, while the present w ork fo cuses on the appro ximation error and generalization error of T ransformers, the optimization error incurred during the training pro cess, sp ecifically the con vergence rates of T ransformers under v arious optimization algorithms, also w arran ts further inv estigation. References [1] T om Ap ostol. Mathematic al A nalysis . Pearson, 2 edition, 1974. [2] Srinadh Bho janapalli, Chulhee Y un, Ankit Singh Ra wat, Sashank Reddi, and San- jiv Kumar. Low-rank b ottlenec k in m ulti-head atten tion mo dels. In International c onfer enc e on machine le arning , pages 864–873. PMLR, 2020. [3] Minshuo Chen, Haoming Jiang, W enjing Liao, and T uo Zhao. Nonparametric re- gression on lo w-dimensional manifolds using deep relu netw orks: F unction approxi- mation and statistical reco very . Information and Infer enc e: A Journal of the IMA , 11(4):1203–1253, 2022. [4] Jingpu Cheng, Ting Lin, Zuow ei Shen, and Qianxiao Li. A unified framework for establishing the universal approximation of transformer-type arc hitectures. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025. [5] Michelle Ching, Ioana Popescu, Nico Smith, Tianyi Ma, William G Underwoo d, and Ric hard J Sam worth. Efficient and minimax-optimal in-con text nonparametric regression with transformers. arXiv pr eprint arXiv:2601.15014 , 2026. [6] Jacob Devlin, Ming-W ei Chang, Ken ton Lee, and Kristina T outanov a. Bert: Pre- training of deep bidirectional transformers for language understanding. In Pr o c e e d- ings of the 2019 c onfer enc e of the North A meric an chapter of the asso ciation for c omputational linguistics: human language te chnolo gies, volume 1 (long and short p ap ers) , pages 4171–4186, 2019. [7] Alexey Dosovitskiy , Lucas Beyer, Alexander Kolesnik o v, Dirk W eissenborn, Xiaoh ua Zhai, Thomas Un terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylv ain Gelly , Jak ob Uszkoreit, and Neil Houlsb y . An image is w orth 16x16 w ords: T ransformers for image recognition at scale. In International Confer enc e on L e arning R epr esentations , 2021. [8] Benjamin L Edelman, Surbhi Goel, Sham Kak ade, and Cyril Zhang. Inductive biases and v ariable creation in self-atten tion mec hanisms. In International Confer enc e on Machine L e arning , pages 5793–5831. PMLR, 2022. 54 [9] Jianqing F an, Yihong Gu, and W en-Xin Zhou. How do noise tails impact on deep relu net w orks? The Annals of Statistics , 52(4):1845–1871, 2024. [10] Max H F arrell, T engyuan Liang, and Sanjog Misra. Deep neural netw orks for esti- mation and inference. Ec onometric a , 89(1):181–213, 2021. [11] Ev arist Gin ´ e and Ric hard Nic kl. Mathematic al foundations of infinite-dimensional statistic al mo dels . Cambridge universit y press, 2021. [12] Ingo G ¨ uhring and Mones Raslan. Approximation rates for neural netw orks with enco dable weigh ts in smo othness spaces. Neur al Networks , 134:107–130, 2021. [13] Iryna Gurevyc h, Mic hael Kohler, and G¨ ozde G ¨ ul S ¸ ahin. On the rate of con v ergence of a classifier based on a transformer enco der. IEEE T r ansactions on Information The ory , 68(12):8139–8155, 2022. [14] Alexander Havrilla and W enjing Liao. Understanding scaling laws with statisti- cal and approximation theory for transformer neural netw orks on intrinsically low- dimensional data. In The Thirty-eighth A nnual Confer enc e on Neur al Information Pr o c essing Systems , 2024. [15] Chang ho on Song, Geonho Hw ang, Jun ho Lee, and Myung joo Kang. Minimal width for universal prop ert y of deep rnn. Journal of Machine L e arning R ese ar ch , 24(121):1–41, 2023. [16] Jerry Y ao-Chieh Hu, Hude Liu, Hong-Y u Chen, W eimin W u, and Han Liu. Universal appro ximation with softmax attention. arXiv pr eprint arXiv:2504.15956 , 2025. [17] Haotian Jiang and Qianxiao Li. Approximation rate of the transformer architecture for sequence mo deling. In The Thirty-eighth A nnual Confer enc e on Neur al Informa- tion Pr o c essing Systems , 2024. [18] Y uling Jiao, Y anming Lai, Defeng Sun, Y ang W ang, and Bok ai Y an. Approxima- tion b ounds for transformer net works with application to regression. arXiv pr eprint arXiv:2504.12175 , 2025. [19] Y uling Jiao, Y anming Lai, Y ang W ang, and Bok ai Y an. T ransformers can o v ercome the curse of dimensionality: A theoretical study from an approximation p erspective. arXiv pr eprint arXiv:2504.13558 , 2025. [20] Y uling Jiao, Guohao Shen, Y uanyuan Lin, and Jian Huang. Deep nonparametric regression on appro ximate manifolds: Nonasymptotic error b ounds with p olynomial prefactors. The Annals of Statistics , 51(2):691–716, 2023. [21] Y uling Jiao, Y ang W ang, and Bok ai Y an. Approximation b ounds for recurrent neural net works with application to regression. arXiv pr eprint arXiv:2409.05577 , 2024. [22] T okio Ka jitsuk a and Issei Sato. Are transformers with one lay er self-attention using lo w-rank weigh t matrices universal appro ximators? In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. 55 [23] T okio Ka jitsuk a and Issei Sato. On the optimal memorization capacit y of trans- formers. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025. [24] Hyunjik Kim, George P apamak arios, and Andriy Mnih. The lipsc hitz constan t of self-atten tion. In International Confer enc e on Machine L e arning , pages 5562–5571. PMLR, 2021. [25] Junghw an Kim, Mic helle Kim, and Barzan Mozafari. Pro v able memorization capac- it y of transformers. In The Eleventh International Confer enc e on L e arning R epr e- sentations , 2023. [26] Juno Kim, T ai Nak amaki, and T aiji Suzuki. T ransformers are minimax optimal non- parametric in-context learners. In The Thirty-eighth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2024. [27] Michael Kohler and Sophie Langer. On the rate of con vergence of fully connected deep neural net w ork regression estimates. The A nnals of Statistics , 49(4):2231–2249, 2021. [28] Anastasis Kratsios, Behno osh Zamanlo o y , Tianlin Liu, and Iv an Dokmani ´ c. Univer- sal appro ximation under constrain ts is p ossible with transformers. In International Confer enc e on L e arning R epr esentations , 2022. [29] Zhong Li, Jiequn Han, Qianxiao Li, et al. Appro ximation and optimization theory for linear contin uous-time recurrent neural net works. Journal of Machine L e arning R ese ar ch , 23(42):1–85, 2022. [30] Peilin Liu and Ding-Xuan Zhou. Generalization analysis of transformers in distribu- tion regression. Neur al Computation , 37(2):260–293, 2025. [31] Jianfeng Lu, Zuo wei Shen, Haizhao Y ang, and Shijun Zhang. Deep netw ork appro xi- mation for smo oth functions. SIAM Journal on Mathematic al A nalysis , 53(5):5465– 5506, 2021. [32] Liam Madden. Upp er and low er memory capacity bounds of transformers for next- tok en prediction. arXiv pr eprint arXiv:2405.13718 , 2024. [33] Sadegh Mahda vi, Renjie Liao, and Christos Thramp oulidis. Memorization capacit y of m ulti-head attention in transformers. In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. [34] T ong Mao and Ding-Xuan Zhou. Rates of appro ximation by relu shallo w neural net works. Journal of Complexity , 79:101784, 2023. [35] Ryumei Nak ada and Masaaki Imaizumi. Adaptiv e appro ximation and generalization of deep neural netw ork with intrinsic dimensionalit y . Journal of Machine L e arning R ese ar ch , 21(174):1–38, 2020. [36] Sejun Park, Jaeho Lee, Chulhee Y un, and Jin woo Shin. Prov able memorization via deep neural netw orks using sub-linear parameters. In Confer enc e on le arning the ory , pages 3627–3661. PMLR, 2021. 56 [37] Philipp P etersen and F elix V oigtlaender. Equiv alence of appro ximation b y con volu- tional neural net works and fully-connected net works. Pr o c e e dings of the A meric an Mathematic al So ciety , 148(4):1567–1581, 2020. [38] Clayton Sanford, Daniel Hsu, and Matus T elgarsky . Representational strengths and limitations of transformers. In Thirty-seventh Confer enc e on Neur al Information Pr o c essing Systems , 2023. [39] Anselm Johannes Schmidt-Hieber. Nonparametric regression using deep neural net- w orks with relu activ ation function. Annals of statistics , 48(4):1875–1897, 2020. [40] Johannes Sc hmidt-Hieb er. The k olmogorov–arnold representation theorem revisited. Neur al networks , 137:119–126, 2021. [41] Zhaiming Shen, Alexander Hsu, Rong jie Lai, and W enjing Liao. Understanding in- con text learning on structured manifolds: Bridging atten tion to k ernel metho ds. In The F ourte enth International Confer enc e on L e arning R epr esentations , 2026. [42] Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Deep net w ork approximation char- acterized by n um b er of neurons. Communic ations in Computational Physics , 28(5), 2020. [43] Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Optimal appro ximation rate of relu net w orks in terms of width and depth. Journal de Math´ ematiques Pur es et Appliqu ´ ees , 157:101–135, 2022. [44] Jonathan W Siegel. Optimal approximation rates for deep relu neural net w orks on sob olev and b eso v spaces. Journal of Machine L e arning R ese ar ch , 24(357):1–52, 2023. [45] Charles J Stone. Optimal rates of con vergence for nonparametric estimators. The annals of Statistics , pages 1348–1360, 1980. [46] T aiji Suzuki. Adaptivity of deep reLU netw ork for learning in b eso v and mixed smo oth b eso v spaces: optimal rate and curse of dimensionality . In International Confer enc e on L e arning R epr esentations , 2019. [47] Shokichi T ak akura and T aiji Suzuki. Approximation and estimation abilit y of trans- formers for sequence-to-sequence functions with infinite dimensional input. In Inter- national Confer enc e on Machine L e arning , pages 33416–33447. PMLR, 2023. [48] Naoki T akeshita and Masaaki Imaizumi. Approximation of p erm utation inv arian t p olynomials by transformers: Efficient construction in column-size. arXiv pr eprint arXiv:2502.11467 , 2025. [49] Aad W V an Der V aart and Jon A W ellner. W eak con v ergence. In We ak c onver genc e and empiric al pr o c esses: with applic ations to statistics , pages 16–28. Springer, 1996. [50] Gal V ardi, Gilad Y ehudai, and Ohad Shamir. On the optimal memorization p o wer of reLU neural netw orks. In International Confer enc e on L e arning R epr esentations , 2022. 57 [51] Ashish V aswani, Noam Shazeer, Niki P armar, Jak ob Uszk oreit, Llion Jones, Aidan N Gomez, Luk asz Kaiser, and Illia Polosukhin. Atten tion is all you need. A dvanc es in neur al information pr o c essing systems , 30, 2017. [52] Haiyu W ang and Y uan yuan Lin. Prompt tuning transformers for data memorization. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025. [53] Zixuan W ang, Stanley W ei, Daniel Hsu, and Jason D. Lee. T ransformers pro v ably learn sparse tok en selection while fully-connected nets cannot. In F orty-first Inter- national Confer enc e on Machine L e arning , 2024. [54] Colin W ei, Yining Chen, and T engyu Ma. Statistically meaningful appro ximation: a case study on appro ximating turing machines with transformers. In Alice H. Oh, Alekh Agarw al, Danielle Belgra ve, and Kyunghyun Cho, editors, A dvanc es in Neur al Information Pr o c essing Systems , 2022. [55] Y unfei Y ang. On the optimal appro ximation of sob olev and b eso v functions using deep relu neural net works. arXiv pr eprint arXiv:2409.00901 , 2024. [56] Y unfei Y ang and Ding-Xuan Zhou. Nonparametric regression using ov er- parameterized shallow relu neural netw orks. Journal of Machine L e arning R ese ar ch , 25(165):1–35, 2024. [57] Dmitry Y arotsky . Error b ounds for appro ximations with deep relu netw orks. Neur al networks , 94:103–114, 2017. [58] Dmitry Y arotsky . Optimal approximation of con tin uous functions by v ery deep relu net works. In Confer enc e on le arning the ory , pages 639–649. PMLR, 2018. [59] Chulhee Y un, Srinadh Bho janapalli, Ankit Singh Raw at, Sashank Reddi, and Sanjiv Kumar. Are transformers universal appro ximators of sequence-to-sequence func- tions? In International Confer enc e on L e arning R epr esentations , 2020. [60] Chulhee Y un, Yin-W en Chang, Srinadh Bho janapalli, Ankit Singh Raw at, Sashank Reddi, and Sanjiv Kumar. O (n) connections are expressive enough: Universal appro ximability of sparse transformers. A dvanc es in Neur al Information Pr o c essing Systems , 33:13783–13794, 2020. [61] Ding-Xuan Zhou. Theory of deep conv olutional neural netw orks: Downsampling. Neur al Networks , 124:319–327, 2020. [62] Ding-Xuan Zhou. Universalit y of deep con v olutional neural net w orks. Applie d and c omputational harmonic analysis , 48(2):787–794, 2020. 58

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment