Approximation Bounds for Random Neural Networks and Reservoir Systems

Noname manuscript No. (will be inserted by the editor) A pproximation Bounds for Random Neural Networks and Reservoir Systems Lukas Gonon · L yudmila G rigoryeva · Juan-Pablo Ortega Recei v ed: date / Accepted: date Abstract This work studies app r oximation based on single-hidd en-layer feed for- ward and recu rrent neural n e tworks with random ly genera te d inter nal weights. Th ese methods, in which o n ly the last layer of weigh ts an d a f ew hyperp arameters ar e opti- mized, have b een successfully applied in a wide r ange of static and dynam ic learn in g problem s. Despite the pop ularity of this approach in empirical tasks, important the- oretical qu estions regard in g the relation between the unkn own fun ction, the w e ight distribution, an d th e approxim a tion rate have remained open. In this work it is proved that, as long as th e un known fun ction, fu nctional, or dynamical system is sufﬁciently regular , it is possible to draw the internal weig hts of the r a ndom (recu rrent) neu - ral network from a generic distribution (not depend in g o n the unknown object) and quantify the err or in terms of the number of neu rons an d the h yperpa r ameters. In particular, this proves that echo state ne tworks with ra n domly gen erated weights are capable of approxim ating a wide class of dy namical systems arb itrarily well and th us provides the ﬁr st mathematical explanatio n for th eir em pirically observed success a t learning dynam ica l systems. Keywords Neural Networks · Ap p roxima tio n Error · Reserv oir Computing · Echo State Networks · Rand o m Fun ction Approxima tio n Mathematics Subject Classiﬁcation (2010) 60-08 · 6 0H25 · 41A30 · 9 3E35 L. Gonon Facu lty of Mathematics and Statistic s, Uni ver sit ¨ at Sankt Gallen, Switzerl and E-mail: lukas.gonon@unisg.ch L. Grigorye va Departmen t of Mathemati cs and S tatistics, Graduate School of Decision Sciences, Univ ersit ¨ a t K onstanz , Germany E-mail: L yudmila.Grigory ev a@un i-konstanz.de J.-P . Orte ga Facu lty of Mathematics and Statistic s, Uni ver sit ¨ at Sankt Gallen, Switzerl and and CNRS, France E-mail: Juan-Pablo .Ortega@un isg.ch 2 Lukas Gonon et al. 1 Introduction This article studies the approximatio n o f an unknown map H ∗ : X → R m by a ran- dom ( recurren t) neural network. More speciﬁcally , when X = R q we study ap- proxim a tions o f the f u nction H ∗ by single-h idden-la y er feedfor ward neu ral networks H A , ζ ζ ζ W ( z ) = W σ σ σ ( Az + ζ ζ ζ ) with A ∈ M N , q , ζ ζ ζ ∈ R N random ly drawn ( not using any knowledge abo ut H ∗ ), σ σ σ : R N − → R N a given ac tivation fun ction (obtained as the compon entwise application of a map σ : R − → R ) and W ∈ M m , N a matrix that can be trained in o r der to approximate H ∗ as well as p ossible. Random n eural networks of this type have been applied very successfully in a variety of settings, we ref er in particular to the sem inal work s on rand o m featur e models [32] and Extreme Learning Machines [17]. W e refer to th is case as the static situation an d w ill come back to it later on. In contrast, we speak about the d ynamic situation when H ∗ takes as in puts sequences, i.e. X ⊂ ( R d ) Z − . A p articularly imp ortant family of appro ximants that we study in the dynamic situation are res ervoir systems , that is, H ( z ) = y 0 for z ∈ X ⊂ ( R d ) Z − , where y 0 is the solu tio n (wh ich exists an d is unique un der suitable hypoth eses) of the state-space system ( x t = F ( x t − 1 , z t ) , y t = h ( x t ) , t ∈ Z − , (1) where th e state or r eservoir map F is (for the most part) r andom ly generated an d only the static observation or r eadou t map h is trained in spe ciﬁc learning tasks. An importan t p articular case of (1) are echo state networks (ESNs) [26], [27], [25], [18]. These are recur rent n eural n etworks that m ap the input z ∈ ( R d ) Z − to the value H A , C , ζ ζ ζ W ( z ) = Y 0 ∈ R m determined by ( X t = σ σ σ ( AX t − 1 + Cz t + ζ ζ ζ ) , t ∈ Z − , Y t = WX t , t ∈ Z − . (2) Here A , C , ζ ζ ζ are ran d omly drawn ( f rom a distribution that do e s not use any knowl- edge about H ∗ ), σ σ σ is a given activ ation function as above, an d W is optimized a t the time of training in order to approx imate H ∗ as well as possible. This tec h nique has been successful in a wide range of applicatio ns ( see, for example, [18], [30], [29], [23]). Based on these empirical results, ESNs with random ly generated A , C , ζ ζ ζ are though t to be capable of approx imating arbitra ry dynamica l an d input/outpu t sys- tems. Ho wever , a rigo rous math ematical r e sult proving this statement does not exist yet in the literature. It is only in the context of inv ertible an d d ifferentiable dyna m ical systems on a co mpact man ifold that a result of this type has been recently established. Indeed , the results in [1 5] sh ow th at rand omly drawn E SNs like (2) can be trained by optimizing W using gener ic one-dimension al observations o f a giv en in vertible an d differentiable dynamical system to prod uce dynamics that are topo logically conjugate to that given sy stem . In this article we place ourselves in the more general setup of input/outp ut systems and provide a ﬁrst mathema tical result that proves the appro ximation capa b ilities of ESNs in a discrete-tim e setting a n d qu antiﬁes the m b y providing approx imation Approximati on Bounds for Random Neural Network s and Reserv oir Systems 3 bound s in terms of their architecture param e ters. In more detail, we propo se a con- structive samplin g procedure for A , C , ζ ζ ζ (depen d ing only on three hyperparam eters) so that by training W , the associated system (2) can b e used to approximate any H ∗ satisfying mild r egularity assumption s. The L 2 -error between H ∗ and its echo state approx imation H A , C , ζ ζ ζ W can be bo unded explicitly and the approx imation r esult ca n also be extended to a uni versality result for gen e r al H ∗ (not satisfying the regularity condition s). For f ull details we r efer to Theorem 2 and Co r ollary 5 b elow . W e complem ent these results by analy zing a p opular mod iﬁcation of (2), in whic h the hidden state X is updated accordin g to X t = σ σ σ ( A WX t − 1 + Cz t + ζ ζ ζ ) . These sys- tems are called echo state networks with outpu t feedback (or J or dan recurr ent neu ral networks with ran d om internal weights) and ar e also widely used in the literatu re e ven though , in this case, a more sophisticated training algor ithm is needed (for instance a stochastic g radient-ty pe o ptimization algorithm c ombined with b ackpro pagation in time). By app lying similar tools as in the case of (2) we provide an approx imation result for such systems in situations when the unknown func tional is itself given by a suf ﬁciently regular reser voir system of typ e (1). I n this case, only one hyperp aram- eter N appear s (propor tional to the number of neur ons, i.e. the dimension of X ) and the ap proxim ation er ror is of or der O ( 1 / √ N ) . W e refer to Theorem 3 below f or f ull details. T o prove the se results we rely mainly on probabilistic arguments in volving co n- centration inequalities, an importance sampling pr o cedure and tech niques from em- pirical process th eory (in particular the Ledo ux-T alagr and ineq uality [22]). A furth er crucial ingre d ient is an in tegral representation for sufﬁciently regular functio ns r e - lated to the in tegral r epresentation s appe a ring in the proo fs in [3], [24], [21]. In con- tinuous time, an alternative approach ba sed on randomized signature is presented in [5] and [6]. W e emp hasize th at the proof of these dynamic statements crucially relies on our novel r esults fo r the static case. T o u n derstand these b etter , we brieﬂy elab orate on the literature (we refer to the introdu ction of [34] for a d etailed overview). The seminal work by Barron [3] shows th a t any fu n ction H ∗ : R q → R of a certain regular ity can be appro x imated up to an error of order O ( 1 / √ N ) using a neural network with one hidden layer and N hidden nodes. The hidden weights can be gener ated randomly , but the distribution fr om wh ich th ey need to be drawn dep e nds on H ∗ . Thus, the ran - domly drawn weights a r e only used to guaran tee th e existence of tunab le weig hts. Subsequen tly , the impo rtant contributions by Rahimi an d Recht [ 32], [33], [ 34] ana- lyze random weig hts generated from a known pro bability distrib ution p . In their ar- gument th e optimal output layer weights (whic h are tun ed) implement an importa n ce sampling procedu re. The fun ction class F p for wh ich error bo unds can be derived (see Theo rems 3.1 and 3.2 in [34]) and f or which an appro ximation err o r of orde r O ( 1 / √ N ) is gu aranteed is deﬁned in term s of p and it is shown that F p is den se. Howe ver , for a g iv en function H ∗ it m ay be c hallenging to de c id e whether H ∗ ∈ F p (and hen ce the erro r boun d app lies) or not. I n this p aper we show that un der mild regularity assumptions o n H ∗ one automatically has H ∗ ∈ F p for a wide class of dis- tributions p includin g th e mo st comm only used case when p is a unifor m d istribution. This is f ormulated abstractly in Theorem 1 an d then specialize d to the unifo rm distri- 4 Lukas Gonon et al. bution in Proposition 3 and Corollary 1. W e also m ake the dep e n dence of the resulting bound s on the input dimen sion explicit. This can b e used to d ecide wh ether appro xi- mations by (shallow) rand om n eural networks for classes of functions (param etrized by the inp ut dimen sion) suffer from the curse of dimensionality or n ot. W e emp ha- size that altho ugh all th ese resu lts u se shallow n eural ne tworks whic h are, from an approx imation th eory p erspective, less ﬂexible than deep n eural network s (see, for in- stance, [ 2 4], [3 1]) , her e the h idden weig h ts are generated r andomly and so th e n eural network tr aining does not require gradient descen t-type optimization tec h niques. Finally , let u s point ou t that Theor em 2 entails a constructive sampling scheme for the weights tha t may b e readily used by practitioners and provides a learning proced u re in which only W and three hyperpa r ameters need to be optimized. The remainder of th is paper is organized as follows. In Section 2 we intr o duce some key concepts on reservoir systems. Section 3 then proves an integral repre- sentation fo r sufﬁciently regu la r fu nctions, which is at the cor e of the subsequent approx imation results. In Section 4 we then treat the static case and prove the ran- dom n e ural network approx imation r esults Theorem 1, Prop osition 3 a n d Corollar y 1. Section 5 is conc erned with the d y namic case and contains the echo state network ap- proxim a tion results Theorem 2 and Corollary 5 . Fina lly , in Section 6 we prove the approx imation result for echo state networks with o utput feedback, Theorem 3 . Notation W e use the notations Z − = { 0 , − 1 , − 2 , . . . } , N = { 0 , 1 , 2 , . . . } , N + = N \ { 0 } . T hroug h- out the article d , m , N , q ∈ N + denote positive integers and M is a positive constant. For any R > 0 we denote by B R the Euclidean ball of radius R ar ound 0 in the appr o- priate dimension (w h ich will always be e ith er m e ntioned explicitly or obvious fr o m the context). Fu rthermo r e, λ q ( B R ) denotes the volume of the ball B R ⊂ R q . Unless mentioned o therwise, k · k d enotes the Euclidean norm. W e denote by M m , n the set of rea l m × n matrices. W e ﬁx a probab ility space ( Ω , F , P ) on which all rand om elements are deﬁned. 2 Preliminaries on the dyna mic setting The g oal of this section is to presen t some preliminar ie s o n the dyn amic case, that is, when X ⊂ ( R d ) Z − . In this case, it is customary to refer to ma p s H ∗ : X → R m as functiona ls . While this article is mainly concern ed with approxima ting f unctionals, let u s poin t out th a t these are in one-to-o ne correspon dence with so-called cau sal and time-inv ariant ﬁlters, see for instance [11, 12, 13]. An importan t class of functionals is given b y those satisfying H ∗ = arg in f H ∈ H R ( H ) for some class H of fun ction- als and a risk map R : H → [ 0 , ∞ ) that satisﬁes certain custom ary properties (see [9] and r eferences therein for details). An other important class is given by reservoir function als that we recall in the next p aragraph s. Approximati on Bounds for Random Neural Network s and Reserv oir Systems 5 2.1 Reservoir systems and associated fu nctionals Let d , N ∈ N + , D d ⊂ R d , D N ⊂ R N and F : D N × D d − → D N , and for z ∈ ( D d ) Z − consider the system x t = F ( x t − 1 , z t ) , t ∈ Z − . (3) W e say that (3) satisﬁes the echo state pr operty , if for any z ∈ ( D d ) Z − there e x- ists a uniq u e x ∈ ( D N ) Z − such that (3) holds. As th e following Propo sition shows, a sufﬁcient co ndition guaranteein g this property is that D N is a closed ball and F is contractive in th e ﬁrst argument. Proposition 1 (Proposition 1 in [9]) Let R > 0 , write B R = { u ∈ R N : k u k ≤ R } a nd suppose tha t F : B R × D d → B R is co ntinuou s. A ssume tha t F is a contraction in the ﬁrst argument, that is, there exis ts 0 < r < 1 such that for all u , v ∈ B R , w ∈ D d it holds that k F ( u , w ) − F ( v , w ) k ≤ r k u − v k . Then the system (3) has the echo state pr operty . Furthermore , we can associate to it a uniqu e mapping H F : ( D d ) Z − → R N that is co ntinuou s (where ( D d ) Z − is equip ped with the pr oduc t to pology) a nd satisﬁes H F ( z · + t ) = x t , for all t ∈ Z − (the symbol z · + t stands for the shifted semi-inﬁnite sequence ( . . . , z − 2 + t , z − 1 + t , z t ) ∈ ( D d ) Z − ). The functio nal H F in Proposition 1 will be ref erred to as the r eservoir fu n c- tional associated to F . In ma ny situations one is also interested in consider ing the input/ou tput system generated by (3) together with a reado ut or o b servation map, that is, y t = h ( x t ) , t ∈ Z − , (4) for some h : D N → R m . The reservoir fu nctional associated to (3)-(4) is g iv en as h ◦ H F . In th e d ynamic case the fun c tio nals H that we u se in this article to approx imate a giv en (u nknown) function a l H ∗ are alw ays of the f o rm H = h ◦ H F for h lin e ar and F suitably constructed . 3 Integral representations o f sufﬁciently regula r functio ns A key in gredient in the proo fs of the approximatio n results in this pape r are certain integral representations of suf ﬁciently regular functions. W e provide a ﬁr st result in Proposition 2 below . V ariations of this result under weaker con ditions will be devel- oped later on in th e ar ticle. In pr obabilistic terms, Prop osition 2 shows that for all R > 0, any sufﬁciently regular f unction f can b e represented o n B M as th e difference of two functions of type v 7→ c E [ max ( v · U + ζ , 0 )] for some con stant c > 0, some random variables U and ζ admitting a Leb esgue-den sity with certain in tegrability proper ties and satisfying k U k ≤ R and | ζ | ≤ max ( M R , 1 ) , P - a.s. The in tegral represen tatio n below is related to the Radon-wav elet integral repre- sentation as used in [24] and represen tations a p pearing in [3, 21] and [2, Theor em 2 ]. 6 Lukas Gonon et al. This integral rep resentation will be crucial to o btain random neural network ap - proxim a tion results with weig h ts sampled from a unifor m distribution, see Pr o posi- tion 3 below . W e will also f o rmulate similar r esults fo r mor e general sampling dis- tributions an d und e r weaker integrability cond itions (see Theorem 1 and Corollary 1 below). Proposition 2 Let σ : R → R b e given as σ ( x ) = max ( x , 0 ) . S uppose that f : R q → R satisﬁes for all v ∈ R q with k v k ≤ M that f ( v ) = Z R q e i v · w g ( w ) d w for some g : R q → C satisfying v ∗ = Z R q max ( 1 , k w k 2 q + 6 ) | g ( w ) | 2 d w < ∞ . (5) Then, for any R > 0 ther e exists a measurable function π : R q + 1 → R such th at (i) π ( ω ω ω ) = 0 for all ω ω ω = ( w , u ) ∈ R q × R sa tisfying k w k > R or | u | > max ( M R , 1 ) , (ii) Z R q + 1 max ( 1 , k ω ω ω k ) | π ( ω ω ω ) | d ω ω ω < ∞ , (iii) for all v ∈ R q with k v k ≤ M , f ( v ) = Z R q + 1 π ( ω ω ω ) σ (( v , 1 ) · ω ω ω ) d ω ω ω , (6) (iv) Z R q + 1 k ω ω ω k 2 π ( ω ω ω ) 2 d ω ω ω ≤ 8 ( M 3 + M + 2 )  Z B R max ( 1 , k w k 3 ) | g ( w ) | 2 d w + Z R q \ B R max ( 1 , k w k 2 q + 5 ) R 2 q + 2 | g ( w ) | 2 d w  and thus in p articular if R ≥ 1 then Z R q + 1 k ω ω ω k 2 π ( ω ω ω ) 2 d ω ω ω ≤ 8 ( M 3 + M + 2 ) v ∗ . Remark 1 A suf ﬁcient condition for (5) to be satisﬁed is that f ∈ L 1 ( R q ) has an integrable Fourier tran sform and belon gs to the So bolev space W q + 3 , 2 ( R q ) , see for instance [7, Theorem 6.1] or Corollary 2 b elow . Remark 2 W e now emph asize two points co ncerning th e condition (5). First, this condition (5) is stronger than the condition Z R q k w k| g ( w ) | d w < ∞ appearin g in the well-kn own work by Barron [3] (see e.g. (7) below for an argument) . Howe ver , th is stronger cond ition (5) also allows us to ob ta in a stronger con clusion. Approximati on Bounds for Random Neural Network s and Reserv oir Systems 7 More speciﬁcally , whereas [3] proves that th ere exist neural network weights ensur- ing a certain approximatio n accuracy , we will see how Prop osition 3 below p rovides under condition (5) a c onstructive pr ocedur e fo r th e neural network weights . Second, we n ow discuss why the co n dition (5) is necessary . The properties of π derived in Propo sition 2 are required to gua r antee that th e ne u ral network weights ca n be sampled fr o m a un iform distribution in Proposition 3 b elow . This is ensu red, on the one hand, by the compact support of π (see Propo sition 2(i)), which is achieved b y a change of variables in the p roof of Propo sition 2. On the o ther hand, to car ry out the importan ce samp ling proc edure in the p roof of Proposition 3, the square integrability condition on π (see Proposition 2(i)) is needed. T o obtain square integrability of k w k π ( w ) we need con dition (5) ( see (11), ( 13) in th e proof below) since the Jacobian determinan t appear ing in the ch ange of variables men tioned above makes th e term k w k 2 q + 2 appear . Pr oof The pro of co nsists of two steps. In a ﬁrst step, we use a mo diﬁcation of the argument in [2 1] to o b tain a rep resentation of type (6), but with correspo nding π not necessarily satisfying (i). Then a suitable change of variables allows to obtain a representatio n with the desired proper ties (i)-(iv). Beforehan d, let u s verify that (5) implies that Z R q | g ( w ) | d w < ∞ and Z R q k w k 3 | g ( w ) | d w < ∞ . (7) Indeed , b y ﬁrst splitting th e integral into an integral over B 1 ⊂ R q and R q \ B 1 and then applyin g H ¨ older’ s ineq uality o ne o btains Z R q ( 1 + k w k 3 ) | g ( w ) | d w ≤ 2  Z B 1 | g ( w ) | 2 d w  1 / 2 λ q ( B 1 ) 1 / 2 + 2 Z R q \ B 1 k w k 3 | g ( w ) | d w , where th e last ter m can be estimated by ap plying H ¨ older’ s ine q uality once more to obtain Z R q \ B 1 k w k 3 | g ( w ) | d w ≤  Z R q \ B 1 k w k 6 + 2 q | g ( w ) | 2 d w  1 / 2  Z R q \ B 1 k w k − 2 q d w  1 / 2 and the integrals are ﬁnite than ks to th e hy pothesis (5). Step 1: Firstly , no te that for any z ∈ R one ma y write − Z ∞ 0 ( z − u ) + e iu + ( − z − u ) + e − iu d u = e iz − iz − 1 , (8) since for z > 0 one has Z z 0 ( z − u ) e iu d u = − 1 i z + 1 i Z z 0 e iu d u = iz − e iz + 1 and for z < 0 one calculates Z − z 0 ( − z − u ) e − iu d u = − 1 i z − 1 i Z − z 0 e − iu d u = iz − e iz + 1 . 8 Lukas Gonon et al. Secondly , for any v ∈ R q one obtains by T o nelli’ s theo rem an d (7) th at Z R q × [ 0 , ∞ ) | ( v · w − u ) + e iu + ( − v · w − u ) + e − iu || g ( w ) | d w d u ≤ Z R q Z | v · w | 0 ( | v · w | − u ) | g ( w ) | d u d w ≤ k v k 2 2 Z R q k w k 2 | g ( w ) | d w < ∞ . Hence one may combine Fubini’ s theor em, (7) and (8) to obtain for any v ∈ R q − Z R q × [ 0 , ∞ ) [( v · w − u ) + e iu + ( − v · w − u ) + e − iu ] g ( w ) d w d u = Z R q ( e i v · w − i v · w − 1 ) g ( w ) d w = f ( v ) − ( ∇ f )( 0 ) · v − f ( 0 ) . Based on th is integral repr esentation o f f we will now deﬁne α approp riately to obtain f ( v ) = Z R q + 1 σ (( v , 1 ) · ( w , u )) α ( w , u ) d w d u for all v ∈ R q with k v k ≤ M . T o do this, ﬁrst note that for all v ∈ R q with k v k ≤ M and all ( w , u ) ∈ R q + 1 with u ≤ − M k w k we have v · w + u ≤ 0 and therefo re σ (( v , 1 ) · ( w , u )) = 0. Setting α 1 ( w , u ) = − [ Re ( e − iu g ( w )) + Re ( e iu g ( − w ))] 1 ( − M k w k , 0 ] ( u ) and changin g variables we thu s obtain f ( v ) − ( ∇ f )( 0 ) · v − f ( 0 ) = Z R q + 1 σ (( v , 1 ) · ( w , u )) α 1 ( w , u ) d w d u . (9) In addition f ( 0 ) , ( ∇ f )( 0 ) ∈ R and th erefore o ne has th at R R q Im [ g ( w )] d w = 0 and R R q ( v · w ) Re [ g ( w )] d w = 0 . This yield s ( ∇ f )( 0 ) · v + f ( 0 ) = Z R q v · w ( − Im [ g ( w )]) + Re [ g ( w )] d w = Z R q Z 1 0 ( v · w + u )( Re [ g ( w )] − Im [ g ( w )]) d u d w = Z R q Z 1 0 [( v · w + u ) + − ( − v · w − u ) + ]( Re [ g ( w )] − Im [ g ( w )]) d u d w . (10) Deﬁning ˜ g ( w ) = Re [ g ( w )] − Im [ g ( w )] an d α 2 ( w , u ) = 1 [ 0 , 1 ] ( u ) ˜ g ( w ) − 1 [ − 1 , 0 ] ( u ) ˜ g ( − w ) we may rewrite (10) as ( ∇ f )( 0 ) · v + f ( 0 ) = Z R q + 1 σ (( v , 1 ) · ( w , u )) α 2 ( w , u ) d w d u . Approximati on Bounds for Random Neural Network s and Reserv oir Systems 9 Combining this with (9) and setting α = α 1 + α 2 thus yields f ( v ) = Z R q + 1 σ (( v , 1 ) · ( w , u )) α ( w , u ) d w d u . Step 2: For ω ω ω = ( w , u ) ∈ R q × R d eﬁne π ( w , u ) = 1 B R \{ 0 } ( w ) " α ( ω ω ω ) + R 2 ( q + 2 ) k w k 2 ( q + 2 ) α  R 2 ω ω ω k w k 2  # . Then clearly π ( w , u ) = 0 if k w k > R . If | u | > max ( M R , 1 ) and k w k ≤ R then it follows that | u | > M k w k an d | u | R 2 / k w k 2 > 1 an d hen ce α 1 ( w , u ) = α 2 ( w , u ) = α 1 ( R 2 w / k w k 2 , R 2 u / k w k 2 ) = α 2 ( R 2 w / k w k 2 , R 2 u / k w k 2 ) = 0. This shows ( i). Next, deﬁne the mapping ϕ : B R \ { 0 } → R q \ B R , ϕ ( w ) = R 2 w k w k 2 and note that ϕ is a diffeomorphism satisfying | det ( ϕ ′ ( w )) | = R 2 q     det  1 q × q 1 k w k 2 − 2 ww tr k w k 4      = R 2 q k w k 2 q . The chan g e o f variables for mula hen ce implies for a ny measur able f unction h : R q → R that Z R q \ B R h ( w ) d w = Z B R h ( ϕ ( w )) R 2 q d w k w k 2 q . Applying this and the substitution R 2 ˜ u = u k w k 2 one obtains that Z R q + 1 max ( 1 , k ω ω ω k ) | π ( ω ω ω ) | d ω ω ω ≤ Z B R Z R R 2 max ( 1 , k ( w , ˜ u ) k ) k w k 2     α  R 2 ( w , ˜ u ) k w k 2      R 2 ( q + 1 ) d ˜ u d w k w k 2 ( q + 1 ) + Z B R × R ( 1 + k ω ω ω k 2 ) | α ( ω ω ω ) | d ω ω ω = Z R q \ B R Z R max ( k w k 2 R 2 , k ( w , u ) k ) | α ( w , u ) | d u d w + Z B R × R ( 1 + k ω ω ω k 2 ) | α ( ω ω ω ) | d ω ω ω ≤ 12 max ( 1 , R − 2 ) Z R q Z max ( 1 , M k w k ) 0 ( 1 + k w k 2 + u 2 ) | g ( w ) | d u d w ≤ 12 ( M 3 + M + 1 ) ma x ( 1 , R − 2 ) Z R q ( 1 + k w k 3 ) | g ( w ) | d u d w < ∞ . This shows (ii). T o dedu c e th e repr esentation (iii) o ne may now use Step 1 and apply the same substitution as ab ove to the ﬁrst term to ob tain for any v ∈ R q with 10 Lukas Gonon et al. k v k ≤ M that f ( v ) = Z R q \ B R Z R σ (( v , 1 ) · ( w , u )) α ( w , u ) d u d w + Z B R × R σ (( v , 1 ) · ( w , u )) α ( w , u ) d w d u = Z B R Z R σ (( v , 1 ) · ( ϕ ( w ) , R 2 ˜ u k w k 2 )) α ( ϕ ( w ) , R 2 ˜ u k w k 2 ) R 2 ( q + 1 ) d ˜ u d w k w k 2 ( q + 1 ) + Z B R × R σ (( v , 1 ) · ω ω ω ) α ( ω ω ω ) d ω ω ω = Z B R × R σ (( v , 1 ) · ω ω ω ) π ( ω ω ω ) d ω ω ω . It r e mains to prove (iv). Applyin g again the change of variables f o rmula and using k ϕ ( w ) k = R 2 k w k − 1 yields Z R q + 1 k ω ω ω k 2 π ( ω ω ω ) 2 d ω ω ω ≤ 2 Z B R × R k ω ω ω k 2 α ( ω ω ω ) 2 d ω ω ω + 2 Z B R Z R R 2 q + 2 k w k 2 q + 2     R 2 ( ˜ u , w ) k w k 2     2 α  R 2 ( ˜ u , w ) k w k 2  2 R 2 ( q + 1 ) d ˜ u d w k w k 2 ( q + 1 ) = 2 Z B R × R k ω ω ω k 2 α ( ω ω ω ) 2 d ω ω ω + 2 R − ( 2 q + 2 ) Z R q \ B R Z R  u 2 k w k 2 q + 2 + k w k 2 q + 4  α ( u , w ) 2 d u d w . (11) T o estimate the ﬁrst term, we note that | α ( ω ω ω ) | 2 ≤ 2 | α 1 ( ω ω ω ) | 2 + 2 | α 2 ( ω ω ω ) | 2 and thus Z B R × R k ω ω ω k 2 α ( ω ω ω ) 2 d ω ω ω ≤ 4 Z B R Z R ( u 2 + k w k 2 )[ | g ( w ) | 2 1 [ 0 , M k w k ] ( u ) + | ˜ g ( w ) | 2 1 [ 0 , 1 ] ( u )] d u d w ≤ 4 ( M 3 + M ) Z B R k w k 3 | g ( w ) | 2 d w + 4 Z B R ( 1 + k w k 2 ) | g ( w ) | 2 d w . (12) Furthermo re, one estimates the integral in th e second ter m in (11) as 1 4 Z R q \ B R Z R  u 2 k w k 2 q + 2 + k w k 2 q + 4  α ( u , w ) 2 d u d w ≤ Z R q \ B R Z R  u 2 k w k 2 q + 2 + k w k 2 q + 4  [ | g ( w ) | 2 1 [ 0 , M k w k ] ( u ) + | ˜ g ( w ) | 2 1 [ 0 , 1 ] ( u )] d u d w ≤ ( M 3 + M ) Z R q \ B R k w k 2 q + 5 | g ( w ) | 2 d w + 4 Z R q \ B R ( k w k 2 q + 2 + k w k 2 q + 4 ) | g ( w ) | 2 d w . (13) Combining (11), (12), and (13) one o btains Approximati on Bounds for Random Neural Network s and Reserv oir Systems 11 Z R q + 1 k ω ω ω k 2 π ( ω ω ω ) 2 d ω ω ω ≤ 8 ( M 3 + M + 2 )  Z B R max ( 1 , k w k 3 ) | g ( w ) | 2 d w + Z R q \ B R max ( 1 , k w k 2 q + 5 ) R 2 q + 2 | g ( w ) | 2 d w  , as claimed. ⊓ ⊔ 4 App roximation Error Estimates for Ra ndom Neura l Networks In this section we deri ve rand o m neur al network approximatio n bound s f or suf ﬁ- ciently regular function s. W e ﬁrst in troduc e th e setting and p r ove a result for sepa- rable Hilb e r t spaces X an d gener al sampling distributions (T h eorem 1 below). In Section 4.2 we the n consider the special case X = R q and derive results fo r weigh ts sampled from a unifo rm d istribution (see Propo sition 3 an d Corollary 1). The de- penden ce of the approx imation bou nds on the input dim ension is explicit and thus these results may be used to decide when the approxim ation by ra n dom neural net- works f or classes of func tions (parametrized by the input dimen sion) suffer from the curse of dimensio n ality . Finally , in Section 4 .3 we deduce as a coro llary of the results in Sec tion 4. 2 that neura l networks with random ly generate d inner weig hts and in which only the last layer is train e d p ossess universal appro ximation capab ilities. T his is a n ew version of th e L 2 -universal approxima tion theor em f or neur al network s f rom [16]. 4.1 Setting and result for separable Hilbert spaces Suppose X is a sep a rable Hilbe r t space with inner prod uct h· , ·i and associate d norm k · k . Let ( A 1 , ζ 1 ) , . . . , ( A N , ζ N ) be i.i.d. X × R -valued random v ariables with distribution π , a probability measure on B ( X × R ) = B ( X ) ⊗ B ( R ) (see [19, Lemma 1.2]). Denote by A : X → R N the rand om linear ma p with Az = ( h A 1 , z i , . . . , h A N , z i ) and set ζ ζ ζ = ( ζ 1 , . . . , ζ N ) . The n fo r any M m , N -valued rando m matrix W we may deﬁn e a rando m functio n H A , ζ ζ ζ W : X → R m by H A , ζ ζ ζ W ( z ) = W σ σ σ ( Az + ζ ζ ζ ) , z ∈ X . (14) Such a function will be called a rando m n eural network with N hidd en nodes and inputs in X . Clearly , if X = R d , then this is a classical single- hidden- layer feedfo r- ward neural network with in p uts in R d . When σ σ σ : R N − → R N is obtaine d as the com- ponen twise application of the rectiﬁer fu nction σ : R → R given by σ ( x ) : = max ( x , 0 ) we say that (14) is a ReLU n eural n etwork. W e will be interested in using ran dom n e u ral networks to approx imate a (un- known) function H ∗ : X → R m . In a p plications, the procedu re is ty pically as f o l- lows: in a ﬁr st step the n etwork parameters A , ζ ζ ζ are gener ated r a ndomly . Then th ese are considered as ﬁxed a n d th e matrix W is train ed (g iv en the realizations o f A , ζ ζ ζ ) in 12 Lukas Gonon et al. order to appro ximate H ∗ as well as possible. W ith this in mind, in what follows we will be mainly interested in measu r ing the appr oximation err or betwee n H A , ζ ζ ζ W and H ∗ condition al on A , ζ ζ ζ and with respect to the L 2 ( X , µ Z ) -norm for a probability mea - sure µ Z on ( X , B ( X )) . Thus, througho ut this section , Z is an arbitrary X -v alued random variable. W e d e note b y µ Z its distribution. T he only assumptions we impose is that k Z k ≤ M , P - a.s. and that Z is indepen d ent of ( A 1 , ζ 1 ) , . . . , ( A N , ζ N ) . The fol- lowing L e mma guarantees in par ticular that H A , ζ ζ ζ W ( Z ) is a random variable, th at is, F -m easurable. Lemma 1 H A , ζ ζ ζ W is pr oduct-measu rable, tha t is, the mapping ( ω , z ) ∋ Ω × X 7→ H A ( ω ) , ζ ζ ζ ( ω ) W ( ω ) ( z ) ∈ R m is F ⊗ B ( X ) -measurable. Pr oof On the one hand, the Cauchy -Schwarz inequality implies th at for any z ∈ X the mapping X ∋ v 7→ h v , z i is contin uous and thus B ( X ) - m easurable. T his shows that h A i , z i is a ra n dom v ariable for all i = 1 , . . . , N . Theref ore, f o r any z ∈ X the mapping Ω ∋ ω 7→ H A ( ω ) , ζ ζ ζ ( ω ) W ( ω ) ( z ) = W ( ω ) σ σ σ ( A ( ω ) z + ζ ζ ζ ( ω )) ∈ R m is F -m easurable. On the other hand, for any ω ∈ Ω the linear map A ( ω ) : X → R N is contin uous (again by th e Cauchy- Schwarz inequality) and thu s also H A ( ω ) , ζ ζ ζ ( ω ) W ( ω ) : X → R m is continuo us. The claimed prod uct-measur ability therefore follows for instance from Aliprantis & Border [ 1, Lemma 4.51 ]. ⊓ ⊔ W e now present our random neural n etwork a p prox im ation result, see also Re- mark 3 below for a d iscussion. W e use the f ollowing notatio n: for any measur e ν we write ν − for the measur e ν − ( · ) = ν ( −· ) and for a complex measure ν we denote by | ν | its total variation measure, see [35, Chapter 6]. Theorem 1 Supp ose tha t H ∗ : X → R m can be r epr esented a s H ∗ j ( z ) = Z X e i h w , z i ˆ µ j ( d w ) for some complex measur es ˆ µ j , j = 1 , . . . , m, on ( X , B ( X )) and all z ∈ X with k z k ≤ M . Assume th at Z X max ( 1 , k w k 2 ) | ˆ µ j | ( d w ) < ∞ , (15) π = π X ⊗ ( π R ( x ) d x ) , | ˆ µ j | + | ˆ µ j | − ≪ π X and with F π ( x ) = 2 R 0 − x 1 π R ( u ) d u either (i) o r (ii) holds: (i) π R is strictly positive and F π ( x ) < ∞ for all x ∈ R (ii) for some R > 0 , π X ( { w ∈ X : k w k > R } ) = 0 and π R ( x ) > 0 , F π ( x ) < ∞ for | x | ≤ max ( M R , 1 ) . Approximati on Bounds for Random Neural Network s and Reserv oir Systems 13 Furthermore set g j = d ( | ˆ µ j | + | ˆ µ j | − ) d π X and assume that Z X F π ( M k w k ) k w k 2 g j ( w ) 2 π X ( d w ) < ∞ , Z X max ( k w k 2 , 1 ) g j ( w ) 2 π X ( d w ) < ∞ (16) and let σ : R → R be the r ectiﬁer function given by σ ( x ) : = max ( x , 0 ) . Th e n th er e exis ts W (a M m , N -valued random va riable) and C ∗ > 0 such that the random ReLU- neural network H A , ζ ζ ζ W satisﬁes E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] ≤ C ∗ N and for a ny δ ∈ ( 0 , 1 ) , with pr oba b ility 1 − δ the random neural network H A , ζ ζ ζ W satis- ﬁes  Z X k H A , ζ ζ ζ W ( z ) − H ∗ ( z ) k 2 µ Z ( d z )  1 / 2 ≤ √ C ∗ δ √ N . Mor eover , the constan t C ∗ is e xplicit and given by C ∗ = ∑ m j = 1 C ∗ j with C ∗ j = M 2 Z X F π ( M k w k ) k w k 2 g j ( w ) 2 π X ( d w ) + 8 M 2 ( F π ( 1 ) − F π ( − 1 )) Z X max ( k w k 2 , 1 ) g j ( w ) 2 π X ( d w ) . Remark 3 At ﬁrst glance Theor em 1 may a p pear to be m erely an e xistence state- ment. Howe ver , an optima l W can in fact be co mputed explicitly by solv in g the least- squares minimization problem min W E [ k W σ σ σ ( AZ + ζ ζ ζ ) − H ∗ ( Z ) k 2 | A , ζ ζ ζ ] , (17) where the minimization is taken with r espect to M m , N -valued random variables which are measurable with r e spect to the sigma- a lgebra gene rated by A , ζ ζ ζ . W e will show in (22) and (23) b e low th at the matrix W constru cted in the p roof o f Th eorem 1 is measur a b le with respect to th e sigma-algeb ra gen erated by A , ζ ζ ζ . Consequen tly , Theorem 1 shows that E  min W n E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 | A , ζ ζ ζ ] o  ≤ C ∗ N . Remark 4 A ﬁrst attemp t at proving Theorem 1 might be to directly work with the solution to the least-squares minimization problem (17), i.e. the explicit min imizer W ∗ . Howe ver , evaluating the appro ximation error E [ k W ∗ σ σ σ ( AZ + ζ ζ ζ ) − H ∗ ( Z ) k 2 ] directly is very challenging d ue to th e dependence betwee n W ∗ and σ σ σ ( AZ + ζ ζ ζ ) . This is further co mplicated by th e fact th at the explicit expression of W ∗ in volves the inverse of the covariance matrix of σ σ σ ( AZ + ζ ζ ζ ) conditional on A , ζ ζ ζ . Th erefor e , ev aluating the expectation with respect to A , ζ ζ ζ or p roviding an upper bo und fo r it is for th e time being out o f reach . This is th e r e a son why we do not work with (17) 14 Lukas Gonon et al. in the proof of Theorem 1 , but we rathe r explicitly construct a W for wh ich the approx imation e r ror can be b ounded more easily . As poin te d ou t in Remar k 3, we thereby obtain also an upper bound for th e optimal W . Whether or not o ne ca n also obtain a lower bou nd E  min W n E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 | A , ζ ζ ζ ] o  ≥ ˜ C N for some ˜ C > 0 is still n ot clear due to the difﬁculties m entioned a bove. Pr oof First note that, writing W j for the j -th row of W , one h as E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] = m ∑ j = 1 E [ | H A , ζ ζ ζ W j ( Z ) − H ∗ j ( Z ) | 2 ] . Thus, it is sufﬁcient to prove the claimed result for each com ponen t j individually an d sum up the resulting co n stants. W ithout loss of generality , we may therefore assume m = 1. T o simplify notation we will write H ∗ = H ∗ 1 , ˆ µ = ˆ µ 1 , g = g 1 and C ∗ = C ∗ 1 . The proof n ow pro c eeds in two steps. In a ﬁrst step we derive an integral represen- tation f or H ∗ similar to Propo sition 2. In the second step we then choose W in such a way that H A , ζ ζ ζ W is a sample a verage of N i. i. d. r andom fun ctions with expectation H ∗ and deduce the claimed error b ound based on this. Step 1: Inte gral repr esentation. Firstly , r e call that b y [35, Theorem 6.1 2] there exists a measu rable func tio n h : X → C satisfying | h ( w ) | = 1 fo r all w ∈ X and ˆ µ ( d w ) = h ( w ) | ˆ µ | ( d w ) . Next note that pro ceeding p recisely as in the proof o f Step 1 in Proposition 2 and using (15) yields for any v ∈ X tha t − Z X × [ 0 , ∞ ) [( h v , w i − u ) + e iu + ( −h v , w i − u ) + e − iu ] ˆ µ ( d w ) d u = Z X ( e i h v , w i − i h v , w i − 1 ) ˆ µ ( d w ) = H ∗ ( v ) − Z X i h v , w i ˆ µ ( d w ) − H ∗ ( 0 ) . (18) W e claim that the la st integral is a real number . T o see this, one uses Im ( H ∗ ( λ v )) = 0 and Im ( H ∗ ( 0 )) = 0 to estimate for any λ > 0     Im  Z X i h v , w i ˆ µ ( d w )      = 1 λ     Im  H ∗ ( λ v ) − H ∗ ( 0 ) − Z X i h λ v , w i ˆ µ ( d w )      ≤ 1 λ     Z X ( e i h λ v , w i − 1 − i h λ v , w i ) h ( w ) | ˆ µ | ( d w )     ≤ 1 2 λ Z X |h λ v , w i| 2 | ˆ µ | ( d w ) ≤ λ k v k 2 2 Z X k w k 2 | ˆ µ | ( d w ) and note th at the last expression conver ges to 0 as λ → 0 due to (15). This shows that Approximati on Bounds for Random Neural Network s and Reserv oir Systems 15 Z X i h v , w i ˆ µ ( d w ) + H ∗ ( 0 ) = Z X ( h v , w i ( − Im [ h ( w )]) + Re [ h ( w )]) | ˆ µ | ( d w ) = Z X Z 1 0 ( h v , w i + u )( Re [ h ( w )] − Im [ h ( w )]) d u | ˆ µ | ( d w ) = Z X Z 1 0 [( h v , w i + u ) + − ( −h v , w i − u ) + ]( Re [ h ( w )] − Im [ h ( w )]) d u | ˆ µ | ( d w ) , (19) which is the analogue to (10) in the pro o f of Pro position 2. W e now c o mbine the repr e sentations (18) and (1 9) to arrive at th e claimed integral representatio n. T o this end deﬁne the fu nction ¯ h : X → R by ¯ h ( w ) = Re [ h ( w )] − Im [ h ( w )] f or w ∈ X and deﬁne the measures ˜ µ 1 ( d w , d u ) = Re [ e − iu h ( w )] | ˆ µ | ( d w ) d u , ˜ µ 2 ( d w , d u ) = Re [ e iu h ( − w )] | ˆ µ | − ( d w ) d u on X × R . With these notations we may deﬁne the measures α 1 and α 2 on X × R by α 1 ( d w , d u ) = − 1 ( − M k w k , 0 ] ( u )[ ˜ µ 1 ( d w , d u ) + ˜ µ 2 ( d w , d u )] α 2 ( d w , d u ) = 1 [ 0 , 1 ] ( u ) ¯ h ( w ) | ˆ µ | ( d w ) d u − 1 [ − 1 , 0 ] ( u ) ¯ h ( − w ) | ˆ µ | − ( d w ) d u . As shown above, the right han d side in (1 8) is r e al a n d hence so is the left hand side. Thus, by setting α = α 1 + α 2 , rearrang ing (18) and using (19) one obtains for a ny v ∈ X with k v k ≤ M that H ∗ ( v ) = Z X × R σ ( h v , w i + u ) α ( d w , d u ) . (20) Finally , let A ∈ B ( X × R ) satisfy π ( A ) = 0 and for u ∈ R denote A u = { w ∈ X : ( w , u ) ∈ A fo r some u ∈ R } . If (i) holds, then th e assumptions that π = π X ⊗ ( π R ( x ) d x ) and π R > 0 imply th at π X ( A u ) = 0 for L ebesgue-a. e. u ∈ R . Conseq u ently , | ˆ µ | ( A u ) + | ˆ µ | − ( A u ) = 0 an d α ( A ) = 0. In case (ii) one may pro c eed similarly to obtain in either case that α ≪ π . Writing g ( w ) = d ( | ˆ µ | + | ˆ µ | − ) d π X ( w ) , w ∈ X one uses | ¯ h ( w ) | ≤ √ 2 to e stima te for any ( w , u ) ∈ X × R th at     d α d π ( w , u )     ≤ ( 1 ( − M k w k , 0 ] ( u ) + √ 2 1 [ − 1 , 1 ] ( u )) 1 π R ( u ) g ( w ) . (21) Step 2 : Importance sampling. Next, write U i = ( A i , ζ i ) , deﬁne the random variables V i = d α d π ( U i ) (22) and set W = 1 N  V 1 · · · V N  . (23) 16 Lukas Gonon et al. By ﬁrst inserting the deﬁnitions and then using independenc e , conditioning (see for instance [19, Lemma 2.11]) and th e assumption that X is separable we o btain E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] = E [ | W σ σ σ ( AZ + ζ ζ ζ ) − H ∗ ( Z ) | 2 ] = E   E        1 N N ∑ i = 1 V i σ ( h A i , z i + ζ i ) − H ∗ ( z )      2         z = Z   . (24) Howe ver , by con stru ction eac h of the summan d s V i σ ( h A i , z i + ζ i ) in (2 4) is a ran- dom variable with expectatio n H ∗ ( z ) , as one sees by using the rep resentation (20) to calculate for each i = 1 , . . . , N and any z ∈ X with k z k ≤ M E [ V i σ ( h A i , z i + ζ i )] = Z X × R d α d π ( w , u ) σ ( h w , z i + u ) π ( d w , d u ) = H ∗ ( z ) . Using independ ence one thus obtains E        1 N N ∑ i = 1 V i σ ( h A i , z i + ζ i ) − H ∗ ( z )      2   = V ar 1 N N ∑ i = 1 V i σ ( h A i , z i + ζ i ) ! = 1 N V ar ( V 1 σ ( h A 1 , z i + ζ 1 )) ≤ 1 N E  V 2 1 σ ( h A 1 , z i + ζ 1 ) 2  . (25) T o estimate the last expectation, o ne notes that (21) and ( 16) yield for any z ∈ X with k z k ≤ M E  V 2 1 σ ( h A 1 , z i + ζ 1 ) 2  = Z X × R  d α d π ( w , u )  2 σ ( h w , z i + u ) 2 π ( d w , d u ) ≤ 2 Z X × R ( 1 ( − M k w k , 0 ] ( u ) + 2 1 [ − 1 , 1 ] ( u ))  g ( w ) π R ( u )  2 σ ( h w , z i + u ) 2 π ( d w , d u ) ≤ 2 Z X Z R  1 ( − M k w k , 0 ] ( u ) |h w , z i| 2 + 4 1 [ − 1 , 1 ] ( u )( |h w , z i| 2 + 1 )  g ( w ) 2 π R ( u ) π X ( d w ) d u ≤ M 2 Z X F π ( M k w k ) k w k 2 g ( w ) 2 π X ( d w ) + 8 M 2 ( F π ( 1 ) − F π ( − 1 )) Z X max ( k w k 2 , 1 ) g ( w ) 2 π X ( d w ) = C ∗ < ∞ . (26) Combining (24), (25) and (26) thus y ields E "  Z X k H A , ζ ζ ζ W ( z ) − H ∗ ( z ) k 2 µ Z ( d z )  1 / 2 # ≤ E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ √ C ∗ √ N . Approximati on Bounds for Random Neural Network s and Reserv oir Systems 17 Thus, for any given δ ∈ ( 0 , 1 ) one may set η = √ C ∗ δ √ N and apply Markov’ s ineq uality to obtain P  Z X k H A , ζ ζ ζ W ( z ) − H ∗ ( z ) k 2 µ Z ( d z )  1 / 2 > η ! ≤ 1 η √ C ∗ √ N = δ . ⊓ ⊔ 4.2 Results in the ﬁnite-dimension al case Let u s now specialize to the case X = R q . W e work in the setting and notation a s introdu c ed in Section 4.1 and, in particular, c o nsider random n eural n etworks H A , ζ ζ ζ W ( z ) = W σ σ σ ( Az + ζ ζ ζ ) , z ∈ R q . (2 7) Thus, Th e o rem 1 provides a rando m ne u ral network appr oximation result for a wide range of sampling distrib utions π for the weights. Ho wever , th ese assumptions may not allow us to sample the weights from a un iform distribution, unless the Fourier representatio n of H ∗ is compactly suppo rted. In this section we prove that th is case can still be covered b y app lying the r epresentation f rom Propo sition 2. T o simplify the statements we choo se m = 1 her e , b ut all the results can be directly gener alized to m ∈ N + . In line with Remark 3 the “existence” statement in the next proposition also directly yield s ap proxim ation e r ror bounds for the random neu ral n etwork with readou t W trained by least-squares m inimization. Proposition 3 Sup pose H ∗ : R q → R can be r epr esented as H ∗ ( z ) = Z R q e i h w , z i g ( w ) d w for some complex-valued function g on R q and all z ∈ R q with k z k ≤ M . Assume that Z R q max ( 1 , k w k 2 q + 6 ) | g ( w ) | 2 d w < ∞ . (28) Let R > 0 , sup pose the r ows of the M N , N -valued random matrix A ar e i.i.d. ran- dom variables with uniform distribution on B R ⊂ R q , sup pose the entries of the R N -valued random vector ζ ζ ζ ar e i.i.d. rando m variables uniformly distributed on [ − max ( M R , 1 ) , max ( M R , 1 )] , assume that A an d ζ ζ ζ ar e independe nt and let σ : R → R be given as σ ( x ) = max ( x , 0 ) . Then, ther e exists W (a M 1 , N -valued r ando m vari- able) and C ∗ > 0 such th at E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] ≤ C ∗ N (29) and for a ny δ ∈ ( 0 , 1 ) , with pr oba b ility 1 − δ the random neural network H A , ζ ζ ζ W satis- ﬁes  Z R q k H A , ζ ζ ζ W ( z ) − H ∗ ( z ) k 2 µ Z ( d z )  1 / 2 ≤ √ C ∗ δ √ N . Mor eover , the constan t C ∗ is e xplicit (see (33) below). 18 Lukas Gonon et al. Pr oof Firstly , the fu nction H ∗ satisﬁes the hyp o theses of Propo sition 2 . Thus, there exists an integrable function π ∗ : R q + 1 → R such that f or z ∈ R q with k z k ≤ M the function H ∗ can be represented a s H ∗ ( z ) = Z R q + 1 σ ( z · w + u ) π ∗ ( w , u ) d w d u and π ∗ ( w , u ) = 0 for all ( w , u ) ∈ R q × R satisfying k w k > R or | u | > m ax ( M R , 1 ) . Moreover , Z R q + 1 k ω ω ω k 2 π ∗ ( ω ω ω ) 2 d ω ω ω ≤ 8 ( M 3 + M + 2 )  Z B R max ( 1 , k w k 3 ) | g ( w ) | 2 d w + Z R q \ B R max ( 1 , k w k 2 q + 5 ) R 2 q + 2 | g ( w ) | 2 d w  . (30) Recall that by assumption π = π X ⊗ π R , wher e π X is the uniform distribution on B R and π R is the u niform d istribution on [ − max ( M R , 1 ) , max ( M R , 1 )] . Henc e , setting α = π ∗ ( ω ω ω ) d ω ω ω one has that (20) holds, α ≪ π and d α d π = 2 max ( MR , 1 ) V ol q ( B R ) π ∗ . Thus, one may now mimic Step 2 in the proof of Theorem 1, i.e. (22)-(25), to o btain E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] ≤ 1 N E  V 2 1 σ ( h A 1 , z i + ζ 1 ) 2  . (31) Furthermo re, for any z ∈ X with k z k ≤ M E  V 2 1 σ ( h A 1 , z i + ζ 1 ) 2  = Z R q × R  d α d π ( w , u )  2 σ ( h w , z i + u ) 2 π ( d w , d u ) = 2 max ( M R , 1 ) V o l q ( B R ) Z R q × R ( π ∗ ( w , u )) 2 σ ( h w , z i + u ) 2 d w d u ≤ 2 max ( M R , 1 ) V o l q ( B R )( M + 1 ) 2 Z R q + 1 k ω ω ω k 2 π ∗ ( ω ω ω ) 2 d ω ω ω . (32) Combining (30), (31) and (32) thus y ields (29), as desired, w ith C ∗ = 16 max ( M R , 1 ) V ol q ( B R )( M + 1 ) 2 ( M 3 + M + 2 ) ·  Z B R max ( 1 , k w k 3 ) | g ( w ) | 2 d w + Z R q \ B R max ( 1 , k w k 2 q + 5 ) R 2 q + 2 | g ( w ) | 2 d w  . (33) The hig h-pro bability statement then fo llows f r om (29) p recisely as in th e pro of of Theorem 1. ⊓ ⊔ In the n ext result we present a n alternativ e error estimate, for which th e integrab ility condition on g do es not depen d on the input dim ension q (compare (28) to (35)). The estimate can be dedu ced f rom the erro r estimate in Proposition 3 by truncating g and estimating the d ifference b e twe en th e trun cation an d the or ig inal H ∗ . Recall that Z is a R q -valued rando m variable satisfying k Z k ≤ M , P -a.s. W e emp hasize that the “existence” statement in the following corollary also yields app r oximation error bound s for the ra ndom n eural network w ith readout W tr ained by least-squares minimization , see Remark 3. Approximati on Bounds for Random Neural Network s and Reserv oir Systems 19 Corollary 1 Supp ose H ∗ : R q → R can be r epr esented as H ∗ ( z ) = Z R q e i h w , z i g ( w ) d w (34) for some co mplex-valued fu nction g ∈ L 1 ( R q ) and all z ∈ R q with k z k ≤ M . Assume that C ∗ g =  Z R q max ( 1 , k w k 3 ) | g ( w ) | 2 d w  1 / 2 < ∞ . (35) Let R > 0 , sup pose the r ows of the M N , N -valued random matrix A ar e i.i.d. ran- dom variables with uniform distribution on B R ⊂ R q , sup pose the entries of the R N -valued random vector ζ ζ ζ ar e i.i.d. rando m variables uniformly distributed on [ − max ( M R , 1 ) , max ( M R , 1 )] , assume that A an d ζ ζ ζ ar e independe nt and let σ : R → R be given as σ ( x ) = max ( x , 0 ) . Then there e xists W (a M 1 , N -valued random vari- able) such tha t E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ p C ∗ R √ N + Z R q \ B R | g ( w ) | d w , (36) wher e C ∗ R = 16 max ( M R , 1 ) V ol q ( B R )( M + 1 ) 2 ( M 3 + M + 2 ) Z B R max ( 1 , k w k 3 ) | g ( w ) | 2 d w . (37) In particula r , writing c 2 M , q = 1 6 max ( M , 1 ) V ol q ( B 1 )( M + 1 ) 2 ( M 3 + M + 2 ) , it fol- lows that: (i) if I k = R R q k w k k | g ( w ) | d w < ∞ for some k ∈ N + , then R = N 1 2 k + q + 1 yields E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ 1 N [ 2 + ( q + 1 ) k ] − 1  c M , q C ∗ g + I k  , (38) (ii) if I k = R R q exp ( C k w k k ) | g ( w ) | d w < ∞ for so m e k ∈ N + and C > 0 , then R = ( log ( √ N ) C ) 1 / k yields E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ [ log ( √ N )] ( q + 1 ) / ( 2 k ) √ N h C − q + 1 2 k c M , q C ∗ g + I k i . (39) Remark 5 A sufﬁcient co ndition fo r (3 4)-(35) to be satisﬁed is that H ∗ ∈ L 1 ( R q ) h as an integrable F ourier transform and belongs to the Sobolev space W 2 , 2 ( R q ) , see for instance [7, T heorem 6 . 1]. Th e integrability co nditions formulated in p arts (i) and (ii) are related to additio nal smoothness proper ties o f H ∗ , where a higher degree of smoothne ss means that the Fourier transform of H ∗ decays more quickly and con - sequently , the expressions I k in (i) or ( ii) are ﬁnite fo r larger k ∈ N + . This results in a faster r a te of co n vergence in the boun ds (38) and (3 9). For instance, if the cond i- tion in part (i) is satisﬁed, then the err or in (3 8) is of o rder O ( N − [ 2 + ( q + 1 ) k ] − 1 ) which is close to O ( 1 / √ N ) when ( q + 1 ) / k is small. Thus, as in classical works (see for instance [28]) the app roximatio n rate depen d s on the ratio of the inpu t dimension and the smoothness o f the functio n to be approximated . A similar result for f unctions in W k , 2 ( R q ) for k > q 2 + 1 is for mulated in Corollary 2 below . 20 Lukas Gonon et al. Pr oof Deﬁne ¯ g ( w ) = 1 B R ( w ) g ( w ) and ¯ H ∗ ( z ) = Z R q e i h w , z i ¯ g ( w ) d w . Then Z R q max ( 1 , k w k 2 q + 6 ) | ¯ g ( w ) | 2 d w ≤ max ( 1 , R 2 q + 3 ) Z R q max ( 1 , k w k 3 ) | g ( w ) | 2 d w < ∞ and so Proposition 3 (applied to ¯ H ∗ ) shows that there exists W such tha t E [ k H A , ζ ζ ζ W ( Z ) − ¯ H ∗ ( Z ) k 2 ] ≤ C ∗ R N (40) with C ∗ R giv en in ( 37). Furthermo re, the triangle inequality yields E [ k H ∗ ( Z ) − ¯ H ∗ ( Z ) k 2 ] 1 / 2 = E "     Z R q \ B R e i h w , Z i g ( w ) d w     2 # 1 / 2 ≤ Z R q \ B R | g ( w ) | d w . Combining this with ( 40) and the triangle ineq u ality th en yields (36). Finally , let u s show th a t the assumptions in (i) and (ii) gu arantee a ce rtain decay of the last term in (36). (i) Suppose I k = R R q k w k k | g ( w ) | d w < ∞ for some k ∈ N + . Then Z R q \ B R | g ( w ) | d w ≤ Z R q \ B R  k w k R  k | g ( w ) | d w ≤ I k R k . Thus, the right hand side in (36) is b o unded by p C ∗ R √ N + I k R k ≤ R q + 1 2 c M , q C ∗ g √ N + I k R k , which beco mes the righ t hand side of (38) if we take R = N α and cho ose α to make both terms of the same order, i.e. α q + 1 2 − 1 2 = − α k . (ii) Suppose I k = R R q exp ( C k w k k ) | g ( w ) | d w < ∞ for some k ∈ N + and C > 0. Then Z R q \ B R | g ( w ) | d w ≤ Z R q \ B R exp ( C [ k w k k − R k ]) | g ( w ) | d w ≤ I k exp ( C R k ) . Thus, taking R = ( log ( √ N ) C ) 1 / k , the right hand side in (36) is bound e d b y p C ∗ R √ N + I k exp ( CR k ) ≤ R q + 1 2 c M , q C ∗ g √ N + I k √ N . ⊓ ⊔ Approximati on Bounds for Random Neural Network s and Reserv oir Systems 21 Recall that W k , 2 ( R q ) den otes for k ∈ N + the Sobolev space consisting o f all function s u : R q → R wh ose mixed partial derivati ves D α u o f or der α ∈ N q with α 1 + · · · + α q ≤ k satisfy D α u ∈ L 2 ( R q ) . For u ∈ L 1 ( R q ) we d enote by b u ( ξ ξ ξ ) = R R q e − i h ξ ξ ξ , z i u ( z ) d z , ξ ξ ξ ∈ R q , the Fourier tran sform of u . By [7 , Theor em 6.1] th e space W k , 2 ( R q ) consists of precisely those u ∈ L 2 ( R q ) f o r which the norm k u k k : =  Z R q | b u ( ξ ξ ξ ) | 2 ( 1 + k ξ ξ ξ k 2 ) k d ξ ξ ξ  1 / 2 is ﬁnite. The next co rollary specializes Corollary 1 to fun ctions in W k , 2 ( R q ) for sufﬁ- ciently large k . An a logous results c ould be derived for th e Sob o lev spaces W k , p ( R q ) for p > 1 and fo r ge n eralized Sob olev spaces (see [4, Section 6.2]) with p = 1 e ven without restrictions on k . Corollary 2 Let k ∈ N with k ≥ q 2 + 1 + ε for some ε > 0 a nd suppose H ∗ ∈ W k , 2 ( R q ) ∩ L 1 ( R q ) . Let R = N 1 / ( 2 k − 2 ε + 1 ) , suppose the r ows of th e M N , N -valued random matrix A ar e i.i.d. random va riables with un iform distribution on B R ⊂ R q , suppose the entries of the R N -valued random v e ctor ζ ζ ζ ar e i.i.d . random variables uniformly dis- tributed on [ − max ( MR , 1 ) , max ( M R , 1 )] , assume that A and ζ ζ ζ ar e independent an d let σ : R → R be g iven as σ ( x ) = max ( x , 0 ) . Then, ther e exis ts W (a M 1 , N -valued random va riable) and a constan t C > 0 (depe n ding on q and M , b ut independen t of H ∗ , N ) such that E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ C k H ∗ k k N − 1 / α (41) with α = 2 + ( q + 1 ) k − q / 2 − ε . C is explicitly g iven in (43) . Remark 6 In the neu ral network H A , ζ ζ ζ W only the weigh ts W are trainable, where as A , ζ ζ ζ are gen erated randomly . Th erefor e , it is clear that th e approx imation cap abilities of these network s are smaller than those o f neu ral networks in which all p arameters can be trained. This intuition is conﬁrm ed when comparing the error rate 1 / α ≤ 1 / 2 in (41) to the ra te k / q o b tained in [28], [24], since k / q ≥ 1 / 2 for k ≥ q / 2. Howev er, the advantage of th e result in Coro llar y 2 is that train ing the r andom neu ral network H A , ζ ζ ζ W is straigh tf orward: it only requires to solve the (co n vex) optim ization prob lem over W , which is mathema tica lly very well-un derstood. In contrast, in the case of fu lly trainable neu ral networks on e typically uses stocha stic grad ient descent ty pe algo - rithms for p a rameter optimization, fo r wh ic h a rigorou s mathematical error an alysis for genera l shallow neur al networks is challenging . Pr oof Firstly , using the Cauch y-Schwartz inequality and th e assump tions o n H ∗ we obtain th at c H ∗ ∈ L 1 ( R q ) , see (42). Hen ce, the Fourier in version theorem yields th e representatio n (34) with g = ( 2 π ) − q c H ∗ ∈ L 1 ( R q ) an d all z ∈ R q . Thus, the constant C ∗ g in (35) is b ound e d by C ∗ g ≤ ( 2 π ) − q k H ∗ k 2 . 22 Lukas Gonon et al. Furthermo re,  Z R q k w k k − s | g ( w ) | d w  2 ≤ Z R q ( 1 + k w k 2 ) − s d w Z R q ( 1 + k w k 2 ) k | c H ∗ ( w ) | 2 d w , (42) which is ﬁnite f o r s > q / 2 (see e.g. [7, p.193]). Cho osing s = q / 2 + ε yields k − q / 2 − ε ≥ 1 and we may ther e fore apply Corollary 1 to obtain that th ere exists W (a M 1 , N -valued rand om variable) su c h that E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ N − [ 2 + ( q + 1 ) k − s ] − 1  c M , q C ∗ g + I k − s  ≤ N − [ 2 + ( q + 1 ) k − q / 2 − ε ] − 1 C k H ∗ k k with c 2 M , q = 16 max ( M , 1 ) V ol q ( B 1 )( M + 1 ) 2 ( M 3 + M + 2 ) and C = c M , q ( 2 π ) − q +  Z R q ( 1 + k w k 2 ) − s d w  1 / 2 . (43) This completes the proof. ⊓ ⊔ Finally , we prove a fu r ther con sequence of Theorem 1. The result in Prop osition 4 below allo ws for a larger class of fun ctions H ∗ (possibly deﬁned in terms of essen- tially lo wer-dimensional fun c tio ns, f or in stan ce as a sum of univariate fu nctions) and shows in particular how the sampling scheme in th e previous re su lts can be mod iﬁed in o rder to cover this more general c ase; while the r ows of A wer e sam pled from the unifor m distribution on the ball B R ⊂ R q in Corollary 1 above, in Proposition 4 the matrix A is in g eneral a sparse ra ndom matrix with entrie s d rawn fro m lower dimensiona l balls B k R ⊂ R k , k = 1 , . . . , q . Proposition 4 Sup pose H ∗ : R q → R can be r epr esented as H ∗ ( z ) = Z R q e i h w , z i ˆ µ ( d w ) for some complex mea sur e ˆ µ on ( R q , B ( R q )) a nd all z ∈ R q with k z k ≤ M . Assume that Z R q max ( 1 , k w k 2 ) | ˆ µ | ( d w ) < ∞ . Suppo se K 1 , . . . , K N ar e i.i.d . r ando m variables with values in { 1 , . . . , q } and for i = 1 , . . . , N , condition al on K i = k th e i-th r ow A i of A is sampled a s follows: – select (uniformly randomly on { 1 , . . . , q } ) k no n-zer o entries – draw th e se entries fr om the uniform distribution on B R ⊂ R k – set the r emaining N − k entries to 0 and ζ i is sampled unifo rmly on [ − max ( MR , 1 ) , max ( M R , 1 )] . F o r k = 1 , . . . , K denote by λ 1 the Lebesgue - measur e on R , let p k = P ( K 1 = k ) an d a ssume that ˆ µ ≪ q ∑ k = 1 p k ∑ µ 1 ,..., µ q ∈{ δ 0 , λ 1 } # { j : µ j = λ 1 } = k µ 1 ⊗ · · · ⊗ µ q . (44) Let σ : R → R be give n as σ ( x ) = max ( x , 0 ) . Then Approximati on Bounds for Random Neural Network s and Reserv oir Systems 23 (i) 1 B R | ˆ µ | + 1 B R | ˆ µ | − ≪ π X , wher e π X denotes the distribution o f A i , (ii) if g = d ( 1 B R | ˆ µ | + 1 B R | ˆ µ | − ) d π X satisﬁes Z B R max ( k w k 3 , 1 ) g ( w ) 2 π X ( d w ) < ∞ , (45) then th er e exists W (a M 1 , N -valued rand o m variab le) such that E [ k H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 2 ] 1 / 2 ≤ √ C √ N + Z R q \ B R | ˆ µ | ( d w ) , wher e C = 8 M 2 max ( M R , 1 ) max ( M , 4 ) Z B R max ( k w k 3 , 1 ) g ( w ) 2 π X ( d w ) . Pr oof T o p rove ( i), supp ose B ∈ B ( R q ) satisﬁes π X ( B ) = 0. Let U k ∼ π k , where π k denotes th e u niform distribution on B k R = B R ⊂ R k . By co nstruction, fo r a ll k = 1 , . . . , q with p k > 0 and all j 1 , . . . , j k ∈ { 1 , . . . , q } we have (with T j 1 ,..., j k denoting the map that embeds R k in R q by in serting 0 at each co mpon ent j / ∈ { j 1 , . . . , j k } ) that 0 = P ( A 1 ∈ B | K 1 = k , A 1 , j 1 6 = 0 , . . . , A 1 , j k 6 = 0 ) = P ( T j 1 ,..., j k ( U k ) ∈ B ) = π k ( T − 1 j 1 ,..., j k ( B )) . Using that π k has a strictly po siti ve Lebesgue density on B k R ⊂ R k , this imp lies that T − 1 j 1 ,..., j k ( B ) ∩ B k R is a Lebe sg ue-nullset in R k . Therefo r e for µ j 1 ,..., j k = µ 1 ⊗ · · · ⊗ µ q with µ j i = λ 1 and µ j = δ 0 for j / ∈ { j 1 , . . . , j q } it follows that µ j 1 ,..., j k ( B ∩ B q R ) = Z R k 1 B ( T j 1 ,..., j k ( w 1 , . . . , w k )) 1 B k R ( w 1 , . . . , w k ) d w 1 · · · d w k = 0 . This shows that B ∩ B q R is a nullset for each of the m easures on the right hand side of (44) and so, by (44), also f or ˆ µ (and consequ ently for | ˆ µ | a n d | ˆ µ | − ). T o sh ow (ii) n ote that P -a. s. k A i k ≤ R and so we may apply Th eorem 1 to X = R q and th e fu nction ¯ H ∗ ( z ) = R R q e i h w , z i 1 B R ( w ) ˆ µ ( d w ) . By assumption on ζ i , the function F π appearin g in The orem 1 is given for | x | ≤ m ax ( M R , 1 ) as F π ( x ) = 2 R 0 − x 2 max ( M R , 1 ) d u = 4 x max ( M R , 1 ) and so C ∗ = C ∗ 1 in Theorem 1 becomes C ∗ = 4 M 2 max ( M R , 1 )  M Z B R k w k 3 g ( w ) 2 π X ( d w ) + 4 Z B R max ( k w k 2 , 1 ) g ( w ) 2 π X ( d w )  ≤ 8 M 2 max ( M R , 1 ) max ( M , 4 ) Z B R max ( k w k 3 , 1 ) g ( w ) 2 π X ( d w ) and (16) is indeed satisﬁed by (45). Th e statemen t then follows precisely as in the proof of Corollary 1 by estimating th e difference | ¯ H ∗ ( z ) − H ∗ ( z ) | ≤ R R q \ B R | ˆ µ | ( d w ) for z ∈ B M ⊂ R q and applyin g the triangle inequ a lity . ⊓ ⊔ 24 Lukas Gonon et al. 4.3 Univ ersal appro x imation by random ReLU networks In this subsection we pre sen t a further co rollary , which proves that feed forward neu - ral networks with rand omly generated inner weights are univ ersal approxim a tors in L 2 ( R q , µ ) for any pr obability me asure µ on ( R q , B ( R q )) . T o formu late the re sult let us ﬁrst introdu ce the sch eme acco rding to which th e weights are sampled. For any ρ > 1 , R > 0 consider the f ollowing schem e to randomly generate weights: (i) Let A 1 , A 2 , . . . b e i.i. d. rando m vecto r s dr awn fro m the unifor m d istribution on the ball B R ⊂ R q , (ii) let ζ 1 , ζ 2 , . . . be i.i.d. un if ormly distributed on [ − ρ , ρ ] , independe nt of { A i } i ∈ N + . Note that the only par ameters that need to be trained fo r the neural network s in Corollary 3 ar e the outer weigh ts W 1 , . . . , W N (once N is ﬁxed and the inner weights A 1 , A 2 , . . . , ζ 1 , ζ 2 , . . . are sampled rand omly). These outer weights can be trained us- ing least-squares minimization. Corollary 3 Let µ be a pr obab ility measur e on R q , G ∈ L 2 ( R q , µ ) a nd let σ : R → R be giv e n as σ ( x ) = max ( x , 0 ) . Then fo r a ny ε > 0 , δ ∈ ( 0 , 1 ) ther e exist N ∈ N + , R > 0 , ρ > 1 and real valued random variables W 1 , . . . , W N (“outer weights”) such that the rando m feedforwar d neural n etwork (with “inner weights” ( A 1 , ζ 1 ) , ( A 2 , ζ 2 ) , . . . sampled as in (i)-(ii)) speciﬁed b y G N ( z ) = N ∑ i = 1 W i σ ( A i · z + ζ i ) , z ∈ R q appr oximates G in L 2 ( R q , µ ) up to pr ecision ε with pr obability 1 − δ , tha t is, Z R q | G ( z ) − G N ( z ) | 2 µ ( d z ) < ε 2 . Pr oof Firstly , by using [19, Lemma 1.3 3] and the fact that the set o f compactly supported inﬁnitely often differentiable fun ctions C ∞ c ( R q ) is dense in the space of continuo us function s with compact sup port C c ( R q ) in the sup remum norm we ﬁnd H ∗ ∈ C ∞ c ( R q ) satisfying  Z R q | H ∗ ( z ) − G ( z ) | 2 µ ( d z )  1 / 2 < ε √ δ 2 . (46) Denoting by c H ∗ ( w ) = R R q e − i h w , z i H ∗ ( z ) d z the Fourier tr a nsform of H ∗ and setting g = ( 2 π ) − q c H ∗ , it f ollows that H ∗ can b e represented as (34) for all z ∈ R q , that g ∈ L 1 ( R q ) and (35) holds. Choose M > 0 large eno ugh to gu a rantee tha t the sup - port of H ∗ is con tained in B M , deno te by ˜ Z a r andom variable with distribution µ and set Z = ˜ Z 1 B M ( ˜ Z ) + z 0 1 R q \ B M ( ˜ Z ) for an arbitr a ry z 0 ∈ B M \ B M . Then k Z k ≤ M and H ∗ ( Z ) = H ∗ ( ˜ Z ) and all the assump tions of Corollary 1 are satisﬁed. W e now select the hy p erparam eters as follows: choose R > 0 large enou gh to guara n - tee R R q \ B R | g ( w ) | d w < ε √ δ 4 and then take N ∈ N + to guarantee √ C ∗ R √ N < ε √ δ 4 (with C ∗ R Approximati on Bounds for Random Neural Network s and Reserv oir Systems 25 giv en in (3 7)). Further more, let ρ = m ax ( M R , 1 ) . Inserting these estimates in the right hand side of (3 6) and applying Corollary 1 sho ws th a t there e xists W = ( W 1 · · · W N ) (a M 1 , N -valued rand om variable) su c h that E [ | H A , ζ ζ ζ W ( Z ) − H ∗ ( Z ) | 2 ] 1 / 2 < ε √ δ 2 . Combining this with (4 6), H A , ζ ζ ζ W ( z ) = W σ σ σ ( Az + ζ ζ ζ ) = G N ( z ) , H ∗ ( Z ) = H ∗ ( ˜ Z ) an d the triangle inequality y ields E  Z R q | G ( z ) − G N ( z ) | 2 µ ( d z )  1 / 2 < ε √ δ . Applying Markov’ s inequality then shows that P  Z R q | G ( z ) − G N ( z ) | 2 µ ( d z )  1 / 2 > ε ! ≤ 1 ε 2 E  Z R q | G ( z ) − G N ( z ) | 2 µ ( d z )  < δ , as claimed. ⊓ ⊔ 5 App roximation Error Estimates For Echo State Networks In the r esults formulated above in Section 4 we were con cerned with the static situ- ation and approximation s based o n rand om n eural network s. W e now tur n to the d y- namic case. Thus, we con sider D d ⊂ R d and inputs given by semi-inﬁn ite seq uences in X = ( D d ) Z − . The u nknown map p ing that ne e ds to be appro ximated is denoted by H ∗ : ( D d ) Z − → R m and is called a fu nctional (see also Section 2 fo r further prelim- inaries on the dynam ic situation ). In ap plications, H ∗ is typically appro ximated by reservoir function als. Recall that a reservoir function al is a mapp ing H R C deﬁned as the inpu t-to-solutio n map X ∋ z 7→ y 0 ∈ R m of the state space system ( 3)-(4). The goal of this section is to de r iv e b ounds for the error th a t arises when approximating the f unction a l H ∗ by such reservoir fu nctionals. W e will be focusing on two of the most pr ominent families of reservoir systems, nam ely linear systems with neural ne t- work re a douts (Sectio n 5. 2) as well as ech o state n e tworks (Section 5.3). Beforehan d, in Section 5.1 we intr oduce th e setting in m ore detail, describe the regu lar ity assump- tion th at is imposed on H ∗ in both cases and ch aracterize a gene ral class of examp les in which it is satisﬁed. As a coro llary o f the app roximation error bounds derived in Section 5. 3 we prove in Section 5 . 4 that echo state networks with randomly g ener- ated recurren t weights a re universal appr oximator s. Th is proves, in par ticular, that echo state networks with random ly ge n erated weig h ts a re capable of ap proxim ating a large class of inp ut/outpu t sy stems arbitr a rily well an d, in conjun c tion with the er - ror estimates in Th eorem 2, thus provides the ﬁrst mathematical explanation fo r the empirically observed success of echo state n etworks in the learnin g of that kind of systems. 26 Lukas Gonon et al. 5.1 Setting and regular f unctionals In o rder to appr o ximate the unknown functional H ∗ : ( D d ) Z − → R m , in app lications the proce d ure is typically as follows. In a ﬁr st step, th e reservoir map F in (3) is ﬁxed (often gen erated random ly). Th en the re adout function h in (4) is trained b y mini- mizing a p reﬁxed loss func tio n in or d er to appro ximate H ∗ as well as po ssible. In what follows we will be intere sted in quan tifying the error committed when using an approx imating reservoir func tio nal for H ∗ condition al on the rand om elemen ts used to generate it and with respect to the L 2 (( D d ) Z − , µ Z ) -norm f or a p robab ility measure µ Z on the space of inputs (( D d ) Z − , B (( D d ) Z − ) . More speciﬁcally , throu ghout this section, Z is a ( D d ) Z − -valued ran dom variable, that is, a d iscrete-time stochastic p ro- cess, we denote b y µ Z its law on ( D d ) Z − and we assume that 0 ∈ D d ⊂ B M ⊂ R d . T o simp lify the statements we choose m = 1 here, but all the results can be directly generalized to m ∈ N + . The functio n als H ∗ , for which th e appro ximation b o unds in Section 5.2 an d Section 5.3 can be derived, are required to satisfy certain regularity assumption s. These will be stated in Assumption 1 belo w . Beforehan d, we intr oduce a Lip schitz- continuity conditio n which quantiﬁes how quickly H ∗ forgets past inpu ts an d is thu s linked to its memo ry , see also [1 4] f or a thoroug h discussion. Deﬁnition 1 Consider a sequence w ∈ ( 0 , ∞ ) Z − with ∑ j ∈ Z − | j | w j < ∞ . W e say that H ∗ is w-Lipschitz co ntinuou s , if there exists L > 0 such tha t | H ∗ ( u ) − H ∗ ( v ) | ≤ L k u − v k 1 , w (47) for all u = ( u t ) t ∈ Z − ∈ ( D d ) Z − , v = ( v t ) t ∈ Z − ∈ ( D d ) Z − , where k u − v k 1 , w : = ∞ ∑ i = 0 w − i k u − i − v − i k . Assumption 1 Su ppose that H ∗ : ( D d ) Z − → R is w-Lipschitz con tinuous for so me w ∈ ( 0 , ∞ ) Z − with ∑ j ∈ Z − | j | w j < ∞ and assume that for any T ∈ N + : (i) The res triction of H ∗ to sequen c e s of length T , which is given by the functio n H ∗ T : ( D d ) T + 1 → R deﬁn e d by H ∗ T ( z 0 , . . . , z − T ) : = H ∗ ( . . . , 0 , z − T , . . . , z 0 ) , can b e r epr esented a s H ∗ T ( u ) = Z R q e i h w , u i g T ( w ) d w for a C - valued function g T ∈ L 1 ( R q ) and all u = ( z 0 , . . . , z − T ) ∈ ( D d ) T + 1 ⊂ R q , with q : = d ( T + 1 ) . (ii) Z R q max ( 1 , k w k 3 ) | g T ( w ) | 2 d w < ∞ . (48) W e now provide a general class of examples that satisfy Assum ption 1. This class includes, for example, state afﬁne system s, linear systems with polyno mial readou ts, and trigono metric state a fﬁne systems as long as th e matrix coefﬁcients in these sys- tems fulﬁll certain conditions that guaran tee th at the conditio n (i) in the next prop o- sition is satisﬁed. W e refer to [ 9, 1 0, 11, 1 2, 1 3, 14] for a detailed discussion of th ese systems. Approximati on Bounds for Random Neural Network s and Reserv oir Systems 27 Proposition 5 Let ρ > 0 an d supp o se H ∗ is th e reser voir functio nal associated to the r eservoir system (3) - ( 4) d etermined by the r estriction to B ρ × D d and B ρ of the m a ps F : R N ∗ × R d → R N ∗ and h : R N ∗ → R , r espectively , and that satisfy the following hypothe ses. F irs tly , F ( B ρ × D d ) ⊂ B ρ and, addition ally , ther e exis t r ∈ ( 0 , 1 ) , L F , L h > 0 , such tha t (i) for any z ∈ D d , F | B ρ × D d ( · , z ) is an r-co ntraction, (ii) for any x ∈ B ρ , F | B ρ × D d ( x , · ) is L F -Lipschitz, (iii) F and h ar e both inﬁnitely differ entiable. Then H ∗ = h ( H F | B ρ × D d ) satisﬁes A ssumption 1. Pr oof Firstly , (iii) and the mean value theor em imply that h | B ρ : B ρ → R is Lipschitz continuo us. In what f ollows we d enote by L h the best Lipschitz con stant of h | B ρ . Secondly , n ote that Propo sition 1 gu arantees that H F | B ρ × D d is ind eed well-deﬁned . For n o tational simplicity write H F = H F | B ρ × D d . Then for any u , v ∈ ( D d ) Z − k H F ( u ) − H F ( v ) k = k F ( H F ( u ·− 1 ) , u 0 ) − F ( H F ( v ·− 1 ) , v 0 ) k ≤ k F ( H F ( u ·− 1 ) , u 0 ) − F ( H F ( v ·− 1 ) , u 0 ) k + k F ( H F ( v ·− 1 ) , u 0 ) − F ( H F ( v ·− 1 ) , v 0 ) k ≤ r k H F ( u ·− 1 ) − H F ( v ·− 1 ) k + L F k u 0 − v 0 k , where we used the echo state proper ty in the ﬁrst step, then the triangle inequality and ﬁnally hypo th eses (i)-(ii). Iterating this estimate w e obtain | H ∗ ( u ) − H ∗ ( v ) | ≤ L h L F ∞ ∑ k = 0 r k k u − k − v − k k = L k u − v k 1 , w for L = L h L F and w − j = r j , j ∈ N . This proves that H ∗ is w -Lipschitz continuo us. Let T ∈ N + . By the echo state pr operty we can write H ∗ T as H ∗ T ( z 0 , . . . , z − T ) = h ◦ F ( · , z 0 ) ◦ . . . ◦ F ( H ∗ ( . . . , 0 , 0 ) , z − T ) (49) for ( z 0 , . . . , z − T ) ∈ ( D d ) T + 1 . The expr e ssion on the right h and side of (49) can b e used to extend H ∗ T to ( R d ) T + 1 = R q and hypo thesis (iii) implies that H ∗ T is inﬁnitely often differentiable. Let χ : R → R be a co mpactly suppo rted C ∞ function that satisﬁes χ ( x ) = 1 for x ∈ [ − M 2 , M 2 ] . Deﬁne G : ( R d ) T + 1 → R by G ( u 0 , . . . , u T ) = H ∗ T ( u 0 , . . . , u T ) χ ( k u 0 k 2 ) · · · χ ( k u T k 2 ) . Then f or ( z 0 , . . . , z − T ) ∈ ( D d ) T + 1 one h as k z − i k ≤ M and thu s χ ( k z − i k 2 ) = 1 fo r i = 0 , . . . , T . Consequen tly , G = H ∗ T on ( D d ) T + 1 . Therefor e, the claim will fo llow if we prove that G can be represented a s G ( u ) = Z R q e i h w , u i g T ( w ) d w (50) for some g T ∈ L 1 ( R q ) satisfyin g (48) and f or all u ∈ R q . However , G is a smoo th function with co mpact suppo rt and therefo re a Schwartz fu nction. Thus, its Four ie r transform ˆ G ( w ) = R R q e − i h w , u i G ( u ) d u is also a Schwartz f unction. The Fourier in- version the o rem thus y ields (50) with g T = ( 2 π ) − q ˆ G a n d the integrability cond itions g T ∈ L 1 ( R q ) and (48) hold because g T is a Schwartz function. ⊓ ⊔ 28 Lukas Gonon et al. 5.2 Approx imation based on Linear Reservoir Systems with Rando m Neur al Network Reado uts In this section we study app r oximation s of the unknown functional H ∗ based on reser- voir fu nctionals H R C determined by (random ) linear reservoir sy stems with random neural network reado uts. More precisely , for q , N ∈ N + let S ∈ M q , c ∈ M q , d and let A and ζ ζ ζ be M N , q and M N , 1 -valued rando m matrices and vectors, respectiv ely . For any read out m atrix W ∈ M 1 , N consider the reservoir system giv en by ( X t = SX t − 1 + cZ t , t ∈ Z − , Y t = W σ σ σ ( AX t + ζ ζ ζ ) , t ∈ Z − . (51) Clearly , when the associated system with deterministic inp u ts z ∈ ( D d ) Z − (which is a linear system with random neural network rea d out, see (2 7)) gi ven b y x t = Sx t − 1 + cz t , t ∈ Z − , (52) y t = H A , ζ ζ ζ W ( x t ) , t ∈ Z − , (53) has th e ech o state prop erty , then the solution to (51) can be obtained b y ev aluating the ﬁlter associated to (52)-(53) at the sto chastic input Z . Remark 7 For notational simplicity we take S , c determ inistic here. Howev er, Propo- sition 6 direc tly extends to random ly drawn S , c satisfying P -a.s. th e h ypotheses of Proposition 6. The expectation in (54) is then cond itional on S , c . Proposition 6 Let N , T ∈ N + , R , M T > 0 and q = d ( T + 1 ) . S uppo se the r ows of A ar e sampled fr om the uniform distribution on B R ⊂ R q and the entries of ζ ζ ζ ar e uniformly distributed on [ − max ( M T R , 1 ) , max ( M T R , 1 )] , let σ : R → R be given as σ ( x ) = max ( x , 0 ) , a ssume th at (52) satisﬁes the echo state p r operty , the matrix K =  c Sc · · · S T c  is invertible , k X 0 k ≤ M T and K − 1 X 0 ∈ ( D d ) T + 1 . Then fo r any H ∗ : ( D d ) Z − → R satisfying Assumption 1 ther e e xists W (a M 1 , N -valued random va riable) such that E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 ≤ p C T , R √ N + | det ( K ) | Z R q \ B R | g T ( K ⊤ w ) | d w + LM ∞ ∑ i = T + 1 w − i ! + L T ∑ i = 0 w 2 − i ! 1 / 2 k K − 1 S T + 1 X − T − 1 k (54) wher e C T , R is given in (59) . Remark 8 Th e bound in Prop osition 6 shows, in p articular, that for suitable cho ices of S (f or instan c e as given in Remark 9 below) the app roximation erro r can be made arbitra r ily small. I ndeed, if ε > 0 is g iv en, T is large enoug h and S T + 1 = 0 , then the last term in (54) vanishes and the third term satisﬁes LM ∑ ∞ i = T + 1 w − i < ε 3 , Approximati on Bounds for Random Neural Network s and Reserv oir Systems 29 since the weigh ting sequence w is summable. Next, one choo ses R > 0 to make | det ( K ) | R R q \ B R | g T ( K ⊤ w ) | d w < ε 3 (this is possible, since g is integrable) an d ﬁnally (with R , T now ﬁxed) N so that √ C T , R √ N < ε 3 . Altogether, o ne o b tains that E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 < ε . Pr oof Firstly , the hypo thesis that H ∗ is w -Lipschitz continuous yield s ( see (47 )) for any z ∈ ( D d ) Z − | H ∗ T ( z 0 , . . . , z − T ) − H ∗ ( z ) | ≤ L ∞ ∑ i = T + 1 w − i k z − i k ! ≤ LM ∞ ∑ i = T + 1 w − i ! . (55) Secondly , using once mo re th e w -Lipschitz pro perty ( 4 7) and H ¨ older’ s inequality show for any u = ( u t ) t = 0 ,..., T , v = ( v t ) t = 0 ,..., T ∈ ( D d ) T + 1 that | H ∗ T ( u ) − H ∗ T ( v ) | ≤ L T ∑ i = 0 w − i k u i − v i k ! ≤ L T ∑ i = 0 w 2 − i ! 1 / 2 T ∑ i = 0 k u i − v i k 2 ! 1 / 2 and therefo r e | H ∗ T ( Z 0 , . . . , Z − T ) − H ∗ T ( K − 1 X 0 ) | ≤ L T ∑ i = 0 w 2 − i ! 1 / 2 k ( Z 0 , . . . , Z − T ) − K − 1 X 0 k . (56) Iterating (52) yields the r epresentation X 0 = T ∑ i = 0 S i cZ − i + S T + 1 X − T − 1 = K    Z 0 . . . Z − T    + S T + 1 X − T − 1 , which we insert in (56) to ob tain | H ∗ T ( Z 0 , . . . , Z − T ) − H ∗ T ( K − 1 X 0 ) | ≤ L T ∑ i = 0 w 2 − i ! 1 / 2 k K − 1 S T + 1 X − T − 1 k . (57) Thirdly , consider the function G : B M T → R deﬁned for v ∈ B M T ⊂ R q by G ( v ) = | de t ( K ) | Z R q e i h w , v i g T ( K ⊤ w ) d w , which is ind eed we ll- deﬁned b ecause g T is integrable. Then the change of variables formu la and Assumption 1 yield H ∗ T ( K − 1 X 0 ) = Z R q e i h K −⊤ w , X 0 i g T ( w ) d w = | det ( K ) | Z R q e i h w , X 0 i g T ( K ⊤ w ) d w = G ( X 0 ) . 30 Lukas Gonon et al. Therefo re, the function G satisﬁes the hypotheses of Coro llar y 1 (integrability again follows by the change of variables for mula) an d so by Corollary 1 there exists W (a M 1 , N -valued rand om variable) su c h that E [ | H A , ζ ζ ζ W ( X 0 ) − [ H ∗ T ◦ K − 1 ]( X 0 ) | 2 ] 1 / 2 ≤ p C T , R √ N + | det ( K ) | Z R q \ B R | g T ( K ⊤ w ) | d w , (58) where C T , R = 16 max ( M T R , 1 ) V ol q ( B R )( M T + 1 ) 2 ([ M T ] 3 + M T + 2 ) | det ( K ) | 2 Z B R max ( 1 , k w k 3 ) | g T ( K ⊤ w ) | 2 d w . (59) By using the trian g le inequality and inserting the bound s o btained in (55), (57) and (58) one thus obtains th e appro ximation bou nd (54), as claimed . ⊓ ⊔ Remark 9 An impo rtant sp e cial c a se is S = ρ  0 0 0 d , d T 0 0 0 d , d I I I d T 0 0 0 d , d  and c =  I I I d 0 0 0 d T , d  (60) for ρ ∈ ( 0 , 1 ] . In this case one calculates S T + 1 = 0 and for k = 1 , . . . , T S k c = ρ k   0 0 0 d k , d I I I d 0 0 0 d ( T − k ) , d   . Thus, e.g . for ρ = 1 o ne obtain s K = I I I d ( T + 1 ) and so in particu lar K is inv ertible and k K − 1 k = 1 . In ad d ition, the system (52) satisﬁes the ech o state pro p erty and the solution is gi ven b y x t =  z ⊤ t , ρ z ⊤ t − 1 , . . . , ρ T z ⊤ t − T  ⊤ , t ∈ Z − . 5.3 Approx imation based on Echo State Networks In th is section we use an echo state network with random ly generated para m eters as an approximatio n to the unknown target functional H ∗ . More precisely , for ¯ N ∈ N + let A , C and ζ ζ ζ be M ¯ N , M ¯ N , d and M ¯ N , 1 -valued ran dom matrices/vectors, respectively , and for any re adout matr ix W ∈ M 1 , ¯ N consider the reservoir system given b y ( x t = σ σ σ ( Ax t − 1 + Cz t + ζ ζ ζ ) , t ∈ Z − , y t = Wx t , t ∈ Z − for z ∈ ( D d ) Z − . Such a system is called an echo state network. If this RC system has the echo state proper ty (see Section 2), then th e reservoir fun c tio nal H A , C , ζ ζ ζ W ( z ) = y 0 (that is, the input-to- so lution map ( D d ) Z − ∋ z 7→ y 0 ) is well-deﬁned and measurable. Evaluating H A , C , ζ ζ ζ W at the stochastic in put signal Z then amounts to solving th e asso- ciated system with stochastic input ( X t = σ σ σ ( AX t − 1 + CZ t + ζ ζ ζ ) , t ∈ Z − , Y t = WX t , t ∈ Z − . (61) Approximati on Bounds for Random Neural Network s and Reserv oir Systems 31 The next r esult shows that it is possible to gene rate A , C and ζ ζ ζ from a gene ric distribution (not depend in g on H ∗ ) and use this gener ic echo state network to appr oxi- mate H ∗ arbitrarily well. Th us, X is universal and to appr oximate H ∗ only the read out matrix W ∈ M 1 , ¯ N needs to be trained, a task which amoun ts to a linear regression. Theorem 2 Let σ : R → R be given as σ ( x ) = max ( x , 0 ) . Let T , N ∈ N + , R > 0 , assume that k ( Z 0 , . . . , Z − T ) k R d ( T + 1 ) ≤ M T and generate A , C , ζ ζ ζ according to the fol- lowing pr ocedur e: (i) d raw N i.i.d . samples A 1 , . . . , A N fr o m the uniform distribution on B R ⊂ R d ( T + 1 ) and N i.i.d. samples ζ 1 , . . . , ζ N (also indepen dent o f { A i } i = 1 ,..., N ) fr o m the unifo rm distribution on [ − max ( M T R , 1 ) , max ( M T R , 1 )] , (ii) let S , c be the shift matrices deﬁned in (60) with ρ = 1 and set a =    A ⊤ 1 . . . A ⊤ N    , ¯ A =  S 0 0 0 q , N aS 0 0 0 N , N  , ¯ C =  c ac  , ¯ ζ ζ ζ =      0 0 0 q ζ 1 . . . ζ N      , A =  ¯ A − ¯ A − ¯ A ¯ A  , C =  ¯ C − ¯ C  , ζ ζ ζ =  ¯ ζ ζ ζ − ¯ ζ ζ ζ  . Then for a ny H ∗ : ( D d ) Z − → R satisfying Assumption 1 there exists a readout W (a M 1 , 2 ( N + d ( T + 1 )) -valued r ando m variable) such that the system (6 1) satisﬁes the echo state pr operty and E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 ≤ p C T , R √ N + Z R q \ B R | g T ( u ) | d u + LM ∞ ∑ i = T + 1 w − i ! (62) with C T , R = 16 max ( M T R , 1 ) V ol q ( B R )( M T + 1 ) 2 ([ M T ] 3 + M T + 2 ) · Z B R max ( 1 , k u k 3 ) | g T ( u ) | 2 d u . (63) Remark 10 The pro cess X in (61) is n ot related in any way to the un known fu nc- tional H ∗ . X is gen eric an d can b e vie wed as a “reservoir” that ef ﬁciently stores the informa tio n ab out the history of the input process Z . Theorem 2 shows that f or any “sufﬁciently regular” fun ctional H ∗ one can appro ximate H ∗ ( Z ) by WX 0 for an ap- propr iately chosen W , i.e . by apply ing a linear mapping to X 0 . This ph enomen o n is analogo u s to the situation encoun tered in continuo us-time stochastic proce sses satis- fying ce r tain stochastic dif ferential equ ations, which can be app roximated by ap ply- ing a lin ear fu n ctional to th e signature of the driving path, see e.g . [20, Ch a pter 5], [8, Chapter 18]. See also [ 5] an d [6]. Remark 11 For simplicity (and to give a fully constructive sampling proce d ure) we have ch osen h e r e for S , c the shift m atrices deﬁne d in (60) with ρ = 1. Howev er, Theorem 2 can b e directly gen eralized to ρ ∈ ( 0 , 1 ) and ar bitrary S , c satisfying the hypoth eses stated in Prop o sition 6. The boun d (62) is then replac ed by the boun d (54) and the constant C T , R giv en in (63) is replaced by (59). 32 Lukas Gonon et al. Remark 12 By using Markov’ s in equality the bo u nd (62) immed iately yie ld s a high - probab ility b ound on the approx imation erro r conditiona l on the reservoir parameter s: for any δ ∈ ( 0 , 1 ) it holds with pro bability 1 − δ that the (ran dom) e c ho state network H A , C , ζ ζ ζ W satisﬁes  Z ( D d ) Z − | H A , C , ζ ζ ζ W ( z ) − H ∗ ( z ) | 2 µ Z ( d z )  1 / 2 ≤ φ ( T , R , N ) δ , where φ ( T , R , N ) is the righ t han d side in (6 2). Pr oof Firstly , Pr oposition 6 an d Remark 9 sho w that fo r any H ∗ satisfying Assump- tion 1 ther e exists w (a M 1 , N -valued random variable) such that th e b ound (62) hold s with Y 0 = Y Lin 0 satisfying ( X Lin t = SX Lin t − 1 + cZ t , t ∈ Z − , Y Lin t = w σ σ σ ( aX Lin t + b ) , t ∈ Z − , and b ⊤ =  ζ 1 · · · ζ N  . Now set ¯ W =  0 0 0 1 , q w  and W =  ¯ W 0 0 0 1 , q + N  . W e ﬁrst sho w that (61) has a solution. T o d o th is we deﬁn e ¯ X t =  X Lin t aX Lin t + b  and claim that X t =  σ σ σ ( ¯ X t ) σ σ σ ( − ¯ X t )  is a solution to the ﬁrst equation in (61). Indeed, we ﬁrst calcu late ¯ A ¯ X t − 1 + ¯ CZ t + ¯ ζ ζ ζ =  SX Lin t − 1 + cZ t aSX Lin t − 1 + acZ t + b  =  X Lin t aX Lin t + b  = ¯ X t and then insert this to obtain σ σ σ ( AX t − 1 + CZ t + ζ ζ ζ ) = σ σ σ (  ¯ A − ¯ A  ( σ σ σ ( ¯ X t − 1 ) − σ σ σ ( − ¯ X t − 1 )) +  ¯ C − ¯ C  Z t +  ¯ ζ ζ ζ − ¯ ζ ζ ζ  ) = σ σ σ  ¯ X t − ¯ X t  = X t , as claimed. In addition, Y t = WX t = ¯ W σ σ σ ( ¯ X t ) = w σ σ σ ( aX Lin t + b ) = Y Lin t and so we hav e c o nstructed a solution to (61) and p roved that (6 2) holds. It remains to be pr oved that the system ( 61) satisﬁes the echo state property . T o do so, consider an arbitrary solution ( U , ˜ Y ) to (6 1), i.e. ( U , ˜ Y ) satisfyin g ( U t = σ σ σ ( A U t − 1 + CZ t + ζ ζ ζ ) , t ∈ Z − , ˜ Y t = WU t , t ∈ Z − . Approximati on Bounds for Random Neural Network s and Reserv oir Systems 33 Partitioning U t = U [ 1 ] t U [ 2 ] t ! (with U [ i ] t valued in R d ( T + 1 )+ N ) an d setting ¯ U t = U [ 1 ] t − U [ 2 ] t one calculates U t = σ σ σ (  ¯ A − ¯ A  ( U [ 1 ] t − 1 − U [ 2 ] t − 1 ) +  ¯ C − ¯ C  Z t +  ¯ ζ ζ ζ − ¯ ζ ζ ζ  ) = σ σ σ  ¯ A ¯ U t − 1 + ¯ CZ t + ¯ ζ ζ ζ − ( ¯ A ¯ U t − 1 + ¯ CZ t + ¯ ζ ζ ζ )  (64) and therefo r e ¯ U t = σ σ σ ( ¯ A ¯ U t − 1 + ¯ CZ t + ¯ ζ ζ ζ ) − σ σ σ ( − ( ¯ A ¯ U t − 1 + ¯ CZ t + ¯ ζ ζ ζ )) = ¯ A ¯ U t − 1 + ¯ CZ t + ¯ ζ ζ ζ . (65) By fu r ther partitionin g ¯ U t = ¯ U [ 1 ] t ¯ U [ 2 ] t ! (with ¯ U [ 1 ] t valued in R d ( T + 1 ) and ¯ U [ 2 ] t valued in R N ) one obtains f rom (65) that ¯ U [ 1 ] t ¯ U [ 2 ] t ! = S ¯ U [ 1 ] t − 1 + cZ t aS ¯ U [ 1 ] t − 1 + acZ t + b ! . (66) Howe ver , the line a r system (52) satisﬁes th e echo state pr operty a nd so ¯ U [ 1 ] t = X Lin t . Inserting this in ( 66) sh ows that ¯ U [ 2 ] t = aX Lin t + b . This p r oves that ¯ U t = ¯ X t . Using this in the second step and inserting ( 65) into (64) in th e ﬁrst step shows that U t =  σ σ σ ( ¯ U t ) σ σ σ ( − ¯ U t )  =  σ σ σ ( ¯ X t ) σ σ σ ( − ¯ X t )  = X t and hence also ˜ Y = Y , as claimed. ⊓ ⊔ Remark 13 As explain ed in Remark 10 the state pr ocess X can be v iewed as a “reser- voir” that stores the histor y o f the in put p rocess Z . Cho o sing X as an echo state network, i.e. evolving accordin g to the dyn a mics speciﬁed in (61), is the most com- monly used choice in p ractical applications in reserv oir computing, see f o r instance [18], [29]. Fr o m a pur e ly mathematical point of view it could also be in teresting to look for o th er cho ices of upd ate functions G so that for X t = G ( X t − 1 , Z t ) a similar result to T h eorem 2 can be proved. Howev er, proving such a re su lt would require different techniques than those used in the proo f of Th eorem 2 (which, due to its re- liance on Corollary 1 v ia Proposition 6, is speciﬁc to the neu ral network choice ma d e here) and G can not be chosen arbitrarily . For instan c e, if we c h oose σ σ σ ( x ) = x in (61), then WX 0 is a linear function al of Z , which can not be used to appro x imate th e (in general non-line a r ) functio nal H ∗ . Remark 14 Let us be mor e spec iﬁc about how e cho state networks are u sed in ap- plications. I n m any situations, the go al is to learn an unk nown input/outpu t system from data. For example, in [18], [29] the consider ed task is to p redict the evolu- tion o f chao tic dynam ical systems based on observational data. In general such prob- lems can be phrased using a target pro c ess Y = ( Y t ) t ∈ Z and an observation proce ss 34 Lukas Gonon et al. Z = ( Z t ) t ∈ Z . The go al is to p r edict Y t based on ( Z s ) s ≤ t . For instance, the target pro- cess is Y t = H ∗ (( Z s ) s ≤ t ) o r Y t = Z t + h for some h > 0 (which co r respond s to lear ning the function al H ∗ (( Z s ) s ≤ t ) = E [ Z t + h | ( Z s ) s ≤ t ] ). T o achieve this goa l, ech o state net- works as introdu ced in (61) are used. First, th e parameters A , C and ζ ζ ζ are gener ated accordin g to some g iv en distrib ution (for instance, all e ntries are drawn from a nor- mal distribution). T h en the readout matrix W is tr ained by a linear regression using past data, i.e. by solving W ∗ = arg m in W 1 T T ∑ k = 1 k WX t − k − Y t − k k 2 and W ∗ X t is then the predictio n of Y t . This is in p ractice repeated for dif feren t ran- dom samples A , C , a nd ζ ζ ζ and an optim ization over some hyper parameter s is car ried out. This procedu re has been successful at learning input/outp ut systems in a wide range of ap plications, in the sense that echo state networks hav e be e n able to achieve a low mean squared prediction error k W ∗ X t − Y t k in comparison to o th er m ethods. In view of Remark 3, Theore m 2 directly provides er r or bound s for this p rocedu r e in the case Y t = H ∗ (( Z s ) s ≤ t ) . In the case wh en Y t is a gen eral rando m vector not necessarily measurable with respe c t to the sigm a-algebra g enerated by ( Z s ) s ≤ t (for instance, if Y t = Z t + h ) th en the a p pr oximation err o r bo unds in Theorem 2 ca n be combined with the generalization err or b ound s in [9] to o btain an err or analysis for echo state network-based lear n ing also in th is case. In order to use the boun d in Theor em 2 in practice on e c an now prescribe an approx imation accuracy ε > 0 and subsequently select the hyperpar ameters R , T , N so that the righ t han d side o f (62) is smaller than ε . The next result provides a sp ecial case of Theore m 2 w h en H ∗ T is in the So b olev space W k , 2 ( R q ) . Corollary 4 Let σ : R → R be given a s σ ( x ) = max ( x , 0 ) and let w ∈ ( 0 , ∞ ) Z − with ∑ j ∈ Z − | j | w j < ∞ . Let T , N ∈ N + , let q = d ( T + 1 ) , let k ∈ N with k ≥ q 2 + 1 + ε for some ε > 0 and let R = N 1 / ( 2 k − 2 ε + 1 ) . Assume th a t k ( Z 0 , . . . , Z − T ) k R d ( T + 1 ) ≤ M T and generate A , C , ζ ζ ζ according to th e pr ocedu re described in (i)-(ii) in Theor em 2. Then for a ny H ∗ : ( D d ) Z − → R th at is w-Lipschitz contin uous with Lipschitz con stant L and satisﬁes H ∗ T ∈ W k , 2 ( R q ) ∩ L 1 ( R q ) ther e exists a readout W (a M 1 , 2 ( N + d ( T + 1 )) - valued random variable) such tha t the system (61) satisﬁes th e echo state pr operty and E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 ≤ C k H ∗ T k k N − 1 / α + LM ∞ ∑ i = T + 1 w − i ! with α = 2 + ( q + 1 ) k − q / 2 − ε and C = [ 1 6 max ( M T , 1 ) V ol q ( B 1 )( M T + 1 ) 2 ([ M T ] 3 + M T + 2 )] 1 / 2 ( 2 π ) − q +  Z R q ( 1 + k w k 2 ) − q / 2 − ε d w  1 / 2 . Pr oof The cor ollary is a consequ ence of Theorem 2 and Corollary 2. More spec iﬁ- cally , to d educe th e d esired result from Theorem 2 it sufﬁces to p rove that the ﬁrst Approximati on Bounds for Random Neural Network s and Reserv oir Systems 35 two erro r terms in (62) are bound e d by C k H ∗ T k k N − 1 / α . T o this end, no te that these error ter ms arise when apply ing Corollary 1 in (58). Our hyp otheses allow us to ap p ly Corollary 2 instead of Coro llary 1, which dire ctly yields the desired expression for the upper bound and the constant. ⊓ ⊔ W e now provide an example in which, for each N , goo d choices of the hyperpa- rameters T and R can be given explicitly as a fun ction of N and thus also the bound (62) depends only on N . Example 1 Let d = 1, D d = [ − M , M ] , λ ∈ ( 0 , 1 ) and c onsider the fun ctional H ∗ ( z ) = exp ( − 1 2 ∑ ∞ i = 0 λ i ( z − i ) 2 ) . Then H ∗ satisﬁes the hypotheses of Theorem 2 an d we may choose R , T app ropriately to o btain f or any N ∈ N + E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 ≤ p ( N ) N γ for some slowly growing fun ction p (a power of logar ith ms of N ) and som e γ > 0. W e car efully prove th is in the next Lemma. Lemma 2 Let β > α > 0 satisfy 1 > α 2 ( 1 − lo g ( 2 ) + log ( β / α )) . Then for a ny N ∈ N + the ESN appr oximation co nstructed in Theorem 2 with T + 1 = α log ( √ N ) , R = β log ( √ N ) , satisﬁes E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 ≤ p ( N ) N γ with p : ( 0 , ∞ ) → R and γ > 0 given in (72) and (73) , respectively . Pr oof Firstly , using that f e : [ 0 , ∞ ) → [ 0 , ∞ ) , f e ( x ) = exp ( − x / 2 ) is 1 / 2-Lipschitz, one estimates | H ∗ ( u ) − H ∗ ( v ) | ≤ 1 2 | ∞ ∑ i = 0 λ i [( u − i ) 2 − ( v − i ) 2 ] | ≤ M ∞ ∑ i = 0 λ i | ( u − i ) − ( v − i ) | and so H ∗ is w -Lip schitz fo r w = ( λ k ) k ∈ N . Secon dly , let Σ = diag ( 1 , λ , . . . , λ T ) . Not- ing that H ∗ T is the cha r acteristic function o f a N ( 0 , Σ ) -distributed rando m variable one has for any u = ( z 0 , . . . , z − T ) H ∗ T ( u ) = exp − 1 2 T ∑ i = 0 λ i ( z − i ) 2 ! = Z R T + 1 e i h w , u i g T ( w ) d w where g T is the density o f a N ( 0 , Σ ) -distribution. In particula r, g T is in tegrable an d (48) is satisﬁed. Choosing ρ = √ λ in the shift matrix (6 0) we note that K = Σ 1 / 2 is in vertible. By Theor em 2 and Remark 11 it follows that the appr oximation bound (5 4) holds with C T , R giv en in (59). The last term in the boun d (5 4) is 0, since S T + 1 = 0. For o u r ch oice T + 1 = α log ( √ N ) th e seco nd to last term in th e boun d (54) equals LM ∞ ∑ i = T + 1 w − i = λ T + 1 LM / ( 1 − λ ) = 1 √ N α log ( 1 / λ ) LM / ( 1 − λ ) . (67) 36 Lukas Gonon et al. Denoting by V a N ( 0 , I I I T + 1 ) -distributed rando m variable, the second term in the right hand side of (54) can b e written as | det ( K ) | Z R T + 1 \ B R | g T ( K ⊤ w ) | d w = ( 2 π ) − ( T + 1 ) / 2 Z R T + 1 \ B R e − k w k 2 2 d w = P ( k V k > R ) . Recall that k V k 2 has a chi-square distribution with T + 1 degrees of freed om. Using this and the fact tha t R 2 > T + 1 (bec ause β > α ) one estimates P ( k V k > R ) ≤ P ( k V k 2 > R 2 ) ≤  R 2 T + 1 e 1 − R 2 / ( T + 1 )  ( T + 1 ) / 2 = 1 √ N β / 2 − α / 2 − α log ( β / α ) / 2 . (68) Finally , one calculates | det ( K ) | 2 Z B R max ( 1 , k w k 3 ) | g T ( K ⊤ w ) | 2 d w ≤ R 3 ( 2 π ) − ( T + 1 ) Z R T + 1 e −k w k 2 d w = R 3 ( 2 π ) − ( T + 1 ) / 2 2 − ( T + 1 ) / 2 . (69) Recall the following standard estimate for the volume of the ball B R ⊂ R q : V ol q ( B R ) ≤ 1 √ q π  2 π e q  q / 2 R q . (70) Inserting (69), M T ≤ p ( T + 1 ) M an d (7 0) in (59) yield s ( for M T > 1, R > 1) C T , R ≤ 2 8 π M 7 ( T + 1 ) 3 R 4  eR 2 2 ( T + 1 )  ( T + 1 ) / 2 . (71) W e m ay now pu t together all the terms that we e stimated separately: inserting (67), (68) and (71) in the a p proxim ation bound (54) yields E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 ≤ p ( N ) √ N , where γ = 1 2 min { α log ( λ − 1 ) , β 2 − α 2 ( 1 + log ( β / α )) , 1 − α 2 ( 1 − log ( 2 ) + lo g ( β / α )) } , (72) p ( N ) = 2 8 π M 7 α 3 β 4 ( log ( √ N )) 7 + 1 + LM 1 − λ . (73) Note that the seco nd ter m in (72) is positive, since 1 + log ( x ) ≤ x for x > 0 and since α < β . The last te r m in ( 72) is positi ve by assumption on α , β and so indeed γ > 0. ⊓ ⊔ Approximati on Bounds for Random Neural Network s and Reserv oir Systems 37 5.4 Univ ersal appro x imation by echo state networks As a corollary of the ech o state network ap proxim ation erro r b o unds in Theorem 2, we also ob tain a c o nstructive ESN universality result, see Corollary 5 below . This compleme n ts the ESN universality result in [ 11, The o rem III .10]. The key novelty of Corollary 5 is that a co n structive approxim ation proced ure (u p to tuning the hy perpa- rameters N , T , R and carrying out a r egression to estimate W ) is gi ven, wh ereas [1 1, Theorem III.10 ] is an existence result. Note also that the setting is slightly d ifferent (the activ ation fu nction he r e is ReLU and the inp uts are unifo rmly bo unded ). Corollary 5 Let H ∗ : ( D d ) Z − → R measurable satisfy tha t E [ | H ∗ ( Z ) | 2 ] < ∞ a n d let σ : R → R b e give n as σ ( x ) = max ( x , 0 ) . Th en for a ny ε > 0 , δ ∈ ( 0 , 1 ) there exists N , T ∈ N + , R > 0 and a r eadout W ( a M 1 , 2 ( N + d ( T + 1 )) -valued rand o m variab le) such that the system (61) (with A , C , ζ ζ ζ generated ac c or ding to (i)-(ii) in Theorem 2 for M T = M √ T ) satisﬁes the echo state pr operty and (deno ting by H A , C , ζ ζ ζ W the associated random ESN functiona l) th e app r oximation e rr o r satisﬁes with pr obability 1 − δ that  Z ( R d ) Z − | H A , C , ζ ζ ζ W ( z ) − H ∗ ( z ) | 2 µ Z ( d z )  1 / 2 = E [ | H A , C , ζ ζ ζ W ( Z ) − H ∗ ( Z ) | 2 | A , C , ζ ζ ζ ] 1 2 < ε . Pr oof Firstly , b y stan dard prop erties of the conditional expectation (see f or instance [11, Lemma A.1]) we m ay ﬁnd T ∗ ∈ N + satisfying E [ | H ∗ ( Z ) − E [ H ∗ ( Z ) | F − T ∗ ] | 2 ] 1 / 2 < ε √ δ 3 , (74) where F − T ∗ : = σ ( Z 0 , . . . , Z − T ∗ ) . Let q : = d ( T ∗ + 1 ) . By d eﬁnition, E [ H ∗ ( Z ) | F − T ∗ ] is F − T ∗ -measurab le and so there exists a measurable function H ( 1 ) : R q → R such that E [ H ∗ ( Z ) | F − T ∗ ] = H ( 1 ) ( Z 0 , . . . , Z − T ∗ ) and E [ | H ( 1 ) ( Z 0 , . . . , Z − T ∗ ) | 2 ] < ∞ (see, e.g., [19, Lemma 1.13]) . By c ombinin g [19, Lemma 1.33] and the fact that C ∞ c ( R q ) is dense in C c ( R q ) in the su premum norm we ﬁnd H ( 2 ) ∈ C ∞ c ( R q ) satisfying E [ | H ( 1 ) ( Z 0 , . . . , Z − T ∗ ) − H ( 2 ) ( Z 0 , . . . , Z − T ∗ ) | 2 ] 1 / 2 < ε √ δ 3 . (75) W e claim that H ( 2 ) satisﬁes Assump tion 1. Ind e e d, H ( 2 ) is Lipsch itz co ntinuou s o n R q and th us also w -Lipschitz with w = ( 1 { t ≤ T ∗ } ) t ∈ N . In addition, f o r any T ∈ N + one has that H ( 2 ) T is a Schwartz function and so also the Fourier transform of H ( 2 ) T is a Sch wartz fun ction and the Fourier in version theorem with (48) indeed h old. Now set T = T ∗ + 1 an d choo se R so that th e second to last term in the rig ht hand side of (54) is smaller than ε √ δ / 6 and then ch oose N such that √ C T , R √ N < ε √ δ / 6. Applyin g Theorem 2 then y ields E [ | Y 0 − H ( 2 ) ( Z 0 , . . . , Z − T ∗ ) | 2 ] 1 / 2 < ε √ δ 3 . (76) 38 Lukas Gonon et al. Applying the triangle inequality and u sing (74), (75), (76) we th en obtain E [ | Y 0 − H ∗ ( Z ) | 2 ] 1 / 2 < ε √ δ . Thus, Markov’ s inequ ality gives P  Z ( R d ) Z − | H A , C , ζ ζ ζ W ( z ) − H ∗ ( z ) | 2 µ Z ( d z )  1 / 2 > ε ! ≤ 1 ε 2 E [ E [ | Y 0 − H ∗ ( Z ) | 2 | A , C , ζ ζ ζ ]] < δ , as claimed. ⊓ ⊔ 6 App roximation Error Estimates For Echo State Networks with Output Feedback In th is section we co ntinue our study of the dyn a m ic situation, but we now focu s on approx imations based on a sligh tly d ifferent type of reservoir co mputing systems: echo state networks with o utput fe edback, that is, systems gi ven for z ∈ ( D d ) Z − and t ∈ Z − by x t = σ σ σ ( A y t − 1 + Cz t + ζ ζ ζ ) , y t = Wx t . (77) These systems are a p opular modiﬁcation of the echo state n etworks co nsidered in Section 5. T hey are also referred to as Jordan recurre n t neural networks (with ran dom internal weights) and are widely used in the literatur e. The advantage of these systems is that they can be used to directly approx imate the reservoir function in case the f unctiona l H ∗ : ( D d ) Z − → R m is itself induc e d by a reserv oir system. Mo re p recisely , c o nsider H ∗ deﬁned via ( D d ) Z − ∋ z 7→ H ∗ ( z ) = y ∗ 0 ∈ R m with y ∗ determined by ( x ∗ t = F ∗ ( x ∗ t − 1 , z t ) , y ∗ t = h ∗ ( x ∗ t ) , t ∈ Z − . (78) For f unctionals H ∗ of this type the system (77) can be used to d irectly approxim ate the state updating function, th at is, the function F ∗ in (78). The disadvantage of the system (77) is that the training p rocedur e is more inv olved, since the readout W is fed b ack into th e state equation of the echo state network in (7 7). Nevertheless, these systems are used frequently in reservoir compu ting applications an d so we a lso provide a d e ta iled appr o ximation analy sis here. This section is structur ed as follows. In Theorem 3 in Section 6.2 we present our appro ximation result for func tio nals induc e d b y sufﬁciently regular reservoir sys- tems. Remark a b ly , in this case on ly o ne hyper parameter N appears ( propo rtional to the number of neurons, i.e. the dimension of x in (77)) and th e approx imation err or is of order O ( 1 / √ N ) . Theorem 3 follows from our more general approx imation re- sult Theo r em 4 below a nd Propo sition 2 . Before h and we introdu ce th e setting and regularity assump tio ns in Sectio n 6.1. Approximati on Bounds for Random Neural Network s and Reserv oir Systems 39 6.1 Setting and regular r eservoir functio nals As in Section 5 we study systems (7 7) in which ﬁrst A , C , ζ ζ ζ are gener ated r a ndomly (and th en considered ﬁxed) and subsequen tly W is trained in order to app roximate H ∗ as well as possible. W e n ow specify the inv olved ob jects in more detail. Firstly , note that in practice instead of the inﬁnite history system (7 7) in fact on e always uses a system that satisﬁes (7 7) for t ≥ − T and is initialized at t = − T − 1 with y − T − 1 = Ξ for some T ∈ N + and some Ξ ∈ R m satisfying k Ξ k ≤ M . Thu s, these are also the systems we co nsider here. Next, th rough out this section Z is a ( D d ) Z − - valued rand om variable – a discrete-time stochastic pr ocess – in depend ent of A , C , ζ ζ ζ . As in the previous sections the app roximatio n error is measured c o nditiona l on the random ly gener ated p arameters A , C , ζ ζ ζ . Howe ver , in orde r to provide an altern ativ e viewpoint we form ulate the appr o ximation results in this section in terms of statistical risk. Thus, for som e integrable random variable Y 0 we conside r th e risk deﬁned by R ( H ) : = E [ L ( H ( Z ) , Y 0 )] for a loss fun ction L : R m × R m → [ 0 , ∞ ) satisfying the Lipschitz condition | L ( x , y ) − L ( x , y ) | ≤ L L ( k x − x k 2 + k y − y k 2 ) , x , x , y , y ∈ R m . (79) In ord er to state our a p proxim ation r e su lt for ech o state n etworks with output feedback let us n ow make precise which kind s o f fu n ctionals we aim to app r oximate. Let N ∗ ∈ N + . W e c onsider func tio ns f : R N ∗ × D d → R who se re striction to B M + 1 × D d satisﬁes th e following smoothn ess cond ition: Deﬁnition 2 A functio n f : R N ∗ × D d → R is sufﬁciently smooth , if for ( x , z ) ∈ B M + 1 × D d one h as f ( x , z ) = R R N ∗ + d ˆ f ( w ) e i ( x , z ) · w d w where ˆ f : R N ∗ + d → C is a fun c- tion satisfying C f =  V ol N ∗ + d ( B 1 ) Z R N ∗ + d max ( 1 , k w k 2 ( N ∗ + d + 3 ) ) | b f ( w ) | 2 d w  1 / 2 < ∞ . (80) Remark 15 For in stance, if D d = R d , f ∈ L 1 ( R N ∗ + d ) ∩ L 2 ( R N ∗ + d ) , ˆ f d enotes the Fourier transform of f and ˆ f is integrab le , then cond ition (80) is equ ivalent to the requirem ent that f belon gs to the Sob o lev space W N ∗ + d + 3 , 2 ( R N ∗ + d ) , see e.g. [7 , The - orem 6.1]. Remark 16 In this section we consider th e dimen sions d and N ∗ as ﬁxed. Th e b e- haviour of (80) a s a function of N ∗ + d depen ds on the function f (or rather the family of fun ctions indexed by N ∗ + d ) und er consideration. Recalling the estimate for the volume of the un it ball (70) o ne obser ves that th e factor V o l N ∗ + d ( B 1 ) in (80) decreases to 0 exponentially as N ∗ + d → ∞ . W ith th is deﬁnition at hand , we now state the regularity assumption imposed on the function als und er consider ation. No te that we focu s o n appro ximating the state equa- tion h ere and so we set m = N ∗ and take h ∗ the id entity in (78). T o ap proxim ate systems with general h ∗ one may either comb ine the resu lts p r esented here with any static app roximatio n techniqu e or proceed as explained in Remark 17 below . No te that unde r Assump tio n 2 the system (7 8) satisﬁes the e cho state pro perty , see Propo- sition 1. 40 Lukas Gonon et al. Assumption 2 Su ppose H ∗ : ( D d ) Z − → R m satisﬁes that H ∗ ( z ) = x ∗ 0 , wher e x ∗ sat- isﬁes (78) for some continuo u s function F ∗ : R N ∗ × D d → B M ⊂ R N ∗ such th at – for each z ∈ D d , F ∗ ( · , z ) is an r-co ntraction, – for each j = 1 , . . . , N ∗ , F ∗ j is sufﬁciently smooth (see Deﬁnitio n 2) . W e deno te C H ∗ = ∑ N ∗ j = 1 C F ∗ j (with C F ∗ j as in (8 0) ). 6.2 Approx imation results for Echo State Networks with Output Feedba ck W e now d eriv e bo unds o n the error arising when echo state networks with outpu t feedback (see (77)) are e m ployed to approx imate functio nals ind u ced b y sufﬁciently regular reservoir systems, that is, function als satisfying Assumption 2. When all pa- rameters ar e trainab le the n etworks (77) are also called Jordan n etworks. Here we consider an ech o state network (77) with A , C , ζ ζ ζ generated rando mly f rom a gener ic distribution. The following the orem shows that such echo state networks with ReLU activ ation fu nction and randomly g enerated parameters e xhibit rather strong u niver - sal appr oximation prope r ties: the same family of systems c an be used to app roximate any functional satisfying a m ild smoothness co ndition (expressed in ter m s of the Fourier tran sform as in [3, 21]) and the app r oximation error is of ord er O ( 1 / √ N ) . In particular, on ly W ne eds to be tuned. Theorem 3 Let N ∈ N + and deno te ¯ N = N N ∗ . Sup p ose σ : R → R is given as σ ( x ) = max ( x , 0 ) , the r o ws of [ A , C ] are i.i.d. random va riables distributed uniformly on B 1 ⊂ R N ∗ + d and the entries of ζ ζ ζ ar e i.i.d. random variables distributed un iformly on [ − M − 1 , M + 1 ] . Assume tha t D d ⊂ B M + 1 . Then for any functional H ∗ satisfying Assumption 2 there exists a r eadou t W (a M m , ¯ N -valued random variable) such that for an y δ ∈ ( 0 , 1 ) , with pr ob ability max ( 1 − δ − 4 C ∗ ( M + 1 ) √ N , 0 ) the system (77) initial- ized at t = − T − 1 fr om any Ξ ∈ R m with k Ξ k ≤ M satisﬁes the echo state pr operty and the associated functiona l H A , C , ζ ζ ζ W : ( D d ) Z − → R m satisﬁes | R ( H A , C , ζ ζ ζ W ) − R ( H ∗ ) | ≤ L L δ  2 ( M + 1 ) C ∗ ( 1 − r ) √ N + 2 ( M + 1 ) r T + 1  , (81) wher e C ∗ = 16 q 3 (( M + 1 ) 3 + M + 3 )( M + 1 ) C H ∗ . (82) Theorem 3 follows f rom com bining the representation in Proposition 2 with our gen- eral reser voir appro ximation result, Theor em 4 below . Note that in Theo rem 4 b elow also the bound edness assumption D d ⊂ B M + 1 is not required . Pr oof (Pr oof of Theor em 3) Firstly , for any j = 1 , . . . , N ∗ the function F ∗ j satis- ﬁes the hy p otheses of Prop osition 2. Theref ore, there exists an integrable functio n π ∗ j : R N ∗ + d + 1 → R such that for x ∈ B M + 1 , z ∈ D d , th e f unction F ∗ j can be repre- sented as F ∗ j ( x , z ) = Z R N ∗ + d + 1 σ (( x , z , 1 ) · ω ω ω ) π ∗ j ( ω ω ω ) d ω ω ω , Approximati on Bounds for Random Neural Network s and Reserv oir Systems 41 and π ∗ j ( ω ω ω ) = 0 for all ω ω ω = ( w , u ) ∈ R N ∗ + d × R satisfyin g k w k > 1 or | u | > M + 1, and Z R N ∗ + d + 1 k ω ω ω k 2 π ∗ j ( ω ω ω ) 2 d ω ω ω ≤ 8 (( M + 1 ) 3 + M + 3 ) C F ∗ j . (83) Recall that the entries of ζ ζ ζ are un if ormly distributed on [ − ( M + 1 ) , M + 1 ] . Setting π k j ( d ω ω ω ) = π ∗ j ( ω ω ω ) d ω ω ω for all k ∈ N + , denoting by π 1 and π 2 the uniform distrib ution on B 1 ⊂ R N ∗ + d and [ − ( M + 1 ) , ( M + 1 )] , respectiv ely , and setting π = π 1 ⊗ π 2 , one has tha t π k j ≪ π and d π k j d π = 2V ol N ∗ + d ( B 1 )( M + 1 ) π ∗ j . Using (83) o ne theref ore obtains 4 √ 3 N ∗ ∑ j = 1   Z R N ∗ + d + 1 k ω ω ω k 2 d π k j d π ( ω ω ω ) ! 2 π ( d ω ω ω )   1 / 2 = 4 p 6 ( M + 1 ) V o l N ∗ + d ( B 1 ) N ∗ ∑ j = 1  Z R N ∗ + d + 1 k ω ω ω k 2 π ∗ j ( ω ω ω ) 2 d ω ω ω  1 / 2 ≤ 16 q 3 (( M + 1 ) 3 + M + 3 )( M + 1 ) C H ∗ and so the co nstant C ∗ in ( 82) is larger o r equ a l than th e constan t C ∗ in Theo rem 4 below . Further more, s k = 0 fo r all k ∈ N + and th us the statement follows from Theo- rem 4 below . ⊓ ⊔ Remark 17 As po inted o ut above, here we fo c us on system s (7 8) in which h ∗ is the identity . Howe ver , Theor em 3 c o uld also be exten ded to more g eneral h ∗ , namely those satisfying that F ∗ j = h j − N ∗ ◦ F ∗ is sufﬁciently smooth (see Deﬁn itio n 2) f or j = N ∗ + 1 , . . . , N ∗ + m . The matrix A in (7 7) would then be replaced by A P with P =  I I I N ∗ 0 0 0 N ∗ , m  . Finally , we prove a more general echo state network app roximatio n resu lt valid for function als in duced by reservoir systems with reservoir function F ∗ that can be ap- proxim a ted well by functions of the form (84). Theorem 4 Let r ∈ ( 0 , 1 ) , L σ > 0 , N ∈ N + and denote ¯ N = N N ∗ . Supp o se σ : R → R is L σ -Lipschitz continuou s and the r ows of [ A , C , ζ ζ ζ ] are i.i.d. random variables with distribution π . Suppose H ∗ : ( D d ) Z − → R m is the r eservoir fu nctional associated to some F ∗ : R N ∗ × D d → B M ⊂ R N ∗ , i.e., for any z ∈ ( D d ) Z − it is giv e n as H ∗ ( z ) = x ∗ 0 , wher e x ∗ satisﬁes (78) . A ssume tha t fo r each v ∈ D d , F ∗ ( · , v ) is a n r -co n traction. Furthermore , fo r an y k ∈ N + , j = 1 , . . . , N ∗ , let π k j be a sig n ed Bor el-measure on R N ∗ + d + 1 such that π k j ≪ π , R R N ∗ + d + 1 k ω ω ω k| π k j | ( d ω ω ω ) < ∞ and C ∗ = 4 √ 3 L σ sup k ∈ N + N ∗ ∑ j = 1   Z R N ∗ + d + 1 k ω ω ω k 2 d π k j d π ( ω ω ω ) ! 2 π ( d ω ω ω )   1 / 2 < ∞ . Denote for each j = 1 , . . . , N ∗ F ∗ , N j ( x , v ) = Z R N ∗ + d + 1 σ (( x , v , 1 ) · ω ω ω ) π N j ( d ω ω ω ) , x ∈ R N ∗ , v ∈ D d (84) 42 Lukas Gonon et al. and assume s N = E [ max t ∈{ 0 ,..., − T } sup x ∈ B M + 1 k F ∗ , N ( x , Z t ) − F ∗ ( x , Z t ) k ] < 1 . Assume that E  max t ∈{ 0 ,..., − T } k Z t k  < ∞ . Then ther e exists a r eado ut W (a M m , ¯ N -valued random v a riable) such that for an y δ ∈ ( 0 , 1 ) , with pr obability at least max ( 1 − δ − 2 C ∗ ( M + 2 + E [ max t ∈{ 0 ,..., − T } k Z t k ] ) √ N − 2 s N , 0 ) the system (7 7) initialized at t = − T − 1 fr o m an y Ξ ∈ R m with k Ξ k ≤ M satisﬁes the echo state pr operty and the associated functiona l H A , C , ζ ζ ζ W : ( D d ) Z − → R m satisﬁes | R ( H A , C , ζ ζ ζ W ) − R ( H ∗ ) | ≤ L L δ  ( M + 2 + max t ∈{ 0 ,..., − T } E [ k Z t k ]) C ∗ ( 1 − r ) √ N + s N 1 − r + 2 ( M + 1 ) r T + 1  . Remark 18 Let us discuss the assumption E  max t ∈{ 0 ,..., − T } k Z t k  < ∞ . Firstly , sup- pose that the input signal satisﬁes k Z t k ≤ B , P -a. s. for all t ∈ Z − . T h en clearly also E  max t ∈{ 0 ,..., − T } k Z t k  ≤ B and so o ne may initialize the system at any T ≥ log ( √ N ) − log ( r ) − 1 in or der to achieve an appr oximation error bound (81) of order 1 √ N with high pr obability 1 − O ( 1 √ N ) . Howe ver , o ur result also covers mo re gener al situations. For instance, su ppose that d = 1 and for each t ∈ Z − , Z t is standard normally dis- tributed (not necessarily indepen dent). Then on e can show th at E  max t ∈{ 0 ,..., − T } k Z t k  ≤ p 2 log ( 2 T ) , and c onsequen tly , choosing T as in the ﬁr st case, one obtains an error b ound of order 1 √ N with high p robability 1 − O ( √ log ( log ( N )) √ N ) . Pr oof (Pr oof of Th eor em 4 ) Recall th at N ∗ = m , h ∗ ( y ) = y and let us write A , C , ζ ζ ζ as block matrices A =    A ( 1 ) . . . A ( N )    ∈ R N N ∗ × N ∗ , C =    C ( 1 ) . . . C ( N )    ∈ M N N ∗ , d , and ζ ζ ζ =     ζ ζ ζ ( 1 ) . . . ζ ζ ζ ( N )     ∈ R N N ∗ , where A ( i ) , C ( i ) and ζ ζ ζ ( i ) are r andom matrices ( resp. vectors) valued in M N ∗ , N ∗ , M N ∗ , d and R N ∗ , respectively , for each i = 1 , . . . , N . Deﬁne the r eadout W = 1 N  W 1 · · · W N  , W i =     V ( i ) 1 . . . V ( i ) N ∗     , where U ( i ) j = ( A ( i ) j , C ( i ) j , ζ ( i ) j ) deno tes the j -th row of ( A ( i ) , C ( i ) , ζ ( i ) ) a nd V ( i ) j is given as V ( i ) j = d π N j d π ( U ( i ) j ) . Approximati on Bounds for Random Neural Network s and Reserv oir Systems 43 By our choice of V ( i ) j one calculates for each i = 1 , . . . , N and any y ∈ B M + 1 , z ∈ D d , E [( W i σ ( A ( i ) y + C ( i ) z + ζ ζ ζ ( i ) )) j ] = E [ V ( i ) j σ (( y , z , 1 ) · U ( i ) j )] = Z R N ∗ + d + 1 σ (( y , z , 1 ) · ω ω ω ) d π N j d π ( ω ω ω ) π ( d ω ω ω ) = F ∗ , N j ( y , z ) (85) and E [( V ( 1 ) j ) 2 ( k A ( 1 ) j k 2 + k C ( 1 ) j k 2 + | ζ ( 1 ) j | 2 )] = Z R N ∗ + d + 1 k ω ω ω k 2 d π N j d π ( ω ω ω ) ! 2 π ( d ω ω ω ) . This shows that 4 √ 3 L σ N ∗ ∑ j = 1 E [( V ( 1 ) j ) 2 ( k A ( 1 ) j k 2 + k C ( 1 ) j k 2 + | ζ ( 1 ) j | 2 )] 1 / 2 ≤ C ∗ . (86) Measurability an d echo state pr op erty: Consider Ω E SP = { ω ∈ Ω : ¯ M ( ω ) ≤ M + 1 } , ¯ M = sup x ∈ B M + 1 t ∈{ 0 ,..., − T } k W F A , C , ζ ζ ζ ( x , Z t ) k . (87) By con tinuity the supremum in (87) is ﬁnite and can also be taken over a cou nt- able set. T his shows that Ω E SP ∈ F . Further more, consider the system (77) in i- tialized at t = − T − 1 fro m a gi ven Ξ ∈ R m with k Ξ k ≤ M . Clearly , for any z ∈ ( D d ) Z − there is a unique ( y t ) t = 0 ,..., − T satisfying (77) and for any ω ∈ Ω the func- tion H A ( ω ) , C ( ω ) , ζ ζ ζ ( ω ) W ( ω ) : ( D d ) Z − → R N ∗ mapping z ∈ ( D d ) Z − to y 0 ( ω ) is contin uous. On the other hand, fo r any z ∈ ( D d ) Z − the map p ing ω 7→ H A ( ω ) , C ( ω ) , ζ ζ ζ ( ω ) W ( ω ) ( z ) is F - measurable an d thus [1, Lemma 4 .51] implies th at H A , C , ζ ζ ζ W is pr oduct-m easurable, i.e. the f unction ( ω , z ) ∋ Ω × ( D d ) Z − 7→ H A ( ω ) , C ( ω ) , ζ ζ ζ ( ω ) W ( ω ) ( z ) ∈ R m is F ⊗ B (( D d ) Z − ) - measurable. Writing Y for the associated process y with input z = Z , we note that for ω ∈ Ω E SP and t ≥ − T Y t ( ω ) = W ( ω ) F A ( ω ) , C ( ω ) , ζ ζ ζ ( ω ) ( Y t − 1 ( ω ) , Z t ( ω )) and consequ ently , by (87), k Y t ( ω ) k ≤ M + 1 fo r all t ≥ − T − 1. Risk e stima tio n on Ω E SP : Firstly , by (79) one has for any measurable H : ( D d ) Z − → R m | R ( H ) − R ( H ∗ ) | ≤ L L E [ k H ( Z ) − H ∗ ( Z ) k ] = L L Z ( D d ) Z − k H ( z ) − H ∗ ( z ) k µ Z ( d z ) . Thus E [ | R ( H A , C , ζ ζ ζ W ) − R ( H ∗ ) | 1 Ω E SP ] ≤ L L E [ k H A , C , ζ ζ ζ W ( Z ) − H ∗ ( Z ) k 1 Ω E SP ] . (88) 44 Lukas Gonon et al. For each t ≥ − T on e estimates E [ k Y t − X ∗ t k 1 Ω E SP ] = E [ k 1 N N ∑ i = 1 W i σ ( A ( i ) Y t − 1 + C ( i ) Z t + ζ ζ ζ ( i ) ) − F ∗ ( X ∗ t − 1 , Z t ) k 1 Ω E SP ] ≤ E [ sup y ∈ B M + 1 k 1 N N ∑ i = 1 W i σ ( A ( i ) y + C ( i ) Z t + ζ ζ ζ ( i ) ) − F ∗ , N ( y , Z t ) k 1 Ω E SP ] + E [ k F ∗ , N ( Y t − 1 , Z t ) − F ∗ ( Y t − 1 , Z t ) k 1 Ω E SP ] + E [ k F ∗ ( Y t − 1 , Z t ) − F ∗ ( X ∗ t − 1 , Z t ) k 1 Ω E SP ] ≤ E [ sup y ∈ B M + 1 k 1 N N ∑ i = 1 W i σ ( A ( i ) y + C ( i ) Z t + ζ ζ ζ ( i ) ) − F ∗ , N ( y , Z t ) k ] + s N + r E [ k Y t − 1 − X ∗ t − 1 k 1 Ω E SP ] . (89) Denoting by ε 1 , . . . , ε N indepen d ent Rademacher ran dom variables, we th u s ob tain by (85), indepen dence and symmetriza tion that for any z ∈ D d E [ sup y ∈ B M + 1 k 1 N N ∑ i = 1 W i σ ( A ( i ) y + C ( i ) z + ζ ζ ζ ( i ) ) − F ∗ , N ( y , z ) k ] ≤ N ∗ ∑ j = 1 E [ sup y ∈ B M + 1      1 N N ∑ i = 1 V ( i ) j σ (( y , z , 1 ) · U ( i ) j ) − F ∗ , N j ( y , z )      ] ≤ 2 N ∗ ∑ j = 1 E [ sup y ∈ B M + 1      1 N N ∑ i = 1 V ( i ) j ε i σ (( y , z , 1 ) · U ( i ) j )      ] . (90) Furthermo re, for any v i ∈ R , u i = ( a i , c i , ζ i ) ∈ S , i = 1 , . . . , N the co ntraction p rinciple [22, Theorem 4.12] (applied to th e contractio ns σ i ( x ) = 1 { v i 6 = 0 } v i σ ( x 1 L σ v i ) ) yields E [ sup y ∈ B M + 1      N ∑ i = 1 v i ε i σ (( y , z , 1 ) · u i )      ] = E [ sup y ∈ B M + 1      N ∑ i = 1 ε i σ i ( L σ v i ( y , z , 1 ) · u i )      ] ≤ 2 L σ E [ sup y ∈ B M + 1      N ∑ i = 1 v i ε i (( y , z , 1 ) · u i )      ] ≤ 2 L σ ( M + 1 ) E [ k N ∑ i = 1 v i ε i a i k ] + E [ | N ∑ i = 1 v i ε i ( c i · z + ζ i ) | ] ! ≤ 2 L σ   ( M + 1 )( N ∑ i = 1 v 2 i k a i k 2 ) 1 / 2 + k z k N ∑ i = 1 v 2 i k c i k 2 ! 1 / 2 + ( N ∑ i = 1 v 2 i | ζ i | 2 ) 1 / 2   . (91) Approximati on Bounds for Random Neural Network s and Reserv oir Systems 45 By conditio n ing, using indep e ndence and combinin g this with (90) one thus obtains E [ sup y ∈ B M + 1 k 1 N N ∑ i = 1 W i σ ( A ( i ) y + C ( i ) z + ζ ζ ζ ( i ) ) − F ∗ , N ( y , z ) k ] ≤ 4 L σ N N ∗ ∑ j = 1 E [( M + 1 )( N ∑ i = 1 ( V ( i ) j ) 2 k A ( i ) j k 2 ) 1 / 2 + k z k N ∑ i = 1 ( V ( i ) j ) 2 k C ( i ) j k 2 ! 1 / 2 + ( N ∑ i = 1 ( V ( i ) j ) 2 | ζ ( i ) j | 2 ) 1 / 2 ] ≤ 4 L σ √ N N ∗ ∑ j = 1 h ( M + 1 ) E [( V ( 1 ) j ) 2 k A ( 1 ) j k 2 ] 1 / 2 + k z k E [( V ( 1 ) j ) 2 k C ( 1 ) j k 2 ] 1 / 2 + E [( V ( 1 ) j ) 2 | ζ ( 1 ) j | 2 ] 1 / 2 i . (92) Inserting (86) thus yields E [ sup y ∈ B M + 1 k 1 N N ∑ i = 1 W i σ ( A ( i ) y + C ( i ) Z t + ζ ζ ζ ( i ) ) − F ∗ , N ( y , Z t ) k ] ≤ ( M + 2 + E [ k Z t k ]) C ∗ √ N . (93) Iterating (89) ( T + 1 ) - tim es and inserting (93) yields E [ k Y 0 − X ∗ 0 k 1 Ω E SP ] ≤ T ∑ k = 0 r k  ( M + 2 + E [ k Z − k k ]) C ∗ √ N + s N  + r T + 1 E [ k Y − T − 1 − X ∗ − T − 1 k 1 Ω E SP ] ≤ ( M + 2 + max t ∈{ 0 ,..., − T } E [ k Z t k ]) C ∗ ( 1 − r ) √ N + s N 1 − r + 2 ( M + 1 ) r T + 1 . Noting that Y 0 = H A , C , ζ ζ ζ W ( Z ) , (93) and (88) hence prove that E [ | R ( H A , C , ζ ζ ζ W ) − R ( H ∗ ) | 1 Ω E SP ] ≤ L L  ( M + 2 + max t ∈{ 0 ,..., − T } E [ k Z t k ]) C ∗ ( 1 − r ) √ N + s N 1 − r + 2 ( M + 1 ) r T + 1  . (94) Estimating P ( Ω \ Ω E SP ) : It thus remains to prove that the prob ability that the ran dom ESN parameter s lie in Ω E SP increases to 1 at r ate 1 / √ N . T o this end , ﬁrst no te that for any x ∈ R N ∗ , z ∈ D d k W F A , C , ζ ζ ζ ( x , z ) k ≤ k W F A , C , ζ ζ ζ ( x , z ) − F ∗ , N ( x , z ) k + k F ∗ , N ( x , z ) − F ∗ ( x , z ) k + M 46 Lukas Gonon et al. and therefo r e P ( Ω \ Ω E SP ) ≤ P ( ¯ M ≥ M + 1 ) ≤ P    sup x ∈ B M + 1 t ∈{ 0 ,..., − T } k W F A , C , ζ ζ ζ ( x , Z t ) − F ∗ , N ( x , Z t ) k + k F ∗ , N ( x , Z t ) − F ∗ ( x , Z t ) k ≥ 1    ≤ P    sup x ∈ B M + 1 t ∈{ 0 ,..., − T } k W F A , C , ζ ζ ζ ( x , Z t ) − F ∗ , N ( x , Z t ) k ≥ 1 2    + P    sup x ∈ B M + 1 t ∈{ 0 ,..., − T } k F ∗ , N ( x , Z t ) − F ∗ ( x , Z t ) k ≥ 1 2    ≤ 2 E    sup x ∈ B M + 1 t ∈{ 0 ,..., − T } k W F A , C , ζ ζ ζ ( x , Z t ) − F ∗ , N ( x , Z t ) k    + 2 s N . = 2 E    E    sup x ∈ B M + 1 v ∈{ z 0 ,..., z − T } k W F A , C , ζ ζ ζ ( x , v ) − F ∗ , N ( x , v ) k    z = Z    + 2 s N . (95) The inn er expectation can now be estimated using precisely th e same argumen ts as in (90), (91), (92) yielding for any z ∈ ( D d ) Z − E    sup x ∈ B M + 1 v ∈{ z 0 ,..., z − T } k W F A , C , ζ ζ ζ ( x , v ) − F ∗ , N ( x , v ) k    ≤ 2 L σ N ∗ ∑ j = 1 E    sup y ∈ B M + 1 v ∈{ z 0 ,..., z − T }      1 N N ∑ i = 1 V ( i ) j ε i σ (( y , v , 1 ) · U ( i ) j )         ≤ 4 L σ N ∗ ∑ j = 1 E    sup y ∈ B M + 1 v ∈{ z 0 ,..., z − T }      1 N N ∑ i = 1 V ( i ) j ε i (( y , v , 1 ) · U ( i ) j )         ≤ 4 L σ √ N N ∗ ∑ j = 1 ( M + 1 ) E [( V ( 1 ) j ) 2 k A ( 1 ) j k 2 ] 1 / 2 +  max t ∈{ 0 ,..., − T } k z t k  E [( V ( 1 ) j ) 2 k C ( 1 ) j k 2 ] 1 / 2 + E [( V ( 1 ) j ) 2 | ζ ( 1 ) j | 2 ] 1 / 2 . Combining this with (95) yields P ( Ω \ Ω E SP ) ≤ 2 ( M + 2 + E  max t ∈{ 0 ,..., − T } k Z t k  ) C ∗ √ N + 2 s N . (96) Approximati on Bounds for Random Neural Network s and Reserv oir Systems 47 Putting together the ingredients: Altogether, setting η = L L δ  ( M + 2 + max t ∈{ 0 ,..., − T } E [ k Z t k ]) C ∗ ( 1 − r ) √ N + s N 1 − r + 2 ( M + 1 ) r T + 1  and combin ing (94) and (96) yields P  | R ( H A , C , ζ ζ ζ W ) − R ( H ∗ ) | > η  ≤ P  | R ( H A , C , ζ ζ ζ W ) − R ( H ∗ ) | 1 Ω E SP > η  + P ( Ω \ Ω E SP ) ≤ δ + 2 ( M + 2 + E  max t ∈{ 0 ,..., − T } k Z t k  ) C ∗ √ N + 2 s N . ⊓ ⊔ Acknowle dgements W e thank Josef T eichmann for fruitful discussions that helped in improvi ng the pa- per . Lukas G and JPO ackno wledge partial ﬁnancia l support coming from the Researc h Commission of the Uni versit¨ at Sankt Gallen and the Swiss Nat ional Scien ce Fou ndation (gran t number 200021 175801/1). L yudmila G ackno wledges partia l ﬁnancia l support of the Graduate School of Deci sion Scienc es of the Uni versit¨ at Konsta nz. JPO ackno wledges partial ﬁnancial support of the French ANR “BIPHOPR OC” project (ANR-14-OHRI-0002-02). The three authors thank the hospital ity and the generosity of the FIM at ET H Zurich where a signiﬁca nt portion of the results in this paper were obtained. References 1. Aliprant is, C.D., Border , K.C.: Inﬁnite dimensional analysis: A hitchhik er’ s guide (2006) 2. Barron, A.R.: Neural Net A pproximation. Proceedings of the 7th Y ale W orkshop on Adapti ve and Learning Systems, 69–72 (1992) 3. Barron, A.R.: Uni ve rsal approximation bounds for superposit ions of a sigmoidal functi on. IEEE Tra nsactions on Information Theory 39 (3), 930–945 (1993) 4. Bergh , J., L ¨ ofstr ¨ om, J. : Interpol ation Spaces. Springer (1976) 5. Cuchier o, C., Gonon, L., Grigorye va , L., Ortega, J.P ., T eichma nn, J. : Approximation of dynamics by randomize d signature. In preparat ion (2020) 6. Cuchier o, C., Gonon, L., Grigorye va, L., Ortega, J. P . , T eichman n, J.: Discrete-time signatures and randomness in reservoir computing. Preprint arXi v:2010.14615 (2020) 7. Folla nd, G.B.: Introduction to Partial Dif ferenti al Equations, second edn. Princeton Univ ersity Press (1995) 8. Friz, P .K., V ictoir , N.B.: Multidimensiona l stochastic proce sses as rough paths. Cambridge Unive rsity Press, Cambridge (2010) 9. Gonon, L., Grigoryev a, L., Orteg a, J.P .: Risk Bounds for Reserv oir Computing. Journal of Machine Learning Research, 21 (240), 1–61 (2020) 10. Gonon, L., Ortega, J.P . : Fading memory echo state network s are uni versal . T o appea r in Neural Networ ks (2021) 11. Gonon, L. , Ortega , J .P .: Reservoir computing univ ersality with stochast ic inputs. IEEE Transact ions on Neural Networks and Learning Systems 31 (1), 100–112 (2020) 12. Grigoryev a, L., Ortega, J.P .: Echo state networks are uni versal . Neural Networks 108 , 495–508 (2018) 13. Grigoryev a, L ., Orteg a, J.P .: Univ ersal discrete-ti me reserv oir computers with stocha stic inp uts and linea r readouts using non-homogeneou s state-a fﬁne systems. Journal of Machine Learning Research 19 (24), 1–40 (2018) 14. Grigoryev a, L ., Ortega, J. P . : Dif ferenti able reservoir computing. Journal of Machine Learning Re- search, 20 (179), 1–62 (2019) 48 Lukas Gonon et al. 15. Hart, A.G., Hook, J.L., Dawes, J.H. P . : Embedding and approx imation theorems for echo state net- works. Prep rint arXiv:1 908.05202 (2019) 16. Hornik, K.: Approxima tion capabi lities of muitilayer feedforw ard networks. Neural Netwo rks 4 (1989), 251–257 (1991) 17. Huang, G.B., Zhu, Q.Y ., Siew , C.K.: Extreme learni ng machine: Theory and appl ication s. Neurocom- puting 70 (1-3), 489–501 (2006) 18. J aeger , H., Haas, H. : Harnessing Nonline arity: Predicting Chaoti c Systems and Saving Energy in W ireless Communication . Scienc e 304 (5667), 78–80 (2004) 19. Kallenberg , O. : Foundations of Modern Proba bility, second edn. Probabili ty and Its Applications. Springer New Y ork (2002) 20. Kloeden, P . E ., Platen, E. : Numerical solution of stocha stic diff erential equati ons. Springer-V erlag, Berlin (1992) 21. Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functio ns with l1 and l0 controls. IEEE Transacti ons on Information Theory 64 (12), 7649–7656 (2018) 22. L edoux, M., T alagra nd, M.: Probabilit y in Banach Spaces. Springer Berlin Heidelb erg (2013) 23. L u, Z., Hunt, B.R., Ott, E.: Attractor reconstructi on by machine learning. Chaos 28 (6) (2018) 24. Maiorov , V ., Meir , R.: On the near optimality of the stochastic approximatio n of smooth functions by neural networ ks. Advance s in Computationa l Mathematics 13 (1), 79–103 (2000) 25. Matthe ws, M., Moschytz, G.: The identiﬁcati on of nonlinea r discrete-ti me fading-memory systems using ne ural network models. IEEE Tra nsactions on Circ uits and Systems II: Analog an d Digital Signal Processing 41 (11), 740–751 (1994) 26. Matthe ws, M.B.: On the Uniform Approximatio n of Nonlinear Discrete-T ime Fading -Memory Systems Using Neural Network Models. Ph.D. thesis, E T H Z ¨ urich (1992). DOI 10.3929/ ETHZ- A- 000625223 27. Matthe ws, M.B.: Approximating nonl inear fa ding-memory operat ors using neural ne twork m odels. Circuit s, Systems, and Signal Processing 12 (2), 279–307 (1993) 28. Mhaskar , N.H.: Neural netw orks for optimal approximation of sm ooth and anal ytic function s. Neural computat ion 8 (1), 164–177 (1996) 29. Pathak, J. , Hunt, B., Girv an, M., Lu, Z., Ott, E.: Model-Free Prediction of Large Spatiotemporal ly Chaoti c Systems from Data: A Reservoi r Computing Approach . Physical Re vie w Letter s 120 (2), 24102 (2018) 30. Pathak, J., L u, Z., Hunt, B.R. , Girv an, M., Ott, E.: Using machine learnin g to replic ate chaotic attrac - tors and calcula te L yapuno v exponents from data. Chaos 27 (12) (2017) 31. Poggio, T ., Mhaskar , H . , Rosasco, L., Miranda, B., Liao, Q. : Why and when can deep-b ut not s hallo w- netw orks avoid the curse of dimension ality: A re vie w . International Journal of Automation and Com- puting 14 (5), 503–519 (2017) 32. Rahimi, A., Recht, B.: Random features for large-scal e k ernel machines. Adv ances in Neural Infor- mation Processing Systems (2007) 33. Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: 2008 46th Annual Allerto n Conference on Communicati on, Control , and Computing, pp. 555–561 (2008) 34. Rahimi, A., Recht, B.: W eighted sums of ran dom kitc hen sinks: Re placing mini mization with ran- domizati on in learning. Advan ces in Neural Information Processing Systems (2009) 35. Rudin, W .: Real and Complex Analysis, third edn. McGra w-Hill (1987)

Approximation Bounds for Random Neural Networks and Reservoir Systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment