On the Bounds of Function Approximations

On the Bounds of F unction Appro ximations Adrian de Wynter 1 [0000 − 0003 − 2679 − 7241] 1 Amazon Alexa, 300 Pine St., Seattle, W ashington, USA 98101 dwynter@am azon.com Abstract. 1 Within mac hine learning, the subﬁeld of Neural Arc hitecture Search (NAS) has recen tly garnered resea rch attenti on due to its a bility to impro ve up on human-designed mo dels. How ever, th e computational re- quirements for ﬁnding an exact solution to this problem are often in- tractable, and the design of the searc h space still requires man ual in ter- ven tion. I n this pap er w e attempt t o establish a forma lized framew ork from whic h w e can b etter u nderstand the computational b ounds of NAS in relation to its searc h space. F or this, w e ﬁ rst reform ulate the func- tion approximation problem in terms of sequences of functions, and we call it the F unction App ro ximation (F A) p roblem; t hen we show th at it is computationally infeasible to devise a pro cedure that solves F A for all functions to zero error, regardless of the search space. W e sho w also that such error will b e minimal if a sp eciﬁc class of functions is presen t in the search space. Subsequently , we show th at machine learning as a mathematical problem is a solution strategy for F A, albeit not an eﬀec- tive one, and further describ e a stronger ver sion of this approach: the Approximate Arc hitectural Searc h Problem (a- A SP), whic h is the math- ematical equiv alent of NAS. W e leve rage the framew ork from th is p aper and results from the literature t o d escribe the conditions under which a-ASP ca n potentially solv e F A as well as an exhaustive searc h, bu t in p olynomial time. Keywords: neural netw orks · learning theory · neural architecture searc h 1 In tro duction The t ypical machine lea r ning task can b e abstra cted out as the problem of ﬁnding the set of para meters o f a co mputable function, such that it approximates an un- derlying pro babilit y distribution to seen and unseen examples [19]. Said function is often ha nd-designed, and the s ub ject of the grea t ma jority of current machine learning res earch. It is well-established that the choice o f function heavily in- ﬂuences its appr oximation ca pabilit y [5 ,55,59], and co nsiderable work ha s go ne 1 Citation details: de Wynter, A drian. On th e Bound s of F unction Approximations. In: T etk o, I . V. et al. (eds.) ICANN 2019. LN CS, vol 11727 . Springer, Heidelberg, pp. 117. h ttps://doi.org/10. 1007/978-3-030-30487-4 32 The ﬁnal authenticated publication is a v ailable online at https://doi .org/10.10 07/978-3-030-30487-4 32 2 A. d e Wynter int o automating the pro cess of ﬁnding such function fo r a g iv en task [9 ,10,18]. In the c on text of neural netw orks , this task is known as Neural Architecture Search (NAS), and it in volv es searching for the b est performing combination of neural net work comp onents a nd parameters fro m a set, also known as the se ar ch sp ac e . Although pro mis ing, little work has b een done on the ana lysis of its viability with resp ect to its computation-theore tica l b ounds [14]. Since NAS strateg ies tend to be exp ensive in terms of their ha r dw are requirements [23,40], r esearch emphasis has b een placed on optimizing s earch algor ithms, [14,32], even though the search space is still manually designed [14,26,27,60]. Without a b etter understanding of the mathematical conﬁnes gov erning NAS, it is unlik ely that these strateg ies will eﬃciently solve new pro blems, o r present r eliably hig h per formance, thus leading to complex systems that still rely on manually engineer ing architectures and search spa ces. Theoretically , learning has been for m ulated as a function approximation problem where the appr oximation is done thro ugh the optimization of the pa- rameters of a g iv en function [1 2,19,37,38,52]; a nd with stro ng results in the area of neural net works in pa rticular [12,16,21,42]. O n the o ther ha nd, NAS is often re- garded a s a sea rch proble m with an optimality criterion [10,14,40,50,59], within a given se a rch spa ce. The c hoice of such sea rch spa c e is critical, yet str o ngly heuristic [14]. Since we a im to obta in a b etter ins igh t on how the pro cess of ﬁnding an optimal architecture can b e improved with relation to the search space, we hypothesiz e that NAS can b e en unciated as a function approximation problem. The k ey observ ation that motiv ates our work is that all co mputable functions can b e expressed in terms o f combinations of mem b ers of certain sets, better known as mo dels of computation. Examples of this are the µ -recurs iv e functions, T uring Machines, a nd, of relev ance to this pa per , a particular set of neural netw ork archit ectures [31]. Thu s, in this study we reformulate the function approximation problem as the task of, for a given search space, ﬁnding the pro cedure that outputs the com- putable sequence of functions, a long with their parameter s, tha t b est approxi- mates a n y given input function. W e refer to this reformulation as the F unction Approximation (F A) problem, and regard it as a very general computational problem; akin to building a fully automa ted machine lear ning pip eline where the user provides a se ries of tasks, and the a lgorithm retur ns trained mo dels for each input. 2 This approa c h yields promising results in terms of the conditio ns under which the F A problem has optimal solutio ns, and ab out the ability of b oth machine lear ning a nd NAS to solv e the F A problem. 1.1 T ec hnical Contributions The main con tribution of this pap er is a reformulation o f the function a pproxi- mation problem in terms of sequences of functions, a nd a framework within the 2 Throughout this paper, the problem of data selection is not considered, and is simply assumed to be an input to our solution strategies. On the Bounds of F unction Approximations 3 context of the theo r y of computation to ana ly ze it. Said framework is quite ﬂex- ible, as it do es not rely o n a particular mo del of computation and ca n b e applied to a ny T uring-equiv alent mo del. W e lev erage its r esults, alo ng with well-kno wn results of co mputer science, to prov e that it is no t p ossible to devise a pr oce dure that approximates all functions everywhere to zero error . How ever, we also s ho w that, if the smallest class of functions along with the op erato r s for the chosen mo del of computation a re present in the sea rch space, it is po ssible to attain an error that is globa lly minimal. Additionally , we tie s aid framework to the ﬁeld of ma c hine lear ning, and an- alyze in a fo rmal manner three s olution strategies for F A: the Machine Learning (ML) problem, the Arc hitecture Searc h problem (ASP), and the less -strict v er- sion o f ASP , the Approximate Architecture Search problem (a- ASP). W e analyze the feasibilit y of a ll three approa c hes in terms of the b ounds described for F A, and their abilit y to solve it. In pa rticular, w e demonstrate that ML is an inef- fective solution stra tegy for F A, a nd po int out tha t ASP is the b est approach in terms of generalizability , although it is in tractable in terms of time c o mplexit y . Finally , by relating the res ults from this pa per, alo ng with the ex is ting work in the liter ature, we describ e the conditions under which a- ASP is able to solve the F A pr oblem as w ell as ASP . 1.2 Outline W e begin b y reviewing the exis ting literature in Section 2. In Section 3 we int ro duce F A, and analyze the gener al prop erties o f this problem in ter ms of its search space. Then, in Section 4 we relate the framework to machine learning as a mathematical problem, and show that it is a weak solution stra tegy for F A, befo re deﬁning a stronger appr oach (ASP) and its computationally tractable version (a-ASP). W e conclude in Sectio n 5 with a discussion of our work. 2 Related W ork The problem of approximating functions and its relation to neur al net works can be found for m ulated ex plicitly in [38], and it is also mentioned often when deﬁning mac hine lea rning as a task, for example in [2,4,5,19,52]. Ho wev er, it is deﬁned as a par ameter o ptimiza tion pr oblem for a predeter mined function. This per spec tive is also cov ered in our pap er, yet it is m uch closer to the ML appro ach than to F A. F o r F A, as deﬁned in this pap er, it is cen tral t o ﬁnd the sequence of functions which minimizes th e appr o ximation error . Neural ne tw ork s as function appr o ximator s a r e well understo o d, and there is a trov e of literature av a ilable on the sub j ect. An inexhaustive list of examples are the studies found in [12,16,21,22,25,35,36,38,42,44,50]. It is important to po in t out t hat the ob jective of this paper is not to prov e that neur al net w orks are function a pproximators, but r ather to provide a theoretical fr amew ork fro m which to understand NAS in the contexts of machine lea rning, and computatio n 4 A. d e Wynter in general. Ho wev er, neural netw orks were shown to b e T uring-equiv alent in [31,45,46], and thus they a re extremely relev an t this study . NAS as a metaheur is tic is also well-explored in the liter ature, and its ap- plication to dee p le a rning has b een b o o ming lately thanks to the widespread av ailability of p ow erful computers, and interest in e nd-to-end machine learning pipelines . There is, how ever, a long standing b o dy of research on this a rea, a nd the list of works presented here is by no means complete. Some pa pers that deal with NAS in an applied fashion ar e the w orks found in [1,9,10,29,43 ,48,49,51], while explo rations in a fo rmal fashion of NAS and metaheur is tics in genera l ca n also be found in [3,10,44,58,59]. Ther e is also interest on the problem of creating an end-to-end machine learning pipeline, a lso kno wn as AutoML. Some exam- ples are studies suc h a s the o nes in [15,20,23,57]. The F A problem is similar to AutoML, but it do es no t include the data prepr oce ssing s tep commonly a sso ci- ated with such sys tems. Additionally , the formal ana lysis of NAS tends to be as a search, rather than a function appr o ximation, problem. The complexity theory of learning and neural netw orks has b een explored as w ell. The reader is referred to the recent survey fr om [33], and [2,7,13,17,53]. Leveraging the g roup-like structure of mo dels of computation is done in [3 9], and the Blum Axioms [6] are a w ell-known fra mework for the th eor y of compu- tation in a mo del-agno s tic setting. It w as also sho wn in [8] that, under certain conditions, it is p ossible to compo se some learning algo rithms to obtain more complex pro c edures. Bounds in terms of the generalization error was proven for conv olutional neural netw orks in [28]. None of the papers ment ioned, how ever, apply directly to F A and NAS in a setting agnostic to mo dels of computation, and the key insight s o f our w ork, drawn fro m the analysis of F A and its solution strategies, ar e, to the b est of o ur kno wledge, not covered in the literature. Fi- nally , the Probably Appro ximately Correct (P AC) learning f ramework [5 2] is a powerful theor y for the study o f learning problems. It is a slightly diﬀerent prob- lem than F A, as the former has the search space abstracted out, w hile the latter concerns itself with ﬁnding a sequence that minimizes the erro r, by s earching through combinations of ex plicitly deﬁned members of the search space. 3 A F or m ulation of the F unction Approxima tion P roblem In this s e ction we deﬁne the F A problem a s a mathematica l task whose g oal is– informally–to ﬁnd a sequence of functions whose be ha vior is closest to an input function. W e then p erform a short analysis of the co mputational b ounds of F A, and s how that it is co mputationally infeasible to design a solution s trategy that approximates a ll functions ev erywhere to zero error . 3.1 Preliminaries on Notation Let R b e the se t of all total co mputable functions. Acros s this pa per we will refer to the ﬁnite set of elementary functions E = { ψ 1 , ..., ψ m } as the smallest On the Bounds of F unction Approximations 5 class of functions, along with their op erato r s, of some T ur ing-equiv alen t mo del of computation. Let S = { φ j : dom ( φ j ) → img ( φ j ) } j ∈ J be a set of functions deﬁned over some sets dom ( φ j ) , img ( φ j ), such that S is indexed by a set J , and that S ⊂ R . Also let f ( x ) = ( φ i 1 , φ i 2 , ..., φ i k )( x ) b e a sequence of elements of S applied successively and suc h that i 1 , ..., i k ∈ I for so me I ⊂ J . W e will utilize the abbreviated nota tio n f = ( φ i ) k i =1 to denote suc h a s equence; and we will use S ⋆,n = { ( φ i ) k i =1 | φ i ∈ S, k ≤ n } to describ e the set of all n - or-less long possible sequences of functions drawn from said S , such that f ∈ S ⋆,n ⇔ f ∈ R . F o r consistency purp oses, througho ut this pap er we will b e using Zermelo - F r aenkel with the Axiom o f Choice (ZFC) set theor y . Finally , for simplicity of o ur analysis w e will only consider co n tinuous, rea l-v a lued functions, and beg inning in Section 3.3, only computable functions. 3.2 The F A Problem Prior to formally deﬁning the F A problem, we must be able to quantify the behaviora l similarity of tw o functions. T his is done through the ap pr oximation err or of a function: Deﬁnition 1 (The approx imatio n error). L et f and g b e two functions. Given a nonempty subset σ ⊂ dom ( g ) , the approximation error of a function f to a function g is a pr o c e dur e which outputs 0 if f is e qual to g with r esp e ct to some metric d : R × R → R ≥ 0 acr oss al l of σ , and a p osi tive numb er otherwise: ε σ ( f , g ) = 1 | σ | X x ∈ σ d ( f ( x ) , g ( x )) (1) Wher e we assume that, for the c ase wher e x 6∈ dom ( f ) , d ( f ( x ) , g ( x )) = g ( x ) . Deﬁnition 2 (The F A Problem). F or any input fun ction F , given a function set (the search spa ce ) S , an inte ger n ∈ N > 0 , and nonempty sets σ ⊂ dom ( F ) , ﬁnd the se quenc e of functions f = ( φ i ) k i =1 , φ i ∈ S , k ≤ n , such that ε σ ( f , F ) is minimal among al l memb ers of S ⋆,n and σ . The F A pro blem, a s stated in Deﬁnition 2, makes no a ssumptions regarding the characteriz ation of the search space, and follo ws closely the deﬁnition in terms of optimization o f par ameters from [37,38]. How ever, it makes a p oint o n the f act that the appro ximation of a function sho uld be given by a sequence of functions. If the input function w ere to b e contin uous and m ultiv ariate, we know from [24,34] that there exists at least o ne ex a ct (i.e., zero approximation error ) repre- sentation in terms o f a sequence of sing le-v a riable, contin uous functions. If such single-v ariable, contin uous functions w ere to be pr e sen t in S , one would exp ect that the F A pro blem could solved to zero erro r for all contin uous m ultiv ariate 6 A. d e Wynter inputs, b y simply comparing and returning the righ t representation. 3 How ev er, it is infeasible to devise a gener alized algorithmic procedur e that outputs such representation: Theorem 1. Ther e is no c omputable pr o c e dur e for F A that appr oximates al l c ontinuous, r e al-val ue d functions t o zer o err or, acr oss their en tir e domai n. Pr o of. Solution strategies for F A ar e parametrized by the seq uence length n , the subset of the domain σ , a nd the sea rch space S . Assume S is inﬁnite. The input function F may be either computable or uncomputable. If the input F is uncomputable, by deﬁnition it can only be estimated to within its co mputable ra ng e, and hence its appr oximation error is nonzer o. If F is a computable function, w e ha ve gua ranteed the e x istence of at leas t one function w ithin S ⋆,n which has zero approximation error : F itself. None theles s, determining the existence o f such a function is an undecidable problem. T o show this, it suﬃces to note that it reduces to the pro blem of determining the equiv alence o f tw o halting T uring Mac hines by asking whether they accept the same langua ge, which is undecidable. When n or σ are inﬁnite, there is no guar ant ee that a pro cedure solving F A will terminate for all inputs. When n , σ , or S are ﬁnite, there will always b e functions outside o f the scop e of the pro cedure that can only b e approximated to a no nzero e r ror. Therefore, there ca nnot be a pr ocedur e for F A that a pproximates all func- tions, let alone all co mputable functions, to zer o error for their ent ire domain. ⊓ ⊔ It is a well-known result of computer s cience that neura l net works [12,16,19,21,22], and P A C learning algo rithms [52], are able to approximate a large class o f func- tions to a n arbitra ry , non-zero er ror. How ever, Theorem 1 do es not make any assumptions regarding the mo del of computation used, and th us it works as mor e generalized statement o f these results. F o r the rest of this pap e r we w ill limit ourselves to the case where n , σ , and S are ﬁnite, and the elements o f S are computable functions. 3.3 A Brief Analysis of the Searc h Space It has b een shown that the s o lutions to F A can o nly be found in terms of ﬁnite sequences built from a ﬁnite search spa ce, whose erro r with resp ect to the input function is nonzer o. It is w orth analyzing under which conditions these sequences will present the s mallest poss ible er ror. F o r this, we note that any solution s trategy for F A will hav e to ﬁrst con- struct a t least one sequence f ∈ S ⋆,n , and then co mpute its error a gainst the input function F . It c o uld be ar gued that this “b ottom-up” appr oach is not the most eﬃcie n t, and one could attempt to “factor” a function in a given mo del of computation that has ex plicit reduction form ulas, s uch as the Lambda calc ulus. 3 With the p ossible exception of the results from [54]. On the Bounds of F unction Approximations 7 This, unfortunately , is not p ossible, as the pro ble m of determining the reductio n of a function in terms of its elementary functions is well-kno wn to b e undecida ble [11]. Ho wev er, the idea of “ factoring” a function can still b e leveraged to sho w that, if the set of elementary functions E is present in the sea r c h space S , any suﬃciently cle ver pro cedure will b e able to get the smallest p ossible theo retical error for S , for a n y given input function F : Theorem 2. L et S b e a se ar ch sp ac e such that it c ontains the set of elementary functions, E ⊂ S . Then, fo r any input function F , ther e exists at le ast one se quenc e f o ∈ S ⋆,n with the sm al lest appr oximation err or among al l p ossibl e c omputable funct ions of se quenc e lengt h up to a nd including n . Pr o of. By deﬁnition, E can generate a ll p ossible computable functions. If E 6⊂ S , then |S ⋆,n | < |E ⋆,n | , and so there exist input functions whose sequence with the smallest approximation er ror, f o , is not contained in S ⋆,n . ⊓ ⊔ In practice, constr ucting a space that co n tains E , and subs e quen tly p erform- ing a search ov er it, can b ecome a time consuming task giv en that the num b er of p o ssible member s o f S ⋆,n grows exp onentially with n . On the other hand, constructing a mor e “eﬃcient” space that alre ady contains the best po ssible sequence requires prior knowledge of the structure of a function rela ting S to F –the pro blem that we are trying to so lv e in the ﬁrst place. That b eing said, Theorem 2 implies tha t there m ust be a wa y to quantify the ability of a s earch space to gener alize to any g iv en function, witho ut the need of explicitly including E . T o achiev e this, we ﬁr st lo ok at the ability of every sequence to approximate a function, by deﬁning the information c ap acity of a sequence: Deﬁnition 3 (The Information Capacit y). L et f = ( φ i ) n i =1 b e a ﬁ nite se- quenc e, wher e every φ i has asso ciate d a ﬁnite set of p ossible p ar ameters π i , and a r estriction set ρ i in its domain: φ i : dom ( φ i ) × π i → img ( φ i ) \ ρ i , so that the next eleme nt in the se quenc e is a function φ i +1 with dom ( φ i +1 ) = img ( φ i ) \ ρ i . Then the informa tion ca pacit y of a se qu en c e f is given by the Cartesian pr o duct of the domai n, p ar ameters, and r ange of e ach φ i : C ( f ) = dom ( φ 1 ) ×  n − 1 Y i =1 π i × ( i mg ( φ i ) \ ρ i )  × π n × i mg ( φ n ) (2) Note that the informa tion ca pacit y of a function is quite similar to its graph, but it ma k es an ex plicit relationship w ith its par ameters. Spec iﬁc a lly , in the case where π i ⊂ Π for every π i in some f , C ( f ) = dom ( φ 1 ) × Π × img ( φ n ). A t a ﬁrst g lance, Deﬁnition 3 co uld b e seen as a v ar iant of the VC dimensio n [7,53], since b oth qua n tities attempt to measure the ability of a given function to generalize. How ever, the latter is designed to w ork on a ﬁxed f unction, and our fo cus is o n the problem of building suc h a function. A mor e in-depth discus s ion of this distinction, along with its application to the framework from this pap er, is given in Section 4.1, and in App endix B. 8 A. d e Wynter A s earch space is co mprised o f one or more functions, a nd algorithmica lly we ar e more in terested ab out the quantiﬁable a bilit y of the sear c h space to approximate any input function. Ther efore, w e deﬁne the informatio n p otential of a search spa ce a s follows: Deﬁnition 4 (The Information P oten tial). The information p otential of a search s pa ce S , is given by al l t he p ossible values its memb ers c an take for a given se quenc e le ngth n : U ( S , n ) = [ f ∈S ⋆,n C ( f ) (3) The deﬁnition of the information p otential allows us to make the imp ortant distinction b et ween c o mparing tw o sear c h spa c e s S 1 , S 2 containing the sa me function f , but deﬁned o ver diﬀere n t parameters π 1 , π 2 ⊂ Π ; a nd compar ing S 1 and S 2 with another space, S 3 , containing a diﬀeren t function g : the information po ten tials will be equiv alent on the ﬁrst case, U ( S 1 , n ) = U ( S 2 , n ), but not o n the second: U ( S 3 , n ) 6 = U ( S 1 , n ). F o r a given space S , as the seque nce length n g r ows to inﬁnity , and if the search s pa ce includes the se t of elementary functions, E ⊂ S , its infor mation po ten tial encompas ses all co mputable functions: lim n →∞ U ( S , n ) = R (4) In other w ords, the information p otential of s uch S appr oaches the information capacity of a universal appr oximator, whic h depending on the mo del of compu- tation chosen, mig h t be a universal T uring machine, or the universal function from [41], to name a few. In the nex t sectio n, we leverage the results shown so far to ev aluate three diﬀerent pro cedures to so lv e F A, and show tha t there exists a bes t p ossible solution strategy . 4 The F A Problem in the Context of Machin e Learning In this section we relate the results from a nalyzing F A to the ﬁeld of machine learning. Fir s t, we show that the machine learning tas k can b e seen a s a solution strategy f or F A. W e then intro duce the Arc hitecture Searc h Pro blem (ASP) as a theoretica l pro cedure, and note that it is the bes t p ossible solution strategy for F A. Finally , w e note that ASP is un viable in an applied setting, and deﬁne a more relaxed v ersion of this approach: the Approximate Ar chitecture Search Problem (a- ASP), which is the a nalogous of the NAS task commonly seen in the literature. 4.1 Mac hine Learning as a Sol v er for F A The Mac hine Learning (ML) problem, informally , is the task of appro ximating an input function F through rep eated sampling a nd the para meter sea rch of a On the Bounds of F unction Approximations 9 predetermined function. This deﬁnition is a simpliﬁed, abstracted o ut version of the typical ma c hine learning task. It is, howev er, not new, and a br ief search in the literature ([4,5,19,37]) can attest to the existence o f several equiv alent formulations. W e repro duce it here for nota tional purp oses, and constrain it to computable functions: Deﬁnition 5 (The M L Problem). F or an unknown, c ontinuous funct ion F deﬁne d over some domain dom ( F ) , given ﬁnite subsets σ ⊂ dom ( F ) , a fun ction f with p ar ameters fr om some ﬁnite set Π , and a function m : R × R → R ≥ 0 , ﬁnd a π o ∈ Π such that m ( f ( x, π o ) , F ( x )) is minimal for a l l x ∈ σ . As deﬁned in Deﬁnition 2, any pro cedur e solving F A is required to r eturn the sequence that b est approximates a n y g iv en fu nction. In the ML problem, how- ever, suc h sequence f is already given to us. Even so, w e can still r eformu late ML as a solution strateg y for F A. F or this, let the search s pace be a sing leton of the form S M L = { f } ; set m to b e the metric function d in the approxima- tion err or; a nd leave σ as it is. W e then carry out a “search” over this space by simply picking f , and then optimizing the pa rameters of f with resp ect to the approximation error ε σ ( f , F ). W e then r eturn the function along with the parameters π o that minimize the err o r. Given that the search is per formed ov er a single element of the search space, this is not an eﬀective procedur e in terms of generalizability . T o see this, note that the pro cedure acts as intended, and “ﬁnds” the function that minimizes the a pproximation err or ε σ ( f , F ) b et ween f and any other F in the search space S M L . How ever, b eing able to approximate an input function F in a single- e lemen t search spa c e tells us nothing ab out the a bilit y of ML to a pproximate o ther input functions, o r e ven whether s uc h f ∈ S M L is the best function approximation for F in the ﬁrst place. In fact, we know by Theorem 2 that for a given sequence length n , for every F there exists an optimal sequence f o in E ⋆,n , which is ma y not b e present in S M L . Since w e are constra ined to a singleton sea rch spa ce, one could be tempted to build a sear c h space with one single function that maximizes the infor mation po ten tial, such as the o ne as describ ed in E quation 4, say , by choosing f to be a universal T uring Machine. There is one pr oblem with this approach: this would mean tha t we need to tak e in as an input the enco ding o f the input function F , along with the subset of the domain σ . If we were able to take the encoding of F as par t o f the input, we would a lready kno w the function and this w ould not be a f unction approximation problem in the ﬁrst place. Additionally , w e would only b e able to ev alua te the s et of computable functions which take in a s a n argument their own encoding , a s it, b y deﬁnition, needs to b e pres en t in σ . In terms of the fra mework from this pap er we c a n see that, no matter ho w we optimize the pa rameters of f to ﬁt new input functions, the information p otential U ( S M L , n ) remains unchanged, and the er ror will r emain bounded. This le a ds us to conclude that measuring a function’s ability to lear n through its n um b er of parameters [19,47,53] is a go od a pproach for a ﬁxed f and single input F , but incomplete in terms of describing its ability to genera lize to other pr oblems. 10 A. d e Wynter This is of critical imp ortance , b ecause, in an applied s etting, even though nob ody would a ttempt to use the same arc hitecture for all poss ible learning problems, the choice o f f r emains a crucial, and mostly heuristic, step in the machine learning pip eline. The statemen ts regarding the information p otential of the sea rch space are in accor dance with the results in [55], where it w as shown that–in the termi- nology o f this pap er–tw o predetermined sequences f a nd f ′ , w hen av eraging their approximation error across all p ossible input functions, will hav e equiv a- lent p erformance. W e hav e seen that ML is unable to generalize well to any o ther po ssible input function, a nd is unable to determine whether the g iv en sequence f is the b est for the giv en input. This leads us to conclude that, althoug h ML is a co mputationally tracta ble solution stra tegy for F A, it is a weak approa c h in terms of genera lizabilit y . 4.2 The Arc hitecture Searc h Problem (ASP) W e hav e shown that ML is a solution s tr ategy for F A, although the nature o f its search spac e makes it ineﬀective in a g eneralized s etting. It is only natural to assume that a str onger formulation of a pro cedure to solve F A w ould in v olve a more complex search spa ce. Similar to Deﬁnition 5, we are giv en the task o f appr oximating an unknown function F through rep eated s a mpling. Unlike ML, ho wev er, we are now a ble to select the s equence of functions (i.e., ar chitecture) that b est ﬁts a given input function F : Deﬁnition 6 (The Arc hitecture Searc h Problem (ASP)). F or an u n- known, c ontinuous function F deﬁne d over some domain dom ( F ) , given a ﬁnite subset σ ⊂ dom ( F ) , a se quenc e length n , a se ar ch sp ac e S AS P , and a function m : R × R → R ≥ 0 , ﬁ nd the se quenc e f = ( φ i ) k i =1 , φ i ∈ S AS P , k ≤ n such t hat m ( f ( x ) , F ( x )) is minimal for al l x ∈ σ , and all f ∈ S ⋆,n AS P . Note that we ha ve left the parameter optimization problem implicit in this formulation, since, as pointed out in Sec tion 4.1, a single- function search space f w ould b e ineﬀective for dealing with multiple input functions F , no matter how well the o ptimizer p erformed for a given subset of these inputs. A t a ﬁrst g lance, ASP lo oks s imilar to the P AC learning framework [52]. How ev er, F A is the task ab out ﬁnding the right sequence of computable functions for a ll p ossible functions, while P AC is a genera lized, tractable formulation of learning pr oblems, with the search space abstracted out. A more precis e analysis of the relatio nship b et ween F A a nd P AC is described in App endix A. As a solution stra teg y for F A, ASP is also sub ject to the results from section Section 3. The key diﬀer ence betw een ML and ASP is that ASP has access to a r ic her sear c h space, whic h allows it to have a better approximation capability . In particular, ASP could b e s een a s a g eneralized version o f the for mer, since for any n -sized sequence present in S M L , one could constr uct a space with bigger information p oten tial in ASP , but with the sa me constrains in sequence length. On the Bounds of F unction Approximations 11 F o r example, we could use E as our sear c h space, choo se a sequence length n , and so U ( S M L , n ) ⊂ U ( E , n ). Since ASP has no ex plicit constraints on time and space, this pro cedure is essentially per forming a n exhaustive search. Theorem 2 implies that, for ﬁxed n and any input F , ASP will alwa ys return the b est pos sible sequence within that space, a s long as the sear c h spac e contains the set of elementary functions, E ⊂ S . On the other hand, it is a cornerstone of the theory and practice of machine learning that learning algor ithms must b e tractable–that is, they must run in po lynomial time. Giv en that the sear c h space for ASP grows expo nen tially with the s equence length, this approach is an interesting theoretica l tool, but not very pr a ctical. W e will still use ASP as a pe rformance target for the ev aluation of more applicable pro cedures . Howev er, it is desirable to form ulate a solution strategy for F A that can b e used in an a pplied setting, but ca n also b e ana lyzed within the framework o f this pa per . T o achiev e this, ﬁr st we note that any other solution strateg y for F A whic h terminates in p olynomia l time will have to be a ble to av oid verifying ev ery po ssible function in the search space. In other words, such procedur e would require a function that is able to choos e a nonempty subset of the sea rch space . W e denote such function as B , such that for a search spa ce S , B ( S ) ⊂ S ⋆,n . W e can now deﬁne the Approximate Arc hitecture Sear c h Pro blem (a-ASP) as the formulation o f NAS in terms o f the F A framework: Deﬁnition 7 (The Appro ximate ASP (a-ASP)). If F is an unknown, c ontinuous fun ction deﬁne d over some domai n dom ( F ) , given a ﬁnite subset σ ⊂ dom ( F ) , a se quenc e length n , a se ar ch sp ac e S AS P , a fun ct ion m : R × R → R ≥ 0 , and a set builder function B ( S AS P ) ⊂ S ⋆,n AS P , ﬁnd t he se quenc e f = ( φ i ) k i =1 , φ i ∈ B ( S AS P ) , k ≤ n su ch that m ( f ( x ) , F ( x )) is minimal for al l x ∈ σ and f ∈ B ( S AS P ) . Just as the previous t w o pro cedures w e deﬁned, a -ASP is also a so lution strategy for F A. The only diﬀerence betw een Deﬁnition 6 a nd Deﬁnition 7 is the inclusion of the set builder function to trav erse the spa ce in a more e ﬃcien t manner. Due to the inclusion o f this function, how ever, a-ASP is weak er than ASP , since it is not g uaranteed to ﬁnd the functions f o that globally minimizes ε σ ( f o , F ), for a ll given F . Additionally , the fact that this function m ust b e included into the para meters for a-ASP implies that suc h pro cedure requir es some design choices. Given that everything else in the deﬁnition of a-ASP is equiv alen t to ASP , it can b e stated that the s et builder function is the only deciding factor when attempting to match the p erfor ma nce o f ASP with a - ASP . It ha s b een shown [56] that certain set builder functions p erform b etter than others in a generalize d setting. This can b e also seen from the p ersp ective of the F A framew ork , where we have av ailable at our disp osa l the s equences that make up a given function. In particular, if S = { φ 1 , ..., φ m } is a sear c h space, and B is a function that selects elemen ts from S ⋆,n , a-ASP not only ha s access to the p er- formance o f all the k sequences chosen so far, { ε σ ( f i , F ) , f i ∈ B ( S ⋆,n ) } i ∈{ 1 ,...,k } , 12 A. d e Wynter but also the enco ding (the conﬁgurations from [56]) o f their comp osition. This means that, giv en enough s a mples, when testing against a s ubset of the input, σ ′ ⊂ σ , such an algorithm would be able to learn the ex pected output φ ( s ) of the functions φ ∈ S , and their be havior if inc luded in the current sequence f k +1 = ( f k , φ )( s ), for s ∈ σ ′ . Including such infor mation in a set builder func- tion co uld allow the procedure to make better decisions at ev ery step, and this approach has b een used in applied s ettings with success [30,26]. It can b e seen that these design choices are not nece s sarily pro blem-dependent, and, from the r esults of Theorem 2, they ca n b e done in a theoretica lly motiv ated manner. Sp e c iﬁcally , we note tha t the informa tion p otential of the sear c h space remains unchanged b etw een a-ASP and ASP , and so, by including E , a-ASP could hav e the abilit y to p erform as well as ASP . 5 Conclusion The F A problem is a reform ulation of the problem of approximating any given function, but with ﬁnding a sequence o f functions as a central asp ect of the task. In this pa per , we ana ly zed its prop erties in terms of the sear c h space , and its applications to mac hine lea rning and NA S. In par ticular, we showed that it is imposs ible to write a pro cedure that so lv es F A for any g iv en function and domain with zero error , but describ ed the conditions under which such error can be minimal. W e leveraged the r e sults from this pap er to analyze three solution strategies for F A: ML, ASP , a nd a-ASP . Sp eciﬁcally , we show ed that ML is a weak solution strateg y for F A, as it is unable to generalize or determine whether the s equence used is the b est ﬁt for the input function. W e also p ointed out that ASP , although the b est p ossible algorithm to so lv e for F A, is in tractable in a n applied setting. W e ﬁnished b y formulating a s olution str ategy tha t merged the b est of b oth ML and ASP , a- ASP , and p oint ed out, through existing work in the liter ature, complemented with the results fro m this fr amew ork , that it has the a bilit y to solve F A as well as ASP in terms of approximation er ror. One area that w as not discussed in this paper was whe ther it would be pos- sible to select a priori a go o d subset σ of the input function’s domain. This problem is imp ortant since a go o d repres en tative of the input will greatly in- ﬂuence a pro cedure’s capa bilit y to so lve F A. This is tied to the data selec tio n pro cess, and it w as not dealt with o n this pap er. F urther re s earch on this topic is likely to b ear gr e a t inﬂuence on mac hine learning as a whole. Ac kno wledgmen ts The author is gra teful to the ano n ymous r eviewers for their helpful feedbac k on this pap er, and also thanks Y. Gore n, Q. W ang, N. Stro m, C. Bejjani, Y. Xu, and B. d’Iverno for their comments and suggestio ns on the early stages of this pro ject. On the Bounds of F unction Approximations 13 References 1. A ngeline, P .J ., Saunders, G.M., P ollac k, J.B.: An ev olutionary algorithm that constructs recurren t n eural netw orks. T rans. Neur. Net w. 5 (1), 54–65 (199 4). https://doi .org/10.11 09/72.265960 2. Bartlett, P .L., Ben-D a vid, S.: Hardness results fo r neural n etw ork approximation problems. In: Pro ceedings of the 4 th Eu ropean Conference on Computational Learning Theory . pp. 50–62. EuroCOL T ’99, Springer-V erlag, London, UK , UK (1999). https://doi.org/ 10.1016/S03 04-3975(01)00057-3 3. Baxter, J.: A mo del of inductive bias learning. Journal of Artiﬁcial Intelligence Researc h 12 , 149–198 (2000). https:// doi.org/10.1 613/jair.731 4. Ben- Da vid, S., Hrub es, P ., Moran, S., Sh pilk a, A ., Y ehuda yoﬀ, A.: A learning p rob- lem that is indep endent of the set theory ZF C axio ms. CoRR abs/1711.05195 (2017), http://arxiv.or g/abs/1711.05195 5. Bengio, Y.: Learning deep arc hitectures for ai. F oun d ations and T rends in Machine Learning 2 (1), 1–127 (2009 ). https://doi .org/10.156 1/2200000006 6. Blum, M.: A machine-indep endent theory of the complexity of recursiv e fun ctions. Journal of the ACM 14 (2), 322336 (1967). https://doi.o rg/10.1145 /321386.321395 7. Blumer, A., Ehrenfeuch t, A., Haussler, D ., W armuth, M.K.: Learnabilit y and the v apnik-cherv onenkis dimension. Journal of the Association for Computin g Machin- ery 36 , 929–96 5 (1989). https:// doi.org/10.11 45/76359.76371 8. Bshouty , N.H.: A new composition theorem for learning algorithms. In: Pro- ceedings of the Thirtieth A nnual ACM Symp osium on Theory of Com- puting. pp. 583–589 . STOC ’98, ACM, New Y ork, NY, USA (1998). https://doi .org/10.11 45/258533.258614 9. Carp enter, G.A., Grossb erg, S.: A massiv ely parallel architec- ture for a self-organizing n eu ral pattern recognition machine. Com- puter Vision, Graphics and Image Pro cessing 37 , 54–115 (1987). https://doi .org/10.10 16/S0734-189X(87)80014-2 10. Carv alho, A.R., Ramos, F.M., Chav es, A.A.: Metaheuristics for t he feedforw ard artiﬁcial n eural n et wo rk (ann) architecture optimization problem. Neural Comput. & Aplic. (2010). https://doi. org/10.1007 /s00521-010-0504-3 11. Churc h, A.: An unsolv able prob lem of elementary num b er theory . American Jour- nal of Mathematics 58 , 345–3 63 (1936) 12. Cyb enko, G.: A pproxima tion b y superp ositions of a sigmoidal func- tion. Mathematics of Con trol, Singals, and Systems 2 , 303–314 (1989). https://doi .org/10.10 07/BF02551274 13. Cyb enko, G.: C omplexity theory of neural net w orks and classiﬁcation problems. In: Pro ceedings of the EURASIP W orkshop 1990 on Neu ral Netw orks. pp . 26–44. Springer-V erlag (1990). https://doi.o rg/10.1007 /3-540-52255-7 25 14. Elsken, T., Metzen, J.H., Hutter, F.: Neural arc hitecture search: A survey (2019). https://doi .org/10.10 07/978-3-030-05318-5 3 15. F eurer, M., Klein, A., Eggensp erger, K., Springenberg, J ., Blum, M., Hutter, F.: Eﬃcien t and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., S ugiy ama, M., Garnett, R . (eds.) Advances in N eu ral I nformation Pro- cessing Sy stems 28, pp. 2962 –2970. Curran A ssociates, Inc. (2015) 16. F unahashi, K.: On the app ro ximate re alization of con tinuous map- pings by neural netw orks. N eural N et wo rks 2 , 183–192 (1989). https://doi .org/10.10 16/0893-6080(89)90003-8 14 A. d e Wynter 17. Girosi, F., Jones, M., P ogg io, T.: Regula rization theory and neu- ral net works arc hitectures. Neu ral Computation 7 , 219–26 9 (1995). https://doi .org/10.11 62/neco.1995.7.2.219 18. Golovin, D., Solnik, B., Moitra, S., Ko chanski, G., Karro, J., Scu l- ley , D.: Go ogle vizier: A service for black-box optimization (2017). https://doi .org/10.11 45/3097983.3098043 19. Go od fellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge, MA (2016), http://www.deeplea rningbook.org 20. He, Y., Lin, J., Liu, Z., W ang, H., Li, L.J., Han, S.: Amc: Automl for mo del compression and accelerati on on mobile devices. In: Proceedings of the Europ ean Conference on Computer Vision (ECCV). pp. 784–800 (2018) 21. Hornik , K.: App roximation capabilities of multil ay er feedforw ard n et wo rks. Neural Netw orks 4 , 251–257 (1991). https://doi.org /10.1016/ 0893-6080(91)90009- T 22. Hornik , K., Stinchcom b e, M., White, H.: Multilay er feedforw ard net- w orks are un iv ersal approximators. Neural Netw orks 2 , 359–366 (1989). https://doi .org/10.10 16/0893-6080(89)90020-8 23. Jin, H ., S ong, Q., Hu, X.: Au to-kera s: Eﬃcient neural architecture search with netw ork morphism (2018) 24. Kolmogoro v, A.N.: On the representation of con tinuous fun ct ions of several v ari- ables by sup erp osition of contin uous function of one vari able and add ition. Dokl. Ak ad. Nauk SSS R 114 , 953–956 ( 1957) 25. Leshno, M., Lin, V.Y., Pinkus, A., Sh ock en, S .: Multila yer f eedforw ard netw orks with a nonp olyn omial activ ation function can approximate an y function. Neural Netw orks 6 , 861–867 (1993). https://doi.org /10.1016/ S0893-6080(05)80131-5 26. Liu, H., S imonyan, K., Y ang, Y.: Hierarchical representatio ns for eﬃcien t architec- ture search. International Conference on Learning Representations (2018) 27. Liu, H., Simony an, K., Y ang, Y.: Darts: Diﬀerenti able arc hitecture search. Inter- national Conference on Learning Represen tations (2019) 28. Long, P .M., Sedghi, H.: Size-free generalization bound s for con- vol utional neural net w orks. CoRR abs/1905.12600 (2019), https://ar xiv.org/pdf/1 905.12600v1.pdf 29. Luo, R., Tian, F., Q in, T., Liu, T.Y.: N eural architecture optimization. In : NeurIPS (2018) 30. Miller, G.F., T o dd, P .M., Hegde, S.U.: Designing neural netw orks using genetic algorithms. Pro c. 3rd Intl. Conf. Genetic Algorithms and Their Applications pp. 379–384 (1989) 31. Neto, J.P ., Siegelmann, H.T., Costa, J.F., Araujo, C.P .S.: T uring universalit y of neural nets (revisited). I n: Pichler, F., Moreno-D ´ ıaz, R. (eds.) Computer Aided S ys- tems Theory — EUROCAST’97. pp. 361–366 . Springer Berlin Heidelberg, Berlin, Heidelb erg (1997). https://doi.org/1 0.1007/BFb0025058 32. Ojha, V.K., A braham, A., Sn´ a ˇ sel, V.: Metaheuristic desig n of f eedforw ard neural netw orks: A review of tw o decades of research. Eng. Appl. Artif. Intell. 60 (C), 97–116 (2017). https:// doi.org/10.1 016/j.engappai.2017.01.013 33. Orp onen, P .: Computational complexity of neural n et wo rks: A survey . Nordic J. of Computing 1 (1), 94–110 (1994 ) 34. Ostrand , P .A .: Dimension of metric spaces and hilb ert’s problem 13. Bulletin of the American Mathematical So ciet y 71 , 619622 (1965). https://doi .org/10.10 90/S0002-9904-1965-11363-5 35. Park, J., Sandberg, I.W.: Universal approximatio n using radial- basis-function netw orks. Neu ral Computation 3 , 246–25 7 (1991). https://doi .org/10.11 62/neco.1991.3.2.246 On the Bounds of F unction Approximations 15 36. Pham, H., Guan, M., Zoph , B., Le, Q., Dean, J.: Eﬃcient neural architecture searc h via parameters sharing. In: Dy , J., Krause, A. (eds.) Pro ceedin gs of the 35th International Conference on Mac hine Learning. Pro ceedings of Mac hine Learning Researc h, v ol. 80, pp. 4095–4104 . PMLR (10–15 Jul 2018) 37. Poggio , T., G irosi, F.: A theory of netw orks for approximati on and learning. A .I. Memo No. 1140 (1989) 38. Poggio , T., Girosi, F.: Netw orks f or app ro ximation and learning. Proceedings of the IEEE 78 (9) (1990). https://doi.org/1 0.1109/5.58326 39. Rab in , M.O.: Computable algebra, general theory and theory of computable ﬁelds. T rans. Amer. Math. So c. 95 , 341–36 0 (1960). https://doi .org/10.10 90/S0002-9947-1960-0113807-4 40. Real, E., Mo ore, S., Selle, A., Saxen a, S., Suematsu, Y.L., Le, Q.V., Kurakin, A.: Large-scale evolution of image classiﬁers. In : Proceedings of the 34 th International Conference on Mac hine Learning (2017) 41. Rogers, Jr., H.: The Theory o f R ecursiv e F un ctions and Eﬀective Computabilit y . MIT Press, Cam bridge, MA (1987) 42. Sch¨ afer, A.M., Zimmermann, H.G.: Recurrent neural netw orks are univer sal ap- proximato rs. In : Pro ceedin gs of th e 16 th International Conference on A rt iﬁcial Neu- ral Net w orks - V olume Part I. ICANN ’06, vol. 27, pp. 63 2–640. Springer-V erlag, Berlin, Heidelb erg ( 2006). https://doi. org/10.100 7/11840817 66 43. Schaﬀer, J.D., Caruana, R .A ., Eshelman, L.J.: Using genetic search to ex- ploit the emergent b ehavior of neural net works. Physi cs D 42 (244-248) (1990). https://doi .org/10.10 16/0167-2789(90)90078-4 44. Siegel, J.W., Xu, J.: On the ap p ro ximation properties of neural net w orks. arXiv e-prints arXiv :1904 .02311 (2019) 45. Siegelmann, H.T., Sontag, E.D.: T uring compu tabilit y with n eural nets. vol. 4, pp. 77–80 (1991). https://doi .org/10.10 16/0893-9659(91)90080-F 46. Siegelmann, H.T., Sontag, E.D.: O n the compu tational p o w er of neural nets. J. Comput. Sy st. Sci. 50 , 132–150 (1995). h ttps://doi.org/10 .1006/jcss.1995.1013 47. Sontag, E.D.: Vc dimension of neural netw orks. Neural Netw orks and Mac hine Learning p. 6995 (1998) 48. St an ley , K .O., Clune, J., Lehman, J., Miikkulainen, R.: Designing neural netw orks through evolutionary algorithms. N ature Machine Intellig ence 1 , 2435 (2019) 49. St an ley , K.O., Miikkulainen, R.: Ev olving neural netw orks t hrough augmenting topologies. Ev ol. Comput. 10 (2), 99–1 27 (Jun 2002). https://doi .org/10.11 62/106365602320169811 50. Su n, Y., Y en, G.G., Yi, Z.: Evolving u nsup ervised deep n eural netw orks for learning meaningful representations. IEEE T ransactions on Evolutionary Computation 23 , 89–103 (2019). https:// doi.org/10.1 109/TEV C.2018.2808689 51. T en orio, M.F., Lee, W.T.: Self organizing neural netw orks for the identiﬁcation problem. I n : T ouretzky , D.S. (ed.) Adv ances in Neural Information Processing Systems 1, p p. 57–64. Morgan-Kaufmann (1989) 52. V alia nt, L.G. : A th eory of the learnable. Commun. A CM 27 , 1134 –1142 (1984). https://doi .org/10.11 45/1968.1972 53. V apnik, V., Chervo nenkis, A.Y.: On t he uniform conv ergence of relative frequencies of even ts to their probabilities. Theory of Probability and Its A pplications 16 , 264– 280 (1971). https://doi.org/1 0.1007/978-3-319-21852-6 3, translated by B. S ec kler 54. Vitu shkin, A.: Some prop erties of linear sup erp ositions of smooth functions. Dokl. Ak ad. Nauk SSS R 156 , 1258–1261 (1964) 16 A. d e Wynter 55. W olpert, D.H., Macready , W.G.: No free lunch th eorems for optimiza- tion. IEEE T ransactions on Evolutionary Computation 1 (1), 67–87 ( 1997). https://doi .org/10.11 09/4235.585893 56. W olpert, D.H., Macrea dy , W.G.: Co evo lutionary free lunches. IEEE T ransactions on Ev olutionary Computation 9 , 7 21–735 (2005). https://doi .org/10.11 09/TEV C.2005.856205 57. W ong, C., Houlsby , N., Lu , Y., Gesmundo, A.: T ransfer learning with neural au- toml. In: Pro ceedings of th e 32Nd International Conference on Neural Information Processing Systems. pp . 8366–8 375. N IPS’18 (2018) 58. Y ang, X.S.: Metaheuristic optimization: Algorithm analysis and op en p roblems. Proceedings of the 10 th International S ymp osium on Exp erimen tal Algorithms 6630 , 21–32 (2011). https://doi.o rg/10.1007 /978-3-642-20662-7 2 59. Y ao, X.: Ev olving artiﬁcial neural netw orks. Pro ceedings of the IEEE 87 (9) (1999). https://doi .org/10.11 09/5.784219 60. Zoph , B., Le, Q.V.: Neural arc hitecture search with reinforcement learning. CoRR abs/1611.01578 (2016) App endices A P AC Is a Solv er for F A P AC learning, as deﬁned by V alia n t [52], is a slightly diﬀerent pr oblem tha n F A, as it concerns itself with whether a c onc ept class C can b e described with high probability with a mem b er of a hyp othesis class H . It als o establishes bo unds in terms of the a moun t of samples from members c ∈ C that ar e needed to learn C . O n the other hand, F A and its solution s trategies concern themse lves with ﬁnding a so lutio n that minimizes the erro r, b y searching t hroug h sequences of explicitly deﬁned members drawn from a search space. Regardless of these diﬀerences, P AC learning as a pr oc e dure ca n still be formulated as a solution strateg y for F A. T o do this, let H b e our search spa ce. Then note that the P A C er ror function e pac ( h, c ) = P r x ∼P [ h ( x ) 6 = c ( x )] , c ∈ C, h ∈ H , is equiv alent to computing ε σ ( h, c ) for some subset σ ⊂ dom ( c ), and choosing the fr e quen tist diﬀerence b etw een the imag es of the functions as the metric d . Our ob jective w ould be to return the h ∈ H that minimizes the approximation err or for a given subse t σ ⊂ C . Note that we do not sea rch through the expanded sear c h s pace H ⋆,n . Finding the right dis tribution for a speciﬁc class may b e NP-hard [7], and so e pac requires us to make certain as s umptions a bout the distribution of the input v a lues. Additionally , any optimizer for P AC is req uired to r un in p olynomial time. Due to a ll of this, P A C is a w eaker approach to so lv e F A when co mpared to ASP , but stronger than ML since this solution strateg y is ﬁxed to the design of the search spa ce, a nd not to the c hoice of function. Nonetheless, it m ust be s tressed that the b ounds and paradigms provided b y P A C and F A are not m utually exclusive, either: the most pr ominen t example being that P AC lea rning provides conditions under whic h the choice subset σ is optimal. With the p olyno mial co nstraint for P AC lea rning lifted, and letting the sa m- ple and searc h space sizes grow inﬁnitely , P AC is eﬀectively equiv alen t to ASP . On the Bounds of F unction Approximations 17 How ev er, tha t deﬁes the purpo se of the P AC fra mew ork, as its s uccess relies on being a tr a ctable learning theory . B The VC Dimension and the I nformation P oten tial There is a natur al cor r espo ndence b etw een the V C dimension [7,53] of a hypoth- esis space, and the information capac it y of a sequence. T o see this, note that the VC dimension is usually deﬁned in terms of the set of concepts (i.e., the input function F ) that c a n be s hattered by a predeter mined function f with img ( f ) = { 0 , 1 } . It is freq uen tly used to quan tify the abilit y of a pro cedure to learn the input function F . In the F A framework we are more in terested in w he ther the search space–a lso a set–of a giv en solution strategy is able to generalize w ell to m ultiple, unseen input functions. Ther e fo re, for ﬁxed F and f , the V C dimension a nd its v a riants provide a p ow erful insight on the abilit y of a n a lg orithm to learn. When f is not ﬁxed, it is still pos sible to utilize this qua n tit y to measure the capacity of a search space S , by simply taking the unio n o f all p ossible f ∈ S ⋆,n for a g iv en n . Ho wev er, when the the input functions are not ﬁxed either, we ar e unable to use the deﬁnition of VC dimension in this context, as the set of input concepts is unknown to us. W e th us need a more ﬂexible w ay to mo del generaliza bilit y , and that is where we le verage the information p oten tial U ( S , n ) of a sea rch s pace.

On the Bounds of Function Approximations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment