Benefits of depth in neural networks
For any positive integer $k$, there exist neural networks with $\Theta(k^3)$ layers, $\Theta(1)$ nodes per layer, and $\Theta(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially…
Authors: Matus Telgarsky
JMLR: W orkshop and Conference Proceedings vol 49: 1 – 23 , 2016 Benefits of depth in neural networks Matus T elgarsky M T E L G A R S @ C S . U C S D . E D U University of Michigan, Ann Arbor Abstract For any positi ve integer k , there exist neural networks with Θ( k 3 ) layers, Θ(1) nodes per layer , and Θ(1) distinct parameters which can not be approximated by netw orks with O ( k ) layers unless the y are exponentially lar ge — they must possess Ω(2 k ) nodes. This result is prov ed here for a class of nodes termed semi-algebraic gates which includes the common choices of ReLU, maximum, indicator , and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also con v olutional networks with ReLU and maxi- mization gates, sum-product networks, and boosted decision trees (in this last case with a stronger separation: Ω(2 k 3 ) total tree nodes are required). Keyw ords: Neural networks, representation, approximation, depth hierarchy . 1. Setting and main results A neural network is a model of real-v alued computation defined by a connected directed graph as follo ws. Nodes aw ait real numbers on their incoming edges, thereafter computing a function of these reals and transmitting it along their outgoing edges. Root nodes apply their computation to a vector provided as input to the network, whereas internal nodes apply their computation to the output of other nodes. Different nodes may compute different functions, two common choices being the maximization gate v 7→ max i v i (where v is the vector of values on incoming edges), and the standar d ReLU gate v 7→ σ R ( h a, v i + b ) where σ R ( z ) := max { 0 , z } is called the ReLU (rectified linear unit), and the parameters a and b may vary from node to node. Graphs in the present work are acyclic, and there is exactly one node with no outgoing edges whose computation is the output of the network. Neural networks distinguish themselves from man y other function classes used in machine learning by possessing multiple layers , meaning the output is the result of composing together an arbitrary number of (potentially complicated) nonlinear operations; by contrast, the functions computed by boosted decision stumps and SVMs can be written as neural networks with a constant number of layers. The purpose of the present work is to sho w that standard types of networks always gain in representation po wer with the addition of layers. Concretely: it is sho wn that for ev ery positive integer k , there exist neural networks with Θ( k 3 ) layers, Θ(1) nodes per layer , and Θ(1) distinct parameters which can not be approximated by networks with O ( k ) layers and o (2 k ) nodes. 1.1. Main result Before stating the main result, a few choices and pieces of notation deserve explanation. First, the target many-layered function uses standard ReLU gates; this is by no means necessary , and a more general statement can be found in Theorem 3.12 . Secondly , the notion of approximation is the L 1 c 2016 M. T elgarsky . T E L G A R S K Y distance: giv en two functions f and g , their pointwise disagreement | f ( x ) − g ( x ) | is a veraged ov er the cube [0 , 1] d . Here as well, the same proofs allo w flexibility (cf. Theorem 3.12 ). Lastly , the shallo wer networks used for approximation use semi-algebraic gates , which generalize the earlier maximization and standard ReLU gates, and allow for analysis of not just standard networks with ReLU gates, b ut con v olutional networks with ReLU and maximization gates ( Krizhevsky et al. , 2012 ), sum-product networks (where nodes compute polynomials) ( Poon and Domingos , 2011 ), and boosted decision trees; the full definition of semi-algebraic gates appears in Section 2 . Theorem 1.1 Let any inte ger k ≥ 1 and any dimension d ≥ 1 be given. Ther e exists f : R d → R computed by a neural network with standard ReLU gates in 2 k 3 + 8 layer s, 3 k 3 + 12 total nodes, and 4 + d distinct parameters so that inf g ∈C Z [0 , 1] d | f ( x ) − g ( x ) | dx ≥ 1 64 , wher e C is the union of the following two sets of functions. • Functions computed by networks of ( t, α, β ) -semi-algebraic gates in ≤ k layers and ≤ 2 k / ( tαβ ) nodes. (E.g., as with standard ReLU networks or with con volutional neural net- works with standar d ReLU and maximization gates; cf. Section 2 .) • Functions computed by linear combinations of ≤ t decision tr ees each with ≤ 2 k 3 /t nodes. (E.g., the function class used by boosted decision tr ees; cf. Section 2 .) Analogs to Theorem 1.1 for boolean circuits — which ha v e boolean inputs routed through { and , or , not } gates — hav e been studied extensi vely by the circuit complexity community , where they are called depth hierar chy theor ems . The seminal result, due to H ˚ astad ( 1986 ), establishes the inapproximability of the parity function by shallow circuits (unless their size is exponential). Standard neural networks appear to hav e receiv ed less study; closest to the present work is an in- vestigation by Eldan and Shamir ( 2015 ) analyzing the case k = 2 when the dimension d is large, sho wing an exponential separation between 2- and 3-layer networks, a regime not handled by The- orem 1.1 . Further bibliographic notes and open problems may be found in Section 5 . The proof of Theorem 1.1 (and of the more general Theorem 3.12 ) occupies Section 3 . The ke y idea is that just a few function compositions (layers) suffice to construct a highly oscillatory function, whereas function addition (adding nodes but keeping depth fixed) giv es a function with fe w oscillations. Thereafter , an elementary counting argument suffices to show that lo w-oscillation functions can not approximate high-oscillation functions. 1.2. Companion results Theorem 1.1 only provides the existence of one network (for each k ) which can not be approximated by a network with many fewer layers. It is natural to wonder if there are many such special functions. The follo wing bound indicates their population is in fact quite modest. Specifically , the construction behind Theorem 1.1 , as elaborated in Theorem 3.12 , can be seen as e xhibiting O (2 k 3 ) points, and a fixed labeling of these points, upon which a shallow network hardly improves upon random guessing. The forthcoming Theorem 1.2 similarly sho ws that even on the more simpler task of fitting O ( k 9 ) points, the earlier class of networks is useless on most random labellings. 2 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S In order to state the result, a few more definitions are in order . Firstly , for this result, the notion of neural network is more restricti ve. Let a neural net graph G denote not only the graph structure (nodes and edges), but also an assignment of gate functions to nodes, of edges to the inputs of gates, and an assignment of free parameters w ∈ R p to the parameters of the gates. Let N ( G ) denote the class of functions obtained by varying the free parameters; this definition is fairly standard, and is discussed in more detail in Section 2 . As a final piece of notation, giv en a function f : R d → R , let ˜ f : R d → { 0 , 1 } denote the corresponding classifier ˜ f ( x ) := 1 [ f ( x ) ≥ 1 / 2] . Theorem 1.2 Let any neural net graph G be given with ≤ p parameter s in ≤ l layers and ≤ m total ( t, α, β ) -semi-alg ebraic nodes. Then for any δ > 0 and any n ≥ 8 pl 2 ln(8 emtαβ p ( l + 1)) + 4 ln(1 /δ ) points ( x i ) n i =1 , with pr obability ≥ 1 − δ over uniform random labels ( y i ) n i =1 , inf f ∈N ( G ) 1 n n X i =1 1 [ ˜ f ( x i ) 6 = y i ] ≥ 1 4 . This proof is a direct corollary of the VC dimension of semi-algebraic networks, which in turn can be proved by a small modification of the VC dimension proof for piece wise polynomial net- works ( Anthon y and Bartlett , 1999 , Theorem 8.8). Moreover , the core methodology for VC dimen- sion bounds of neural networks is due to W arren , whose goal was an analog of Theorem 1.2 for polynomials ( W arren , 1968 , Theorem 7). Lemma 1.3 (Simplification of Lemma 4.2 ) Let any neural net graph G be given with ≤ p pa- rameter s in ≤ l layers and ≤ m total nodes, each of which is ( t, α , β ) -semi-alg ebraic. Then VC ( N ( G )) ≤ 6 p ( l + 1) ln(2 p ( l + 1)) + ln(8 emtα ) + l ln( β ) . The proof of Theorem 1.2 and Lemma 1.3 may be found in Section 4 . The argument for the VC dimension is very close to the argument for Theorem 1.1 that a network with fe w layers has fe w oscillations; see Section 4 for further discussion of this relationship. 2. Semi-algebraic gates and assorted network notation The definition of a semi-algebraic gate is unfortunately complicated; it is designed to capture a fe w standard nodes in a single abstraction without degrading the bounds. Note that the name semi- algebr aic set is standard ( Bochnak et al. , 1998 , Definition 2.1.4), and refers to a set defined by unions and intersections of polynomial inequalities (and thus the name is some what abused here). Definition 2.1 A function f : R k → R is ( t, α, β ) -sa ( ( t, α, β ) -semi-algebraic) if ther e e xist t polynomials ( q i ) t i =1 of de gree ≤ α , and m triples ( U j , L j , p j ) m j =1 wher e U j and L j ar e subsets of [ t ] (wher e [ t ] := { 1 , . . . , t } ) and p j is a polynomial of de gree ≤ β , suc h that f ( v ) = m X j =1 p j ( v ) Y i ∈ L j 1 [ q i ( v ) < 0] Y i ∈ U j 1 [ q i ( v ) ≥ 0] . 3 T E L G A R S K Y A notable trait of the definition is that the number of terms m does not need to enter the name as it does not af fect any of the comple xity estimates herein (e.g., Theorem 1.1 or Theorem 1.2 ). Distinguished special cases of semi-algebraic gates are as follows in Lemma 2.3 . The standard piece wise polynomial gates generalize the ReLU and hav e recei ved a fair bit of attention in the theoretical community ( Anthony and Bartlett , 1999 , Chapter 8); here a function σ : R → R is ( t, α ) -poly if R can be partitioned into ≤ t interv als so that σ is a polynomial of degree ≤ α within each piece. The maximization and minimization gates have become popular due to their use in con v olutional networks ( Krizhe vsky et al. , 2012 ), which will be discussed more in Section 2.1 . Lastly , decision trees and boosted decision trees are practically successful classes usually viewed as competitors to neural networks ( Caruana and Niculescu-Mizil , 2006 ), and hav e the follo wing structure. Definition 2.2 A k -dt (decision tree with k nodes) is defined recur sively as follows. If k = 1 , it is a constant function. If k > 1 , it first evaluates x 7→ 1 [ h a, x i − b ≥ 0] , and thereafter conditionally evaluates either a left l -dt or a right r -dt where l + r < k . A ( t, k ) -bdt (boosted decision tr ee) evaluates x 7→ P t i =1 c i g i ( x ) wher e each c i ∈ R and each g i is a k -dt. Lemma 2.3 (Example semi-algebraic gates) 1. If σ : R → R is ( t, β ) -poly and q : R d → R is a polynomial of degr ee α , then the standar d piecewise polynomial gate σ ◦ q is ( t, α, αβ ) -sa. In particular , the standar d ReLU gate v 7→ σ R ( h a, v i + b ) is (1 , 1 , 1) -sa. 2. Given polynomials ( p i ) r i =1 of degr ee ≤ α , the standar d ( r, α ) -min and -max gates φ min ( v ) := min i ∈ [ r ] p i ( v ) and φ max ( v ) := max i ∈ [ r ] q i ( v ) ar e ( r ( r − 1) , α, α ) -sa. 3. Every k -dt is ( k, 1 , 0) -sa, and every ( t, k ) -bdt is ( tk , 1 , 0) . The proof of Lemma 2.3 is mostly a matter of unwrapping definitions, and is deferred to Ap- pendix A . Perhaps the only interesting encoding is for the maximization gate (and similarly the minimization gate), which uses max i v i = P i v i ( Q j v j ])( Q j >i 1 [ v i ≥ v j ]) . 2.1. Notation f or neural networks A semi-algebraic gate is simply a function from some domain to R , but its role in a neural network is more complicated as the domain of the function must be partitioned into ar guments of three types: the input x ∈ R d to the network, the parameter vector w ∈ R p , and a vector of real numbers coming from parent nodes. As a con v ention, the input x ∈ R d is only accessed by the root nodes (otherwise “layer” has no meaning). For con venience, let layer 0 denote the input itself: d nodes where node i is the map x 7→ x i . The parameter vector w ∈ R p will be made av ailable to all nodes in layers abov e 0, though they might only use a subset of it. Specifically , an internal node computes a function f : R p × R d → R using parents ( f 1 , . . . , f k ) and a semi-algebraic gate φ : R p × R k → R , meaning f ( w , x ) := φ ( w 1 , . . . , w p , f 1 ( w , x ) , . . . , f k ( w , x )) . Another common practice is to have nodes apply a uni variate activation function to an af fine mapping of their parents (as with piecewise polynomial gates in Lemma 2.3 ), where the weights in the affine combination are the parameters to the network, and additionally correspond to edges in the graph. It is permitted for the same 4 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S parameter to appear multiple times in a network, which explains how the number of parameters in Theorem 1.1 can be less than the number of edges and nodes. The entire network computes some function F G : R p × R d → R , which is equiv alent to the function computed by the single node with no outgoing edges. As stated previously , G will denote not just the graph (nodes and edges) underlying a network, but also an assignment of gates to nodes, and ho w parameters and parent outputs are plugged into the gates (i.e., in the preceding paragraph, how to write f via φ ). N ( G ) is the set of functions obtained by varying w ∈ R p , and thus N ( G ) := { F G ( w , · ) : w ∈ R p } where F G is the function defined as above, corresponding to computation performed by G . The results related to VC dimension, meaning Theorem 1.2 and Lemma 1.3 , will use the class N ( G ) . Some of the results, for instance Theorem 1.1 and its generalization Theorem 3.12 , will let not only the parameters but also network graph G vary . Let N d (( m i , t i , α i , β i ) l i =1 ) denote a network where layer i has ≤ m i nodes where each is ( t i , α i , β i ) -sa and the input has dimension d . As a simplification, let N d ( m, l , t, α, β ) denote networks of ( t, α, β ) -sa gates in ≤ l layers (not including layer 0) each with ≤ m nodes. There are various empirical prescriptions on ho w to vary the number of nodes per layer; for instance, con volutional networks typically hav e an increase between layer 0 and layer 1, followed by exponential decrease for a fe w layers, and finally a fe w layers with the same number of nodes ( Fukushima , 1980 ; LeCun et al. , 1998 ; Krizhe vsky et al. , 2012 ). 3. Benefits of depth The purpose of this section is to prov e Theorem 1.1 and its generalization Theorem 3.12 in the follo wing three steps. 1. Functions with fe w oscillations poorly approximate functions with many oscillations. 2. Functions computed by networks with fe w layers must have fe w oscillations. 3. Functions computed by networks with man y layers can hav e many oscillations. 3.1. A pproximation via oscillation counting f g Figure 1: f crosses more than g . The idea behind this first step is depicted at right. Giv en functions f : R → R and g : R → R (the multiv ariate case will come soon), let I f and I g denote partitions of R into intervals so that the classifiers ˜ f ( x ) = 1 [ f ( x ) ≥ 1 / 2] and ˜ g are constant within each interval. T o formally count oscillations, define the cr ossing number Cr ( f ) of f as Cr ( f ) = |I f | (thus Cr ( σ R ) = 2 ). If Cr ( f ) is much larger than Cr ( g ) , then most piecewise constant re gions of ˜ g will e xhibit many oscillations of f , and thus g poorly approximates f . Lemma 3.1 Let f : R → R and g : R → R be given, and take I f to denote the partition of R given by the pieces of ˜ f (meaning |I f | = Cr ( f ) ). Then 1 Cr ( f ) X U ∈I f 1 [ ∀ x ∈ U ˜ f ( x ) 6 = ˜ g ( x )] ≥ 1 2 1 − 2 Cr ( g ) Cr ( f ) . 5 T E L G A R S K Y The arguably strange form of the left hand side of the bound in Lemma 3.1 is to accommodate dif ferent notions of distance. For the L 1 distance with the Lebesgue measure as in Theorem 1.1 , it does not suffice for f to cross 1/2: it must be r e gular , meaning it must cross by an appreciable distance, and the crossings must be ev enly spaced. (It is worth highlighting that the ReLU easily gi ves rise to a regular f .) Howe ver , to merely show that f and g giv e very dif ferent classifiers ˜ f and ˜ g over an arbitrary measure (as in part of Theorem 3.12 ), no additional re gularity is needed. Proof (of Lemma 3.1 ) Let I f and I g respecti vely denote the sets of intervals corresponding to ˜ f and ˜ g , and set s f := Cr ( f ) = |I f | and s g := Cr ( g ) = |I g | . For ev ery J ∈ I g , set X J := { U ∈ I f : U ⊆ J } . Fixing any J ∈ I g , since ˜ g is constant on J whereas ˜ f alternates, the number of elements in X J where ˜ g disagrees e verywhere with ˜ f is | X J | / 2 when | X J | is e ven and at least ( | X J | − 1) / 2 when | X J | is odd, thus at least ( | X J | − 1) / 2 in general. As such, 1 s f X U ∈I f 1 [ ∀ x ∈ U ˜ f ( x ) 6 = ˜ g ( x )] ≥ 1 s f X J ∈I g X U ∈ X J 1 [ ∀ x ∈ U ˜ f ( x ) 6 = ˜ g ( x )] ≥ 1 s f X J ∈I g | X J | − 1 2 . (3.1) T o control this expression, note that ev ery X J is disjoint, ho we ver X := ∪ J ∈I j X j can be smaller than I f : in particular, it misses interv als U ∈ I f whose interior intersects with the boundary of an interv al in I g . Since there are at most s g − 1 such boundaries, s f = |I f | ≤ s g − 1 + | X | ≤ s g + X J ∈I g | X J | , which rearranges to gi ves P J ∈I g | X J | ≥ s f − s g . Combining this with eq. ( 3.1 ), 1 s f X U ∈I f 1 [ ∀ x ∈ U ˜ f ( x ) 6 = ˜ g ( x )] ≥ 1 2 s f ( s f − s g − s g ) = 1 2 1 − 2 s g s f . 3.2. F ew layers, few oscillations As in the preceding section, oscillations of a function f will be counted via the crossing number Cr ( f ) . Since Cr ( · ) only handles univ ariate functions, the multi variate case is handled by first choos- ing an af fine map h : R → R d (meaning h ( z ) = az + b ) and considering Cr ( f ◦ h ) . Before giving the central upper bounds and sketching their proofs, notice by analogy to poly- nomials ho w compositions and additions vary in their impact upon oscillations. By adding together two polynomials, the resulting polynomial has at most twice as man y terms and does not exceed the maximum degree of either polynomial. On the other hand, composing polynomials, the result has the product of the degrees and can have more than the product of the terms. As both of these can impact the number of roots or crossings (e.g., by the Bezout Theorem or Descartes’ Rule of Signs), composition wins the race to higher oscillations. Lemma 3.2 Let h : R → R d be affine . 6 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S 1. Suppose f ∈ N d (( m i , t i , α i , β i ) l i =1 ) with min i min { α i , β i } ≥ 1 . Setting α := max i α i , β := max i β i , t := max i t i , m := P i m i , then Cr ( f ◦ h ) ≤ 2(2 tmα/l ) l β l 2 . 2. Let k -dt f : R d → R and ( t, k ) -bdt g : R d → R be given. Then Cr ( f ◦ h ) ≤ k and Cr ( g ◦ h ) ≤ 2 tk . Lemma 3.2 shows the key tradeoff: the number of layers is in the exponent, while the number of nodes is in the base. Rather than directly controlling Cr ( f ◦ h ) , the proofs will first show f ◦ h is ( t, α ) -poly , which immediately bounds Cr ( f ◦ h ) as follows. Lemma 3.3 If f : R → R is ( t, α ) -poly , then Cr ( f ) ≤ t (1 + α ) . Proof The polynomial in each piece has at most α roots, which thus divides each piece into ≤ 1 + α further pieces within which ˜ f is constant. A second technical lemma is needed to reason about combinations of partitions defined by ( t, α, β ) -sa and ( t, α ) -poly functions. Lemma 3.4 Let k partitions ( A i ) k i =1 of R each into at most t intervals be given, and set A := ∪ i A i . Then there exists a partition B of R of size at most k t so that every interval expr essible as a union of intersections of elements of A is a union of elements of B . Figure 2: Three partitions. The proof is somewhat painful o wing to the fact that there is no con vention on the structure of the interv als in the partitions, namely which ends are closed and which are open, and is thus deferred to Appendix A . The principle of the proof is elementary , and is de- picted at right: giv en a collection of partitions, an intersection of constituent interv als must share endpoints with intervals in in the intersection, thus the total number of intervals bounds the total num- ber of possible intersections. Arguably , this failure to increase com- plexity in the face of arbitrary intersections is why semi-algebraic gates do not care about the number of terms in their definition. Recall that ( t, α, β ) -sa means there is a set of t polynomials of de gree at most α which form the regions defining the function by intersecting simpler regions x 7→ 1 [ q ( x ) ≥ 0] and x 7→ 1 [ q ( x ) < 0] . As such, in order to analyze semi-algebraic gates composed with piecewise polynomial gates, consider first the behavior of these predicate polynomials. Lemma 3.5 Suppose f : R k → R is polynomial with de gree ≤ α and ( g i ) k i =1 ar e each ( t, γ ) -poly . Then h ( x ) := f ( g 1 ( x ) , . . . , g k ( x )) is ( tk , αγ ) -poly , and the partition defining h is a refinement of the partitions for each g i (in particular , each g i is a fixed polynomial (of de gr ee ≤ γ ) within the ≤ tk pieces defining h ). Proof By Lemma 3.4 , there exists a partition of R into ≤ tk intervals which refines the partitions defining each g i . Since f is a polynomial with degree ≤ α , then within each of these intervals, its composition with ( g 1 , . . . , g k ) giv es a polynomial of degree ≤ αγ . This gi ves the follo wing complexity bound for composing ( s, α, β ) -sa and ( t, γ ) -poly gates. 7 T E L G A R S K Y Lemma 3.6 Suppose f : R k → R is ( s, α, β ) -sa and ( g 1 , . . . , g k ) are ( t, γ ) -poly . Then h ( x ) := f ( g 1 ( x ) , . . . , g k ( x )) is ( stk (1 + αγ ) , β γ ) -poly . Proof By definition, f is polynomial in regions defined by intersections of the predicates U i ( x ) = 1 [ q i ( x ) ≥ 0] and L i ( x ) = 1 [ q i ( x ) < 0] . By Lemma 3.5 , q i ( g 1 , . . . , g k ) is ( tk , αγ ) -poly , thus U i and L i together define a partition of R which has Cr ( x 7→ q i ( g 1 ( x ) , . . . , g k ( x ))) pieces, which by Lemma 3.3 has cardinality at most tk (1 + αγ ) and refines the partitions for each g i . By Lemma 3.4 , these partitions across all predicate polynomials ( q i ) s i =1 can be refined into a single partition of size ≤ stk (1 + αγ ) , and which thus also refines the partitions defined by ( g 1 , . . . , g k ) . Thanks to these refinements, h over any element U of this final partition is a fixed polynomial p U ( g 1 , . . . , g k ) of degree ≤ β γ , meaning h is ( stk (1 + αγ ) , β γ ) -poly . The proof of Lemma 3.2 now follows by Lemma 3.6 . In particular , for semi-algebraic networks, the proof is an induction ov er layers, establishing node j is ( t j , α j ) -poly (for appropriate ( t j , α j ) ). 3.3. Many layers, many oscillations The idea behind this construction is as follows. Consider any continuous function f : [0 , 1] → [0 , 1] which is a generalization of a triangle wa ve with a single peak: f (0) = f (1) = 0 , and there is some a ∈ (0 , 1) with f ( a ) = 1 , and additionally f strictly increases along [0 , a ] and strictly decreases along [ a, 1] . No w consider the ef fect of the composition f ◦ f = f 2 . Along [0 , a ] , this is a stretched copy of f , since f ( f ( a )) = f (1) = 0 = f (0) = f ( f (0)) and moreover f is a bijection between [0 , a ] and [0 , 1] (when restricted to [0 , a ] ). The same reasoning applies to f 2 along [ a, 1] , meaning f 2 is a function with two peaks. Iterating this ar gument implies f k is a function with 2 k − 1 peaks; the follo wing definition and lemmas formalize this reasoning. Definition 3.7 f is ( t, [ a, b ]) -triangle when it is continuous along [ a, b ] , and [ a, b ] may be divided into 2 t intervals [ a i , a i +1 ] with a 1 = a and a 2 t +1 = b , f ( a i ) = f ( a i +2 ) whene ver 1 ≤ i ≤ 2 t − 1 , f ( a 1 ) = 0 , f ( a 2 ) = 1 , f is strictly incr easing along odd-numbered intervals (those starting fr om a i with i odd), and strictly decr easing along even-number ed intervals. Lemma 3.8 If f is ( s, [0 , 1]) -triangle and g is ( t, [0 , 1]) -triangle, then f ◦ g is (2 st, [0 , 1]) -triangle. Proof Since g ([0 , 1]) = [0 , 1] and f and g are continuous along [0 , 1] , then f ◦ g is continuous along [0 , 1] . In the remaining analysis, let ( a 1 , . . . , a 2 s +1 ) and ( c 1 , . . . , c 2 t +1 ) respectiv ely denote the interv al boundaries for f and g . No w consider any interval [ c j , c j +1 ] where j is odd, meaning the restriction g j : [ c j , c j +1 ] → [0 , 1] of g to [ c j , c j +1 ] is strictly increasing. It will be sho wn that f ◦ g j is ( s, [ c j , c j +1 ]) -triangle, and an analogous proof holds for the strictly decreasing restriction g j +1 : [ c j +1 , c j +2 ] → [0 , 1] , whereby it follo ws that f ◦ g is (2 st, [0 , 1]) by considering all choices of j . T o this end, note for any i ∈ { 1 , . . . , 2 s + 1 } that g − 1 j ( a i ) exists and is unique, thus set a 0 i := g − 1 j ( a i ) . By this choice, for odd i it holds that f ( g j ( a 0 i )) = f ( g j ( g − 1 j ( a i ))) = f ( a i ) = f ( a 1 ) = 0 and f ◦ g j is strictly increasing along [ a 0 i , a 0 i +1 ] (since g j is strictly increasing every- where and f is strictly increasing along [ g j ( a 0 i ) , g j ( a 0 i +1 )] = [ a i , a i +1 ] ), and similarly even i has f ( g j ( a 0 i )) = f ( a 2 ) = 1 and f ◦ g j is strictly decreasing along [ a 0 i , a 0 i +1 ] . 8 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S Corollary 3.9 If f ∈ N 1 ( m, l , t, α, β ) is ( t, [0 , 1]) -triangle with p distinct parameters, then f k ∈ N 1 ( m, k l, t, α , β ) is (2 k − 1 t k , [0 , 1]) -triangle with p distinct parameters and Cr ( f k ) = (2 t ) k + 1 . Proof It suf fices to perform k − 1 applications of Lemma 3.8 . Next, note the follo wing examples of triangle functions. Lemma 3.10 The following functions ar e (1 , [0 , 1]) -triangle. 1. f ( z ) := σ R (2 σ R ( z ) − 4 σ R ( z − 1 / 2)) ∈ N 1 (2 , 1 , 1 , 1 , 1) . 2. g ( z ) := min { σ R (2 z ) , σ R (2 − 2 z ) } ∈ N 1 (2 , 1 , 2 , 1 , 1) . 3. h ( z ) := 4 z (1 − z ) ∈ N 1 (1 , 1 , 0 , 2 , 0) . Cf. Schmitt ( 2000 ). Lastly , consider the first e xample f ( z ) = σ R (2 σ R ( z ) − 4( σ R ( z − 1 / 2))) = min { σ R (2 z ) , σ R (2 − 2 z ) } , whose graph linearly interpolates (in R 2 ) between (0 , 0) , (1 / 2 , 1) , and (1 , 0) . Consequently , f ◦ f along [0 , 1 / 2] linear interpolates between (0 , 0) , (1 / 4 , 1) , and (1 / 2 , 1) , and f ◦ f is analogous on [1 / 2 , 1] , meaning it has produced two copies of f and then shrunken them horizontally by a factor of 2. This process repeats, meaning f k has 2 k − 1 copies of f , and grants the regularity needed to use the Lebesgue measure in Theorem 1.1 . Lemma 3.11 Set f ( z ) := σ R (2 σ R ( z ) − 4 σ R ( z − 1 / 2)) ∈ N 1 (2 , 1 , 1 , 1 , 1) (cf . Lemma 3.10 ). Let r eal z ∈ [0 , 1] and positive inte ger k be given, and c hoose the unique nonne gative inte ger i k ∈ { 0 , . . . , 2 k − 1 } and r eal z k ∈ [0 , 1) so that z = ( i k + z k )2 1 − k . Then f k ( z ) = ( 2 z k when 0 ≤ z k ≤ 1 / 2 , 2(1 − z k ) when 1 / 2 < z k < 1 . 3.4. Proof of Theor em 1.1 The proof of Theorem 1.1 now follows: Lemma 3.11 sho ws that a man y-layered ReLU network can gi ve rise to a highly oscillatory and regular function f k , Lemma 3.2 shows that fe w-layered networks and (boosted) decision trees give rise to functions with few oscillations, and lastly Lemma 3.1 shows ho w to combine these into an inapproximability result. In this last piece, the proof av erages ov er the possible of fsets y ∈ R d − 1 and considers univ ariate problems after composing networks with the af fine map h y ( z ) := ( z , y ) . In this way , the result carries some resemblance to the random projection technique used in depth hierarchy theorems for boolean functions ( H ˚ astad , 1986 ; Rossman et al. , 2015 ), as well as earlier techniques on complexi- ties of multiv ariate sets ( V itushkin , 1955 , 1959 ), albeit in an extremely primitiv e form (considering v ariations along only one dimension). Proof (of Theorem 1.1 ) Set h ( z ) := σ R (2 σ R ( z ) − 4 σ R ( z − 1 / 2)) (cf. Lemma 3.10 ), and define f 0 ( z ) := h k 3 +4 ( z ) and f : R d → R as f ( x ) = f 0 ( x 1 ) . Let I f denote the pieces of ˜ f 0 , meaning |I f | = Cr ( f 0 ) , and Corollary 3.9 grants Cr ( f 0 ) = 2 k 3 +4 + 1 . Moreover , by Lemma 3.11 , for any U ∈ I f , f 0 − 1 / 2 is a triangle with height 1/2 and base either 2 − k − 1 (when 0 ∈ U or 1 ∈ U ) or 9 T E L G A R S K Y 2 − k , whereby R U | f 0 ( x ) − 1 / 2 | dx ≥ 2 − k − 1 / 4 ≥ |I f | / 16 (which has thus made use of the special regularity of h ). No w for any y ∈ R d − 1 define the map p y : R → R d as p y ( z ) := ( z , y ) . If g is a semi-algebraic network with ≤ k layers and m ≤ 2 k / ( tαβ ) total nodes, then Lemma 3.2 grants Cr ( g ◦ p y ) ≤ 2(2 tmα/k ) k β k 2 ≤ 4( tmαβ ) k 2 ≤ 2 k 3 +2 . Otherwise, g is ( t, 2 k 3 /t ) -bdt, whereby Lemma 3.2 gi ves Cr ( g ◦ p y ) ≤ 2 t 2 k 3 /t ≤ 2 k 3 +2 once again. By Lemma 3.1 , for any y ∈ R d − 1 , Cr ( f ◦ p y ) = Cr ( f 0 ) , and Z [0 , 1] | f ( p y ( z )) − g ( p y ( z )) | dz = X U ∈I f Z U | ( f ◦ p y )( z ) − ( g ◦ p y )( z ) | dz ≥ X U ∈I f Z U | ( f ◦ p y )( z ) − 1 / 2 | 1 [ ∀ z ∈ U ^ ( f ◦ p y )( z ) 6 = ^ ( g ◦ p y )( z )] dz ≥ 1 16 |I f | X U ∈I f 1 [ ∀ z ∈ U ^ ( f ◦ p y )( z ) 6 = ^ ( g ◦ p y )( z )] dz ≥ 1 32 1 − 2 Cr ( g ◦ p y ) Cr ( f ◦ p y ) ≥ 1 32 1 − 2(2 k 3 +2 ) 2 k 3 +4 ! ≥ 1 64 . T o finish, Z [0 , 1] d | f ( x ) − g ( x ) | dx = Z [0 , 1] d − 1 Z [0 , 1] | ( f ◦ p y )( z ) − ( g ◦ p y )( z ) | dz dy ≥ 1 64 . Using nearly the same proof, but gi ving up on continuous uniform measure, it is possible to handle other distances and more flexible tar get functions. Theorem 3.12 Let inte ger k ≥ 1 and function f : R → R be given where f is (1 , [0 , 1]) -triangle, and define h : R d → R as h ( x ) := f k ( x 1 ) . F or every y ∈ R d − 1 , define the affine function p y ( z ) := ( z , y ) . Then there exist Bor el pr obability measur es µ and ν over [0 , 1] d wher e ν is discr ete uniform on 2 k + 1 points and µ is continuous and positive on exactly [0 , 1] d so that every g : R d → R with Cr ( g ◦ p y ) ≤ 2 k − 2 for every y ∈ R d − 1 satisfies Z | h − g | dµ ≥ 1 32 , Z | ˜ h − ˜ g | dµ ≥ 1 8 , Z | h − g | dν ≥ 1 8 , Z | ˜ h − ˜ g | dν ≥ 1 4 . 4. Limitations of depth Theorem 3.12 can be taken to say: there exists a labeling of Θ(2 k 3 ) points which is realizable by a network of depth and size Θ( k 3 ) , but can not be approximated by networks with depth k and size o (2 k ) . On the other hand, this section will sk etch the proof of Theorem 1.2 , which implies that these Θ( k 3 ) depth networks realize relatively fe w different labellings. The proof is a quick consequence of the VC dimension of semi-algebraic networks (cf. Lemma 1.3 ) and the follo wing fact, where Sh ( · ) is used to denote the gr owth function ( Anthon y and Bartlett , 1999 , Chapter 3). 10 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S Lemma 4.1 Let any function class F and any distinct points ( x i ) n i =1 be given. Then with pr oba- bility at least 1 − δ over a uniform random draw of labels ( y i ) n i =1 (with y i ∈ {− 1 , +1 } ), inf f ∈F 1 n n X i =1 1 [ ˜ f ( x i ) 6 = y i ] ≥ 1 2 1 − r ln( Sh ( F ; n )) + ln(1 /δ ) 2 n ! . The proof of the preceding result is similar to proofs of the Gilbert-V arshamo v packing bound via Hoef fding’ s inequality ( Duchi , 2016 , Lemma 13.5). Note that a similar result w as used by W ar - ren to prov e rates of approximation of continuous functions by polynomials, but without in voking Hoef fding’ s inequality ( W arren , 1968 , Theorem 7). The remaining task is to control the VC dimension of semi-algebraic networks. T o this end, note the follo wing generalization of Lemma 1.3 , which further provides that semi-algebraic networks compute functions which are polynomial when restricted to certain polynomial regions. Lemma 4.2 Let neur al network graph G be given with ≤ p parameter s, ≤ l layers, and ≤ m total nodes, and suppose every gate is ( t, α, β ) -sa. Then VC ( N ( G )) ≤ 6 p ( l + 1) ln(2 p ( l + 1)) + ln(8 emtα ) + l ln( β ) . Additionally , given any n ≥ p data points, there e xists a partition S of R p wher e each S ∈ S is an intersection of pr edicates 1 [ q 0] with ∈ { <, ≥} and q has de gree ≤ αβ l − 1 , such that F G ( x i , · ) r estricted to each S ∈ S is a fixed polynomial of de gree ≤ β l for every example x i , with |S | ≤ 8 enmtαβ l pl and Sh ( N ( G ); n ) ≤ 8 enmtαβ l p ( l +1) The proof follows the same basic structure of the VC bound for networks with piecewise poly- nomial activ ation functions ( Anthony and Bartlett , 1999 , Theorem 8.8). The slightly modified proof here is also very similar to the proof of Lemma 3.2 , performing an induction up through the layers of the netw ork, arguing that each node computes a polynomial after restricting attention to some range of parameters. The proof of Lemma 4.2 manages to be multiv ariate (unlike Lemma 3.2 ), though this requires arguments due to W arren ( 1968 ) which are significantly more complicated than those of Lemma 3.2 (without leading to a strengthening of Theorem 1.1 ). One minor departure from the VC dimension proof of piecewise polynomial networks (cf. ( An- thony and Bartlett , 1999 , Theorem 8.8)) is the following lemma, which is used to track the number of regions with the more complicated semi-algebraic networks. Despite this generalization, the VC dimension bound is basically the same as for piece wise polynomial networks. Lemma 4.3 Let a set of polynomials Q be given wher e each Q 3 q : R p → R has de gree ≤ α . Define an initial family S 0 of subsets of R p as S 0 := { a ∈ R p : q ( a ) 0 } : q ∈ Q , ∈ { <, ≥} . Then the collection S of all nonempty intersections of elements of S 0 satisfies |S | ≤ 2 4 e |Q| α p p . 5. Bibliographic notes and open problems Arguably the first approximation theorem of a big class by a smaller one is the W eierstrass Approx- imation Theorem, which states that polynomials uniformly approximate continuous functions over compact sets ( W eierstrass , 1885 ). Refining this, K olmogorov ( 1936 ) gave a bound on how well subspaces of functions can approximate continuous functions, and V itushkin ( 1955 , 1959 ) showed 11 T E L G A R S K Y a similar bound for approximation by polynomials in terms of the polynomial degrees, dimension, and modulus of continuity of the target function. W arren ( 1968 ) then gav e an alternate proof and generalization of this result, in the process effecti vely proving the VC dimension of polynomials (de veloping tools still used to prov e the VC dimension of neural networks ( Anthony and Bartlett , 1999 , Chapters 7-8)), and producing an analog to Theorem 1.2 for polynomials. The preceding results, howe ver , focused on separating large classes (e.g., continuous functions of bounded modulus) from small classes (polynomials of bounded degree). Aiming to refine this, depth hierarchy theorems in circuit complexity separated circuits of a certain depth from circuits of a slightly smaller depth. As mentioned in Section 1 , the seminal result here is due to H ˚ astad ( 1986 ). For architectures closer to neural networks, sum-pr oduct networks (summation and product nodes) have been analyzed by Bengio and Delalleau ( 2011 ) and more recently Martens and Meda- balimi ( 2015 ), and networks of linear threshold functions in 2 and 3 layers by Kane and W illiams ( 2015 ); note that both polynomial gates (as in sum-product networks) and linear threshold gates are semi-algebraic gates. Most closely to the present work (excluding ( T elgarsky , 2015 ), which is a v astly simplified account), Eldan and Shamir ( 2015 ) analyze 2- and 3-layer networks with general acti v ation functions composed with affine mappings, sho wing separations which are exponential in the input dimension. Due to this result and also recent adv ances in circuit complexity ( Rossman et al. , 2015 ), it is natural to suppose Theorem 1.1 can be strengthened to separating k and k + 1 layer networks when dimension d is large; howe ver , none of the earlier works giv e a tight sense of the behavior as d ↓ 1 . The triangle wa ve target functions considered here (e.g., cf. Lemma 3.10 ) have appeared in v arious forms throughout the literature. General properties of piece wise affine highly oscillating functions were in vestigated by Szymanski and McCane ( 2014 ) and Mont ´ ufar et al. ( 2014 ). Also, Schmitt ( 2000 ) inv estigated the map z 7→ 4 z (1 − z ) (as in Lemma 3.10 ) to sho w that sigmoidal networks can not approximate high degree polynomials via an analysis similar to the one here, ho we ver looseness in the VC bounds for sigmoidal networks prev ented e xponential separations and depth hierarchies. A tantalizing direction for future work is to characterize not just one difficult function (e.g., tri- angle functions as in Lemma 3.10 ), but many , or ev en all functions which are not well-approximated by smaller depths. Arguably , this direction could hav e value in machine learning, as discov ery of such underlying structure could lead to algorithms to recov er it. As a trivial example of the sort of structure which could arise, considering the following proposition, stating that any symmetric signal may be repeated by pre-composing it with the ReLU triangle function. Proposition 5.1 Set f ( z ) := σ R (2 σ R ( z ) − 4 σ R ( z − 1 / 2)) (cf. Lemma 3.10 ), and let any g : [0 , 1] → [0 , 1] be given with g ( z ) = g (1 − z ) . Then h := g ◦ f k satisfies h ( x ) = h ( x + i 2 k ) = g ( x 2 k ) for every r eal x ∈ [0 , 2 − k ] and inte ger i ∈ { 0 , . . . , 2 − k − 1 } ; in other words, h is 2 k r epetitions of g with graph scaled horizontally and uniformly to fit within [0 , 1] 2 . Acknowledgments The author is indebted to Joshua Zahl for help navig ating semi-algebraic geometry and for a sim- plification of the multiv ariate case in Theorem 1.1 , and to Rastislav T elg ´ arsky for an introduction to this general topic via K olmogorov’ s Superposition Theorem ( Kolmogoro v , 1957 ). The author further thanks Jacob Abernethy , Peter Bartlett, S ´ ebastien Bubeck, and Alex Kulesza for v aluable discussions. 12 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S References Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical F oundations . Cam- bridge Uni versity Press, 1999. Y oshua Bengio and Oli vier Delalleau. Shallo w vs. deep sum-product networks. In NIPS , 2011. Jacek Bochnak, Michal Coste, and Marie-Franc ¸ oise Roy . Real Alg ebraic Geometry . Springer , 1998. Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms. pages 161–168, 2006. John Duchi. Statistics 311/electrical engineering 377: Information theory and statistics. Stanford Uni versity , 2016. Ronen Eldan and Ohad Shamir . The power of depth for feedforward neural netw orks. 2015. arXiv:1512.03965 [cs.LG] . Kunihik o Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaf fected by shift in position. Biological Cybernetics , 36:193–202, 1980. Johan H ˚ astad. Computational Limitations of Small Depth Circuits . PhD thesis, Massachusetts Institute of T echnology , 1986. Daniel Kane and Ryan W illiams. Super-linear gate and super-quadratic wire lo wer bounds for depth-two and depth-three threshold circuits. 2015. arXiv:1511.07860v1 [cs.CC] . Andrei K olmogorov . ¨ Uber die beste ann ¨ aherung v on funktionen einer ge gebenen funktionenklasse. Annals of Mathematics , 37(1):107–110, 1936. Andrey Nikolae vich K olmogorov . On the representation of continuous functions of sev eral variables by superpositions of continuous functions of one v ariable and addition. 114:953–956, 1957. Alex Krizhe vsky , Ilya Sutske ver , and Geoffery Hinton. Imagenet classification with deep conv olu- tional neural networks. In NIPS , 2012. Y ann LeCun, L ´ eon Bottou, Y oshua Bengio, and Patrick Haf fner . Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11):2278–2324, 1998. James Martens and V enkatesh Medabalimi. On the expressi ve ef ficienc y of sum product networks. 2015. arXiv:1411.7717v3 [cs.LG] . Guido Mont ´ ufar , Razvan Pascanu, K yunghyun Cho, and Y oshua Bengio. On the number of linear regions of deep neural netw orks. In NIPS , 2014. Hoifung Poon and Pedro M. Domingos. Sum-product networks: A new deep architecture. In U AI 2011 , pages 337–346, 2011. Benjamin Rossman, Rocco A. Servedio, and Li-Y ang T an. An average-case depth hierarchy theorem for boolean circuits. In FOCS , 2015. 13 T E L G A R S K Y Michael Schmitt. Lower bounds on the complexity of approximating continuous functions by sig- moidal neural networks. In NIPS , 2000. Lech Szymanski and Brendan McCane. Deep networks are effecti ve encoders of periodicity . IEEE T ransactions on Neur al Networks and Learning Systems , 25(10):1816–1827, 2014. Matus T elgarsk y . Representation benefits of deep feedforward networks. 2015. arXiv:1509.08101v2 [cs.LG] . Anatoli V itushkin. On multidimensional variations. GITTL , 1955. In Russian. Anatoli V itushkin. Estimation of the complexity of the tabulation problem. F izmatgiz. , 1959. In Russian. Hugh E. W arren. Lower bounds for approximation by nonlinear manifolds. T ransactions of the American Mathematical Society , 133(1):167–178, 1968. Karl W eierstrass. ¨ Uber die analytische darstellbarkeit sogenannter willk ¨ urlicher functionen einer reellen ver ¨ anderlichen. Sitzungsberichte der Akademie zu Berlin , pages 633–639, 789–805, 1885. A ppendix A. Deferred proofs This appendix collects v arious proofs omitted from the main text. A.1. Deferred pr oofs from Section 2 The follo wing mechanical proof shows that standard piecewise polynomial gates, maximization/minimization gates, and decision trees are all semi-algebraic gates. Proof (of Lemma 2.3 ) 1. T o start, since σ : R → R is piecewise polynomial, σ ◦ q can be written σ ( q ( z )) := p 1 ( q ( z )) 1 [ q ( z ) 1 b 1 ] + t − 1 X i =2 p i ( q ( z )) 1 [ − q ( z ) ∗ i − 1 − b i − 1 ] 1 [ q ( z ) i b i ] + p t ( q ( z )) 1 [ − q ( z ) ∗ t − b t ] where for each i ∈ [ t ] , p i has degree ≤ β , i ∈ { <, ≤} , ∗ i = “ < ” when i = “ ≤ ” and otherwise ∗ i = “ ≤ ” , and b i ∈ R . As such, setting q i ( z ) := q ( z ) − b 1 whene ver i = “ < ” and q i ( z ) := b i − q ( z ) otherwise, it follows that σ ◦ q is ( t, α, α β ) -sa. 2. Since min i ∈ [ r ] x i = − max i ∈ [ r ] − x i , it suffices to handle the maximum case, which has the form φ max ( v ) = d X i =1 p i ( v ) Y j p j ( v )] Y j >i 1 [ p i ( v ) ≥ p j ( v )] . Constructing polynomials q i,j = p j − p i when j < i and q i,j = p i − p j when j > i , it follows that φ max is ( r ( r − 1) , α, α ) -sa. 14 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S 3. First consider a k -dt f , wherein the proof follo ws by induction on tree size. In the base case k = 1 , f is constant. Otherwise, there exist functions f l and f r which are respectiv ely l - and r -dt with l + r < k , and additionally an affine function q f so that f ( x ) = f l ( x ) 1 [ q f ( x ) < 0] + f r ( x ) 1 [ q f ( x ) ≥ 0] = m l X j =1 p ( l ) j ( v ) 1 [ q f ( x ) < 0] Y i ∈ L ( l ) j 1 [ q ( l ) i ( v ) < 0] Y i ∈ U ( l ) j 1 [ q ( l ) i ( v ) ≥ 0] + m r X j =1 p ( r ) j ( v ) 1 [ q f ( x ) ≥ 0] Y i ∈ L ( r ) j 1 [ q ( r ) i ( v ) < 0] Y i ∈ U ( r ) j 1 [ q ( r ) i ( v ) ≥ 0] . where the last step expanded the semi-algebraic forms of f l and f r . As such, by combining the sets of predicate polynomials for f l and f r together with { q f } (where the former two have cardinalities ≤ l and ≤ r by the inductive hypothesis), and unioning together the triples for f l and f r but e xtending the triples to include 1 [ q f < 0] for triples in f l and 1 [ q f ≥ 0] for triples in f r , it follo ws by construction that f is ( k , 1 , 0) -semi-algebraic. No w consider a ( t, k ) -bdt g . By the preceding expansion, each individual tree f i is ( k , 1 , 0) - sa, thus their sum is ( tk , 1 , 0) by unioning together the sets of polynomials, triples, and adding together the expansions. A.2. Deferred pr oofs from Section 3 The first proof sho ws that a collection of partitions may be refined into a single partition whose size is at most the total number of intervals across all partitions. As discussed in the text, while the proof has a simple idea (one need only consider boundaries of intervals across all partitions), it is somewhat painful since there is not consistent rule for whether specific endpoints endpoints of interv als are open or closed. Proof (of Lemma 3.4 ) If k = 1 , then the result follows with B = A = A 1 (since all intersections are empty), thus suppose k ≥ 2 . Let { a 1 , . . . , a q } denote the set of distinct boundaries of intervals of A , and iterati v ely construct the partition B as follows, where the construction will maintain that B j is a partition whose boundary points are { a 1 , . . . a j } . For the base case, set B 0 := { R } . Thereafter , for ev ery i ∈ [ q ] , consider boundary point a i ; since the boundary points are distinct, there must exist a single interv al U ∈ B i − 1 with a i ∈ U . B i will be formed from B i − 1 by refining U in one of the follo wing two ways. • Consider the case that each partition A l which contains the boundary point a i has e xactly two interv als meeting at a i and moreover the closedness properties are the same, meaning either a i is contained in the interval which ends at a i , or it is contained in the interval which starts at a i . In this case, partition U into two intervals so that the treatment of the boundary is the same as those A l ’ s with a boundary at a i . 15 T E L G A R S K Y • Otherwise, it is either the case that some A l hav e a i contained in the interv al ending at a i whereas others hav e it contained in the interval starting at a i , or simply some A l hav e three interv als meeting at a i : namely , the singleton interval [ a l , a l ] as well as two intervals not containing a l . In this case, partition U into three intervals: one ending at a i (but not containing it), the singleton interv al [ a i , a i ] , and an interv al starting at a i (but not containing it). (These cases may also be described in a unified way: consider all intervals of A which have a i as an endpoint, extend such intervals of positive length to have infinite length while keeping endpoint a i and the side it falls on, and then refine U by intersecting it with all of these intervals, which as abov e results in either 2 or 3 interv als.) Note that the construction never introduces more intervals at a boundary point than exist in A , thus | B | ≤ | A | = k t . It remains to be sho wn that a union of intersections of elements of A is a union of elements of B . Note that it suffices to show that intersections of elements of A are unions of elements of B , since thereafter these encodings can be used to e xpress unions of intersections of A as unions of B . As such, consider any intersection U of elements of A ; there is nothing to show if U is empty , thus suppose it is nonempty . In this case, it must also be an interv al (e.g., since intersections of con vex sets are con ve x), and its endpoints must coincide with endpoints of A . Moreover , if the left endpoint of U is open, then U must be formed from an intersection which includes an interv al with the same open left endpoint, thus there exists such an interval in A , and by the abov e construction of B , there also exists an interv al with such an open left endpoint in B ; the same ar gument similarly handles the case of closed left endpoints, as well as open and closed right endpoints, namely giving elements in B which match these traits. Let a r and a s denote these endpoints. By the abov e construction of B , intervals with endpoints { a j , a j +1 } for j ∈ { r , . . . , s − 1 } will be included in B , and since B is a partition, the union of these elements will be exactly U . Since U was an arbitrary intersection of elements of A , the proof is complete. Next, the tools of Section 3.2 (culminating in the composition rule for semi-algebraic gates (Lemma 3.6 )) are used to show crossing number bounds on semi-algebraic networks and boosted decision trees. Proof (of Lemma 3.2 ) 1. This proof first sho ws f ◦ h is (2 i t i α i Q j ≤ i − 1 t j α j β i − j +1 j k j , Q j ≤ i β j ) -poly , and then relaxes this expression and applies Lemma 3.3 to obtain the desired bound. First consider the case d = 1 and h is the identity map, thus f ◦ h = f . For con venience, set A i := Y j ≤ i α j , B i := Y j ≤ i β j , C i := Y j ≤ i β i − j +1 j = Y j ≤ i B j , M i := Y j ≤ i m j , T i := Y j ≤ i t j . The proof proceeds by induction on the layers of f , showing that each node in layer i is (2 i T i A i C i − 1 M i − 1 , B i ) -poly . For con venience, first consider layer i = 0 of the inputs themselves: here, node i outputs the i th coordinate of the input, and is thus af fine and (1 , 1) -poly . Next consider layer i > 0 , where the inducti ve hypothesis grants that each node in layer i − 1 is (2 i − 1 T i − 1 A i − 1 C i − 2 M i − 2 , B i − 1 ) - poly . Consequently , since any node in layer i is ( t i , α i , β i ) -sa, Lemma 3.6 grants it is also 16 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S (2 i − 1 t i T i − 1 A i − 1 C i − 2 M i − 2 m i − 1 (1 + α i B i − 1 ) , β i B i − 1 ) -poly as desired (since 1 + α i B i − 1 ≤ 2 α i B i − 1 ). Next, consider the general case d ≥ 1 and h : R → R d is an affine map. Since ev ery coordinate of h is af fine (and thus (1 , 1) -poly), composing h with every polynomial in the semi-algebraic gates of layer 1 giv es a function g ∈ N 1 (( m i , t i , α i , β i ) l i =1 ) which is equal to f ◦ h everywhere and whose gates are of the same semi-algebraic complexity . As such, the result follo ws by applying the preceding analysis to g . Lastly , the simplified terms giv e f ◦ h is ((2 tα ) l β l ( l − 1) / 2 Q j ≤ l − 1 m j , β l ( l +1) / 2 ) -poly . Since ln( · ) is strictly increasing and concav e and m l = 1 , ln Y j ≤ l − 1 m j = ln Y j ≤ l m j = X j ≤ l ln( m j ) ≤ l ln( m/l ) = ln(( m/l ) l ) . It follo ws that f ◦ h is ((2 tmα/l ) l β l ( l − 1) / 2 , β l ( l +1) / 2 ) -poly , whereby the crossing number bound follo ws by Lemma 3.3 . 2. Gi ven any k -dt f , the affine function e v aluated at each predicate may be composed with h to yield another affine function, thus f ◦ h : R → R is still a k -dt, and thus ( k , 1 , 0) -sa by Lemma 2.3 . As such, by Lemma 3.6 (with g 1 ( z ) = z as the identity map), f ◦ h is ( k , 0) - poly . (Inv oking Lemma 3.6 without massaging in h introduces a factor d .) Similarly , for a ( t, k ) -bdt g , g ◦ h : R → R is another ( t, k ) -bdt after pushing h into the predicates of the constituent trees, thus Lemma 2.3 grants g ◦ h is ( tk , 1 , 0) -sa, and Lemma 3.6 grants it is ( tk (1 + 1) , 0) -poly . The desired crossing number bounds follow by applying Lemma 3.3 . Next, elementary computations verify that the three functions listed in Lemma 3.10 are indeed (1 , [0 , 1]) -triangle. Proof (of Lemma 3.10 ) 1-2. By inspection, f (0) = f (1) = 0 and f (1 / 2) = 1 . Moreover , for x ∈ [0 , 1 / 2] , f ( x ) = 2 x meaning f is increasing, and x ∈ [1 / 2 , 1] means f ( x ) = 2(1 − x ) , meaning f is decreasing. Lastly , the properties of g follow since f = g . 3. By inspection, h (0) = h (1) = 0 and h (1 / 2) = 1 . Moreover h is a quadratic, thus can cross 0 at most twice, and moreover 1 / 2 is the unique critical point (since g 0 has degree 1), thus g is increasing on [0 , 1 / 2] and decreasing on [1 / 2 , 1] . In the case of the ReLU (1 , [0 , 1]) -triangle function f gi ven in Lemma 3.10 , the exact form of f k may be established as follows. (Recall that this refined form allows for the use of Lebesgue measure in Theorem 1.1 , and also the repetition statement in Proposition 5.1 .) 17 T E L G A R S K Y Proof (of Lemma 3.11 ) The proof proceeds by induction on the number of compositions l . For the base case l = 1 , f 1 ( z ) = f ( z ) = 2 z when z ∈ [0 , 1 / 2] , 2(1 − z ) when z ∈ (1 / 2 , 1] , 0 otherwise . For the inductiv e step, first note for any z ∈ [0 , 1 / 2] , by symmetry of f l around 1/2 (i.e., f l ( z ) = f l (1 − z ) by the inducti v e hypothesis), and by the abov e e xplicit form of f 1 , f l +1 ( z ) = f l ( f ( z )) = f l (2 z ) = f l (1 − 2 z ) = f l ( f (1 / 2 − z )) = f l ( f ( z + 1 / 2)) = f l +1 ( z + 1 / 2) , meaning the case z ∈ (1 / 2 , 1] is implied by the case z ∈ [0 , 1 / 2] . Since the unique nonnegati v e integer i l +1 and real z l +1 ∈ [0 , 1) satisfy 2 z = 2( i l +1 + z l +1 )2 − l − 1 = ( i l +1 + z l +1 )2 − l , the inducti ve hypothesis grants ( f l ◦ f )( z ) = f l (2 z ) = ( 2 z l +1 when 0 ≤ z l +1 ≤ 1 / 2 , 2(1 − z l +1 ) when 1 / 2 < z l +1 < 1 , which completes the proof. The proof of the slightly more general form of Theorem 1.1 is as follo ws; it does not quite imply Theorem 1.1 , since the constructed measure is not the Lebesgue measure even for the ReLU-based (1 , [0 , 1]) -triangle function from Lemma 3.10 . Proof (of Theorem 3.12 ) First note some general properties of f k . By Corollary 3.9 , f k is (2 k − 1 , [0 , 1]) - triangle, which means there exist s := 2 k + 1 points ( z i ) s i =1 so that f k ( z i ) = 1 [ i is odd ] , and moreov er f k is continuous and equal to 1 / 2 at exactly 2 k points (by the strict increasing/decreasing part of the triangle wa v e definition), which is a finite set of points and thus has Lebesgue measure zero. T aking p y : R → R d to be the map p y ( z ) = ( z , y ) where y ∈ R d − 1 , then ( h ◦ p y )( z ) = h (( z , y )) = f k ( z ) , thus letting I denote the 2 k pieces within which f f k is constant, it follo ws that ^ h ◦ p y is constant within the same set of pieces and thus Cr ( h ◦ p y ) = s . No w consider the discrete case, where ν denotes the uniform measure ov er the s points ( x i ) s i =1 defined as x i := p 0 ( z i ) ∈ R d . Further consider the two types of distance. • Since z i < z i +1 and f f k ( z i ) 6 = f f k ( z i +1 ) , then taking ( U i ) s i =1 to denote the intervals of I sorted by their left endpoint, z i ∈ U i for i ∈ [ s ] . By Lemma 3.1 , Z | ˜ h − ˜ g | dν = 1 s s X i =1 | ˜ h ( x i ) − ˜ g ( x i ) | = 1 s s X i =1 | f f k ( z i ) − ^ g ◦ p 0 ( z i ) | ≥ 1 s s X i =1 1 [ ∀ z ∈ U i f f k ( z ) 6 = ^ g ◦ p 0 ( z )] ≥ 1 2 1 − 2 2 k − 2 s ≥ 1 4 . • Since f k ( z i ) ∈ { 0 , 1 } , then f f k ( z i ) 6 = e g ( x i ) implies | f k ( z i ) − g ( x i ) | ≥ 1 / 2 , thus R [0 , 1] d | h − g | dν ≥ R [0 , 1] d | ˜ h − ˜ g | dν / 2 ≥ 1 / 8 . 18 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S Construct the continuous measure µ as follows, starting with the construction of a uni v ariate measure µ 0 . Since f k is continuous, there exists a δ ∈ (0 , min i ∈ [ s − 1] | z i − z i +1 | / 2) so that | f k ( z ) − f k ( z i ) | ≤ 1 / 4 for any i ∈ [ s ] and z with | z − z i | ≤ δ . As such, let µ 0 denote the probability measure which places half of its mass uniformly on these s balls of radius δ (which must be disjoint since f k alternates between 0 and 1 along ( z i ) s i =1 ), and half of its mass uniformly on the remaining subset of [0 , 1] . Finally , extend this to a probability measure µ on [0 , 1] d uniformly , meaning µ is the product of µ 0 and the measure µ 1 which is uniform o ver [0 , 1] d − 1 . Now consider the tw o types of distances. • By Lemma 3.1 , Z | ˜ h − ˜ g | dµ ( x ) = Z Z | f f k ( p y ( z )) − ˜ g ( p y ( z )) | dµ 0 ( z ) dµ 1 ( y ) = Z X U ∈I Z 1 [ z ∈ U ∧ f f k ( z )) 6 = ˜ g ( p y ( z ))] dµ 0 ( z ) dµ 1 ( y ) ≥ Z 1 2 s X U ∈I 1 [ ∀ z ∈ U f f k ( z )) 6 = ^ g ◦ p y ( z )] dµ 1 ( y ) ≥ 1 4 1 − 2 2 k − 2 s ≥ 1 8 . • For any y ∈ R d − 1 and U i ∈ I (with corresponding z i ∈ U i ), if f f k ( z ) 6 = ^ g ◦ p y ( z ) for every z ∈ U i , then Z U i | f k ( z ) − g ( p y ( z )) | dµ 0 ( z ) ≥ Z | z − z i |≤ δ | f k ( z ) − 1 / 2 | dµ 0 ( z ) ≥ 1 4 µ 0 ( { z ∈ U i : | z − z i | ≤ δ } ) ≥ 1 8 s . By Lemma 3.1 , Z | h − g | dµ ( x ) = Z Z | h ( p y ( z )) − g ( p y ( z )) | dµ 0 ( z ) dµ 1 ( y ) ≥ Z X U ∈I 1 [ ∀ z ∈ U f f k ( z ) 6 = ˜ g ( p y ( z ))] Z U | f k ( z ) − g ( p y ( z )) | dµ 0 ( z ) dµ 1 ( y ) ≥ Z 1 8 s X U ∈I 1 [ ∀ z ∈ U f f k ( z ) 6 = ^ g ◦ p y ( z )] dµ 1 ( y ) ≥ 1 16 1 − 2 2 k − 2 s ≥ 1 32 . As a closing curiosity , Theorem 3.12 implies the follo wing statement reg arding polynomials. Corollary A.1 F or any inte ger k ≥ 1 , ther e exists a polynomial h : R d → R with degr ee 2 k and a corr esponding continuous measur e µ which is positive everywher e over [0 , 1] d so that every polynomial g : R d → R of de gree ≤ 2 k − 3 satisfies R | h − g | dµ ≥ 1 / 32 . 19 T E L G A R S K Y Proof Set f ( z ) = 4 z (1 − z ) , which by Lemma 3.10 is (1 , [0 , 1]) -triangle, thus f k is (2 k − 1 , [0 , 1]) - triangle with Cr ( f k ) = 2 k + 1 by Corollary 3.9 , and f k has degree 2 k directly; thus set h ( x ) = f k ( x 1 ) . Next, for any polynomial g : R d → R of degree ≤ 2 k − 3 , g ◦ p y : R → R is still a poly- nomial of degree ≤ 2 k − 3 for ev ery y ∈ R d − 1 (where p y ( z ) = ( z , y ) as in Theorem 3.12 ), and so Lemma 3.3 grants Cr ( g ◦ p y ) ≤ 1 + 2 k − 3 ≤ 2 k − 2 . The result follows by Theorem 3.12 . A.3. Deferred pr oofs from Section 4 First, the proof of a certain VC lo wer bound which mimics the Gilbert-V arshamo v bound; the proof is little more than a consequence of Hoef fding’ s inequality . Proof (of Lemma 4.1 ) For con venience, set m := Sh ( F ; n ) , and let ( a 1 , . . . , a m ) denote these dichotomies (meaning a j ∈ { 0 , 1 } n ), and with foresight set := p ln( m/δ ) / (2 n ) . Let ( Y i ) n i =1 denote fair Bernoulli random labellings for each point, and note by symmetry of the fair coin that for any fix ed dichotomy a j , Pr " 1 n n X i =1 | ( a j ) i − Y i | < 1 / 2 − # = Pr " 1 n n X i =1 Y i < 1 / 2 − # . Consequently , by a union bound ov er all dichotomies and lastly by Hoeffding’ s inequality , Pr " ∃ f ∈ F 1 n n X i =1 | ˜ f ( x i ) − Y i | < 1 / 2 − # ≤ m X j =1 Pr " 1 n n X i =1 | ( v j ) i − Y i | < 1 / 2 − # = m Pr " 1 n n X i =1 Y i < 1 / 2 − # ≤ m exp( − 2 n 2 ) ≤ δ, where the last step used the choice of . The remaining deferred proofs do not e xactly follo w the order of Section 4 , b ut instead the order of dependencies in the proofs. In particular , to control the VC dimension, first it is useful to prove Lemma 4.3 , which is used to control the growth of numbers of regions as semi-algebraic gates are combined. Proof (of Lemma 4.3 ) Fix some ordering ( q 1 , q 2 , . . . , q |Q| ) of the elements of Q , and for each i ∈ [ |Q| ] define two functions l i ( a ) := 1 [ q i ( a ) < 0] and u i ( a ) := 1 [ q i ( a ) ≥ 0] , as well as two sets L i := { a ∈ R p : l i ( a ) = 1 } and U i := { a ∈ R p : u i ( a ) = 1 } . Note that S := n ( ∩ i ∈ A L i ) ∩ ( ∩ i ∈ B ) : A ⊆ [ |Q| ] , B ⊆ [ |Q| ] o \ {∅} . Additionally consider the set of sign patterns V := l 1 ( a ) , u i ( a ) , . . . , l |Q| ( a ) , u |Q| ( a ) : a ∈ R p . Distinct elements of S correspond to distinct sign patterns in V : namely , for any C ∈ S , using the ordering of Q to encode A and B as binary v ectors of length |Q| , the corresponding interlea ved 20 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S binary vector of length 2 |Q| is distinct for distinct choices of ( A, B ) . (For each i that appears in neither A nor B , there tw o possible encodings in V : having both coordinates corresponding to i set to 1, and having them set to 0. On the other hand, a more succinct encoding based just on ( l i ) |Q| i =1 fails to capture those sets arising from intersections of proper subsets of Q .) As such, making use of gro wth function bounds for sets of polynomials ( Anthony and Bartlett , 1999 , Theorem 8.3), |S | ≤ | V | ≤ 2 4 eα |Q| p p . Thanks to Lemma 4.3 , the proof of the VC dimension bound Lemma 4.2 follo ws by induction ov er layers, effecti vely keeping track of a piecewise (regionwise?) polynomial function as with the proof of Lemma 3.2 (but no w in the multiv ariate case). Proof (of Lemma 4.2 ) First note that this proof follows the scheme of a VC dimension proof for net- works with piecewise polynomial activ ation functions ( Anthony and Bartlett , 1999 , Theorem 8.8), but with Lemma 4.3 allowing for the more complicated semi-algebraic gates, and some additional bookkeeping for the (semi-algebraic) shapes of the re gions of the partition S . Let examples ( x j ) n j =1 be giv en with n ≥ p , let m i denote the number of nodes in layer i (whereby m 1 + · · · + m l = m ), and let f := F G : R p × R d → R denote the function ev aluating the neural network (as in Section 2.1 ), where the tw o ar guments are the parameters w ∈ R p and the input example x ∈ R d . The goal is to upper bound the number of dichotomies K := Sh ( N ( G ); n ) = |{ ( sgn ( f ( w , x 1 )) , . . . , sgn ( f ( w , x n ))) : w ∈ R p }| . The proof will proceed by producing a sequence of partitions ( S i ) l 0=1 of R p and two corresponding sequences of sets of polynomials ( P i ) l i =0 and ( Q i ) l i =0 so that for each i , P i has polynomials of degree at most β i , Q i has polynomials of degree at most αβ i − 1 , and over any parameters S ∈ S i , there is an assignment of elements of P i to nodes of layer i so that for each example x j , ev ery node in layer i e valuates the corresponding fixed polynomial in P i ; lastly , the elements of S i are intersections of sets of the form { w ∈ R p : q ( w ) 0 } where q ∈ Q i and ∈ { <, ≥} , and the partition S i +1 refines S i for each i (meaning for each U ∈ S i +1 there exists S ⊇ U with S ∈ S i ). Setting the final partition S := S l , this in turn will giv e an upper bound on K , since the final output within each element of S is a fixed polynomial of degree at most β l , whereby the VC dimension of polynomials ( Anthony and Bartlett , 1999 , Theorem 8.3) grants K ≤ X S ∈S |{ ( sgn ( f ( w , x 1 )) , . . . , sgn ( f ( w , x n ))) : w ∈ S }| ≤ 2 |S | 2 enβ l p p . (A.1) T o start, consider layer 0 of the input coordinates themselves, a collection of d affine maps. Consequently , it suffices to set S 0 := { R p } , Q 0 := ∅ , and P 0 to be the nd possible coordinate maps corresponding to all d coordinates of all n examples. For the inductive step, consider some layer i + 1 . Restricted to any S ∈ S i , the nodes of the pre vious layer i compute fixed polynomials of degree β i . Each node in layer i + 1 is ( t, α, β ) -sa, meaning there are t predicates, defined by polynomials of degree ≤ α , which define regions wherein this node is a fixed polynomial. Let Q S denote this set of predicates, where | Q S | ≤ tnm i +1 by 21 T E L G A R S K Y considering the n possible input examples and the t possible predicates encountered in each of the m i +1 nodes in layer i + 1 , and set Q i +1 := Q i S ( ∪ S ∈S i Q S ) . By the definition of semi-algebraic gate, each node in layer i + 1 computes a fixed polynomial when restricted to a region defined by an intersection of predicates which moreo ver are defined by Q i +1 . As such, defining S i +1 as the refinement of S i +1 which partitions each S ∈ S i according to the intersections of predicates encountered in each node, then Lemma 4.3 on each Q S grants |S i +1 | ≤ X S ∈S i |{ all nonempty intersections of Q S }| ≤ 2 |S i | 4 enm i +1 tαβ i p p , (A.2) which completes the inducti ve construction. The upper bound on K may no w be estimated. First, |S | may be upper bounded by applying eq. ( A.2 ) recursi vely: |S | ≤ |S 0 | l Y i =1 8 enm i tαβ i − 1 p p ≤ 8 enmtαβ l − 1 pl . Continuing from Equation ( A.1 ), K ≤ 2 |S | 2 emβ l p p ≤ 8 enmtαβ l p ( l +1) . T o compute VC ( N ( G )) , it suffices to find N such that Sh ( N ( G ); N ) < 2 N , which in turn is implied by p ( l + 1) ln( N ) + p ( l + 1) ln(8 emtαβ l ) < N ln(2) . Since ln( N ) = ln( N / (2 p ( l + 1)) + ln(2 p ( l + 1)) ≤ N / (2 p ( l + 1)) − 1 + ln(2 p ( l + 1)) and ln(2) − 1 / 2 > 1 / 6 , it suf fices to sho w 6 p ( l + 1) ln(2 p ( l + 1)) + ln(8 emtαβ l ) ≤ N . As such, the left hand side of this expression is an upper bound on VC ( N ( G )) . The proofs of Lemma 1.3 and Theorem 1.2 from Section 1 are no w direct from Lemma 4.2 and Lemma 4.1 . Proof (of Lemma 1.3 ) This statement is the same as Lemma 4.2 with some details remo v ed. Proof (of Theor em 1.2 ) By the bound on Sh ( N ( G ); n ) from Lemma 4.2 , n = n 2 + n 2 ≥ 2 ln(1 /δ ) + 4 pl 2 ln(8 emtαβ p ( l + 1)) + n 2 ≥ 2 ln(1 /δ ) + 2 p ( l + 1) ln(8 emtαβ l ) + 2 p ( l + 1) ln( p ( l + 1))) + n 2 p ( l + 1) − 1 ≥ 2 ln(1 /δ ) + 2 p ( l + 1) ln(8 emtαβ l ) + 2 p ( l + 1) ln( n ) ≥ 2 ln(1 /δ ) + 2 ln( Sh ( N ( G ); n )) . The result follo ws by plugging this into Lemma 4.1 . 22 B E N E FI T S O F D E P T H I N N E U R A L N E T W O R K S A.4. Deferred pr oofs from Section 5 Proof (of Pr oposition 5.1 ) Immediate from Lemma 3.11 . 23
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment