Canonical dual solutions to nonconvex radial basis neural network optimization problem

CANONICAL DU AL SOLUTIONS TO NONCONV EX RADIAL B ASIS NEURAL NETWORK OP TIMIZA TION PR OBLEM 1 V ittor io Latorre ∗ and Da vid Y ang Gao + 2 ∗ Univer sity “Sapienza” of Rome, Rome, Italy + Univer sity of Ballarat and Austral ian National Univer sity , Austra lia Abstract Radial Basis Functions Neural Networks (RBFNNs) are tools widely used in regression problem s. One of their principal drawbacks is that the formulation co rrespon ding to the training with the s upervision of both the centers and the weights is a high ly n on-co n vex optimization p roblem, wh ich leads to some fundam entally di ﬃ culties fo r traditio nal optimiza tion theo ry and m ethods. This p aper presents a g eneralized cano nical duality theor y for solvin g this ch allenging prob lem. W e demon strate that by sequ ential cano nical dual tr ansform ations, the no nconve x optimization pro blem of the RBFNN can be refor mulated as a canonical dual pr oblem (withou t duality g ap). Both global optimal solution and lo cal extrema can be classiﬁed. Several applications to one of the m ost used Radial Basis Fun ctions, the Gau ssian fu nction, are illustra ted. Our results sh ow that even for one-d imensional case, the g lobal min imizer of th e no nconve x pro blem may n ot b e the best solution to the RBFNNs, an d th e cano nical dual th eory is a prom ising tool fo r solvin g general neur al networks training problem s. 1. Introduction Radial Basis Func tion Neu ral Ne tworks(RBFNN) are a tool introdu ced in the ﬁeld of function in terpolation [1] an d then were adapted to the prob lem o f r egression [ 2]. Durin g th e last two d ecades RBFNN were ap plied in se veral ﬁelds. The p rob- lem of regression consists in tryin g to appro ximate a function f : R n → R by means of an ap proxim ation fu nction g ( · ) th at uses a set of samples deﬁned as: T = { ( x p , y p ) , x p ∈ R n , y p ∈ R , p = 1 , ..., P } , (1) where ( x p , y p ) are resp ectiv ely a rguments an d values of the giv en function f ( x ). In general the ap prox imating function g ( · ) obtained by the RB FNNs with radial basis function φ ( · ) has the following form: g ( x ) = N X i = 1 w i φ ( k x − c i k ) , (2) where N is the numb er of u nits used to approxim ate the f unc- tion, or ne urons of th e network, w is the vector with comp o- nents w i for i = 1 , . . . , N that is the vecto r of th e weig hts asso- ciated with the conne ctions between the units x and c i ∈ R n for i = 1 , . . . , N are the centers of the RB FNNs. Generally speaking, there are two main optimizatio n strate- gies to train a RBFNN. The ﬁrst con sists in the optimization of only th e weights o f the n eural network. In th is case the cen- ters a re gener ally chosen b y using clu stering strategies [3]. T his problem is a con vex problem in the variable w and has the for m: E ( w ) = 1 2 P X p = 1 N X i = 1 ( w i φ ( c i ) − y p ) 2 + 1 2 β w k w k 2 , (3) where β w is the regularization parameter for the weights. The second strategy is to con sider b oth w eighter w an d the centers c of the r adial b asis functions as variables. This strat- egy c an be performed b y solving the following unconstra ined optimization prob lem: E ( w , c ) = 1 2 P X p = 1 N X i = 1 ( w i φ ( c i ) − y p ) 2 + 1 2 β w k w k 2 + 1 2 β N X i = 1 n X j = 1 c 2 ji . (4) This problem is non-convex, b ut from empirical experiments [4] it e merged that it generally yields n eural networks with a n higher precision than the one s trained with strategy (3). One of the most used strategies to solve this o ptimization problem is to apply d ecompo sition algorithms [ 5]. Howe ver , d ue to the no n- conv exity of the pr oblem (4), there are some fundamental di ﬃ - culties to ﬁnd th e g lobal min imum of th e p roblem a nd to char- acterize local minim a. Indeed, the problem (4) is consid ered to be NP-hard even if the radial b asis fu nction φ ( c ) is a quad ratic function an d n = 1 [6, 7 ]. Another issue that char acterizes this problem is the choic e of the regularization p arameters β w and β . In gene ral a cross-validation strategy is ap plied in o rder to ﬁnd these regularization param eters. Cross-validation con sists in trying di ﬀ ere nt values of the par ameters in o rder to ﬁnd the one that y ields the n eural network with the best pre diction. Un- til now it w as not possible to ﬁnd a c losed form for the optimal values of these par ameters in th e g eneral ca se. If it is possible to ﬁnd at least an upp er bound for these parameter s, the time needed to perform a cross validation would greatly decrease. Canonical du ality theory developed from nonco n vex analysis Prep rint submitted to Neur oc omputing October 31, 2018 and global optimization [ 8, 9 ] is a potentially p owerful method- ology , which ha s been used successfully fo r solv ing a large class of challenging problems in biology , en gineerin g, scien ces [10, 14, 15], and re cently in network commu nications [11 , 13]. In this paper we stu dy the c anonical duality theory fo r so lving the g eneral Radial B asis Neural Netw orks optim ization prob - lem (4) and m ainly analyze one-d imensional case in order to ﬁnd pro perties and intuitions that can be useful for the mu ltidi- mensional cases. The rest of this paper is arrang ed as follows. In Section 2, we ﬁrst d emonstrate h ow to rewrite th e no nconve x primal pro blem as a dua l problem by u sing seq uential canonica l dual transformation developed in [8, 12]. In Section 3 w e prove the comp lementarity- dual pr inciple showing that the obtain ed formu lation is canonically dual to the original problem in the sense tha t ther e is no du ality gap. In Sectio n 4, we ana lyze the problem with the Gaussian function as radia l basis in the n eu- rons and show som e examp les. The last section pr esents som e conclusion s. 2. Primal problem for general Radial Basis Func- tions(RBF) The gene ral o ne dimen sional non-co n vex function to be ad- dressed in this paper can be p roposed in the following form : P ( c ) = W ( c ) + 1 2 β c 2 − f c , (5) where β is the regularization coe ﬃ cie nt and f is a positive scalar close to zero. The term − f c is not comprised in the orig- inal Radial Basis Neural Networks formu lation but we consider it for the g eneral m athematical case. The n on-co n vex f unction W ( c ) dep ends on the choice of the radial basis function φ ( · ): W ( c ) = 1 2  w φ ( k x − c k 2 ) − y  2 , (6) where x , y and w belo ng to R . In applications the par ameter w is also a variable, but the or iginal problem (4) is c onv ex in w wh ile non-co n vex in r espect to the cen ter of the r adial basis function c . T herefo re, the one-d imensiona l non-conve x primal problem can be formu lated as ( P ) : min n P ( c ) = 1 2  w φ ( k x − c k 2 ) − y  2 + 1 2 β c 2 − f c | ∀ c ∈ R o . (7) In order to a pply th e can onical duality th eory to solve this problem , we n eed to choo se th e follo wing geom etrically no n- linear operato r: ξ = Λ ( c ) = w φ ( k x − c k 2 ) : R → E a . (8) Clearly , this is a non linear map from R to a subspace E a ∈ R , wh ich depends o n the choice o f the Radial Basis Function φ ( · ). The ca nonica l fu nction associated with this geometrical operator is V ( ξ ( c ) ) = 1 2 ( ξ ( c ) − y ) 2 = W ( Λ ( c )) . (9) By the d eﬁnition introduced in the canon ical d uality theory [9], V : E a → R is said to be canonical fu nction on E a if for any giv en ξ ∈ E a , the duality relation σ = V ′ ( ξ ) = { ξ − y } : E a → S a (10) is inv ertible, where S a is the r ange of the du ality map ping σ = ∂ V ( ξ ), which depend s on the ch oice of the Radial Basis Function φ ( · ). The cou ple ( ξ, σ ) fo rms a can onical d uality pair on E a × S a with the Legendre conjugate V ∗ ( σ ) deﬁned by V ∗ ( σ ) = { ξ σ − V ( ξ ) | σ = V ′ ( ξ ) } = 1 2 σ 2 + y σ ! . (11) By considering th at W ( c ) = Λ ( c ) σ − V ∗ ( σ ), the primal function P ( c ) c an b e reformu lated as th e so-called total complemen tarity function deﬁned by Ξ ( c , σ ) = Λ ( w , c ) σ − V ∗ ( σ ) + 1 2 β c 2 − f c = w φ ( k x − c k 2 ) σ − 1 2 σ 2 + σ y ! + 1 2 β c 2 − f c . (12) The function φ ( · ) can be a n on conv ex function just like W ( c ). For this rea son we h av e to p erform a sequential can onical dual transform ation for th e n onlinear op erator Λ ( c ). T o th is aim we choose a second nonlinear operato r: ǫ = Λ 2 ( c ) = k x − c k 2 (13) which is a map from R to E b = { ǫ ∈ R | ǫ ≥ 0 } . In terms of ǫ , the ﬁrst lev el operator ξ = Λ ( c ) can be written as ξ = U ( ǫ ) = w φ ( ǫ ) . (14) W e assume that U ( ǫ ) is a con vex function on E b such th at the second-level du ality relation τ = U ′ ( ǫ ) = w φ ′ ( ǫ ) (15) is in vertible, i. e., ǫ =  φ ′  τ w  − 1 , (16) where the term  φ ′  τ w  − 1 is the inv erse of the function φ ′ ( ǫ ). Thus, the Legend re conjugate of U can be obtained uniqu ely by U ∗ ( τ ) = τ  φ ′  τ w  − 1 − w φ  φ ′  τ w  − 1 ! . (17) W e n otice th at ξ = w φ ( ǫ ). By su bstituting th e value of ǫ given by (16) we ﬁnd a re lation that con nects th e ﬁrst level primal variable ξ with the second le vel dual v ariable τ : ξ = w φ  φ ′  τ w  − 1 ! . (18) By plugg ing this in (10) we obtain σ = w φ  φ ′  τ w  − 1 ! − y . (19) 2 Generally speaking , it is possible, for certain function s φ , to use the canonical dual tr ansforma tion to ﬁnd the relation between the ﬁrst le vel du al v ariable σ and the second l ev el dual variable τ by means of the deriv ati ves of φ ( · ) and th e ﬁrst p rimal variable ξ . In gen eral this relation is: τ = w φ ′  φ − 1  σ + y w  . (20) Therefo re, replacing U ( ξ ) = Λ ( c ) b y its Legendre con jugate U ∗ , the total complemen tarity function becom es Ξ ( c , σ , τ ) =  k x p − c i k 2 τ − U ∗ ( τ )  σ − V ∗ ( σ ) + 1 2 β c 2 − f c . (21) It is also possible to rewrite th e total co mplemen tary function (21) in the following form: Ξ ( c , σ , τ ) = 1 2 c 2 (2 τσ + β ) − c (2 τσ x + f ) − U ∗ ( τ ) σ − V ∗ ( σ ) + x 2 τσ . (22) By the criticality condition ∂ Ξ ( c , σ , τ ) / ∂ c = 0 we obtain c ( τ , σ ) = 2 τ x σ + f 2 τσ + β . (23) Clearly , if 2 τσ + β , 0 , the gener al solution of (23) is c = 2 τ x σ + f 2 τσ + β ∀ ( σ , τ ) ∈ S a = { σ , τ | 2 τσ + β , 0 } (24) and the canonica l dual function of P ( c ) can be presented as P d ( σ , τ ) = − 1 2 (2 τ x σ + f ) 2 2 τσ + β − U ∗ ( τ ) σ − V ∗ ( σ ) + x 2 τσ . (25 ) By consider ing dual relation given in (20), and by setting s ( σ ) = σ + y w , we can write the tota l com plementar ity fu nction in terms of only c and σ Ξ ( c , σ ) = 1 2 c 2 G ( σ ) − c F ( σ ) − U ∗ ( σ ) σ − V ∗ ( σ ) + x 2 w φ ′  φ − 1 ( s ( σ ) )  σ , (26) where G ( σ ) = 2 w φ ′  φ − 1 ( s ( σ ) )  σ + β, F ( σ ) = 2 w φ ′  φ − 1 ( s ( σ ) )  x σ + f , U ∗ ( σ ) = w φ ′  φ − 1 ( s ( σ ) )  φ − 1 ( s ( σ ) ) − ( σ + y ) . Therefo re, in terms of σ only , the canonical dual fu nction c an be written as P d ( σ ) = − 1 2 F ( σ ) 2 G ( σ ) − U ∗ ( σ ) σ + V ∗ ( σ ) − x 2 w φ ′  φ − 1 ( s ( σ ) )  σ . (27) 3. Complementary-Dua l P rinciple Theorem 3.1. If ¯ σ is a critical point of (P d ) and the term: G ′ ( ¯ σ ) = σφ ′′  φ − 1 ( s ( ¯ σ ) )   φ − 1 ( s ( ¯ σ ) )  ′ + w φ ′  φ − 1 ( s ( ¯ σ ) )  , 0 , (28 ) then the point ¯ c = F ( ¯ σ ) G ( ¯ σ ) (29) is a critical point of P ( c ) and P ( ¯ c ) = P d ( ¯ σ ) Proof 3.1. Supp ose that ¯ σ is a critical point of P d then we hav e P d ( ¯ σ ) ′ = h ¯ c 2 − 2 x ¯ c + x 2 − φ − 1 ( s ( ¯ σ ) ) i G ′ ( ¯ σ ) − σ h φ ′  φ − 1 ( s ( ¯ σ ) )   φ − 1 ( s ( ¯ σ ) )  ′ − 1 i = 0 . (30 ) Notice that  φ − 1 ( s ( ¯ σ ) )  ′ = 1 φ ′ ( ¯ ǫ ) = 1 φ ′  φ − 1 ( s ( ¯ σ ) )  , (31) The third term in (30) is z ero. The ter m G ′ ( ¯ σ ) is not z ero fro m the hypoth esis, s o we obtain ( x − ¯ c ) 2 − φ − 1 ( s ( ¯ σ ) ) = 0 , (32) that is ¯ σ = w φ  k x − ¯ c k 2  − y . (33) The critical point conditio n for the primal problem P ′ ( c ) = 0 is − 2 w ( x − c ) φ ′ ( k x − c k 2 )( w φ ( k x − c k 2 ) − y ) + β c − f = 0 . (34) By considering th at φ ′ ( k x − c k 2 ) = φ ′  φ − 1 ( s ( ¯ σ ) )  and σ = w φ  ( x − c ) 2  − y we obtain 2 w ( x − c ) φ ′  φ − 1 ( s ( σ ) )  σ + β c − f = 0 , (35) that is c = 2 φ ′  φ − 1 ( s ( σ ) )  σ + f 2 φ ′  φ − 1 ( s ( σ ) )  σ + β . (36) By setting σ = ¯ σ in (36) we obtain (24) proving that ¯ c is a critical point of P ( c ). For t he corresponde nce of the function v alues we start from the dual function P d ( ¯ σ ) = − 1 2 F 2 ( ¯ σ ) G ( ¯ σ ) − U ∗ ( ¯ σ ) ¯ σ − V ∗ ( ¯ σ ) + x 2 w φ ′  φ − 1 ( s ( ¯ σ ) )  ¯ σ (37) add and subtract the term 1 2 F 2 ( ¯ σ ) G ( ¯ σ ) and substitute the value of ¯ c 1 2 ¯ c 2 G ( ¯ σ ) − ¯ c F ( ¯ σ ) − U ∗ ( ¯ σ ) ¯ σ − V ∗ ( ¯ σ ) + x 2 w φ ′  φ − 1 ( s ( ¯ σ ) )  ¯ σ (38) by reorde ring the terms we obtain =  k x − ¯ c k 2 w φ ′  φ − 1 ( s ( ¯ σ ) )  − U ∗ ( ¯ σ )  ¯ σ − V ∗ ( ¯ σ ) + 1 2 β ¯ c 2 − f ¯ c , (39) 3 Considering the (10), setting ¯ ǫ = k x − ¯ c k 2 and φ ′  φ − 1 ( s ( ¯ σ ) )  = φ ′ ( ¯ ǫ ) we obtain:  w φ ′ ( ¯ ǫ ) ¯ ǫ − w φ ′ ( ¯ ǫ ) ¯ ǫ + w φ ( ¯ ǫ )   w φ ( ¯ ǫ ) − y  − 1 2 ( w φ ( ¯ ǫ ) − y ) 2 + y ( w φ ( ¯ ǫ ) − y ) + 1 2 β ¯ c 2 − f ¯ c = w 2 φ ( ¯ ǫ ) 2 − yw φ ( ¯ ǫ ) − 1 2 ( w φ ( ¯ ǫ ) − y ) 2 − yw φ ( ¯ ǫ ) + y 2 + 1 2 β ¯ c 2 − f ¯ c (40) by collecting the terms we obtain: ( w φ ( ¯ ǫ ) − y ) 2 − 1 2 ( w φ ( ¯ ǫ ) − y ) 2 + 1 2 β ¯ c 2 − f ¯ c , (4 1) that is 1 2  w φ ( k x − ¯ c k 2 ) − y  2 + 1 2 β ¯ c 2 − f ¯ c = P ( ¯ c ) . (42) that proves the theor em.  Theorem 3.1 shows that the pr oblem ( P d ) is canonically du al to the primal ( P ) in the sense that the duality gap is zero. 4. Gaussian function One of the most u sed RBF is the Gau ssian fu nction. In this section we will analyze the problem with φ ( k x − c k 2 ) = exp n − k x − c k 2 2 α 2 o , where α is a p arameter that represents the stan- dard de viation of the Gaussian fu nction. In the RBFNN for- mulation nor mally there is no the linear term f c . The pr imal problem is: min P ( c ) = 1 2 w exp ( − k x − c k 2 2 α 2 ) − y ! 2 + 1 2 β c 2 (43) If we deﬁn e th e qu antity d ( c ) = k x − c k 2 2 α 2 , the n onlinear operator ξ : R → E a from (8) become s ξ = w exp {− d ( c ) } . (44) The expressions that d eﬁne σ , V an d V ∗ are the same as th e general pro blem that is : • V ( ξ ( c )) = 1 2 ( ξ − y ) 2 ; • σ = ξ − y ; • V ∗ ( σ ) =  1 2 σ 2 + y σ  . The second ord er operator Λ 2 ( c ) : R → E b is ǫ = Λ 2 ( c ) = k x − c k 2 = ǫ (45) The second level can onical function becomes U ( ǫ ) = w exp  − ǫ 2 α 2  . (46 ) And the second ord er duality mapping τ is τ = w φ ′ ( ǫ ) = − w 2 α 2 exp  − ǫ 2 α 2  . (47) So the Legendre conjugate U ∗ : S ′ b → R is U ∗ ( τ ) = τ  φ − 1  τ w  ′ − w φ  φ − 1  τ w  ′ = − 2 α 2 τ ln − 2 α 2 τ w ! − 1 ! . (48) The der iv ati ve of th e expone ntial function is the exponential function itself. T his simpliﬁes the relation (18) between ξ and τ making it linear, that is ξ = − τ 2 α 2 . The relation between σ an d τ is: τ = − ( σ + y ) 2 α 2 (49) that is a lso linear . Th e total co mplemen tarity f unction beco mes: Ξ ( c , σ ) = 1 2 c 2 G ( σ ) − c F ( σ ) − U ∗ ( σ ) σ − V ∗ ( σ ) − x 2 ( σ 2 + y σ ) 2 α 2 (50) where: G ( σ ) = β − σ 2 + y σ α 2 F ( σ ) = − x σ 2 + xy σ α 2 U ∗ ( σ ) = ( σ + y ) ( ln ( s ( σ ) ) − 1 ) s ( σ ) = σ + y w The dual prob lem is P d ( σ ) = − 1 2 F ( σ ) 2 G ( σ ) − ln ( s ( σ ) )  σ 2 + y σ  + 1 2 σ 2 − x 2 ( σ 2 + y σ ) 2 α 2 (51) The d omains of the variables in the pr imal an d dual problems are: • E b = { ǫ ∈ R | ǫ ≥ 0 } • S b = { τ ∈ R | − ∞ < τ < 0 } if w > 0, S b = { τ ∈ R | − ∞ < τ < 0 } if w < 0 • E a = { ξ ∈ R | 0 ≤ ξ ≤ w } • S a = { σ ∈ R | − y ≤ σ ≤ w − y } if w > 0, S a = { σ ∈ R | w − y ≤ σ ≤ − y } if w < 0 Remark 1. P arameters β , x, y, and w p lay impo rtant r oles in solving the non-conve x pr oblem (P). In the o riginal pr ob- lem (7) one sea r ches fo r the value of c that brings the term w exp {− d ( c ) } as closer as possible to y, that is σ = w exp {− d ( c ) } − y = 0 . If y < 0 and w > 0 or y > 0 and w < 0 we will have that | σ | > 0 . Th is means that in the case of the exponential fu nction, it would be better to choose c as b igger as p ossible in order to make the exponen tial go to zer o, but the r esult wou ld never b e satisfactory as the err or committed by the appr oximation would go close to − y as c goes to inﬁnity . The v alue − y is no t a good value for th e err or as it is fa r fr om zer o . On the oth er hand if y 4 and w have the same sign and | y | > | w | the value of c will b e x in or der to have the exponential equal to 1 and to have the lowest value for σ = w exp {− d ( c ) } − y. In order to have a r ealistic pr oblem, we will consider the case with y and w with the same sign, an d with | y | < | w | . Th e cases with y , w > 0 and y , w < 0 ar e equivalent, so we will suppose that both y and w ar e po sitive without losing generality . Theorem 4.1. S uppose that ¯ σ ∈ S a is a critical point of the dual p r oblem (51) with the corr esp onding ¯ c = F ( ¯ σ ) G ( ¯ σ ) ∈ R an d that ¯ σ , y 2 . Then ¯ c is a critical point of the primal pr oblem and: P d ( ¯ σ ) = P ( ¯ c ) . (52) mor eover , ther e ar e th e following r ela tions between the c ritical points of the primal pr oblem an d the dual pr o blem: 1. If (2 ¯ σ + y ) > 0 and G ( ¯ σ ) ≥ 0 or (2 ¯ σ + y ) < 0 and G ( ¯ σ ) ≤ 0 then if ¯ σ is a loca l min imum of the dual pr ob lem, the corr e- sponding ¯ c i s a local maximum o f th e primal pr o blem; if ¯ σ is a loca l maximum of th e dual pr ob lem the corr espo nding ¯ c is a local minimum of the primal pr ob lem; 2. If (2 ¯ σ + y ) > 0 and G ( ¯ σ ) ≤ 0 or (2 ¯ σ + y ) < 0 and G ( ¯ σ ) ≥ 0 then if ¯ σ is a local minimum o f the d ual p r oblem the co rr e- sponding ¯ c is a local minimum of the primal pr oblem; if ¯ σ is a loca l maximum of th e dual pr ob lem the corr espo nding ¯ c is a local maximum of the primal pr ob lem. Let x o = q − 2 α 2 ln  y 2 w  . If ¯ σ = − y 2 , the n ther e is a corr e- sponding critical point to ¯ σ in the p rimal pr o blem if and on ly if the parameters x, y, β and w satisfy on e of the two fo llowing condition s: β x +  β + y 2 4 α 2  x o = 0 β x −  β + y 2 4 α 2  x o = 0 (53) and th e corresponding critical poin t ¯ c in the primal p r oblem is always a local minimum. If neither of cond itions ( 53) is satis- ﬁed, ¯ σ = − y 2 is always a critical point of th e dua l pr o blem, but it does n ot h ave an y co rr espon ding critical poin t in the p rimal pr oblem. Proof 4.1. The ﬁrst or der deriv ative for the dual pro blem is: P d ( σ ) ′ = −        x − F ( σ ) G ( σ ) ! 2 1 2 α 2 + ln ( s ( σ ) )         2 σ + y  (54) so the term (28) is equal to 2 ¯ σ + y . If ¯ σ , − y 2 , the critical point equiv alen cy and condition (52) are consequen ces of Theorem 3.1. T o prove statemen ts ( i ) and ( ii ) we use the second order deriv a- ti ves o f the problems P ( c ) and P d ( σ ) P ( c ) ′′ = ( x − c ) 2 α 4 exp { − d ( c ) }  2 w exp {− d ( c ) } − y  + β − 1 α 2 w exp {− d ( c ) }  w exp {− d ( c ) } − y  (55) P d ( σ ) ′′ = − 1 α 2 x − F ( σ ) σ ! 2 1 + (2 σ + y ) 2 α 2 G ( σ ) ! − 2 σ + y σ + y − 2 ln ( s ( σ ) ) . (56) Since ¯ σ is a cr itical point of th e dual, we h av e th at P d ( σ ) ′ = 0. Therefo re when ¯ σ , − y 2 : x − F ( ¯ σ ) G ( ¯ σ ) ! 2 = − 2 α 2 ln ( s ( ¯ σ ) ) (57) By using condition (57) in (56) we obtain: P d ( ¯ σ ) ′′ = (2 ¯ σ + y ) 2 ln ( s ( ¯ σ ) ) (2 ¯ σ + y ) α 2 G ( ¯ σ ) − 1 ¯ σ + y ! . (58) Noticing σ = w exp {− d ( c ) } − y , it is possible to r ewrite P ( ¯ c ) ′′ in terms of ¯ σ , i. e.: P ( c ( ¯ σ )) ′′ = G ( ¯ σ ) + 2 α 2 ( ¯ σ + y )(2 ¯ σ + y ) x − F ( ¯ σ ) G ( ¯ σ ) ! 2 . (59) by using again condition (57) we obtain: P ( c ( ¯ σ )) ′′ = 1 α 2 h α 2 G ( ¯ σ ) − 2( ¯ σ + y )( 2 ¯ σ + y ) ln ( s ( ¯ σ ) ) i (60) so it is possible to rewrite equation (58) in the following form: P d ( ¯ σ ) ′′ = − 2 ¯ σ + y G ( ¯ σ )( ¯ σ + y ) P ( c ( ¯ σ )) ′′ . (61) and to ﬁnd the relations reported in T able 1. From these r ela- tions, we obtain: • If (2 σ + y ) > 0 and G ( σ ) ≥ 0 or ( 2 σ + y ) < 0 and G ( σ ) ≤ 0 then the second order d eriv ate o f the p rimal pro blem and the second order derivate of the dual pro blem have o ppo- site sign at their critical points; • If (2 σ + y ) > 0 and G ( σ ) ≤ 0 or ( 2 σ + y ) < 0 and G ( σ ) ≥ 0 then the second order d eriv ate o f the p rimal pro blem and the second order derivate of th e dual problem hav e the same sign at their critical points. This proves statements 1 and 2. (2 ¯ σ + y ) G ( ¯ σ ) P ( c ( ¯ σ )) P d ( ¯ σ ) > 0 > 0 ± ∓ > 0 < 0 ± ± < 0 < 0 ± ∓ < 0 > 0 ± ± T able 1: Relations between the second order deriv ati ves of the primal problem and dual problem The point ¯ σ = − y 2 is a critical point of P d accordin g to the second part of the (5 4). T he point ¯ c correspon ding to ¯ σ = − y 2 5 is a critical p oint of th e primal p roblem if and only if P ′ ( ¯ c ) = 0. W e can use th e (10) to ﬁnd the relation between ¯ σ and ¯ c that is : ¯ σ = ¯ ξ − y → ¯ σ = w exp {− d ( ¯ c ) } − y (62) ¯ c = x ± p − 2 α 2 ( ln ( s ( ¯ σ ) )) . (63) For ¯ σ = − y 2 we obtain: ¯ c = x ± x o . (64) Substituting th ese values in the ﬁrst order de riv ative of th e pr i- mal prob lem: P ′ ( ¯ c ) = 1 2 d ( ¯ c ) w exp {− d ( ¯ c ) }  w exp {− d ( ¯ c ) } − y  + β ¯ c (65) and c onsidering that w exp {− d ( ¯ c ) } = ¯ σ + y = y 2 and w exp {− d ( ¯ c ) } − y = ¯ σ = − y 2 we ob tain that th e primal p roblem has a cr itical p oint at ¯ c co rrespon ding to the c ritical ¯ σ = − y 2 if and only if: β x ± β + y 2 4 α 2 ! x o = 0 . ( 66) This hap pens only fo r a particular conﬁgu ration o f the par am- eters w , β , x and y that ma kes o ne o f the ro ots th e ﬁrst term of the derivati ve ( 54): −        x − F ( ¯ σ ) G ( ¯ σ ) ! 2 1 2 α 2 + ( ln ( s ( ¯ σ ) ))        = 0 (67) be in ¯ σ = − y 2 . T o prove tha t at ¯ σ = − y 2 the critical poin t o f the dual p roblem correspo nds to a min imum point of the prima l p roblem we plug the value of ¯ σ = − y 2 in the (59) and obtain P ′′ ( ¯ σ ) = β + y 2 4 α 2 , (68) which is always a positi ve value.  Remark 2. F r om now o n we will r e fer to the critica l po int σ f = − y 2 as pseudo dual critical po int as it is a critical point of the dua l pr o blem that generally does not ha ve a corr espond ing critical point for the primal pr ob lem. 4.1. Choice of the critical point In order to ﬁnd the best solu tion amo ng the critical points of problem (43) we introd uce the follo win g feasible spaces: S + a = { σ ∈ S a | G( σ ) > 0 } (69) S − a = { σ ∈ S a | G( σ ) < 0 } (70) The following theorem e x plains the relations between the criti- cal points: Theorem 4.2. Sup pose that the point ¯ σ 1 ∈ S + a and ¯ σ 2 ∈ S − a ar e critical points of the dual pr o blem, that ¯ σ i , − y 2 for i = 1 , 2 and that ¯ c 1 and ¯ c 2 ar e the corr espo nding critical points of the primal pr oble m. Then if b oth ¯ c 1 and ¯ c 2 ar e lo cal minima or local ma xima o f the primal pr o blem, the following r ela tion always holds: P ( ¯ c 1 ) = P d ( ¯ σ 1 ) < P ( ¯ c 2 ) = P d ( ¯ σ 2 ) (71) Proof 4.2. This the orem is a consequence of the ﬁrst th eorem in triality theory [8].  Remark 3. The pseudo critical point σ f = − y 2 is alwa ys in S + a . From the results in Th eorem 4.2 it is always b etter to search for the du al critical poin t in S + a that corre sponds to a minimu m in the prim al prob lem. In order to characterize the solutions in S + a and the doma ins in wh ich search for the best solu tion, two theorems are proposed in the following: Theorem 4.3. Let σ f = − y 2 be th e p seudo critical poin t of the dual pr oblem, x o = q − 2 α 2 ln  y 2 w  , x po sitive. Th en: • if x ∈ ( 0 , x o ) then σ f is always a local minimum of P d ( σ ) ; • if x > x o then: 1. if β > 0 a nd β < y 2 x o 4 α 2 ( x − x o ) , σ f is a loc al minimum for the dual pr o blem; 2. if β > 0 and β > y 2 x o 4 α 2 ( x − x o ) , σ f is a local maximum for the dual pr o blem; 3. if β > 0 , β = y 2 x o 4 α 2 ( x − x o ) , σ f is an inﬂe ction point in which th e ﬁrst o r der deriva tive is zer o and tha t cor- r espon ds to a a loca l minimu m of the p rimal pr o blem. Proof 4.3. In ord er to un derstand that σ f = − y 2 is a m inimum or a maximum for the dual we have to plug its value in the sec- ond o rder derivati ve of P d ( σ ) tha t is e quation (5 6) and an alyze its sign. After the sub stitution we obtain P d ( σ f ) = −          2 ln  − y 2 w  + 1 α 2         x β β + y 2 4 α 2         2          . (72 ) The ﬁrst or der deri vate in β of (72) is − 2 x β 2 α 2  β + y 2 4 α 2  2 , tha t is the function is monoton ic d ecreasing in β . Th e v alue of ( 72) in β = 0 is − ln  − y 2 w  that is positive. If we make β go to + ∞ we obtain: lim β → + ∞ −          2 ln  − y 2 w  + 1 α 2         x β β + y 2 4 α 2         2          = − 2 ln  − y 2 w  + x 2 α 2 (73) that is the secon d order deriv ative of P d ( σ ) in σ f is non negative for any v alue of β > 0 if x ∈ [ − x o , x o ] (7 4) 6 If x doe s n ot satisfy this con dition, f rom th e (72) we have that the second o rder der iv ative of the dual p roblem is positi ve in σ f if β satisﬁes: β > − y 2 x o 4 α 2 ( x + x o ) and β < y 2 x o 4 α 2 ( x − x o ) . (75) On the other hand if: β < − y 2 x o 4 α 2 ( x + x o ) or β > y 2 x o 4 α 2 ( x − x o ) (76) there will be a local max imum in σ f . As x is con sidered pos- iti ve, the term − y 2 x o 4 α 2 ( x + x o ) is always negativ e, so β will always be greater than it. If the co ndition β = y 2 x o 4 α 2 ( x − x o ) is satisﬁed, the critical p oint σ f is an inﬂec tion point that also s atisﬁes the ﬁr st order con dition and it h as a c orrespo nding minimum poin t in the primal problem for Theorem 4.1.  Remark 4. In the case of x negative, the conditio ns are changed i n the following way: • if x ∈ ( − x o , 0 ) then σ f is always a loca l minimu m o f P d ( σ ) • if x < − x o then: 1. if β > 0 a nd β < − y 2 x o 4 α 2 ( x + x o ) , σ f is a loc al minimum for the dual pr oblem; 2. if β > 0 and β > − y 2 x o 4 α 2 ( x + x o ) , σ f is a local maximum for the dual pr oblem; 3. if β > 0 , β = − y 2 x o 4 α 2 ( x + x o ) , σ f is an inﬂe ction point in which th e ﬁrst or de r d erivative is zer o a nd that cor- r espon ds to a a loca l minimum of the p rimal pr oblem. The pr oof o f these state ment is similar to that o f Theor em 4.3 and can be omitted. Remark 5. Theor em 4.3 shows the e ﬀ ects of the p arameter β on the pseudo critical point σ f . Similar e ﬀ ects can also be obtained in respect to y, x, α , and w. The r eason we cho ose β is because it is an hyper-parameter that ca n be chosen b y the practitioner befor e performing the optimization . For the next theorem, we introd uce the two following subsets of S + a : S + ♯ =  σ ∈ S + a | σ > − y 2  (77) S + ♭ =  σ ∈ S + a | σ < − y 2  (78) Theorem 4.4. Let σ f = − y 2 be the pseu do critical point in the dual pr oble m and let the primal p r oblem have a ma ximum of ﬁve critical points. Then • if σ f is a local minimu m for the dual fun ction, ther e will be a local maximum in S + ♯ that corr espo nds to a minimum of the primal pr oblem. • if σ f is a local maximum then: 1. ther e are no critical points in S + ♯ ; 2. ther e is at least one critical point in ( S + ♭ Proof 4.4. In the du al prob lem ther e must be a singularity point in G ( σ ) = 0 that goes to −∞ , so if σ f is a lo cal minimum , ther e must be a local maximum in S + ♯ . If σ f is a local maximum, we prove condition ( i ) by negating the thesis and suppose that th ere is a least on e critical point in S + ♯ . As P d ( σ ) goes to −∞ if G ( σ ) → 0, there will be no one, but two critical points in S + ♯ , a local minimum σ 1 and a local maxim um σ 2 with the relation P d ( σ 1 ) < P d ( σ 2 ). For Theorem s 4.1 an d 4.2, σ 1 correspo nds to th e second highest local ma ximum of the primal functio n c 1 , and σ 2 correspo nds to the lowest or seco nd lowest local minimum o f the primal function c 2 , that is the relation P ( c 2 ) < P ( c 1 ) is satis ﬁed. By Theorem 3.1 we have: P d ( σ 1 ) < P d ( σ 2 ) = P ( c 2 ) < P ( c 1 ) = P d ( σ 1 ) (79) that is a contradictio n. T o pr ove condition ( ii ), it is su ﬃ cient to notice th at if th ere are no critical po ints in S + ♯ , for the tr iality theory there must b e at least one critical point correspondin g to the global minimum in S + a and this point will be in S + ♭ .  Figure 1: Dual alge braic curves with y = 1, w = 2, α = √ 2 2 and β = 0 . 1 in respect to the interna l input x Dependin g on the parameters, the p rimal proble m (43) can have at most ﬁv e critical poin ts. There are sev er al cases: Case 1 : T hree critical points for P ( c ) and four critical points for P d ( σ ), two critical poin t in S + a and two critical points in S − a , with σ f as local m inimum. The values of the p arameters are y = 1, x = 1, w = 2, α = √ 2 2 , β = 0 . 1 (see Figure 2). This case can be easily solved with the g eneral canonical du- ality fr amework[8 ], as the local maximum in S + a correspo nds to the g lobal min imum of the problem , and the local minimu m and maximum in S − a correspo nd to the lo cal minimum and max - imum in the primal prob lem. 7 Figure 2: Primal(in blue) and dual(in red) functions for Case 1 with three criti- cal points Figure 3: Primal(in blue) and dual(i n red) functi ons for Case 2 with ﬁ ve cri tical points in the primal and six critic al points in the dual. Case 2 : Fiv e critical points fo r P ( c ), six critical p oints for P d ( σ ). The values of the param eters a re y = 1, x = 4 , w = 2 , α = √ 2 2 and β = 0 . 1 (see Figure 3). Notice that the only pa- rameter that changed in respect to Case 1 is x . With th ese pa- rameters the problem becomes multi-welled. The two critical points with the lowest value o f th e objecti ve fun ction belo ng to th e same dou ble well an d th eir co rrespon ding critical p oints are in S + a . Th e critical point σ = − 0 . 99 9999 of P d ( σ ) is cor- respond ing to the seco nd be st m inimizer c = 0 . 00 002 of the primal problem and this σ is situated near the bound ary of S + ♭ which is visible in Figu re 4. It is also possible, for certain val- ues of th e param eters, that the local min imum on th e boun dary of S a , correspo nds to the global m inimum of the p roblem (see Figure 5). In this case the cho ice o f the value for σ shou ld b e the critical point near the boun dary . This critical point corre- sponds to a critical p oint in the primal with the value of c near zero. Th is critical point is generated by the term 1 2 β c 2 that is the r egularization term used to make the ob jective f unction co- ercive an d more regular . On the oth er ha nd, this term doesn’t have anything to do with the or iginal aim of the p roblem. This point near zero in the p rimal function will alw a ys have the cor - respond ing dual critical point near the bound ary , because a s c Figure 4: Critical point on the boundary of the dual functi on fea sible set for Case 2. Figure 5: S + a of the dual proble m in the case of β = 0 . 12. The minimum near the boundary σ 1 is a global minimum. gets close to zero, σ = w exp { − d ( c ) } − y g ets close to − y . W e also co nsider th at σ = w exp {− d ( c ) } − y is the error that o rigi- nally we want to m inimize in problem (6) and that the critical point on th e boun dary will always have a σ with an abso lute value bigger than the o ther critical poin t closer to σ = 0 . In other words the local minimum on the boundary has nothing to do with the original problem, has an high value of the error and should not be consid ered as a good solution. I n order to ﬁnd the optimal solution for th e or iginal problem, the local mini- mum in the prim al p roblem correspon ding to th e critical p oint closer to zer o in S + a is pref erable. By reducin g the value of β it is p ossible n ot o nly to make the critical po int near c = 0 into a local minimum, but also to assure that σ f is a lo cal m inimum. In this way the re is a c ritical point in S + ♯ and the do main of the solution is well deﬁned . Basically if the critical po int near the bound ary of S + a is the global minimum, a very big value of β has been chosen. Case 3 : T hree critical points for P ( c ) and four critical points for P d ( σ ), all belonging to S + a . The v a lues of the parame ters are y = 1, x = 4, w = 2, α = √ 2 2 and β = 0 . 22 ( see Figu re 6). This case is similar to the previous one, an d the solu tion of th e dual problem should be the critical po int th at correspo nds to a 8 Figure 6: Primal (in blue) and dual (in red) functions for the Case 3 with three critic al point s in the primal and four critica l points in S + a . minimum in the pr imal pr oblem with the value of σ closer to zero. Figure 7: Primal (in blue) and dual (in red) functions for the Case 4 with three critic al point s in the primal and t wo crit ical points in S + a and two c ritical points in S − a and σ f as a local maximum. Case 4 : Th ree cr itical points in the primal and four critical points in th e d ual, but with two critical p oints in S + a , two cr iti- cal points in S − a and σ f as lo cal max imum. T he values of the parameters a re y = 1, x = 8, w = 2, α = √ 2 2 and β = 0 . 25 (see Figure 7). If the value of the h yper para meter β is reduced it is possible to m ake σ f into a loca l minimum an d return in o ne of the previous cases. Case 5 : One c ritical point in the primal problem and two critical po ints in the dual p roblem . T his case occurs when the quadra tic term with b eta do minates the error function W ( x ) . If this case o ccurs, it means th at th e value of β is too b ig and the problem is not related with the original anym ore, so one should choose a smaller value o f β to have a problem related to the original. Based on the stud y of th ese cases, we can obtain the general idea to ﬁnd the best solution, i. e. th e hyper par ameter β should be set to a value that satisﬁes co ndition (75) in order to have σ f as a lo cal minimu m, then search f or the critical point in the domain S + ♯ . 5. Conclusions In this paper we have presented an application of the canoni- cal duality theor y to function appro ximation using Radial Basis Functions. By using the seq uential dual canonica l transf orma- tion, the non co n vex problem with a g eneral RBF function φ ( · ) is r eformu lated in a canonical d ual fo rm. An associated strong duality theorem is also propo sed. Applications to one of th e most used RBF , the expo nential func- tion, are illustrated . Due to the p articular pro perties o f the expo- nential function , we are able to ﬁnd a linear r elation between the dual variables, which leads to an exp licit form of the ca nonical dual problem . W e also found c ondition s on the hype r param e- ter β in order to obtain a reliable domain wh ere to search for the best solution. This resear ch rev e als a n impor tant p henom enon in complex systems, i.e. the global optimal solutio n may not be the best solution to the prob lem considere d. There are still several open topics o n the application o f the canonical du ality theory to Radial Basis Error fu nctions. For example the re are other kin ds of RBF that can b e analyzed, lik e the m ulti q uadratic and the mu lti quadra tic in verse fun ctions, a fur ther development fo r future research is to expand the on e dimensiona l c ase to the multidim ensional case with also con- sidering w as a variable and not as a parameter . When this case is analy zed, we will be able to realize RBF neural network s based on canon ical duality theory . References [1] M. J. D. Powell , “Radial basis function s in 1990, ” Adv . Numer . Anal. , 2 , 105-210 (1992). [2] S. Haykin,“Ne ural Networks, a Comprehen si ve Foundatio n, ” Prentice- Hall, (1999). [3] L. Bruzzo ne, D. Prieto, “Supe rvised training technique s for radial ba- sis function neu ral networks. ” Elect r onic Letter s , 34 (11), 1115 1116 (1998). [4] D. W ettsche reck, T . Dietteric h , “Impro ving the Performances of Radial Basis Functions Networks by Learni ng Center L ocation s , ” Advances in Neural Informatio n Processi ng Systems (1992). [5] C. Buzzi, L. Grippo, M. Sciandrone , “Con verge nt dec omposition tech - niques for training RBF neural networks, ” Neural Computation , 13 , pp. 1891-1920 (2001). [6] J. J. Mor ´ e, Z. J. W u, “Global continuatio n for distance geometry prob- lems, ” SIAM Jou rnal on Optimization , 7 (3): 814-836 (1997). [7] J. Sax e, “Embeddabilit y of weight ed graphs in k-spac e is strongly NP- hard, ” in Pr oc. 17th Allerton Confer ence in Communication s , Cont rol, and Computing , Monticel lo, IL, 1979: 480-489 (1979). [8] D.Y . Gao , “Duality Principles in N oncon ve x Systems: Theory , Meth- ods, and Applica tions, ” Noncon vex Optimizat ion and Its Applicatio ns , Kluwer Academic Publishers (2000). [9] D.Y . Gao, “ Canonical dual transformatio n method and gene ralized trial- ity theory in nonsmooth globa l optimizat ion, ” J. Glob . Optim. 17 (1 / 4), 127-160 (2000). [10] D.Y . Gao , “Cano nical duality theory: theory , method, and applicati ons in global optimiza tion, ” Comput . Chem. 33 , 1964-19 72 (2009). [11] D.Y . Gao, N. Rua n and P .M. Parda los, “Canonical dual solution s to sum of fourth -order polynomial s minimizati on probl ems with applic ations to sensor net work localizat ion, ” in Sensors: Theory , Algorithms and Appli - cations , P .M. Pardalos, Y .Y . Y e, V . Boginski, and C. Commander (eds). Springer (2010). 9 [12] T .K. Gao, “Complet e solutio ns to a class of 8th order polynomia l op- timizat ion problems”, to appear in IMA J . A pplied Mathematics , pub- lished online at http: // arxiv .org / abs / 1205 .6886 arXiv :1205.6886 (2012). [13] N. Ruan and D.Y . Gao, “Global optimal solutions to a genera l sensor netw ork locali zatio n problem”, to appear in P erformence Evaluation , publishe d online at http: // a rxiv .org / submit / 654731 (2013). [14] Z.B. W ang, S.C. Fang, D.Y . Gao , W . X. Xing, “Ca nonical dual approach to solv ing the maxi m um cut problem, ” J . Glob . Optim. , 54 , 341- 352 (2012). [15] J. Z hang , D.Y . Gao, J. Y earwood, “ A nov el canonical dual computa- tional approa ch for prion A GAAAAGA amyloid ﬁbril molecular mod- eling, ” Journal of Theor etical B iolo gy , 284 , 149-157 (2011). 10

Canonical dual solutions to nonconvex radial basis neural network optimization problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment