D-optimal Bayesian Interrogation for Parameter and Noise Identification of Recurrent Neural Networks

D-optimal Ba y esian In terrogation for P arameter and Noise Iden tiation of Reurren t Neural Net w orks Barnabás Pózos ⋆ and András L®rinz Departmen t of Information Systems, Eötv ös Loránd Univ ersit y , Pázmán y P . sétán y 1/C, Budap est H-1117, Hungary WWW home page: http://nipg.info pozoss.ualberta.a, lorinzinf.elte.hu Abstrat. W e in tro due a no v el online Ba y esian metho d for the iden tiation of a family of noisy reurren t neural net w orks (RNNs). W e dev elop Ba y esian ativ e learning te h- nique in order to optimize the in terrogating stim uli giv en past exp erienes. In partiular, w e onsider the unkno wn parameters as sto  hasti v ariables and use the D-optimalit y priniple, also kno wn as ` infomax metho d ', to  ho ose optimal stim uli. W e apply a greedy te hnique to maximize the information gain onerning net w ork parameters at ea h time step. W e also deriv e the D-optimal estimation of the additiv e noise that p erturbs the dy- namial system of the RNN. Our analytial results are appro ximation-free. The analyti deriv ation giv es rise to attrativ e quadrati up date rules. 1 In tro dution When studying online systems it is of high relev ane to failitate fast information gain on- erning the system [1,2 ℄. As an example, onsider the resear h on real neurons. In one of the exp erimen tal paradigms, resear hers lo ok for the stim ulus that maximizes the resp onse of the neuron [3,4 ℄. Another approa h sear hes for stim ulus distribution that maximizes m utual infor- mation b et w een stim ulus and resp onse [5 ℄. A reen t te hnique assumes that the unkno wn system b elongs to the family of generalized linear mo dels [6℄ and treats the parameters as probabilis- ti v ariables. Then the goal is to nd the optimal stim uli b y maximizing m utual information b et w een the parameter set and the resp onse of the system. W e are in terested in the ativ e learning [7,8 ,9,10 ℄ of noisy reurren t artiial neural net w orks (RNNs), when w e ha v e the freedom to in terrogate the net w ork and to measure the resp onses. Our framew ork is similar to the generalized linear mo del (GLM) approa h used b y [6℄: w e w ould lik e to  ho ose in terrogating, or `  ontr ol ' inputs in order to (i) iden tify the parameters of the net w ork and (ii) estimate the additiv e noise eien tly . F rom no w on, w e use the terms  ontr ol and interr o gation in ter hangeably; on trol is the on v en tional expression, whereas the w ord in terrogation expresses our aims b etter. W e apply online Ba y esian learning [11 ,12 ,13 ,14 ℄ to aomplish our task. F or Ba y esian metho ds prior up dates often lead to in tratable p osterior distributions su h as a mixture of exp onen tially n umerous distributions. Here, w e sho w that in our mo del omputations are b oth tratable and appro ximation-free. F urther, the emerging learning rules are simple. W e also sho w that dieren t stim uli are needed for the same RNN ⋆ Presen t address: Departmen t of Computing Siene, Univ ersit y of Alb erta, A thabasa Hall, Edmon- ton, Canada, T6G 2E8 mo del dep ending on whether the goal is to estimate the w eigh ts of the RNN or the additiv e noise that p erturbs the RNN. Hereafter w e will refer to this noise as the `driving noise' of the RNN. Our approa h, whi h optimizes on trol online in order to gain maxim um information on- erning the parameters, falls in to the realm of Optimal Exp erimen tal Design, or Optimal Ba y esian Design [15 ,1 ,16 ,17 ,18℄. Sev eral optimalit y priniples ha v e b een w ork ed out and their eienies ha v e b een studied extensiv ely in the literature. F or a review see, e.g., [19 ℄. Our ap- proa h orresp onds to the so-alled D-optimalit y [20 ,21 ℄, whi h is equiv alen t to the information maximization (infomax) priniple applied b y [6℄. W e use b oth terms, the term D-optimality and the term infomax , to designate our approa h. T o the b est of our kno wledge, D-optimalit y has not b een applied to the t ypial non-spiking sto  hasti artiial reurren t neural net w ork mo del that w e treat here. The on tribution of this pap er an b e summarized as follo ws: W e use the D-optimalit y (infomax) priniple and deriv e ost funtions and algorithms for (i) the parameter learning of the sto  hasti RNN and (ii) the estimation of its driving noise. W e sho w that, (iii) using the D- optimalit y in terrogation te hnique, these t w o tasks are not ompatible with ea h other: greedy on trol signals deriv ed from the D-optimalit y priniple for parameter estimation are sub optimal (basially the w orst p ossible) for the estimation of the driving noise and vie v ersa. W e sho w that (iv) D-optimal ost funtions lead to simple greedy optimization rules b oth for the parameter estimation and for the noise estimation, resp etiv ely . In v estigation of non-greedy m ultiple step optimizations, whi h ma y a hiev e more eien t estimation of the net w ork parameters and the noise, seems diult and is b ey ond the sop e of the presen t pap er. Ho w ev er, (v) for the task of estimating the driving noise w e in tro due a non-greedy m ultiple step lo ok-ahead heuristis. The pap er is strutured as follo ws: In Setion 2 w e in tro due our mo del. Setion 3 onerns the Ba y esian equations of the RNN mo del. Setion 4 deriv es the optimal on trol for its parameter iden tiation starting from the D-optimalit y (infomax) priniple. Setion 5 deals with our seond task, when the goal is the estimation of the driving noise of the RNN. The pap er ends with a short disussion and some onlusions (Setion 6). 2 The Mo del W e in tro due our mo del here. Let P ( e ) = N e ( m , V ) denote the probabilit y densit y of a normally distributed sto  hasti v ariable e with mean m and o v ariane matrix V . Let us assume that w e ha v e d simple omputational units alled ` neur ons ' in a reurren t neural net w ork: r t +1 = g   I X i =0 F i r t − i + J X j =0 B j u t +1 − j + e t +1   , (1) where { e t } , the driving noise of the RNN, denotes temp orally indep enden t and iden tially distributed (i.i.d.) sto  hasti v ariables and P ( e t ) = N e t ( 0 , V ) , r t ∈ R d represen ts the observ ed ativities of the neurons at time t . Let u t ∈ R c denote the on trol signal at time t . The neural net w ork is formed b y the w eigh ted dela ys represen ted b y matries F i ( i = 0 , . . . , I ) and B j ( j = 0 , . . . , J ), whi h onnet neurons to ea h other and also the on trol omp onen ts to the neurons, resp etiv ely . Con trol an also b e seen as the means of in terrogation, or the stim ulus to the net w ork [6℄. W e assume that funtion g : R d → R d in (1) is kno wn and in v ertible. The omputational units, the neurons, sum up w eigh ted previous neural ativities as w ell as w eigh ted on trol inputs. These sums are then passed through iden tial non-linearities aording to Eq. (1). Our goal is to estimate the parameters F i ∈ R d × d ( i = 0 , . . . , I ), B j ∈ R d × c ( j = 0 , . . . , J ) and the o v ariane matrix V , as w ell as the driving noise e t b y means of the on trol signals. In artiial neural net w ork terms, (1) is in the form of r ate  o de mo dels . In our rate o de mo del, noise, on trol, and the reurren t ativities inuene the ring rates similarly . W e sho w that analyti ost funtions emerge for this mo del that are free of appro ximation. 3 Ba y esian Approa h Here w e em b ed the estimation task in to the Ba y esian framew ork. First, w e in- tro due the follo wing notations: x t +1 = [ r t − I ; . . . ; r t ; u t − J +1 ; . . . ; u t +1 ] , y t +1 = g − 1 ( r t +1 ) , A = [ F I , . . . , F 0 , B J , . . . , B 0 ] ∈ R d × m . With these notations, mo del (1) redues to a linear equa- tion y t = Ax t + e t . (2) T o fulll our goal, the online estimation of the unkno wn quan tities (parameter matrix A , noise e t and its o v ariane matrix V ), w e rely on Ba y es' metho d. W e assume that prior kno wledge is a v ailable and w e up date our p osteriori kno wledge on the basis of the observ ations. Con trol will b e  hosen at ea h instan t to pro vide maximal exp eted information onerning the quan tities w e ha v e to estimate. Starting from an arbitrary prior distribution of the parameters the p osterior distribution needs to b e omputed. This an b e highly omplex, ho w ev er, so appro ximations are ommon in the literature. F or example, assumed densit y ltering, when the omputed p osterior is pro jeted to simpler distributions, has b een suggested [22,23 ,11 ℄. W e shall use the metho d of onjugated priors [24 ℄ instead. F or matrix A w e assume a matrix v alued normal distribution prior. F or o v ariane matrix V in v erted Wishart (IW) distribution will b e our prior. One an sho w for these  hoies that the funtional form of the p osteriori distributions is not aeted. W e dene the normally distributed matrix v alued sto  hasti v ariable A ∈ R d × m b y using the follo wing quan tities: M ∈ R d × m is the exp eted v alue of A . V ∈ R d × d is the o v ariane matrix of the ro ws, and K ∈ R m × m is the so-alled preision parameter matrix that w e shall mo dify in aordane with the Ba y esian up date. They are b oth p ositiv e semi-denite matries. The densit y funtion of the sto  hasti v ariable A is dened as: N A ( M , V , K ) = | K | d/ 2 | 2 π V | m/ 2 exp( − 1 2 tr (( A − M ) T V − 1 ( A − M ) K )) , where tr , | · | , and sup ersript T denote the trae op eration, the determinan t, and transp osition, resp etiv ely . See e.g. [25 ,26 ℄. W e assume that Q ∈ R d × d is a p ositiv e denite matrix and n > 0 . Using these notations, the densit y of the In v erted Wishart distribution with parameters Q and n is as follo ws [25 ℄: I W V ( Q , n ) = 1 Z n,d 1 | V | ( d +1) / 2     V − 1 Q 2     n/ 2 exp( − 1 2 tr ( V − 1 Q )) , where Z n,d = π d ( d − 1) / 4 d Q i =1 Γ (( n + 1 − i ) / 2) and Γ ( . ) denotes the gamma funtion. No w, one an rewrite mo del (2) as follo ws: P ( A | V ) = N A ( M , V , K ) , (3) P ( V ) = I W V ( Q , n ) , (4) P ( e t | V ) = N e t ( 0 , V ) , (5) P ( y t | A , x t , V ) = N y t ( Ax t , V ) . (6) 4 The Infomax Approa h for P arameter Learning Let us ompute the parameter estimation strategy for task (1) (i.e., for task (3)-(6)) as presrib ed b y the infomax priniple. Let us in tro due t w o shorthands; θ = { A , V } , and { x } j i = { x i , . . . , x j } . W e  ho ose the on trol v alue in (1) at ea h instan t su h that it pro vides the most exp eted information onerning the unkno wn parameters. Assuming that { x } t 1 , { y } t 1 are giv en, aording to the infomax priniple our goal is to ompute arg max u t +1 I ( θ , y t +1 ; { x } t +1 1 , { y } t 1 ) , (7) where I ( a, b ; c ) denotes the m utual information of sto  hasti v ariables a and b for xed param- eters c . Let H ( a | b ; c ) denote the onditional en trop y of v ariable a onditioned on v ariable b and for xed parameter c . Note that I ( θ , y t +1 ; { x } t +1 1 , { y } t 1 ) = H ( θ ; { x } t +1 1 , { y } t 1 ) − H ( θ | y t +1 ; { x } t +1 1 , { y } t 1 ) , holds [27 ℄ and H ( θ ; { x } t +1 1 , { y } t 1 ) = H ( θ ; { x } t 1 , { y } t 1 ) is indep enden t from u t +1 , hene our task is redued to the ev aluation of the follo wing quan tit y: arg min u t +1 H ( θ | y t +1 ; { x } t +1 1 , { y } t 1 ) = (8) = arg min u t +1 − Z d y t +1 P ( y t +1 |{ x } t +1 1 , { y } t 1 ) Z d θ P ( θ |{ x } t +1 1 , { y } t +1 1 ) log P ( θ |{ x } t +1 1 , { y } t +1 1 ) . In order to solv e this minimization problem w e need to ev aluate P ( y t +1 |{ x } t +1 1 , { y } t 1 ) , the p osterior P ( θ |{ x } t +1 1 , { y } t +1 1 ) , and the en trop y of the p osterior, that is R d θ P ( θ |{ x } t +1 1 , { y } t +1 1 ) log P ( θ |{ x } t +1 1 , { y } t +1 1 ) , where P ( a | b ) denotes the onditional probabilit y of v ariable a giv en ondition b . The main steps of these omputations are pro vided b elo w. Assume that the a priori distributions P ( A | V , { x } t 1 , { y } t 1 ) = N ( A | M t , V , K t ) and P ( V |{ x } t 1 , { y } t 1 ) = I W V ( Q t , n t ) are kno wn. Then the p osterior distribution of θ is: P ( A , V |{ x } t +1 1 , { y } t +1 1 ) = P ( y t +1 | A , V , x t +1 ) P ( A | V , { x } t 1 , { y } t 1 ) P ( V |{ x } t 1 , { y } t 1 ) P ( y t +1 |{ x } t +1 1 , { y } t 1 ) = N y t +1 ( Ax t +1 , V ) N A ( M t , V , K t ) I W V ( Q t , n t ) R A R V N y t +1 ( Ax t +1 , V ) N A ( M t , V , K t ) I W V ( Q t , n t )) . This expression an b e rewritten in a more useful form: let K ∈ R m × m and Q ∈ R d × d b e p ositiv e denite matries. Let A ∈ R d × m , and let us in tro due the densit y funtion of the matrix v alued Studen t-t distribution [28 ,26 ℄ as follo ws: T A ( Q , n, M , K ) = | K | d/ 2 π dm/ 2 Z n + m,d Z n,d | Q | n/ 2 | Q + ( A − M ) K ( A − M ) T | ( m + n ) / 2 , No w, w e need the follo wing lemma: Lemma 41 N y ( Ax , V ) N A ( M , V , K ) I W V ( Q , n ) = N A (( MK + yx T )( xx T + K ) − 1 , V , xx T + K ) × ×I W V  Q + ( y − Mx ) (1 − x T ( xx T + K ) − 1 x ) ( y − Mx ) T , n + 1  × ×T y  Q , n, Mx , 1 − x T ( xx T + K ) − 1 x  . Pr o of. It is easy to sho w that the follo wing equations hold: N y ( Ax , V ) N A ( M , V , K ) = N A ( M + , V , xx T + K ) N y ( Mx , V , γ ) , N A ( M , V , K ) I W V ( Q , n ) = I W V ( Q + H , n + 1) T A ( Q , n, M , K ) , (9) where M + = ( MK + yx T )( xx T + K ) − 1 , γ = 1 − x T ( xx T + K ) − 1 x , H = ( A − M ) K ( A − M ) T for the sak e of brevit y . Then w e ha v e N y ( Mx , V , γ ) I W V ( Q , n ) = I W V ( Q + ( y − Mx ) γ ( y − Mx ) T , n + 1 ) T y ( Q , n, Mx , γ ) , and the statemen t of the lemma follo ws. Using this lemma, w e an ompute the p osterior probabilities. Let us in tro due the follo wing quan tities: γ t +1 = 1 − x T t +1 ( x t +1 x T t +1 + K t ) − 1 x t +1 , n t +1 = n t + 1 , M t +1 = ( M t K t + y t +1 x T t +1 )( x t +1 x T t +1 + K t ) − 1 , Q t +1 = Q t + ( y t +1 − M t x t +1 ) γ t +1 ( y t +1 − M t x t +1 ) T . (10) F or the p osterior probabilities w e ha v e determined that P ( A | V , { x } t +1 1 , { y } t +1 1 ) = N A ( M t +1 , V , x t +1 x T t +1 + K t ) , (11) P ( V |{ x } t +1 1 , { y } t +1 1 ) = I W V ( Q t +1 , n t +1 ) , (12) P ( y t +1 |{ x } t +1 1 , { y } t 1 ) = T y t +1 ( Q t , n t , M t x t +1 , γ t +1 ) . Ha ving done so, w e an ompute the en trop y of the p osterior distribution of θ = { A , V } b y means of the follo wing lemma: Lemma 42 The entr opy of a sto hasti variable with density funtion P ( A , V ) = N A ( M , V , K ) I W V ( Q , n ) assumes the form − d 2 ln | K | + ( m + d +1 2 ) ln | Q | + f 1 ( d, n ) , wher e f 1 ( d, n ) dep ends only on d and n . Pr o of. Let v ec ( A ) denote a v etor of dm dimensions where the ( d ( i − 1) + 1) th , . . . , ( id ) th ( 1 ≤ i ≤ m ) elemen ts of this v etor are equal to the elemen ts of the i th olumn of matrix A ∈ R d × m in the appropriate order. Let ⊗ denote the Krone k er-pro dut. It is kno wn that for P ( A ) = N A ( M , V , K ) , P ( v ec ( A )) = N vec ( A ) ( v ec ( M ) , V ⊗ K − 1 ) holds [26 ℄. Using the w ell- kno wn form ula for the en trop y of a m ultiv ariate and normally distributed v ariable [27 ℄ and applying the relation | V ⊗ K − 1 | = | V | m / | K | d , w e ha v e that H ( A ; V ) = 1 2 ln | V ⊗ K − 1 | + dm 2 ln(2 π e ) = m 2 ln | V | − d 2 ln | K | + dm 2 ln(2 π e ) . Exploit ertain prop erties of the Wishart distribution, w e ompute the en trop y of distribution I W V ( Q , n ) . The densit y of the Wishart distribution is dened b y W V ( Q , n ) = 1 Z n,d | V | ( n − d − 1) / 2     Q − 1 2     n/ 2 exp  − 1 2 tr ( VQ − 1 )  . Let Ψ denote the digamma funtion, and let f 2 ( d, n ) = − P d i =1 Ψ ( n +1 − i 2 )) − d ln 2 . Replaing V − 1 with S , w e ha v e for the Jaobian that | d V d S | = | d S − 1 d S | = | S | − ( d +1) [25 ℄. T o pro eed w e use that E W S ( Q ,n ) S = n Q , and E W S ( Q ,n ) ln | S | = ln | Q | − f 2 ( d, n ) , [29 ℄ and substitute them in to E I W V ( Q ,n ) ln | V | , and E I W V ( Q ,n ) tr ( QV − 1 ) : E I W V ( Q ,n ) ln | V | = = Z 1 Z n,d 1 | V | ( d +1) / 2     V − 1 Q 2     n/ 2 exp  − 1 2 tr ( V − 1 Q )  ln | V | d V = − Z 1 Z n,d | S | ( d +1) / 2     SQ 2     n/ 2 exp  − 1 2 tr ( SQ )  ln | S || S | − d − 1 d S = − Z 1 Z n,d | S | ( n − d − 1) / 2     Q 2     n/ 2 exp  − 1 2 tr ( SQ )  ln | S | d S = − E W S ( Q − 1 ,n ) ln | S | = ln | Q | + f 2 ( d, n ) . (13) One an also sho w that E I W V ( Q ,n ) tr ( QV − 1 ) = = Z 1 Z n,d 1 | V | ( d +1) / 2     V − 1 Q 2     n/ 2 exp  − 1 2 tr ( V − 1 Q )  tr ( QV − 1 ) d V = Z 1 Z n,d | S | ( d +1) / 2     SQ 2     n/ 2 exp  − 1 2 tr ( SQ )  tr ( QS ) | S | − d − 1 d S , = Z 1 Z n,d | S | ( n − d − 1) / 2     Q 2     n/ 2 exp  − 1 2 tr ( SQ )  tr ( QS ) d S , = E W S ( Q − 1 ,n ) tr ( QS ) , = tr ( Q Q − 1 n ) = nd (14) W e alulate the en trop y of sto  hasti v ariable V with distribution I W V ( Q , n ) . It follo ws from Eq. (13) and Eq. (14) that H ( V ) = = − E I W V ( Q ,n )  − ln( Z n,d ) + n 2 ln     Q 2     − n + d + 1 2 ln | V | − 1 2 tr ( V − 1 Q )  = ln( Z n,d ) − n 2 ln     Q 2     + n + d + 1 2 " ln | Q | − d X i =1 Ψ ( n + 1 − i 2 )) − d ln 2 # + nd 2 = d + 1 2 ln | Q | + f 3 ( d, n ) , where f 3 ( d, n ) dep ends only on d and n . Giv en the results ab o v e, w e omplete the omputation of en trop y H ( A , V ) as follo ws: H ( A , V ) = H ( A | V ) + H ( V ) = H ( V ) + Z d V I W V ( Q , n ) H ( A ; V ) = Z d V I W V ( Q , n )  m 2 ln | V | − d 2 ln | K | + dm 2 ln(2 π e )  + H ( V ) = − d 2 ln | K | + dm 2 ln(2 π e ) + m 2 [ln | Q | + f 2 ( d, n )] + d + 1 2 ln | Q | + f 3 ( d, n ) = − d 2 ln | K | + ( m + d + 1 2 ) ln | Q | + f 1 ( d, n ) . This is exatly what w as laimed in Lemma 42. Lemmas 41 and 42 lead to the follo wing: Corrolary 43 F or the entr opy of a sto hasti variable with p osterior distribution P ( A , V | x , y ) it holds that H ( A , V ; x , y ) = − d 2 ln | xx T + K | + f 1 ( d, n ) + ( m + d + 1 2 ) ln | Q + ( y − Mx ) γ ( y − M x ) T | . W e note that the follo wing lemma also holds: Lemma 44 Z T y ( Q , n, µ , γ ) ln | Q + ( y − µ ) γ ( y − µ ) T | d y is indep endent fr om b oth µ and γ , and th us w e an ompute the onditional en trop y expressed in (8) : Lemma 45 H ( A , V | y ; x ) = Z p ( y | x ) H ( A , V ; x , y ) d y = − d 2 ln | xx T + K | + g 1 ( Q , d, n ) . wher e g 1 ( Q , d, n ) dep ends only on Q , d and n . Colleting all the terms, w e arriv e at the follo wing intriguingly simple expression u t +1 opt = ar g min u t +1 Z p  y t +1 |{ x } t +1 1 , { y } t 1  H ( A , V |{ x } t +1 1 , { y } t 1 , y t +1 ) d y t +1 = ar g min u t +1 − d 2 ln | x t +1 x T t +1 + K t | = arg max u t +1 x T t +1 K − 1 t x t +1 , (15) where x t +1 . = [ r t − I ; . . . ; r t ; u t − J +1 ; . . . ; u t +1 ] , and w e used that | xx T + K | = | K | (1 + x T K − 1 x ) aording to the Matrix Determinan t Lemma [30 ℄. W e assume a b ounded domain U for the on trol, whi h is neessary to k eep the max- imization pro edure of (15) nite. This is, ho w ev er, a reasonable ondition for all pratial appliations. So, u t +1 opt = ar g max u ∈U x T t +1 K − 1 t x t +1 , (16) In what follo ws D-optimal on trol will b e referred to as `infomax interr o gation sheme' . The steps of our algorithm are summarized in T able 1. T able 1: Pseudo o de of the algorithm Con trol Calulation u t +1 = arg max u ∈U ˆ x T t +1 K − 1 t ˆ x t +1 where ˆ x t +1 = [ r t − I ; . . . ; r t ; u t − J +1 ; . . . ; u t ; u ] set x t +1 = [ r t − I ; . . . ; r t ; u t − J +1 ; . . . ; u t ; u t + 1 ] Observ ation observ e r t +1 , and let y t +1 = g − 1 ( r t +1 ) Ba y esian up date M t +1 = ( M t K t + y t +1 x T t +1 )( x t +1 x T t +1 + K t ) − 1 K t +1 = x t +1 x T t +1 + K t n t +1 = n t + 1 γ t +1 = 1 − x T t +1 ( x t +1 x T t +1 + K t ) − 1 x t +1 Q t +1 = Q t + ( y t +1 − M t x t +1 ) γ t +1 ( y t +1 − M t x t +1 ) T Computation of the in v erse ( x t +1 x T t +1 + K t ) − 1 in T able 1 an b e simplied onsiderably b y the follo wing reursion: let P t = K − 1 t , then aording to the Sherman-Morrison form ula [31 ℄ P t +1 = ( x t +1 x T t +1 + K t ) − 1 = P t − P t x t +1 x T t +1 P t 1 + x T t +1 P t x t +1 , (17) In this expression matrix in v ersion disapp ears and instead only a real n um b er is in v erted. 5 Estimating the Noise One migh t wish to ompute the optimal on trol for estimating noise e t in (1), instead of the iden tiation problem ab o v e. Based on (1) and b eause e t +1 = y t +1 − I X i =0 F i r t − i − J X j =0 B j u t +1 − j , (18) one migh t think that the b est strategy is to use the optimal infomax on trol of T able 1, sine it pro vides go o d estimations for parameters A = [ F I , . . . , F 0 , B J , . . . , B 0 ] and so for noise e t . Anotherand dieren tthough t is the follo wing. A t time t , let us denote our estimations as ˆ e t , ˆ F t i (i=0,. . . ,I), and ˆ B t j (j=0,. . . ,J). Using (18) , w e ha v e that e t +1 − ˆ e t +1 = I X i =0 ( F i − ˆ F t i ) r t − i + J X j =0 ( B j − ˆ B t j ) u t +1 − j . (19) This hin ts that the on trol should b e u t = 0 for all times in order to get rid of the error on tribution of matrix B j in (19). Straigh tforw ard utilization of D-optimalit y onsiderations, opp osed to the ob jetiv e of (7), suggests the optimization of the follo wing quan tit y: arg max u t +1 I ( e t +1 , y t +1 ; { x } t +1 1 , { y } t 1 ) , That is, for the estimation of the noise w e w an t to design a on trol signal u t +1 su h that the next output is the b est from the p oin t of view of greedy optimization of m utual information b et w een the next output y t +1 and the noise e t +1 . It is easy to sho w that this task is equiv alen t to the follo wing optimization problem: arg min u t +1 Z d y t +1 P ( y t +1 |{ x } t +1 1 , { y } t 1 ) H ( e t +1 ; { x } t +1 1 , { y } t +1 1 ) , (20) where H ( e t +1 ; { x } t +1 1 , { y } t +1 1 ) = H ( Ax t +1 ; { x } t +1 1 , { y } t +1 1 ) , b eause e t +1 = y t +1 − Ax t +1 . T o ompute this quan tit y w e need the follo wing lemma [26 ℄: Lemma 51 If P ( A ) = N A ( M , V , K ) , then P ( Ax ) = N Ax  Mx , V ,  x T K − 1 x  − 1  Applying this lemma and using (11) one has that P ( Ax t +1 | V , { x } t +1 1 , { y } t 1 ) = N Ax t +1  M t +1 x t +1 , V ,  x T t +1 K − 1 t +1 x t +1  − 1  (21) W e in tro due the notations ˜ K t +1 =  x T t +1 K − 1 t +1 x t +1  − 1 ∈ R , (22) λ t +1 = 1 + ( Ax t +1 − M t +1 x t +1 ) T ( ˜ K t +1 Q − 1 t +1 )( Ax t +1 − M t +1 x t +1 ) ∈ R and use (9) and (12) for the p osterior distribution (21). Then w e arriv e at P ( Ax t +1 |{ x } t +1 1 , { y } t 1 ) = T Ax t +1  Q t +1 , n t +1 , M t +1 x t +1 , ˜ K t +1  = π − d/ 2 | ˜ K − 1 t +1 Q t +1 | − 1 / 2 Γ ( n t +1 +1 2 ) Γ ( n t +1 +1 − d 2 ) λ n t +1 +1 2 t +1 The Shannon-en trop y of this distribution aording to [32℄ equals: H ( Ax t +1 ; { x } t +1 1 , { y } t +1 1 ) = f 4 ( d, n t +1 ) + d 2 log | ˜ K − 1 t +1 | + log | Q t +1 | where f 4 ( d, n t +1 ) = − log Γ ( n t +1 +1 2 ) π d/ 2 Γ ( n t +1 +1 − d 2 ) + n t +1 + 1 2  Ψ  n t +1 + 1 2  − Ψ  n t +1 + 1 − d 2  . Using the notations in tro dued in (10) and in (22) the ab o v e expressions an b e transrib ed as follo ws: H ( Ax t +1 ; { x } t +1 1 , { y } t +1 1 ) = f 4 ( d, n t +1 ) − d 2 log | ˜ K t +1 | + log | Q t +1 | = f 4 ( d, n t +1 ) + d 2 log | x T t +1 ( K t + x t +1 x T t +1 ) − 1 x t +1 | + log | Q t +1 | = f 4 ( d, n t +1 ) + d 2 log | x T t +1 ( K t + x t +1 x T t +1 ) − 1 x t +1 | + + log | Q t + ( y t +1 − M t x t +1 ) γ t +1 ( y t +1 − M t x t +1 ) T | No w, w e are in a p osition to alulate (20) b y applying Lemma 44 as b efore. W e get that Z d y t +1 P ( y t +1 |{ x } t +1 1 , { y } t 1 ) H ( e t +1 ; { x } t +1 1 , { y } t +1 1 ) = = g 2 ( Q t , d, n t +1 ) + d 2 log | x T t +1 ( K t + x t +1 x T t +1 ) − 1 x t +1 | , where g 2 ( Q t , d, n t +1 ) dep ends only on Q t , d and n t +1 . Th us, w e ha v e that arg max u t +1 I ( e t +1 , y t +1 ; { x } t +1 1 , { y } t 1 ) = arg min u t +1 log | x T t +1 ( K t + x t +1 x T t +1 ) − 1 x t +1 | = ar g min u t +1 log      x T t +1 K − 1 t − K − 1 t x t +1 x T t +1 K − 1 t 1 + x T t +1 K − 1 t x t +1 ! x t +1      = ar g min u t +1 log      x T t +1 K − 1 t x t +1 1 + x T t +1 K − 1 t x t +1      = arg min u t +1 x T t +1 K − 1 t x t +1 In pratie, w e p erform this optimization in an appropriate domain U . Th us, the D-optimal in terrogation s heme for noise estimation is as follo ws u opt t +1 = ar g min u ∈U x T t +1 K − 1 t x t +1 . (23) It is w orth noting that this D-optimal ost funtion for noise estimation and the D-optimal ost funtion deriv ed for parameter estimation in (15) are not ompatible with ea h other. Estimating one of them qui kly will neessarily dela y the estimation of the other. In Setion 5.1 w e see that for large enough t v alues, expression (23) giv es rise to on trol v alues lose to u t = 0 . 5.1 Greedy and Non-Greedy Optimization Greedy optimization of (23) is simple, pro vided that K t is xed during the optimization of u t +1 . If so, then the optimization task is quadrati. T o see this, let us partition matrix K t as follo ws: K t =  K 11 t K 12 t K 21 t K 22 t  , where K 11 t ∈ R d × d , K 21 t ∈ R m − d × d , K 22 t ∈ R m − d × m − d . It is easy to see that if domain U in (23) is large enough then u opt t +1 = ( K 22 t ) − 1 K 21 t r t . (24) Ho w ev er, for non-greedy solutions, expression K t in x T t +1 K − 1 t x t +1  hanges, b eause it ma y dep end on previous on trol inputs u 1 , . . . , u t , the sub jet of previous optimization steps. The optimal strategy for long-term non-greedy optimization falls outside of the sop e of the presen t w ork. Here w e prop ose the follo wing heuristis for this problem: Use the strategy of T able 1 for the rst τ steps. It inreases | K t | qui kly in (23). Then after τ -steps swit h to the on trol desrib ed in (24) . This will derease the ost funtion (23) further. W e will all this non-greedy in terrogation heuristis in tro dued for noise estimation ` τ -infomax noise interr o gation ' . It is w orth noting that in the τ -infomax noise in terrogation, if the τ swit hing time is large enough then for large t v alues | K 22 t | will b e large, and hene aording to (24)  the optimal u t in terrogation will b e lose to 0 . The appro ximation of the ` τ -infomax noise in terrogation' when w e use the in terrogation desrib ed in T able 1 for τ steps and then swit h to zer o-interr o gation will b e alled the ` τ -zer o interr o gation ' s heme. 6 Disussion and Conlusions W e ha v e treated the iden tiation problem of reurren t neural net w orks desrib ed b y mo del (1). W e applied ativ e learning to solv e this task. In partiular, the online D-optimalit y priniple w as applied and w e in v estigated the learning prop erties for parameter and noise estimations. W e note that the D-optimal in terrogation s heme is also alled infomax on trol in the literature [6℄. This name originates from the ost funtion that optimizes the m utual information. The GLM mo del used b y [6 ℄ is as follo ws: r t +1 = g   I X i =0 F i r t − i + J X j =0 B j u t +1 − j   + e t +1 , (25) where { e t } is i.i.d. noise with 0 ∈ R d mean. The authors mo del spiking neurons and assume that the main soure of the noise is this spiking, whi h app ears at the output of the neurons and adds linearly to the neural ativit y . They in v estigated the ase in whi h the observ ed quan tit y r t had a P oisson distribution. Unfortunately , in this mo del Ba y esian equations b eome in tratable and the estimation of the p osterior ma y b e sp oiled, b eause the distribution is pro jeted to the family of normal distributions at ea h instan t. A serious problem with this approa h is that the exten t of the information loss aused b y this appro ximation is not kno wn. Our sto  hasti RNN mo del r t +1 = g   I X i =0 F i r t − i + J X j =0 B j u t +1 − j + e t +1   , diers only sligh tly from the GLM mo del of (25) , but it has onsiderable adv an tages, as w e shall disuss b elo w. Ba y esian designs of dieren t kinds w ere deriv ed for the linear regression problem in [35 ℄: y = X θ + e (26) P ( e ) = N e (0 , σ 2 I ) . (27) This problem is similar to ours ((3)-(6)), but while the goal of [35 ℄ w as to nd an optimal design for the explanatory v ariables θ , w e w ere onerned with the parameter ( X in (26)) and the noise ( e ) estimation task. In V erdinelli's pap er in v erted gamma prior and v etor-v alued normal distribution w ere assumed on the isotropi noise and on the explanatory v ariables, resp etiv ely . By on trast, w e w ere in terested in the matrix-v alued o eien ts and in general, non-isotropi noises. W e used matrix-v alued normal distribution for the o eien ts and in v erted Wishart distribution for the o v ariane matrix as onjugate priors. Due to the in v erted Wishart distribution that w e used, the o v ariane matrix of the noise is not restrited to the isotropi form, but an b e general in our ase. The Ba y esian online learning framew ork allo w ed us to deriv e analyti results for the greedy optimization of the parameters as w ell as the driving noise. Optimal in terrogation strategies (16) and (23) app eared in attrativ e, in triguingly simple quadrati forms. W e ha v e sho wn that these t w o tasks are inompatible with ea h other. P arameter and noise estimations require the maximization and the minimization of expression x T t +1 K − 1 t x t +1 , resp etiv ely . The problem of non-greedy optimization of the full task has b een left op en. Ho w ev er, w e put forth a heuristi solution for the estimation of the driving noise that w e alled τ -infomax noise in terrogation. It uses the D-optimal in terrogation of T able 1 up to τ -steps, and applies the noise estimation on trol of (23) afterw ards. This heuristis dereases the estimation error of the o eien ts of matries F and B up to time τ and th us  up on turning o the explorativ e D-optimization  tries to minimize the estimation error of the v alue of the noise at time τ + 1 . W e in tro dued the τ -zero in terrogation s heme and sho w ed that it is a go o d appro ximation of the τ -infomax noise s heme for large τ v alues. Finally , it seems desirable to determine the onditions under whi h the algorithms deriv ed from the D-optimal (infomax) priniple are b oth onsisten t and eien t. The tratable form of our appro ximation-free results is promising in this resp et. 7 A  kno wledgmen ts This resear h has b een supp orted b y the EC NEST `P ereptual Consiousness: Expliation and T esting' gran t under on trat 043261. Opinions and errors in this man usript are the author's resp onsibilit y , they do not neessarily reet the opinions of the EC or other pro jet mem b ers. Referenes 1. F edoro v, V.V.: Theory of Optimal Exp erimen ts. A ademi Press, New Y ork (1972) 2. Cohn, D.A.: Neural net w ork exploration using optimal exp erimen t design. In: A dv anes in Neural Information Pro essing Systems. V olume 6. (1994) 679686 3. deCharms, R.C., Blak e, D.T., Merzeni h, M.M.: Optimizing sound features for ortial neurons. Siene 280 (1998) 14391444 4. Földiák, P .: Stim ulus optimization in primary visual ortex. Neuro omputing 3840 (2001) 1217 1222 5. Ma hens, C.K., Gollis h, T., K olesnik o v a, O., Herz, A.V.M.: T esting the eieny of sensory o ding with optimal stim ulus ensem bles. Neuron 47 (2005) 447456 6. Lewi, J., Butera, R., P aninski, L.: Real-time adaptiv e information-theoreti optimization of neu- roph ysiology exp erimen ts. In: A dv anes in Neural Information Pro essing Systems. V olume 19. (2007) 7. MaKa y , D.J.C.: Information-based ob jetiv e funtions for ativ e data seletion. Neural Compu- tation 4 (1992) 590604 8. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: A tiv e learning with statistial mo dels. Journal of Artiial In telligene Resear h 4 (1996) 129145 9. F ukumizu, K.: Statistial ativ e learning in m ultila y er p ereptrons. IEEE T ransations on Neural Net w orks 11 (2000) 1726 10. Sugiy ama, M.: A tiv e learning in appro ximately linear regression based on onditional exp etation of generalization error. The Journal of Ma hine Learning Resear h 7 (2006) 141166 11. Opp er, M., Win ther, O.: A Ba y esian approa h to online learning. In: Online Learning in Neural Net w orks. Cam bridge Univ ersit y Press (1999) 12. Solla, S., Win ther, O.: Optimal p ereptron learning: An online Ba y esian approa h. In: Online Learning in Neural Net w orks. Cam bridge Univ ersit y Press (1999) 13. Honk ela, A., V alp ola, H.: On-line v ariational Ba y esian learning. In: 4th In ternational Symp osium on Indep enden t Comp onen t Analysis and Blind Signal Separation. (2003) 803808 14. Ghahramani, Z.: Online v ariational Ba y esian learning (2000) Slides from talk presen ted at NIPS 2000 w orkshop on Online Learning. 15. Kiefer, J.: Optim um exp erimen tal designs. Journal of the Ro y al Statistial So iet y , Series B 21 (1959) 272304 16. Stein b erg, D.M., Hun ter, W.: Exp erimen tal design: review and ommen t. T e hnometrixs 26 (1984) 7197 17. T oman, B., Gast wirth, J.L.: Robust ba y esian exp erimen tal design and estimation for analysis of v ariane mo dels using a lass of normal mixtures. Journal of statistial planning and inferene 35 (1993) 383398 18. Puk elsheim, F.: Optimal Design of Exp erimen ts. John Wiley & Sons (1993) 19. Chaloner, K., V erdinelli, I.: Ba y esian exp erimen tal design: A review. Statist. Si. 10 (1995) 273304 20. Bernardo, J.M.: Exp eted information as exp eted utilit y . The Annals of Statistis 7 (1979) 686690 21. Stone, M.: Appliation of a measure of information to the design and omparison of regression exp erimen ts. Ann. Math. Statist 30 (1959) 5570 22. Bo y en, X., K oller, D.: T ratable inferene for omplex sto  hasti pro esses. In: F ourteen th Con- ferene on Unertain t y in Artiial In telligene. (1998) 3342 23. Mink a, T.: A family of algorithms for appro ximate Ba y esian inferene. PhD thesis, MIT Media Lab, MIT (2001) 24. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Ba y esian Data Analysis. CR C Press, 2nd erdition (2003) 25. Gupta, A.K., Nagar, D.K.: Matrix V ariate Distributions. V olume 104 of Monographs and Surv eys in Pure and Applied Mathematis. Chapman and Hall/CR C (1999) 26. Mink a, T.: Ba y esian linear regression (2000) MIT Media Lab note. 27. Co v er, T.M., Thomas, J.A.: Elemen ts of Information Theory . Wiley-In tersiene (1991) 28. K otz, S., Nadara jah, S.: Multiv ariate T-Distributions and Their Appliations. Cam bridge Univ er- sit y Press (2004) 29. Beal, M.J.: V ariational algorithms for appro ximate Ba y esian inferene. PhD thesis, Gatsb y Com- putational Neurosiene Unit, Univ ersit y College London (2003) 30. Harville, D.A.: Matrix Algebra F rom a Statistiian's P ersp etiv e. Springer-V erlag (1997) 31. Golub, G.H., V an Loan, C.F.: Matrix Computations. 3rd ed. edn. Johns Hopkins, Baltimore, MD (1996) 32. Zografos, K., Nadara jah, S.: Expressions for Rén yi and Shannon en tropies for m ultiv ariate distri- butions. Statistis and Probabilit y Letters 71 (2005) 7184 33. Y amakita, M., Iw ashiro, M., Sugahara, Y., F uruta, K.: Robust swing-up on trol of double p endulum (1995) 34. Gäfv ert, M.: Mo delling the furuta p endulum. T e hnial rep ort ISRN LUTFD2/TFR T7574SE, Departmen t of Automati Con trol, Lund Univ ersit y , Sw eden (1998) 35. V erdinelli, I.: A note on ba y esian design for the normal linear mo del with unkno wn error v ariane. Biometrik a 87 (2000) 222227

D-optimal Bayesian Interrogation for Parameter and Noise Identification of Recurrent Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment