Nonextensive Generalizations of the Jensen-Shannon Divergence

1 None xtensi v e Generaliz ations of the Jensen-Shannon Di ver gence Andr ´ e F . T . Martins, Pedro M. Q. Ag uiar , Se nior Member , IEEE, and M ´ ario A. T . Figueiredo, Senior Memb er , IEEE Abstract — Con vex ity is a key concept in inf ormation th eory , namely via the many implications of Jensen’s inequality , such as the non-negativity of the Kullback-Leibler dive rgence (KLD). Jensen’ s inequality also un derlies the concept of J ensen-Shann on diver gence ( JSD), which is a sy mmetrized and smoothed v ersion of the KLD . Th is paper in troduces new JSD-type div erge nces, by extendi ng its two bui lding blocks: conv exity and Shann on’ s entropy . In particular , a new concept of q -con vexity is i ntroduced and shown to satisfy a Je nsen’s q -inequali ty . Based on this Jensen’ s q -inequali ty , the Jensen-Tsallis q -d ifference is built, which is a nonextensive generalization of the JSD, based on Tsallis entropies. Fin ally , the Jensen-Tsallis q -diff erence is chara- terized in terms of co n vexity an d extr ema. Index T erms — Con vexity , Tsallis entropy , non extensiv e en- tropies, Jensen-Shannon d iver gence, mutu al inf ormation. I . I N T RO D U C T I O N The central role p layed by the Shan non entropy in infor- mation theor y has stimulated the prop osal of sev eral gen er- alizations and extension s during the last deca des (see, e.g. , [1], [2], [ 3], [4], [5], [6], [7]). One of the best kn own o f these g eneralization s is th e family of R ´ enyi en tropies, which has the Shanno n entr opy as a limit case [1], a nd has b een used in several ap plications ( e.g. , [8], [9]). The R ´ e nyi and Shannon entro pies share the well-known additivity property , under w hich th e join t entropy of a pair of indep endent ran dom variables is simply the sum of the individual entrop ies. In other gen eralizations, n amely th ose intr oduced by Havrda- Charv ´ at [2], Dar ´ o czi [3], and Tsallis [7 ], the additivity property is aban doned , yielding the so-called no nextensive entro pies. These non extensi ve entropies have raised g reat interest amon g physicists in mod eling certain physical pheno mena (such as those exhibitin g long-rang e interactions and mu ltifractal be- havior) and as a framework for nonextensive generalization s of the classical Boltzmann- Gibbs statistical mecha nics [10], [11]. No nextensiv e entropies have also been recen tly used in signal/image proce ssing ( e.g. , [1 2], [13], [14]) and m any oth er areas [15]. This work was partiall y support ed by Fundac ¸ ˜ ao para a Ci ˆ encia e T ec- nolo gia (FCT), Portuguese Ministry of Science and Higher Education , under project PTDC/EEA-TEL /72572/2 006. A. Martins is with the Department of Electrical and Computer E nginee ring, Institut o Superior T ´ ecnico , 1049-001 Lisboa, Portugal, and with the Language T echnologi es Institute, Carne gie Mellon Uni versit y , Pittsbur gh, P A 15213, USA. Email: afm@cs.cmu.edu P . Aguiar is with the Institut o de Sistemas e Rob ´ otica and the Department of Electrical and Computer Engineering, Instituto Superior T ´ ecnico , 1049-001 Lisboa, Portugal; email: aguiar@isr .ist.utl.pt M. Figue iredo is with t he Instit uto de T elecomunic ac ¸ ˜ oes and the De partment of Electrical and Computer Engineering, Instituto Superior T ´ ecnico , 1049-001 Lisboa, Portugal; email: m ario.ﬁguei redo@lx.it.pt Con vexity is a k ey co ncept in infor mation theor y , nam ely via the many im portant co rollaries of Jen sen’ s ineq uality [16], such as the non-n egati vity of th e rela ti ve Shanno n entropy , or Kullback-Leibler divergence (KLD) [17]. Th e Jensen in- equality is also at the b asis of the conce pt of Jensen-Sh annon div ergence (JSD), which is a symm etrized and smooth ed version of the KLD [ 18], [19]. The JSD is widely used in areas su ch as statistics, mach ine learning , image and sign al processing, and physics. The go al o f this pa per is to intro duce new extension s of JSD-type div ergences, by extending its two building blocks: conv exity a nd the Shanno n entr opy . In previous work [ ? ], we in vestigate how th ese extension s may be applied in kernel- based machine lear ning. More speciﬁcally , the main con tribu- tions of this paper are: • The concept o f q -conv exity , as a gen eralization of con- vexity , for wh ich we p rove a Jensen q - inequality . Th e related concep t of Jensen q -differ e nces , wh ich generalize Jensen differences, is also pro posed. Based o n these concepts, we introdu ce th e J e nsen-Tsallis q -differ ence , a nonextensive generalization o f the JS D, which is also a “mutual information ” in the sense of Furuichi [20]. • Character ization of the Jensen- Tsallis q -difference, with respect to conve xity an d its extrema, extend ing re sults obtained by Burbea an d Rao [21] and by Lin [19] fo r the JSD. The rest of the pap er is organized as follows. Section II-A revie w s the concep ts of nonexten si ve entropies, with emp hasis on the Tsallis case. Section III discusses Jensen differences and div ergences. The concep ts of q - differences an d q -co n vexity are introduced in Sectio n IV, where th ey are u sed to d eﬁne and characterize some new d iv ergence-typ e quantities. Section V deﬁn es the Jensen-Tsallis q -difference and derives som e proper ties. Finally , Section VI con tains co ncluding r emarks and mentions directions for future research. I I . N O N E X T E N S I V E E N T RO P I E S A. S uyari’s Ax iomatization Inspired b y the Shan non-K hinchin axiomatic for mulation of the Shannon entr opy [22], [23], Suyari proposed an axio matic framework for nonextensiv e entropies and a u niqueness th eo- rem [24]. Let ∆ n − 1 :=    ( p 1 , . . . , p n ) ∈ R n : p i ≥ 0 , n X j =1 p i = 1    (1) 2 denote the ( n − 1) -dim ensional simplex. The Suya ri axiom s (see Appen dix) de termine the fu nction S q,φ : ∆ n − 1 → R giv en by S q,φ ( p 1 , . . . , p n ) = k φ ( q ) 1 − n X i =1 p q i ! , (2) where q , k ∈ R + , S 1 ,φ := lim q → 1 S q,φ , and φ : R + → R is a continuo us fun ction satisfying the following three conditions: (i) φ ( q ) has the same sign as q − 1 ; (ii) φ ( q ) vanishes if a nd only if q = 1 ; (iii) φ is differentiable in a neigh borho od of 1 and φ ′ (1) = 1 . For any φ satisfying these conditions, S q,φ has th e pseudo ad- ditivity p roperty : for any two inde penden t random variables A and B , with proba bility mass function s p A ∈ ∆ n A and p B ∈ ∆ n B , respectively , S q,φ ( A × B ) = S q,φ ( A ) + S q,φ ( B ) − φ ( q ) k S q,φ ( A ) S q,φ ( B ) , where we denote (as usual) S q,φ ( A ) := S q,φ ( p A ) . For q = 1 , we recover the Shan non entropy , S 1 ,φ ( p 1 , . . . , p n ) = H ( p 1 , . . . , p n ) = − k n X i =1 p i ln p i , (3) thus pseudoadd iti vity tur ns into additivity . B. Tsallis E ntr opie s Sev eral propo sals for φ ha ve app eared [ 2], [3], [ 7]. In the rest of th e paper, we set φ ( q ) = q − 1 , which yields th e Tsallis entropy: S q ( p 1 , . . . , p n ) = k q − 1 1 − n X i =1 p q i ! . (4) T o simplify , we let k = 1 and write the Tsallis entropy a s S q ( X ) := S q ( p 1 , . . . , p n ) = − X x ∈ X p ( x ) q ln q p ( x ) , (5) where ln q ( x ) := ( x 1 − q − 1) / (1 − q ) is the q -logarithm function , which satisﬁes ln q ( xy ) = ln q ( x ) + x 1 − q ln q ( y ) and ln q (1 /x ) = − x q − 1 ln q ( x ) . Furuichi derived som e in formation theo retic pro perties of Tsallis entro pies [20]. Tsallis joint and conditional entr op ies are deﬁned, respectively , as S q ( X, Y ) := − X x,y p ( x, y ) q ln q p ( x, y ) (6) and S q ( X | Y ) := − X x,y p ( x, y ) q ln q p ( x | y ) = X y p ( y ) q S q ( X | y ) , (7) and the chain rule S q ( X, Y ) = S q ( X ) + S q ( Y | X ) holds. For two p robab ility mass functio ns p X , p Y ∈ ∆ n , the Tsallis r ela tive entr opy , generalizin g the KLD, is deﬁned as D q ( p X k p Y ) := − X x p X ( x ) ln q p Y ( x ) p X ( x ) . (8) Finally , the Tsallis mutual en tr opy is deﬁned as I q ( X ; Y ) := S q ( X ) − S q ( X | Y ) = S q ( Y ) − S q ( Y | X ) , (9) generalizin g (for q > 1 ) Shannon’ s mutual information [20]. In Section V, we establish a relationsh ip between Tsallis mutual entropy an d a quantity called Tsallis q - differ ence , gen eralizing the one between mutual information and the JSD [2 5]. Furuichi consider s an alter nativ e g eneralization of Shan- non’ s mutual information , ˜ I q ( X ; Y ) := D q ( p X,Y k p X ⊗ p Y ) , (10) where p X,Y is the true joint pr obability m ass functio n of ( X, Y ) and p X ⊗ p Y denotes th eir joint p robability if they were ind epend ent [2 0]. Th is altern ativ e deﬁnition has also been used as a “Tsallis mutual entro py” [2 6]; notice th at I q ( X ; Y ) 6 = ˜ I q ( X ; Y ) in gener al, the case q = 1 being a notable exception . In Section V, we sh ow that th is alterna ti ve deﬁnition also leads to a n onextensive an alogue of the JSD. C. Deno rmalization of Tsallis En tr opies In the sequel, we extend the domain of Tsallis entr opies from ∆ n − 1 to the set of unn ormalized m easures, R n + := { ( x 1 , . . . , x n ) ∈ R n | ∀ i x i ≥ 0 } . The Tsallis entropy of a measure is deﬁned as S q ( x 1 , . . . , x n ) := − n X i =1 x q i ln q x i = n X i =1 ϕ q ( x i ) , (11) where ϕ q : R + → R is given b y ϕ q ( y ) = − y q ln q y =  − y ln y , if q = 1 , ( y − y q ) / ( q − 1) , if q 6 = 1 . (12) I I I . J E N S E N D I FF E R E N C E S A N D D I V E R G E N C E S A. The Jensen Dif fer en ce Jensen’ s ineq uality states that, if f is a con cav e function and X is an integrable re al-valued rando m variable, f ( E [ X ]) − E ( f ( X )) ≥ 0 . (13) Burbea and Rao studied the dif ference in the left hand side of (13), with f := H ϕ , where H ϕ : [ a, b ] n → R is a co ncave function , called a ϕ -entr opy , deﬁned as H ϕ ( x ) := − n X i =1 ϕ ( x i ) , (14) where ϕ : [ a, b ] → R is conve x [21]. The resu lt is called the Jensen differ ence , as formalized in the fo llowing deﬁnitio n. Deﬁnition 1 : The Jensen differ enc e J π Ψ : R nm + → R induced by a (concave) gen eralized entro py Ψ : R n + → R and weighted by ( π 1 , . . . , π m ) ∈ ∆ m − 1 is J π Ψ ( x 1 , . . . , x m ) := Ψ   m X j =1 π j x j   − m X j =1 π j Ψ( x j ) = Ψ ( E [ X ]) − E [Ψ( X )] , (15) where both expectations are with respect to ( π 1 , . . . , π m ) . In the follo wing sub sections, we con sider se veral instances of Deﬁnition 1, leading to several Jensen-type d iv ergences. 3 B. The Jensen-Sh annon Diver gence Let P be a rand om prob ability distribution takin g values in { p y } y =1 ,...,m ⊆ ∆ n − 1 accordin g to a distribution π = ( π 1 , . . . , π m ) ∈ ∆ m − 1 . (In classiﬁcation/estimation theo ry parlance, π is ca lled the prior distribution and p y := p ( . | y ) the likelihood f unction. ) Th en, (15) b ecomes J π Ψ ( p 1 , . . . , p m ) = Ψ ( E [ P ]) − E [Ψ( P )] , (16) where the expectations are with respect to π . Let now Ψ = H , th e Shannon entropy . Consider the random variables Y and X , tak ing values respectively in Y = { 1 , . . . , m } and X = { 1 , . . . , n } , with prob ability mass function s π ( y ) := π y and p ( x ) := P m y =1 p ( x | y ) π ( y ) . Using standard notation of information theory [17], J π ( P ) := J π H ( p 1 , . . . , p m ) = H ( X ) − H ( X | Y ) (17) = I ( X ; Y ) , where I ( X ; Y ) is th e mutual information between X and Y . Since I ( X ; Y ) is a lso equal to the KLD between the join t distribution an d the product of the m arginals [1 7], we ha ve J π ( P ) = H ( E [ P ]) − E [ H ( P )] = E [ D ( P k E [ P ])] . (18) The quantity J π H ( p 1 , . . . , p m ) is called the J ensen-Sh annon diver gence (JSD) of p 1 , . . . , p m , with weights π 1 , . . . , π m [21], [19]. Equality (18) allows two in terpretation s of the JSD: (i) the Jensen d ifference of the Shannon entropy of P ; or (ii) th e expected KLD between P and the expectation o f P . A remar kable fact is that J π ( P ) = min Q E [ D ( P k Q )] , i.e. , Q ∗ = E [ P ] is a m inimizer of E [ D ( P k Q )] with respect to Q . It has b een shown th at this pr operty togeth er with equality (18) characterize the so-called Br egman diver gences : they hold not only for Ψ = H , but for any co ncave Ψ and the correspo nding Bregman d i vergence, in which case J π Ψ is the Br egman information (see [2 7] for details). When m = 2 and π = (1 / 2 , 1 / 2 ) , P may be seen as a random distribution whose value on { p 1 , p 2 } is chosen by tossing a fair coin . In this case, J (1 / 2 , 1 / 2) ( P ) = J S ( p 1 , p 2 ) , where J S ( p 1 , p 2 ) = H  p 1 + p 2 2  − H ( p 1 ) + H ( p 2 ) 2 = 1 2 D  p 1    p 1 + p 2 2  + 1 2 D  p 2    p 1 + p 2 2  , (19 ) as introd uced in [19]. I t h as been shown that √ J S satisﬁes the triangle inequality (h ence being a metric) a nd that, m oreover , it is an Hilbertian metric [28], [ 29]. C. Th e J ensen-R ´ enyi Diver gence Consider again the scenario ab ove (Subsectio n III-B), now with the R ´ enyi q -entro py R q ( p ) = 1 1 − q ln n X i =1 p q i (20) replacing the Sh annon en tropy . Th e R ´ enyi q -en tropy is con- cav e for q ∈ [0 , 1) and h as the Shannon entro py as the limit when q → 1 [1]. Letting Ψ = R q , (16) becomes J π R q ( p 1 , . . . , p m ) = R q ( E [ P ]) − E [ R q ( P )] . (21) Unlike in the JSD case, there is no cou nterpar t of equality (18) based on the R ´ enyi q -divergence D R q ( p 1 k p 2 ) = 1 q − 1 ln n X i =1 p q 1 i p 1 − q 2 i . (22) The qu antity J π R q in ( 21) is called the Jensen-R ´ enyi diver - gence (JRD). Furthermo re, whe n m = 2 and π = (1 / 2 , 1 / 2) , we write J π R q ( P ) = J R q ( p 1 , p 2 ) , where J R q ( p 1 , p 2 ) = R q  p 1 + p 2 2  − R q ( p 1 ) + R q ( p 2 ) 2 . (23) The JRD h as been used in several signal/image processing applications, such as registration, se gmentation , denoising, and classiﬁcation [30], [31], [32]. D. The Jensen-Tsallis Diver gence Burbea and Rao have deﬁne d divergences of the form (16) based o n the Tsallis q -entr opy S q , d eﬁned in (11) [21]. Like the Shannon entropy , b ut unlike the R ´ e nyi entropies, the Tsallis q -entro py is an instance of a ϕ -en tropy (see (14)). Letting Ψ = S q , (16) becomes J π S q ( p 1 , . . . , p m ) = S q ( E [ P ]) − E [ S q ( P )] . (24) Again, like in Subsectio n III -C, if we co nsider the Tsallis q - div ergence, D q ( p 1 k p 2 ) = 1 1 − q 1 − n X i =1 p 1 i q p 2 i 1 − q ! , (25) there is no counterpart o f the equality (18). The quantity J π S q in (24) is called the Jensen-Tsallis diver- gence (JTD) and it has also be en applied in image proce ssing [33]. Unlike the JSD, the JTD lacks an interpre tation as a mutual informa tion. In sp ite of this, for q ∈ [1 , 2 ] , the JTD exhibits joint conv exity [21]. In th e next section, we pro pose an alternative to the JTD which , am ongst o ther features, is interpretab le as a nonexten si ve m utual info rmation ( in the sense of Furuichi [20]) and is jointly conv ex, f or q ∈ [0 , 1 ] . I V . q - C O N V E X I T Y A N D q - D I FF E R E N C E S A. In tr odu ction This sectio n in troduc es a novel class of f unctions, term ed Jensen q -differ en ces (JqD), that generalizes Jensen dif ferences. W e will later (Sec tion V) use the JqD to deﬁne the Jensen- Tsallis q -differ ence (JTq D), which we will p ropose as an alternative nonextensive generaliza tion of th e JSD, instead of the JTD discussed in Subsection III-D. W e begin by recalling the concep t of q -expectation , wh ich is used in nonextensiv e thermo dynam ics [ 7]. Deﬁnition 2 : The u nnor malized q -e xpectation of a ﬁnite random variable X ∈ X , with probab ility mass fun ction P X ( x ) , is E q [ X ] := X x ∈X x P X ( x ) q . (26) Of course, q = 1 c orrespon ds to the standar d notio n o f expectation. For q 6 = 1 , the q -exp ectation d oes not cor respond 4 to the in tuitive mean ing of av erage/expectation ( e.g. , E q [1] 6 = 1 in general) . No netheless, it has been used in the co nstruction of n onextensive inform ation theoretic co ncepts such a s th e Tsallis entropy , which can be wr itten compactly as S q ( X ) = − E q [ln q p ( X )] . B. q -Con vexity W e now introdu ce th e novel conce pt o f q -convexity and use it to de riv e a set of r esults, among which we e mphasize a q - Jensen ineq uality . Deﬁnition 3 : Let q ∈ R and X b e a conve x set. A f unction f : X → R is q -convex if for any x, y ∈ X a nd λ ∈ [0 , 1] , f ( λx + (1 − λ ) y ) ≤ λ q f ( x ) + (1 − λ ) q f ( y ) . (27) Naturally , f is q -concave if − f is q -c onv ex. Of co urse, 1-conve xity is the usual n otion of con vexity . The next propo- sition states the q -Jensen inequality . Proposition 4: If f : X → R is q -co n vex, then for any n ∈ N , x 1 , . . . , x n ∈ X an d π = ( π 1 , . . . , π n ) ∈ ∆ n − 1 , f  X π i x i  ≤ X π q i f ( x i ) . (28) Pr oof: Use indu ction, exactly as in th e proof of the standard Jensen inequality ( e.g. , [17]). Proposition 5: Let f ≥ 0 and q ≥ q ′ ≥ 0 ; the n, f is q -conve x ⇒ f is q ′ -conve x (29) − f is q ′ -conve x ⇒ − f is q - conv ex . (30) Pr oof: Imp lication (29) r esults from f ( λx + (1 − λ ) y ) ≤ λ q f ( x ) + (1 − λ ) q f ( y ) ≤ λ q ′ f ( x ) + (1 − λ ) q ′ f ( y ) , where th e ﬁrst ineq uality states the q -conv exity of f and the second o ne is valid bec ause f ( x ) , f ( y ) ≥ 0 and t q ′ ≥ t q ≥ 0 , for any t ∈ [0 , 1] and q ≥ q ′ . The proo f o f (30) is analogou s. C. Jensen q -Differ en ces W e now generalize Jensen differences, formalized in Deﬁ- nition 1, by introducing the concept of Jensen q - differences. Deﬁnition 6 : For q ≥ 0 , the Jensen q -differ ence induced by a (c oncave) gene ralized entropy Ψ : R n + → R and we ighted by ( π 1 , . . . , π m ) ∈ ∆ m − 1 is T π q, Ψ ( x 1 , . . . , x m ) , Ψ   m X j =1 π j x j   − m X j =1 π q j Ψ( x j ) = Ψ ( E [ X ]) − E q [Ψ( X )] , (31) where th e expe ctation and the q -expectatio n are with respect to ( π 1 , . . . , π m ) . Burbea and Rao established necessary and sufﬁcient con- ditions for the Jensen difference o f a ϕ -entropy to be con vex [21]. The following proposition generalize s that result, extend - ing it to Jensen q -d ifferences. Proposition 7: Let ϕ : [0 , 1] → R be a fun ction o f class C 2 and co nsider the ( ϕ -en tropy [21]) f unction Ψ : [0 , 1] n → R deﬁned b y Ψ( z ) := − P n i =1 ϕ ( z i ) . Then, th e q -d ifference T π q, Ψ : [0 , 1] nm → R is conve x if and only if ϕ is con vex and − 1 /ϕ ′′ is (2 − q ) -convex. Pr oof: The case q = 1 correspo nds to the Jensen difference and was proved by Bur bea and Rao (Th eorem 1 in [21]). Ou r proof extends that of Burbea a nd Rao to q 6 = 1 . In general, y = { y 1 , ..., y m } , where y t = { y t 1 , ..., y tn } , thus T π q, Ψ ( y ) = Ψ m X t =1 π t y t ! − m X t =1 π q t Ψ( y t ) = n X i =1 " m X t =1 π q t ϕ ( y ti ) − ϕ m X t =1 π t y ti !# , showing that it sufﬁces to co nsider n = 1 , i.e. , T π q, Ψ ( y 1 , . . . , y m ) = m X t =1 π q t ϕ ( y t ) − ϕ m X t =1 π t y t ! ; (32) this fu nction is conve x on [0 , 1] m if and o nly if, f or every ﬁx ed a 1 , . . . , a m ∈ [0 , 1] , and b 1 , . . . , b m ∈ R , th e function f ( x ) = T π q, Ψ ( a 1 + b 1 x, . . . , a m + b m x ) (33) is conv ex in { x ∈ R : a t + b t x ∈ [0 , 1] , t = 1 , . . . , m } . Since f is C 2 , it is con vex if and only if f ′′ ( t ) ≥ 0 . W e ﬁr st show that c onv exity of f (e quiv alently of T π q, Ψ ) implies con vexity of ϕ . Letting c t = a t + b t x , f ′′ ( x ) = m X t =1 π q t b 2 t ϕ ′′ ( c t ) − m X t =1 π t b t ! 2 ϕ ′′ m X t =1 π t c t ! . (34) By choo sing x = 0 , a t = a ∈ [0 , 1] , f or t = 1 , ..., m , and b 1 , . . . , b m satisfying P t π t b t = 0 in (34), we get f ′′ (0) = ϕ ′′ ( a ) m X t =1 π q t b 2 t , hence, if f is con vex, ϕ ′′ ( a ) ≥ 0 thus ϕ is con vex. Next, we show that conve xity of f also implies (2 − q ) - conv exity of − 1 /ϕ ′′ . By choosin g x = 0 (thus c t = a t ) and b t = π 1 − q t ( ϕ ′′ ( a t )) − 1 , we get f ′′ (0) = m X t =1 π 2 − q t ϕ ′′ ( a t ) − m X t =1 π 2 − q t ϕ ′′ ( a t ) ! 2 ϕ ′′ m X t =1 π t a t ! = m X t =1 π 2 − q t ϕ ′′ ( a t ) ! ϕ ′′ m X t =1 π t a t ! × " 1 ϕ ′′ ( P m t =1 π t a t ) − m X t =1 π 2 − q t ϕ ′′ ( a t ) # , where the expression inside th e square brackets is the Jensen (2 − q ) -difference of 1 /ϕ ′′ (see Deﬁnition 6). Since ϕ ′′ ( x ) ≥ 0 , the factor o utside the sq uare brackets is non-n egativ e, thus the Jensen (2 − q ) -difference of 1 /ϕ ′′ is also no nnegative and − 1 /ϕ ′′ is (2 − q ) -convex. 5 Finally , we show that if ϕ is con vex and − 1 /ϕ ′′ is (2 − q ) -c onv ex, then f ′′ ≥ 0 , thus T π q, Ψ is conve x. L et r t = ( qπ 2 − q t /ϕ ′′ ( c t )) 1 / 2 and s t = b t ( π q t ϕ ′′ ( c t ) /q ) 1 / 2 ; then, non-n egati vity of f ′′ results fro m the following chain of inequalities/equ alities: 0 ≤ m X t =1 r 2 t ! m X t =1 s 2 t ! − m X t =1 r t s t ! 2 (35) = m X t =1 π 2 − q t ϕ ′′ ( c t ) m X t =1 b 2 t π q i ϕ ′′ ( c t ) − m X t =1 b t π t ! 2 (36) ≤ 1 ϕ ′′ ( P m t =1 π t c t ) m X t =1 b 2 t π q t ϕ ′′ ( c t ) − m X t =1 b t π t ! 2 (37) = 1 ϕ ′′ ( P m t =1 π t c t ) · f ′′ ( t ) , (38) where: (35) is the Cauchy-Schwarz in equality; equality (36) results from the deﬁnitions of r t and s t and f rom the fact that r t s t = b t π t ; ineq uality (37) states the (2 − q ) -conve xity of − 1 /ϕ ′′ ; equality (38) results from (34). V . T H E J E N S E N - T S A L L I S q - D I FF E R E N C E A. Deﬁ nition As in Subsection III-B, let P be a rando m pr obability distribution taking values in { p y } y =1 ,...,m accordin g to a distribution π = ( π 1 , . . . , π m ) ∈ ∆ m − 1 . Then, we may write T π q, Ψ ( p 1 , . . . , p m ) = Ψ ( E [ P ]) − E q [Ψ( P )] , (39) where th e expectations are with respect to π . Hence Jensen q - differences may be seen as deformatio ns of the standard J ensen differences (16), in which the second expectation is replaced by a q -expectation. Let n ow Ψ = S q , the non extensiv e Tsallis q -entropy . Intro- ducing the random v ariables Y and X , with v alue s respectiv ely in Y = { 1 , . . . , m } and X = { 1 , . . . , n } , with pro bability mass fu nctions π ( y ) := π y and p ( x ) := P m y =1 p ( x | y ) π ( y ) , we have (writing T π q,S q simply as T π q ) T π q ( p 1 , . . . , p m ) = S q ( X ) − S q ( X | Y ) = I q ( X ; Y ) , (40) where S q ( X | Y ) is the Tsallis condition al q -entro py , and I q ( X ; Y ) is the Tsallis mutu al q -e ntropy , a s deﬁned by Fu- ruichi [20]. Obser ve that (40) is a nonextensive analogue o f (17). Since, in general, I q 6 = ˜ I q (see (1 0)), unless q = 1 ( I 1 = ˜ I 1 = I ), there is n o cou nterpar t of (18) in terms of q - differences. Nevertheless, Lamber ti and M ajtey have pr oposed a non-lo garithmic version of the JSD, which corr esponds to using ˜ I q for the Tsallis mutua l q - entropy ( although this interpretatio n is not explicitally mentioned b y tho se author s) [26]. W e call the q uantity T π q ( p 1 , . . . , p m ) the J en sen-Tsallis q - differ ence (JTqD) of p 1 , . . . , p m with weights π 1 , . . . , π m . Although th e JTqD is a generalization of the J ensen-Sh annon diver gence , f or q 6 = 1 , the term “diver gence” would be misleading in th is case, since T π q may take negative values (if q < 1 ) and do es not vanish in general if P is determ inistic. When m = 2 and π = (1 / 2 , 1 / 2) , deﬁne T q := T 1 / 2 , 1 / 2 q , T q ( p 1 , p 2 ) = S q  p 1 + p 2 2  − S q ( p 1 ) + S q ( p 2 ) 2 q . (41) Notable cases arise for particular v alues of q : • For q = 0 , S 0 ( p ) = − 1 + k x k 0 , where k x k 0 denotes the so-called 0 -norm (althou gh it’ s not a norm) o f vector x , i.e. , its num ber of n onzero co mpon ents. The Jensen- Tsallis 0 -difference is thus T 0 ( p 1 , p 2 ) = 1 − k p 1 ⊙ p 2 k 0 , (42) where ⊙ den otes the Hadam ard-Schu r ( i.e. , e lementwise) produ ct. W e call T 0 the Boolean differ ence . • For q = 1 , since S 1 ( p ) = H ( p ) , T 1 is the JSD, T 1 ( p 1 , p 2 ) = J S ( p 1 , p 2 ) . (43) • For q = 2 , S 2 ( p ) = 1 − h p, p i , w here h x, y i = P i x i y i is the usual inn er prod uct between x and y . Consequen tly , the Tsallis 2 -difference is T 2 ( p 1 , p 2 ) = 1 2 − 1 2 h p 1 , p 2 i , (44) which we call the linear d iffer ence . B. Pr operties of the JTqD This subsectio n p resents results regard ing convexity and extrema of the JTqD, for several values of q , extend ing known proper ties of the JSD ( q = 1 ). Some pro perties of the JSD are lo st in the transition to nonextensivity . For example, while the former is nonn egati ve and vanishes if and only if all the distributions ar e identica l, this is not true in g eneral with the JTq D. Nonnegativity of th e JTqD is only guar anteed if q ≥ 1 , wh ich exp lains why some author s ( e.g. , [ 20]) only consider values of q ≥ 1 , wh en lookin g fo r no nextensive analogu es of Shanno n’ s informa tion theor y . M oreover , unless q = 1 , it is n ot gener ally true that T π q ( p, . . . , p ) = 0 or even that T π q ( p, . . . , p, p ′ ) ≥ T π q ( p, . . . , p, p ) . For example, th e solution to the o ptimization problem min p 1 ∈ ∆ n T q ( p 1 , p 2 ) , (45) is, in general, d ifferent fro m p 2 , unless if q = 1 . Instead, this minimizer is closer to the uniform distribution, if q ∈ [0 , 1) , and closer to a degener ate distribution, for q ∈ (1 , 2 ] . T his is not so surprising: re call that T 2 ( p 1 , p 2 ) = 1 2 − 1 2 h p 1 , p 2 i ; in this case, (45) b ecomes a linear progr am, and the solu tion is not p 2 , but p ∗ 1 = δ j , where j = ar g ma x i p 2 i . W e start by recalling a basic result, which e ssentially conﬁrms that Tsallis en tropies satify on e of th e Suya ri axioms (see Axiom A2 in the Append ix), wh ich states that en tropies should be maximized by uniform d istributions. Proposition 8: The unifo rm distribution maximizes the Tsallis entropy f or any q ≥ 0 . Pr oof: Consider the problem max p S q ( p ) , subject to P i p i = 1 and p i ≥ 0 . 6 Equating the gradient of the Lagrangian to zer o, yields ∂ ∂ p i ( S q ( p ) + λ ( P i p i − 1)) = − q ( q − 1 ) − 1 p q − 1 i + λ = 0 , for all i . Since all these equ ations are identical, the solution is the uniform distribution, which is a m aximum, due to the concavity o f S q . The f ollowing coro llary of Pr oposition 7 establishes the joint con vexity of the JTqD, for q ∈ [0 , 1] . This co mplemen ts the joint conv exity of t he JTD, for q ∈ [1 , 2] , which was proved by Burbea and Rao [21]. Corollary 9: For q ∈ [0 , 1] , the JTqD is a jointly co n vex function on ∆ n − 1 . Formally , let { p ( i ) y } i =1 ,...,l y =1 ,...,m , be a collection of l sets of probability distributions on X = { 1 , . . . , n } ; then, for any ( λ 1 , . . . , λ l ) ∈ ∆ l − 1 , T π q l X i =1 λ i p ( i ) 1 , . . . , l X i =1 λ i p ( i ) m ! ≤ l X i =1 λ i T π q ( p ( i ) 1 , . . . , p ( i ) m ) . Pr oof: Obser ve that the Tsallis en tropy (5) of a pro ba- bility distribution p t = { p t 1 , ..., p tn } can be written as S q ( p t ) = − n X i =1 ϕ ( p ti ) , wher e ϕ q ( x ) = x − x q 1 − q ; thus, from Proposition 7 , T π q is conv ex if an d o nly if ϕ q is conv ex an d − 1 / ϕ ′′ q is (2 − q ) -convex. Since ϕ ′′ q ( x ) = q x q − 2 , ϕ q is conv ex for x ≥ 0 an d q ≥ 0 . T o show th e (2 − q ) - conv exity of − 1 /ϕ ′′ q ( x ) = − (1 /q ) x 2 − q , f or x t ≥ 0 , and q ∈ [0 , 1 ] , we use a version of the power mean inequality [34], − l X i =1 λ i x i ! 2 − q ≤ − l X i =1 ( λ i x i ) 2 − q = − l X i =1 λ 2 − q i x 2 − q i , thus concluding that − 1 /ϕ ′′ q is in fact (2 − q ) -con vex. The next corollar y , which re sults fro m the previous one , provides an upper boun d for the JTqD, fo r q ∈ [0 , 1] . Although this result is weaker than that of Proposition 11 below , we include it since it p rovides in sight about the u pper extrema of the JTqD. Corollary 10: Let q ∈ [0 , 1] . Then, T π q ( p 1 , . . . , p m ) ≤ S q ( π ) . Pr oof: From Corollary 9, for q ∈ [0 , 1 ] , T π q ( p 1 , . . . , p m ) is co n vex. Since its do main is a conve x polytope (th e cartesian produ ct of m simplec es), its maximu m occurs on a vertex, i.e. , when each ar gumen t p t is a degenerate distribution at x t , denoted δ x t . In particular , if n ≥ m , this maximum occurs at the vertex cor respond ing to disjoint d egenerate d istributions, i.e. , such th at x i 6 = x j if i 6 = j . At this maximu m, T π q ( δ x 1 , . . . , δ x m ) = S q m X t =1 π t δ x t ! − m X t =1 π t S q ( δ x t ) = S q m X t =1 π t δ x t ! (46) = S q ( π ) , (47) where th e eq uality in (46) results from S q ( δ x t ) = 0 . Notice that this maximum may not b e achieved if n < m . The next proposition establishes (upp er and lower) b ound s for the JTqD, e xtendin g Cor ollary 10 to any non- negativ e q . Proposition 11: For q ≥ 0 , T π q ( p 1 , . . . , p m ) ≤ S q ( π ) , (48) and, if n ≥ m , the m aximum is re ached for a set o f disjoint degenerate distributions. As in Coro llary 1 0, this m aximum may not be attained if n < m . For q ≥ 1 , T π q ( p 1 , . . . , p m ) ≥ 0 , (49) and the minimum is attained in the pure deterministic case, i.e. , wh en all d istributions are equal to same degenera te distribution. Results (48) and ( 49) still h old wh en X and Y are countable sets. For q ∈ [0 , 1] , T π q ( p 1 , . . . , p m ) ≥ S q ( π )[1 − n 1 − q ] . (50) This lower bo und (wh ich is zero or n egati ve) is attained when all distributions are un iform. Pr oof: The proo f o f (48), for q ≥ 0 , results from T π q ( p 1 , . . . , p m ) = 1 q − 1   1 − X j X i π i p j i ! q − X i π q i   1 − X j ( p j i ) q     = S q ( π ) + 1 q − 1 X j " X i  π i p j i  q − X i π i p j i ! q # ≤ S q ( π ) , (51) where th e ine quality h olds since, for y i ≥ 0 : if q ≥ 1 , then P i y q i ≤ ( P i y i ) q ; if q ∈ [0 , 1] , then P i y q i ≥ ( P i y i ) q . The proo f that T π q ≥ 0 f or q ≥ 1 , u ses the notio n of q -conve xity . For coun table X , the Tsallis entropy (4) is nonnegative. Sin ce − S q is 1 -c onv ex, then, by Pro position 5, it is also q -conv ex for q ≥ 1 . Con sequently , fro m the q -Jensen inequality (Proposition 4), for ﬁn ite Y , with |Y | = m , T π q ( p 1 , . . . , p m ) = S q m X t =1 π t p t ! − m X t =1 π q t S q ( p t ) ≥ 0 . Since S q is con tinuous, so is T π q , thus the ineq uality is valid in the limit as m → ∞ , which p roves the assertion f or Y countab le. Finally , T π q ( δ 1 , . . . , δ 1 ) = 0 , wher e δ 1 is som e degenerate d istribution. 7 Finally , to prove (50), for q ∈ [0 , 1] and X ﬁnite, T π q ( p 1 , . . . , p m ) = S q m X t =1 π t p t ! − m X t =1 π q t S q ( p t ) ≥ m X t =1 π t S q ( p t ) − m X t =1 π q t S q ( p t ) (5 2) = m X t =1 ( π t − π q t ) S q ( p t ) ≥ S q ( U ) m X t =1 ( π t − π q t ) (53) = S q ( π )[1 − n 1 − q ] , (54) where the inequality (5 2) results from S q being concave, and the inequality (53) hold s since π t − π q t ≤ 0 , for q ∈ [0 , 1] , and the unifo rm d istribution U maximizes S q (Proposition 8), with S q ( U ) = (1 − n 1 − q ) / ( q − 1) . Finally , th e next propo sition characterize s the conv ex- ity/concavity of the JTqD. As before , it holds mo re generally when Y and X ar e countable sets. Proposition 12: Th e JTqD is con vex in each argu ment, for q ∈ [0 , 2] , and concav e in each argume nt, for q ≥ 2 . Pr oof: Notice that the JTqD can b e written as T π q ( p 1 , . . . , p m ) = P j ψ ( p 1 j , . . . , p mj ) , with ψ ( y 1 , . . . , y m ) = (55) 1 q − 1 " X i ( π i − π q i ) y i + X i π q i y q i − X i π i y i ! q # . It suf ﬁces to consider the second deri vati ve o f ψ with respect to y 1 . Introducing z = P m i =2 π i y i , ∂ 2 ψ ∂ y 2 1 = q h π q 1 y q − 2 1 − π 2 1 ( π 1 y 1 + z ) q − 2 i = q π 2 1  ( π 1 y 1 ) q − 2 − ( π 1 y 1 + z ) q − 2  . (56) Since π 1 y 1 ≤ ( π 1 y 1 + z ) ≤ 1 , the qu antity in (56) is nonnegative for q ∈ [0 , 2] and non-positive f or q ≥ 2 . V I . C O N C L U S I O N In this p aper we ha ve intr oduced new Jensen- Shannon -type div ergences, b y extendin g its two building blocks: conv exity and entropy . W e ha ve introd uced the con cept of q -co n vexity , for which we have stated and p roved a Jensen q - inequality . Based on this conc ept, we have introd uced the Jensen-Tsallis q -differ ence , a no nextensiv e genera lization o f the Jensen- Shannon divergence. W e h ave character ized the Jensen-Tsallis q -difference with respect to conv exity and extrem a, extending previous results obtaine d in [21], [1 9] f or th e Jensen-Shan non div ergence. A C K N OW L E D G M E N T S The au thors wou ld like to thank Prof. Noah Smith f or valuable com ments and discussions. A P P E N D I X In [24], Suyari p roposed the following set of axioms (above referred as Suya ri’ s a xioms) that determ ine no nextensiv e en- tropies S q,φ : ∆ n − 1 → R of th e form stated in (2). In what follows, q is ﬁxed and f q is a function deﬁned on ∆ n − 1 . (A1) Continu ity : f q is continuou s in ∆ n − 1 and q ≥ 0 ; (A2) Maximality : For any q ≥ 0 , n ∈ N , and ( p 1 , . . . , p n ) ∈ ∆ n − 1 , f q ( p 1 , . . . , p n ) ≤ f q (1 /n, . . . , 1 /n ) ; (A3) Generalized additivity : For i = 1 , . . . , n , j = 1 , . . . , m i , p ij ≥ 0 , and p i = P m i j =1 p ij , f q ( p 11 , . . . , p nm i ) = f q ( p 1 , . . . , p n ) + n X i =1 p q i f q  p i 1 p i , . . . , p im i p i  ; (A4) Expa ndab ility : f q ( p 1 , . . . , p n , 0) = f q ( p 1 , . . . , p n ) . R E F E R E N C E S [1] A. R ´ eny i, “On m easures of entrop y and informatio n, ” in Proc. 4t h Berkel e y Symp. Math. Statist. and Pr ob . , vol . 1. Berke ly: Uni v . Calif. Press, 1961, pp. 547–561. [2] M. E. Havrda and F . Charv ´ at, “ Quantiﬁca tion method of classiﬁca tion processes: concept of structural α -entro py , ” K yberneti ka , vol. 3, pp. 30– 35, 1967. [3] Z . Dar ´ oczy , “Genera lized informati on functio ns, ” Information and Con- tr ol , vol. 16, no. 1, pp. 36–51, Mar . 1970. [4] M. Basse ville, “Distance measures for signa l processi ng and pattern recogni tion, ” Signal Pro cessing , vol. 18, pp. 349–369, 1989. [5] J. N. Kapur, Measures of Entr opy and Their Applicat ions . New Delhi: John W ile y & Sons, 1994. [6] C. Arndt, Information Measures . Berlin: Springer , 2004. [7] C. Tsallis, “Possible generaliza tion of Boltzman n-Gibbs s tatist ics, ” Jour - nal of Statisti cal Physics , vol. 52, pp. 479–487, 1988. [8] R. Baran iuk, P . Flandri n, A. J ensen, and O. Mic hel, “Measuri ng time frequenc y information content using the R ´ enyi entropies, ” IEEE T rans- actions on Information Theory , vol. 47, pp. ?–?, 2001. [9] D. Erdogmus, K. Hild II, M. L azaro, I. Santamaria , and J. Principe, “ Adapti ve blind decon volution of linear channels using R ´ enyi’ s entropy with Parz en estimation, ” IEE E T ransacti ons on Signal P r ocessing , vol. 52, pp. 1489–1498, 2004. [10] S. Abe, “Found ations of nonext ensi ve statist ical mechanics, ” in Chaos, Nonline arity , Complexit y . Springer , 2006. [11] S. Abe and Y . Okamoto, Nonextensi ve Statist ical Mec hanics and Its Applicat ions . Berlin: Springer , 2001. [12] J. Hu, W . Tung, and J. Gao , “ Modeling sea cl utter as a nonstati onary and none xtensi ve random process, ” in IEE E Inte rnational Confer ence on Radar , Gainesville, FL, 2006. [13] Y . Li, X. Fan, and G. Li, “Image segment ation based on Tsallis- entrop y and Renyi-e ntropy and their comparison, ” in IEEE Internatio nal Confer ence on Industrial Informatics , 2006, pp. 943–948. [14] S. Martin, G. Morison, W . Nailon, and T . Durrani, “Fast and accurate image regi stration using T sallis entrop y and simult aneous perturbatio n stochasti c approximat ion, ” E lectr onics Letter s , v ol. 40, pp. 595– 597, 2004. [15] M. Gell-Mann and C. Tsallis, Nonexte nsive entr opy: inter discipl inary applica tions . Oxford Uni versit y Press, 2004. [16] J. Jensen, “Sur les fonctions con vex es et les in ´ egalit ´ es entre les vale urs moyenne s, ” A cta Mathemat ica , vol. 30, pp. 175–193, 1906. [17] T . Cover and J. Thomas, Elements of Information Theory . W ile y , 1991. [18] J. Lin and S. W ong, “ A new directed div erge nce measure and its charac teriza tion, ” International J ournal of General Systems , vol. 17, pp. 73–81, 1990. [19] J. Lin, “Di ver gence mea sures ba sed on the S hannon entropy , ” IEEE T ransactions on Information Theory , vol. 37, no. 1, pp. 145–151, 1991. [20] S. Furuichi , “Informati on theoreti cal properties of Ts allis entropies, ” J ournal of Mathemat ical Physics , vol. 47, no. 2, p. 023302, 2006. [Online]. A va ilable : http://li nk.aip.org /link/?JMP/47/023302/1 [21] J. Burbea and C. R. Rao, “ On the con ve xity of some di verg ence measures based on entrop y functions, ” IEEE T ransac tions on Information Theory , vol. 28, no. 3, pp. 489–495, 1982. 8 [22] A. Y . Khinchin, Mathematic al F oundations of Informatio n Theory . Dov er , New Y ork, 1957. [23] C. Shannon and W . W eav er , The Mathematica l Theo ry of Communica- tion . Uni versity of Illinois Press, Urbana, Ill., 1949. [24] H. Suyari, “Generaliza tion of Shannon-Khinch in axioms to nonextensi ve systems and the uniqueness theorem for the nonextensi ve entropy , ” IE EE T ransactions on Info rmation Theo ry , v ol. 50, no. 8, pp. 1783–1787, 2004. [25] I. Grosse, P . Bernaol a-Galv an, P . Carpena, R. Roman-Roldan, J. Oli ver , and H. E. Stanle y , “Ana lysis of symbolic sequences using the Jensen- Shannon di ver gence.” P hysical Revi ew E , vol. 65, 2002. [26] P . W . Lamberti and A. P . Majte y, “Non-loga rithmic Jensen-Shann on di ver gence , ” Physic a A Statistica l Me chani cs and its Applicati ons , vol. 329, pp. 81–90, Nov . 2003. [27] A. Ba nerjee , S. Merugu, I. S. Dhillon, and J. Gho sh, “Cluste ring with Bregma n di ver gence s, ” J ournal of Mac hine Learning Resear ch , vol. 6, pp. 1705–174 9, 2005. [Online ]. A vail able: http:/ /www .jmlr . org/ papers/v6/ banerjee05b .html [28] D. M. Endres and J. E . Schindel in, “ A ne w m etric for probability distrib utions, ” IEEE T ransactions on Information Theory , vol. 49, no. 7, pp. 1858–1860, 2003. [29] F . T opsøe, “Some inequalitie s for informati on div ergen ce and related measures of discrimina tion, ” IEEE T ransactions on Information Theory , vol. 46, no. 4, pp. 1602–1609, 2000. [30] A. Hamza and H. Krim, “Image registrat ion and segmenta tion by maximizin g the Jensen-R ´ enyi div erg ence, ” in Proc. of Internati onal W orkshop on Ener gy Minimization Met hods in Co mputer V ision and P attern R eco gnition . Lisbon, Portugal : Springer , 2003, pp. 147–163. [31] Y . He, A. Hamza, and H. Krim, “ A generaliz ed div erge nce m easure for robust image registrati on, ” IEEE T rans. on Signal Proc essing , vol. 51, no. 5, pp. 1211–1220, 2003. [32] D. Karak os, S. Khudanpur , J. Eisner , and C. Priebe, “Iterati ve denoising using Jensen-R ´ enyi div erg ences with an application to unsupervise d document categori zation, ” in Pr oc. of IEEE Internati onal Confere nce on Acoustics, Speech and Signal Pr ocessing , vol. II, Baltimore , MD, 2007, pp. 509–512. [33] A. B. Hamza, “ A nonexte nsiv e informati on-theore tic measure for image edge detec tion, ” J ournal of Electr onic Imaging , vol. 15-1, pp. 13 011.1– 13 011.8, 2006. [34] J. Steele, The Cauchy -Schwa rz Ma ster Class . Cambridge: Cambridge Uni versi ty Press, 2006.

Nonextensive Generalizations of the Jensen-Shannon Divergence

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment