Basic univariate and bivariate statistics for symbolic data: a critical review

Basic univ ariate and biv ariate statistics for sym b olic data: a critical review Abstract This f ew lines sho ws the main problem with the state-of-the-art o f basic statistics for histogram and interv al data. 1 Numerical sym b olic mo dal data [Histogram v alued description] W e assume that S ( i ) = [ y i ; y i ] (the supp ort is bo unded in R ). The suppor t is partitioned int o a set of n i int erv als S ( i ) = { I 1 i , . . . , I n i i } , w her e I hi = h y hi , y hi  and h = 1 , . . . , n i , i.e. i. I hi ∩ I mi = ∅ ; h 6 = m ; ii. S l =1 ,...,n i I hi = S ( i ) F or histograms it is supposed that each in terv al is uniformly dense. It is p ossible to deﬁne the mo dal description of i as follows: Y ( i ) = { ( I hi , π hi ) | ∀ I hi ∈ S ( i ); π hi = Φ i ( y hi ≤ y ≤ y hi ) = Z I hi φ i ( z ) dz ≥ 0 } where R S ( i ) φ i ( z ) dz = 1. In this case, the mo dal-numeric description is: Y ( i ) = { ( I 1 i , π 1 i ) , . . . , ( I n i i , π n i i ) } . With Y ( i ) it is p ossible to as sociated a distribution function Φ i ( y ) a s follows: Φ i ( y ) = X h<ℓ π hi + π ℓi · y − y ℓi y ℓi − y ℓi whe re ( ℓ ∈ 1 , . . . , n i : y ℓi ≤ y ≤ y ℓi ) . According to [9], the co rrespo nding quantile function (the in v erse of Φ i ( y )) is deﬁned as : Φ − 1 i ( t ) = y ℓi + t − Φ i ( y ℓi ) π ℓi · ( y ℓi − y ℓi ) , (1) where  ℓ ∈ 1 , . . . , n i : Φ i ( y ℓi ) ≤ t ≤ Φ i ( y ℓi )  . 1 Example The pulse rate of the i patient in a day is described throug h a histogram with supp ort S ( i ) = [80 , 120]. The empirical f requency distribution of the observed pulse rate is describ ed as a histogram as follows: Y ( i ) = { ([8 0 , 90 ) , 0 . 1) , ([90 , 1 00) , 0 . 3) , ([100 , 110) , 0 . 4) , ([110 , 120] , 0 . 2) } . It follows that Φ i ( y ) is: I hi ∈ S ( i ) [80 , 90) [90 , 100 ) [100 , 1 10) [110 , 12 0] [Φ i ( y hi ); Φ i ( y hi )] [0 , 0 . 1) [0 . 1 , 0 . 4) [0 , 4 , 0 . 8) [0 . 8 , 1] . In this case, if we want to co mput e the quantile at level t = 0 . 5 (i.e.,the median of the distribution) according to Eq. (1), we o btain that Φ − 1 i (0 . 5) = 1 00 + 0 . 5 − 0 . 4 0 . 4 · (110 − 1 00) = 1 0 2 . 5 . 2 Basic univ ariate statistics for numerica l sym- b olic data The ﬁr st to pro pose a set of univ ariate a nd biv ariate sta tistics for s ymb olic data was Bertrand and Goupil [1], a nd subsequently Billard and Diday [5] impr o v ed them. The Bertrand and Goupil [1] approach relies on the so-called two level p ar adigm pr esen ted in SD A in [6 ]: the se t- v alued description of a statistical unit of a higher o r der is the generaliz a tion of the v alues o bserv ed for a class of the lower o r der units. F or example, the income distributio n of a nation (the higher order unit) is the empirical distribution of the incomes of each citizen (the low er order units) of that na tion. Na turally , other gener alization of grouping criteria can be taken into co nsideration. The g e neralization pro cess fro m low er to higher or der units considered b y Bertrand and Goupil [1] and by Billa rd and Diday [5] implies the following ass umptions : given tw o symbo lic data y (1) a nd y (2) describ ed b y the freq uency distr ibutions f 1 ( y ) and f 2 ( y ), a lower order unit c an b e describ ed by a single v alue y 0 that has a p robability o f occur ring equal to f 1 ( y 0 )+ f 2 ( y 0 ) 2 . The univ ariate statis tics pro- po sed by Be r trand and Goupil [1] and by B illa rd and Diday [5] for a symbo lic v a riable (namely , a v ariable describing higher or der units, or a class of units) corres p ond to those of the classic v ariable us ed for describing the (unknown) low er order units. Thus, g iv en a set E of n higher order units describ ed by the nu merical symbolic v ar iable Y , the mean, the v ar iance and the s tandard devia- tion prop osed b y Bertrand and Goupil [1] and extended by Billard and Diday [5] cor respond to those of a ﬁnite mixture o f n densit y (or frequency) functions with mixing weigh ts equal to 1 n . Giv en n density functions denoted w ith φ i ( y ) with the resp ectiv e mea ns µ i = E ( Y i ) and v ariance σ 2 i = E [( Y i − µ i ) 2 ], and given the ﬁnite mixture densit y φ ( y ) as follows: φ ( y ) = n X i =1 1 n φ i ( y ) = 1 n n X i =1 φ i ( y ) , (2) 2 F r¨ uh wirth-Schnatter [8] sho ws that the mean µ = E ( Y ) a nd the v ar iance σ 2 = E [( Y − µ ) 2 ] of φ ( y ) are the following: µ = E ( Y ) = 1 n n X i =1 µ i ; (3) σ 2 = E [( Y − µ ) 2 ] = 1 n n X i =1  µ 2 i + σ 2 i  − µ 2 . (4) The F r¨ uh wirth-Schnatter [8] mean and v ar iance lead to the Billard a nd Diday [5] mean and v ariance for an interv al o r a histo gram symbolic v ariable. F or the sake of simplicity , we sho w that this is tr ue for interv al- v alued data. Indeed, the histogra m-v alued data ar e tr eated as a particula r case of weighted interv al descriptions. Let Y b e an int erv al-v alue d v ariable, thus, the generic symbolic data is y ( i ) = [ a i ; b i ] with a i ≤ b i belo nging to R . According to [1], y ( i ) is considered as a uniform distribution in [ a i ; b i ], with mea n equal to µ i = a i + b i 2 and v ariance eq ua l to σ 2 i = ( b i − a i ) 2 12 . Given a s et of n units des c ribed by a n int erv al-v alue d v ariable, the symb oli c sample me an ¯ Y [5 , eq. (3.22)] is: ¯ Y = 1 2 · n n X i =1 ( b i + a i ) . (5) It is stra igh tforward to show its equiv a lence with µ in eq.(3), indeed: ¯ Y = 1 n n X i =1 ( b i + a i ) 2 = 1 n n X i =1 µ i = µ. In [5, eq. (3.2 2)] is a lso prop osed the symb olic sample vari anc e a s follows: S 2 = 1 3 · n n X i =1  b 2 i + b i · a i + a 2 i  | {z } ( I ) − 1 4 · n 2 " n X i =1 ( a i − b i ) # 2 | {z } ( I I ) . (6) Considering that: µ 2 i + σ 2 i =  b i + a i 2  2 + ( b i − a i ) 2 12 = ( b i + a i ) 2 4 + ( b i − a i ) 2 12 = = 3 b 2 i + 3 a 2 i + 6 b i a i + b 2 i + a 2 i − 2 b i a i 12 = = 4 b 2 i + 4 a 2 i + 4 b i a i 12 = b 2 i + b i · a i + a 2 i 3 the term (I) o f eq. (6) can b e expressed as follows: 1 3 · n n X i =1  b 2 i + b i · a i + a 2 i  = 1 n n X i =1  µ 2 i + σ 2 i  . The term ( I I ) is clearly µ 2 , indeed: 1 4 · n 2 " n X i =1 ( a i − b i ) # 2 = " 1 n n X i =1 ( a i − b i ) 2 # 2 = " 1 n n X i =1 µ i # 2 = µ 2 . 3 Thu s, S 2 in eq. (6) cor r esponds to eq. (4), indeed: S 2 = ( I ) − ( I I ) = 1 n n X i =1  µ 2 i + σ 2 i  − µ 2 = σ 2 . (7) The same cor r espondences a lso hold for the mea n and the v ariance o f the other nu merical mo dal symbolic v ariables. 3 Bertrand and Goupil [1] approac h to basic statistics A bit o f nota tio n In terv al data Y 1 and Y 2 are interv al- v alued data. Y 1 ( i ) = [ a i 1 , b i 1 ] and Y 2 ( i ) = [ a i 2 , b i 2 ]. Histogram data Y 1 and Y 2 are histo gram-v a lued data. Y 1 ( i ) = { ([ a i 1 , 1 , b i 1 , 1 ] , π i 1 , 1 ) , . . . , ([ a i 1 ,h i 1 , b i 1 ,h i 1 ] , π i 1 ,h i 1 ) } such that h i 1 X r =1 π i 1 ,h i 1 = 1 and Y 2 ( i ) = { ([ a i 2 , 1 , b i 2 , 1 ] , π i 2 , 1 ) , . . . , ([ a i 2 ,h i 2 , b i 2 ,h i 2 ] , π i 2 ,h i 2 ) } such that h i 2 X r =1 π i 2 ,h i 2 = 1 . h i 1 is no t necessa rily eq ual to h i 2 . Bertrand and Goupil [1] and [2] Univ ariate statistics for in terv als Bertrand and Goupil [1] c o nsider each int erv al as a uniform distr ibutio ns. Under this hypothesis the mean is ¯ Y = 1 n n X i =1 b i + a i 2 being µ i = b i + a i 2 then ¯ Y = 1 n n X i =1 µ i the v ariance is S 2 = 1 3 n n X i =1  b 2 i + b i a i + a 2 i  − 1 4 n 2 " n X i =1 ( b i + a i ) # 2 4 being σ i = q ( b i − a i ) 2 12 and b eing µ 2 i + σ 2 i = ( b i + a i ) 2 4 + ( b i − a i ) 2 12 = 4 b 2 i + 4 a 2 i + 4 a i b i 12 = 1 3  b 2 i + a i b i + a 2 i  then S 2 = 1 n n X i =1 ( µ 2 i + σ 2 i ) − ¯ Y 2 . Biv ariate statistics for in terv als In the biv a riate ca se, Ber trand and Goupil [1] assume that, if the individual is observ ed for tw o int erv al-v a lued v a ri- ables, the joint distribution whose marginals are the t w o unifor ms is de- rived under an (implicit) indepe ndence assumption, i.e., given Y 1 ( i ) ∼ U ( a i 1 , b i 1 ) and Y 2 ( i ) ∼ U ( a i 2 , b i 2 ) the within cov a riance C O V ( Y 1 ( i ) , Y 2 ( i )) = 0 thus ( Y 1 ∩ Y 2 )( i ) = Y 1 ( i ) · Y 2 ( i ) = U ( a i 1 , b i 1 ) · U ( a i 2 , b i 2 ) . Using the same appro ac h of Billar d [4], we ca n decomp ose the cro ss- v a riation int o a Within a nd a between co mponent as follows C S W = n X i =1 C O V ( Y 1 ( i ) , Y 2 ( i )) that is equal to 0 be c ause each C O V ( Y 1 ( i ) , Y 2 ( i )) = 0 , a nd a be tw een that is eq ual to C S B = n X i =1 ( µ i 1 − ¯ Y 1 )( µ i 2 − ¯ Y 2 ) = n X i =1 µ i 1 µ i 2 − n ¯ Y 1 ¯ Y 2 th us C S T = C S W + C S B = 0 + C S B = n X i =1 µ i 1 µ i 2 − n ¯ Y 1 ¯ Y 2 . (8) Considering the indep endence and the uniform assumption the co v ari- ance o f a set o f bi-v ariate interv als is C O V = C S T n = 1 n n X i =1 µ i 1 µ i 2 − ¯ Y 1 ¯ Y 2 . Univ ariate statistics for histograms Is analog ue to the interv als . The mean ¯ Y = 1 2 n n X i =1 H i X h =1 ( b ih + a ih ) π ih | {z } 2 · µ i = 1 n n X i =1 µ i while the v ar iance is S 2 = + ∞ Z −∞  y − ¯ Y  2 f ( y ) dy 5 Using the approach suggested by Billard [4] we can div ide nS int o a within and a b et ween part. The within sum of squar es S S W is the sum of internal v ariability: S S W = n X i =1 σ 2 i = n X i =1 + ∞ Z −∞ ( y − µ i ) 2 f i ( y ) dy = n X i =1   + ∞ Z −∞ y 2 f i ( y ) dy − µ i 2   where considering that f i ( y ) = π ih b ih − a ih if a ih ≤ y ≤ b ih , we ha ve H i X h =1 π ih b ih − a ih b ih Z a hi y 2 dy = H i X h =1 π ih b ih − a ih 1 3  b 3 hi − a 3 hi  = 1 3 H i X h =1 π ih  a 2 ih + b ih a ih + b 2 ih  th us S S W = n X i =1 σ 2 i = 1 3 n X i =1 H i X h =1 π ih  a 2 ih + b ih a ih + b 2 ih  − n X i =1 µ 2 i . the betw een s um of squares is S S B = n X i =1  µ i − ¯ Y  2 = n X i =1 µ 2 i − n ¯ Y 2 th us S S T = S S W + S S B = 1 3 n X i =1 H i X h =1 π ih  a 2 ih + b ih a ih + b 2 ih  − n X i =1 µ 2 i + n X i =1 µ 2 i − n ¯ Y 2 th us the v ariance is S 2 = S S T n = 1 3 n n X i =1 H i X h =1 π ih  a 2 ih + b ih a ih + b 2 ih  − ¯ Y 2 Biv ariate statistics for histograms Also in t his cas e, Bertr and and Goupil [1] a ssume the internal indep endence for biv ar iate histogra m-v alued de- scription, i.e., they assume that that it is assumed that C OV ( Y 1 ( i ) , Y 2 ( i )) = 0. Ther efore, using the Billard [4 ] approach we can write: C S W = n X i =1 C O V ( Y 1 ( i ) , Y 2 ( i )) that is equal to 0 b ecause each C O V ( Y 1 ( i ) , Y 2 ( i )), and a b et ween that is equal to C S B = n X i =1 ( µ i 1 − ¯ Y 1 )( µ i 2 − ¯ Y 2 ) = n X i =1 µ i 1 µ i 2 − n ¯ Y 1 ¯ Y 2 th us C S T = C S W + C S B = 0 + C S B = n X i =1 µ i 1 µ i 2 − n ¯ Y 1 ¯ Y 2 . 6 A g eneral drawback is that given tw o v ar iables Y 1 and Y 2 such that Y 1 ( i ) = Y 2 ( i ) for each i = 1 , . . . , n , lo o king a t the for m ulas it easy to obs erv e that even if S 2 ( Y 1 ) = S 2 ( Y 2 ), C S T ( Y 1 , Y 2 ) 6 = n · S 2 ( Y 1 ) or C S T ( Y 1 , Y 2 ) 6 = n · S 2 ( Y 2 ) . (9) 4 Billard and Dida y [5] 2006 approac h to basic statistics Univ ariate statistics for in terv als They are the same of Bertrand and Goupil [1] under the hypothesi of uniform dis tr ibution in each interv al-v alued de- scription. Biv ariate statistics for in terv als Are a par ticula r mo diﬁcation of the Bertr and and Goupil [1], but without solv ing some dra wbacks (the cov ar iance of t wo identical interv al v a riables is diﬀerent fro m the v ariance of the tw o v a riables. Univ ariate statistics for histograms They are the same of B ertrand and Goupil [1] under the hypothesi o f uniform distribution in ea c h histog r am- v a lued description. Biv ariate statistics for histograms Are a particular mo diﬁcation o f the B ertrand and Goupil [1], but without solv ing some dra wbacks (the cov ar iance of t wo identical interv al v a riables is diﬀerent fro m the v ariance of the tw o v a riables. Another problem a rise, if the histog rams ar e the same but have a diﬀer en t bin partition. 5 Billard [4] 2008 approac h to basic statistics d eﬁne in terv al descriptions and histogram des criptions Univ ariate statistics for in terv als Are the sa me o f Bertra nd and Goupil [1] under the hypothesi of uniform distribution in each interv al-v alued de- scription. Biv ariate statistics for in terv als The main nov elty o f the Billard [4] ap- proach is related to the deﬁnition of the cros s -v aria tio n betw een tw o int erv al-v alue d v ar iables. Indeed, it is a ssumed that C S W = n X i =1 C O V ( Y 1 ( i ) , Y 2 ( i )) = n X i =1 ( b i 1 − a i 1 )( b i 2 − a i 2 ) / 12 . (10) The a s sumption that C OV ( Y 1 ( i ) , Y 2 ( i )) = ( b i 1 − a i 1 )( b i 2 − a i 2 ) / 12 is related to a n assumption (not declared in the paper) of p erfect positive internal correla tion betw een of the tw o uniforms descr ibing the i -th individual. In fact, b eing σ 2 i 1 = ( b i 1 − a i 1 ) 2 12 and σ 2 i 2 = ( b i 2 − a i 2 ) 2 12 and deno ting with C O RR ( Y 1 ( i ) , Y 2 ( i )) the cor relation b et ween Y 1 ( i ) and Y 2 ( i ), we can write: C O RR ( Y 1 ( i ) , Y 2 ( i )) = C O V ( Y 1 ( i ) , Y 2 ( i )) σ i 1 σ i 2 = 1 → 7 → C O V ( Y 1 ( i ) , Y 2 ( i )) = σ i 1 σ i 2 = ( b i 1 − a i 1 )( b i 2 − a i 2 ) 12 Naturally the b et ween comp onen t r emain the same, i.e. C S B = n X i =1 ( µ i 1 − ¯ Y 1 )( µ i 2 − ¯ Y 2 ) = n X i =1 µ i 1 µ i 2 − n ¯ Y 1 ¯ Y 2 th us C S T = C S W + C S B = n X i =1 ( b i 1 − a i 1 )( b i 2 − a i 2 ) / 12 + n X i =1 µ i 1 µ i 2 − n ¯ Y 1 ¯ Y 2 and b eing n P i =1 ( b i 1 − a i 1 )( b i 2 − a i 2 ) 12 + n P i =1 ( b i 1 + a i 1 )( b i 2 + a i 2 ) 4 − n ¯ Y 1 ¯ Y 2 = = n P i =1 h ( b i 1 − a i 1 )( b i 2 − a i 2 ) 12 + ( b i 1 + a i 1 )( b i 2 + a i 2 ) 4 − ¯ Y 1 ¯ Y 2 i = = n P i =1  1 6 (2 b i 1 b i 2 + b i 1 a i 2 + a i 1 b i 2 + 2 a i 1 a i 2 ) − ¯ Y 1 ¯ Y 2  we have that C S T = C S W + C S B = n X i =1  1 6 (2 b i 1 b i 2 + b i 1 a i 2 + a i 1 b i 2 + 2 a i 1 a i 2 ) + ¯ Y 1 ¯ Y 2  that, being indep enden t for translations, it can b e expressed as in [4] as follows C S T = C S W + C S B = = 1 6 n P i =1  2( b i 1 − ¯ Y 1 )( b i 2 − ¯ Y 2 ) + ( b i 1 − ¯ Y 1 )( a i 2 − ¯ Y 2 )+ +( a i 1 − ¯ Y 1 )( b i 2 − ¯ Y 2 ) + 2( a i 1 − ¯ Y 1 )( a i 2 − ¯ Y 2 )  Even if it is no t discussed in the p ap er [4], this fo r m ulation see ms a n u- merical trick for so lving the problem in Eq. (9). In fact in this case it is easy to show that, given tw o interv a l-v alued v ariables Y 1 and Y 2 such that Y 1 ( i ) = Y 2 ( i ) for ea c h i = 1 , . . . , n , C S T ( Y 1 , Y 2 ) = n · S 2 ( Y 1 ) = n · S 2 ( Y 2 ) . Ho w ev er, this is true only for in terv al v alu e d data (as we see in a while). Univ ariate statistics for histograms Are the sa me of Ber trand and Goupil [1] under the hypo thesis of uniform dis tribution in each bin of the histogr a m- v a lued description. Biv ariate statistics for hi stograms In this case Billar d [4], without any jus- tiﬁcation, extend the cov ariance using a w eighted formulation of equatio n that implies, for each couple o f bins o f the tw o histog rams that des c ribe the i -th individual, a s follows: C O V ( Y 1 , Y 2 ) = 1 6 n n P i =1 h i 1 P r =1 h i 2 P s =1  2  b i 1 ,r − ¯ Y 1   b i 2 ,s − ¯ Y 2  +  b i 1 ,r − ¯ Y 1   a i 2 ,s − ¯ Y 2  + +  a i 1 ,r − ¯ Y 1   b i 2 ,s − ¯ Y 2  + 2  a i 1 ,r − ¯ Y 1   a i 2 ,s − ¯ Y 2  π i 1 ,r π i 2 ,s  8 How ev er, this for m ulation do es not solve the problem in Eq. (9 ), a nd further it is sensible to a r ecodify of the histogr ams that doe s not change the density . W e see this with tw o exa mples. F urther, when the the n umber of bins incre a ses h .. → + ∞ , in gener al π ... → 0, the c onsequence is that C O V → 1 n n P i =1 µ i 1 µ i 2 − ¯ Y 1 ¯ Y 2 . The pr oof is intuitiv e (b ecause the biv ariate histogram tends to c o incide with the densit y of a biv ariate distribution under indep endence assumption). Example: problem in Eq. (9) p ersists W e have the following dataset of 2 individuals describ ed by 2 histog ram v a riables. i Y 1 Y 2 1 { ([10 , 20 ] , 0 . 4) , ([20 , 30] , 0 . 6) } { ([10 , 2 0] , 0 . 4 ) , ([20 , 3 0 ] , 0 . 6) } 2 { ([50 , 60 ] , 0 . 2) , ([60 , 70] , 0 . 8) } { ([50 , 6 0] , 0 . 2 ) , ([60 , 7 0 ] , 0 . 8) } basic statistics Y 1 = 1 2  20+10 2 0 . 4 + 30+20 2 0 . 6 + 60+50 2 0 . 8 + 70+60 2 0 . 8  = 42 Y 2 = 1 2  20+10 2 0 . 4 + 30+20 2 0 . 6 + 60+50 2 0 . 8 + 70+60 2 0 . 8  = 42 S 2 1 = 1 3 · 2   10 2 + 1 0 · 20 + 20 2  0 . 4 +  20 2 + 2 0 · 30 + 30 2  0 . 6+  50 2 + 5 0 · 60 + 60 2  0 . 2 +  60 2 + 6 0 · 70 + 70 2  0 . 8  − 4 2 2 = 469 , ¯ 3 S 2 2 = 1 3 · 2   10 2 + 1 0 · 20 + 20 2  0 . 4 +  20 2 + 2 0 · 30 + 30 2  0 . 6+  50 2 + 5 0 · 60 + 60 2  0 . 2 +  60 2 + 6 0 · 70 + 70 2  0 . 8  − 4 2 2 = 469 , ¯ 3 C O V ( Y 1 , Y 2 ) = = 1 6 · 2             2(10 − 42)(10 − 42) + (20 − 42) · (10 − 42)+ +(10 − 42) · (2 0 − 42) + 2(2 0 − 42 ) · (20 − 42 )  0 . 4 · 0 . 4+ +  2(10 − 4 2 )(20 − 4 2) + (10 − 42) · (30 − 42 )+ +(20 − 4 2 ) · (20 − 42 ) + 2 (20 − 4 2) · (3 0 − 42)  0 . 4 · 0 . 6+ ....            = 44 9 . ¯ 3 T able 1 : EX1 Example: CO V is not inv arian t to reco dify of the same histo g ra m in to diﬀerent bins If we take Y 2 and rewrite it by splitting the bins int o tw o pa rts we do not change the dens ity or the cumulativ e function asso ciated with each histogr am. How ever, C OV ( Y 1 , Y 2 ) changes. In orde r to show this, in T ab. 2 , we hav e reco diﬁed the seco nd v ariable, as fo llo ws: Y 2 (1) = { ([10 , 20 ] , 0 . 4) , ([20 , 30 ] , 0 . 6) } = = { ([10 , 15] , 0 . 2 ) , ([15 , 2 0] , 0 . 2 ) , ([20 , 25] , 0 . 3 ) , ([25 , 30] , 0 . 3) } and Y 2 (2) = { ([50 , 60] , 0 . 2 ) , ([60 , 70 ] , 0 . 8) } = = { ([50 , 55 ] , 0 . 1 ) , ([5 5 , 60] , 0 . 1) , ([60 , 65] , 0 . 4) , ([65 , 70] , 0 . 4) } . Thu s, rewriting the histogr am without changing its distribution or its density , C OV c hanges, but not the univ aria te s tatistics. Cont inuing to 9 bisecting bins, it is ea sy to s how that C OV ( Y 1 , Y 2 ) tend to the cov ar aince of the biveriate distribution of the means o f the histog rams that is equa l to 4 41. In appendix, the Matlab co de o f the numerical pr oof. i Y 1 Y 2 1 { ([10 , 20 ] , 0 . 4) , ([20 , 30] , 0 . 6) } { ([10 , 15] , 0 . 2) , [15 , 20] , 0 . 2) , ([20 , 2 5] , 0 . 3)[25 , 30] , 0 . 3) } 2 { ([50 , 60 ] , 0 . 2) , ([60 , 70] , 0 . 8) } { ([50 , 5 5] , 0 . 1 ) , ([55 , 6 0 ] , 0 . 1) , ([6 0 , 65 ] , 0 . 4) , ([6 5 , 70] , 0 . 4) } basic statistics Y 1 = 42 Y 2 = 42 S 2 1 = 46 9 , ¯ 3 S 2 2 = 46 9 , ¯ 3 C O V ( Y 1 , Y 2 ) = 4 45 . 1 ¯ 6 T a ble 2: EX2, the second v ariable is split but the histog ram have alwa ys the same densit y function App endix Co de for pro of re lated to example 1 and 2. The function cov s tat b il lar d ( H M ) computes the basic statistics for t w o his - togram v aria bles. HM is a str ucture where each element is a histogr am. function res=c ov s ta t b il la r d(HM) %this function comp ute the basic statist i cs o f Billa rd 2008 I ASC n=size(HM,1); ALL1=[]; ALL2=[]; meds=[]; for i= 1:size(HM,1); ALL1=[ALL1; HM(i,1).h]; ALL2=[ALL2; HM(i,2).h]; meds(i,1)=sum(( HM(i,1).h(:,2)+HM(i,1).h(:,1))/2. * HM(i,1).h(:,3)); meds(i,2)=sum(( HM(i,2).h(:,2)+HM(i,2).h(:,1))/2. * HM(i,2).h(:,3)); end m1=1/n * sum((ALL1(:,2)+ALL1(:,1))/2. * ALL1(:,3)); m2=1/n * sum((ALL2(:,2)+ALL2(:,1))/2. * ALL2(:,3)); res.m1=m1; res.m2=m2; tmp1=0; for i= 1:n for r=1:size(HM(i,1).h,1) tmp=(HM(i,1).h(r , 1 ) − m1)ˆ2+ ... (HM(i,1).h(r , 1 ) − m1) * (HM(i,1).h(r , 2 ) − m1)+ ... (HM(i,1).h(r , 2 ) − m1)ˆ2; tmp1=tmp1+1/(3 * n) * tmp * HM(i,1).h(r,3); end end var1=tmp1; 10 res.var1=var1; tmp2=0; for i= 1:n for r=1:size(HM(i,2).h,1) tmp=(HM(i,2).h(r , 1 ) − m2)ˆ2+ ... (HM(i,2).h(r , 1 ) − m2) * (HM(i,2).h(r , 2 ) − m1)+ ... (HM(i,2).h(r , 2 ) − m2)ˆ2; tmp2=tmp2+1/(3 * n) * tmp * HM(i,2).h(r,3); end end var2=tmp2; res.var2=var2; covar=0; tmp=0; for i= 1:n for r=1:size(HM(i,1).h,1) for t= 1:size(HM( i,2).h,1) tmp2=2 * (HM(i,1).h(r , 1 ) − m1) * (HM(i,2).h(t , 1 ) − m2)+ ... (HM(i,1).h(r , 1 ) − m1) * (HM(i,2).h(t , 2 ) − m2)+ ... (HM(i,1).h(r , 2 ) − m1) * (HM(i,2).h(t , 1 ) − m2)+ ... 2 * (HM(i,1).h(r , 2 ) − m1) * (HM(i,2).h(t , 2 ) − m2); tmp=tmp+1/(6 * n) * tmp2 * HM(i,1).h(r,3) * HM(i,2).h(t,3); end end end covar=tmp; res.covar=covar; res.covCE=meds(:,1)' * meds(:,2) * 1/n − m e an (meds(:,1)) * mean(meds(:,2)); The nex t script star ts fro m a conﬁguratio n of histogr ams, th us split each bin in t wo equal parts and r e compute basic statistics. % define 2x2 histogram − valued table H11=[10 20 0.4; ... 20 30 0.6]; H21=[50 60 0.2; ... 60 70 0.8]; H12=[10 20 0.4; ... 20 30 0.6]; H22=[50 60 0.2; ... 60 70 0.8]; % set up the structure HM HM(1,1).h=H11; HM(1,2).h=H12; HM(2,1).h=H21; HM(2,2).h=H22; res=c ov s t a t b il la rd( H M); res.covar %now w e do 5 s p lits and recompute ba sic statistics for i= 1:5 for r=1:size(HM,1) for s= 1:size(HM,2) tmp=HM(r,s).h; %% tmp2=[]; for k=1:size(HM(r,s).h,1); 11 uno=HM(r,s).h(k,1); tre=HM(r,s).h(k,2); due=(tre+uno)/2; pp=HM(r,s).h(k,3)/2; tmp2=[tmp2;[uno due pp; due tr e pp]]; end HM(r,s).h=tmp2; end end res= co v s ta t b il la rd( H M); res.covar res.covCE end References [1] Bertra nd, P . and Goupil, F. (2 000): Descriptive statistics for symbolic data. In: H.H. B o ck and E. Diday (Eds.): A nalysis of S ymb olic D ata: Explor atory Metho ds for Extr acting Statist ic al Information fr om Complex Data . Springer, Ber lin, 103– 1 24. [2] Billard, L. and Diday , E. (2003 ): F ro m the Statistics of Data to the Statis- tics o f Knowledge: Sym b olic Data Analysis Journ al of the Americ an Sta- tistic al Asso cia tion, 98, 462, 470-487 [3] Billard, L. (20 07): Depe ndenc ie s a nd V ar iation Comp onent s of Symbo lic Int erv al–V alued Data. In: P . Brito, P . Be rtrand, G. Cucumel, F. de Ca r- v a lho (Eds.): Sele cte d Contributions in Data A nalysis and Classiﬁc atio n . Springer, B erlin, 3–12 . [4] Billard, L. (2008 ): Sample Cov ar iance F unction for Complex Quantitativ e Data, Pr o c e e dings of IASC 2008, Y okoh ama, J ap an , 157–16 3. [5] Billard,L. and Diday , E. (200 6): Symb olic Data Analysis: Conc eptual Statistics and Data Mining ,Wiley , Chirchester. [6] Bo c k, H.H. and Diday , E . (2000): A nalysis of Symb olic Data, Ex pl or atory metho ds for extr acting statistic al information fr om c omplex data , Studies in Classiﬁcation, Data Analysis and Knowledge Org anisation, Springer- V er la g. [7] Chisini, O. (1929): Sul concetto di media . Perio dic o di Matematiche 4, 106-116 . [8] F r ¨ uh wirth-Schnatter, S. (20 06): Finite mixtur e and Markov switching m o d- els , Springer. [9] Irpino, A., Lechev allier, Y. and V er de, R. (200 6): Dynamic cluster ing of histograms using W assers tein metr ic . In: Rizzi, A., Vichi, M. (eds.) COMP- ST A T 2006 . Physica-V e r lag, Ber lin, 869–8 76. 12

Basic univariate and bivariate statistics for symbolic data: a critical review

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment