On central tendency and dispersion measures for intervals and hypercubes
The uncertainty or the variability of the data may be treated by considering, rather than a single value for each data, the interval of values in which it may fall. This paper studies the derivation of basic description statistics for interval-valued…
Authors: Marie Chavent (IMB), Jer^ome Saracco (GREThA)
On cen tral tendency and disp ersion measures for in terv als and h yp ercub es Marie Chav en t 1 and J ´ erˆ ome Saracco 1 , 2 ∗ 1 Univ ersit ´ e Bordeaux 1 Institut de Math ´ ematiques d e Bordeaux, UMR CNRS 5251 351 cours de la lib´ eration, 3340 5 T alence Cedex, F rance e-mail: { Marie.c ha v ent,Jerome. Saracco } @math.u-b ordeaux1.fr 2 GREThA, UMR CNRS 5113 Univ ersit ´ e Monte squ ieu - Bordeaux IV Av en ue L´ eon Duguit, 33608 Pessac Cedex, F rance Abstract The u ncertain t y or t he v ariabilit y of the data m ay b e treated by considering, rather than a single v alue for eac h d ata, the interv al of v alues in whic h it ma y fall. T h is pap er studies the d eriv ation of basic description statistics for interv al-v alued datasets. W e p rop ose a geo- metrical approac h in the determination of summary statisti cs (cen tral tendency and disp ersion measures) for in terv al-v alued v ariables. Keyw ords: Clustering, Hausdorff Distance, Multidimensional Inte r- v al Data. ∗ preprint submitted to Communications in Statistics - Theory and Metho ds 1 1 In tro duction In descriptiv e statistics, summary statistics are used to syn thesize a set of real observ ations. They usually in v olv e: - a measure of lo cation or cen tral tendency , suc h as the a rithmetic mean, median, in terquartile mean or midrange, - a me asure of dis p ersion like the standard deviation, range, interquartile range or a bsolute deviation. In this pap er, w e fo cus on o btaining basic descriptiv e statistics as cen tral tendency and disp ersion measures for in terv al- v alued data. Suc h data are often met in practice, they t ypically reflect the v ariabilit y and/or uncertain t y that underly the observ ed me asuremen t. In terv al data is a sp ecial case of ‘sym b olic da t a ’, whic h also comprises set-v alued categorical a nd quan titative v ariables as describ ed, e.g., in Bo c k and D ida y (2000). Empirical extensions of summary statistics to the calculation of the mean and v ariance for in terv al v alued-data hav e b een given b y Bertrand and Goupil (2000) and for histogram-v alued data by Billard and Dida y (20 0 3). In t his pap er, we prop ose a geometrical determination of summary statis- tics (mean, median, v ariance, absolute deviation,....) for in terv al-v alued v ari- ables. This approac h mimics the case of real-v alued v aria bles, with the ab- solute v a lue of the difference b et w een tw o real n um b ers b eing replaced by a distance b et w een t w o in terv a ls. F or real-v alued v ar ia bles, a geometrical w a y for defining a cen tral v alue c of a set { x 1 , x 2 , ...., x n } of n real observ ations is to c ho ose c ∈ R as close a s p ossible to all the x i ’s. Let us define the function S p : S p ( c ) = k x − c k p = ( ( P n i =1 | x i − c | p ) 1 /p for p < ∞ , max i =1 ...n | x i − c | for p = ∞ , (1) 2 where x ∈ R n is the v ector of the n observ ations x i , k · k p is the L p norm on R n , a nd c = c I n with I n the unit v ector. Then one can use ˆ c = arg min c ∈ R S p ( c ) , (2) as a central v a lue and S p ( ˆ c ) as t he asso ciated dispersion measure. The ab ov e minimization problem has an explicit solution for p = 1 , 2 , ∞ . • When p = 1, the central v alue is ˆ c = x M (the sample median) and the corresp onding disp ersion is S 1 ( x M ) = P n i =1 | x i − x M | = ns M where s M is the a v erage absolute deviation fro m the median. • When p = 2, the cen tral v alue is ˆ c = ¯ x (the sample mean) and the cor- resp onding disp ersion is S 2 ( ¯ x ) = p P n i =1 ( x i − ¯ x ) 2 = p ( n − 1) s where s is t he sample standard deviation. • When p = ∞ , the cen tral v alue is ˆ c = x R (the midrange) and the corresp onding disp ersion is S ∞ ( x R ) = max i =1 ...n | x i − x R | = 1 2 w where w is the sample range. The pairs ( ¯ x, s 2 ), ( x M , s M ) and ( x r , w ) a re then consisten t with the use of resp ectiv ely the L 1 , L 2 and L ∞ norms in the function S p . F or inte rv al-v alued v ariables, w e will use the ab ov e geometrical approach to define coheren t measures of cen tral tendency and disp ersion of a set { ˜ x 1 , ˜ x 2 , ...., ˜ x n } of n in terv a ls ˜ x i = [ a i , b i ] ∈ I = { [ a, b ] | a, b ∈ R , a ≤ b } . A measure of central tendency ˜ c is now an in terv al ˜ c = [ α , β ] defined in order to b e as close as p o ssible to all the ˜ x i ’s. Replacing in (1) the terms | x i − c | b y a distance d ( ˜ x i , ˜ c ) b et we en tw o interv als leads to the function e S p defined b y: e S p ( ˜ c ) = ( ( P n i =1 d ( ˜ x i , ˜ c ) p ) 1 /p for p < ∞ , max i =1 ...n d ( ˜ x i , ˜ c ) for p = ∞ . (3) The cen tral interv al ˆ ˜ c = [ ˆ α, ˆ β ] is then defined as ˆ ˜ c = arg min ˜ c ∈ I e S p ( ˜ c ) , (4) 3 and the correspo nding disp ersion measure is e S p ( ˆ ˜ c ). In the following, after a brief recall of some definitions of distances b e- t w een inte rv als (section 2), w e exhibit in section 3 particular cases of v alue p and distance d for which explicit for mula of the low er and upp er b o unds of cen tral interv als ˆ ˜ c ha v e already b een dev elop ed. Then w e resolv e in section 4 the case where p = 2 and d is the H ausdorff d istance and w e sho w ho w the corresp onding cen t ral interv al can be computed in a finite n um b er of op erations prop ortional to n 3 . W e generalize in section 5 all these results to h yp ercub es. F inally , concluding remarks are giv en in section 6. 2 Distances b etw een in terv als Man y distances b et w een interv als ha v e b een prop osed. They v ary from simple ones to the more elab orated o nes. Elab ora ted distances taking into accoun t b oth ra nge and p osition ha v e b een prop osed in the fra mew ork of sym b olic data analysis (see for instance, Chapter 8 and 11.2.2 of Bo c k a nd Diday , 20 00, De Carv alho, 1998 , Ic hino and Y aguc hi, 1 9 94). Simple distances commonly used to compare ˜ x 1 = [ a 1 , b 1 ] and ˜ x 2 = [ a 2 , b 2 ] are the L p distances b et w een: • the t w o ve ctors a 1 b 1 ! and a 2 b 2 ! of the lo w er and upp er b ounds, • or the t w o ve ctors m 1 l 1 ! and m 2 l 2 ! of the midp oin ts m i = a i + b i 2 and the half-lengths l i = b i − a i 2 . General distances b etw een se ts lik e the Hausdorff distance (see N adler, 1978), can also b e used to compar e t wo in terv a ls. In the case of tw o in terv als ˜ x 1 = [ a 1 , b 1 ] and ˜ x 2 = [ a 2 , b 2 ], the Hausd orff distance has the prop ert y to simplify to: d ( ˜ x 1 , ˜ x 2 ) = max( | a 1 − a 2 | , | b 1 − b 2 | ) . (5) 4 By replacing in (5) the low er b ound a i b y ( m i − l i ) and the upp er b ound b i b y ( m i + l i ), and according t o the follo wing prop ert y defined for x a nd y in R , max( | x − y | , | x + y | ) = | x | + | y | , one can show that the Hausdorff distance can b e written as: d ([ a 1 , b 1 ] , [ a 2 , b 2 ]) = | m 1 − m 2 | + | l 1 − l 2 | . (6) The Hausdorff distance b etw een interv als has then the interes ting pro p- ert y to b e, at the same time, - a distance b et w een sets, - equal to the L ∞ distance b et w een the v ectors a 1 b 1 ! and a 2 b 2 ! , - equal to the L 1 distance b et w een the v ectors m 1 l 1 ! and m 2 l 2 ! . 3 Existing re s ults on cen tral in te r v als Explicit form ula of t he cen tral interv al ˆ ˜ c = [ ˆ α, ˆ β ] = arg min ˜ c ∈ I e S p ( ˜ c ) can b e found in some particular cases. W e remind these results already obtained and used in previous w orks (see fo r instance Cha ve nt and Lec hev allier, 2002, Cha v en t, 2004 , De Carv alho et al., 2006). 3.1 L 1 com bination of Hausdorff distances When p = 1 and d is the Hausdorff distance, e S p ( ˜ c ) reads: e S 1 ( ˜ c ) = n X i =1 ( | m i − µ | + | l i − λ | ) , (7) where µ and λ are the midp oin t and the half-length of ˜ c = [ α, β ]. 5 Minimization of e S 1 ( ˜ c ) b oils do wn to the tw o minimization problems: min µ ∈ R n X i =1 | m i − µ | and min λ ∈ R n X i =1 | l i − λ | . Theorem 1 In c ase of an L 1 c om b i nation of Hausdorff distanc es, the mid- p oin t ˆ µ an d the hal f - l e ngth ˆ λ of the c entr al interval ˆ ˜ c ar e : ˆ µ = median { m i | i = 1 , . . . , n } , ˆ λ = median { l i | i = 1 , . . . , n } . (8) 3.2 L ∞ com bination of Hausdorff distances When p = ∞ and d is the Hausdorff distance, e S p ( ˜ c ) reads: e S ∞ ( ˜ c ) = max i =1 ,...,n max n | a i − α | , | b i − β | o , (9) i.e. e S ∞ ( ˜ c ) = max n max i =1 ,...,n | a i − α | , max i =1 ,...,n | b i − β | o . Minimization of e S ∞ ( ˜ c ) b oils down to the t wo minimization problems: min α ∈ R max i =1 ,...,n | a i − α | and min β ∈ R max i =1 ,...,n | b i − β | . Theorem 2 In c ase of an L ∞ c om b i nation of Hausdorff distanc es , the lower b ound ˆ α and the upp er b ound ˆ β of the c entr al in terval ˆ ˜ c ar e : ˆ α = a ( n ) − a (1) 2 , ˆ β = b ( n ) − b (1) 2 , (10) wher e a ( n ) (r esp. b ( n ) ) is the lar g est lower b o und (r esp. upp er b ound) and a (1) (r esp. b (1) ) is the smal lest lower b ound (r esp. upp er b ound) . 6 3.3 L 2 com bination of L 2 distances F or p = 2, an explicit solution is easily defined when d is the L 2 distance b et w een either the middles and half lengths of the in terv als or b etw een their lo w er and upp er b ounds. F or instance in the first case, e S p ( ˜ c ) reads: e S 2 ( ˜ c ) = v u u t n X i =1 d ( ˜ x i , ˜ c )) 2 = v u u t n X i =1 ( | m i − µ | ) 2 + ( | l i − λ | ) 2 . (1 1 ) Theorem 3 In c ase o f an L 2 c om b i nation of L 2 distanc es b etwe en midp oin ts and half lengths, the mi d p oint ˆ µ and the hal f - l e ngth ˆ λ of the c entr al interval ˆ ˜ c ar e : ˆ µ = 1 n n X i =1 m i and ˆ λ = 1 n n X i =1 l i . In c ase of an L 2 c om b i nation of L 2 distanc es b etwe en lower and the upp er b ounds, the lower a nd upp er b ounds of the intervals of the c e ntr al interval ˆ ˜ c ar e: ˆ α = 1 n n X i =1 a i and ˆ β = 1 n n X i =1 b i . 4 Main re sult W e study here the case of a n L 2 com bination of Hausdorff distances. When p = 2 and d is the Hausdorff distance, e S p ( ˜ c ) reads: e S 2 ( ˜ c ) 2 = n X i =1 (max( | a i − α | , | b i − β | ) 2 . (12) Theorem 4 In c ase of an L 2 c om b i nation of Hausdorff di s tanc es, the c e n - tr al in terva l ˜ c which minimize s (12) c an b e c ompute d in a finite numb er of op er ations pr op ortional to n 3 . Pr o of: The square is an increasing function ov er p ositiv e n um b ers, so for- m ula (12) can b e rewritten: e S 2 ( ˜ c ) 2 = n X i =1 max ( a i − α ) 2 , ( b i − β ) 2 . (13) 7 On the other hand, using midp oin ts and half-lengths, one obta ins: ( a i − α ) 2 − ( b i − β ) 2 = − 4( m i − µ )( l i − λ ) . So w e see that the maximum in (13) is ( a i − α ) 2 if ( m i − µ )( l i − λ ) ≤ 0, and ( b i − β ) 2 if ( m i − µ )( l i − λ ) ≥ 0. Let us denote b y ( m (1) , . . . , m ( n ) ), resp. ( l (1) , . . . , l ( n ) ), the sample of the midp oin ts, resp. the half- lengths, organized in increasing order. Let us define the in terv als: M j = [ m ( j ) , m ( j +1) ] , j = 0 , . . . , n, L k = [ l ( k ) , l ( k +1) ] , k = 0 , . . . , n, (14) with m (0) = l (0) = −∞ and m ( n +1) = l ( n +1) = + ∞ . F or all ( µ, λ ) in any rectangle Q j,k = M j × L k , the pro duct ( m i − µ )( l i − λ ) has a give n sign, for eac h i = 1 , . . . , n . So the form ula (13) for e S 2 ( ˜ c ) 2 simplifies ov er suc h a rectangle to: e S j,k ( ˜ c ) = X i ∈ I a,j,k ( a i − α ) 2 + X i ∈ I b,j,k ( b i − β ) 2 , (15) where: I a,j,k = i ∈ { 1 . . . n }| m i − m ( j ) + m ( j +1) 2 l i − l ( k ) + l ( k +1) 2 ≤ 0 , (16) I b,j,k = i ∈ { 1 . . . n }| m i − m ( j ) + m ( j +1) 2 l i − l ( k ) + l ( k +1) 2 > 0 . (17) Hence t he minimization of e S 2 ( ˜ c ) 2 o v er R 2 is equiv a len t to the resolu- tion, for j, k = 0 , 1 . . . n , of the ( n + 1) 2 constrained quadratic problems: (P j,k ) find ( α, β ) = ( ˆ α j,k , ˆ β j,k ) whic h minimizes e S j,k ( α, β ) under the constrain ts: 2 m ( j ) ≤ α + β ≤ 2 m ( j +1) and 2 l ( k ) ≤ β − α ≤ 2 l ( k +1) (18) whose resolution is describ ed in the App endix. The cen tral interv al ˆ ˜ c = [ ˆ α, ˆ β ] is then give n by: ( ˆ α, ˆ β ) = arg min j,k =0 , 1 ...,n e S j,k ( ˆ α j,k , ˆ β j,k ) . (19) 8 Because t he num b er of op erations in the resolution of (1 8) is prop o r tional to n , the n um b er of op erations for the calculation o f ( ˆ α, ˆ β ) is prop ortional to n 3 . 5 The m ultidimen sional case W e consider now a set of n k - dimensional inte rv als { ˜ x 1 , . . . , ˜ x n } with ˜ x i = [ a i , b i ] and a i , b i ∈ R k . A k -dimensional interv al ˜ x i can also b e view ed as a regular hyperparallelepip ed ˜ x i = Q k j = 1 ˜ x j i with ˜ x j i = [ a j i , b j i ] where a j i (resp. b j i ) is the j th co ordinate of a i (resp. b i ). By misuse of la nguage the ˜ x i ’s will b e called h yp ercub es in the rest of the pap er. The ab o v e geometrical approac h can then be used to define a cen tral h yp ercub e (also called centrocub e or prototype) of a set of n hypercub es { ˜ x 1 , . . . , ˜ x n } , whic h is no w a k -dimensional in terv al ˜ c = [ α , β ] with α and β in R k . Replacing in (3 ) the terms d ( ˜ x i , ˜ c ) by a distance D ( ˜ x i , ˜ c ) b etw een t w o hypercub es leads to the function e e S p defined b y: e e S p ( ˜ c ) = ( ( P n i =1 D ( ˜ x i , ˜ c ) p ) 1 /p for p < ∞ , max i =1 ...n D ( ˜ x i , ˜ c ) for p = ∞ . (20) The cen tro cub e ˜ c = [ α , β ] is then b e defined b y ˆ ˜ c = arg min ˜ c ∈ I e e S p ( ˜ c ) . (21) There exists man y p ossible distances b et w een hypercub es (see for instance Bo c k, 20 02). Once again, dep ending on the distance D and on the v a lue p in e e S p ( ˜ c ), the cen tro cub e is more or less difficult to calculate. A first distance D that could b e used is the Hausdorff distance b et w een t w o hypercub es: D ( ˜ x 1 , ˜ x 2 ) = max( h ( ˜ x 1 , ˜ x 2 ) , h ( ˜ x 2 , ˜ x 1 )) (22) 9 with h ( ˜ x 1 , ˜ x 2 ) = sup a ∈ ˜ x 1 inf b ∈ ˜ x 2 δ ( a, b ) (23) where δ is an a rbitrary metric on R k . W e hav e seen t ha t in the one- dimensional case, the Hausdorff distance simplifies to (5) but the calculation of this distance for hig her dimensions is more in v olv ed and dep ends of the c hoice o f the metric δ . If δ is the Euclidean metric for instance, there exist algorithms that compute the Hausdorff distance b etw een tw o h yp ercub es in a finite num b er of steps (see e.g., Bo c k, 2005) but as far as w e kno w, there exist no algorithm to compute the cen tro cub e. If δ is the L ∞ metric, an ex- plicit solution of the cen tro cub e exists when p = ∞ (see Cha v en t, 2 0 04). In other cases, the definition of cen tro cub es fo r the original Hausdorff distance b et w een hypercub es still remains a sub ject to in v estigate. Another approac h whic h mak es explicit definitions of cen tro cub es easier to find, is to use a distance D that is a combination of co ordinate-wise one- dimensional in terv al distances d : D ( ˜ x 1 , ˜ x 2 ) = ( ( P k j = 1 d ( ˜ x j 1 , ˜ x j 2 ) q ) 1 /q for q < ∞ , max j = 1 ...k d ( ˜ x j 1 , ˜ x j 2 ) for q = ∞ . (24) When p = q , e e S p ( ˜ c ) p reads: e e S p ( ˜ c ) p = n X i =1 k X j = 1 d ( ˜ x j i , ˜ c j ) p (25) Because d ( ˜ x j i , ˜ c j ) ≥ 0, it sufficien t to find for eac h comp onent j the central in terv al ˆ ˜ c j whic h minimizes P n i =1 d ( ˜ x j i , ˜ c j ) , so that t he cen tro cub e is the pro duct of the cen tra l in terv a ls of eac h v ariable. The results presen ted in sections 3 and 4 concerning cen tral interv als can then b e a pplied directly to define this ‘co ordinate-wise’ centrocub e. 10 6 Conclud ing remarks In this pa p er, w e prop osed differen t solutions f or the determination of cen tral in terv als and h yp ercub es. These results hav e applications in clustering. In- deed, the existence of explicit formula for the computat io n of the centrocub e is useful in dynamic clustering (see Diday and Simon, 1976), b ecause it en- sures the decreasing at eac h it eration of the criterio n e e S p . ‘Co o r dinate-wise’ cen tro cub es ha v e b een defined as proto type in sev eral dynamical clustering algorithms of in terv a l data. The ‘co ordinate-wise’ cen tro cub e for p = q = 1 is used with the Hausdorff distance in Chav ent and Leche v allier (2002 ) and with the L 1 distance b etw een the lo we r and the upp er b ounds in De Souza and De C arv alho (2004). The case p = q = 2 is us ed by de Carv a lho et al. (200 6) with the L 2 distance b et w een the low er and upp er b ounds. The algorithm pro p osed in section 4 for the determination of the cen tral in terv al in the case of L 2 com bination of Hausdorff distances gives a solution for the case p = q = 2 and the Hausdorff distance. Another application of these results conce rn the data scaling. Dealing with scalar v ariables measured on v ery differen t scales is already a problem when comparing tw o ob jects globally on all the v aria bles. F or instance, t he Euclidean distance or more g eneraly the L q distance will giv e more imp or- tance to v aria bles of strong disp ersion and the comparison b etw een o b jects will only reflect their differences on those v ariables. A nat ur a l w a y to av oid this effect is to use a normalized distance. A L q normalized comp onen t-wise distance b et w een h yp ercub es could t hen b e: D ( ˜ x 1 , ˜ x 2 ) = ( P k j = 1 ( d ( ˜ x j 1 , ˜ x j 2 ) e S ( ˆ ˜ c j ) ) q ) 1 /q for q < ∞ , max j = 1 ...k d ( ˜ x j 1 , ˜ x j 2 ) e S ( ˆ ˜ c j ) for q = ∞ . (26) where e S ( ˆ ˜ c j ) is the dispersion measure asso ciated to a cen tr a l in terv al ˆ ˜ c j . F or coherency reasons, it seems reasonable to use the same exp onent ( q = p ): - to ag gregate the in terv als in the search of t he cen tra l in terv al a nd the ev aluation of the disp ersion for eac h v a riable (exp onen t p in (3) ), 11 - and to ev aluate the distance b et w een ob j ects (exp onen t q in (26)). T o conclude, a natural extension of these results concerns w eighte d cen tra l tendency and disp ersion measures. This p oint is curren tly under inv estiga- tion. Ac kno wledgmen ts The authors thank G. Chav ent f o r his helpful con tribution to the resolution of problem (P j,k ) in the App endix. They w ould like also to thank the asso ciate editor and the review ers f or their useful comments . App endix: Resolut ion of prob lem (P j,k ) W e describe here the resolution of one of the minimization problems (P j,k ) of equation (18). W e drop the subscripts j, k , and w e write m − instead of m ( j ) , m + instead of m ( j +1) , l − instead of l ( j ) and l + instead of l ( j +1) . W e use the midp o int and half - length v ariables µ = ( α + β ) / 2 and λ = ( β − α ) / 2, and w e denote b y Q the rectangle Q = { ( µ , λ ) suc h that m − ≤ µ ≤ m + and l − ≤ λ ≤ l + } . (27) With these notations, the problem to solv e is no w: (P) find ( ˆ µ, ˆ λ ) whic h minimizes e S ( µ, λ ) ov er Q, (28) where the ob jectiv e funtion is: e S ( µ, λ ) = X i ∈ I a ( a i − µ + λ ) 2 + X i ∈ I b ( b i − µ − λ ) 2 , (29) with I a and I b defined resp ectiv ely in (16) and (17) This ob j ective function is con v ex a nd quadratic (t he lev el lines of e S a re - p o ssibly degenerated - ellipses with axis parallel to the directions λ = µ and λ = − µ ), and the constrain ts in (27) ar e linear, so that the resolution o f (P) is equiv alen t to that of the asso ciated Kuhn-T uck er system of necessary conditio ns. 12 W e describ e now the corresp o nding a lg orithm. W e ha v e eliminated the consideration of some dead-end cases by taking adv antage of t he con v exit y of the problem: w hen the solution ( ˆ µ, ˆ λ ) of (P) is on one edge of Q ( p ossibly at a corner of Q ) , t he unconstrained minimizer ( ˇ µ, ˇ λ ) of e S and the cen ter of Q are necessarily on differen t sides of the line con taining this edge. He nce the edges of Q which can p ossibly con tain the solution ( ˆ µ, ˆ λ ) are those whic h con tain the L 2 -pro jection of ( ˇ µ , ˇ λ ) on Q . W e supp ose for simplicit y that the midp oints and half-length of all in ter- v als are distinct: ( m (1) < m (2) < . . . < m ( n ) l (1) < l (2) < . . . < l ( n ) (30) One computes first, in a lo op from i to n ov er the samples: n a = P i ∈ I a 1 , n b = P i ∈ I b 1 , A = P i ∈ I a a i , B = P i ∈ I b b i , A 2 = P i ∈ I a a 2 i , B 2 = P i ∈ I b b 2 i , (31) with t he conv ention that the sum is zero if the set I a or I b of indices is empty . Notice that n a is the num b er of indices in I a , and n b is the num b er of indices in I b , so that n = n a + n b . With these notations, the gradien t o f S : ∇ e S ( µ, λ ) = 2 − P i ∈ I a ( a i − µ + λ ) − P i ∈ I b ( b i − µ − λ ) + P i ∈ I a ( a i − µ + λ ) − P i ∈ I b ( b i − µ − λ ) ! . simplifies to: ∇ e S ( µ, λ ) = 2 − A − B + ( n a + n b ) µ − ( n a − n b ) λ + A − B − ( n a − n b ) µ + ( n a + n b ) λ ! . (32) The minimizer ( ˆ µ , ˆ λ ) of problem (P) can b e computed as follows: 1. If n a = 0 (a similar reasoning can b e done if n b = 0), then function e S reduces o v er Q to : e S ( µ, λ ) = X i =1 ,...,n ( b i − µ − λ ) 2 , 13 and the lev el lines of e S degenerate to the straigh t lines µ + λ = constan t . The unconstrained minimizers ( ˇ µ, ˇ λ ) of e S are then on the line: (L) n ( µ + λ ) = B . If the line (L) go es through Q , problem (P) has an infinite num b er of solutions, with at leas t one of them (in gene ral t wo) b eing on the b oundary o f Q . If ( L ) do es not hit Q , the unique solution of (P) is lo cated a t the corner of Q c losest to (L). In both cases, (P) admits at least one solution ( ˆ µ, ˆ λ ) on o ne edge of Q . If we denote b y Q ∗ the rectangle on the o ther side of this edge (for whic h ˜ n a = 1 6 = 0), one sees that ( ˆ µ, ˆ λ ) ∈ Q ∗ , so that the minim um ˜ S ∗ min of ˜ S o v er Q ∗ will necessarily b e smaller than ˜ S min , the minim um of e S o v er Q (as ( ˆ µ, ˆ λ ) ∈ Q ∗ ). So there is no p oint in computing ˜ S min , and we can skip the resolution of problem (P). 2. If n a > 0 and n b > 0, the unconstrained minimizer ( ˇ µ, ˇ λ ) of e S is unique. It is g iven b y: ( P i ∈ I a a i = n a ( ˇ µ − ˇ λ ) = n a ˇ α, P i ∈ I b b i = n b ( ˇ µ + ˇ λ ) = n b ˇ β . (33) If ( ˇ µ, ˇ λ ) ∈ Q then set ˆ µ = ˇ µ , ˆ λ = ˇ λ , and problem is solv ed. If no t, go to the next step. 3. Compute t he L 2 -pro jection ( ˇ ˇ µ, ˇ ˇ λ ) of ( ˇ µ, ˇ λ ) on Q : ˇ ˇ µ = m − if ˇ µ ≤ m − ˇ µ if m − ≤ ˇ µ ≤ m + m + if m + ≤ ˇ µ , ˇ ˇ λ = l − if ˇ λ ≤ l − ˇ λ if l − ≤ ˇ λ ≤ l + l + if l + ≤ ˇ λ (34) 4. If the pro jection is on a edge o f Q , sa y for example ˇ ˇ µ = m − , l − < ˇ ˇ λ < l + (left edge), determine ˇ ˇ ˇ λ whic h zero es the comp onent of ∇ e S a long this 14 edge (here the second comp o nen t as the edge is parallel to the second axis µ = 0): + X i ∈ I a a i − n a ( m − − ˇ ˇ ˇ λ ) − X i ∈ I b b i + n b ( m − + ˇ ˇ ˇ λ ) = 0 . (35) Then set: ˆ µ = m − , ˆ λ = l − if ˇ ˇ ˇ λ ≤ l − , ˇ ˇ ˇ λ if l − ≤ ˇ ˇ ˇ λ ≤ l + , l + if m + ≤ ˇ ˇ ˇ λ , , (36) and problem is solv ed. 5. If the pro jection is at a corner of Q , say for example ˇ ˇ µ = m − , ˇ ˇ λ = l − (lo w er-left corner), ev aluate the gra dien t ∇ e S = ( g µ , g λ ) at the corner. • If g µ ≥ 0 and g λ ≥ 0, set ˆ µ = m − , ˆ λ = l − , and pr o blem is solve d. • If g µ < 0 a nd g λ ≥ 0 , ( t he ob jectiv e function is decreasing when one leav es the lo we r-left cor ner to the right on t he low er edge o f Q ), determine ˇ ˇ ˇ λ whic h zero es the compo nent of ∇ e S along this edge (here the first comp onen t as the edge is parallel to the first axis λ = 0 ): − X i ∈ I a a i + n a ( ˇ ˇ ˇ µ − l − ) − X i ∈ I b b i + n b ( ˇ ˇ ˇ µ + l − ) = 0 . (37) Then set: ˆ µ = ( ˇ ˇ ˇ µ if m − < ˇ ˇ ˇ µ ≤ m + , m + if m + ≤ ˇ ˇ ˇ µ , , ˆ λ = l − , (38) and problem is solv ed. • If g µ ≥ 0 and g λ < 0, similarly determine ˇ ˇ ˇ µ whic h zero es the comp onen t o f ∇ e S along the left edge of Q : + X i ∈ I a a i − n a ( m − − ˇ ˇ ˇ λ ) − X i ∈ I b b i + n b ( m − + ˇ ˇ ˇ λ ) = 0 (39) 15 Then set: ˆ µ = m − , ˆ λ = ˇ ˇ ˇ λ if l − < ˇ ˇ ˇ λ ≤ l + , l + if l + ≤ ˇ ˇ ˇ λ , , (40) and problem is solv ed. • The case g µ < 0 and g λ < 0 cannot happ en. The minim um v alue e S min of e S ov er Q is then: e S min = A 2 − 2 A ˆ α + ˆ α 2 + B 2 − 2 B ˆ β + ˆ β 2 , (41) where ˆ α = ˆ µ − ˆ λ and ˆ β = ˆ µ + ˆ λ. References BER TRAND, P . and GOUPIL, F. (2000), “Descriptiv e statistics f or sym- b olic da t a ”, In: H.- H. Bo c k and E. D ida y ( eds.): A nalysis of symb olic data. Explor atory m etho ds for extr ac ting statistic al in formation fr om c om- plex d a ta , Springer V erlag, 103- 124. BILLARD, L. and D ID A Y, E. ( 2 003), “F rom the statistics of data to the statistics of know ledge: Sy mbolic data analysis”, Journa l of the American Statistical Asso ciation, 98, 470- 4 87. BOCK H.H. (2002), “Clustering metho ds and k ohonen maps for sym b olic data”, Journal of the Jap anese So ciety o f Computational Statistics , 15, 1-13. BOCK H.H. (2005 ) , “Optimization in Sym b olic Data Analysis: Dissimi- larities, Class Cen ters, a nd Clustering”, In: D. Baier, L. Sc hmidt-Thieme (eds.): Data Analysis and D e cisi on Supp ort , Springer V erlag, 1-10. BOCK, H.- H. and DIDA Y, E. ( eds.) ( 2 000), Analysis of symb olic data. Explor a tory m etho ds for extr acting stat istic al i n formation fr om c omplex data , Springer V erlag, Heidelb erg. 16 CHA VE NT, M. (20 04), “An Hausdorff distance b et w een h yp er-rectangles for clustering in terv al d ata ” , In: D. Banks et al. (eds.): Classific ation, Clustering an d Data Mini n g Applic ations , Springer, 333-3 4 0. CHA VE NT, M. and LECHEV ALLIER Y. (2002) , “Dynamical Clustering of in terv al data. Optimization of an adequacy criterion based on Hausdorff distance”, In: K. Jaguga et al. (Eds.): Classific ation, Clustering and Data A nalysis , Springer V erlag, 53-60. CHA VENT, M., CAR VHALO, F. de A.T., LECHEV ALLIER, Y. and VERDE R . (2 006), “New Clustering metho ds for in terv al data ” , C o m pu- tational Statistics , 2 6 , 2 1 1-229. DE CAR V ALHO, F. de A.T. (1998), “Extension based prox imities co ef- ficien ts b etw een b o olean sym b olic ob jects”, In: C. Ha y ashi et al. (eds): Data Scienc e, Classific ation a n d R elate d Metho ds , Springer V erlag, 370- 378. DE CAR V ALHO, F. de A.T., BRITO, P . and BO CK, H.-H. (2006 ), “Dy- namic Clustering for Inte rv al D ata Based on L 2 Distance”, Computational Statistics , 2 1, 2 31-250. DE SOUZ A R.M.C.R. and DE CAR V ALHO, F. de A.T. (2004 ) , “Cluster- ing of in terv al data based on city -blo c k distances”, Patt ern R e c o gnition L etters , 25, 353 -365. DIDA Y, E. and SIMON, J.J. (1976), “Clustering Analysis”, In: K.S. F u (Eds): Digital Pattern R e c o gnition , Springer, 47-94. ICHINO, M. and Y A GUCHI, H. (1994), “Generalized Mink o wski metrics for mixed feature ty p e data a nalysis”, IEEE T r ansactions on Sy stems, Man and Cyb erne tics , 24, 698 - 708. NADLER, S.B.J. (1978) , Hyp ersp ac es of sets , Marcel Dekk er, ONC., New Y ork. 17
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment