Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy

Symmetry in Data Mining and Analysis: A Unifying View Based on Hierarc h y Fionn Murtagh Science F oundation Ireland, Wilton P ark House, Wilton Place, Dublin 2, Ireland and Departmen t of Computer Science Ro yal Hollo w a y , Universit y of London Egham TW20 0EX, UK fm urtagh@acm.org Octob er 22, 2018 Abstract Data analysis and data mining are concerned with unsup ervised pat- tern ﬁnding and structure determination in data sets. The data sets them- selv es are explicitly linked as a form of representation to an observ ational or otherwise empirical domain of interest. “Structure” has long b een un- dersto od as symmetry which can tak e many forms with resp ect to any transformation, including p oint, translational, rotational, and many oth- ers. Symmetries directly point to in v ariants, that pinp oin t in trinsic prop- erties of the data and of the background empirical domain of interest. As our data models change so to o do our persp ectives on analyzing data. The structures in data survey ed here are based on hierarch y , represented as p-adic num b ers or an ultrametric top ology . Keyw ords: Data analytics, multiv ariate data analysis, pattern recognition, in- formation storage and retriev al, clustering, hierarch y , p-adic, ultrametric top ol- ogy , complexity 1 In tro duction Herb ert A. Simon, Nob el Laureate in Economics, originator of “b ounded ratio- nalit y” and of “satisﬁcing”, b eliev ed in hierarch y at the basis of the human and so cial sciences, as the follo wing quotation shows: “... my central theme is that complexit y frequently tak es the form of hierarch y and that hierarchic systems 1 ha ve some common properties indep endent of their sp eciﬁc conten t. Hierar- c hy , I shall argue, is one of the central structural sc hemes that the arc hitect of complexit y uses.” ([74], p. 184.) P artitioning a set of observ ations [76, 77, 53] leads to some v ery simple symmetries. This is one approac h to clustering and data mining. But such approac hes, often based on optimization, are really not of direct interest to us here. Instead w e will pursue the theme p ointed to by Simon, namely that the notion of hierarch y is fundamental for in terpreting data and the complex reality whic h the data expresses. Our work is very diﬀerent too from the marvelous view of the developmen t of mathematical group theory – but viewed in its own righ t as a complex, evolving system – presen ted by F o ote [20]. 1.1 Structure in Observed or Measured Data W eyl [83] makes the case for the fundamental importance of symmetry in sci- ence, engineering, arc hitecture, art and other areas. As a “guiding principle”, “Whenev er you ha ve to do with a structure-endow ed entit y ... try to determine its group of automorphisms, the group of those element-wise transformations whic h leav e all structural relations undisturb ed. Y ou can exp ect to gain a deep insigh t in the constitution of [the structure-endo wed entit y] in this wa y . Af- ter that you may start to inv estigate symmetric c onﬁgurations of elements, i.e. conﬁgurations which are in v ariant under a certain subgroup of the group of all automorphisms; ...” ([83], p. 144). “Symmetry is a v ast sub ject, signiﬁcant in art and nature.”, W eyl states (p. 145), and no better example of the “mathematical in tellect” at work. “Al- though the mathematics of group theory and the ph ysics of symmetries were not fully developed simultaneously – as in the case of calculus and mec hanics b y Newton – the intimate relationship b et ween the tw o was fully realized and clearly formulated by Wigner and W eyl, among others, b efore 1930.” ([78], p. 1.) Po w erful imp etus was given to this (mathematical) group view of study and exploration of symmetry in art and nature by F elix Klein’s 1872 Erlangen Pro- gram [41] which prop osed that geometry w as at heart group theory: geometry is the study of groups of transformations, and their inv ariants. Klein’s Erlangen Program is at the cross-roads of mathematics and physics. The purp ose of this article is to locate symmetry and group theory at the cross-roads of data mining and data analytics to o. 1.2 Ab out this Article In section 2, we describ e ultrametric top ology as an expression of hierarc hy . In section 3, p-adic enco ding, pro viding a num b er theory v antage p oin t on ultrametric toplogy , gives rise to additional symmetries and wa ys to capture in v ariants in data. Section 4 deals with symmetries that are part and parcel of a tree, repre- sen ting a partial order on data, or equally a set of subsets of the data, some of whic h are em b edded. 2 In section 5 p erm utations are at issue, including p ermutations that hav e the prop ert y of represen ting hierarch y . Section 6 deals with new and recent results relating to the remark able sym- metries of massive, and especially high dimensional data sets. 1.3 A Brief Introduction to Hierarc hical Clustering F or the reader new to analysis of data a v ery short introduction is now pro- vided on hierarc hical clustering. Along with other families of algorithm, the ob jective is automatic classiﬁcation, for the purp oses of data mining, or knowl- edge discov ery . Classiﬁcation, after all, is fundamental in human thinking, and mac hine-based decision making. But w e draw attention to the fact that our ob jective is unsup ervise d , as opp osed to sup ervise d classiﬁcation, also known as discriminan t analysis or (in a general wa y) machine learning. So here we are not concerned with generalizing the decision making capability of training data, nor are w e concerned with ﬁtting statistical models to data so that these models can play a role in generalizing and predicting. Instead we are concerned with ha ving “data sp eak for themselves”. That this unsup ervised ob jectiv e of classi- fying data (observ ations, ob jects, even ts, phenomena, etc.) is a huge task in our so ciet y is unquestionably true. One may think of situations when precedents are very limited, for instance . Among families of clustering, or unsupervised classiﬁcation, algorithms, we can distinguish the following: (i) arra y p ermuting and other visualization ap- proac hes; (ii) partitioning to form (discrete or ov erlapping) clusters through optimization, including graph-based approaches; and – of interest to us in this article – (iii) embedded clusters interrelated in a tree-based wa y . F or the last-men tioned family of algorithm, agglomerative building of the hierarc hy from consideration of ob ject pairwise distances has b een the most common approach adopted. As comprehensive background texts, see [52, 29, 84, 30]. 1.4 A Brief Introduction to p-Adic Num b ers The real num b er system, and a p-adic num b er system for given prime, p, are p oten tially equally useful alternatives. p-Adic n umbers w ere in tro duced b y Kurt Hensel in 1898. Whether we deal with Euclidean or with non-Euclidean geometry , w e are (nearly) alw ays dealing with reals. But the reals start with the natural n umbers, and from asso ciating observ ational facts and details with suc h n umbers w e begin the pro cess of measurement. F rom the natural num b ers, we pro ceed to the rationals, allowing fractions to b e tak en into consideration. The follo wing view of how we do science or carry out other quantitativ e study was prop osed b y V olovic h in 1987 [80, 81]. See also F reund [23]. W e can alw ays use rationals to make measuremen ts. But they will b e approximate, in general. It is b etter therefore to allow for observ ables b eing “con tinuous, i.e. endo w them with a top ology”. Therefore we need a completion of the ﬁeld Q 3 of rationals. T o complete the ﬁeld Q of rationals, we need Cauch y sequences and this requires a norm on Q (b ecause the Cauch y sequence must conv erge, and a norm is the to ol used to show this). There is the Arc himedean norm suc h that: for any x, y ∈ Q , with | x | < | y | , then there exists an integer N suc h that | N x | > | y | . F or con venience here, w e write: | x | ∞ for this norm. So if this completion is Arc himedean, then w e ha ve R = Q ∞ , the reals. That is ﬁne if space is taken as comm utative and Euclidean. What of alternatives? Remark ably all norms are known. Besides the Q ∞ norm, we hav e an inﬁnity of norms, | x | p , lab eled b y primes, p. By Ostro wski’s theorem [66] these are all the p ossible norms on Q . So w e hav e an unambiguous lab eling, via p, of the inﬁnite set of non-Archimedean completions of Q to a ﬁeld endow ed with a top ology . In all cases, we obtain lo cally compact completions, Q p , of Q . They are the ﬁelds of p-adic num b ers. All these Q p are con tinua. Being lo cally compact, they ha ve additive and multiplicativ e Haar measures. As suc h we can integrate ov er them, such as for the reals. 1.5 Brief Discussion of p-Adic and m-Adic Numbers W e will use p to denote a prime, and m to denote a non-zero p ositiv e integer. A p-adic num b er is such that any set of p integers which are in distinct residue classes mo dulo p ma y b e used as p-adic digits. (Cf. remark b elow, at the end of section 3.1, quoting from [26]. It makes the p oin t that this op ens up a range of alternativ e notation options in practice.) Recall that a ring do es not allow division, while a ﬁeld do es. m-Adic num b ers form a ring; but p-adic num b ers form a ﬁeld. So a priori, 10-adic n umbers form a ring. This pro vides us with a reason for preferring p-adic o ver m-adic num b ers. W e can consider v arious p-adic expansions: 1. P n i =0 a i p i , whic h deﬁnes positive in tegers. F or a p-adic n umber, w e require a i ∈ 0 , 1 , ...p − 1. (In practice: just write the integer in binary form.) 2. P n i = −∞ a i p i deﬁnes rationals. 3. P ∞ i = k a i p i where k is an in teger, not necessarily p ositiv e, deﬁnes the ﬁeld Q p of p-adic num b ers. Q p , the ﬁeld of p-adic num b ers, is (as seen in these deﬁnitions) the ﬁeld of p-adic expansions. The choice of p is a practical issue. Indeed, adelic num b ers use all p ossible v alues of p (see [9] for extensive use and discussion of the adelic num b er frame- w ork). Consider [17, 40]. DNA (deso xyrib onucleic acid) is enco ded using four n ucleotides: A, adenine; G, guanine; C, cytosine; and T, th ymine. In RNA (ri- b on ucleic acid) T is replaced b y U, uracil. In [17] a 5-adic enco ding is used, since 5 is a prime and thereb y oﬀers uniqueness. In [40] a 4-adic enco ding is used, and a 2-adic enco ding, with the latter based on 2-digit b o olean expressions for the four nucleotides (00, 01, 10, 11). A default norm is used, based on a longest 4 common preﬁx – with p-adic digits from the start or left of the sequence (see section 3.2 b elow where this longest common preﬁx norm or distance is used). 2 Ultrametric T opology In this section w e mainly explore symmetries related to: geometric shap e; matrix structure; and lattice structures. 2.1 Ultrametric Space for Represen ting Hierarc h y Consider Figure 1, illustrating the ultrametric distance and its role in deﬁning a hierarc hy . An early , inﬂuen tial pap er is Johnson [34] and an important surv ey is that of Rammal et al. [68]. Discussion of how a hierarc hy expresses the seman tics of change and distinction can be found in [63]. The ultrametric top ology was introduced by Marc Krasner [44], the ultra- metric inequality having b een formulated by Hausdorﬀ in 1934. Essen tial moti- v ation for the study of this area is pro vided by [71] as follows. Real and complex ﬁelds gav e rise to the idea of studying any ﬁeld K with a complete v aluation | . | comparable to the absolute v alue function. Such ﬁelds satisfy the “strong triangle inequalit y” | x + y | ≤ max( | x | , | y | ). Given a v alued ﬁeld, deﬁning a to- tally ordered Ab elian (i.e. commutativ e) group, an ultrametric space is induced through | x − y | = d ( x, y ). V arious terms are used in terchangeably for analysis in and o ver such ﬁelds suc h as p-adic, ultrametric, non-Arc himedean, and isosceles. The natural geometric ordering of metric v aluations is on the real line, whereas in the ultrametric case the natural ordering is a hierarchical tree. 2.2 Some Geometrical Prop erties of Ultrametric Spaces W e see from the follo wing, based on [46] (chapter 0, part IV), that an ultrametric space is quite diﬀerent from a metric one. In an ultrametric space ev erything “liv es” on a tree. In an ultrametric space, all triangles are either isosceles with small base, or equilateral. W e ha ve here very clear symmetries of shap e in an ultrametric top ology . These symmetry “patterns” can b e used to ﬁngerprint data data sets and time series: see [59, 60] for many examples of this. Some further prop erties that are studied in [46] are: (i) Ev ery point of a circle in an ultrametric space is a center of the circle. (ii) In an ultrametric top ology , ev ery ball is b oth op en and closed (termed clop en). (iii) An ultrametric space is 0-dimensional (see [10, 70]). It is clear that an ultrametric top ology is very diﬀeren t from our intuitiv e, or Euclidean, notions. The most imp ortan t p oint to keep in mind is that in an ultrametric space everything “liv es” in a hierarch y expressed by a tree. 5 x y z 1.0 1.5 2.0 2.5 3.0 3.5 Height Figure 1: The strong triangular inequalit y deﬁnes an ultrametric: ev ery triplet of p oin ts satisﬁes the relationship: d ( x, z ) ≤ max { d ( x, y ) , d ( y , z ) } for dis- tance d . Cf. by reading oﬀ the hierarch y , how this is v eriﬁed for all x, y , z : d ( x, z ) = 3 . 5; d ( x, y ) = 3 . 5; d ( y , z ) = 1 . 0. In addition the symmetry and p ositive deﬁniteness conditions hold for an y pair of points. 2.3 Ultrametric Matrices and Their Prop erties F or an n × n matrix of p ositive reals, symmetric with resp ect to the principal diagonal, to b e a matrix of distances asso ciated with an ultrametric distance on X , a suﬃcient and necessary condition is that a p erm utation of rows and columns satisﬁes the following form of the matrix: 1. Ab ov e the diagonal term, equal to 0, the elements of the same ro w are non-decreasing. 2. F or ev ery index k , if d ( k , k + 1) = d ( k , k + 2) = . . . = d ( k , k + ` + 1) then d ( k + 1 , j ) ≤ d ( k , j ) for k + 1 < j ≤ k + ` + 1 and d ( k + 1 , j ) = d ( k , j ) for j > k + ` + 1 6 Sepal.Length Sepal.Width P etal.Length Petal.Width iris1 5.1 3.5 1.4 0.2 iris2 4.9 3.0 1.4 0.2 iris3 4.7 3.2 1.3 0.2 iris4 4.6 3.1 1.5 0.2 iris5 5.0 3.6 1.4 0.2 iris6 5.4 3.9 1.7 0.4 iris7 4.6 3.4 1.4 0.3 T able 1: Input data: 8 iris ﬂow ers characterized by sepal and p etal widths and lengths. F rom Fisher’s iris data [18]. iris1 iris2 iris3 iris4 iris5 iris6 iris7 iris1 0 0.6480741 0.6480741 0.6480741 1.1661904 1.1661904 1.1661904 iris2 0.6480741 0 0.3316625 0.3316625 1.1661904 1.1661904 1.1661904 iris3 0.6480741 0.3316625 0 0.2449490 1.1661904 1.1661904 1.1661904 iris4 0.6480741 0.3316625 0.2449490 0 1.1661904 1.1661904 1.1661904 iris5 1.1661904 1.1661904 1.1661904 1.1661904 0 0.6164414 0.9949874 iris6 1.1661904 1.1661904 1.1661904 1.1661904 0.6164414 0 0.9949874 iris7 1.1661904 1.1661904 1.1661904 1.1661904 0.9949874 0.9949874 0 T able 2: Ultrametric matrix derived from the dendrogram in Figure 2. Under these circumstances, ` ≥ 0 is the length of the section b eginning, b ey ond the principal diagonal, the in terv al of columns of equal terms in ro w k . T o illustrate the ultrametric matrix format, consider the small data set sho wn in T able 1. A dendrogram pro duced from this is in Figure 2. The ultrametric matrix that can b e read oﬀ this dendrogram is shown in T able 2. Finally a visualization of this matrix, illustrating the ultrametric matrix prop- erties discussed ab ov e, is in Figure 3. 2.4 Clustering Through Matrix Ro w and Column P erm u- tation Figure 3 shows how an ultrametric distance allows a certain structure to be visible (quite p ossibly , in practice, sub ject to an appropriate row and column p erm uting), in a matrix deﬁned from the set of all distances. F or set X , then, this matrix expresses the distance mapping of the Cartesian pro duct, d : X × X − → R + . R + denotes the non-negative reals. A priori the ro ws and columns of the function of the Cartesian pro duct set X with itself could b e in any order. The ultrametric matrix prop erties establish what is p ossible when the distance is an ultrametric one. Because the matrix (a 2-wa y data ob ject) inv olves one mo de (due to set X b eing crossed with itself; as opp osed to the 2-mo de case 7 1 3 4 2 5 6 7 0.2 0.4 0.6 0.8 1.0 1.2 Height Figure 2: Hierarchical clustering of 7 iris ﬂow ers using data from T able 1. No data normalization w as used. The agglomerative clustering criterion w as the minim um v ariance or W ard one. where an observ ation set is crossed b y an attribute set) it is clear that b oth rows and columns can b e p erm uted to yield the same order on X . A prop erty of the form of the matrix is that small v alues are at or near the principal diagonal. A generalization op ens up for this sort of clustering by visualization sc heme. Firstly , we can directly apply ro w and column p ermuting to 2-mo de data, i.e. to the rows and columns of a matrix crossing indices I by attributes J , a : I × J − → R . A matrix of v alues, a ( i, j ), is furnished b y the function a acting on the sets I and J . Here, each suc h term is real-v alued. W e can also generalize the principle of p ermuting such that small v alues are on or near the principal diagonal to instead allow similar v alues to b e near one another, and thereb y to facilitate visualization. An optimized w ay to do this w as pursued in [50, 49]. Comprehensive surveys of clustering algorithms in this area, including ob jective functions, visualization schemes, optimization approaches, presence of constrain ts, and applications, can b e found in [51, 48]. See too [15, 57]. F or all these approaches, underpinning them are row and column p ermu- tations, that can b e expressed in terms of the p ermutation group, S n , on n elemen ts. 8 Figure 3: A visualization of the ultrametric matrix of T able 2, where bright or white = highest v alue, and black = low est v alue. 2.5 Other Miscellaneous Symmetries As examples of v arious other lo cal symmetries w orthy of consideration in data sets consider subsets of data comprising clusters, and recipro cal nearest neighbor pairs. Giv en an observ ation set, X , we deﬁne dissimilarities as the mapping d : X × X − → R + . A dissimilarity is a p ositiv e, deﬁnite, symmetric measure (i.e., d ( x, y ) ≥ 0; d ( x, y ) = 0 if x = y ; d ( x, y ) = d ( y , x )). If in addition the triangular inequalit y is satisﬁed (i.e., d ( x, y ) ≤ d ( x, z ) + d ( z , y ) , ∀ x, y, z ∈ X ) then the dissimilarit y is a distance. If X is endow ed with a metric, then this metric is mapp ed onto an ultramet- ric. In practice, there is no need for X to b e endow ed with a metric. Instead a dissimilarit y is satisfactory . A hierarch y , H , is deﬁned as a binary , ro oted, no de-ranked tree, also termed a dendrogram [7, 34, 46, 57]. A hierarch y deﬁnes a set of embedded subsets of a given set of ob jects X , indexed by the set I . That is to say , ob ject i in the ob ject set X is denoted x i , and i ∈ I . These subsets are total ly or der e d b y an index function ν , whic h is a stronger condition than the p artial or der required b y the subset relation. The index function ν is represented b y the ordinate in 9 Figure 2 (the “height” or “level”). A bijection exists b etw een a hierarch y and an ultrametric space. Often in this article we will refer interc hangeably to the ob ject set, X , and the asso ciated set of indices, I . Usually a constructive approach is used to induce H on a set I . The most eﬃcien t algorithms are based on nearest neighbor chains, whic h by deﬁnition end in a pair of agglomerable recipro cal nearest neighbors. F urther information can b e found in [54, 55, 57, 58]. 2.6 Generalized Ultrametric In this subsection, we consider an ultrametric deﬁned on the p ow er set or join semilattice. Comprehensiv e bac kground on ordered sets and lattices can b e found in [13]. A review of generalized distances and ultrametrics can b e found in [72]. 2.6.1 Link with F ormal Concept Analysis T ypically hierarchical clustering is based on a distance (which can b e relaxed often to a dissimilarity , not resp ecting the triangular inequality , and mutatis mutandis to a similarity), deﬁned on all pairs of the ob ject set: d : X × X → R + . I.e., a distance is a positive real v alue. Usually w e require that a distance cannot b e 0-v alued unless the ob jects are identical. That is the traditional approac h. A diﬀerent form of ultrametrization is ac hieved from a dissimilarity deﬁned on the p ow er set of attributes characterizing the observ ations (ob jects, individ- uals, etc.) X . Here we hav e: d : X × X − → 2 J , where J indexes the attribute (v ariables, characteristics, prop erties, etc.) set. This gives rise to a diﬀerent notion of distance, that maps pairs of ob jects on to elements of a join semilattice. The latter can represent all subsets of the attribute set, J . That is to say , it can represent the p ow er set, commonly denoted 2 J , of J . As an example, consider, say , n = 5 ob jects characterized b y 3 b o olean (presence/absence) attributes, shown in Figure 4 (top). Deﬁne dissimilarit y b et ween a pair of ob jects in this table as a set of 3 comp onen ts, corresponding to the 3 attributes, such that if b oth comp onents are 0, we hav e 1; if either comp onen t is 1 and the other 0, we ha ve 1; and if b oth comp onents are 1 we get 0. This is the simple matching co eﬃcien t [32]. W e could use, e.g., Euclidean distance for eac h of the v alues sough t; but we prefer to treat 0 v alues in b oth comp onen ts as signaling a 1 contribution. W e get then d ( a, b ) = 1 , 1 , 0 whic h w e will call d1,d2 . Then, d ( a, c ) = 0 , 1 , 0 which we will call d2 . Etc. With the latter we create lattice nodes as sho wn in the middle part of Figure 4. In F ormal Concept Analysis [13, 25], it is the lattice itself which is of primary in terest. In [32] there is discussion of, and a range of examples on, the close relationship b et ween the traditional hierarchical cluster analysis based on d : I × I → R + , and hierarchical cluster analysis “based on abstract p osets” (a 10 v 1 v 2 v 3 a 1 0 1 b 0 1 1 c 1 0 1 e 1 0 0 f 0 0 1 Potential lattice vertices Lattice vertices found Level d1,d2,d3 d1,d2,d3 3 / \ / \ d1,d2 d2,d3 d1,d3 d1,d2 d2,d3 2 \ / \ / d1 d2 d3 d2 1 The set d1,d2,d3 corresp onds to: d ( b, e ) and d ( e, f ) The subset d1,d2 corresp onds to: d ( a, b ) , d ( a, f ) , d ( b, c ) , d ( b, f ) , and d ( c, f ) The subset d2,d3 corresp onds to: d ( a, e ) and d ( c, e ) The subset d2 corresp onds to: d ( a, c ) Clusters deﬁned by all pairwise link age at lev el ≤ 2: a, b, c, f a, c, e Clusters deﬁned by all pairwise link age at lev el ≤ 3: a, b, c, e, f Figure 4: T op: example data set consisting of 5 ob jects, characterized by 3 b oolean attributes. Then: lattice corresp onding to this data and its interpreta- tion. 11 p oset is a partially ordered set), based on d : I × I → 2 J . The latter, leading to clustering based on dissimilarities, w as developed initially in [31]. 2.6.2 Applications of Generalized Ultrametrics As noted in the previous subsection, the usual ultrametric is an ultrametric distance, i.e. for a set I, d : I × I − → R + . The generalized ultrametric is: d : I × I − → Γ, where Γ is a partially ordered set. In other w ords, the gen- er alize d ultrametric distance is a set. Some areas of application of generalized ultrametrics will now b e discussed. In the theory of reasoning, a monotonic op erator is rigorous application of a succession of conditionals (sometimes called consequence relations). How- ev er negation or m ultiple v alued logic (i.e. encompassing in termediate truth and falseho od) require supp ort for non-monotonic reasoning. Th us [28]: “Once one introduces negation ... then certain of the important op erators are not monotonic (and therefore not con tinuous), and in consequence the Knaster-T arski theorem [i.e. for ﬁxed p oints; see [13]] is no longer applicable to them. V arious wa ys hav e b een prop osed to ov ercome this problem. One such [approac h is to use] syntactic conditions on programs ... Another is to consider diﬀeren t op erators ... The third main solution is to introduce techniques from top ology and analysis to augment arguments based on order ... [the latter include:] methods based on metrics ... on quasi-metrics ... and ﬁnally ... on ultrametric spaces.” The conv ergence to ﬁxed p oints that are based on a generalized ultrametric system is precisely the study of spherically complete systems and expansive automorphisms discussed in section 3.3 b elow. As expansive automorphisms w e see here again an example of symmetry at work. A direct application of generalized ultrametrics to data mining is the fol- lo wing. The p oten tially huge adv antage of the generalized ultrametric is that it allows a hierarc hy to b e read directly oﬀ the I × J input data, and b ypasses the O ( n 2 ) consideration of all pairwise distances in agglomerative hierarchical clustering. In [64] we study application to chemoinformatics. Pro ximity and b est match ﬁnding is an essen tial operation in this ﬁeld. T ypically w e ha ve one million c hemicals upw ards, characterized by an appro ximate 1000-v alued attribute enco ding. 3 Hierarc h y in a p-Adic Num b er System A dendrogram is widely used in hierarc hical, agglomerative clustering, and is induced from observed data. In this article, one of our imp ortant goals is to sho w how it lays bare many diverse symmetries in the observed phenomenon represen ted by the data. By expressing a dendrogram in p-adic terms, we op en up a wide range of p ossibilities for seeing symmetries and attendant inv ariants. 12 3.1 p-Adic Enco ding of a Dendrogram W e will introduce now the one-to-one mapping of clusters (including singletons) in a dendrogram H in to a set of p-adically expressed integers (a forteriori, ra- tionals, or Q p ). The ﬁeld of p-adic num b ers is the most imp ortan t example of ultrametric spaces. Addition and m ultiplication of p-adic integers, Z p (cf. ex- pression in subsection 1.5), are well-deﬁned. Inv erses exist and no zero-divisors exist. A terminal-to-root trav ersal in a dendrogram or binary ro oted tree is deﬁned as follows. W e use the path x ⊂ q ⊂ q 0 ⊂ q 00 ⊂ . . . q n − 1 , where x is a given ob ject sp ecifying a given terminal, and q , q 0 , q 00 , . . . are the em b edded classes along this path, sp ecifying no des in the dendrogram. The ro ot no de is sp eciﬁed b y the class q n − 1 comprising all ob jects. A terminal-to-ro ot trav ersal is the shortest path b et ween the giv en terminal no de and the ro ot no de, assuming we preclude rep eated trav ersal (bac ktrack) of the same path betw een any tw o no des. By means of terminal-to-ro ot tra versals, we deﬁne the following p-adic en- co ding of terminal no des, and hence ob jects, in Figure 5. x 1 : +1 · p 1 + 1 · p 2 + 1 · p 5 + 1 · p 7 (1) x 2 : − 1 · p 1 + 1 · p 2 + 1 · p 5 + 1 · p 7 x 3 : − 1 · p 2 + 1 · p 5 + 1 · p 7 x 4 : +1 · p 3 + 1 · p 4 − 1 · p 5 + 1 · p 7 x 5 : − 1 · p 3 + 1 · p 4 − 1 · p 5 + 1 · p 7 x 6 : − 1 · p 4 − 1 · p 5 + 1 · p 7 x 7 : +1 · p 6 − 1 · p 7 x 8 : − 1 · p 6 − 1 · p 7 If w e choose p = 2 the resulting decimal equiv alen ts could b e the same: cf. con tributions based on +1 · p 1 and − 1 · p 1 + 1 · p 2 . Giv en that the co eﬃcients of the p j terms (1 ≤ j ≤ 7) are in the set {− 1 , 0 , +1 } (implying for x 1 the additional terms: +0 · p 3 + 0 · p 4 + 0 · p 6 ), the co ding based on p = 3 is required to av oid am biguity among decimal equiv alents. A few general remarks on this enco ding follow. F or the labeled rank ed binary trees that w e are considering, w e require the lab els +1 and − 1 for the t wo branches at any no de. Of course w e could interc hange these lab els, and ha ve these +1 and − 1 lab els reversed at any no de. By doing so w e will hav e diﬀeren t p-adic codes for the ob jects, x i . The following prop erties hold: (i) Unique enc o ding: the decimal co des for eac h x i (lexicographically ordered) are unique for p ≥ 3; and (ii) R eversibility: the dendrogram can be uniquely reconstructed from an y such set of unique co des. The p-adic enco ding deﬁned for any ob ject set can b e expressed as follo ws for any ob ject x asso ciated with a terminal node: 13 x1 x2 x3 x4 x5 x6 x7 x8 0 1 2 3 4 5 6 7 +1 +1 +1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 Figure 5: Lab eled, ranked dendrogram on 8 terminal no des, x 1 , x 2 , . . . , x 8 . Branc hes are lab eled +1 and − 1. Clusters are: q 1 = { x 1 , x 2 } , q 2 = { x 1 , x 2 , x 3 } , q 3 = { x 4 , x 5 } , q 4 = { x 4 , x 5 , x 6 } , q 5 = { x 1 , x 2 , x 3 , x 4 , x 5 , x 6 } , q 6 = { x 7 , x 8 } , q 7 = { x 1 , x 2 , . . . , x 7 , x 8 } . x = n − 1 X j =1 c j p j where c j ∈ {− 1 , 0 , +1 } (2) In greater detail we ha ve: x i = n − 1 X j =1 c ij p j where c ij ∈ {− 1 , 0 , +1 } (3) Here j is the level or rank (ro ot: n − 1; terminal: 1), and i is an ob ject index. In our example we ha ve used: c j = +1 for a left branc h (in the sense of Figure 5), = − 1 for a right branch, and = 0 when the no de is not on the path from that particular terminal to the root. 14 A matrix form of this encoding is as follows, where {·} t denotes the transpose of the vector. Let x b e the column v ector { x 1 x 2 . . . x n } t . Let p b e the column v ector { p 1 p 2 . . . p n − 1 } t . Deﬁne a characteristic matrix C of the branching co des, +1 and − 1, and an absent or non-existen t branc hing giv en b y 0, as a set of v alues c ij where i ∈ I , the indices of the ob ject set; and j ∈ { 1 , 2 , . . . , n − 1 } , the indices of the dendrogram levels or no des ordered increasingly . F or Figure 5 we therefore ha ve: C = { c ij } =             1 1 0 0 1 0 1 − 1 1 0 0 1 0 1 0 − 1 0 0 1 0 1 0 0 1 1 − 1 0 1 0 0 − 1 1 − 1 0 1 0 0 0 − 1 − 1 0 1 0 0 0 0 0 1 − 1 0 0 0 0 0 − 1 − 1             (4) F or giv en level j , ∀ i , the absolute v alues | c ij | give the mem b ership function either by no de, j , which is therefore read oﬀ c olumn wise; or by ob ject index, i , whic h is therefore read oﬀ rowwise. The matrix form of the p-adic encoding used in equations (2) or (3) is: x = C p (5) Here, x is the decimal enco ding, C is the matrix with dendrogram branching co des (cf. example sho wn in expression (4)), and p is the vector of p ow ers of a ﬁxed integer (usually , more restrictively , ﬁxed prime) p . The tree enco ding exempliﬁed in Figure 5, and deﬁned with co eﬃcients in equations (2) or (3), (4) or (5), with lab els +1 and − 1 was required (as opp osed to the choice of 0 and 1, which might hav e b een our ﬁrst thought) to fully cater for the rank ed no des (i.e. the total order, as opposed to a partial order, on the no des). W e can consider the ob jects that w e are dealing with to hav e equiv alent in teger v alues. T o show that, all we must do is w ork out decimal equiv alents of the p-adic expressions used ab ov e for x 1 , x 2 , . . . . As noted in [26], we hav e equiv alence b et ween: a p-adic num b er; a p-adic expansion; and an elemen t of Z p (the p-adic integers). The coeﬃcients used to sp ecify a p-adic num b er, [26] notes (p. 69), “m ust b e taken in a set of representativ es of the class mo dulo p . The n umbers b etw een 0 and p − 1 are only the most obvious choice for these represen tatives. There are situations, ho wev er, where other choices are exp edien t.” W e note that the matrix C is used in [12]. A somewhat trivial view of ho w “hierarchical trees can b e p erfectly scaled in one dimension” (the title and theme of [12]) is that p-adic num b ering is feasible, and hence a one dimensional 15 represen tation of terminal no des is easily arranged through expressing eac h p- adic num b er with a real num b er equiv alent. 3.2 p-Adic Distance on a Dendrogram W e will now induce a metric top ology on the p-adically enco ded dendrogram, H . It leads to v arious symmetries relative to identical norms, for instance, or iden tical tree distances. F or conv enience, we will use a similarity whic h w e can con vert to a distance. T o ﬁnd the p-adic similarity , w e lo ok for the term p r in the p-adic co des of the tw o ob jects, where r is the lo west level such that the v alues of the co eﬃcients of p r are equal. Let us lo ok at the set of p-adic codes for x 1 , x 2 , . . . ab o ve (Figure 5 and relations 2), to give some examples of this. F or x 1 and x 2 , we ﬁnd the te rm w e are looking for to b e p 1 , and so r = 1. F or x 1 and x 5 , we ﬁnd the te rm w e are looking for to b e p 5 , and so r = 5. F or x 5 and x 8 , we ﬁnd the te rm w e are looking for to b e p 7 , and so r = 7. Ha ving found the v alue r , the similarity is deﬁned as p − r [7, 26]. W e take for a singleton ob ject r = 0, and so a similarity s has the prop erty that s ( x, y ) ≤ 1 , x 6 = y , and s ( x, x ) = 1. This leads naturally to an asso ciated distance d ( x, y ) = 1 − s ( x, y ), whic h is furthermore a 1-b ounded ultrametric. An alternativ e wa y of lo oking at the p-adic similarity (or distance) intro- duced, from the p-adic expansions listed in relations (2), is as follows. Consider the longest common sequence of co eﬃcients using terms of the expansion from the start of the sequence. W e will ensure that the start of the sequence cor- resp onds to the ro ot of the tree represen tation. Determine the p r term b efore whic h the v alue of the co eﬃcien ts ﬁrst diﬀer. Then the similarity is deﬁned as p − r and distance as 1 − p − 1 . This longest common preﬁx metric is also known as the Baire distance. In top ology the Baire metric is deﬁned on inﬁnite strings [47]. It is more than just a distance: it is an ultrametric b ounded from ab o ve by 1, and its inﬁmum is 0 whic h is relev ant for very long sequences, or in the limit for inﬁnite-length sequences. The use of this Baire metric is pursued in [64] based on random pro jections [79], and pro viding computational b eneﬁts ov er the classical O ( n 2 ) hierarc hical clustering based on all pairwise distances. The longest common preﬁx metric leads directly to a p-adic hier ar chic al classiﬁc ation (cf. [8]). This is a sp ecial case of the “fast” hierarchical clustering discussed in section 2.6.2. Compared to the longest common preﬁx metric, there are other closely re- lated forms of metric, and sim ultaneously ultrametric. In [24], the metric is deﬁned via the integer part of a real num b er. In [7], for in tegers x, y we hav e: d ( x, y ) = 2 − order p ( x − y ) where p is prime, and order p ( i ) is the exp onent (non- negativ e integer) of p in the prime decomp osition of an integer. F urthermore let S ( x ) b e a series: S ( x ) = P i ∈ N a i x i . ( N are the natural n umbers.) The order of S ( i ) is the rank of its ﬁrst non-zero term: order( S ) = inf { i : i ∈ N ; a i 6 = 0 } . 16 (The series that is all zero is of order inﬁnit y .) Then the ultrametric similarity b et ween series is: d ( S, S 0 ) = 2 − order ( S − S 0 ) . 3.3 Scale-Related Symmetry Scale-related symmetry is v ery imp ortan t in practice. In this subsection w e in tro duce an operator that provides this symmetry . W e also term it a dilation op erator, b ecause of its role in the wa velet transform on trees (see [61] for discussion and examples). This op erator is p-adic multiplication by 1 /p . Consider the set of ob jects { x i | i ∈ I } with its p-adic coding considered ab o ve. T ake p = 2. (Non-uniqueness of corresp onding decimal co des is not of concern to us now, and taking this v alue for p is without any loss of generality .) Multiplication of x 1 = +1 · 2 1 + 1 · 2 2 + 1 · 2 5 + 1 · 2 7 b y 1 /p = 1 / 2 gives: +1 · 2 1 + 1 · 2 4 + 1 · 2 6 . Each level has decreased by one, and the low est level has b een lost. Sub ject to the low est lev el of the tree b eing lost, the form of the tree remains the same. By carrying out the multiplication-b y-1 /p op eration on all ob jects, it is seen that the eﬀect is to rise in the hierarc hy by one level. Let us call pro duct with 1 /p the op erator A . The eﬀect of losing the b ottom lev el of the dendrogram means that either (i) each cluster (p ossibly singleton) remains the same; or (ii) tw o clusters are merged. Therefore the application of A to all q implies a subset relationship b etw een the set of clusters { q } and the result of applying A , { Aq } . Rep eated application of the op erator A gives Aq , A 2 q , A 3 q , . . . . Starting with any singleton, i ∈ I , this gives a path from the terminal to the ro ot node in the tree. Each suc h path ends with the n ull element, which w e deﬁne to b e the p-adic enco ding corresp onding to the ro ot no de of the tree. Therefore the in tersection of the paths equals the null elemen t. Benedetto and Benedetto [5, 6] discuss A as an expansive automorphism of I , i.e. form-preserving, and lo cally expansive. Some implications [5] of the ex- pansiv e automorphism follo w. F or an y q , let us tak e q , Aq, A 2 q , . . . as a sequence of op en subgroups of I , with q ⊂ Aq ⊂ A 2 q ⊂ . . . , and I = S { q , Aq , A 2 q , . . . } . This is termed an inductiv e sequence of I , and I itself is the inductiv e limit ([69], p. 131). Eac h path deﬁned by application of the expansiv e automorphism deﬁnes a spherically complete system [71, 24, 70], which is a formalization of well-deﬁned subset embeddedness. Such a metho dological framework ﬁnds application in m ulti-v alued and non-monotonic reasoning, as noted in section 2.6.2. 4 T ree Symmetries through the W reath Pro d- uct Group In this section the wreath pro duct group, used up to now in the literature as a framew ork for tree structuring of image or other signal data, is here used on a 2-wa y tree or dendrogram data structure. An example of wreath product in v ariance is provided by the wa v elet transform of suc h a tree. 17 4.1 W reath Pro duct Group Corresp onding to a Hierar- c hical Clustering A dendrogram like that shown in Figure 5 is inv ariant as a representation or structuring of a data set relativ e to rotation (alternatively , here: p erm utation) of left and right child nodes. These rotation (or permutation) symmetries are deﬁned by the wreath pro duct group (see [21, 22, 19] for an introduction and applications in signal and image pro cessing), and can b e used with any m-ary tree, although we will treat the binary or 2-wa y case here. F or the group actions, with respect to which we will seek inv ariance, we consider independent cyclic shifts of the subno des of a given no de (hence, at eac h level). Equiv alen tly these actions are adjacency preserving p ermutations of subno des of a given no de (i.e., for given q , with q = q 0 ∪ q 00 , the p erm utations of { q 0 , q 00 } ). W e hav e therefore cyclic group actions at eac h no de, where the cyclic group is of order 2. The symmetries of H are given b y structured p ermutations of the terminals. The terminals will b e denoted here by T erm H . The full group of symmetries is summarized by the follo wing generative algorithm: 1. F or lev el l = n − 1 down to 1 do: 2. Selected no de, ν ← − no de at level l . 3. And p ermute subno des of ν . Subno de ν is the ro ot of subtree H ν . W e denote H n − 1 simply b y H . F or a subno de ν 0 undergoing a relo cation action in step 3, the in ternal structure of subtree H ν 0 is not altered. The algorithm describ ed deﬁnes the automorphism group which is a wreath pro duct of the symmetric group. Denote the p erm utation at level ν by P ν . Then the automorphism group is given b y: G = P n − 1 wr P n − 2 wr . . . wr P 2 wr P 1 where wr denotes the wreath pro duct. 4.2 W reath Pro duct In v ariance Call T erm H ν the terminals that descend from the no de at lev el ν . So these are the terminals of the subtree H ν with its ro ot no de at level ν . W e can alternativ ely call T erm H ν the cluster asso ciated with level ν . W e will no w look at shift inv ariance under the group action. This amoun ts to the requirement for a constan t function deﬁned on T erm H ν , ∀ ν . A conv enient w ay to do this is to deﬁne such a function on the set T erm H ν via the ro ot no de alone, ν . By deﬁnition then w e ha ve a constant function on the set T erm H ν . Let us call V ν a space of functions that are constant on T erm H ν . That is to say , the functions are constant in clusters that are deﬁned by the subset of n ob jects. Possibilities for V ν that were considered in [61] are: 18 1. Basis vector with | T erm H n − 1 | comp onen ts, with 0 v alues except for v alue 1 for comp onent i . 2. Set (of cardinality n = | T erm H n − 1 | ) of m -dimensional observ ation vectors. Consider the resolution scheme arising from moving from T erm H ν 0 , T erm H ν 00 } to T erm H ν . F rom the hierarchical clustering p oint of view it is clear what this represents, simply , an agglomeration of tw o clusters called T erm H ν 0 and T erm H ν 00 , replacing them with a new cluster, T erm H ν . Let the spaces of functions that are constan t on subsets corresponding to the t wo cluster agglomerands b e denoted V ν 0 and V ν 00 . These tw o clusters are dis- join t initially , which motiv ates us taking the tw o spaces as a couple: ( V ν 0 , V ν 00 ). 4.3 Example of W reath Pro duct In v ariance: Haar W av elet T ransform of a Dendrogram Let us exemplify a case that satisﬁes all that has b een deﬁned in the con text of the wreath pro duct inv ariance that we are targeting. It is the algorithm discussed in depth in [61]. T ake the constant function from V ν 0 to b e f ν 0 . T ake the constant function from V ν 00 to b e f ν 00 . Then deﬁne the constant function, the sc aling function , in V ν to b e ( f ν 0 + f ν 00 ) / 2. Next deﬁne the zero mean function, ( w ν 0 + w ν 00 ) / 2 = 0, the wavelet function , as follows: w ν 0 = ( f ν 0 + f ν 00 ) / 2 − f ν 0 in the supp ort interv al of V ν 0 , i.e. T erm H ν 0 , and w ν 00 = ( f ν 0 + f ν 00 ) / 2 − f ν 00 in the supp ort interv al of V ν 00 , i.e. T erm H ν 00 . Since w ν 0 = − w ν 00 w e hav e the zero mean requiremen t. W e no w illustrate the Haar w av elet transform of a dendrogram with a case study . The discrete wa velet transform is a decomp osition of data into spatial and frequency comp onents. In terms of a dendrogram these comp onen ts are with resp ect to, resp ectively , within and betw een clusters of successive partitions. W e show ho w this w orks taking the data of T able 3. The hierarch y built on the 8 observ ations of T able 3 is shown in Figure 6. Here w e note the associations of irises 1 through 8 as, respectively: x 1 , x 3 , x 4 , x 6 , x 8 , x 2 , x 5 , x 7 . Something more is shown in Figure 6, namely the detail signals (denoted ± d ) and ov erall smo oth (denoted s ), whic h are determined in carrying out the w av elet transform, the so-called forward transform. The in verse transform is then determined from Figure 6 in the follo wing w ay . Consider the observ ation v ector x 2 . Then this vector is reconstructed exactly b y reading the tree from the ro ot: s 7 + d 7 = x 2 . Similarly a path from ro ot to terminal is used to reconstruct an y other observ ation. If x 2 is a vector of dimensionalit y m , then so also are s 7 and d 7 , as well as all other detail signals. 19 Sepal.L Sepal.W P etal.L Petal.W 1 5.1 3.5 1.4 0.2 2 4.9 3.0 1.4 0.2 3 4.7 3.2 1.3 0.2 4 4.6 3.1 1.5 0.2 5 5.0 3.6 1.4 0.2 6 5.4 3.9 1.7 0.4 7 4.6 3.4 1.4 0.3 8 5.0 3.4 1.5 0.2 T able 3: First 8 observ ations of Fisher’s iris data. L and W refer to length and width. s7 d7 d6 d5 d4 d3 d2 d1 Sepal.L 5.146875 0.253125 0.13125 0.1375 − 0 . 025 0.05 − 0 . 025 0.05 Sepal.W 3.603125 0.296875 0.16875 − 0 . 1375 0.125 0.05 − 0 . 075 − 0 . 05 P etal.L 1.562500 0.137500 0.02500 0.0000 0.000 − 0 . 10 0.050 0.00 P etal.W 0.306250 0.093750 − 0 . 01250 − 0 . 0250 0.050 0.00 0.000 0.00 T able 4: The hierarchical Haar wa velet transform resulting from use of the ﬁrst 8 observ ations of Fisher’s iris data sho wn in T able 3. W av elet co eﬃcient levels are denoted d1 through d7, and the con tinuum or smooth comp onen t is denoted s7. 20 x1 x3 x4 x6 x8 x2 x5 x7 0 1 s7 s6 s5 s4 s3 s2 s1 -d7 -d6 -d5 -d4 -d3 -d2 -d1 +d7 +d6 +d5 +d4 +d3 +d2 +d1 Figure 6: Dendrogram on 8 terminal no des constructed from ﬁrst 8 v alues of Fisher iris data. (Median agglomerative metho d used in this case.) Detail or w av elet co eﬃcients are denoted b y d , and data smo oths are denoted b y s . The observ ation v ectors are denoted b y x and are associated with the terminal no des. Eac h signal smo oth , s , is a vector. The (p ositive or negative) detail signals , d , are also vectors. All these vectors are of the same dimensionalit y . 21 This pro cedure is the same as the Haar w av elet transform, only applied to the dendrogram and using the input data. This wa v elet transform for the data in T able 3, based on the “key” or inter- mediary hierarch y of Figure 6, is shown in T able 4. W av elet regression entails setting small and hence unimp ortant detail coef- ﬁcien ts to 0 b efore applying the inv erse wa velet transform. More discussion can b e found in [61]. Early work on p-adic and ultrametric wa velets can b e found in Kozyrev [42, 43]. Recen t applications of w av elets to general graphs are in [65, 33]. 5 T ree and Data Stream Symmetries from P er- m utation Groups In this section we show ho w data streams, ﬁrstly , and hierarchies, secondly , can b e represen ted as p ermutations. There are restrictions on p ermitted p er- m utations. F urthermore, sets of data streams, or or trees, when expressed as p erm utations constitute particular p ermutation groups. 5.1 P erm utation Represen tation of a Data Stream In sym b olic dynamics, we seek to extract symmetries in the data based on top ology alone, b efore considering metric prop erties. F or example, instead of listing a sequence of iterates, { x i } , w e ma y symbolically enco de the sequence in terms of up or do wn, or north, south, east and w est mo ves. This provides a sequence of sym b ols, and their patterns in a phase space, where the in terest of the data analyst lies in a partition of the phase space. Patterns or templates are sough t in this top ology . Sequence analysis is tan tamount to a sort of topological time series analysis. Th us, in symbolic dynamics, the data v alues in a stream or sequence are replaced b y symbols to facilitate pattern-ﬁnding, in the ﬁrst instance, through top ology of the symbol sequence. This can b e v ery helpful for analysis of a range of dynamical systems, including chaotic, sto c hastic, and deterministic-regular time series. Through measure-theoretic or Kolmogorov-Sinai entrop y of the dynamical system, it can b e shown that the maximum en tropy conditional on past v alues is consistent with the requirement that the symbol sequence retains as muc h of the original data information as p ossible. Alternativ e approaches to quan tifying complexity of the data, expressing the dynamical system, is through Ly apunov exponents and fractal dimensions, and there are close relationships b et ween all of these approac hes [45]. F rom the viewpoint of practical and real-w orld data analysis, ho w ever, man y problems and op en issues remain. Firstly , noise in the data stream means that repro ducibilit y of results can break down [2]. Secondly , the symbol sequence, and derived partitions that are the basis for the study of the symbolic dynamic top ology , are not easy to determine. Hence [2] enunciate a pragmatic principle, whereb y the symbol sequence should come as naturally as p ossible from the data, 22 with as little as p ossible by wa y of further mo del assumptions. Their approac h is to deﬁne the symbol sequence through (i) comparison of neighboring data v alues, and (ii) up-do wn or do wn-up mov ements in the data stream. T aking in to account all up-do wn and down-up mov ements in a signal allows a p erm utation represen tation. Examples of such symbol sequences from [2] follow. They consider the data stream ( x 1 , x 2 , . . . , x 7 ) = (4 , 7 , 9 , 10 , 6 , 11 , 3). T ake the order as 3, i.e. con- sider the up-do wn and down-up properties of successive triplets. (4 , 7 , 9) − → 012; (7 , 9 , 10) − → 012; (9 , 10 , 6) − → 201; (6 , 11 , 3) − → 201; (10 , 6 , 11) − → 102. (In the last, for instance, we hav e x t +1 < x t < x t +2 , yielding the symbolic sequence 102.) In addition to the order, here 3, we may also consider the delay , here 1. In general, for delay τ , the neighborho o d consists of data v alues indexed b y t, t − τ , t − 2 τ , t − 3 τ , . . . , t − dτ where d is the order. Th us, in the example used here, we hav e the sym b olic represen tation 012012201102201. The symbol sequence (or “itinerary”) deﬁnes a partition – a separation of phase space into disjoin t regions (here, with three equiv alence classes, 012, 201, and 102), which facilitates ﬁnding an “organizing template” or set of top ological relationships [82]. The problem is describ ed in [35] as one of studying the qualitative b ehav- ior of the dynamical system, through use of a “very coarse-grained” description, that divides the state space (or phase space) into a small n umber of regions, and co des each by a diﬀeren t symbol. Diﬀeren t encodings are feasible and [38, 37] use the following. Again consider the data stream ( x 1 , x 2 , . . . , x 7 ) = (4 , 7 , 9 , 10 , 6 , 11 , 3). Now given a delay , τ = 1, w e can represent the ab o ve b y ( x 6 τ , x 5 τ , x 4 τ , x 3 τ , x 2 τ , x τ , x 0 ). Now look at rank order and note that: x τ > x 3 τ > x 4 τ > x 5 τ > x 2 τ > x 6 τ > x 0 . W e read oﬀ the ﬁnal p erm utation representation as (1345260). There are man y wa ys of deﬁning suc h a p ermutation, none of them b est, as [38] ackno wledge. W e see to o that our m -v alued input stream is a p oint in R m , and our output is a p erm utation π ∈ S m , i.e. a member of the permutation group. Keller and Sinn [38] explore in v ariance prop erties of the p ermutations ex- pressing the ordinal, symbolic co ding. Resolution scale is in tro duced through the delay , τ . (An alternative approach to incorp orating resolution scale is used in [11], where consecutive, sliding-window based, binned or av eraged versions of the time series are used. This is not entirely satisfactory: it is not robust and is very dep endent on data prop erties suc h as dynamic range.) Application is to EEG (univ ariate) signals (with some discussion of magnetic resonance imaging data) [36]. Statistical prop erties of the ordinal transformed data are studied in [3], in particular through the S 3 symmetry group. W e hav e noted the symbolic dynamics motiv ation for this work; in [1] and other work, motiv ation is pro vided in terms of rank order time series analysis, in turn motiv ated by the need for robustness in time series data analysis. 5.2 P erm utation Represen tation of a Hierarch y There is an isomorphism b etw een the class of hierarchic structures, termed unlab eled, rank ed, binary , ro oted trees, and the class of p erm utations used in 23 0.0 0.2 0.4 0.6 0.8 1.0 1.2 name definition position existence object motion fact disposition Figure 7: Hierarc hical clustering of 8 terms. Data on which this was based: fre- quencies of o ccurrence of the 8 nouns in 24 successiv e, non-ov erlapping segments of Aristotle’s Cate gories . 24 0.0 0.2 0.4 0.6 0.8 1.0 1.2 existence object position disposition fact motion name definition Figure 8: Dendrogram on 8 terms, isomorphic to the previous ﬁgure, Figure 7, but no w with successiv ely later agglomerations alw ays represen ted by right child no de. Apart from the labe ls of the initial pairwise agglomerations, this is oth- erwise a unique representation of the dendrogram (hence: “existence” and “ob- ject” can b e interc hanged; so can “disp osition” and “fact”; and ﬁnally “name” and “disp osition”). In the discussion we refer to this represen tation, with later agglomerations alwa ys parked to the right, as our canonical representation of the dendrogram. 25 0.0 0.2 0.4 0.6 0.8 1.0 1.2 existence object position disposition fact motion name definition 1 2 3 4 5 6 7 Figure 9: Dendrogram on 8 terms, as previous ﬁgure, Figure 8, with non- terminal no des num b ered in sequence. These will form the no des of the ori- en ted binary tree. W e may consider one further no de for completeness, 8 or ∞ , lo cated at an arbitrary lo cation in the upp er right. 26 0.0 0.2 0.4 0.6 0.8 1.0 1.2 existence object position disposition fact motion name definition Figure 10: Orien ted binary tree is sup erimp osed on the dendrogram. The no de at the arbitrary upp er right lo cation is not shown. The oriented binary tree deﬁnes an inorder or depth-ﬁrst tree tra versal. 27 sym b olic dynamics. Each non-terminal no de in the tree shown in Figure 7 has t wo c hild nodes. This is a dendrogram, represen ting a set of n − 1 agglomerations based on n initial data vectors. Figure 7 shows a hierarc hical clustering. Figure 8 sho ws a unique repre- sen tation of the tree, termed a dendrogram, sub ject only to terminals being p erm utable in position relativ e to the ﬁrst non-terminal cluster no de. A p acke d r epr esentation [73] or p ermutation represen tation of a dendrogram is deriv ed as follows. Put a low er rank ed subtree alwa ys to the left; and read oﬀ the oriented binary tree on non-terminal no des. Then for any terminal no de indexed by i , with the exception of the rightmost which will alwa ys b e n , deﬁne p ( i ) as the rank at which the terminal no de is ﬁrst united with some terminal no de to its right. F or the dendrogram shown in Figure 10 (or Figures 8 or 9), the pack ed represen tation is: (13625748). This is also an inorder trav ersal of the orien ted binary tree (seen in Figure 10). The pack ed representation is a uniquely deﬁned p erm utation of 1 . . . n . Dendrograms (on n terminals) of the sort shown in Figures 7 – 10 are lab eled (see terminal no de lab els, “existence”, “ob ject”, etc.) and ranked (ranks indi- cated in Figure 9). Consider when tree structure alone is of interest and we ignore the lab els. Such dendrograms, called non-lab eled, ranked (NL-R) in [56], are particularly in teresting. They are isomorphic to either do wn-up p ermuta- tions, or up-do wn p ermutations (b oth on n − 1 elemen ts). F or the com binatorial prop erties of these p ermutations, and NL-R dendrograms, see the com binatorial sequence encyclop edia entry , A000111, at [75]. W e see therefore how we are dealing with the group of up-do wn or down-up p erm utations. 6 Remark able Symmetries in V ery High Dimen- sional Spaces In the work of [67, 68] it was shown ho w as ambien t dimensionality increased distances became more and more ultrametric. That is to sa y , a hierarchical em b edding b ecomes more and more immediate and direct as dimensionalit y in- creases. A b etter wa y of quan tifying this phenomenon was developed in [59]. What this means is that there is inheren t hierarchical structure in high dimen- sional data spaces. It w as sho wn exp erimentally in [67, 68, 59] how p oints in high dimensional spaces become increasingly equidistant with increase in dimensionality . Both [27] and [16] study Gaussian clouds in very high dimensions. The latter ﬁnds that “not only are the p oints [of a Gaussian cloud in v ery high dimensional space] on the conv ex h ull, but all reasonable-sized subsets span faces of the con vex hull. This is wildly diﬀerent than the b ehavior that would b e exp ected b y traditional lo w-dimensional thinking”. That v ery simple structures come ab out in very high dimensions is not as 28 trivial as it might app ear at ﬁrst sight. Firstly , ev en very simple structures (hence with man y symmetries) can b e used to supp ort fast and p erhaps even constan t time w orst case pro ximity search [59]. Secondly , as shown in the ma- c hine learning framework by [27], there are imp ortant implications ensuing from the simple high dimensional structures. Thirdly , [62] sho ws that very high di- mensional clustered data contain symmetries that in fact can b e exploited to “read oﬀ ” the clusters in a computationally eﬃcient wa y . F ourthly , following [14], what we might wan t to lo ok for in contexts of cons iderable symmetry are the “impurities” or small irregularities that detract from the ov erall dominant picture. 7 Conclusions “My thesis has b een that one path to the construction of a nontrivial theory of complex systems is by wa y of a theory of hierarch y .” ([74], p. 216.) Or again: “Human thinking (as w ell as many other information pro cesses) is fundamen tally a hierarchical pro cess. ... In our information mo deling the main distinguishing feature of p-adic n umbers is the treelike hierarc hical structure. ... [the w ork] is dev oted to classical and quantum mo dels of ﬂows of hierarchically ordered information.” ([39], pp. xiii, xv.) W e hav e noted symmetry in man y guises in the represen tations used, in the transformations applied, and in the transformed outputs. These symmetries are non-trivial to o, in a wa y that would not b e the case were we simply to lo ok at classes of a partition and claim that cluster members were mutually similar in some wa y . W e ha ve seen ho w the p-adic or ultrametric framework provides signiﬁcan t fo cus and commonality of viewp oint. In seeking (in a general w ay) and in determining (in a fo cused w ay) structure and regularit y in data, we see that, in line with the insigh ts and achiev ements of Klein, W eyl and Wigner, in data mining and data analysis we seek and determine symmetries in the data that express observed and measured reality . A very fundamen tal principle in muc h of statistics, signal pro cessing and data analysis is that of sparsit y but, as [4] sho w, b y “co difying the inter-dependency structure” in the data new persp ectives are opened up ab ov e and b eyond sparsity . References [1] C. Bandt. Ordinal time series analysis. Ec olo gic al Mo del ling , 182:229–238, 2005. [2] C. Bandt and B. P omp e. Perm utation entrop y: a natural complexity mea- sure for time series. Physic al R eview L etters , 88:174102(4), 2002. [3] C. Bandt and F. Shiha. Order patterns in time series. T ec hnical rep ort, 2005. Preprin t 3/2005, Institute of Mathematics, Greifsw ald, www.math- inf.uni-greifsw ald.de/ ∼ bandt/pub.html. 29 [4] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde. Mo del-based com- pressiv e sensing. 2008. h [5] J.J. Benedetto and R.L. Benedetto. A wa velet theory for lo cal ﬁelds and related groups. The Journal of Ge ometric Analysis , 14:423–456, 2004. [6] R.L. Benedetto. Examples of wa velets for local ﬁelds. In D. Larson C. Heil, P . Jorgensen, editor, Wavelets, F r ames, and Op er ator The ory, Contemp o- r ary Mathematics V ol. 345 , pages 27–47. 2004. [7] J.-P . Benz ´ ecri. L a T axinomie . Dunod, Paris, 2nd edition, 1979. [8] P .E. Bradley . Mumford dendrograms. Computer Journal , 2008. forthcom- ing, Adv ance Access online doi:10.1093/comjnl/bxm088. [9] L. Brekke and P .G.O. F reund. p-Adic n umbers in ph ysics. Physics R ep orts , 233:1–66, 1993. [10] P . Chakrab ort y . Lo oking through newly to the amazing irrationals. T ech- nical rep ort, 2005. arXiv: math.HO/0502049v1. [11] M. Costa, A.L. Goldb erger, and C.-K. Peng. Multiscale en tropy analysis of biological signals. Physic al R eview E , 71:021906(18), 2005. [12] F. Critchley and W. Heiser. Hierarc hical trees can b e p erfectly scaled in one dimension. Journal of Classiﬁc ation , 5:5–20, 1988. [13] B.A. Dav ey and H.A. Priestley . Intr o duction to L attic es and Or der . Cam- bridge Universit y Press, 2nd edition, 2002. [14] F. Delon. Espaces ultram´ etriques. Journal of Symb olic L o gic , 49:405–502, 1984. [15] S.B. Deutsc h and J.J. Martin. An ordering algorithm for analysis of data arra ys. Op er ations R ese ar ch , 19:1350–1362, 1971. [16] D.L. Donoho and J. T anner. Neigh b orliness of randomly-pro jected sim- plices in high dimensions. Pr o c e e dings of the National A c ademy of Scienc es , 102:9452–9457, 2005. [17] B. Dragovic h and A. Dragovic h. p-Adic mo delling of the genome and the genetic co de. Computer Journal , 2007. forthcoming, Adv ance Access doi:10.1093/comjnl/b xm083. [18] R.A. Fisher. The use of multiple measuremen ts in taxonomic problems. The A nnals of Eugenics , pages 179–188, 1936. [19] R. F o ote. An algebraic approach to multiresolution analysis. T r ansactions of the A meric an Mathematic al So ciety , 357:5031–5050, 2005. [20] R. F o ote. Mathematics and complex systems. Scienc e , 318:410–412, 2007. 30 [21] R. F o ote, G. Mirc handani, D. Rockmore, D. Healy , and T. Olson. A wreath pro duct group approach to signal and image pro cessing: Part I – m ultireso- lution analysis. IEEE T r ansactions on Signal Pr o c essing , 48:102–132, 2000. [22] R. F o ote, G. Mirc handani, D. Rockmore, D. Healy , and T. Olson. A wreath pro duct group approach to signal and image pro cessing: Part I I – conv olu- tion, correlations and applications. IEEE T r ansactions on Signal Pr o c ess- ing , 48:749–767, 2000. [23] P .G.O. F reund. p-Adic strings and their applications. In Z. Rakic B. Dragovic h, A. Khrennik ov and I. V olo vich, editors, Pr o c. 2nd Interna- tional Confer enc e on p-A dic Mathematic al Physics , pages 65–73. American Institute of Physics, 2006. [24] L. Ga ji´ c. On ultrametric space. Novi Sad Journal of Mathematics , 31:69– 71, 2001. [25] B. Ganter and R. Wille. F ormal Conc ept Analysis: Mathematic al F ounda- tions . Springer, 1999. F ormale Be griﬀsanalyse. Mathematische Grund lagen , Springer, 1996. [26] F.Q. Gouvˆ ea. p-A dic Numb ers: An Intr o duction . Springer, 2003. [27] P . Hall, J.S. Marron, and A. Neeman. Geometric representation of high dimensional, lo w sample size data. Journal of the R oyal Statistic al So ciety B , 67:427–444, 2005. [28] P . Hitzler and A.K. Seda. The ﬁxed-p oint theorems of Priess-Cramp e and Rib en b oim in logic programming. Fields Institute Communic ations , 32:219–235, 2002. [29] A.K. Jain and R.C. Dub es. Algorithms F or Clustering Data . Pren tice-Hall, 1988. [30] A.K. Jain, M.N. Murt y , and P .J. Flynn. Data clustering: a review. ACM Computing Surveys , 31:264–323, 1999. [31] M.F. Janowitz. An order theoretic mo del for cluster analysis. SIAM Journal on Applie d Mathematics , 34:55–72, 1978. [32] M.F. Jano witz. Cluster analysis based on abstract p osets. T ec hnical report, 2005–2006. http://dimax.rutgers.edu/ ∼ melj. [33] M. Jansen, G.P . Nason, and B.W. Silverman. Multiscale metho ds for data on graphs and irregular multidimensional situations. Journal of the R oyal Statistic al So ciety B , 71:97–126, 2009. [34] S.C. Johnson. Hierarchical clustering sc hemes. Psychometrika , 32:241–254, 1967. 31 [35] K. Keller and H. Lauﬀer. Symbolic analysis of high-dimensional time series. International Journal of Bifur c ation and Chaos , 13:2657–2668, 2003. [36] K. Keller, H. Lauﬀer, and M. Sinn. Ordinal analysis of EEG time series. Chaos and Complexity L etters , 2:247–258, 2007. [37] K. Keller and M. Sinn. Ordinal analysis of time series. Physic a A , 356:114– 120, 2005. [38] K. Keller and M. Sinn. Ordinal sym b olic dynamics. 2005. T ec hnical Rep ort A-05-14, www.math.mu-luebeck.de/publik ationen/pub2005.shtml. [39] A.Y u. Khrenniko v. Information Dynamics in Co gnitive, Psycholo gic al, So- cial and A nomalous Phenomena . Klu wer, 2004. [40] A.Y u. Khrennik ov. Gene expression from p olynomial dynamics in the 2- adic information space. T echnical rep ort, 2006. arXiv:q-bio/06110682v2. [41] F. Klein. A comparative review of recent researches in geometry . Bul l. New Y ork Math. So c. , 2:215–249, 1892–1893. V ergleic hende Betrach tungen ¨ ub er neuere geometrische F orsc hungen, 1872, translated b y M.W. Hask ell. [42] S. V. Kozyrev. W av elet theory as p-adic sp ectral analysis. Izvestiya: Math- ematics , 66:367–376, 2002. [43] S. V. Kozyrev. W av elets and sp ectral analysis of ultrametric pseudo diﬀer- en tial op erators. Sb ornik: Mathematics , 198:97–116, 2007. [44] M. Krasner. Nom bres semi-r ´ eels et espaces ultram ´ etriques. Comptes- R endus de l’A c ad ´ emie des Scienc es, T ome II , 219:433, 1944. [45] V. Latora and M. Baranger. Kolmogoro v-Sinai en tropy rate versus physical en tropy . Physic al R eview L etters , 82:520, 1999. [46] I.C. Lerman. Classiﬁc ation et Analyse Or dinale des Donn´ ees . Dunod, P aris, 1981. [47] A. Levy . Basic Set The ory . Do ver, Mineola, NY, 2002. (Springer, 1979). [48] S.C. Madeira and A.L. Oliveira. Biclustering algorithms for biological data analysis: a survey . IEEE/ACM T r ansactions on Computational Biolo gy and Bioinformatics , 1:24–45, 2004. [49] S.T. March. T echniques for structuring database records. Computing Sur- veys , 15:45–79, 1983. [50] W.T. McCormick, P .J. Sch weitzer, and T.J. White. Problem decomp osition and data reorganization b y a clustering technique. Op er ations R ese ar ch , 20:993–1009, 1982. 32 [51] I. V an Mec helen, H.-H. Bock, and P . De Bo ec k. Two-mode clustering metho ds: a structured ov erview. Statistic al Metho ds in Me dic al R ese ar ch , 13:363–394, 2004. [52] B. Mirkin. Mathematic al Classiﬁc ation and Clustering . Kluw er, 1996. [53] B. Mirkin. Clustering for Data Mining . Chapman and Hall/CRC, Bo ca Raton, FL, 2005. [54] F. Murtagh. A survey of recent adv ances in hierarchical clustering algo- rithms. Computer Journal , 26:354–359, 1983. [55] F. Murtagh. Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly , 1:101–113, 1984. [56] F. Murtagh. Counting dendrograms: a surv ey . Discr ete Applie d Mathe- matics , 7:191–199, 1984. [57] F. Murtagh. Multidimensional Clustering Algorithms . Physica-V erlag, Hei- delb erg and Vienna, 1985. [58] F. Murtagh. Comments on: P arallel algorithms for hierarchical clustering and cluster v alidity . IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 14:1056–1057, 1992. [59] F. Murtagh. On ultrametricity , data co ding, and computation. Journal of Classiﬁc ation , 21:167–184, 2004. [60] F. Murtagh. Iden tifying the ultrametricit y of time series. Eur op e an Physic al Journal B , 43:573–579, 2005. [61] F. Murtagh. The Haar w av elet transform of a dendrogram. Journal of Classiﬁc ation , 24:3–32, 2007. [62] F. Murtagh. The remark able simplicity of v ery high dimensional data: ap- plication to mo del-based clustering. Journal of Classiﬁc ation , 2007. Sub- mitted. [63] F. Murtagh. The corresp ondence analysis platform for unco vering deep structure in data and information (sixth Annual Boole Lecture). Computer Journal , 2008. forthcoming, Adv ance Access doi:10.1093/comjnl/bxn045. [64] F. Murtagh, G. Downs, and P . Contreras. Hierarchical clustering of mas- siv e, high dimensional data sets by exploiting ultrametric embedding. SIAM Journal on Scientiﬁc Computing , 30:707–730, 2008. [65] F. Murtagh, J.-L. Starc k, and M. Berry . Overcoming the curse of dimen- sionalit y in clustering by means of the wa v elet transform. Computer Jour- nal , 43:107–120, 2000. 33 [66] A. Ostrowski. ¨ Ub er einige L¨ osungen der Funktionalgleich ung φ ( x ) · φ ( y ) − φ ( xy ). A cta Mathematic a , 41:271–284, 1918. [67] R. Rammal, J.C. Angles d’Auriac, and B. Doucot. On the degree of ultra- metricit y . L e Journal de Physique – L ettr es , 46:L–945–L–952, 1985. [68] R. Rammal, G. T oulouse, and M.A. Virasoro. Ultrametricity for ph ysicists. R eviews of Mo dern Physics , 58:765–788, 1986. [69] H. Reiter and J.D. Stegeman. Classic al Harmonic Analysis and L o c al ly Comp act Gr oups . Oxford Universit y Press, Oxford, 2nd edition, 2000. [70] A.C.M. V an Ro oij. Non-Ar chime de an F unctional Analysis . Marcel Dekker, 1978. [71] W.H. Schikhof. Ultr ametric Calculus . Cam bridge Universit y Press, Cam- bridge, 1984. (Chapters 18, 19, 20, 21). [72] A.K. Seda and P . Hitzler. Generalized distance functions in the theory of computation. Computer Journal , 2008. forthcoming, Adv ance Access doi:10.1093/comjnl/b xm108. [73] R. Sibson. Slink: an optimally eﬃcient algorithm for the single-link cluster metho d. Computer Journal , 16:30–34, 1980. [74] H.A. Simon. The Scienc es of the Artiﬁcial . MIT Press, Cambridge, MA, 1996. [75] N.J.A. Sloane. OEIS – On-Line Encyclope- dia of Integer Sequences. T echnical rep ort, 2006. h ttp://www.research.att.com/ ∼ njas/sequences/Seis.h tml, Sequence A000111: http://www.researc h.att.com/ ∼ njas/sequences/A000111. [76] D. Steinley . K-means clustering: a half-century synthesis. British Journal of Mathematic al and Statistic al Psycholo gy , 59:1–3, 2006. [77] D. Steinley and M.J. Brusco. Initializing K-means batch clustering: a critical ev aluation of several techniques. Journal of Classiﬁc ation , 24:99– 121, 2007. [78] W u-Ki T ung. Gr oup The ory in Physics . W orld Scientiﬁc, 1985. [79] S.S. V empala. The R andom Pr oje ction Metho d . American Mathematical So ciet y , 2004. V ol. 65, DIMA CS Series in Discrete Mathematics and The- oretical Computer Science. [80] I.V. V olovic h. Number theory as the ultimate physical theory . T ec hnical rep ort, 1987. Preprint No. TH 4781/87, CERN, Genev a. [81] I.V. V olovic h. p-Adic string. Classic al Quantum Gr avity , 4:L83–L87, 1987. 34 [82] W. W eck esser. Sym b olic dynamics in mathematics, physics, and en- gineering, based on a talk by N. T uﬃlaro. T ec hnical rep ort, 1997. h ttp://www.ima.umn.edu/ ∼ wec k/nbt/n bt.ps. [83] H. W eyl. Symmetry . Princeton Universit y Press, 1983. [84] Rui Xu and D. W unsch. Survey of clustering algorithms. IEEE T r ansac- tions on Neur al Networks , 16:645–678, 2005. 35

Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment