Data Smashing

Investigation of the underlying physics or biology from empirical data requires a quantifiable notion of similarity - when do two observed data sets indicate nearly identical generating processes, and when they do not. The discriminating characterist…

Authors: Ishanu Chattopadhyay, Hod Lipson

Data Smashing
1 Data Smashing Ishanu Chattopadhyay Hod Lipson ic99@cornell.edu hod.lipson@cornell.edu Abstract —In vestigation of the underlying physics or biology from empirical data requir es a quantifiable notion of similarity - when do two observed data sets indicate nearly identical generating processes, and when they do not. The discriminating characteristics to look for in data is often determined by heuristics designed by experts, e:g : , distinct shapes of “folded” lightcurves may be used as “features” to classify variable stars, while determination of pathological brain states might requir e a Fourier analysis of brainwave activity . Finding good features is non-trivial. Here, we propose a universal solution to this problem: we delineate a principle for quantifying similarity between sources of arbitrary data streams, without a priori knowledge, features or training. W e uncov er an algebraic structure on a space of symbolic models for quantized data, and show that such stochastic generators may be added and uniqely in verted; and that a model and its in verse always sum to the generator of flat white noise. Theref ore, ev ery data stream has an anti-stream: data generated by the inverse model. Similarity between two streams, then, is the degree to which one, when summed to the other’s anti-stream, mutually annihilates all statistical structure to noise. W e call this data smashing. W e present div erse applications, including disambiguation of brainwaves pertaining to epileptic seizures, detection of anomalous cardiac rhythms, and classification of astronomical objects from raw photometry . In our examples, the data smashing principle, without access to any domain knowledge, meets or exceeds the performance of specialized algorithms tuned by domain experts. Index T erms —featur e-free classification, universal metric, pr obabilistic automata I. M otiv a tion & C ontribution The term “data smashing” might conjure up images of erasing information or destroying hard drives. But just as smashing atoms can reveal their composition, “colliding” quantitativ e data streams can re veal their hidden structure. W e describe here a new principle, where quantitativ e data streams hav e corresponding anti-streams, which inspite of being non-unique, are tied to the stream’ s unique statistical structure. W e then describe “data smashing”, a process by which streams and anti-streams can be algorithmically collided to rev eal di ff erences that are di ffi cult to detect using con ventional techniques. W e establish this principle formally , describe how we implemented it in practice, and report its performance on a number of real-world cases. The results show that without access to any domain knowledge, data smashing meets or exceeds the accuracy achieved by specialized algorithms and heuristics de vised by domain experts. Nearly all automated discov ery systems today rely , at their core, on the ability to compare data: From automatic image recognition to discov ering new astronomical objects, such systems must be able to compare and contrast data records in order to group them, classify them, or identify the odd-one-out. Despite rapid growth in the amount of data collected and the increasing rate at which it can be processed, analysis of quantitati ve data streams still relies heavily on knowing what to look for . Any time a data mining algorithm searches beyond simple cor- relations, a human expert must help define a notion of similarity - by specifying important distinguishing “features” of the data to compare, or by training learning algorithms using copious amounts of examples. The data smashing principle remov es the reliance on expert-defined features or e xamples, and in many cases, does so faster and with better accuracy than traditional methods. This paper is or ganized as follo ws: Sections I-VI describe the key concepts, along with a brief but complete description of the approach. The mathematical details, including pro ff s of correctness, are presenetd in Sections VII-IX. Qunatization schemes are discussed in Section X. Comparisons with some standard notions of statistical dependencies is carried out in Section XI, and the paper is concluded in Secion XII. Fig. 1. Data smashing: (A) deter mining the similarity between two data streams is key to any data mining process, but relies heavily on human- prescribed criteria. (B) Data smashing first encodes each data stream, then collides one with the inverse of the other . The randomness of the resulting stream reflects the similarity of the original streams, leading to a cascade of downstream applications in volving classification, decision and optimization. II. A nti - streams The notion of data smashing applies only to data in the form of an ordered series of digits or symbols, such as acoustic wav es from a microphone, light intensity ov er time from a telescope, tra ffi c density along a road, or network acti vity from a router . The anti-stream contains the “opposite” information from the original data stream, and is produced by algorithmically in verting the statistical distribution of symbol sequences appearing in the original stream. For example, sequences of digits that were common in the original stream will be rare in the anti-stream, and vice versa. Streams and anti-streams can then be algorithmically “collided” in a way that systematically cancels any common statistical structure in the original streams, leaving only information relating to their statistically significant di ff erences. W e call this the principle of Information Annihilation (See Fig. 1). Data smashing inv olves two data streams and proceeds in three steps: raw data streams are first quantized, by conv erting continuous value to a string of characters or symbols. The simplest example of such quantization is where all positive v alues are mapped to the symbol “1” and all neg ativ e v alues to “0”, thus generating a string of bits. Next, we select one of the quantized input streams, and generate its anti-stream. Finally , we annihilate this anti-stream against the remaining quantized input stream and measure what information remains. The remaining information is estimated from the de viation of the resultant stream from flat white noise (FWN). Since a data stream is perfectly annihilated by a correct realization of its anti-stream, any de viation of the collision product from noise quantifies statistical dissimilarity . Using this causal similarity metric, we can cluster streams, classify them, or identify stream segments that are unusual or di ff erent. The algorithms are linear in input data, implying they can be applied e ffi ciently to streams in near-real time. Importantly , data smashing can be applied without understanding where the streams were generated, how they are encoded, and what they represent. Ultimately , from a collection of data streams and their pairwise similarities, it is possible to automatically “back out” the underlying metric embedding of the data, rev ealing its hidden structure for use with traditional machine learning methods. Dependence across data streams is often quantified using mutual information (1). Howe ver , mutual information and data smashing are 2 Process 1 Process 2 q 1 q 2  1 j 0 : 3  0 j 0 : 7  1 j 0 : 1  0 j 0 : 9 q 1 q 2 q 3  1 j 0 : 2  0 j 0 : 8  1 j 0 : 3  0 j 0 : 7  1 j 0 : 4  0 j 0 : 6 Space Of Hidden Processes Space Of Hidden Generative Models (PFSA) Space Of Quantized Sequences Observed Raw Signals Estimate similarity between sequences Proposed similarity is causal: con verges to distance between hidden generators Quantization A Raw signals to Generative Models  1  1 s 1 s 2 ^  Output  11 = 0 Data Sufficiency Self-check for s 2 ^  Output  12 = 0 Distance between s 1 and s 2 ^  Output  22 = 0 Data Sufficiency Self-check for s 1 D Annihilation circuit q 1 q 2  1 j 0 : 3  0 j 0 : 7  1 j 0 : 1  0 j 0 : 9 Group-theoretic In version Using PFSA model Stream Inversion via selective erasure of s 1 q 1 q 2  1 j 0 : 7  0 j 0 : 3  1 j 0 : 9  0 j 0 : 1 sequence s 1 sequence s 0  1 G  G B Inver se PFSAs & Anti-streams Space Of PFSA q 1 q 2  1 j 0 : 3  0 j 0 : 7  1 j 0 : 1  0 j 0 : 9 q 1 q 2  1 j 0 : 7  0 j 0 : 3  1 j 0 : 9  0 j 0 : 1 q 1  0 j 0 : 5  1 j 0 : 5 sequence s 1 Sum realized via selective erasure of s 1 , s 2 sequence s 2 sequence s 0 G  G W (Flat White Noise Model) Flat White Noise + = C Annihilation identity with PFSA & sequences Space Of PFSA Fig. 2. Calculation of causal similarity using information annihilation. (A) W e quantize ra w signals to symbolic sequences o ver the chosen alphabet, and compute a causal similarity between such sequences. The underlying theor y is established assuming the existence of generativ e probabilistic automata for these sequences, but our algorithms do not require e xplicit model construction, or a priori knowledge of their structures. (B) Concept of stream in version; while we can find the group inv erse of a giv en PFSA algebraically , we can also transf orm a generated sequence directly to one that represents the inverse model, without constructing the model itself. (C) Summing PFSAs G and its in verse  G yields the z ero PFSA W . W e can carry out this annihilation purely at the sequence lev el to get flat white noise. (D) Circuit that allows us to measure similarity distance between streams s 1 ; s 2 via computation of  11 ,  22 and  12 (See T able 1). Given a threshold  ? > 0 , if  k k <  ? , then we have sufficient data for stream s k ( k = 1 ; 2 ). Additionally if  12 5  ? , then we conclude that s 1 ; s 2 hav e the same stochastic source with high probability (which conv erges e xponentially f ast to 1 with length of input). distinct concepts. The former measures dependence between streams; the latter computes a distance between the generative processes themselves. T wo sequences of independent coin-flips necessarily ha ve zero mutual information, but data smashing will identify the streams as similar; being generated by the same stochastic process. Moreover , smashing only works correctly if the streams are independent or nearly so (See Section XI-A). Similarity computed via data smashing is clearly a function of the statistical information buried in the input streams. Howe ver , it might not be easy to find the right statistical tool, that reveals this hidden information, particularly without domain knowledge, or without first constructing a good system model (See Section XI-B for an example where smashing re v eals non-tri vial categories missed by simple statis- tical measures). W e describe in detail the process of computing anti- streams, and the process of comparing information. In Section VII-IX we provide theoretical bounds on the confidence levels, minimal data lengths required for reliable analysis, and scalability of the process as function of the signal encodings. W e hav e limitations. Data smashing is not directly applicable to learning tasks that do not depend or require a notion of similarity , e:g : , identifying a specific time instant at which some e v ent of interest transpired within a data set, or predicting the next step in a time series. Even with the problems to which smashing is applicable, we do not claim strictly superior quantitative performance to the state-of- art in any and all applications; carefully chosen approaches tuned to specific problems can certainly do as well, or better . Our claim is not that we uniformly outperform existing methods, but that we are on par , as evidenced in multiple example applications; yet do so without requiring expert knowledge, or a training set. Additionally , technical reasons preclude applicability to data from strictly deterministic systems (See section on Limitations & Assumptions). III. T he H idden M odels The notion of a universal comparison metric makes sense only in the context of a featureless approach, where one considers pairwise similarity (or dissimilarity) between indi vidual measurement sets. Howe v er , while the advantage of considering the notion of similarity between data sets instead of between feature vectors has been recognized (2], [3], [4), the definition of similarity measures has remained intrinsically heuristic and application dependent, with the possibility of a univ ersal metric been summarily rejected. W e show that such univ ersal comparison is indeed realizable, at least under some general assumptions on the nature of the generating process. W e consider sequential observations, e:g : , time series of sensor data. The first step is mapping the possibly continuous-v alued sensory observations to discrete symbols via pre-specified quantization of the data range (See Section X and Fig. 11). Each symbol represents a slice of the data range, and the total number of slices define the symbol alphabet  (where j  j denotes the alphabet size). The coarsest quantization has a binary alphabet consisting of say 0 and 1 (it is not important what symbols we use, we can as well represent the letters of the alphabet with a and b ), but finer quantizations with larger alphabets are also possible. An observed data stream is thus mapped to a symbolic sequence over this pre-specified alphabet. W e 3 T ABLE I A lgorithms F or S tream O pera tions (Procedures belo w are used to assemble the annihilation circuit shown in Fig. 2D, whic h carries out data smashing) Stream Operation Algorithmic Procedur e (Pseudocode)  Independent Stream Copy y C s s 0 Generate an independent sample path from the same hidden stochastic sour ce. 1 Generate stream ! 0 from FWN 2 Read current symbol  1 from s 1 , and  2 from ! 0 3 If  1 =  2 , then write  1 to output s 0 4 Mov e read positions one step to right, and go to step 1 This operation is required internally in stream inv ersion.  Stream Inversion y  1 s s 0 Generate sample path from in verse model of hidden source. 1 Generate j  j  1 independent copies of s 1 : s 1 ;    ; s j  j 1 2 Read current symbols  i from s i ( i = 1 ;    ; j  j  1 ) 3 If  i ,  j for all distinct i; j , then write  n S j  j 1 i =1  i to output s 0 4 Mov e read positions one step to right, and go to step 1  Stream Summation y s 1 s 2 s 0 Generating sample path from sum of hidden sources. 1 Read current symbols  i from s i ( i = 1 ; 2 ) 2 If  1 =  2 , then write to output s 0 3 Mov e read positions one step to right, and go to step 1  Deviation fr om FWN z ^  s Real Number output in [0 ; 1] Estimating the deviation of a symbolic stream from FWN. (Symbolic derivatives (Definition 9) in Section VII formal- izes  s (  ) . If s is generated by a FWN process, then  s ( x ) ! U  for an y x 2  ? , and hence ^  ( s; ` ) ! 0 .) ^  ( s; ` ) = j  j  1 j  j X x : j x j 5 ` jj  s ( x )  U  jj 1 j  j 2 j x j ; where  j  j is the alphabet size, j x j is the length of the string x  ` is the maxim um length of str ings upto which the sum is e v aluated. F or a given  ? , we choose ` = ln(1 = ? ) = ln( j  j ) (See Proposition 14)  U  is the uniform probability vector of length j  j  For  i 2  ,  s ( x )   i = # of occurrences of x i in string s # of occurrences of x in string s y See Section IX for proof of correctness z See Definition 22 and Propositions 13 and 14 in Section IX § Symbolic derivativ es underlie the r igorous proofs. Ho we ver , for the actual implementation, they are only needed in the final step to compute de viation from FWN assume that the symbol alphabet and its interpretation is fixed for a particular task. Quantization inv olves some information loss which can be reduced with finer alphabets at the expense of increased computational com- plexity (See Section X). W e use quantization schemes (See Fig. 11) which require no domain expertise. A. In verting and combining hidden models Quantized Stochastic Processes (QSPs) which capture the statistical structure of symbolic streams can be modeled using probabilistic automata, provided the processes are ergodic and stationary (5], [6], [7). For the purpose of computing our similarity metric, we require that the number of states in the automata be finite ( i:e: we only assume the existence of a generative Probabilistic Finite State Automata (PFSA)); we do not attempt to construct explicit models or require knowledge of either the exact number of states or any explicit bound thereof (See Fig. 2). A slightly restricted subset of the space of all PFSA over a fixed alphabet admits an Abelian group structure (See Section VIII); wherein the operations of commutati ve addition and in version are well-defined. A tri vial example of an Abelian group is the set of reals with the usual addition operation; addition of real numbers is commutativ e and each real number a has a unique inv erse  a , which when summed produce the unique identity 0 . W e have previously discussed the Abelian group structure on PFSAs in the context of model selection (8). Here, we show that key group operations, necessary for classification, can be carried out on the observed sequences alone, without any state synchronization or reference to the hidden generators of the sequences. Existence of a group structure implies that given PFSAs G and H , sums G + H ; G  H , and unique inv erses  G and  H are well- defined. Individual symbols have no notion of a “sign”, and hence the models G and  G are not generators of sign-inv erted sequences which would not make sense as our generated sequences are symbol streams. For example, the anti-stream of a sequence 10111 is not  1 0  1  1  1 , but a fragment that has inv erted statistical properties in terms of the occurrence patterns of the symbols 0 and 1 (See T able I). For a PFSA G , the unique inv erse  G is the PFSA which when added to G yields the group identity W = G + (  G ) , i:e: , the zero model. Note, the zero model W is characterized by the property that for any arbitrary PFSA H in the group, we hav e H + W = W + H = H . For any fixed alphabet size, the zero model is the unique single- state PFSA (up to minimal description (9)) that generates symbols as consecuti ve realizations of independent random v ariables with uniform distribution over the symbol alphabet. Thus W generates flat white noise (FWN), and the entropy rate of FWN achiev es the theoretical upper bound among the sequences generated by arbitrary PFSA in the model space. T wo PFSAs G; H are identical if and only if G + (  H ) = W . B. Metric Structur e on Model Space In addition to the Abelian group, the PFSA space admits a metric structure (See Section VII). The distance between two models thus can be interpreted as the deviation of their group-theoretic di ff erence from a FWN process. Information annihilation exploits the possibility of estimating causal similarity between observed data streams by estimating this distance from the observed sequences alone without requiring the models themselves. W e can estimate the distance of the hidden generativ e model from FWN given only an observed stream s . This is achie ved by the func- tion ^  (See T able I, row 4). Intuitively , given an observed sequence fragment x , we first compute the deviation of the distribution of the next symbol from the uniform distribution over the alphabet. ^  ( s; ` ) is the sum of these de viations for all historical fragments x with length up to ` , weighted by 1 = j  j 2 j x j . The weighted sum ensures that deviation of the distributions for longer x hav e smaller contribution to ^  ( s; ` ) , which addresses the issue that the occurrence frequencies of longer sequences are more variable. 4 T ABLE II A pplica tion P roblems , & R esul ts z System Input Description Classification Performance 1. Identify epileptic pathology (10) • 495 EEG excerpts, each 23 : 6 s sampled at 173 : 61 hz • Signal derivati ve as input • Quantization ? (3 letter): 0 1 2 0 1 ; 000 2 ; 000 3 ; 000  20  7 7 20 IA accuracy 98.9% State of art NA No comparable result is available in the literature. Howe ver , IA rev eals a 1D manifold structure in the dataset, while (10) with additional assumptions on the nature of hidden processes, fails to yield such insight. 2. Identify heart murmur (11) • 65 .wav files sampled at 44 : 1 kHz (  10 s each) • Quantization ? (2 letter): 0 1 100 200  0 : 005 0 0 : 005 IA precision (murmur) 75.2% State of art 67% State of the art (11) achieved in supervised learning with task- specific features 3. Classify variable stars (Cepheid variable vs RR L yrae) from photometry (OGLE II) (12) • 10699 photometric series • Di ff erentiated folded / raw photometry used as input • Quantization ? (3 letter): 0 1 2 50 100  8 0 8 IA accuracy 99.8% State of art 99.6% Folded Photometry State of the art (12) achieved with task-specific features and multiple hand-optimized classification steps IA accuracy 94.3% State of art NA Unfolded Photometry ( This capability is beyond the state of art ) 4. EEG based Biometric A uthentication (13) with visually evoked potentials (VEP) • 122 subjects, multi-variate data from 61 standard electrodes. • 256 data points for each trial for each electrode. • T otal # of data series: 5477 (each with 61 variables). • Quantization ? (2 letter): 0 1 100 200  20 0 20 kNN SVM IA accuracy 97.96% 99.65% State of art 95.6% 98.96 % State of the art (14) achiev ed with task-specific features, and after eliminating 2 subjects from consideration 5. T ext-independent speaker identification using ELSDSR database (15) • 23 speakers ( 9 female, 14 male), 16 kHZ recording •  100 s recording / speaker • 2 s snippets used as time series excerpts • T otal # of time series: 1270 • Quantization ? (2 letter): 0 1 100 200 300  0 : 1  0 : 05 0 0 : 05 IA accuracy 80.2% State of art 73.73% State of the art (16) achieved with task-specific features and multiple hand-optimized classification steps ? See Section X for details on choosing quantization schemes IV . K ey I nsight : T he I nforma tion A nnihila tion P rinciple Our ke y insight is the following: two sets of sequential observations hav e the same generative process if the in verted copy of one can annihilate the statistical information contained in the other . W e claim, that giv en two symbol streams s 1 and s 2 , we can check if the underlying PFSAs (say G 1 ; G 2 ) satisfy the annihilation equality : G 1 + (  G 2 ) = W without explicitly knowing or constructing the models themselves. Data smashing is predicated on being able to inv ert and sum streams, and to compare streams to noise. In version generates a stream s 0 giv en a stream s , such that if PFSA G is the source for s , then  G is the source for s 0 . Summation collides two streams: Given streams s 1 and s 2 , generate a new stream s 0 which is a realization of FWN if and only if the hidden models G 1 ; G 2 satisfy G 1 + G 2 = W . Finally , deviation of a stream s from that generated by a FWN process can be calculated directly . Importantly , for a stream s (with generator G ), the in verted stream s 0 is not unique. Any symbol stream generated from the in verse model  G qualifies as an in verse for s ; thus anti-streams are non-unique. What is indeed unique is the generating inv erse PFSA model. Since, our technique compares the hidden stochastic processes and not their possibly non-unique realizations, the non-uniqueness of anti-streams is not problematic. Despite the possibility of mis-synchronization between hidden model states, applicability of the algorithms shown in T able I for disambiguation of hidden dynamics is valid. W e sho w in Section IX that the algorithms ev aluate distinct models to be distinct, and nearly identical hidden models to be nearly identical. Estimating the deviation of a stream from FWN is straightforward (as specified by ^  ( s; ` ) in T able I, row 4). All subsequences of a gi v en length must necessarily occur with the same frequency for a FWN process; and we simply estimate the deviation from this behavior in the observed sequence. The other two tasks are carried out via selectiv e erasure of symbols from the input stream(s) (See T able I, rows 1-3). For example, summation of streams is realized as follows: giv en two streams s 1 ; s 2 , we read a symbol from each stream, and if they match then we copy it to our output, and ignore the symbols read when they do not match. Thus, data smashing allows us to manipulate streams via selecti ve erasure, to estimate a distance between the hidden stochastic sources. Specifically , we estimate the degree to which the sum of a stream and its anti-stream brings the entropy rate of the resultant stream close to its theoretical upper bound. A. Contrast with F eatur e-based State of Art Contemporary research in machine learning is dominated by the search for good “features” (17), which are typically understood to be heuristically chosen discriminative attrib utes characterizing objects or phenomena of interest. Finding such attributes is not easy (18], [19). Moreov er , the number of characterizing features i:e: the size of the feature set, needs to be relatively small to av oid intractability of the subsequent learning algorithms. Additionally , their heuristic definition precludes any notion of optimality; it is impossible to quantify the quality of a given feature set in any absolute terms; we can only compare how it performs in the context of a specific task against a few selected variations. In addition to the heuristic nature of feature selection, machine learning algorithms typically necessitate the choice of a distance metric in the feature space. For e xample, the classic “nearest neighbor” k-NN classifier (20) requires definition of proximity , 5 10 2 10 3 10  1 10 0 # of symbols read Self-annihilation Error (i) EEG 10 3 10 4 10 5 10  2 10  1 # of symbols read (ii) Heart sound 10 1 : 5 10 2 10  1 : 5 10  1 10  0 : 5 # of symbols read (iii) Photometry A Covergence of Self-Annihilation Error B Computation time f or annihilation 20 000 40 000 60 000 80 000 100 00 0 0 5 10 15 Symbol Length [No. of symbols] Computation Time [ms] j  j = 2 ( 2 states) j  j = 2 ( 4 states) j  j = 3 j  j = 4 Fig. 3. Computational complexity and conver gence rates for inf ormation annihilation. (A) Illustrates e xponential con v ergence of the self-annihilation error for a small set of data ser ies for different applications (plate (i) for EEG data, plate (ii) for hear t sound recordings, and plate (iii) for photometr y). (B) Computation times for carr ying out annihilation using the circuit shown in Fig. 2D as a function of the length of input streams f or different alphabet sizes (and for diffrent number of states in the hidden models). Note that the asymptotic time complexity of obtaining the similarity distances scales as O ( j  j n ) , where n is the length of the shor ter of the two input streams. and the k-means algorithm (21) depends on pairwise distances in the feature space for clustering. T o side-step the heuristic metric problem, recent approaches often learn appropriate metrics directly from data, attempting to “back out” a metric from side information or labeled constraints (22). Unsupervised approaches use dimensionality reduction and embedding strategies to unco ver the geometric structure of geodesics in the feature space ( e:g : see manifold learning (23], [24], [25)). Howe ver , automatically inferred data geometry in the feature space is, again, strongly dependent on the initial choice of features. Since Euclidean distances between feature vectors are often misleading (23), heuristic features make it impossible to concei ve of a task-independent uni versal metric. In contrast, smashing is based on an application-independent notion of similarity between quantized sample paths observed from hidden stochastic processes. Our uni versal metric quantifies the degree to which the summation of the inverted copy of any one stream to the other annihilates the existing statistical dependencies, leaving behind flat white noise. W e circumvent the need for features altogether (See Fig. 1B) and do not require training. Despite the fact that the estimation of similarities between two data streams is performed in absence of the knowledge of the underlying source structure or its parameters, we establish that this univ ersal metric is causal, i:e: , with su ffi cient data it con ver ges to a well- defined distance between the hidden stochastic sources themselves, without e ver knowing them explicitly . B. Self-annihilation T est for Data-su ffi ciency Chec k Statistical process characteristics dictate the amount of data re- quired for estimation of the proposed distance. With no access to the hidden models, we cannot estimate the required data length a priori; howe v er it is possible to check for data-su ffi ciency for a specified error threshold via self-annihilation. Since the proposed metric is causal, the distance between two independent samples from the same source always con v erges to zero. W e estimate the degree of self- annihilation achieved in order to determine data su ffi ciency; i:e: , a stream is su ffi ciently long if it can su ffi ciently annihilate an inv erted self-copy to FWN. The self-annihilation based data-su ffi ciency test consists of two steps: gi ven an observed symbolic sequence s , we first generate an independent copy (say s 0 ). This is the independent stream copy operation (See T able I, row 1), which can be carried out via selective symbol erasure without any knowledge of the source itself. Once we hav e s and s 0 , we check if the in verted version of one annihilates the other to a pre-specified degree. In particular , we generate s 00 from s via stream inv ersion, and use stream summation of s 0 and s 00 to produce the final output stream s 000 , and check if ^  ( s 000 ; ` ) is less than some specified threshold  ? > 0 . W e show that considering only histories up to a length ` = ln(1 = ? ) ln( j  j ) in the computation of ^  ( s 000 ; ` ) is su ffi cient (See Section IX). The self-annihilation error is also useful to rank the e ff ectiv eness of di ff erent quantization schemes. Better quantization schemes ( e:g : ternary instead of binary) will be able to produce better self- annihilation while maintaining the ability to discriminate di ff erent streams (See Section X). V . F ea ture - free C lassifica tion and C lustering Giv en n data streams s 1 ;    ; s n , we construct a matrix E , such that E ij represents the estimated distance between the streams s i ; s j . Thus, the diagonal elements of E are the self-annihilation errors, while the o ff -diagonal elements represent inter-stream similarity estimates (See Fig. 2D for the basic annihilation circuit). Giv en a positiv e threshold  ? > 0 , the self-annihilation tests are passed if  k k 5  ? ( k = i; j ), and for su ffi cient data the streams s i ; s j hav e identical sources with high probability if and only if  ij 5  ? . Once E is constructed, we can determine clusters by rearranging E into prominent diagonal blocks. Any standard technique (26) can be used for such clustering; information annihilation is only used to find the causal distances between observed data streams, and the resultant distance matrix can then used as input to state-of-the- art clustering methodologies, or finding geometric structures (such as lo wer dimensional embedding manifolds (23)) induced by the similarity metric on the data sources. The matrix H , obtained from E by setting the diagonal entries to zero, estimates a distance matrix. An Euclidean embedding (27) of H then leads to deeper insight into the geometry of the space of the hidden generators, e:g : , in the case of the EEG data, the time series’ describe a one-dimensional manifold (a curv e), with data from similar phenomena clustered together along the curve (See Fig. 4A(ii)). A. Computational Complexity & Data Requirements The asymptotic time complexity of carrying out the stream oper- ations scales linearly with input length, and the granularity of the alphabet (See Section IX and Fig. 3B for illustration of the linear time complexity of estimating inter-stream similarity). T o pass the self-annihilation test, a data stream must be su ffi ciently long; and the required length j s j of the input s with a specified threshold  ? is dictated by the characteristics of the generating process. Selecti ve erasure in annihilation (See T able I) implies that the output tested for being FWN is shorter compared to the input stream, and the expected shortening ratio  can be explicitly computed (See Section IX).W e refer to  as the annihilation e ffi ciency , since the con ver gence rate of the self-annihilation error scales as 1 = p  j s j . In other words, the required length j s j of the data stream to achieve a self-annihilation error of  ? scales as 1 = (  ? ) 2 . It is important to note that our analysis shows that the annihilation e ffi ciency is independent of the descriptional complexity of the process, e:g : , in Fig. 10 the self-annihilation error for a simpler two state process conv erges faster to a four state process. Howe v er the con ver gence rate always scales as O (1 = p j s j ) as dictated by the the Central Limit Theorem (CL T) (28). B. Limitations & Assumptions Data smashing is not useful in problems which do not require a notion of similarity , e:g : , predicting the future course of a time series, or analyzing a data set to pinpoint the occurrence time of an event of interest. For problems to which smashing is applicable, we implicitly assume the existence of PFSA generators; although we never find 6 Pathology Eyes Closed (Normal) Eyes Open (Normal) 100 200 300 100 200 300 (i) IA Distance Matrix (Seizure data not shown) 9 : 9  10  4 1  10  2 1  10  1 0 0 : 2 0 : 4 0 : 6 0 0 : 2 0 : 4 0 0 : 2 x y z (ii) 3D Euclidean Embedding of Distance Matrix Anomaly Normal (Eyes closed) Normal (Eyes open) Plane A (iii) Clusters All Data Non seizure Ano- maly Normal Eyes Closed Eyes Open Seizure Data All data A EEG (epileptic pathology detection) 20 40 60 20 40 60 (i) Distance Matrix showing clusters 0 9  10  5 1  10  2 0 0 : 05 0 : 1 0 0 : 02 0 : 04 0 : 06 0 0 : 02 x y z (iii) 2D Manifold 0 0 : 05 0 : 1 0 0 : 02 0 : 04 0 : 06 0 0 : 02 x y z (ii) 3D Euclidean Embedding Murmur Healthy (iv) Clusters (Numbers = Cluster size) All Data 3 6 4 9 8 6 7 5 8 8 1 All data Murmur Healthy B Heart mumur detection from digital stethoscope Cepheid RR Lyrae 0 5 ; 000 10 ; 000 0 5 ; 000 10 ; 000 (i) IA Distance Matrix showing clusters 9 : 9  10  4 1  10  2 1  10  1 0 0 : 2 0 : 4 0 : 6 0 0 : 2 0 : 4 0 0 : 2 0 : 4 x y z (ii) 3D Euclidean Embedding RRL Cepheids C Cepheid vs RR L yrae classification from photometry (iii) Clusters RRL Cepheid All Data Fig. 4. Data smashing applications. P airwise distance matrices, identified clusters and 3D projections of Euclidean embeddings for epileptic pathology identification (shown in (A)), identification of hear t murmur (shown in (B)), and classification of variable stars from photometry (shown in (C)). In these applications, the rele vant clusters are f ound unsupervised. these models e xplicitly . It follows that what we actually assume is not any particular modeling framew ork, but that the systems of interest satisfy the properties of ergodicity , stationarity , and have a finite (b ut not a priori bounded) number of states (See Section VII). In practice, our technique performs well even if these properties are only approximately satisfied ( e:g : quasi-stationarity instead of stationarity , see example in Section XI-B). The algebraic structure of the space of PFSAs (in particular, existence of unique group in verses) is key to the information annihilation principle; howe ver we argue that any quantized ergodic stationary stochastic process is indeed representable as a probabilistic automata (See Section VII). Data smashing is not applicable to data from strictly deterministic systems. Such systems are representable by probabilistic automata; howe v er transitions occur with probabilities which are either 0 or 1 . PFSAs with zero-probability transitions are non-inv ertible, which in v alidates the underlying theoretical guarantees (See Section VIII). Similarly , data streams in which some alphabet symbol is exceedingly rare would be di ffi cult to in vert (See Section IX for the notion of annihilation e ffi ciency). Symbolization in variably introduces quantization error . This can be made small by using larger alphabets. Howe ver , larger alphabet sizes demand longer observed sequences (See Section IX, Fig. 9), implying that the length of observation limits the quantization granularity , and in the process limits the degree to which the quantization error can be mitigated. Importantly , with coarse quantizations distinct processes may ev aluate to be similar . Howe ver , identical processes will still ev aluate to be identical (or nearly so), provided the streams pass the self-annihilation test. The self-annihilation test thus o ff ers an application-independent way to compare and rank quantization schemes (See Section X). The algorithmic steps (See T able I) require no synchronization (we can start reading the streams anywhere), implying that non-equal length of time-series, and phase mismatches are of no consequence. VI. A pplica tion E xamples Data smashing be gins with quantizing streams to symbolic se- quences, followed by the use of the annihilation circuit (Fig. 2D) to compute pairwise causal similarities. Details of the quantization schemes, computed distance matrices, and identified clusters and Euclidean embeddings are summarized in T able II and Fig. 4. Our first application is classification of brain electrical acti vity from di ff erent physiological and pathological brain states (10). W e used sets of electroencephalographic (EEG) data series consisting of surface EEG recordings from healthy volunteers with eyes closed and open, and intracranial recordings from epilepsy patients during 7 seizure free intervals from within and from outside the seizure generating area, as well as intracranial recordings of seizures. Starting with the data series of electric potentials, we generated sequences of relativ e changes between consecutive v alues before quantization. This step allows a common alphabet for sequences with wide v ariability in the sequence mean v alues. The distance matrix from pairwise smashing yielded clear clusters corresponding to seizure, normal eyes open (EO), normal eyes closed (EC) and epileptic pathology in non-seizure conditions. (See Fig. 4A, seizures not sho wn due to large di ff erences from the rest). Embedding the distance matrix (See Fig. 4A, plate (i)) yields a one-dimensional manifold (a curve), with contiguous segments corresponding to di ff erent brain states, e:g : , right hand side of plane A correspond to epileptic pathology . This provides a particularly insightful picture, which eludes complex non-linear modeling(10). Next we classify cardiac rhythms from noisy heat-sound data recorded using a digital stethoscope (11). W e analyzed 65 data series (ignoring the labels) corresponding to healthy rhythms and murmur , to verify if we could identify clusters without supervision that correspond to the expert-assigned labels. W e found 11 clusters in the distance matrix (See Fig. 4B), 4 of which consisted of mainly data with murmur (as determined by the expert labels), and the rest consisting of mainly healthy rhythms (See Fig. 4B, plate (iv)). Classification precision for murmur is noted in T able 2 ( 75 : 2% ). Embedding of the distance matrix revealed a two dimensional manifold (See Fig. 4B, plate (iii)). Our next problem is the classification of variable stars using light intensity series (photometry) from the Optical Gravitational Lensing Experiment (OGLE) surve y (12). Supervised classification of photometry proceeds by first “folding” each light-curve to its known period to correct phase mismatches. In our first analysis, we started with folded light-curves; and generated data series of the relativ e changes between consecutiv e brightness v alues in the curves before quantization, which allows for the use of a common alphabet for light curves with wide v ariability in the mean brightness v alues. Using data for Cepheids and RRLs (3426 Cepheids, 7273 RRL), we obtained a classification accuracy of 99 : 8% which marginally outperforms the state of art (See T able II). Clear clusters (obtained unsupervised) corresponding to the tw o classes can be seen in the computed distance matrix (See Fig. 4C, plate (i)), and the 3D projection of its Euclidean embedding (See Fig. 4C, plate (ii)). The 3D embedding was very nearly constrained within a 2D manifold (See Fig. 4C plate (ii)). Additionally , in our second analysis, we asked if data smashing can work without knowledge of the period of the v ariable star; skipping the folding step. Smashing raw photometry data yielded a classification accuracy of 94 : 3% for the two classes (See T able II). This direct approach is beyond state of the art techniques. Our fourth application is biometric authentication using visually ev oked EEG potentials (VEP). The public database used (13). con- sidered 122 subjects, each of whom was exposed to pictures of objects chosen from the standardized Snodgrass set (29). Note that while this application is supervised (since we are not attempting to find clusters unsupervised), no actual training is in volv ed; we merely mark the randomly chosen subject-specific set of data series as the library set representing each individual subject. If “unknown” test data series is smashed against each element of each of the libraries corresponding to the individual subjects, we expected that the data series from the same subject will annihilate each other correctly , while those from di ff erent subjects will fail to do so to the same extent. W e outperformed the state of art for both kNN and SVM based approaches (See T able II). Our fifth application is text independent speaker identification using the ELSDSR database (15), which includes recording from 23 speakers (9 female, and 14 male, with possibly non-native accents). As before, training in volv ed specifying the library series for each speaker . W e computed the distance matrix by smashing the library data series against each other, and trained a simple kNN on the Euclidean embedding of the distance matrix. The test data then yielded a classification accuracy of 80 : 2% , which beat the state of art figure of 73 : 73% for 2 s snippets of recording data (16) (See T able II). In the suceeding sections, we dev elop the mathematical details of the information annihilatin principle, and establish the correctness of the data smashing algorithm. Section VII presents the theory of probabilistic automata as a modeling framework for ergodic station- ary quantized stochastic processes. Section VIII describes the rele v ant algebraic structures, including that of an Abelian group, definable on the space of probabilistic automata. This is central to the notion of anti-streams. Section IX then establishes that the stream operations delineated in T able I are indeed correct. Section X discusses quanti- zation schemes; specifically describing how to choose the granularity of the quantization. Section XI expounds the di ff erences between the data smashing approach and some specific standard notions often used to quantify statistical dependencies, e:g : mutual information between data streams. W e also discuss a specific example to illustrate that simple statistical features may miss important dynamical artifacts in data, which is easily rev ealed via data smashing. The paper is summarized and concluded in Section XII. VII. S tochastic P rocesses & P rob abilistic A utoma t a T o establish the correctness of the data smashing algorithm, we first establish the possibility of using probabilistic automata to model stationary , ergodic processes. Our automata models (5) are distinct to those reported in the literature (30], [31). W e include a brief overvie w here for the sake of completeness. Notation 1.  denotes a finite alphabet of symbols. The set of all finite but possibly unbounded strings on  is denoted by  ? (32). The set of finite strings over  form a concatenative monoid, with the empty wor d  as identity . The set of strictly infinite strings on  is denoted as  ! , wher e ! denotes the first transfinite car dinal. F or a string x , j x j denotes its length, and for a set A , j A j denotes its car dinality . Definition 1 (QSP) . A QSP H is a discrete time  -valued strictly stationary , ergodic stochastic pr ocess, i:e: H = f X t : X t is a  -valued random variable ; t 2 N [ f 0 gg (1) A pr ocess is ergodic if moments may be calculated from a su ffi ciently long r ealization, and strictly stationary if moments ar e time-in variant. W e next formalize the connection of QSPs to PFSA generators. W e dev elop the theory assuming multiple realizations of the QSP H , and fixed initial conditions. Using ergodicity , we will be then able to apply our construction to a single su ffi ciently long realization, where initial conditions cease to matter . Definition 2 (  -Algebra On Infinite Strings) . F or the set of infinite strings on  , we define B to be the smallest  -algebra generated by the family of sets f x  ! : x 2  ? g . Lemma 1. Every QSP induces a pr obability space ( ! ; B ;  ) . Pr oof: Assuming stationarity , we can construct a probability measure  : B ! [0 ; 1] by defining for any sequence x 2  ? n f  g , and a su ffi ciently large number of realizations N R (assuming ergod- icity):  ( x  ! ) = lim N R !1 # of initial occurrences of x # of initial occurrences of all sequences of length j x j and extending the measure to elements of B n B via at most countable sums. Thus  ( ! ) = P x 2  ?  ( x  ! ) = 1 , and for the null word  (   ! ) =  ( ! ) = 1 . Notation 2. F or notational bre vity , we denote  ( x  ! ) as P r ( x ) . Classically , automaton states are equi v alence classes for the Nerode relation; two strings are equiv alent if and only if any finite extension of the strings is either both in the language under consideration, or neither are (32). W e use a probabilistic extension (9). Definition 3 (Probabilistic Nerode Equiv alence Relation) . ( ! ; B ;  ) induces an equivalence relation  N on the set of finite strings  ? as: 8 x; y 2  ? ; x  N y ( ) 8 z 2  ?   P r ( xz ) = P r ( y z ) = 0  _   P r ( xz ) =P r ( x )  P r ( y z ) =P r ( y )   = 0  (2) 8 Notation 3. F or x 2  ? , the equivalence class of x is [ x ] . It is easy to see that  N is right in v ariant, i:e: x  N y ) 8 z 2  ? ; xz  N y z (3) A right-inv ariant equiv alence on  ? always induces an automaton structure; and hence the probabilistic Nerode relation induces a probabilistic automaton: states are equiv alence classes of  N , and the transition structure arises as follows: For states q i ; q j , and x 2  ? , ([ x ] = q ) ^ ([ x ] = q 0 ) ) q   ! q 0 (4) Before formalizing the above construction, we introduce the notion of probabilistic automata with initial, but no final, states. Definition 4 (Initial-Marked PFSA) . An initial marked pr obabilis- tic finite state automaton (a Initial-Marked PFSA) is a quintuple ( Q;  ;  ; e  ; q 0 ) , where Q is a finite state set,  is the alphabet,  : Q   ! Q is the state transition function, e  : Q   ! [0 ; 1] specifies the conditional symbol-generation pr obabilities, and q 0 2 Q is the initial state.  and e  ar e recursively extended to arbitrary y =  x 2  ? as follows: 8 q 2 Q;  ( q ;  ) = q (5)  ( q ;  x ) =  (  ( q ;  ) ; x ) (6) 8 q 2 Q; e  ( q ;  ) = 1 (7) e  ( q ;  x ) = e  ( q ;  ) e  (  ( q ;  ) ; x ) (8) Additionally , we impose that for distinct states q i ; q j 2 Q , ther e exists a string x 2  ? , such that  ( q i ; x ) = q j , and e  ( q i ; x ) > 0 . Note that the probability of the null word is unity from each state. If the current state and the next symbol is specified, our next state is fixed; similar to Probabilistic Deterministic Automata (33). Howe ver , unlike the latter, we lack final states in the model. Additionally , we assume our graphs to be strongly connected. Later we will remov e initial state dependence using ergodicity . Next we formalize how a PFSA arises from a QSP . Lemma 2 (PFSA Generator) . Every Initial-Marked PFSA G = ( Q;  ;  ; e  ; q 0 ) induces a unique pr obability measur e  G on the measurable space ( ! ; B ) . Pr oof: Define set function  G on the measurable space ( ! ; B ) :  G ( ∅ ) , 0 (9) 8 x 2  ? ;  G ( x  ! ) ,  ( q 0 ; x ) (10) 8 x; y 2  ? ;  G ( f x; y g  ! ) ,  G ( x  ! ) +  G ( y  ! ) (11) Countable additi vity of  G is immediate, and (See Definition 4):  G ( ! ) =  G (   ! ) =  ( q 0 ;  ) = 1 (12) implying that ( ! ; B ;  G ) is a probability space. W e refer to ( ! ; B ;  G ) as the probability space generated by the Initial-Marked PFSA G . Lemma 3 (Probability Space T o PFSA) . If the pr obabilistic Ner ode r elation corr esponding to a pr obability space ( ! ; B ;  ) has a finite index, then the latter has an initial-marked PFSA generator . Pr oof: Let Q be the set of equiv alence classes of the probabilistic Nerode relation (Definition 3), and define functions  : Q   ! Q , e  : Q   ! [0 ; 1] as:  ([ x ] ;  ) = [ x ] (13) e  ([ x ] ;  ) = P r ( x 0  ) P r ( x 0 ) for any choice of x 0 2 [ x ] (14) where we extend  ; e  recursiv ely to y =  x 2  ? as  ( q ;  x ) =  (  ( q ;  ) ; x ) (15) e  ( q ;  x ) = e  ( q ;  ) e  (  ( q ;  ) ; x ) (16) For verifying the null-word probability , choose a x 2  ? such that [ x ] = q for some q 2 Q . Then, from Eq. (14), we have: e  ( q ;  ) = P r ( x 0  ) P r ( x 0 ) for any x 0 2 [ x ] ) e  ( q ;  ) = P r ( x 0 ) P r ( x 0 ) = 1 (17) Finite index of  N implies j Q j < 1 , and hence denoting [  ] as q 0 , we conclude: G = ( Q;  ;  ; e  ; q 0 ) is an Initial-Marked PFSA. Lemma 2 implies that G generates ( ! ; B ;  ) , which completes the proof. The above construction yields a minimal realization for the Initial- Marked PFSA, unique up to state renaming. Lemma 4 (QSP to PFSA) . Any QSP with a finite index Ner ode equivalence is gener ated by an Initial-Marked PFSA. Pr oof: Follo ws immediately from Lemma 1 (QSP to Probability Space) and Lemma 3 (Probability Space to PFSA generator). A. Canonical Repr esentations W e have defined a QSP as both ergodic and stationary , whereas the Initial-Marked PFSAs have a designated initial state. Next we intro- duce canonical representations to remove initial-state dependence. W e use e  to denote the matrix representation of e  , i:e: , e  ij = e  ( q i ;  j ) , q i 2 Q;  j 2  . W e need the notion of transformation matrices   . Definition 5 (T ransformation Matrices) . F or an initial-marked PFSA G = ( Q;  ;  ; e  ; q 0 ) , the symbol-specific transformation matrices   2 f 0 ; 1 g j Q jj Q j ar e:     ij =  e  ( q i ;  ) ; if  ( q i ;  ) = q j 0 ; otherwise (18) T ransformation matrices hav e a single non-zero entry per ro w , reflecting our generation rule that gi v en a state and a generated symbol, the next state is fixed. First, we note that, giv en an initial-marked PFSA G , we can associate a probability distribution } x ov er the states of G for each x 2  ? in the following sense: if x =  r 1     r m 2  ? , then we hav e: } x = }  r 1   r m = 1 jj }  Q m j =1   r j jj 1 | {z } Normalizing factor }  m Y j =1   r j (19) where }  is the stationary distribution over the states of G . Note that there may exist more than one string that leads to a distribution } x , beginning from the stationary distribution }  . Thus, } x is an equiv alence class of strings, i:e: , x is not unique. Definition 6 (Canonical Representation) . An initial-marked PFSA G = ( Q;  ;  ; e  ; q 0 ) uniquely induces a canonical r epr esentation ( Q C ;  ;  C ; e  C ) , wher e Q C is a subset of the set of pr obability distributions o ver Q , and  C : Q C   ! Q C , e  C : Q C   ! [0 ; 1] ar e constructed as follows: 1) Construct the stationary distribution on Q using the transition pr obabilities of the Mark ov Chain induced by G , and include this as the first element }  of Q C . Note that the transition matrix for G is the row-stoc hastic matrix M 2 [0 ; 1] j Q jj Q j , with M ij = P  :  ( q i ; )= q j e  ( q i ;  ) , and hence }  satisfies: }  M = }  (20) 2) Define  C and e  C r ecursively:  C ( } x ;  ) = 1 jj } x   jj 1 } x   , } x (21) e  C ( } x ;  ) = } x e  (22) For a QSP H , the canonical representation is denoted as C H . Lemma 5 (Properties of Canonical Representation) . Given an initial- marked PFSA G = ( Q;  ;  ; e  ; q 0 ) : 1) The canonical repr esentation is independent of the initial state. 2) The canonical repr esentation ( Q C ;  ;  C ; e  C ) contains a copy of G in the sense that there exists a set of states Q 0  Q C , such that ther e e xists a one-to-one map  : Q ! Q 0 , with: 8 q 2 Q; 8  2  ;  e  ( q ;  ) = e  C (  ( q ) ;  )  ( q ;  ) =  C (  ( q ) ;  ) (23) 3) If during the construction (be ginning with }  ) we encounter } x =  ( q ) for some x 2  ? , q 2 Q and any map  as defined in (2), then we stay within the graph of the copy of the initial- marked PFSA for all right extensions of x . 9 q 0 q 1  1 j 0 : 15  0 j 0 : 85  1 j 0 : 75  0 j 0 : 25 (a) Synchronizable q 0 q 1  1 j 0 : 15  0 j 0 : 85  0 j 0 : 25  1 j 0 : 75 (b) Non-synchronizable Fig. 5. Synchr onizable and non-synchronizable machines. Synchroniza- tion is deter mination of the current state from obser ved past symbols . Not all PFSAs are synchronizable, e:g : , while the top machine is synchronizable , the bottom one is not. Note that a histor y of just one symbol suffices to determine the current state in the synchronizable machine (top), while no finite histor y can do the same in the non-synchronizable machine (bottom). A  -synchronizing string alwa ys exists (5) for a PFSA, which is not true for deterministic automata (34),(35). Pr oof: (1) follows the ergodicity of QSPs, which makes }  independent of the initial state in the initial-marked PFSA. (2) The canonical representation subsumes the initial-marked rep- resentation in the sense that the states of the latter may themselves be seen as degenerate distributions ov er Q , i:e: , by letting E =  e i 2 [0 1] j Q j ; i = 1 ;    ; j Q j  (24) denote the set of distributions satisfying: e i j j =  1 ; if i = j 0 ; otherwise (25) (3) follo ws from the strong connecti vity of G . Lemma 5 implies that initial states are unimportant; we may denote the initial-marked PFSA induced by a QSP H , with the initial marking remov ed, as P H , and refer to it simply as a “PFSA ”. States in P H are representable as states in C H as elements of E . Next we show that we always encounter a state arbitrarily close to some element in E (See Eq. (24)) in the canonical construction starting from the stationary distribution }  on the states of P H . Next we introduce the notion of  -synchronization of probabilistic automata (See Figure 5). Synchronization of automata is fixing or determining the current state. Not all PFSAs are synchronizable, b ut all are  -synchronizable (5). Definition 7 (  -synchronizing Strings) . A string x 2  ? is  - synchr onizing for a PFSA if: 9 # 2 E ; jj } x  # jj 1 5  (26) W e next introduce the notion of symbolic deriv ativ es: Note that, PFSA states are not observable; we observe symbols generated from hidden states. A symbolic deriv ati ve at a given string specifies the distribution of the next symbol ov er the alphabet. Notation 4. W e denote the set of probability distributions over a finite set of car dinality k as D ( k ) . Definition 8 (Symbolic Count Function) . F or a string s over  , the count function # s :  ? ! N [ f 0 g , counts the number of times a particular substring occurs in s . The count is overlapping, i:e: , in a string s = 0 001 , we count the number of occurr ences of 00 s as 00 01 and 000 1 , implying # s 00 = 2 . Definition 9 (Symbolic Deriv ati ve) . F or a string s generated by a QSP over  , the symbolic derivative  s :  ? ! D ( j  j  1) is defined:  s ( x )   i = # s x i P  i 2  # s x i (27) Thus, 8 x 2  ? ;  s ( x ) is a probability distribution over  .  s ( x ) is r eferr ed to as the symbolic derivative at x . Note that 8 q i 2 Q , e  induces a probability distribution over  as [ e  ( q i ;  1 ) ;    ; e  ( q i ;  j  j )] . W e denote this as e  ( q i ;  ) . W e next show that the symbolic deriv ativ e at x can be used to estimate this distribution for q i = [ x ] , provided x is  -synchronizing. Proposition 1 (  -Con ver gence) . If x 2  ? is  -synchr onizing, then: 8  > 0 ; lim j s j!1 jj  s ( x )  e  ([ x ] ;  ) jj 1 5 a:s  (28) Pr oof: W e use the Glivenko-Cantelli theorem (36) on uniform con ver gence of empirical distributions. Since x is  -synchronizing: 8  > 0 ; 9 # 2 E ; jj } x  # jj 1 5  (29) Recall that E =  e i 2 [0 1] j Q j ; i = 1 ;    ; j Q j  denotes the set of distributions over Q satisfying: e i j j =  1 ; if i = j 0 ; otherwise (30) Let x  -synchronize to q 2 Q . Thus, when we encounter x while reading s , we are guaranteed to be distributed ov er Q as } x , where: jj } x  # jj 1 5  ) } x = # + (1   ) u (31) where  2 [0 ; 1] ,  = 1   , and u is an unknown distribution ov er Q . Defining A  =  e  ( q ;  ) + (1   ) P j Q j j =1 u j e  ( q j ;  ) , we note that  s ( x ) is an empirical distrib ution for A  , implying: lim j s j!1 jj  s ( x )  e  ( q ;  ) jj 1 = lim j s j!1 jj  s ( x )  A  + A   e  ( q ;  ) jj 1 5 a.s. 0 by Glivenko-Cantelli z }| { lim j s j!1 jj  s ( x )  A  jj 1 + lim j s j!1 jj A   e  ( q ;  ) jj 1 5 a:s (1   ) ( jj e  ( q ;  )  u jj 1 ) 5 a:s  This completes the proof. The notion of canonical representations, along with that of the sym- bolic deri vati v es will be used to establish the correctness of the stream operations in Section IX. Note that the canonical representation is free from the notion of initial states; intuitively this translates to our ability to carry out the stream operations (T able 1, main text) without knowledge of the initial states of the hidden models. The notion of the symbolic deri v ati ves, along with Proposition 1 establishes that if the deriv ati ves computed from two su ffi ciently long observed sequences s 1 ; s 2 match up closely , then the underlying generativ e PFSAs are also close. The detailed formulation in (5) prov es that we can conclude that the distance between these underlying models is small with a high probability (in the P A C sense). W e also need to briefly describe the concept of a metric on the space of probabilistic automata established in (5). Proposition 2 (Metric For Probabilistic Automata) . F or two strongly connected PFSAs G 1 ; G 2 , denote the symbolic derivative at x 2  ? as  s G 1 ( x ) and  s G 2 ( x ) r espectively . Then, ( G 1 ; G 2 ) = j  j  1 j  j lim j s 1 j!1 ; j s 2 j!1 X x 2  ?  jj  s 1 G 1 ( x )   s 2 G 2 ( x ) jj 1 j  j 2 j x j  defines a metric on the space of pr obabilistic automata on  . Pr oof: The above metric is slightly di ff erent from the one introduced in (5). Howe ver , the proof of the metric properties follows almost identically . The follo wing result is immediate, and justifies the expression giv en in T able 1 of main text (Ro w 4). Corollary 1 (For Proposition 2) . F or any two PFSA G 1 ; G 2 : 0 5 ( G 1 ; G 2 ) 5 1 (32) Pr oof: The lower bound is immediate by setting G 1 = G 2 . For the upper bound, we note: ( G 1 ; G 2 ) = j  j  1 j  j lim j s 1 j!1 ; j s 2 j!1 X x 2  ?  jj  s 1 G 1 ( x )   s 2 G 2 ( x ) jj 1 j  j 2 j x j  5 j  j  1 j  j lim j s 1 j!1 ; j s 2 j!1 X x 2  ? ( max  jj  s 1 G 1 ( x )   s 2 G 2 ( x ) jj 1  j  j 2 j x j ) = j  j  1 j  j X x 2  ? 1 j  j 2 j x j = j  j  1 j  j 1 X k =0 j  j k j  j 2 k = j  j  1 j  j 1 X k =0 1 j  j k where the last two steps follow from the fact that there are j  j j x j strings of length j x j , which allows us to replace the sum over x 2  ? to ov er k = j x j . Finally , noting that j  j 1 j  j P 1 k =0 1 j  j k = 1 , completes the proof. Next, we elucidate the relevant algebraic structures on the space of PFSA. 10 VIII. A lgebraic S tructures O n PFSA S p ace The material presented in this section is reproduced from the first author’ s previous work (8), and is included here for the sake of completeness. The formulation in Section VII indicates that a symbolic dynamical process has a probabilistic finite state description if and only if the corresponding Nerode equi valence has a finite index. Definition 10 (Space of PFSA) . The space of all PFSA over a given symbol alphabet is denoted by A and the space of all pr obability measur es p inducing a finite-index pr obabilistic Ner ode equivalence on the corr esponding measure space ( ! ; B  ; p ) is denoted by P . As expected, there is a close relationship between A and P , which is made explicit in the sequel. Definition 11 (PFSA Map H ) . Let p 2 P and G = ( Q;  ;  ; q 0 ; e ) 2 A . The map H : A ! P is defined as H ( G ) = p such that the following condition is satisfied: 8 x =  1     r 2  ? ; (33) p ( x ) = e ( q 0 ;  1 ) r  1 Y k =1 e (  ? ( q 0 ;  1     k ) ;  k +1 ) (34) wher e r 2 N , the set of positive integer s. Definition 12 (Right Inv erse H  1 ) . The right in verse of the map H is denoted by H  1 : P ! A such that 8 p 2 P ; H ( H  1 ( p )) = p (35) An explicit construction of H  1 is reported in (9); we only require that such a map exists. Definition 13 (Perfect Encoding) . Given an alphabet  , a PFSA G = ( Q;  ;  ; q 0 ; e ) is said to be a perfect encoding of the measur e space ( ! ; B  ; p ) if p = H ( G ) . There are possibly many PFSA realizations that encode the same probability measure on B  due to existence of non-minimal real- izations and state relabeling; neither of them a ff ect the underlying encoded measure. From this perspective, a notion of PFSA equiv a- lence is introduced as follows: Definition 14 (PFSA Equiv alence) . T wo PFSA G 1 and G 2 ar e defined to be equivalent if H ( G 1 ) = H ( G 2 ) . In this case, we say G 1 = G 2 . In the sequel, a PFSA G implies the equiv alence class of G , i:e: , f P 2 A : H ( P ) = H ( G ) g . Definition 15 (Structural Equi valence) . T wo PFSA G i = ( Q i ;  ;  i ; q i 0 ; e  i ) 2 A , i = 1 ; 2 , are defined to have the equivalent (or identical) structure if Q 1 = Q 2 ; q 1 0 = q 2 0 and  1 ( q ;  ) =  2 ( q ;  ) ; 8 q 2 Q 1 8  2  . Definition 16 (Synchronous Composition of PFSA) . The bi- nary operation of synchr onous composition of two PFSA G i = ( Q i ;  ;  ; q ( i ) 0 ; e  i ) 2 A wher e i = 1 ; 2 , denoted by  : A  A ! A is defined as G 1  G 2 =  Q 1  Q 2 ;  ;  0 ; ( q (1) 0 ; q (2) 0 ) ; e  0  (36) wher e  0 and e  0 is computed as follows: 8 q i 2 Q 1 ; q j 2 Q 2 ;  2  ;   0 (( q i ; q j ) ;  ) = (  1 ( q i ;  ) ;  2 ( q j ;  )) e  0 (( q i ; q j ) ;  ) = e  1 ( q i ;  ) (37) In general,  i:e: synchronous composition is non-commutati ve. Proposition 3 (Synchronous Composition of PFSA) . Let G 1 ; G 2 2 A . Then, H ( G 1 ) = H ( G 1  G 2 ) and therefor e G 1 = G 1  G 2 in the sense of Definition 14. Pr oof: See Theorem 4.5 in (9). Synchronous composition of PFSA allows transformation of PFSA with disparate structures to non-minimal descriptions that hav e the same underlying graphs. This assertion is crucial for the development in the sequel, since any binary operation defined for two PFSA with an identical structure can be extended to the general case on account of Definition 16 and Proposition 3. Next we show that a restricted PFSA subspace can be assigned the algebraic structure of an Abelian group. W e first construct the Abelian group on a subspace of probability measures, and then induce the group structure on this subspace of PFSA via the isomorphism between the two spaces. Definition 17 (Restricted PFSA Space) . Let A + = f G = ( Q;  ;  ; q 0 ; e ) : e ( q ;  ) > 0 8 q 2 Q 8  2  g that is a proper subset of A . It follows that the transition map of any PFSA in the subset A + is a total function. W e restrict the map H : A ! P on a smaller domain A + , that is, H + : A + ! P + , i.e., H + = H j A + . Definition 18 (Restricted Probability Measure) . Let P + , f p 2 P : p ( x ) , 0 ; 8 x 2  ? g that is a proper subset of P . Each element of P + is a probability measure that assigns a non-zero probability to each string on B  . Similar to Definition 17, we r estrict H  1 on P + , i.e., H +  1 = H  1 j P + . Since we do not distinguish PFSA in the same equiv alence class (See Definition 14), we have the following result. Proposition 4 (Isomorphism of H + ) . The map H + is an isomor- phism between the spaces A + and P + , and its in verse is H +  1 . Pr oof: Immediate from preceding discussion. Definition 19 (Abelian Operation on P + ) . The addition operation  : P +  P + ! P + is defined by p 3 , p 1  p 2 ; 8 p 1 ; p 2 2 P + such that 1) p 3 (  ) = 1 . 2) 8 x 2  ? and  2  , p 3 ( x ) p 3 ( x ) = p 1 ( x ) p 2 ( x ) P  2  p 1 ( x ) p 2 ( x ) p 3 is a well-defined probability measure on P + , since 8 x 2  ? :   2  p 3 ( x ) =   2  p 1 ( x ) p 2 ( x ) P  2  p 1 ( x ) p 2 ( x ) p 3 ( x ) = p 3 ( x ) (38) Proposition 5 (abelian Group of PFSA) . The algebr a ( P + ;  ) forms an Abelian gr oup. Pr oof: Closure property and commutati vity of ( P + ;  ) are obvious. The associativity , existence of identity and e xistence of in verse element are established next. (1) Associativity i:e: ( p 1  p 2 )  p 3 = p 1  ( p 2  p 3 ) . No w , 8 x 2  ? ;  2  , we hav e: (( p 1  p 2 )  p 3 )( x ) (( p 1  p 2 )  p 3 )( x ) = ( p 1  p 2 )( x ) p 3 ( x ) P  2  ( p 1  p 2 )( x ) p 3 ( x ) = p 1 ( x )( p 2  p 3 )( x ) P  2  p 1 ( x )( p 2  p 3 )( x ) = ( p 1  ( p 2  p 3 ))( x ) ( p 1  ( p 2  p 3 ))( x ) (39) (2) Existence of identity : Let us introduce a probability measure i  of symbol strings such that: 8 x 2  ? ; i  ( x ) =  1 j  j  j x j (40) where j x j denotes the length of the string x . Then, 8  2  that i  ( x ) i  ( x ) = 1 j  j . For a measure p 2 P + and 8  2  , ( p  i  )( x ) ( p  i  )( x ) = p ( x ) i  ( x ) P  2  p ( x ) i  ( x ) = p ( x ) p ( x ) This implies that p  i  = i   p = p by Definition 19 and by commutativity . Therefore, i  is the identity of the monoid ( P + ;  ) . (3) Existence of in verse : 8 p 2 P + , 8 x 2  ? and 8  2  , let  p be defined by the following relations: (  p )(  ) = 1 (41) (  p )( x ) (  p )( x ) = p  1 ( x ) P  2  p  1 ( x ) (42) 11 q 1 q 2  1 j 0 : 7  0 j 0 : 3  1 j 0 : 9  0 j 0 : 1 q 1 q 2 q 3  1 j 0 : 2  0 j 0 : 8  1 j 0 : 3  0 j 0 : 7  1 j 0 : 4  0 j 0 : 6 q 1 q 2 q 3 q 4 q 5 q 6  1 j 0 : 86  0 j 0 : 14  1 j 0 : 7  0 j 0 : 3  1 j 0 : 79  0 j 0 : 21  1 j 0 : 57  0 j 0 : 43  0 j 0 : 39  1 j 0 : 61  0 j 0 : 63  1 j 0 : 37 q 1 q 2 q 3 q 4 q 5 q 6  1 j 0 : 9  0 j 0 : 1  1 j 0 : 9  0 j 0 : 1  1 j 0 : 9  0 j 0 : 1  1 j 0 : 7  0 j 0 : 3  0 j 0 : 3  1 j 0 : 7  0 j 0 : 3  1 j 0 : 7 q 1 q 2 q 3 q 4 q 5 q 6  1 j 0 : 2  0 j 0 : 8  1 j 0 : 3  0 j 0 : 7  1 j 0 : 1  0 j 0 : 9  1 j 0 : 2  0 j 0 : 8  0 j 0 : 7  1 j 0 : 3  0 j 0 : 6  1 j 0 : 4 q 1 q 2 q 3 q 4 q 5 q 6  1 j 0 : 86  0 j 0 : 14  1 j 0 : 7  0 j 0 : 3  1 j 0 : 79  0 j 0 : 21  1 j 0 : 57  0 j 0 : 43  0 j 0 : 39  1 j 0 : 61  0 j 0 : 63  1 j 0 : 37 + = + = G H G + H G  H H  G G + H (a) Summing arbitray PFSA models via Non-minimal realizations q 1 q 2  1 j 0 : 7  0 j 0 : 3  1 j 0 : 9  0 j 0 : 1 q 1 q 2  1 j 0 : 3  0 j 0 : 7  1 j 0 : 1  0 j 0 : 9 q 1  0 j 0 : 5  1 j 0 : 5 + = (b) Annihilation identity with PFSA defined on binary alphabet Fig. 6. Addition of arbitrar y PFSAs with the same alphabet, using non-minimal realizations to equate structures (via synchronous composition) q 1  0 j 0 : 5  1 j 0 : 5 (a) Zero PFSA for binary alphabet q 1  0 j 1 = 3  1 j 1 = 3  2 j 1 = 3 (b) Zero PFSA for trinary alphabet Fig. 7. Zero PFSAs for diff erent alphabet sizes Then, we hav e: ( p  (  p ))( x ) ( p  (  p ))( x ) = p ( x )(  p )( x ) P  2  p ( x )(  p )( x ) = 1 j  j (43) This gi ves p  (  p ) = i  which completes the proof. W e denote the zero-element i  of the Abelian group ( P + ;  ) as flat white noise (FWN) . A. Explicit Computation of the Abelian Operation  The isomorphism between P + and A + (See Proposition 4) induces the follo wing Abelian operation on A + . Definition 20 (Addition Operation on PFSA) . Given any G 1 ; G 2 2 P + , the addition operation + : A +  A + ! A + is defined as: G 1 + G 2 = H +  1 ( H + ( G 1 )  H + ( G 2 )) If the summand PFSA hav e identical structure (i.e., their underly- ing graphs are identical), then the explicit computation of this sum is stated as follo ws. Proposition 6 (PFSA Addition) . If two PFSA G 1 ; G 2 2 A + ar e of the same structure , i.e., G i = ( Q;  ;  ; q 0 ; e  i ) ; i = f 1 ; 2 g , then we have G 1 + G 2 = ( Q;  ;  ; q 0 ; e ) wher e e ( q ;  ) = e  1 ( q ;  ) e  2 ( q ;  ) P  2  e  1 ( q ;  ) e  2 ( q ;  ) (44) Pr oof: Let p i = H + ( G i ) , i = f 1 ; 2 g and since G 1 ; G 2 hav e the same structure, we have from Eq. (33): 8  2  ; 8 x s : t :  ? ( q 0 ; x ) = q 2 Q; p i ( x ) p i ( x ) = e  i (  ? ( q 0 ; x ) ;  ) = e  i ( q ;  ) (45) Now , by Definition 19 and Definition 11, e ( q ;  ) = ( p 1  p 2 )( x ) ( p 1  p 2 )( x ) = p 1 ( x ) p 2 ( x ) P  2  p 1 ( x ) p 2 ( x ) = p 1 ( x ) p 2 ( x ) p 1 ( x ) p 2 ( x ) P  2  p 1 ( x ) p 2 ( x ) p 1 ( x ) p 2 ( x ) = e  1 ( q ;  ) e  2 ( q ;  ) P  2  e  1 ( q ;  ) e  2 ( q ;  ) The extension to the general case is achieved by using synchronous composition of probabilistic machines. Proposition 7 (PFSA Addition (General case)) . Given two PFSA G 1 ; G 2 2 A + , the sum G 1 + G 2 is computed via Pr oposition 6 and Definition 16 as follows: G 1 + G 2 = ( G 1  G 2 ) + ( G 2  G 1 ) (46) Pr oof: Noting that G 1  G 2 and G 2  G 1 hav e the same structure up to state relabeling, it follows from Proposition 3: H + ( G 1 + G 2 ) = H + ( G 1 )  H + ( G 2 ) (See Denition 20) = H + ( G 1  G 2 )  H + ( G 2  G 1 ) 12 = H +  ( G 1  G 2 ) + ( G 2  G 1 )  which completes the proof. Example 1. Let G 1 and G 2 be two PFSA with identical structur es, such that the pr obability morph matrices ar e: e  1 =  0 : 2 0 : 8 0 : 4 0 : 6  and e  2 =  0 : 1 0 : 9 0 : 6 0 : 4  (47) Then the e  -matrix for the sum G 1 + G 2 , denoted by e  12 , is e  12 =  0 : 1  0 : 2 0 : 9  0 : 8 0 : 6  0 : 4 0 : 4  0 : 6  N or ma liz e        ! r o w s  0 : 027 0 : 973 0 : 5 0 : 5  IX. C orrectness O f S tream O pera tions In this section (Section IX), we prove that the stream operations described in T able I of main text are indeed correct. A. Independent Str eam Copy W e show that the “Independent Stream Copy” operation produces an independent realization from a pseudo-copy of the PFSA model generating the input stream. First, we formalize the notion of pseudo- copies. Definition 21. Given a PFSA G = ( Q;  ;  ; e  ) in the canonical r epr esentation, a pseudo-copy is a canonical PFSA P  ( G ) = ( Q;  ;  ; P ( e  )) , wher e we have: P  () =  [ I  (1   )]  1  (48) for some scalar  2 (0 ; 1) . W e note that that while the row-stochastic matrix  may not be in vertible, and [ I  ] is definitely singular (since  has a eigen v alue at 1); the matrix  [ I  (1   )]  1 is always well-defined for  2 (0 ; 1) , and additionally is a non-negati ve ro w-stochastic matrix (37). W e use the follo wing notation: Notation 5. F or a given string s , the underlying PFSA generator is denoted as G s , and for a given PFSA G , G ! s is a r ealization generated by G . Note that G s automatically implies that we ar e r eferring to the PFSA generator as j s j ! 1 , since one cannot have a unique gener ator for bounded strings. Proposition 8 (Independent Stream Copy) . Given a symbol str eam s with a hidden PFSA generator G , let str eam s 0 be gener ated via: 1 Generate str eam ! 0 fr om FWN 2 Read curr ent symbol  1 fr om s , and  0 fr om ! 0 3 If  =  0 , then write to output s 0 4 Move r ead positions one step to right, and go to step 1 Then, we have: 1) W ell-defined con ver gence of underlying models: lim j s 0 j!1 ( G 0 s 0 ) = lim j s j!1 P 1 2 ( G s ) (49) 2) If s 0 ; s 00 ar e generated from the above algorithm fr om the same input stream s , then, in the limit of infinite length, s 0 ; s 00 ar e independent r ealizations of P 1 2 ( G s ) . 3) If s 0 1 ; s 0 2 ar e generated by the algorithm for input str eams s 1 ; s 2 r espectively , then we have: 8  > 0 ; ( G 1 s 1 ; G 2 s 2 ) 5  ) ( G 0 1 s 0 1 ; G 0 2 s 0 2 ) 5 j  j  (50) 8  > 0 ; ( G 1 s 1 ; G 2 s 2 ) =  ) ( G 0 1 s 0 1 ; G 0 2 s 0 2 ) = j  j (2 j  j  1) 2  (51) Pr oof: (1) Let  be the probability that the first symbol in the input stream s is recorded in the output. Since, the stream ! 0 is FWN, we conclude that  = 1  , and that  is also the constant probability that any symbol in s is recorded. Thus, assuming that the symbolic deriv ati ves computed are exact ( i:e: the input stream is infinite), the transition matrix M of a realization for the PFSA G 0 s 0 can be expressed as a function of the transition matrix  for G s as: M =   + (1   )   2 + (1   ) 2   3 +    (52) ) M =  [ I  (1   )]  1  = P  () (53) Since, the transformation [ I  (1   )]  1 is in vertible, the rank of M is the same as  , which implies that states in G s cannot collapse when we pass to P  ( G ) , implying in turn that G 0 s 0 is has indeed the same minimal structure as G s , which establishes claim (1). (2) Claim (1) implies: lim j s 0 j!1 G 0 s 0 = lim j s 00 j!1 G 00 s 00 (54) The independence claim then follo ws immediately from noting that random erasure, as executed by the stated algorithm, eliminates any possibility of synchronization between the states of the same underlying model G 0 s 0 in the limit of infinite string lengths . (3) Consider the PFSAs G 1 s 1 ; G 2 s 2 as j s 1 j ; j s 2 j ! 1 . Let us bring them to the same structure, via the transformations ( G 1 s 1 )  ( G 2 s 2 ) and ( G 2 s 2 )  ( G 1 s 1 ) respectiv ely (See [REF]). Let us denote the transition matrices of the PFSAs in their transformed representations as  1 ;  2 respectiv ely . Then denoting  =  1   2 , and  0 = P  ( 1 )  P  ( 2 ) , we claim:  (2   ) 2 jj  jj 1 5 jj  0 jj 1 5 1  jj  jj 1 (55) T o establish this claim, we first note that for any stochastic matrix A , we hav e:  [ I  (1   ) A ]  1 A = 1 1    [ I  (1   ) A ]  1   1   I (56) which implies the upper bound in Eq. (105) as follows: (1   ) ( P  ( 1 )  P  ( 2 )) =  [ I  (1   ) 1 ]  1   [ I  (1   ) 2 ]  1 =  (1   )[ I  (1   ) 2 ]  1 ( 2   1 )[ I  (1   ) 1 ]  1 This implies: )  0 =   [ I  (1   ) 2 ]  1 [ I  (1   ) 1 ]  1 ) j j  0 jj 1 5  jj [ I  (1   ) 2 ]  1 jj 1  jj [ I  (1   ) 1 ]  1 jj 1 jj  jj 1 ) jj  0 jj 1 5   1   1   jj  jj 1 = 1  jj  jj 1 And the lo wer bound follo ws from noting: )  0 =   [ I  (1   ) 2 ]  1 [ I  (1   ) 1 ]  1 )    = [ I  (1   ) 2 ] 0 [ I  (1   ) 1 ] )  jj  jj 1 5 jj [ I  (1   ) 2 ] jj 1 jj [ I  (1   ) 1 ] jj 1 jj  0 jj 1 ) j j  0 jj 1 =  jj [ I  (1   ) 2 ] jj 1 jj [ I  (1   ) 1 ] jj 1 jj  jj 1 =  (2   ) 2 jj  jj 1 (57) Next, we compute bounds on the probability morph matrices. W e denote the probability morph matrices for the rele vant PFSAs as follo ws (PFSAs on left, morph matrices on right): ( G 1 s 1 )  ( G 2 s 2 ) e  1 ( G 2 s 2 )  ( G 1 s 1 ) e  2 P  (( G 1 s 1 )  ( G 2 s 2 )) P ( e  1 ) P  (( G 2 s 2 )  ( G 1 s 1 )) P ( e  2 ) And, additionally , we use the notation: e  = e  1  e  2 (58) e  0 = P ( e  1 )  P ( e  2 ) (59) W ithout loss of generality , we assume that for each PFSA, gi ven a state, each symbol leads to a distinct state. This can be arranged via state splitting if necessary , and implies: jj  jj 1 = jj e  1  e  2 jj 1 (60) 13 jj  0 jj 1 = jj P ( e  1 )  P ( e  2 ) jj 1 (61) which therefore leads to the bounds:  (2   ) 2 jj e  1  e  2 jj 1 5 jj P ( e  1 )  P ( e  2 ) jj 1 5 1  jj e  1  e  2 jj 1 (62) Recall the definition of the PFSA metric (See Proposition 2): ( G 1 s 1 ; G 2 s 2 ) = j  j  1 j  j lim j s 1 j!1 ; j s 2 j!1 X x 2  ?  jj  s 1 G 1 ( x )   s 2 G 2 ( x ) jj 1 j  j 2 j x j  (63) and, note that: ( G 1 s 1 ; G 2 s 2 ) 5  ) 8 x 2  ? ; jj  s 1 G 1 ( x )   s 2 G 2 ( x ) jj 1 5  (64) ( G 1 s 1 ; G 2 s 2 ) =  ) 8 x 2  ? ; jj  s 1 G 1 ( x )   s 2 G 2 ( x ) jj 1 =  (65) Since the bounds established in Eq. (62) is applicable to any non- minimal realization of the PFSAs ( G 1 s 1 )  ( G 2 s 2 ) and ( G 2 s 2 )  ( G 1 s 1 ) , considering the full  -ary tree as the limiting “unfolded” realization, we conclude from Eq. (62) that: ( G 1 s 1 ; G 2 s 2 ) 5  ) ( G 0 1 s 0 1 ; G 0 2 s 0 2 ) 5 1   (66) and also: ( G 1 s 1 ; G 2 s 2 ) =  ) ( G 0 1 s 0 1 ; G 0 2 s 0 2 ) =  (2   ) 2  (67) The desired bounds then follow from noting that in the stated algorithm, we hav e  = 1 j  j . This completes the proof. Remark 1. Proposition 8 establishes that the stream s 0 obtained fr om an input stream s may not be a realization fr om the hidden generator for the latter ( i:e: str eam s ), but is a r ealization from a PFSA which is a pseudo-copy of the generator for s . Also, note that it is not true in gener al that a pseudo-copy is close to the original PFSA, in the sense of our metric. Nevertheless, Pr oposition 8 shows that if the distance between two PFSAs is small, then so is the distance between their pseudo-copies; and if the distance between two PFSAs is lar ge , then so is the distance between the pseudo-copies. Thus, if we determine the distance between pseudo-copies, we have a good estimate of the distance between the original machines. B. Str eam Summation Proposition 9 (Stream Summation) . Given a symbol str eams s 1 ; s 2 with hidden PFSA generators G 1 ; G 2 , let stream s 0 be generated via: 1 Read curr ent symbols  i fr om s i ( i = 1 ; 2 ) 2 If  1 =  2 , then write to output s 0 3 Move r ead positions one step to right, and go to step 1 Then, denoting the FWN generator as W , we have: 1) If ( G 1 s 1 ) + ( G 2 s 2 ) has a single state in its minimal r ealization, then we have: ( G 0 s 0 ) = ( G 1 s 1 ) + ( G 2 s 2 ) (68) i:e: , then s 0 is an exact realization of ( G 1 s 1 ) + ( G 2 s 2 ) in the limit j s 1 j ; j s 2 j ! 1 . 2) If ( G 1 s 1 ) + ( G 2 s 2 ) has N > 1 states in its minimal r ealization, we have the lower bound: (( G 1 s 1 ) + ( G 2 s 2 ) ; W ) =  ) (( G 0 s 0 ) ; W ) =  j  j N + j  j N  1 (69) 3) W e have the upper bound:  (( G 1 s 1 ) + ( G 2 s 2 ) ; W ) 5  )   ( G 0 s 0 ) ; W  5  (70) Pr oof: (1) If ( G 1 s 1 ) + ( G 2 s 2 ) has a single causal state: 8 x 2  ? ; lim j s 0 j!1  s 0 ( x ) = v ; where X i v i = 1 ; v i = 0 (71) W e assume without loss of generality that G 1 s 1 ; G 2 s 2 in their canonical representations (See Definition 6). Thus, the streams s 1 ; s 2 can be assumed to start at states of the canonical representation which maps to the corresponding stationary distribution over the states of the corresponding initial-marked PFSAs (See Definition 6, and the discussion immediately after). Also, note that we can delete arbitrary prefixes from s 1 and s 2 , and still assume that they start at these states. Thus, we delete prefixes of s 1 ; s 2 up to the point where the next symbols match, and we see our first output symbol. Since we see a symbol in the output if there is a match in both s 1 and s 2 , it follows that the probability of seeing the first symbol in s 0 as  i is gi ven by: lim j s 1 j ; j s 2 j!1 1 P i  s 1 (  ) j i  s 2 (  ) j i  s 1 (  ) j i  s 2 (  ) j i = v i (72) Also, since G 1 s 1 and G 2 s 2 can be assumed to have the same graph without loss of generality (via considering non-minimal realizations if necessary), it follows that the next hidden states q ; q 0 in s 1 ; s 2 after seeing the first output symbol, are still synchronized. Thus, we conclude, that if the first observed symbol in s 0 is  , then the probability that the next symbol is  j , is also gi ven by: lim j s 1 j ; j s 2 j!1 1 P i  s 1 (  ) j j  s 2 (  ) j j  s 1 (  ) j i  s 2 (  ) j i = v j (73) It follows from straightforward induction, that at any point in s 0 , the distribution of the next symbol is giv en by v , i:e , in the limit s 1 ! 1 ; s 2 ! 1 , s 0 is an exact realization from the sum ( G 1 s 1 ) + ( G 2 s 2 ) (if the latter has a single causal state). (2) Let H = ( G 1 s 1 ) + ( G 2 s 2 ) and assume  ( H ; W ) =  . Also, let the set of states in the minimal realization of the canonical representation for H be Q , and additionally let C ard ( Q ) = N > 1 . It follows from the definition of our metric (and the fact that  0 - synchronizing strings must occur for all  0 > 0 ), that: 8 q 2 Q H ; jj e  ( q ;  )  U  jj 1 =  (74) where, as before, U  =  1 j  j    1 j  j  . W e observe that the stream summation algorithm can be thought to be producing the output sequence by traversing the arcs in the canonical representation for H augmented with “jumps” (See Fig. 8), where there are unreported and unlabeled transitions back to the state corresponding to the equi valence class [  ] from each state: whene ver we ha ve a mismatch we jump back to [  ] . The probabilities of these back transitions can be easily computed, but not important here. Howe ver , this implies that if q 0 is a state in the canonical representation for G 0 s 0 , then we have (assuming e  0 is the morph function for G 0 s 0 ): e  0 ( q 0 ;  ) = p ( q 0 ) e  ([  ] ;  ) + (1  p ( q 0 )) e  ( q ;  ) ; for some q 2 Q; where p ( q 0 ) 2 [0 ; 1] (75) Since, e  0 ( q 0 ;  ) is a weighted av erage of e  ([  ] ;  ) , and e  ( q ;  ) (both of which satisfy Eq. (82)), it is possible that: jj e  0 ( q 0 ;  )  U  jj 1 5  (76) Assume, if possible, that ( Q 0 is the state set in the minimal realization of the canonical representation for G 0 s 0 ): 8 q 0 2 Q 0 ; jj e  0 ( q 0 ;  )  U  jj 1 5  (77) W e note that the same ar gument in claim (1) implies that: e  0 ([  ] ;  ) = e  ([  ] ;  ) =  (78) where the inequality follows from Eq. (82). Since, e  0 ([  ] ;  ) is some weighted av erage of all e  0 ( q 0 ;  ) vectors, and by assumption Eq. (77), the norm of the di ff erence of each of these v ectors from U  j is bounded abov e by  , it follo ws that: jj e  0 ([  ] ;  )  U  jj 1 5  (79) which is a contradiction. Thus, there exists at least one state q ? 2 Q 0 , 14 } x } x 0 } x 00 } x 000 }   1 j p 3  0 j p 1  0 j p 2  1 j p 1  1 j p 1  0 j p 1  0 j p 1  0 j p 1  1 j p 1  1 j p 1  1 j p 1 } x } x 0 } x 00 } x 000 }   1 j q 3  0 j q 1  0 j q 2  1 j q 1  1 j q 1  0 j q 1  0 j q 1  0 j q 1  1 j q 1  1 j q 1  1 j q 1 } x } x 0 } x 00 } x 000 }   1 j p 3 q 3  0 j p 1 q 1  0 j p 2 q 2  1 j p 1 q 1  1 j p 1 q 1  0 j p 1 q 1  0 j q 1  0 j p 1 q 1  1 j p 1 q 1  1 j p 1 q 1  1 j p 1 q 1 m 1 m 2 m 3 m 4 (a) G 1 s 1 (b) G 2 s 2 (c) Construct with jumps Fig. 8. Illustration for Proposition 9. Note that using the same str ucture for G 1 s 1 and G 2 s 2 causes no loss of generality , since arbitrar y PFSAs ov er the same alphabet can be brought to the same structure via possibly non-minimal realizations such that jj e  0 ( q ? ;  )  U  jj 1 =  (80) W e also note that G 0 s 0 has at most N states, since, if none of the e  0 rows are equal, then we can represent G 0 s 0 using the same graph as H , with the ro ws of e  replaced with those from e  0 . Now , we compute ( G 0 s 0 ; W ) . W e note that since G 0 s 0 has at most N states, lim j s 0 j!1  s 0 ( x ) equals e  0 ( q ? ;  ) at least once ev ery N lev els, which implies: ( G 0 s 0 ; W ) = j  j  1 j  j 1 X i =0 1 j  j 2 i + N  =  j  j N + j  j N  1 (81) (3) Assume  ( H ; W ) 5  . It follows from the definition of our metric (and the fact that  0 -synchronizing strings must occur for all  0 > 0 ), that: 8 q 2 Q H ; jj e  ( q ;  )  U  jj 1 5  (82) where, as before, U  =  1 j  j    1 j  j  . Since, (using the notation used for claim (2) above) e  0 ( q 0 ;  ) is a weighted average of e  ([  ] ;  ) , it follo ws immediately that: 8 q 0 2 Q 0 ; jj e  0 ( q 0 ;  )  U  jj 1 5  (83) and also, that the norm of the di ff erence between any weighted av erage (with the weights being positive and summing to unity), of the rows of the e  matrix from U  , is bounded above by  , This implies that each term in ( G 0 s 0 ; W ) is bounded above by  , which establishes the desired bound: ( G 0 s 0 ; W ) 5  (84) This completes the proof. Remark 2. Note that the lower bound established in claim (2) is obviously not tight; since if N = 1 , then we have exact summation, wher eas the bound is o ff by a factor of j  j . Remark 3. Pr oposition 9 establishes that the str eam summation algorithm works perfectly if the summands sum to a one state machine, which includes FWN. F or arbitrary inputs, the deviation of the r ealized sum fr om FWN is small if the deviation of the sum of the original models is small; and con versely the deviation of the r ealized sum from FWN is lar ge if the deviation of the sum of the original models is lar ge . Corollary 2 (Contrapositiv es to Proposition 9) . Using the notation of Pr oposition 9, we have: Lower Bound: ( G 0 s 0 ; W ) <  ) (( G 1 s 1 ) + ( G 2 s 2 ) ; W ) <  j  j N + j  j N  1   (85) Equality: ( G 0 s 0 ; W ) = 0 ) (( G 1 s 1 ) + ( G 2 s 2 ) ; W ) = 0 (86) Upper Bound: ( G 0 s 0 ; W ) >  ) (( G 1 s 1 ) + ( G 2 s 2 ) ; W ) >  (87) Pr oof: (Equality:) Note that G 0 s 0 = W , and the fact that ev ery state in G 0 is a conv ex combination of [  ] and some state q 2 Q H , implies: 8 x 2  ? ; lim j s 0 j!1  s 0 ( x ) = p e  ([  ] ;  ) + (1  p ) e  ( q ;  ) = U  (88) Also, since e  ([  ] ; _ ) = lim j s 0 j!1  s 0 (  ) for arbitrary input streams, it follo ws that: 8 q 2 Q H ; p U  + ( 1  p ) e  ( q ;  ) = U  ) 8 q 2 Q H ; e  ( q ;  ) = U  (89) which establishes Eq. (86). The other bounds follow by taking the contrapositive of the inequalities established in Proposition 9. Remark 4. Cor ollary 2 implies that if the output sequence from the str eam summation algorithm is FWN, then the summands are exact in verses of each other . C. Str eam In version Lemma 6 (Stream Summation to FWN) . Let str eams s i ; i = 1 ;    ; j  j be j  j independent r ealizations fr om a PFSA G defined over the alphabet  . And let s 0 be gener ated as follows: 1 Read curr ent symbols  i fr om s i ( i = 1 ;    ; j  j ) 2 If  i ,  j for all distinct i; j , then write  1 to output s 0 3 Move r ead positions one step to right, and go to step 1 Then, we have: ( G 0 s 0 ; W ) = 0 (90) wher e W is the FWN gener ator for the alphabet size j  j . Pr oof: Let the set of states for the minimal realization of the canonical representation for G be Q , and the corresponding morph function be e  . Similarly , let the state set for G 0 s 0 be Q 0 , and the associated probability morph function be e  0 . Let s 0 i be the sequence obtained by copying the current symbol from the input stream s i in Step (2) of the above scheme. (Thus, s 0 1 = s 0 .) It is obvious from the symmetry of the scheme that: 8 i; j 2 f 1 ;    ; j  jg ; ( G 0 i s 0 i ) = ( G 0 j s 0 j ) (91) It follows that, if f j i is the frequency of the j th symbol from the alphabet in the stream s 0 i , then we ha ve: 8 i; lim j s 0 i j!1 f j i j s 0 i j = e  0 ([  ] ;  )   j (92) 15 ) lim j s 0 i j!1 P i f j i j s 0 i j = lim j s 0 i j!1 j  j f j i j s 0 i j = j  j e  0 ([  ] ;  )   j (93) where we also used the fact that j s 0 i j = j s 0 i j . Next, noting that we hav e an output symbol in each s 0 i only when each new symbol is distinct, we conclude: 8 j ; k ; X i f j i = X i f k i (94) which in turn implies for the j th and k th alphabet symbols: lim j s 0 i j!1 P i f j i j s 0 i j = lim j s 0 i j!1 j  j f j i j s 0 i j = j  j e  0 ([  ] ;  )   j = lim j s 0 i j!1 P i f k i j s 0 i j = lim j s 0 i j!1 j  j f k i j s 0 i j = j  j e  0 ([  ] ;  )   k (95) and hence we conclude: e  0 ([  ] ;  ) = lim j s 0 j!1  s 0 (  ) = U  (96) Next, denote the r th symbol in the stream s 0 as s 0 ( r ) . Then, if we assume that the streams s i were all synchronized to the same state of G just prior to the generation of s 0 ( r ) , we hav e: 8  i ;  k 2  ; P r ob ( s 0 ( r ) =  i ) = c Y  j 2  e  ( }  ;  j ) = P r ob ( s 0 ( r ) =  k ) (97) which implies: 8  k 2  ; P r ob ( s 0 ( r ) =  k ) = 1 j  j (98) Next, we consider the following construction: Consider the PFSA G , with the streams s i trav ersing the transitions via the symbol- labeled arcs, with each s i initialized to the state [  ] . Note that we hav e a new symbol in the output s 0 if all current symbols in the j  j input streams are distinct; which can occur in two possible ways: 1) all s i streams are synchronized to some state q 2 Q , and a distinct symbol is generated for each s i 2) no such synchronization; but the symbols generated are distinct In the second case, we assume that a re-initialization occurs; i:e: , all the streams jump back to state [  ] before the distinct symbols are generated causing the ne w output symbol. Note that this construction causes no loss of generality , as we are simply defining a path, with jumps, for the given input streams through the PFSA G . Denote the probability distribution of the output symbol, when the streams are synchronized at some state q 2 Q be v q , i:e: v q i is the probability of seeing the i th symbol, given that we indeed hav e a new output symbol. Next we observe that the streams s i can be assumed to be synchronized at [  ] when the first symbol appears in the output s 0 (since deletion of arbitrary leading prefixes has no e ff ect in the limit of infinite data). Thus, we have (from Eq. (96)): lim j s 0 j!1  s 0 (  ) = U  (99) Note that the next symbol may be produced after a “silent” jump to some state q 2 Q . Additionally , the probability that the jump occurs to a specific state q is an explicit function of the parameters (morph probabilities, and transition structure) of the PFSA G . Howev er we do not need to compute these probabilities; we simply conclude that: 8 x 2  ? ; lim j s 0 j!1  s 0 ( x ) = X q 2 Q p ( q ; x ) v q (100) where p ( q ; x ) 2 [0 ; 1] ; X q p ( q ; x ) = 1 (101) Noting that Eq. (98) establishes that 8 q 2 Q; v q = U  , we conclude: 8 x 2  ? ; lim j s 0 j!1  s 0 ( x ) = U  (102) which establishes that G 0 s 0 is the FWN generator . This completes the proof. 0.25 0.074 0.023 2 4 6 8 10 10  4 10  2 10 0 Alphabet Size ( j  j ) Upper bound on annihilation e ffi ciency Fig. 9. Upper bound on annihilation efficiency ( j  j 1)! j  j j  j vs alphabet size j  j . This illustrates why having a fine quantization would necessitate large amounts of data to pass the self-annihilation test. Proposition 10 (Stream Inv ersion) . Given a str eam s which is generated by some hidden PFSA G , let str eam s 0 be gener ated via: 1 Generate j  j  1 independent copies of s 1 : s 1 ;    ; s j  j 1 2 Read curr ent symbols  i fr om s i ( i = 1 ;    ; j  j  1 ) 3 If  i ,  j for all distinct i; j , then write  n S j  j 1 i =1  i to output s 0 4 Move r ead positions one step to right, and go to step 1 Then, we have: (  G; G 0 s 0 ) = 0 (103) Pr oof: Follows immediately from Lemma 6. Proposition 11 (Asymptotic Complexity) . The asymptotic time com- plexity of carrying out the str eam operations is O ( j s jj  j Pr oof: The algorithmic steps in each of the operations of stream copy and stream summation proceed in a symbol-by-symbol fashion, with no memory of pre vious symbols. Also, each step inv olves a constant number of integer comparisons. Assuming that each new symbol from the FWN processes in v olved can be generated with constant comple xity , we conclude that the asymptotic time complexity of both stream summation and stream copy is O ( j s j . The stream in version operation needs to generate j  j  1 stream copies, implying that its asymptotic time complexity is O ( j s jj  j ) . D. Annihilation E ffi ciency T o pass the self-annihilation test, a data stream must be su ffi ciently long; and the required length j s j of the input s with a specified threshold  ? is dictated by the characteristics of the generating process. Thus the rate of con ver gence of the self-annihilation error as a function of j s j quantifies the sample complexity of information annihilation. Let s 0 be obtained from s via stream inv ersion, and s 00 be obtained via stream summation of s and s 0 . Then, it follows that s 00 is always a realization of the FWN process, which has an uniform probability of generating any symbol at any point. Thus, for any x 2  ? , the vectors  s 00 ( x ) (See T able I, row 4) are empirical distributions which conv erge to the flat distribution as j s 00 j ! 1 . Additionally , the Central Limit Theorem (CL T) (28) dictates the con ver gence rate to scale as 1 = p j s 00 j irrespectiv e of the generating process for the input s . Howe ver , selective erasure in annihilation (See T able I) implies that j s 00 j < j s j , and the expected shortening ratio  = E ( j s 00 j = j s j ) does indeed depend on the generating process. W e refer to  as the annihilation e ffi ciency , since the conv ergence rate of the self-annihilation error scales as 1 = p  j s j . Next, we compute  in terms of the symbol frequencies: Proposition 12. Given an input stream s , let str eam s 0 be pr oduced via str eam in version from s , and let s 00 be pr oduced via str eam summation of s and s 0 . Let p i be the pr obability of observing symbol  i 2  . Then, we have  = E ( j s 00 j = j s j ) = ( j  j  1)! Y i p i (104) 16 Pr oof: T o generate s 0 from s , we first need to generate j  j  1 independent stream copies of s . It is clear from the stated algorithm for independent stream copy (See T able I in main text and Propo- sition 8) that the expected length of each of these copies is 1 j  j j s j . The probability of obtaining a symbol in the output by comparing these j  j  1 streams (to get s 0 ) is simply the probability of seeing a di ff erent symbol in each of the copied streams (as stated in the algorithm for stream inv ersion in T able I of main text). Denoting this probability as  , we hav e:  = ( j  j  1)! j  j X j =1 Y i , j p i = ( j  j  1)! ( Y i p i ) ( X i 1 p i ) = 1 H j  j ! Y i p i (105) where H is the harmonic mean of the probability vector p . Thus, the expected length of s 0 is  j s j . The final step is stream summation of s and s 0 to obtain s 00 . W e note that the probability of seeing symbol  i in the in verted stream s 0 is k =p i , where k = 1 P i 1 p i = H j  j . It follows that stream summation of s and s 0 , would result in an expected length of H j  j j s 0 j , which when combined with Eq. (105) completes the proof. Corollary 3. Annihilation e ffi ciency satisfies: ( j  j  1)!  j  j f 1  + 1  j  jg 5  5 ( j  j  1)! j  j j  j (106) wher e  is the pr obability of occurrence of the rar est symbol in the input str eam s . Pr oof: The upper bound follows from noting that the product Q i p i is maximized when 8 i; p i = 1 = j  j . The lo wer bound is obtained by assuming that  = min i p i , upon which the minimum value of the product Q i p i is gi ven by:  j  j 1 (1   ( j  j  1)) =  j  j f 1  + 1  j  jg (107) Remark 5. The upper bound for the annihilation e ffi ciency is r ealized if the input s is FWN. E. Distance Between Hidden Generators Definition 22 (FWN Deviation Estimators) . F or a string s , the complete white noise deviation estimator  ( s ) is defined as:  ( s ) = j  j  1 j  j X x 2  ?  1 j  j 2 j x j jj  s ( x )  U  jj 1  (108) And the partial white noise deviation estimator ^  ( s; ` ) is defined: ^  ( s; ` ) = j  j  1 j  j X x 2  `  1 j  j 2 j x j jj  s ( x )  U  jj 1  (109) which only carries out the summation for all strings up to length ` . Proposition 13 (Causality Claim 1) . Given a stream s , and denoting the hidden generator for s as G s , and the zer o model as W , we have: lim j s j!1 j ( G s ; W )   ( s ) j = 0 (110) Pr oof: For a string s 0 generated by the model G ( s ) , we denote lim j s 0 j!1  s 0 ( x ) as  ? ( x ) . Then: ( G ( s ) ; W ) = j  j  1 j  j X x 2  ? 1 j  j 2 j x j jj  ? ( x )  U  jj 1 5 j  j  1 j  j X x 2  ?  1 j  j 2 j x j jj  ? ( x )   s ( x ) jj 1  + j  j  1 j  j X x 2  ?  1 j  j 2 j x j jj  s ( x )  U  jj 1  10 1 10 2 10 3 10 4 10  2 10  1 10 0 Symbol Length [No. of symbols] Self-annihilation Error q 1 q 2  1 j 0 : 7  0 j 0 : 3  1 j 0 : 9  0 j 0 : 1 q 1 q 2 q 3 q 4  1 j 0 : 7  1 j 0 : 7  0 j 0 : 7  0 j 0 : 7  0 j 0 : 3  1 j 0 : 9  1 j 0 : 1  0 j 0 : 1  O ( n  1 2 ) Smaller # of states But slower con ver gence Fig. 10. Con vergence rate of the self-annihilation error Shown to scale as O (1 = p n ) as dictated by the Central Limit Theorem. The conv ergence rates do not depend directly on the descr iptional complexity of the generating processes; note that the data from the two state process has a slower conv ergence rate compared to that from the four state process. As discussed in the section on computational complexity , the convergence rate scales as O ( p  n ) where  is the expected shor tening of the input stream due to the selective erasure of the symbols in the different steps of the annihilation process. We establish in Proposition 12 that if p i is the occurrence probability of the symbol  i in the stream s 00 , then we have:  = ( j  j  1)! Q i p i . 5 j  j  1 j  j X x 2  ?  1 j  j 2 j x j jj  ? ( x )   s ( x ) jj 1  | {z } B +  ( s ) Starting with  ( s ) on the RHS, we end up with:  ( s ) 5 ( G ( s ) ; W ) + j  j  1 j  j X x 2  ?  1 j  j 2 j x j jj  ? ( x )   s ( x ) jj 1  (111) which then implies: j ( G s ; W )   ( s ) j 5 j  j  1 j  j X x 2  ?  1 j  j 2 j x j jj  ? ( x )   s ( x ) jj 1  W e note that  s ( x ) is an empirical estimate of  ? ( x ) , which then implies via the Gli venko-Cantelli theorem (36) that 8 x 2  ? ; jj  s ( x )   ? ( x ) jj 1 ) a:s:   ! 0 (112) which completes the proof. Finally we establish our causality claim: while the deviation from FWN is estimated by function ^  ( s; ` ) from a finite observed string s and consideration of finite histories of length bounded by ` , it con ver ges to the deviation of the underlying process from FWN in the limit of infinite data (See next proposition). It thus follows that the distance  calculated by annihilating a stream s against a second stream s 0 con ver ges to the absolute deviation of G s  G s 0 from the FWN generator W . Proposition 14 (Causality Claim 2) . Given a stream s , and denoting the hidden generator for s as G s , and the zer o model as W , we have: lim j s j!1 j ( G s ; W )  ^  ( s; ` ) j 5  (113) wher e the partial estimator is evaluated upto length ` = ln 1  ln j  j Pr oof: Noting that we have: j ( G s ; W )  ^  ( s; ` ) j 5 j ( G s ; W )   ( s ) j 17 +       j  j  1 j  j X x 2  ? n  `  1 j  j 2 j x j jj  s ( x )  U  jj 1        5 j ( G s ; W )   ( s ) j + 1 j  j ` the result follo ws using Proposition 13. X. Q u antiza tion T echniques Information annihilation operates on symbolic sequences. Thus, we need to specify a quantization scheme to map possibly continuous- valued data streams to symbolic sequences. This is accomplished by the choice of a symbol alphabet, where each letter in the alphabet denotes a slice of the data range. Gi ven a particular quantization scheme, we map each continuous-valued observation to the symbol representing the slice of the data range to which the observ ation belongs. Thus any chosen quantization scheme incurs error , which can be made small by using a fine quantization, i:e: , a large alphabet. Howe v er , the length of the observed data limits the size of the alphabet that we can use. This is a direct consequence of the fact that the annihilation e ffi ciency falls rapidly with the alphabet size (See Proposition 12, and Fig. 9). Thus, if s is the input stream, s 0 is obtained via stream inv ersion from s , and s 00 is the output from stream summation of s and s 0 , then the expected ratio of the lengths j s 00 j = j s j falls rapidly as the alphabet size j  j is increased, making the estimation of the deviation of s 00 from FWN more and more di ffi cult. Since the con v ergence rate of the self-annihilation error scales as 1 = p  j s j , it follows that the self-annihilation error increases rapidly with finer quantization (See Fig. 11 for illustration on the EEG dataset). A. Desir ed Pr operties of Quantization Schemes It follows that a good quantization scheme is defined by the following properties: 1) The frequency of the rarest symbol in the quantized data streams are too small. This is to ensure that symbols are represented faithfully according to its generation probability from each state in the hidden model; too few occurrences of a particular symbol may represent statistical fluctuations rather than the generation probabilities. 2) The av erage self-annihilation error for the observed data streams is small, i:e: , if  ii is the self-annihilation error for the observed data stream s i , then we require that 1 T T X i =1  ii is small (where T is the total number of observed data streams). 3) The average discrimination between data-streams is high, i:e: , if for two streams s i ; s j , the similarity computed by information annihilation is  ij , then we require that 1 T ( T  1) T X i =1 T X j =1 i , j  ij is large (where T is the total number of observed data streams). One approach to choosing a quantization that satisfies the stated properties is the following: W e restrict ourselv es to maximum-entropy quantizations, i:e: , schemes in which each symbol occurs with the same frequency in the data set. In Fig. 11 plates (a)-(c), we sho w three such maximum-entropy schemes for the EEG-dataset. The alphabet size is increased from 2 to 4 , and we choose the slices of the data- range such that each slice contains an approximately equal number of data points. F or example, in plate (c) of Fig. 11, each of the four slices contains approximately 25% of the total number of observations in the data set. Such maximum-entropy schemes guarantee that property 1 (See above) is satisfied. F or the remaining properties, we plot the mean self-annihilation error and the mean discrimination, for each alphabet size. As expected, we see that finer alphabets lead to high av erage discrimination, while at the same time incur high av erage self-annihilation errors (See Fig. 11(d)). The ratio of the two quantities is more useful, and in Fig. 11(e), we note that the trinary maximum-entropy quantization minimizes this ratio; implying high discrimination and lo w self-annihilation error . 0 0 : 5 1 1 : 5  50  25 25 50  0 : 55 (a) Binary 0 0 : 5 1 1 : 5  50  25 25 50  6 : 76 6 : 85 (b) T rinary 0 0 : 5 1 1 : 5  50  25 25 50 9 : 46  1 : 76  12 : 05 (c) Quaternary 2 4 6 8 10 0 0 : 2 0 : 4 0 : 6 0 : 8 Alphabet Size ( j  j ) (g) Mean Self-ann. Error Mean Discr imination (  6 : 76 ; 6 : 85) (  12 : 05 ;  1 : 76 ; 9 : 46) (  0 : 55) 2 4 6 8 10 0 : 2 0 : 4 3 Alphabet Size ( j  j ) Ratio of Mean Self-ann. Error to Mean Discrimination (h) Fig. 11. Plates(a)-(c): Maxim um-entropy quantization schemes for the EEG dataset with alphabet sizes 2 , 3 , and 4 respectively ( 1 : 5 s of a single time series shown f or clar ity). As e xplained in the text, “maximum-entropy” in this context refers to the fact that each data slice contains approximately an equal number of observations. Plate (d) shows how the average self- annihilation error, as well as the av erage discrimination between different data streams, increases exponentially with alphabet size. Plate (e) shows that the ratio of the av erage self-annihilation error to av erage discrimination has a minimum at an alphabet size 3 . q 1  0 j 0 : 7  1 j 0 : 3 110001100101 001101000100 100100001100 011000011100 010010100010 100101000110 10001100    Stream A1 010001111100 000100100000 100010100111 010001011000 000001000000 100100100001 10001110    Stream A2 (a) Model A & sample streams Model A q 1  0 j 0 : 9  1 j 0 : 1 100000100000 000000010100 000000001110 000000000011 000000010000 000000000010 00000000    Stream B1 000000000000 101100000000 000100001010 000000101100 001000000000 000100000010 00000100    Stream B2 (a) Model B & sample streams Model B Fig. 12. T wo distinct PFSA models (alphabet  = f  0 ;  1 g ) and initial sections of generated sample streams (  0 shown as 0 , and  1 shown as 1 )): While streams generated in independent r uns of the same model have near zero mutual inf ormation, they are correctly e valuated as having similar generators via data smashing (See T ables III and IV). Also, r uns from the different models also have near-zero mutual information, while smashing them correctly rev eals a significant diff erence in the generators. W e note that if our chosen quantization is too coarse, then distinct processes may ev aluate to be similar . Howe ver , too coarse an alphabet produces errors in only one direction; identical processes will still ev aluate to be identical (or nearly so), provided the streams pass the self-annihilation test. XI. C omp arison A gainst S imple S t a tistical A ppr oches to S imilarity A. Smashing & Mutual Information Smashing two finite quantized data streams manipulates the statis- tical information contained in them. Notions of information-theoretic interdependence of sequential data have been in vestig ated in the literature extensiv ely; one such concept is mutual information be- tween streams. For discrete random variables, mutual information quantifies the amount of information one random variable contains about another . Formally , let X ; Y be discrete random variables with alphabets  X ;  Y and probability mass function p ( x ) = P r ob f X = x : x 2  X g ; p ( y ) = P r ob f Y = y : y 2  Y g . Also, considering ( X ; Y ) as a single vector vector -valued random variable, we hav e 18 T ABLE III D ist ance ma trix obt ained by smashing streams from models A and B. (Note clear clusters corresponding to runs from the same model) DA T A SMASHING A1 A2 B1 B2 A1 0.005 0.019 0.264 0.269 A2 0.021 0.006 0.246 0.253 B1 0.262 0.251 0.005 0.009 B2 0.264 0.254 0.011 0.006 T ABLE IV P airwise mutual informa tion of streams from models A and B. (No indication of generative di ff erence) Mutual Inf. A1 A2 B1 B2 A1 0.89 0.00000476 0.00017155 0.00002713 A2 0.0000047 0.87 0.00001186 0.00003927 B1 0.00017155 0.00001186 0.48 0.00000996 B2 0.00002713 0.00003927 0.00000996 0.47 the mass function p ( x; y ) . Then, mutual information between the discrete random v ariables X ; Y is defined as: I ( X ; Y ) = X y 2  Y X x 2  X p ( x; y ) log  p ( x; y ) p ( x ) p ( y )  (114) Mutual information is related to the notion of entropy: the entropy of a random variable is a measure of the amount of information required on the av erage to describe the random variable; while mutual information is the amount of information one variable contains about the other; or, more precisely , the degree to which the uncertainty in one can be reduced by knowing about the other . Needless to say , if two data streams X ; Y are generated indepen- dently from the same underlying generator, then we have: I ( X ; Y ) = X y 2  Y X x 2  X p ( x; y ) log  p ( x; y ) p ( x ) p ( y )  = X y 2  Y X x 2  X p ( x ) p ( y ) log  p ( x ) p ( y ) p ( x ) p ( y )  = 0 (115) Thus, sharing a common generati ve process does not imply a high mutual information; and conv ersely , high mutual information is indicative of some sort of statistical synchronization between the generativ e processes; which may be very di ff erent themselves. Thus, the concept of mutual information and data smashing is "orthogonal" in the sense that while we measure statistical depen- dence for computing the former , the streams need to be statistically independent (or very nearly so) for the latter to work. Note that in the computation of the anti-stream, we generated streams that approximate independent copies of the input stream, which are then manipulated to yield the in verse. The algorithm requires this independence; in absence of which Proposition 10 falls apart. W e can illustrate these points by a simple example (See Fig. 12). W e consider two simple one-state PFSAs (A and B), with di ff erent ev ent probabilities, and generated 10000 bit streams A 1 ; A 2 and B 1 ; B 2 , Note that simply “running" a gi ven PFSA twice, i:e: choosing a start state randomly , and generating symbols in accor- dance with the ev ent probabilities, implies that the generations are independent. W e smash the streams A 1 ; A 2 ; B 1 ; B 2 against each other , and compute the pairwise distance matrix sho wn in T able III. Note that streams ( A 1 ; A 2) annihilate nearly perfectly , as do the stream ( B 1 ; B 2) ; while streams ( A 1 ; B 1) , ( A 1 ; B 2) , ( A 2 ; B 1) and ( A 2 ; B 2) fail to do so. This results in clearly clustered values in T able III, which correctly indicate that streams ( A 1 ; A 2) and ( B 1 ; B 2) hav e identical generators, which di ff er significantly from each other . Pairwise computation of mutual information between the streams A 1 ; A 2 ; B 1 ; B 2 is not expected to reveal this generative structure. Since the streams are generated independently , the mutual informa- tion between any two distinct streams would be zero (or nearly so 50 100  100 0 100 (a) r = 0 : 209 100 200  200 0 200  X ( t ) (b) r = 0 : 257 100 200  400  200 0 200 Simulation time (s) (c) r = 0 : 299 Fig. 13. Model system. The Lotka-V olterra system of reactions is a stochastic model that captures a simple two-species predator-prey dy- namical system. We parameterize the system with r (the propensity of the predation reaction), and generate time ser ies data using Gillespie’ s stochastic simulation algor ithm. Plates (a0-(c) show prey numbers v ar ying with time in three different runs with diff erent r values . for finitely generated streams). This is illustrated in T able IV. Note that while the diagonal terms (which represent the self-information or entropy) are high; all o ff -diagonal terms are very nearly zero, and no clusters are discernible. Thus, we can summarize:  Mutual information measures the degree of statistical depen- dence between data streams; data smashing computes the distance between the generati ve processes, provided the data streams are independent or nearly so.  W e proved that maximizing entropy of a single stream max- imizes the annihilation e ffi ciency (See Proposition 12, and its corollary)  Thus, data smashing is conceptually orthogonal to the notion of mutual information B. Smashing Vs Simple Statistical Measur es The pairwise distances computed via data smashing is clearly a function of the statistical information buried in the streams. Howe ver , it might not be easy to find the right statistical tool to mine this information for a particular problem. In this section we provide an example of a dynamical system, in which data smashing is able to recov er meaningful nontrivial structure, which is missed by simple statistical measures. W e consider the Lotka-V olterra system of stochastic “reactions”, modeling a simple closed eco-system of two species, one of which preys on the other . While deterministic di ff erential equation models for this system do exist (and is widely studied), a more realistic model is this set of three simple reactions (See Fig. 13, plate A), primarily due to its ability to model the stochastic component. The generally accepted method to solve such systems, to produce the time traces of population numbers (See Fig. 13, plate B), is via the Gillespie’ s stochastic simulation algorithm. (Note: While the preced- ing theoretical development assumes ergodicity and stationarity , the theoretical considerations fall apart gracefully as we deviate from these idealizations). In our simple model, as shown belo w: X 1 : 0   ! 2 X (R1) X + Y 0 : 005    ! 2 Y (R2) Y r   ! ∅ (R3) we consider the propensity of one of the reactions to be param- eterized by r , which ranges between 0 : 2 to 0 : 3 in steps of 0 : 001 . For each set of reaction parameters, we simulate the system 1000 times for a maximum of 200 s using Gillespie’ s algorithm. In each simulation run, we initialize the system with X = 12 8 ; Y = 25 6 . W e 19 20 40 60 80 100 20 40 60 80 100 (C.i) Smashing 9 : 9  10  4 1  10  2 1  10  1 20 40 60 80 100 (C.ii) Abs. diff. of means 9 : 9  10  4 1  10  2 1  10  1 20 40 60 80 100 (C.iii) Abs. diff. of variance 9 : 9  10  4 1  10  2 1  10  1 A Smashing distance vs difference of means vs difference of variance with  X -time series 20 40 60 80 100 20 40 60 80 100 (D.i) Smashing 9 : 9  10  4 1  10  2 1  10  1 20 40 60 80 100 (D.ii) Abs. diff. of means 9 : 9  10  4 1  10  2 1  10  1 20 40 60 80 100 (D.iii) Abs. diff. of variance 9 : 9  10  4 1  10  2 1  10  1 B Smashing distance vs difference of means vs difference of variance with  X -time series after normalizing each series to have z ero mean 20 40 60 80 100 20 40 60 80 100 (E.i) Smashing 9 : 9  10  4 1  10  2 1  10  1 20 40 60 80 100 (E.ii) Abs. diff. of means 9 : 9  10  4 1  10  2 1  10  1 20 40 60 80 100 (E.iii) Abs. diff. of variance 9 : 9  10  4 1  10  2 1  10  1 C Smashing distance vs difference of means vs difference of variance with  X -time series after normalizing each series to have z ero mean and unit variance Fig. 14. We compute distances between the simulated time series for population numbers of the species X . First column is obtained via pairwise smashing. The second column is obtained as the absolute difference of means, and the third column is obtained as the absolute difference of variances. The second row illustrates the effect of normalizing the data to make each time ser ies hav e zero mean. The third row illustrates the effect of normalization to make each ser ies have zero mean and unit variance. Data smashing yields clear clusters, which are preser ved through the nor malization; whereas the simple statistical measures do not. assume that we can make observations ev ery 0 : 1 s from the simulated dynamics. A few sample paths for the change in the number of prey with time are sho wn in Fig. 14. Note that the probability of each reaction at any point in time is proportional to the number of combinatorial ways that particular reaction can transpire, as well as the propensity of the reaction itself. This combinatorial number is a function of the current population count of each species; and hence the reaction probabilities are strongly dependent on the current state vector . Since the simulation terminates when any one species becomes extinct, we cannot assume stationary behavior . Also, the initial state at least partially dictates the length of the time the ecosystem surviv es, implying non-ergodicity . Note that removing the restriction of a strictly positive and integer - valued population count, might result in a more well-behaved system. Giv en our parameterization, we actually hav e 101 distinct systems with di ff erent sets of parameters, and for the system with index i , we hav e set the propensity of the third reaction as: r = 0 : 2 + 0 : 001 i (116) Since the third reaction models predator death, we expect that increasing its propensity will make the predator degradation more probable. Thus, we can clearly expect smaller number of predators and larger number of prey on average, as r is increased. Howev er , a truly interesting structure would be uncovered if the behaviors exhibit some sort of clustering; as opposed to simply a monotonic dependence on r . W e aanalyzed our set of r -parameterized dynamical systems as follo ws: 1) W e concatenated all 1000 time series for species X generated by simulating each system. Thus, the i th system generates the concatenated series s i for species X . 2) Next we generated s 0 i from s i by taking one step di ff erences from s i , i:e: s 0 i is the time series of relative updates for the population of species X . 3) W e mapped each sequential data series s 0 i , to a symbol stream using a binary partition function, which maps negativ e entries in the series to symbol 0 , and positi ve entries to symbol 1 . 4) W e collided the symbolic streams pairwise, and compute the smashing distance matrix H . Thus, the ij th entry in H is the deviation of the sum of s 0 i and an in verted copy of s 0 j from flat white noise. The result is shown in plate A(i) of Fig. 14. 5) W e also generated the pairwise absolute di ff erences of means, and variances. In each of these cases, the ij th entry of the corresponding matrix is the absolute di ff erence between the corresponding statistical measure between the data series s 0 i and s 0 j (See plates A(ii-iii) in Fig. 14). Notably , the smashing matrix in plate A(i) of Fig . 14 shows clear clusters, wher eas the matrices corresponding to mean and variance show trivial monotonic dependence on the parameter r . T o ascertain if the clustering obtained via data smashing is dependent on the mean or v ariance of the input data streams, we redid the analysis, after: 1) Normalization to zero mean signals prior to symbolization 2) Normalization to zero mean and unit variance signals prior to symbolization In the first case, zeroing the mean makes the clusters appear more prominently (See plate B(i) of Fig. 14), while additionally normal- izing the variance has little e ff ect (See plate C(i) of Fig. 14). None of these changes allow the simple statistical measures to recov er the clear clusters obtained via data smashing. The Lotka-V olterra system has a rich set of dynamical regimes, and it would not be surprising if such measures fail to capture this complexity . T o that e ff ect, we plotted the minimum number of predators after 100 s of simulation 20 0 20 40 60 80 100 0 20 40 60 system index i , where parameter r = 0 : 2 + 0 : 001 i Minimum # of predators present at the end of 100 s of simulation A Distinct dynamical features in parameterized Lotka-V olterra system 20 40 60 80 100 20 40 60 80 100 B Smashing Distances Fig. 15. Mapping clusters recovered via data smashing to a meaningful dynamical feature. Plotting the minimum number of predators remaining in the system after a fix ed sim ulation time of 100 s (minimization carried out ov er the 1000 simulation runs of each system), illustr ates that the clusters almost perfect correspond to monotonic domains of this function. (minimum calculated over the 1000 simulation runs carried out for each parameter set, as discussed before). The result is shown in Fig. 15. The clusters identified via data smashing is now seen to corre- spond almost perfectly for each monotonic domain of this function. This illustrates that data smashing finds meaningful categorization, which simple statistical tools may miss. The di ff erences discovered via smashing is obviously a function of the statistical structure of the observed data. Howe ver , the preceding example illustrates that it may not be easy to find the right statistical tool. Data smashing approach alleviates this challenge to a considerable degree. XII. C onclusion W e introduced data smashing to measure causal similarity between series of sequential observations. W e demonstrated that our insight allows feature-less model-free classification in div erse applications, without the need for training, or expert tuned heuristics. Non-equal length of time-series, missing data, and possible phase mismatches are of no consequence. While better classification algorithms may exist for specific prob- lem domains, such algorithms are di ffi cult to tune. The strength of data smashing lies in its ability to circumvent both the need for expert-defined heuristic features and expensi ve training; eliminating key bottlenecks in contemporary big data challenges. R eferences [1] T . M. Cover and J. A. Thomas, Elements of Information Theory . John W iley , New Y ork, 1991. [2] R. P . W . Duin, D. d. Ridder, and D. M. J. T ax, “Experiments with a featureless approach to pattern recognition, ” P attern Recognition Letters , pp. 1159–1166, 1997. [3] V . Mottl, S. Dvoenko, O. Seredin, C. Kulikowski, and I. Muchnik, “Featureless pattern recognition in an imaginary hilbert space and its application to protein fold classification, ” Machine Learning and Data Mining in P attern Recognition , pp. 322–336, 2001. [4] E. Pekalska and R. P . W . Duin, “Dissimilarity representations allow for building good classifiers, ” P attern Recognition Letters , vol. 23, no. 8, pp. 943–956, 2002. [5] I. Chattopadhyay and H. Lipson, “ Abductive learning of quantized stochastic processes using probabilistic finite automata, ” Phil. T rans. of the Roy . Soc. A , 2012, in press. [6] J. P . Crutchfield and B. S. McNamara, “Equations of motion from a data series, ” Complex systems , vol. 1, no. 3, pp. 417–452, 1987. [7] J. Crutchfield, “The calculi of emergence: computation, dynamics and induction, ” Physica D: Nonlinear Phenomena , vol. 75, no. 1, pp. 11–54, 1994. [8] I. Chattopadhyay , Y . W en, and A. Ray , “Pattern classification in symbolic streams via semantic annihilation of information, ” in American Contr ol Confer ence (A CC), 2010 , 30 2010-july 2 2010, pp. 492 –497. [9] I. Chattopadhyay and A. Ray , “Structural transformations of probabilistic finite state machines, ” International J ournal of Contr ol , vol. 81, no. 5, pp. 820–835, 2008. [10] R. G. Andrzejak, K. Lehnertz, F . Mormann, C. Rieke, P . David, and C. E. Elger , “Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state, ” Phys Rev E Stat Nonlin Soft Matter Phys , vol. 64, no. 6 Pt 1, p. 061907, Dec 2001. [11] P . Bentle y , G. Nordehn, M. Coimbra, and S. Mannor , “The P AS- CAL Classifying Heart Sounds Challenge 2011 (CHSC2011) Results, ” http: // www .peterjbentley .com / heartchallenge / index.html. [12] M. K. Szymanski, “The optical gravitational lensing experiment. internet access to the ogle photometry data set: ogle-II bvi maps and i-band data, ” Acta Astr on. , vol. 55, pp. 43–57, 2005. [13] H. Begleiter , “EEG database data set, ” 1995, neurodynamics Laboratory , State University of New Y ork Health Center Brooklyn, New Y ork. [Online]. A vailable: http: // archive.ics.uci.edu / ml / datasets / EEG + Database [14] K. Brigham and B. Kumar, “Subject identification from electroen- cephalogram (eeg) signals during imagined speech, ” in Biometrics: The- ory Applications and Systems (BTAS), 2010 F ourth IEEE International Confer ence on , 2010, pp. 1–8. [15] “English language speech database for speaker recognition, ” Department of Informatics and mathematical modelling, T echnical University of Denmark, 2005, department of Informatics and mathematical modelling, T echnical Uni versity of Denmark. [Online]. A vailable: http: // www2.imm.dtu.dk / $ \ sim$lfen / elsdsr / [16] L. Feng and L. K. Hansen, “ A ne w database for speaker recognition, ” Informatics and Mathematical Modelling, T echnical Univ ersity of Denmark, DTU, Richard Petersens Plads, Building 321, DK-2800 Kgs. L yngby , T ech. Rep., 2005. [Online]. A vailable: http: // www2.imm.dtu.dk / pubdb / p.php?3662 [17] R. O. Duda, P . E. Hart, and D. G. Stork, P attern Classification (2nd Edition) . Wile y-Interscience, 2000. [18] G. Brumfiel, “High-energy physics: Down the petabyte highway , ” Na- tur e , vol. 469, no. 7330, pp. 282–283, Jan 2011. [19] R. G. Baraniuk, “More is less: Signal processing and the data deluge, ” Science , vol. 331, no. 6018, pp. 717–719, 2011. [20] T . Cover and P . Hart, “Nearest neighbor pattern classification, ” Informa- tion Theory , IEEE T ransactions on , vol. 13, no. 1, pp. 21 –27, january 1967. [21] J. B. MacQueen, “Some methods for classification and analysis of multiv ariate observations, ” in Proceedings of 5th Berkele y Symposium on Mathematical Statistics and Pr obability , 1967, pp. 281–297. [22] L. Y ang, “ An overvie w of distance metric learning, ” Pr oc. Computer V ision and P attern reco gnition, October , vol. 7, 2007. [23] J. T enenbaum, V . de Silva, and J. Langford, “ A global geometric framew ork for nonlinear dimensionality reduction. ” Science (New Y ork, N.Y .) , vol. 290, no. 5500, pp. 2319–2323, 2000. [24] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding. ” Science (New Y ork, N.Y .) , vol. 290, no. 5500, pp. 2323–2326, 2000. 21 [25] H. Seung and D. Lee, “Cognition. the manifold ways of perception. ” Science (New Y ork, N.Y .) , vol. 290, no. 5500, pp. 2268–2269, 2000. [26] J. H. W ard, “Hierarchical grouping to optimize an objectiv e function, ” Journal of the American Statistical Association , vol. 58, no. 301, pp. 236–244, 1963. [27] M. J. Sippl and H. A. Scheraga, “Solution of the embedding problem and decomposition of symmetric matrices, ” Proc. Natl. Acad. Sci. U .S.A. , vol. 82, no. 8, pp. 2197–2201, Apr 1985. [28] W . Feller, The Fundamental Limit Theor ems in Probability . MacMillan, 1945. [29] J. G. Snodgrass and M. V , “ A standardized set of 260 pictures: Norms for name agreement, image agreement, familiarity , and visual complexity , ” Journal of Experimental Psychology: Human Learning and Memory , 1980. [30] A. Paz, Introduction to pr obabilistic automata (Computer science and applied mathematics) . Orlando, FL, USA: Academic Press, Inc., 1971. [31] E. V idal, F . Thollard, C. de la Higuera, F . Casacuberta, and R. Carrasco, “Probabilistic finite-state machines - part i, ” P attern Analysis and Ma- chine Intelligence, IEEE T ransactions on , vol. 27, no. 7, pp. 1013–1025, July 2005. [32] H. Hopcroft, R. Motwani, and J. Ullman, Introduction to Automata Theory , Languag es, and Computation, 2nd ed. Addison W esley , Boston, 2001. [33] R. Gavaldà, P . W . Keller , J. Pineau, and D. Precup, “Pac-learning of markov models with hidden state, ” in ECML , 2006, pp. 150–161. [34] S. Bogdanovic, B. Imreh, M. Ciric, and T . Petkovic, “Directable automata and their generalizations - a survey , ” Novi Sad Journal of Mathematics , vol. 29, no. 2, pp. 31–74, 1999. [35] M. Ito and J. Duske, “On cofinal and definite automata, ” Acta Cybern. , vol. 6, pp. 181–189, 1984. [36] F . T opsøe, “On the gli venko-cantelli theorem, ” Probability Theory and Related F ields , vol. 14, pp. 239–250, 1970, 10.1007 / BF01111419. [Online]. A vailable: http: // dx.doi.org / 10.1007 / BF01111419 [37] I. Chattopadhyay and A. Ray , “Language-measure-theoretic optimal control of probabilistic finite-state systems, ” Int. J. Contr ol , vol. 80, no. 8, pp. 1271–1290, 2007.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment