Multi Layer Analysis

This thesis presents a new methodology to analyze one-dimensional signals trough a new approach called Multi Layer Analysis, for short MLA. It also provides some new insights on the relationship between one-dimensional signals processed by MLA and tr…

Authors: Luca Pinello

Multi Layer Analysis
Mul ti La yer Anal ysis Author Luca Pinello Co ordinator Pr of. Camillo Trap ani Thesis A dvisor Pr of. Domenico Tegolo Co-A dvis or Dott. Giosuè Lo Bosco 2 Settore Scien tifico Disciplinare INF/01 Multi La yer Analysi s Abstract: This thesis presen ts a new methodology t o analyze one-dim ensional signa ls trough a new approac h c alled Multi La y er Analysis, for short MLA. It also pro vides some new insigh ts on the relationship b et ween one -dimensional signals p ro cessed b y MLA and tree kern els, test of randomness and signal process ing tec hniques. The MLA app roac h has a w ide range of application to the fields of pattern disco v ery and matc hing, computational biology and man y other areas of computer science and signal pro cessing. This thesis includes als o some applications of this approac h to real problems in biology and sismology . Keyw ords: m ulti la yer analysis, mac hine learning, pattern disco very , classifica- tion, clustering, tree k ernel, test of randomness. ii A c kno wledgmen ts I o we a great deal of thanks to man y p eople for making this thesis p ossible. First of all I dedicate this dissertation to Prof. qVito Di Ges ù, wh o has b een leading and supp orting me and m y re searc h to b e fruitful in his patience a nd I ’m very sad that unfortunately he can no longer follo w me at this imp ortan t step. I w ould lik e to express m y gratitude for m y curren t advisor Prof. Domenico T egolo, who has con tin ued to leading and supp orting me in this last year of researc h. A h uge thanks go es to Giosuè Lo Bosco for his fundamen tal and pr ecious co llab oration in all asp ects of m y w ork. Without his skillful and infinite supp ort my pro jects wou ld not ha v e b een p oss ible. I w ould sp ecially lik e to thank Guo c heng Y uan for his extremely v aluable exp erience, supp ort, insigh ts and, most imp ortan t his f riendship. Thanks to m y fellow PhD friends, in particular Filipp o Utro, F abio Bella v ia, Marco Cip olla, Fili pp o Millonzi for our broad-ranging discussions and for shari ng the jo ys and w orries of the academic researc h. F urthermore, I am deeply indebted to m y coll eagues at Dep artmen t of Mathematics and Com puter Scien ce that h a v e pro vided the e n vironmen t for sharing the ir exp eriences ab out the pr oblem issues in v olv ed as w ell as participa ted in stimula ting team exercises dev eloping solutions to the iden tified problems. Finally , I wish to express m y gratitu de to m y family and f riends who pro vided contin- uous understanding, patience, lov e and energy . I n particular, I w ould like to express a heartfelt thanks to my paren ts and m y girlfriend V aleria for their infinite supp ort in m y researc h endea v ors. Thanks to all of you. iii Originali t y Declaratio n This w ork con tains no material which has been accepted f or th e aw ard of an y o ther degree or diplo ma in an y univ ersit y or other tertiary institution a nd, to the b est of m y kno wledge and belief, con tains no material previously published or written b y another p erson, except where due reference has b een made in the test. I giv e consen t to this cop y of my thesis, when dep osited in the Universit y Library , b egin a v ailable for loan and photo cop ying. Signed . . . . . . . . . . . . . . . . . . . . . . . . Jan uary 2011 Con te n ts In tro duction 1 What this thesis is ab out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Our Con tributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Multi-resolution or multi-scale method ologies 5 1.1 Motiv ation of Multi La y er Analysis . . . . . . . . . . . . . . . . . . . . . 5 1.1.1 Multi-resolution or m ulti-scale methodologies . . . . . . . . . . . 6 1.1.2 Discrete F ourier T ransform . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 W a v elet Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.4 Scale Space Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.5 Quadtree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.1.6 String methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.7 Lev el Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 P attern Disco v ery and Classification . . . . . . . . . . . . . . . . . . . . 14 1.2.1 P attern Disco very . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.2 General sc hema of a Pa ttern Discov ery method . . . . . . . . . . 15 1.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2 Multi La y er Analysis 19 2.1 The Multi La y er A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1 9 2.1.1 The threshold op eration . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.2 The Horizon tal Sampling, the In terv als R epresen tation and the Aggregation Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Cho osing the righ t v alue f or the n um b er of thresholds . . . . . . . . . . 28 2.3 Usage of the MLA as prepro cessing step . . . . . . . . . . . . . . . . . . 34 3 P attern Discov ery and Classification by MLA 35 3.1 MLA in P attern Discov ery and Class ification . . . . . . . . . . . . . . . 35 3.2 F undamen tals of Molecular Biology . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Genes and proteins . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 Protein production and expression lev el of a gene . . . . . . . . . 38 3.2.4 Nucleosome and c hromatin . . . . . . . . . . . . . . . . . . . . . 39 3.2.5 Microarra y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Case Study: Nucleosome P ositioning . . . . . . . . . . . . . . . . . . . . 42 3.3.1 The microarra y and the signal . . . . . . . . . . . . . . . . . . . . 44 vi Con ten ts 3.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 First solution: Hidden Mark o v Mo del . . . . . . . . . . . . . . . . . . . . 47 3.4.1 HMM as generators . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.2 HMM as recognizers . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.3 Problems related to HMM . . . . . . . . . . . . . . . . . . . . . . 49 3.4.4 F orw ard pro cedures . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.5 Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.6 Baum W elc h algorithm . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.7 The prop osed HMM for n ucleosome p ositioning . . . . . . . . . . 55 3.5 Second solution: MLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.2 Creating the mo del . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.3 In terv al iden tification . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.4 Aggregation rule and P attern Definition . . . . . . . . . . . . . . 59 3.5.5 P attern selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.6 F eature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.7 Dissimilarit y function . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.8 Nucleosome Classification . . . . . . . . . . . . . . . . . . . . . . 61 3.5.9 P arameter selection b y calibra tion . . . . . . . . . . . . . . . . . 62 3.5.10 Syn thetic generation of biological signals . . . . . . . . . . . . . . 63 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6.1 MLA vs HMM on Synth etic Nucleosome P ositioning data . . . . 65 3.6.2 MLA vs HMM on real data . . . . . . . . . . . . . . . . . . . . . 66 3.6.3 Scalabilit y and computation al time of MLA and HMM: . . . . . 67 3.7 One-Class Classifier and MLA . . . . . . . . . . . . . . . . . . . . . . . . 68 3.7.1 One-Class classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.7.2 One-Class K N N . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7.3 Results on syn thetic data . . . . . . . . . . . . . . . . . . . . . . 71 3.7.4 Results on real data: . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 T est of Randomness b y MLA 81 4.1 T est of Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.1 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.2 T est based on runs . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.3 T est based on entr op y estimator . . . . . . . . . . . . . . . . . . 82 4.1.4 T est based on ranking: Wilco xon rank sum test . . . . . . . . . . 83 4.1.5 T est based on go o dness of fit: K olmogoro v -Smirno v go o dness of fit T est . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 MLA T est of Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1 Mon te Carlo sim ulation . . . . . . . . . . . . . . . . . . . . . . . 85 Con ten ts vii 4.2.2 Hyp othesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.3 Probabilit y densit y functions estimation . . . . . . . . . . . . . . 86 4.3 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.1 Assessmen t on syn thetic data . . . . . . . . . . . . . . . . . . . . 89 4.3.2 Assessmen t on real data . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 Comparison with Wilco xon rank sum test . . . . . . . . . . . . . 91 5 MLA and Kernel metho ds 101 5.1 Kernel metho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1.1 Main ideas of ke rnel metho ds . . . . . . . . . . . . . . . . . . . . 101 5.1.2 F ormal definition and prop erties of kernels . . . . . . . . . . . . . 103 5.1.3 Kernels and distances . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Kernel metho ds for tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.1 Con volutio n k ernel . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.2 T ree ke rnels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 MLA Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.1 MLA T ree Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.2 MLA Con volutio n Kernel . . . . . . . . . . . . . . . . . . . . . . 110 5.4 Supp ort V ector Mach ines . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1 5.5.1 Syn thetic data: discrimination p o w er of MLA T ree Kernel on basic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.2 Syn thetic data: MLA T ree Kernel on w av eform dataset . . . . . 112 5.5.3 Assessmen t of induced distance of MLA Conv olution Ker nel f or clustering of seismic signal . . . . . . . . . . . . . . . . . . . . . . 112 6 Conclusions and F uture Directions 119 Bibliography 121 List of Figures 1.1 Con volutio n of a signal with a wa v elet function. (Pa rt of ) this figure is tak en from [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Scaling and translation of a mother w av elet. (P art of ) this figure is take n from [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Haar w a v elet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Mexican hat wa v elet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0 1.5 Morlet w a velet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Scale Space represen tation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Quadtree image segmen tation . . . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Lev el Set represen tation for a function dep ending on 2 v a riables. . . . . 14 1.9 P attern Disco v ery parts . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Sc hema of MLA pro cessing . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Thresold op eration for three differen t v alues of φ . . . . . . . . . . . . . 21 2.3 Equally spaced simple MLA . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 In terv al represen tation of a signal . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Original signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Degradation of the signal for different v alues of K . . . . . . . . . . . . . 26 2.7 MLA reconstruction of the simple sinusoidal signal with K = 8 . . . . . 27 2.8 MLA reconstruction of the rectangular pulse signal with K = 2 . . . . . 28 2.9 MLA “mother” function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.10 Different examples of signals (all of length 400) . . . . . . . . . . . . . . 31 2.11 (a) Odd w orst case,(b) Ev en b es t and w orst case,(c) Odd b est case . . . 32 2.12 Interv als incremen t: eac h p oin t added can b e add no more than k − 1 in terv als . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 P attern Disco v ery b y MLA and signal segmen tation . . . . . . . . . . . 36 3.2 DNA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Amino acids alphab et in terms of DN A alphabet . . . . . . . . . . . . . 38 3.4 F rom a genomic sequence to a protein . . . . . . . . . . . . . . . . . . . 39 3.5 F rom DNA to c hromatin . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Nucleosome structure: in blue the o ctamer, in orange the DN A . . . . . 41 3.7 Microarra y w orkflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 Microarra y prob es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.9 F rom microarra y to one-d imensional signal . . . . . . . . . . . . . . . . . 46 3.10 F orward pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 x List of Figures 3.11 Backw ard pro cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.12 Baum W elc h algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.13 HM M topology f or n ucleosome pos itiong . . . . . . . . . . . . . . . . . . 56 3.14 Pat terns that meet the condition of con v exit y . . . . . . . . . . . . . . . 58 3.15 Mo del of w ell-p ositioned n ucleosome . . . . . . . . . . . . . . . . . . . . 59 3.16 T wo differ ent shap es of the input signal: (on the left) Since at threshold lev el K + 1 the in terv al R k = { I 1 K } has tw o subset R k +1 = { I 1 K +1 , I 2 K +1 } , it is p ossible to set three pattern P 1 = { I 1 K } , P 2 = { I 1 K +1 } and P 3 = { I 2 K +1 } . (on the right) In this case, I 1 K +1 is the unique subset of I 1 K , th us it is p ossible to set an unique pattern P 1 = { I 1 K , I 1 K +1 } . . . . . . . . . 60 3.17 (a) Input si gnal, sm o othing, p attern ide ntific ation and extr action: A Sac- char omyc es c er evisiae mi croarra y data p ortion. Eac h x v alue represen ts a sp ot (prob e) on the microarra y and the corresp onding y v alue is the logarithmic ratio of its Green and Red v alues. Nucleosomes regions are around the p eaks signal (one is marke d by blac k circle), while lo w er ratio v alues show link er regions (mark ed b y dashed circles). The dashed lines represen ts the threshold lev els, in this example 6 patterns are retriev ed, iden tified by rhom bus, circle, square, triangle do wn, triangle up, star. Eac h pattern iden tifier is replica ted for eac h of its feature v alues and p oin ted in eac h one of its middle point. (b) An example of classific ation: In this p ortion 5 n ucleosome regi ons are sho wn together with its range in base pairs. In particu lar 1 out of the 5 regions is classified as delo c alize d while the remaining wel l-p ositione d. . . . . . . . . . . . . . . . . . . . . . 73 3.18 Shap es of the p atterns: The three classes of nucleo somes it is possible to detect with the MLA v ery lik ely reflect differen t nu cleosome mobil- it y existing in v ivo at sp ecific c hromatin lo ci. Delo calized n ucleosomes probably represen t s ingle n ucleosomes or arra ys of nuc leosomes with high mobilit y , while fused n ucleosomes ma y reflect a single nucleosome that o ccupies t w o distinct close p ositions in differen t cells. On the left of the arro ws , the particular n ucleosome configuration whic h generates the resulting shap e of w ell-positioned (W), delocalized (D) and fused (F) n ucleosome classes are sho wn. . . . . . . . . . . . . . . . . . . . . . . . . 74 3.19 C lassific ation: The classification of a generic pattern P i is p erformed in to t wo phases. In the first phase the linke r ( L ), the exp ected w ell-p ositioned ( E W ) and the ex p ected delo calized ( E D ) patterns are established b y us- ing the cla ssification rule defi ned b y c 1 . In the second phase, the exp ected regions A i are defined b y opportunely process ing E W and E D patterns, and afterw ards used b y the classification rule c 2 in order to finally classify b et w een w ell-p ositioned ( W ), delocalized ( D ) and fused ( F ) n ucleosomes. 75 List of Figures xi 3.20 C alibr ation phase f or the choic e of m : Recognition p erformance plots (group a) and p ercen tage of minim um n um b er of p ermanences plots (group b) for 3 different signal to noise ratios, SNR = 1,2,4 (first, s econd, third column resp ective ly). The bar in each plot groups the res ults for 10 exp erimen ts o ccurring at several threshold v alues (i.e n um b er of cuts). 76 3.21 C alibr ation phase for the choic e of K : The v alue for K is s elected in ter- activ ely b y lo oking b oth at the plots of  and M S . . . . . . . . . . . . 77 3.22 An ex ample of synthetic si gnal gener ation. . . . . . . . . . . . . . . . . 77 3.23 R esults on synthetic data: The Recognition A ccuracy of MLA and H M M on 6 synthetic signals genera ted at signal to noise ratios 1 , 2 , 4 , 6 , 8 , 1 0 . . 78 3.24 A represen tativ e sample windo ws spanning 13 n uclesome where the agree- men t (disagreemen t) of the three metho ds is sho wn. The red draw rep- resen ts the classification done b y Pugh et Al. (2007) in [2] . . . . . . . 78 3.25 C omputation time p erformanc es : The execution time ratio T h /T m of the MLA ( T m ) an d HMM ( T h ) for 10 syn thetic signa l gener ated with differen t n um b er of w ell-p ositioned n ucleosomes. The dashed line sho ws the a v erage execution time. . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.26 T wo differen t represen tations of M , on the left (a) a 3d plot, on the righ t (b) an image represen tation sho wing the v alues of M using gra yscale ( 0 is blac k, 1 is white). I n this latter figure, there are also the c hosen pair ( φ ∗ , K ∗ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.27 Best A ccuracy a nd F P R v alues v ersus SNR. The couple s ( φ, K ) causing suc h results are also rep orted. . . . . . . . . . . . . . . . . . . . . . . . . 80 4.1 The general schem a of M ontec arlo Method . . . . . . . . . . . . . . 85 4.2 Examples of P I L k (a), and P S K L k (b) for k = 4 . . . . . . . . . . . . . 87 4.3 Examples of P I L k (a), and P S K L k (b) for k = 5 . . . . . . . . . . . . . 88 4.4 Examples of P I L k (a), and P S K L k (b) for k = 6 . . . . . . . . . . . . . 89 4.5 Examples of input signals: (a) input s ignal S N R = 1 ; (b) input signal S N R = 1 . 5 ; (b) input s ignal S N R = 10 . . . . . . . . . . . . . . . . . . . 92 4.6 Examples of h yp othesis test at differen t S N R and thresholds. . . . . . . 93 4.7 Examples of h yp othesis test at differen t S N R and thresholds. . . . . . . 94 4.8 Examples of h yp othesis test at differen t S N R and thresholds. . . . . . . 95 4.9 P I L k (a) and P S K L k and h yp othesis test results (b) of the real signal for k = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.10 P I L k (a) and P S K L k and h yp othesis test results (b) of the real signal for k = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.11 P I L k (a) and P S K L k and h yp othesis test results (b) of the real signal for k = 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 xii List of Figures 4.12 The gra y strep indicates the useful part of the input signal in order to p erform the test of randomness. . . . . . . . . . . . . . . . . . . . . . . . 99 4.13 Mann-Whitney rank sum test results f or different signal to noise ratio (a) and for the real signal (b) . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.1 General schem a of kernel methods . . . . . . . . . . . . . . . . . . . . . 102 5.2 Kernel mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 General Sc hema of MLA T ree Kernel . . . . . . . . . . . . . . . . . . . . 108 5.4 General Sc hema of Kernel M ethods . . . . . . . . . . . . . . . . . . . . . 115 5.5 SVM margin and the separation h yp erplane . . . . . . . . . . . . . . . . 116 5.6 Basic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.7 Basic function plus noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.8 Sc hema of the exp erimen t . . . . . . . . . . . . . . . . . . . . . . . . . . 117 List of T ables 2.1 Degradation of the signal for different v alues of K . . . . . . . . . . . . . 25 2.2 Information loss on the s ignal for differen t v alues of K . . . . . . . . . . 30 3.1 Confusion matrices of H M M on 6 differen t signal to noise ratios for n ucleosome (N) and link er (L) regions. . . . . . . . . . . . . . . . . . . . 66 3.2 Confusion matrices of MLA on 6 differen t signal to noise ratios for nu- cleosome (N) and linker (L) regions. . . . . . . . . . . . . . . . . . . . . 66 3.3 Agreemen t betw een the H M M and M LA (and vicev ersa) on the Saccha - rom yces cerevisiae data set for Nucleosomes (N) and Link er (L) regions. The table on the left sho ws the R A results of H M M when considering MLA a s th e tru th cla ssification, while the opp osite is sho wn on the ri gh t table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Confusion matr ices of MLA and H M M on deep sequencing appro ac h (DS) data b y Pugh et Al. (2007). . . . . . . . . . . . . . . . . . . . . . . 67 3.5 Agreemen t betw een the H M M and M LA (and vicev ersa) on the Saccha - rom yces cerevisiae data set for Nucleosomes (N) and Link er (L) regions. The table on the left sho ws the R A results of H M M when considering MLA a s th e tru th cla ssification, while the opp osite is sho wn on the ri gh t table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1 Classification accuracy on basic functions dataset. . . . . . . . . . . . . . 112 5.2 Classification accuracy on wa v eforms dataset. . . . . . . . . . . . . . . . 113 5.3 Distance optimalit y on geological signals . . . . . . . . . . . . . . . . . . 114 In tro duction What this thesi s is ab out This thesis presen ts a new methodology called Multi La yer Analysis (MLA) that acts a transformat ion from the space of one- dimensional signals to a new space called space of in terv als. The main idea of this approac h, s hared by sev eral other ones, is the decomposition of the input signal in to basic features that allo ws to b etter extract its useful information. The main motiv ation of this study w as to devel op a new high scalable methodology in order to extrac t shape information from one-dimen sional signals. This because a lot of real problems f all in this con text. In fact, sev eral application domains suc h as Geology , Biomedic ine and Biology require the analysis of one-dimensional signals in whic h their features are enco ded in the shap es of whole signals or on the shap es of their sub-fragmen ts (e.g seismic signals, ECG trac k s or c hip-c hip or c hip-seq tracks). The kind of analysis ob v iously dep ends on the application domains but us ually inv olve s P attern Discov ery , Clustering or Classificatio n methodologies. The main adv an tages of the MLA compa red to other similar methods , are its scalabilit y and the pos sibilit y to represen t a one-dimensional signal in terms of a tree of in te rv als, and this p ermits to express or c haracterize explicitly an y kind of shap e. Consequen tly , this has strong im- plications s ince it establishes a connection b et w een the class of algorithms that pro cess one dimensional signa ls, suc h as digital signal pro cessing tec hniques, and algorithms on trees and graphs. Con tributions and Thesis O utl ine The MLA methodology can b e used as prepro cessing step in differen t fields of applica- tion e.g.: Classification, Clustering, P attern Discov ery and T est of Randomness. Th us, it can b e used as to ol in the field of data analysis. More in details: • This method has b een applied to the biological problem of nucleosome p ositioning pro viding similar p erformances to the state of the art metho d, but b etter scala- bilit y and computa tional time. This is a f undamen tal p oint b ecause it allo ws to analyze more complex organisms. It is also able to recov er the p ositions of fuzzy n ucleosomes. • A new nonparametri c test of randomness based on MLA, that exp loits shap e features that are rare in a random signal, was dev eloped. 2 In tro duction • It allo ws to map a one-dimensional signal in a tree of in terv als. Consequen tly some tree k ernels, used in differen t con texts, hav e b een adapted to this represen- tation, pro viding new ker nels that explicitly enco de the s h ap e information of a one-dimensiona l signal expressed as a tree of inter v als. • The mapping of a one-dimensional signal in a tree of inte rv als creates a new and imp ortan t connect ion betw een t w o fundamen tal classes of algorithms: signa l pro cessing algorithms and algorithms on trees and graphs. Chapter 1 presen ts the motiv ations of MLA, fo cusing on differen t metho dologies that exploit and share the same idea. Some approac hes, at first sigh t disjointed , but actually exploiting the same idea of m ulti-resolution o r m ulti-views analysis, are presen ted. Some asp ects of these methods are related to the MLA analysis; in particular similarities or adv an tages of one metho d with resp ect to the others are highligh ted. In addition, all the basic definitions of the problems where the M LA can b e pro ductiv ely applied are briefly giv en. Chapter 2 pro vides a detailed and formal description of the MLA, explaining step b y step the MLA transformat ion and highligh ting its li mits a nd prop erties. Finally , some genera l guidelines on ho w to use the MLA as a prepr o cessing s tep for sev eral problems are pro vided. Chapter 3 explains ho w MLA can b e integ rated in the co n text of P attern Disco v ery and Classificat ion. In addition, a case study that reg ards a particula r biological problem in whic h the MLA was successfully used is introduced: the nucleo some spacing. Moreo ve r, an alternativ e approac h for the same problem based on Hidden Mark o v Mo del and a comparison of the tw o metho ds are presented. Finally , the last section is dev oted to the description of a new one-class clas sifier that was used as new classifier mo dule of the MLA. Chapter 4 presen ts a new nonparamet ric test of randomness applicable to a set of one-dimensiona l signals that tak es adv an tage of MLA prepro cessing step. I n particular, this pro cedure is based on the probabilit y densit y function of the symmetrized Kullbac k-Leibler distance, estim ated via a Monte Carlo simul ation on the in terv als lengths obtained b y MLA. The main adv an tage of this new approac h is to p erform an exploratory analysis in order to directly v erify the presence of several kinds of structures in an input signal. In particular, this test differs f rom the other approac hes since it exploits shap e features that are rare in a rando m signal. Con tributions and Thesis Ou tline 3 Chapter 5 presen ts how the MLA can help on designing new kernel f unctions that explicitly take in to accoun t the shap e information con tained in a one-dimensional signal. The main idea of Kernel Metho ds is present ed, giving more details on a particular sub class of k ernel f unctions applicable to s tructured data, in particular trees. The MLA is used to define a mapping from the s et of one-dimensional signal to the set of trees. T w o new k ernels that use the MLA represen tation are finally defined and a case study that regards sismographic signals is presented . Chapter 1 Multi-resolution or m ulti-scale metho dologies The prop osed methodology is essen tially a m ulti-lev el decomp osition of a one- dimensional s ignal. The key p oin t of this method is the m ulti-level analysis. The idea of “m ulti-level ” or “m ulti-resolution” is shared by sev eral apparen tly disjoin ted methodologies. 1.1 Motiv a ti on of Multi La y er Anal ysis Recen tly the m ulti-scale or m ulti-resolution mo dels hav e b een researc h topics in rapid ev olution, with great impact on Comp uter Science, Applied Mathematic, Image Anal- ysis and Signal Pro cessing. The k ey idea of the M LA is to obtain severa l “views” or “features” of the same input d ata (at differen t s cale, resolution or in a differen t domain) in order to p erform a b etter and ma yb e more understand able analysis. Using this ap- proac h it is p ossible to fo cus on the regions of inte rest with a finer resolution, having as a consequence an increase on the precision. The regions of in terest can be detected b y views or features at low er resolution; in this w a y it is pos sible to b oth obtain better results and an impro v emen t in computational time. The idea of m ulti-scale analysis comes from the fact that man y real systems ha ve di fferen t behaviors at differen t scales. F or example in ph ysics there are differen t laws to describ e a phenomenon at different scale or resolution, e.g. cla ssical mec hanics for describing the motion of macroscopic ob- jects in op p osition to q uan tum mec hanics that describ es atoms and molecules. It is not an exaggeration to s ay that man y real problems can b e handled using differen t scales or resolutions. F or example the h uman b eing organizes his time using seconds, hours, da ys , weeks, mon ths, y ears reflecting the mult i-scale dynamic of the solar sy stem, using scale dep ending on the problem he is handling. The folding of a protein can require a time in the scale of seconds, while the scale of vibration of co v alen t b onds is in the order of 10 − 15 seconds. In ge neral, the more details of a sys tem w e w ant to mo del, the more complex the required la ws to describ e it becomes. 6 Chapter 1. Multi-resolution or multi-scale metho dologies 1.1.1 Multi-resolution or multi-s cale metho dologies In the f ollo wing sections s ome approac hes will b e presen ted, at first sigh t disjoin ted, but actually exploiting the same idea of m ulti-resolution or m ulti-views analysis. In fact, the shared motiv ation o f a ll these approac hes is that in some cases it is easier, giv en an input signal, to extract and analyze a set of features or views that represen ts differen t information con tained in it that analyzes the original signal. This is done b y eac h methodology in di fferen t w ays but the main idea that connects the m is to deco mp ose a signals in to simpler parts (in frequency , time domain or in another scale or resolution) and perform the analysis com bining the results information on eac h part. The MLA as w ell as the other metho ds exploi ts the same idea, in whic h the analysis is p erformed on sev eral “parts” of the original s ignal obtained, as it will b e explained in the next chap ter, b y a simple operation called threshold. Some asp ects of these methods will be related to the MLA analysis in particular where there are strongly similarities or adv an tages of one metho d resp ect to the others. 1.1.2 Discrete F ourier T ransform One of the w ell-kno wn methods that firstly exploited this idea is th e F ourier T ransform and in particular its v arian t for discrete signals called Dis crete F ourier T r ansform (DFT). This transformation is mainly adopted when the infor mation of in terest are enco ded in the frequency domain of a signal. In fact, the F ourier T ransform and its discrete ver sion i.e. DFT is an op eration able to transform a discrete signal from the time domain into the frequency domain. This is done b y deco mp osing it as a linear com bination of sin usoidal component s. Here the parts of the original signal are the pure s in usoids at different f requencies and phases. In more details the DFT decomposes a signal in to a discrete spectra comp osed b y its f requency comp onen ts, while th e in verse transform syn thesizes the original signal f rom the f requency comp onen ts into its sp ectra[78]. More formally: Definition DFT Giv en a discrete signal x ( n ) of N samples its D F T , and its in v erse D F T are defined b y these equations: • Syn thesis equation: x ( n ) = N − 1 X k =0 c k e 2 πj k n N (1.1) 1.1. Motiv ation of Multi La yer Analysis 7 • Analysis equation: c k = 1 N N − 1 X n =0 x ( n ) e − j 2 πk n N (1.2) In more d etails, the DFT allo ws to extract frequency , phase and amplitude informa- tion of the s in usoids coming from the decomp osition of a signal. In addit ion, with the DFT, it is possible to find th e frequency resp onse of a system f rom its impul se resp onse and vicev ersa. In this w a y it is p ossible to analyze a sy stem in the frequency domain as it is p ossible to use the con volution to analyze a signal in the time domain. This approac h in some sense extracts s ev eral v iews of the same input s ignal corresponden t to the frequency comp onen ts that it con tains. Ho w ev er one of the main limitation of this approac h is that it not p erform w ell for non-stati onary signals, and in addition it cannot c haracterize directly the shap es con tained in a signal as it is p ossible instea d to do with the MLA analysis. 1.1.3 W a vele t Analysis A metho d that o vercome some limitations of the F ourier Analys is is the W a velet Analy- sis. A w a velet is a mathematical function and it is used to decomp ose a s ignal in comp o- nen ts with differen t frequencies, resolutions and p ositions[1]. The p osition comp onen t is particular y useful when the input signa l is not stationary i.e. it has b een gene r- ated b y a sto c hastic pr o cess whose join t probabilit y distribution do es not c hange when shifted in time or space. F or this reason w a v elets are b ecome p opular and no wada ys are widely used in mu lti-resolution analysis. The w av elets transf orm is the represen tation of a signal in term of scaled and translated copies of the same function called mother w av elet. More in detail, the w a velet transform is obtained by the con v olution betw een a signal and a wa v elet function, as illustrated in figure 1.1. It is p ossible to see in figur e 1.2 an example of s caling and translating a mother wa v elet. A mother w a velet needs to s atisf y some prop erties suc h as finite length and zero mean v alue. These prop erties mak e w a v elet analysis more p ow erful than F ourier analysis since a signal can b e decom- p osed as a sum of the same w a v elet prop erly translated and scaled, instead of using smo oth and contin uous f unction lik e s in usoids. This leads t o a go o d decomp osition also in the case of signal that s ho ws discon tin uities or in the case of non stationary pro cesses. Figures 1.3,1.4,1.5 sho w some p ossible m other wa v elets. No w it will b e formally introduced the Con tinuou s W av elet T r ansform and the In v erse Con tin uous W a v elet T ransform. 8 Chapter 1. Multi-resolution or multi-scale metho dologies Figure 1.1: Con voluti on of a signa l with a wa v elet function. (P art of ) this figure is tak en from [1] Definition Conti nuous W avelet T r ansform The con tin uous w a v elet transform or CWT of a con tinu ous signal x ( t ) , considering the mother w av elet ψ ( a, b ) is defined as: T ( a, b ) = w ( a ) ∞ Z −∞ x ( t ) ψ ∗  t − b a  dt (1.3) where ψ ∗ is the complex conjugate of the function ψ , w ( a ) is a w eigh ting function usually equal to 1 √ a or 1 a , a con trol the lo cation of ψ and b its scale. Definition Inverse Continuous W avelet T r an s form The con tin uous in v erse wa v elet transform or ICWT o f the w av elet transform T ( a, b ) of con tinu ous signal x ( t ) with resp ect to the mother w a ve let ψ ( a, b ) is defined as: x ( t ) = 1 C g ∞ Z −∞ ∞ Z 0 T ( a, b ) ψ a,b ( t ) dadb a 2 (1.4) where a con trol the lo cation of ψ used and b its scale. 1.1.4 Scale Space Theory Another methodology that exploits the idea of decomposition of a signal in simpler “parts” is the Scale Space Theory that is a f ramew ork f or a m ulti-scale represen tation of signals dev elop ed in the fields of computer vision, image pro cessing and signal pro cessing [50]. It is a formal theory applied to manipulat e signals of one or more dimen sions at differen t scales. Here the “parts” of a signal are structu res or features at differe n t scales 1.1. Motiv ation of Multi La yer Analysis 9 Figure 1.2: Scaling and translation of a mother w a v elet. (Part of ) this figure is tak en from [1] con tained in it and as in th e w a velet approac h the parts are obtained b y a con volution of a base signal at differen t scales. The main difference is ho w the con vol ution is p erformed and how the information of the parts are combi ned. The concept of scale space is general and it can b e used in an arbitrary n umber of dimensions. F or simplicit y , here the most used framew ork, that is the case of linear scale space in t w o dimensions, will b e describ ed. Definition Lin e ar Sc ale Sp ac e Giv en a t w o-dimensional signal f ( x, y ) (e.g. an image), its linear scale space is a family of derive d signals L ( x, y , t ) defined b y the conv olution of signal f ( x, y ) w ith a Gaussian 10 Chapter 1. Multi-resolution or multi-scale metho dologies −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 x y Figure 1.3: Haar w a v elet. −8 −6 −4 −2 0 2 4 6 8 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 x y Figure 1.4: Mexican hat w av elet. k ernel g : g ( x, y , t ) = 1 2 π t e − x 2 + y 2 2 t (1.5) suc h that: L ( x, y , t ) = g ( x, y , t ) ∗ f ( x, y ) (1.6) Where t = σ 2 is the v ariance of the Gaussian. The reason for generating a scale space represen tation of an image, for example, derive s from the consideration that real world ob jects consist of different s tructures a t differen t 1.1. Motiv ation of Multi La yer Analysis 11 −8 −6 −4 −2 0 2 4 6 8 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 x y Figure 1.5: Morlet w av elet. scales. Th is implie s that the real-w orld ob jects are differen t from those of the idealized mathematica l entiti es, suc h as p oints or li nes, and may app ear differen tly depending on the scale we use to observe them. F or ex ample, the concept of tree is appropriate if w e think in the scale of meters, while the concept of leaf requires a finer scale. F or example, a machin e vision system that has to analyze an unkno wn scene, cannot kno w in adv ance wh ic h scales are appropriate to describ e the da ta in the scene. F or this reason, a reasonable ap proac h is to consider descriptio ns of the scene at differen t scales sim ultaneously . An example of this approac h is illustrated in figure 1.6. 1.1.5 Quadtree A nalysis Quadtree Analysis is another image analysis tec hnique th at consists in iterativ ely split- ting an image in to blo c k s that are more homogeneous than the image itself by using a particular data structure called quadtr e e [20]. This techn ique, examining the image at differen t resolutions, allo ws to obtain inf ormation ab out i ts structure. It is also used as the first step in adaptiv e algorithms for image compression. The tec hnique consists in dividing a square image in to four blo c ks of equal size, and then test whether each blo c k meets some homogeneit y criterion (for example, if the gra y lev els of all pixels b elonging in to a blo c k ha ve a sp ecific range of v alues). If th e block mee ts the criterio n it will not b e further splitted, otherwise it will b e again divided in to four blo c ks that will be tested again according to some homogeneit y criterion. T his process is iterated un til each blo ck meets the criteria. The ent ire pro cess ob viously will split the image in to blo cks of differen t sizes. An example of quadtree analysis, used to detect salien t ob j ects in an image, is shown in figure 1.7. 12 Chapter 1. Multi-resolution or multi-scale metho dologies 1.1.6 String metho ds In a lot of discipline the in put dat a comes in natural form as s tring: bio-sequenc es, graphs and text do cumen ts. In this scenario there are sev eral metho dologies that exploit the “m ulti views” approac h in terms of subsequences or substrings of the input string. F or example there are sev eral similarit y measures b et ween string ob jects in whic h the more similar, the greater it is the n um b er of the f actors they share [53]. Another example that will b e presen ted in detail in chapte r 5 is the family of conv olution kernels[34]. The (a) L ( x, y , t ) at scale t = 0 (original image) (b) L ( x, y , t ) at scale t = 1 (c) L ( x, y , t ) at scale t = 4 (d) L ( x, y , t ) at scale t = 16 (e) L ( x, y , t ) at scale t = 64 (f ) L ( x, y , t ) at scale t = 256 Figure 1.6: Scale Space represen tation Figure 1.7: Quadt ree image segmen tation 1.1. Motiv ation of Multi La yer Analysis 13 basic idea of a con v olution k ernel is to decom p ose a data ob ject into simpl er parts and then define a k ernel function in terms of suc h parts. A v ery common k ernel for string classification (esp ecially protein sequences) that exploits this idea is the spectrum kern el. The main idea b ehind it, is that the more substrings with a fixed length are s hared b y t wo string, th e more similar th ey are (see [49] f or det ails). More formally let’s consider the follo wing definition: Definition Sp e ctrum Kernel Let Σ b e a finite alphab et, Σ ∗ denote all p ossible string ov er Σ and Σ k all the string o ver Σ of length k . Let # x [ w ] denote the nu m b er of o ccurrences of w in x e.g. # x [ w ] = |{ y | x = y · w · z ∧ y , z ∈ Σ ∗ }| and G k [ x ] the k-gram vec tor of x o v er all the string in Σ k e.g. G k [ x ] = (# x [ w ]) w ∈ Σ k . Giv en a k ∈ N the sp ectrum k ernel can b e defined as: S k ( s 1 , s 2 ) = X w ∈ Σ k # s 1 [ w ] · # s 2 [ w ] = h G k [ s 1 ] , G k [ s 2 ] i (1.7) 1.1.7 Lev el Set Another approac h that decomp ose a signal in parts and that is v ery close to the MLA is the Lev el Set method that is a nume rical tec hnique for th e recognition of shap es in a signal [74]. This metho ds is based on the f act that usually , it is easier to c haracterize a shap e using a particular set of auxiliary functions called L evel Sets than using the shap e directly . In fact the lev el sets allo w to c haracterize a shap e considering sev eral of its le v els or subviews. In figure 1.8 it is p ossible to see a pi ctorial represen tation of this approac h on a f unction of 2 v ariables. No w it will b e pro vided the formal definition of Lev el Set: Definition L evel Set of a function Starting from a function f : R n − > R a level set is a set of the form: { ( x 1 , . . . , x n ) | f ( x 1 , . . . , x n ) = k } (1.8) If n = 2 this set is called level cu rve , if n = 3 the set is called level surfac e or more in general if n > 3 it is called level hyp ersurfac e . In particular using a level set it is p ossible to express a clo sed curv e Γ indirec tly using the t he function f and considering the lev el set: Γ = { ( x 1 , . . . , x n ) | f ( x 1 , . . . , x n ) = 0 } As it will b e poss ible to see later, the MLA idea in some sense is v ery close to this approac h since the informati on that cha racterize the signal are similar. The main difference is the w ay the information are organi zed, in fact w ith MLA it is p os s ible to c haracterize an y shap e in a natural and elegant w ay using a pa rticular structure to s tore these information. 14 Chapter 1. Multi-resolution or multi-scale metho dologies Figure 1.8: Lev el Set repre sen tation for a function depending on 2 v ariables. 1.2 P attern Di sco v ery and Classificatio n The next section present s t w o mac hine learning techn iques in which the MLA can b e promiscuously in tegrated. F or this reason here will be in tro duced the general problems of P attern Disco very and Classification , while in c hapter 3 will b e co v er in detail ho w to in tegrate the MLA in these con texts. 1.2.1 P attern Discov ery P attern discov ery is a general discipline in whic h the main g oal is to pro cess a large amoun ts of data in order to efficien tly extract unkno wn useful k nowled ge [87]. In other w ords a pattern discov ery method disco v ers subsets of input data that are meaningful accordingly to a formal criteria. More in general, the pattern disco v ery is a research area that provides effic ien t metho ds to unco v er, without using “a priory” k no wledge on the data, patterns that are rep etitiv e, un exp ected or intere sting, using a formal criteria. In order to b etter understand pattern disco very , it is first necess ary to define the meaning of pat tern. Informally a pattern is any relation in the data that is of our in terest and that is not casual or random . I n other w ords it is necessary to answ er to the question: h ow m e aningful is a p attern? This is b ecause the h uman mind has the tendency to see patterns everywher e. F or this reason, it is necessary to understand if a pattern is significativ e in a rigorous w a y . More formally , a pattern is a data v ector serving to describe an anomalously high lo cal densit y of data p oin ts [32]. This means 1.2. P attern Discov ery and Classification 15 that particular points hav e a differen t b eha vior than the points in other regio ns usually called “bac k ground” and that are not intere sting since in those regions they hav e a b eha v ior not related to the true pro cess that has generated the “anoma lies”. During the last years a lot of atten tion w as paid to this problem so that it is p ossible to find sev eral to ols in the realm of Statistic and in the Computer Science to address this problem . In particular these te c hniques can b e fruitfully applied to sev eral unconnected application domains suc h as: s p eec h recognition, biology , finance and econometric, biomedicine, text analysis, statistics. As a matter of fact the data in volv ed in the pattern disco v ery metho ds are of differen t k inds suc h as sequence, image, sound and structured data such as tree and graphs [87, 15, 6, 61, 13, 83 ]. 1.2.2 General sc hema of a Pattern Discov ery metho d A genera l pattern disco v ery metho d can b e sub divided in three main parts[83] as it is p ossible to see in figure 1.9. Figure 1.9: P attern Discov ery parts • A language to describ e the pattern; • a scor e function to assesses the inte restingness of a pattern ; • an efficien t algorithm that iden tifies the most inte resting patterns using the score function. Ob viously , these three parts dep end strongly on the particular application domains tak en in to consideration. In particular, this is true for the language used to describ e the patterns, in fact the data are not alw a y s in the form of feature vectors or in term of some formal languages (or grammars). In this sense languages can b e though t as a transformation that enco des the information presen t in the data in a s uitable form for a particular score function. Another imp ortan t p oin t is the c hoice of the most suitable score f unction for the partic ular pro cess that has generated the data, in order to disco ver the “anomal ies”. The last but not le ast imp ortan t p oin t is the scalabilit y of 16 Chapter 1. Multi-resolution or multi-scale metho dologies the algorithm that is fundamen tal in man y practical application domains. In particular, this last p oin t usually dep ends on the complexit y of the language used to express the patterns and on the computation al efficiency of the score function. F or this reason it is necessary to consider a compro mise betw een the expressiv it y and the compu tational efficiency of languages and score functions. 1.2.3 Classification In recen t y ears, sev eral algorit hms ha v e b een dev elop ed for classification, but all allo w, alb eit with differen t tec hniques, to matc h a set of elemen ts defined o v er a space of features, with a set of lab els corresp onding to differen t groups or classes [24]. This is equiv alen t to partition the space of features in to regions, ass igning to eac h region a sp ecific label. In general, classification refers to the class of metho dologies of mac hine learning that given in input a set of data assign subparts of the input data to a giv en class taken from a finite n um b er of categories. More formally , let’s consider a set of observ ations X ∈ R n , a set of elemen ts Y = y 1 , . . . , y M called lab els and a function f : X → Y that defines the true mapping from the set X of observ ations to the set of lab els. A classification algorithms considering a set D = ( x 1 , y 1 ) , . . . , ( x n , y n ) called tr aining set pro duce in output a func tion g : X → Y that appro ximate as close as p ossible the function f . The classification can also b e seen as a problem of parame ter estimation, where the goal is to estimate a s et of functions of the form: P ( cl ass | x ) = f  x ; − → θ  (1.9) where x ∈ X repr esen ts the v ector of input features for eac h item to b e clas s ified, and f is a f unction dep ending on a v ector of parameters denoted b y − → θ related to the sp ecific classification problem. This function represen ts the probabilit y that the elemen t represen ted by the v ector of cha racteristics, b elongs to a particular class. In an y case, the classification pro cess generally follo ws the f ollo wing steps: 1. Selection of the classes of in terest; 2. Selection of the set of training; 3. Statistical analysis of the set of training in order to ass ess whether th ey represent w ell the problem b eing tac k led; 4. Algorithm Selection for classification; 5. Classification of data using the chosen algorith m; 6. V alidation of the results and their in terpretation. 1.2. P attern Discov ery and Classification 17 The most common algorithms to p erform classification are: Ba y esian Classifier, K- Nearest Neigh b ors, Support V ector M ac hine, Decision T ree and Neural Net w orks. The in terested readers can found a go o d surv ey of the principal classification algorithms here [24]. Chapter 2 Multi La y er Analysis In this c hapter a detailed and formal description of the M ult i La yer Analysis (MLA) will b e presen ted. The MLA is a general featu re extraction method that can be adapted to disco ver patterns on one-dimensional signals or as a prepro cessing s tep to classificati on, clustering and other data analysis tec hniques. 2.1 The Multi La y er Anal ysi s The M LA is a feature extraction metho d in whic h the pro cessed input data can b e used b y a classifier or a clustering metho d in order to distinguish b etw een sev eral kinds of patterns. It is based on the generatio n of sev eral sub-samples of the input signal, eac h one carried out b y a particular threshold op eration, c hosen b y resp ecting cut-set optimal conditions, within resp ect to the inp ut data. In figure 2.1, it is sho wn a flo w chart of the whole metho dology . As it is p oss ible to see in th at figure, the metho d starting from the input signal and applying a s et of simple op erations, calle d thresholds, extracts a set of in terv als. These inter v als opportunely aggregated can enco de the shap e information of the input signal that can b e used to c haracterize it or to discov er structures con tained in it. In the follo wing, the formal definition of the threshold op eration will b e giv en, together with some some generic application of this transfo rmation. 2.1.1 The threshold op eration Definition Thr eshold op er ation Giv en an input signal f the threshold op eration σ k is defined as follow s: σ k ( x ) = ( f ( x ) if p ( f ( x )) is true k otherwise where p is a generic condition defined on the eleme n ts of f . In the simplest case f can b e defined in R and it is possible to set: p ( f ( x )) = ( true if f ( x ) ≤ φ false otherwise (2.1) 20 Chapter 2. Multi La ye r Analysis Threshold operator Intervals extractor p 1 Input signal f Threshold operator Intervals extractor p 2 Threshold operator Intervals extractor p k Aggregation Rule ........... ....... Figure 2.1: Sc hema of MLA processing This approac h detects sub-samples de riving from threshold operations that satisfy structural or shape prop erties. An example of a simple thres hold op eration w ith condition expressed in equation 2.1 is depicted in figure 2.2. The k ey idea b ehind the MLA is to explore the input signal at differen t threshold lev els that corresp onds to its decomposition in to several s ub-signals, in order to disco v er the hidden pattern of in terest. Definition Gener al MLA The MLA can b e defined as a set of sub-samples of a one-dimensional signal f M LA ( f ) = { σ 1 ( x ) , σ 2 ( x ) , · · · , σ K ( x ) } (2.2) where eac h threshold op eration indicated b y the subscript of σ could b e c haracterized b y a sp ecific condition. The MLA is more accurate and robust in comparison to a naiv e metho dology that, using a single threshold op eration could give inaccurate results esp ecially in the real case when the input data is affected b y noise. T he accuracy and robustness are due the fact that MLA uses more conditions p in order to v alidate the same hypothesis or cond itions on the multip le sub-samples extracted f rom the input s ignals f . F or this reason this 2.1. The Multi La yer Analysis 21 Figure 2.2: Thresold operation for three differen t v alues of φ tec hnique in tro duces a sort of “flexibili t y” to the an alysis of a signal. After the m ultiple threshold op erations called horiz on tal sampling it is p ossible to extract a set of inter v als from the original signal and define its in terval r epr esentation ; it is also p ossible to organize these in terv als using a particular rule called aggr e gation rule . A summary of the o v erall process is sho wn in figu re 2.1 . The next t w o subp aragraph explain in det ails the horizontal sampling , the interval r epr esentation and the aggr e gation rule of a signal. 2.1.2 The Horizon tal Samp ling, the In terv als R epresen tation and the Aggregation Rule The core of the MLA is the inte rv al iden tification obtained through the horizon tal sampling pro cedure. Definition Horizontal sampling Giv en a b ounded signal f : [ α, β ] → R + and K ∈ N threshold op erations σ k ( k = 1 , ..., K ) for eac h k it is poss ible to build a set of in terv als: I k =  i 1 k , i 2 k , · · · , i n k k  (2.3) where i t k = [ a t k , b t k ] with t = 1 , · · · , n k , and a t k , b t k ∈ R In the simpl e case in whic h the condition p of the generic threshold op eration σ k is that expressed in equation 2.1 it is easy to pro ve that f ( a t k ) = f ( b t k ) = t k . Af ter 22 Chapter 2. Multi La ye r Analysis the horizon tal samp ling pro cess, a different represen tation of the input signal, called Interval r epr esentation of f is drawn and it will b e denoted with Υ( f ) . Definition Disambiguation op er ation T o a v oid am biguities in the case f is discrete i.e. f : { 1 , 2 , · · · , L } → R + , and f (1) 6 = min ( f ) or f ( L ) 6 = min ( f ) , f is tran sformed int o a new signal f ′ : [ a, b ] → R + : f ′ ( x ) = ( min ( f ) if x = a W x = b f ( x ) if 1 ≤ x ≤ L where a = ( 0 if f (1) 6 = min ( f ) 1 otherwise and b = ( L + 1 if f ( L ) 6 = min ( f ) L otherwise Definition Interval R epr esentation Giv en a signal f and K threshold op erations σ k ( k = 1 , ..., K ), and let I k =  i 1 k , i 2 k , · · · , i n k k  the set of in terv als corresp onding to σ k , then the inter v al represen- tation of f indicate d as Υ ( f ) is: Υ( f ) = { I 1 , I 2 , · · · , I K } (2.4) Definition A ggr e gation Rule Giv en a signal f and its int erv al represen tation Υ( f ) = { I 1 , I 2 , · · · , I K } an aggregation rule is a rule that constructs s ets of interv als taken from Υ( f ) in order to cha racter- ize or represen t “in teresting” s ubparts of f . In general it is p ossible to define several aggregation rules to express differen t shap e prop erties presen t in a signal. In the next c hapters it will b e presente d several examples of aggrega tion rules applied to differen t application domains. Definition Equal ly sp ac e d simple MLA Without loss of generalit y , let assume that f : R → [0 , 1] and K ≥ 2 . The equally spaced simple MLA is carried out b y considering the thresholds σ k with 1 ≤ k ≤ K defined as follo w: σ k ( x, φ k ) = ( f ( x ) if f ( x ) ≤ φ k φ k otherwise 2.1. The Multi La yer Analysis 23 with φ k = 1 K × ( k − 1) As con ven tion the first threshold op eration corresp onds to σ 1 ( x, 0) and the last to σ K ( x, 1) . Note that all the in terv als extracted b y the last threshold op eration σ K b y con ven tion encompass a s ingle p oin t corresp onding to the intersectio n of the signal with the s traigh t line of eq uation: y = 1 . In other w ords, these in terv als I K ha ve the prop ert y that a t K = b t K , ∀ 1 ≤ t ≤ K . In addition, b y definition the first threshold op eration collects on ly one in terv als [1 , L ] where L = β + 1 . An example of equally spaced simple MLA is depicted in figure 2.3. Figure 2.3: Equally spaced simple MLA In general the in terv al represen tation is lossy b ecause it can only keep a subset of p oin ts of f th at form the inter v als in Υ( f ) (see figure 2.4). Figure 2.4: In terv al repre sen tation of a signal 24 Chapter 2. Multi La ye r Analysis Notice that as many other tra nsformations presen ted in c hapter 1, b y using the MLA it is alw a y s p ossible to reconstruct a lossless v ersion o f the input signal if some conditions arise, and this will b e discussed later. Ob vious ly , the informa tion loss in this represen tation decre ases as the n um b er K of threshold operations increases. Of course, it is alw a y s p os s ible to reconstruct a lossy v ersion of the original signal using an in terp olation algorithm and using only the p oints of its inter v al represen tation. Giv en a generic signal f it is also obviou s that it is al w a y s p ossible to obtain a lossl ess reconstruction of f from its represen tation Υ( f ) as k → ∞ . If f is a discrete signal, it is easy to pro v e that is alw ays p oss ible to obtain a lossless represen tation imposing that at least one o f the threshold lev els intersect eac h p oin t of f , in particu lar the followi ng theorem giv es a w a y to calculate the minim um n um b er of thresholds op erations K to use in order to build a loss less represen tation using equally spaced thresholds. Theorem 2.1 . 1 L et ε min b e the pr e cision r e quir e d, and let f : [ α, β ] → [0 , 1] b e a discr ete time signal of length L ( | [ α, β ] | = L ). T hen the lower b ound of thr eshold op er ations K al lowi ng a lossless r epr esentation h of f using the e qual ly sp ac e d simple MLA (i.e. for e ach p air of adjac ent p oi n t in h , d n = | h ( n + 1) − h ( n ) | > = c with c ∈ R ) is: K = 1 g L − 1 X n =1  d n ε min  ≈  1 g × ε min  (2.5) with g the GC D (Gr e atest Common Divisor) b etwe en al l the inte gers: F = nh d n ε min i , n = 1 , 2 , · · · , L o . Pro of Using a precision of ε min it is possible to map the set of the absolute differences D = { d n = | f ( n + 1) − f ( n ) | , n = 1 , 2 , · · · , L } in the set of natural num b ers F = nh d n ε min i , n = 1 , 2 , · · · , L o and let g = GC D ( F ) . By definition of g it results that  d n ε min  = g × m n with m n ∈ N , and K = L − 1 X n =1 m n = L − 1 X n =1 1 g  d n ε min  Lemma 2.1.2 L et ε min the pr e cision r e quir e d, and let f a dis cr ete signal of length L and without loss of gener ality let us assume that f as values in [0 , 1] . T hen K = L − 1 X n =1  d n ε min  (2.6) 2.1. The Multi La yer Analysis 25 Num b er of threshold op eration K Kendall Correlation Length of represen tation 2 0.6900 4 4 0.2846 68 8 0.9420 130 16 0.9973 280 32 0.9987 566 64 0.9999 1440 T able 2.1: Degradation of the signal for differen t v alues of K is the upp er b ound on the numb er of thr eshold op er ations K to obtain a lossless r epr e- sentation of f using an e qual ly sp ac e d sub division of f . Pro of The proof is straigh tforw ard, it is pos s ible to obta in the largest K when the GC D g assume its minim um v alue, this v alue is 1 b ecause one p rop ert y o f GC D is that g ≥ 1 . Although the previous theorem and lemma sho w a low er and upp e r b ound on K allo wing a lossless represen tation of a discrete signal f , it is usually con v enien t for sev eral reasons to optimi ze the search for the b est small est K allow ing a reasonable lossy represen tation of f . It is ob vious that the n um b er of threshold op erations strongly dep ends on the sign al shape. F or this reason, this represen tation is sugg ested when the information of the signal is enco ded in the time space b ecause it we ll cha racterizes the shap e in formation (as a solution to this problem it c ould be p oss ible to use th e F ourier T ransform and apply this methodology on the s p ectra of the signal). In figure 2.6 it is sho wn the progressiv e degradation of a signal as the nu m b er of threshold op erations decreases, and in table 2.1 the n um b er of p oin ts required to represen t a signal giving a fixed lev el of K thresholds, and the correl ation co efficien t b et ween the original and the reconstructed signal. I n the subsection 2.2 a c alibration pro cedure to select the prop er v alue of K will b e describ ed. Note that this transformation cannot b e simply related to the theory of sampling and in particular to the Sampling Theorem[64], because the non trivial distortion in the sp ectral comp onen ts of the original signal that MLA could b e introduce. Theorem 2.1 . 3 (Sampli ng Theorem [64]) If the highest fr e quency c ontaine d in an ana- lo g signals , x a ( t ) is F max = B an d the signal is sample d at a r ate F s > 2 F max = 2 B , then x a ( t ) c an b e exactly r e c over e d fr om i ts sample value s using the interp olation func- tion: g ( t ) = sin (2 π B t ) 2 π B t (2.7) 26 Chapter 2. Multi La ye r Analysis In other w ords it do es not exist a simple mathematical relation that link the t w o transformations b ecause they extract differen t information from the s ignal, frequency and shap e information as stressed before. As a enlight en example, consider t w o simple but opp osite cases: a sinusoida l signal and a rectangular pulse signal. Looking at the figure 2.7 and 2.8 it is clear that this transformation introduces artifact on the sp ectrum for the simple sinusoidal signal, that can b e represen ted only b y one comp onen t with the F ourier T ransform, but it is not present any artifact on the rectangular pulse signal that, in the contin uous case, require infinite comp onen ts to b e represen ted prop erly in the 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 Figure 2.5: Origin al signal 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 (a) Signal reconstructed with K = 3 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 (b) Signal reconstructed with K = 4 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 (c) Signal reconstructed with K = 8 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 (d) Signal reconstructed with K = 16 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 (e) Signal reconstructed with K = 32 0 20 40 60 80 100 120 140 160 −6 −5 −4 −3 −2 −1 0 1 2 3 4 (f ) Signal reconstructed wi th K = 64 Figure 2.6: Degrada tion of the signal for differen t v alues of K 2.1. The Multi La yer Analysis 27 frequency domain. In other w ords the num b er of threshold op erations do esn’t dep end directly on the frequency con ten t of the input signal but only on the quant ization lev els needed to properly represen t it. The quan tization lev els are ob viously prop ortional to the smallest v ariation ε min that it is necessary to capture in the signal. If it is necessary to obtain in term of threshold operations an equally spaced “horizon tally sampling” as in the case of e qual ly sp ac e d sim ple MLA it is poss ible to use the theo rem 2.1.2. 0 20 40 60 80 100 120 140 160 180 200 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 MLA reconstructed Original Figure 2.7: M LA reconstructio n of the simple sin usoidal s ignal with K = 8 In some sense the MLA represen tation is related to the wa v elet represent ation. I n fact it is poss ible to think a signal as comp osed b y scaled and shifted comp onen ts (in sense of w a velet comp onen ts) in whic h the mother w a velet is a single rectangle pulse as depicted in figure 2.9. The main difference with wa v elet approac h is that in M LA transformation the data are represen ted in a differen t wa y and the MLA “mother” do esn’t need to ha v e mean zero although it has finite duration. 28 Chapter 2. Multi La ye r Analysis 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 MLA reconstructed Original Figure 2.8: MLA reconstru ction of the rectangu lar pulse signal with K = 2 2.2 Cho os i ng the righ t v al ue for the n um b er of thresholds The b ounds on the v alues of K giv en a quan tization precision of ε min in the case of N equally spaced thresholds hav e b een previously stated. An in teresting question is: is it necessary to use all the levels that the upp er b ound stated in theorem 2.1.2? The short answer is no. A practical approac h to follow , is to define a s imilarit y m easure b et w een the original input signal and the reconstructed signal in order to ha ve an idea on the “amoun t” of information that MLA represen tation induces. A s et of na tural similarit y functions that can b e suitable to this scop e b elongs to the family of correlation functions. Among the correlation functions, the most kno wn are the Pe arson , Sp e arman and Kendal l cor relation indices. Definition Pe arson, Sp e arman, and Kendal l c orr elation Giv en t wo signal x and y then the correlation indices are defined as : 2.2. Cho osing the right v alue for the num b er of thr esholds 29 0 100 200 300 400 500 600 700 800 900 1000 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 2.9: MLA “mothe r” functio n. • Pe arson c orr elation r = P m i =1 ( x i − ¯ x )( y i − ¯ y ) P m i =1 ( x j − ¯ x ) 2 P m j = 1 ( y j − ¯ y ) 2 (2.8) • Sp e arman c orr elation ρ = 1 − 6 P m i =1 ∆ 2 i n ( n 2 − 1) (2.9) • Kendal l c orr elation τ = n c − n d 1 2 n ( n − 1) (2.10) where ¯ x = 1 m P i x i , ¯ y = 1 m P i y i , ∆ i is the difference b et ween the ranks of x i and y i , while n c and n d are their n um b er of conco rding and discordin g pairs, respectively . 30 Chapter 2. Multi La ye r Analysis signal / K 5 10 50 100 earthquak e 0.3856 0.6488 0.9399 0.9470 gaussian 0.9484 0.9890 0.9994 1 uniform 0.9916 0.9990 1 1 sin 0.9936 0.9937 0.9950 0.9950 T able 2.2: Information loss on the signal for differen t v alues of K In figure 2.10 it is p ossible to see four example s of real w orld and syn thetic signals: an earthquake s ignal, a gaussian noise signal generated in a ccordance to the G auss ian distribution equation 2. 12, a random uniform signal, generated in acc ordance to the uniform distribution equation 2.11, and a sinusoidal signal. Definition Unif orm Distri bution The uniform distribution [27] is a distribution that has constan t probab ilit y o ver an i n terv al [ a, b ] , and its probabilit y densit y function p is: p ( x ) =        0 for x < a 1 b − a for a ≤ x ≤ b 0 for x > b (2.11) Definition Normal or Gaussian Distribution The Normal or Gaussian distribution [27] is a probabilit y distribution with probabilit y densit y function: f ( x ) = 1 √ 2 π e − x 2 / 2 . (2.12) T able 2.2 sho ws the n um b er of lev els required to obtain a correlation v alue of at least 0 . 9 (using the Kendall’s correlation , equation 2.10) in the case of four examples. It is also imp ortan t to take in to accoun t the length of the signal represen tation that ob viously strongly depends on the n um b er of lev els used. The f ollo wing theorem giv es an upper b ound on the length of the represen tation of a signal using K threshold op erations. 2.2. Cho osing the right v alue for the num b er of thr esholds 31 Theorem 2.2 . 1 Given a discr ete signal f of length L ≥ 3 and let K ≥ 2 the numb er of thr eshold levels in the e qual ly sp ac e d simple MLA tr ansformation then the upp er b ound I max on the numb er of intervals of its r epr esentation Υ( f ) is: I max ( L ) =  L 2  ∗ ( K − 1) + 1 ( 2.13) and the r e al n umb ers r e quir e d to r epr esent the in tervals ar e i n n umb er of : n max ( L ) = 2 ∗  L 2  ∗ ( K − 1) + 2 (2.14) Pro of T o av oid confusion, remem b er that for definition the equally spaced simple MLA adds at the b eginning (or to the e nd) of the s ignal f a p oin t equal to min ( f ) if f (1) 6 = min ( f ) (or if f ( L ) 6 = min ( f ) ) b y the disambigua tion op eration. It is p ossible to define t wo kinds of w orst case signal, one for L o dd (see figure 2.11 (a)), and one for L even (see figure 2.11 (b)). The ev en wor st case signal in v olv es alw a ys the addiction of a single 0 50 100 150 200 250 300 350 400 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 (a) Eartquake signal 0 50 100 150 200 250 300 350 400 −3 −2 −1 0 1 2 3 (b) Gaussian noise 0 50 100 150 200 250 300 350 400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) Uniform noise 0 50 100 150 200 250 300 350 400 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (d) Sinuso idal signal Figure 2.10: Differen t examples of signals (all of length 400) 32 Chapter 2. Multi La ye r Analysis new p oin t by disam biguation, while tw o p oin ts are added in the case of a w orst case o dd signal. Moreo v er, the addition of a new p oint to the signal, inv olv es the in tro duction of K − 1 new in terv als as it p ossible to see in figure 2.12. F urther it will b e considered a generic threshold op eration σ k with k 6 = 1 b ecause b y definition, the first threshold op eration extracts alw ays only one in terv al indep enden tly on the length of the signal. Note also that, in the case of the b est case signal with L o dd p oin ts , the n um b er of in terv al is exactly I min ( L ) =  L 2  ∗ ( K − 1) + 1 (see figure 2.11 (c)). 21:46 original points points added by disambiguation operation (a) (b) (c) Figure 2.11: (a) Od d wor st case,(b) Ev en bes t and wor st case,(c) Odd b est case 2.2. Cho osing the right v alue for the num b er of thr esholds 33 { K-1 +1 { Figure 2.12: Interv als incremen t: eac h point added can b e add no more than k − 1 in terv als Let’s recall t w o simple prop erties of the ceil and flo or function: if L ∈ N is ev en then:  L 2  =  L − 1 2  (2.15) if L ∈ N is o dd then:  L 2  + 1 =  L 2  (2.16) Supp os e to ha v e a signal of length L , consider t wo cases, L even or odd: • (L) even : Since L is ev en, only a new p oin t has to be added. The resulting s ignal can b e seen as the extension of the b est case s ignal with L − 1 o dd poin ts b y adding t w o new p oin ts, and applying the induction, and the pr op erties 2.15, 2.16 it results that I max ( L ) = I min ( L − 1) + ( K − 1) =  L − 1 2  ∗ ( K − 1) + 1 + ( K − 1) = (  L − 1 2  + 1) ∗ ( K − 1) + 1 =  L − 1 2  ∗ ( K − 1) + 1 =  L 2  ∗ ( K − 1) + 1 . • (L) o dd : Since L is o dd, the w orst case signal inv olv e the addiction of tw o new p oin ts . The resulting signal can b e s een as the extension of a b est case signal with L odd points by adding tw o new points, and b y applying the induction, and the prop ert y 2.16 it results that I max ( L ) = I min ( L ) + ( K − 1) =  L 2  ∗ ( K − 1) + 1 + ( K − 1) = (  L 2  + 1) ∗ ( K − 1) + 1 =  L 2  ∗ ( K − 1) + 1 . Lemma 2.2.2 Given a discr ete signal f of length L and let K ≥ 2 the numb er of thr esh- old levels i n t he e qual ly sp ac e d simple MLA then the c omplexity of this tr ansformation is O ( K ∗ L ) 34 Chapter 2. Multi La ye r Analysis Pro of Using the previous theorem, it is clear that in the w orst case it is p os s ible to obtain  L 2  in terv als for a g eneric threshold operation. Sin ce the transformation uses in total K threshold op erations in the w orst case it is p ossible to obtain  L 2  ∗ K in terv al extractions. 2.3 Usage of the MLA as prepro cessing step In general it is p oss ible to find tw o principal problems in whic h MLA can b e successfully used: • giv en a f amily of s ignals and a signal in this family , c haracterize it in term s of the other signals in the family; • giv en a signal, disco v er if it con tains intere sting substructures in some f ormal sense. In mor e details given a signal f and its MLA represen tation Υ( f ) there a re sev eral w ays to use it, the most trivial is to use the in terv als “as they are” in a feature vectors fashion. I t is importan t to note that they are not real feature ve ctors since giv en t w o signals of equal length not alwa ys inv olve the same representati on. In other w ords it is not possible to ha ve a p ositional represen tation of the feature of a sign al as in a classic feature v ector. F or this reason, in order to compare t w o or more signal using the MLA represen tation, sp ecial dista nces or more in general diss imilarit y functions need to b e defined. One w a y to o v ercome this problem is to use a set of probabilit y distributions to mo del the outpu t of threshold op erations. I t will b e sho wn an example of this approac h on c hapter 4 where a randomness test that exploits this idea will b e presente d. If we need intstea d to charac terize s ubparts of a s ignal, it is necessary to define aggregation rules that reflect our “in terestingness”. It will b e presen ted this approac h in the n ext c hapter where a rule that well c haracterizes a biological structure (the n ucleosomes) will b e defined. An ex tension of this approac h will b e presen ted in c hapter 5 where a new struct ure using a particular in terv als aggregation rule, called T ree In terv al Represen tation, will b e in tro duced. I t will give al so the p ossibilit y to define a new k ernel function b y taking inspiration from the well-kno wn tree kernels that ha v e b een successfully used in a completely differen t con text: the pro cessing of natural langu ages and the text categorization. In part icular eac h of these c hapters will b e organized in tw o pa rts: the first par t will sho w the f ormal defin itions an d the second part will presen t the real problem and the prop osed solution, highligh ting where the MLA tak es place and, if possible, a compa rison with the state of the art metho dologies. Chapter 3 P attern Disco v ery and Classification b y MLA This ch apter presen ts the MLA in the con text of P attern Discov ery and Classification; in particular the section 3.1 explains how MLA can b e in tegrated in t hese con texts. Then in section 3.3 a case study is introduced: it regards a part icular biological problem, the nucle osome spacing, in which the MLA was successfully used (see section 3.1). In addition, in section 3.4, an alternativ e approac h for this problem based on Hidden Mark o v Model is presen ted, while in section 3.6 a comparison of the t w o methods is presen ted. Finally , the last secti on is devoted to the description of a new one-class classifier that w as used as new classifier module of the MLA. 3.1 MLA in P attern Disco v ery a nd Cla s s i fication This section explains ho w it is p ossible to apply the MLA in th e con text of patter n disco very . A general sc hema of pattern disco v ery that takes adv an tage of the MLA is presen ted in figure 3.1. The imp ortan t p oint here is that MLA plays the role of the language to expre ss the pattern as it w as explained in c hapter 1. In particula r, given a signal f the pa tterns correspond to subregio ns of f tha t can b e found using its interv al represen tation Υ( f ) together with an appropriativ e aggregation rule. In particular as expressed in c hapter 2 it is con v enien t to use the M LA in order to ch aracterize or disco ver patterns in term of their shapes. This means that a general criteria to assess if a pattern is interestin g in to this con text, is to c hec k ho w close a subregio n of a signal expressed in term of in terv als meets a particular aggregation rule criteria or in terv als distribution. In the latter case this means that it is possib le to define an exp ected in terv als distribution for a “bac kground” that can b e used to ass es se ho w in teresting a pattern is. This approac h, as it will b e sho wn in a case study describ ed in the next section, is particularly useful and natural for signal segmen tation. In the classification problem, since it is necessary to pro vide an explicit training set (i.e. some examples for each class to discriminate), the MLA can b e us ed as feature extractor, in the sense that eac h elemen t can b e expressed using MLA as its in terv al represen tation, or more in general in a struc ture built on its in terv al represen tation using a particular aggregation rule . Here, a n elemen t of a class can be a 36 Chapter 3. Pa ttern Discov ery and Classification by MLA Figure 3.1: P attern Discov ery b y MLA and signal segmen tation whole signal or a subpart of a signal ma yb e extracted with a pa ttern disco very approac h. In the n ext section, the basic biological notions will be pro vided in order to in- tro duce the MLA in the con text of pattern disco v ery and class ification for a part icular biological problem: the n ucleosome spacing. 3.2 F undamen tal s of Mole cular Biology In this section some concepts and notions of biology will b e des crib ed, in order to in tro duce th e basic terminology that can b e useful for the comprehension of the matter. 3.2.1 DNA DNA is a double helix molecule formed by t w o c hains (helices) o rien ted in opp osite directions, as sho wn in the figure 3.2. DN A is presen t in ever y cell in the b o dy and con- tains all the genetic information necessary for the b o dy . The ma jor classes of organisms are euk aryo tes and prok aryotes. In euk aryotes DNA is con tained within the n ucleus, separated from the cy toplasm; in prok ary otes, instead, it is con tained in cytop lasm. DNA is com p osed of four distinc t t yp es of bases, called n ucleotides, that consist of three parts: a phosphate group, a sugar (deo xyrib ose) and a nitrogenous base (purine or p yrimidine). The four bases that forms the DNA are: adenine (indicated b y A), cytosine (indicated b y C), th ymine (indicated by T) and gu anine (indicated by G). The DNA bases are complemen tary: a C alw ays pairs with a G and an A with a T. The complemen tarit y of the t w o c hains allow s to represen t a DNA sequence using only one of the tw o b ecause the other one is complemen tary and then the information it con tains is redundan t. 3.2.2 Genes and proteins Genes correspond to particular sub-sequences of DNA. They b elong to the genome of an organism, whic h can b e comp osed of DNA or RN A; the genes in particular direct ph ys ical and b eha vioral dev elopmen t of the b o dy . Genes also determi ne the amino acid sequence of proteins, whic h are the most inv olv ed macromolecules in biochem ical 3.2. F und amentals of Molecular Biology 37 Figure 3.2: DNA structure and metab olic pro cesses of the cell. Some other genes do not enco de proteins but enco de RNA that pla ys a k ey role in gene express ion. In a cell there are thousands of differen t proteins, each with a distinct amino acid sequence. I n particular eac h amino acid is encoded by exactly 3 n ucleotides as it is p oss ible to see in figure 3.3 and there are 20 amino acids in total. In general, a protein is a p olymer comp osed b y differen t com binations of amino acids that bind eac h other through som e int eractions that are called peptide b onds. Proteins pla y a v ariet y of tasks in the cell. In fact, they tran smit messages b etw een cells, turn on and off genes, are es s en tial in mu scle con traction, and finally build structures suc h as hair. Proteins are c haracterized b y a three-dimensional structure articulated on four structural levels, in relation to eac h other: 1. The primary structure is th e one that iden tifies th e s p ecific sequence of am ino acids from the p eptide cha in. 2. The secondary structure corresp onds to sev eral config urations s uch as the spiral shap e (o r alph a helix), the planar (or beta sheet), the three in tert wined filamen ts and those belonging to the globula r KEMF (kera tin, epiderm ina, m y osin, fibrino- gen). 3. The tertiary structure represent s the three-dimensional configurat ion of the p olyp eptide c hain. This configuration is p ermitted and maintaine d b y different c hemical b onds, including the sulfide bridges and the forces of V an der W aals. 4. The quaternary structure determines the asso ciation of t wo or more p olyp eptide 38 Chapter 3. Pa ttern Discov ery and Classification by MLA units, or of protein and n on-protein units, joined together b y w eak b onds,suc h as sulfide bridges, bu t in a v ery sp ecific wa y , suc h as it o ccurs in the formation of the enzyme pho sphorylase, consisting of four s ub-units, or from hemogl obin, whic h is the molecule resp onsible for transp orting oxygen in the b o dy . Figure 3.3: A mino acids alphab et in terms of DNA alphab et 3.2.3 Protein pro duction and expression level of a gene The pro duction of a protein from a gene is called gene expression. T o obtain a protein from a gene, the information in DNA is copied through a pro cess called RN A transcrip- tion. RNA in the form of mRNA acts as a messenger and delive rs information from the cell nucleus (where DNA is located) to the cytoplasm. Once in the cytoplasm, the mRNA is translated in its product, the protein, thanks to the usage of the alphabet of amino acids. Then the protein is built starting from the original DN A sequence repre- sen ting the gene, as it p ossible to see in figure 3.4. Eac h cell of an organism con tains the same DNA, s o the same inf ormation; ho wev er cells are sp ecialized according to their function. This sp ecialization is b ecause not all genes are e xpressed at the same time and with in the same cell. In fact, gene expression is a con trolled dynamic phenomenon so that the pro cesses of a cell are carried out in a con trolled wa y . This phenomenon is regulated b y sev eral proteins that bind eac h other differen t regions of DNA. This adjustmen t ma y dep end on the function that a cell has to mak e and it is regulated b y b oth external factors and int ernal f actors produced by the cell. 3.2. F und amentals of Molecular Biology 39 Figure 3.4: F rom a genomic sequence to a protein 3.2.4 Nucleosome and c hromatin As said b efore, DNA con tains all the information of an organism and it is organized in a sp ecific space co nfiguration called c hromatin and in particular in chrom osomes. More in deta il, there are fundamen tal units called n ucleosomes that pac k age DNA in to c hromatin and there are sever al levels of space organization from DNA to a c hromosome as it is p ossible to see in figure 3.5. The n ucleosome, whose discov ery dates bac k to 1974 , is the fundamen tal unit of c hromatin structure and consists of a segmen t of ab out 150 bp of DNA as s o ciated with a quatern ary structu re of proteins called histon e o ctamer. The n ucleosome has a compact glob ular shap e and pla ys the role to compact DNA in a euk ary otic cell. I n figur e 3.6 it is p ossible to see the s t y liz ed structure of a n ucleosome. Nucleosomes ha v e a diameter of about 11 nm and are s paced from eac h other by a stretc h of DNA link er v arying in length from a few to ab out 80 pairs of nu cleotides. The resulting structure has the c haracteristic app earance of a nec klace of p earls and is the first level of compaction of c hromatin. The formati on of n ucleosomes in fact con verts a molecule of DNA in a s trand of chrom atin along about a third of the original length. This structural organization w as highligh ted after isolating th e n ucleosomes from c hromatin. Sev eral factor s can influence the n ucleosome organizations [72] and therefore the c hromatin. Recen t studies has sho wn that one of this factor is the sequence sp ecificit y that consists in the n ucleosomes preference for some sequences: in particular, in vitro studies hav e s ho wn that n ucleosomes ha v e a strong preference for some DNA sequences [70] and instead “d on’t lik e” other sequences such as p oly (da,dt) tracts [71]. Another imp ortan t factor is their s tatistical p ositioning [46]. This theory is based on the concept of barriers, that are regions on the dna in whic h the n ucleosomes cannot sta y . Barriers in particular on av erage regulates the p ositions of n ucleosomes around them. An imp ortan t result is that it is p ossible to deriv e mathematically the probabilit y function on the preferences of n ucleosome around the barrier. The last point is the set 40 Chapter 3. Pa ttern Discov ery and Classification by MLA of chrom atin remo deler complexes that activ ely mov e the n uc leosomes across DNA [66 ]. Figure 3.5: F rom DN A to c hromatin 3.2.5 Microarra y A DNA microarra y (commonly kno wn as gene c hip, DNA chip, or bio chip ) is a collectio n of microscop ic DNA prob es attac hed to a solid surface suc h as glass, plastic or silicon c hip forming an arra y [3]. These arrays are used to examine the expression profile of a gene or to iden tify the pr esence of a gene or of a short sequence on thousands (often the en tire genome of an organism). Eac h lo cation corresp onds to a sp ecific 3.2. F und amentals of Molecular Biology 41 Figure 3.6: Nucleosome structure: in blue the o ctamer, in orange the DNA gene (or a sp ecific sequence) and it do es contain m ultiple copies of a filamen t with a particular seq uence of bases. These DNA strands are anc hored to the s urface of the substrate, and are used as probes to measure the amoun t of other DN A molecules (whic h are also single-stranded) deriv ed from mRNA transcripts and cont ained in a solution that is dep osited on the surface of the microarra y . The main approac hes used in the man ufacturing pro cess of the microarra ys are tw o: one pro cess is to depos it, with the help of a rob ot, a solution cont aining the DNA prob es on the surface of the solid supp ort. The prob es can b e ma de of a single- stranded cDNA (complemen tary DN A obtained b y an mRNA transcript ha ving a length of 200-2400 bases) or can b e made of pre-c hemically syn thesized oligon ucleotides (short sequences of nucleo tides with a length of 50-100 bases). Microarra ys made b y this pro cess, are called “cDNA microarraie s” [3 ]. The other pro cess is to directly syn thesize olig on ucleotides on the s urface of the microarra y (in situ); this op eration is carried out mainly with photolithographic tec hniques (t ypical of Affymetrix) and inkjet prin ting [3]. The adv an tage of using microarra y s is the p ossibility to examine a large amo un t of data p er exp erimen t; for example, it is p ossible to monitor the expression levels of thousands of gene s at a time. In the figure 3.7 it is poss ible to see the wo rkflo w that is usually follo wed when using the microarra y tec hnique: • Preparation and marking of the sample (different samples are lab eled with differ- 42 Chapter 3. Pa ttern Discov ery and Classification by MLA en t mark ers) • Hybridization and alignm en t • Cleaning • Image acquisition and data analysis Figure 3.7: Microarra y w orkflo w 3.3 Case Study: Nucleosome Positioning The biologica l problem under co nsideration concerns the p ositioning of n ucleosomes in DNA. This problem is v ery inte resting b ecause the accurate and precise measuremen t of the n ucleosomes p osition on genomic scale could impro ve the understanding of the c hromatin structur e and its function. Alterations in ch romatin an d hence in n ucleo- some organizatio ns can result in a v ariet y of d iseases. In fact, the emergence of diseases is though t to b e due to the fact that the altered c hromosomes condensation leads to the expression increase of certain genes, causing abn ormal pro duction of proteins in the cell. This motiv ates the use of a methodology capable of determining the p osition 3.3. Case Stud y: Nucleosome Positioning 43 of nucl eosomes, in order to study the implication of nuc leosome spacing in the chro- matin condensation phenomena. This may b e in v estigated b y comparing the p ositions of n ucleosomes in differen t co n texts in whic h there are differen t amo un ts of proteins that remodel c hromatin by c hanging their p osition. Th is w ould figure out whic h is the molecular basis of ch romosome condensation defects or defects in gene expression caused b y the partial or total absence of these molecular mac hines. In fact, it w ould b e p ossible that the n uclesome spacing is the basis of this, whic h w ould mean that in the absence of suc h molecular mac hines, n ucleosomes w ere not s paced properly carrying abnormal- ities in the ce ll. So it is v ery imp ortan t to understand the pro cesses that mo dulate the c hromatin dynamics a nd in particular the n ucleosome positi oning. T heir p ositioning in fact plays a direct role in gene regulatio n [51]. While the pack aging that they pro vide allo ws the cell to organize a large and complex genome in the nucleu s, th ey can also blo c k the access of transcription factors and other proteins to DNA [17]. F or exam- ple, under norm al conditions th e Ph o5 promoter in y east is o ccupied b y w ell-p ositioned n ucleosomes, prev en ting the transcription f actor Pho4 f rom binding to its target bind- ing s ite. When induced by phosphate starv ation, the nucleo somes are depleted from the promoter region so that Pho4 can bind to its target DNA binding s equence thus activ ating the Pho5 gene transcription [79]. Ho wev er, n ucleosome binding can some- times enhance transcription by bringing distant DNA regulatory elemen ts together [84]. Genome-wide studies ha v e f ound that transcription activity is inv ersely proportional to n ucleosome depletion in promoter regions in general [5, 63, 47]. With th e help of til ing arra ys at 20 bp resolution, Y uan et Al. [90] hav e looked at n ucleosome o cc upancy rela- tiv e to gene regulatory regions on 4% of the y east gen ome b y using an Hidden Marko v Mo del approac h HM M. The used m icroarra y-based method allows the iden tification of n ucleosomal and link er DNA sequences on the basis of susceptibilit y of linker DNA to microco ccal nuclea se. This metho d allo ws the represen tation of microarra y data as a signal of green/red ratio v alues showing nucleo somes as p eaks of ab out 150 base pairs long, surro unded b y lo w er ratio v alues corresponding to linker regions. Consisten t with previous studies, Y uan et Al. found that 87% of the transcription factor binding sites [33] are free of n ucleosome binding. A substan tial improv emen t o ver this w ork has b een recen tly done by Lee et al. [48] where the genome-wide n ucleosome p ositions in yea st ha ve b een mapp ed at 4 bp resolution. A similar approac h has also b een used to lo ok at differenc es in nucl eosome spacing o ccurring in the absence of a c hromatin remodeler [86]. A num b er of other groups ha v e dev elop ed analysis me tho ds to detect n ucleosomes as w ell as transcription f actor binding sites [10, 40, 45, 91, 43 , 44, 55, 88]. Co mpared to transcription factors, it is more c hallenging to detect n ucleosome p ositions since the ma j orit y of a euk ary otic genome is wrapped in to n ucleosomes. Another difficult y is that the ra w data ma y con tain complex trends that are unrelated to n ucleosome binding [90]. An in tuitiv e metho d to decon volv e data trend is to define a p eak-to-trough difference measure an d to detect its local maxima. How ev er, Y uan et Al. [90] h a v e found that al- 44 Chapter 3. Pa ttern Discov ery and Classification by MLA though this method can detect lo cal p eaks, it suffers f rom amplify ing observ ation n oise. A s imilar approac h has b een adapted in [60] to map n ucleosome p ositions in huma n. Although a n in trinsic DNA co de for n ucleosome positioning has been recen tly reported [69], a s ignifican t tec hnological dev elopmen t in genome-wide location of n ucleosomes has b een ma de using “deep sequencing” approac hes [2, 4, 56, 41], whic h differs from microarra y-based approac h in that the isolated DNA of in terest is mapp ed to genome via dir ect DNA sequencing, instead of microarra y h ybridization. F or this new tec h- nology , the input data corresp ond to p eaks of DNA fragmen t counts instead of high h y bridization ratio. Ho w ev er, the task of p eak detection rema ins a key pr oblem for the statistical analysis of the input data. Unlik e microarra y -based approac hes, where data collection is constrain t to a regular grid, “deep sequencing” data are in trinsically base- pair resolution and therefore less statistically stable. One solution to this problem is to first map the data on to a regular grid b y binning. Ho wev er, more sophisticated meth- o ds need to b e dev elop ed to balance the resolution v s v ariance dilemma. The analysis of sto c hastic signals aims to b oth extract signifi c ant patterns from noisy bac kground and to study their spatial relations (p erio dicit y , long term v ariation, burst, etc.). The problem b ecomes more complex whenev er the noise bac kground is structured and un- kno wn. Examples of suc h kind of data correspond to protein-sequences in the study of folding [21] and the p ositioning of n ucleosomes along c hromatin in the study of gene expression [90]. The analysis carried out in b oth cases has b een based on probabilistic net works [39] (for example, Hidden Mark ov Mo dels [26], Ba yesian net w orks). Metho ds based on probabilistic net w orks are suitable for the an alysis of suc h kind of signal data; ho wev er, they suffer of high computational complexit y and results can be biased from lo calit y that dep ends on the me mory steps they use [90, 21]. In the next section it wi ll b e presen ted an approac h that takes adv an tage of the MLA and its comparison with the prop osed method based on HMM. The main adv an tage of MLA ov er HMM is its scalabilit y that pro duce a significan t reduction in computational time o v er the HMM. In this case s tudy in particular it was considered the p erformances of these t w o methods to b oth syn thetic and microarra y-based n ucleosome p ositioning data and their abilit y to reco ver distinct n ucleosome configuration. This configurations could be underlie impor- tan t regulatory roles, highligh ting the impact of these metho dologies on genome-wid e n ucleosome p ositioning studies in higher euk ary otes. 3.3.1 The microarra y and the signal The follo wing describ es the microarra y structu re designed and used in the Bauer Center lab oratory for Genomic s Researc h, Harv ard Univ ersit y [90]. As men tioned before, a DNA microarra y w as used to extract the sequences corresponding to nucleo somes and those corresp onding to the linker, in order to iden tify the nucleosom es on a genomic scale. In particular the microarra y data, S , are organized in T con tiguous fragmen ts 3.3. Case Stud y: Nucleosome Positioning 45 S 1 , · · · , S T whic h represen ts D N A sub-sequences. I n order to obtain the signal on whic h subsequent pro cessing are mad e, carrying out as follo ws is needed: Firstly , DNA wrapped in the nucleo some is isolated and lab eled with a green fluorescen t dye (it is mark ed the entir e genomic DNA of the organism, c hromatin is then digested with a particular enzyme that cuts in the linker regions of n ucleosomes but leav es in tact the DNA around the n ucleosome). A t the same time the genomic DNA is mark ed with a red fluorescen t dy e. A t this p oin t there is a comp etitiv e hybridiza tion; if b oth prob es are hybridi zed in equal proportions, a yel lo w s p ot will be obtained, while a red sp ot if the prob e with the red mark er is the more hybridize d, othe rwise a green sp ot. As a result red or green sp ots will b e obtained as it is poss ible to see in figure 3.8. Figure 3.8: Microarra y probes In particular, in suc h data, eac h sp ot corresp onds to a sequence of 50 base pairs. These sequences are o verla pp ed of 30 base pairs in ord er to obtain a final resolution of 20 base pairs. With this resolution a nu cleosome, whic h o ccupies ab out 150 base pairs, will corresp ond to ab out 6-8 prob es in the microarra y . These nucle osomes are called w ell-p ositioned nucleo somes. There is also a class of decentra lized nucleosom es, that can o ccup y m ultiple p ositions due to thermo dynamics factors or that can corresp ond to segmen ts that ma y come from cells in differen t states. The next step is to excite the tw o dyes with a laser scanner, using differen t w a velengths; in this w a y a separate scanning of red and green c hannels is obtained. T o see if the s equences are h ybridized or not, their logarithmic ratio has to b e considered: S = log 2  G R  (3.1) This will giv e a s ignal with a pattern whic h will hav e p eaks in the presence of nuc le- osomes. An ov erview of this metho d and a fragmen t of this signal is show n in figure 46 Chapter 3. Pa ttern Discov ery and Classification by MLA 3.9 Figure 3.9: F rom microarra y to one-di mensional signal 3.3.2 Prepro cessing Before the analysis, the s ignal coming from the microarr a y is normalized in order to remov e p ossible measuremen t errors (bias) and to reduce the influence of cross- h y bridization . Normalization is a t w o-step pro cess : • the mean and v ariance of eac h group of sp ots is tak en into accoun t, • the cross-hybridi zation and th e en trop y of the signal (base sequence) is taken into accoun t. The cross-hybridiz ation is the h ybridization o f segme n ts that do not hav e a p er- fect matc h but only a partial one, and consequen tly do not matc h and should not b e considered. The en trop y here is int ended the classic definition prop os ed b y Shannon: E i = − l i X k =1 p k log p k (3.2) Where p k represen ts the probab ilit y of emission of the k − th symbol, that is defined in the alphab et of the bases that constitute the DNA (A, T, C, G), and l i indicates the length of segmen ts in eac h sp ot. The first phase of stand ardization will reduce the bias caused of differen t groups in whic h tak e place the h ybridization. In particular this phase uses the follo wing model: y ij = σ j ( µ i + β j ) + ε (3.3) 3.4. First solu tion : Hidden Mark ov Mo del 47 where y ij represen ts the logarith mic ratio of the observ ed v alue of i − th prob e of the j − th group, µ i is the normalized v alue desired, β j and σ j are resp ectiv ely the mean and v arianc e of the j − th group and ε is an instrumen tal error term, whic h is a ssumed to b e indep enden t and ha v e zero mean. In th e second phase of stan dardization the ob jective is at least to red uce the effects of cro ss-h ybridization, as this is considered una voi dable b ecause of the large n um b er of bases considered. In trying to reduce cross-h ybridization t w o factors are considered: • A sp ecific comp onen t that measures the num b er of small seq uences tha t cross- h y bridize with long ov erlaps with the sequences of the probe s ; • An unsp ecified componen t that measures the case in whic h a large n um b er of sequences are w eakly cross-h ybridized with small ov erlaps with the sequences of the prob es. The first comp onen t w as mo deled b y a discrete v alue B i , whic h is set to 1 if the sequence of a prob e, (whic h as mentio ned b efore is 50 bases long) corresp onds to another sequence of equal length for at least 30 pairs of basis (a partial matc h, but not negligible), whic h w ould in tro duce an un wa n ted p ositiv e con tribution to the signal of the logarithmic ratio. Oth erwise, the v alue of B i is set to 0. Th e second comp onen t w as mo deled with E i , i.e. the en trop y of the i − th sequence presen t in a probe. The normali zed v alue v of the prob e i of the group j is then obtained as: v i = µ i + ( w B µ i + q b ) B i + ( w E µ i + q E ) E i (3.4) where w B q B e w E q E are the linear coefficien ts estimate d resp ectiv ely for the first and second comp onen t, obtained b y linear regression. 3.4 First solution: Hidden M a rk o v Mo del In this paragraph a formal definitio n of HMM will b e outlined, and then a mo del top ology designed for the particular biological problem of n ucleosome iden tification will b e giv en. The HMM is a statistical signal modeling tec hnique used in v arious disciplines suc h as alignmen t of gene sequences, acoustic modeling, sp eec h recognition and OCR tec h- niques [25, 65, 9]. In this mo del, once defined the alphabet of symbols that make up the signal, a set of states are define d, eac h of one is asso ciated with a particular prob- abilit y distribution to pro duce a particular sy m b ol of the alphab et. It also necessary to define the probabilit y of transition from one state to another, and the probabilit y distribution of initial states. In this wa y this mo del leads to a w eigh ted graph where 48 Chapter 3. Pa ttern Discov ery and Classification by MLA the edge w eights represen t the probabilit y of transiting from one v ertex to an adjacen t one. The modeling of the signal can then b e seen as a visit on this graph, where ev ery time a v ertex is visited, a sym b ol is pro duced. A formal definition of HMM will no w b e giv en. Definition Hidden Markov M o del Let Σ an alphab et of M sym b ols. A HMM is a quintu ple: λ = ( N , M , A, B , π ) where: • N is the n um b er of states of the mo del indicated b y the in tegers 1,2, . . . , N ; • M is the num b er of symbols of the alphabet that eac h state can pro duce or recognize; • A = ( a ij ) is a matrix called tr ansition matrix where a ij represen t the probabilit y of transition from the state i to the state j with 1 ≤ i, j ≤ N . This matrix m ust also satisfies the follo wing condition: P j a ij = 1 , ∀ i • B is the probabilit y distribution of the observ ations, where b j ( σ ) represen ts the probabilit y of recognizi ng or generating the symbol σ ∈ Σ if y ou are in the state j . In addition , The condition P σ ∈ Σ b j ( σ ) = 1 , ∀ j needs to b e met; • π ut is the probabilit y distribution of initial states, where with π i is denoted the probabilit y of starting from th e state i . In addition, the condition P i π i = 1 , ∀ i needs to b e met; The transition matrix A induces a directed graph where no des represent states, and arcs are lab eled with their corresp onding transition probabilities. The term hidden refers to the fact that , giv en a sequence o f sym b ols that composes the signal y ou w an t to mo del, and set a model, the s equence of states is hidden and not unique, unlike other mo dels suc h as Marko v Chains [12] for example. The HMMs can b e used, as it will b e sho wn in the followin g paragraphs, b oth as generators and as recognizers of signals. 3.4.1 HMM as generators A H MM can b e used to generate a sequence of Σ ∗ . Let X = x 1 x 2 . . . x T ∈ Σ ∗ . This sequence can b e generated b y a sequence of states Q = q 1 q 2 . . . q T as follo ws: 1. Set i ← 1 and ch o ose the state q i according to the probabilit y distribution π of initial states; 3.4. First solu tion : Hidden Mark ov Mo del 49 2. Assuming to b e in the state q i (ha ving already generated x 1 x 2 . . . x i − 1 ) produce in output x i according to the probabilit y distribution b i ; 3. If i < T , then i ← i + 1 and go to the state q i +1 in agreemen t wi th A [ i, 1 : N ] and rep eat step 2 otherwise end. The probabilit y of observing X = x 1 x 2 . . . x T and the sequence of states Q = q 1 q 2 . . . q T is: P ( X, Q ) = π 1 T Y i =1 b i ( x i ) a ii +1 (3.5) This probabilit y is often not v ery useful b ecause it is unkno wn whic h sequence of states has produced the string X (since it is poss ible to ha ve m ultiple sequences of states that can generate it). Algorithms that solv e this problem will b e shown later. 3.4.2 HMM as r ecognizers A HMM can b e used as a probabilistic v alidator of a sequence of Σ ∗ b ecause it returns a measure, in terms of mass of the probabilit y of ho w w ell a HMM recogn izes or observes X . This probabilit y is defined as: P ( X | λ ) = T Q t =1 N P i =1 P ( q t = i ) b i ( x t ) with P (q t = j ) =      π j if t = 1 N P i =1 P ( q t − 1 = i ) a ij b i ( x t − 1 ) (3.6) As men tioned earlier, the HMM through the transition matrix A induces a multi- parted graph. This graph can b e represen ted as a matrix with N ro ws, whic h corresp ond to N state s of λ , and f or all t ≥ 1 columns t and t + 1 f orm a com plete bipar tite graph, with arcs dir ected from ve rtices in column t to vertices in colu mn t + 1 ( 1 ≤ t ≤ T − 1 ). The recognition consists of sup erimp ose X o ver all p os s ible paths of length T in this graph (whic h is called tr el lis ), starting from the v ertices in column 1. F or a given v ertex i in column t on a giv en path, the measure of ho w we ll it is p ossible to recognize the sym b ol x t consists of tw o parts: the probabilit y of b eing in the state P ( q t = i ) and the probabilit y that the state emits the sym b ol x t giv en by b i ( x t ) . 3.4.3 Problems r elated to HMM Giv en an HMM mo del λ , three main issues are considered: 1. Given a sequence of observ ations X = x 1 x 2 . . . x T ∈ Σ ∗ and a mo del λ = ( N , M , A, B , π ) , calculate the probabilit y of observing the sequence X using the mo del λ i.e. P ( X | λ ) ; 50 Chapter 3. Pa ttern Discov ery and Classification by MLA 2. Giv en a sequen ce of o bserv ations X = x 1 x 2 . . . x T ∈ Σ ∗ and a mo del λ = ( N , M , A, B , π ) ,c ho os e the corresponding sequence of states Q = q 1 q 2 . . . q T that b est explains the observ ations using the mo del λ and an optimization criterion; 3. Calculate the v alues of mo del parameters λ = ( N , M , A, B , π ) in order to maxi- mize P ( X | λ ) . The first problem is solved efficien tly b y an algorithm called forwar d pr o c e dur es , the second b y the Viterbi algorith m, while the third b y the Baum W elch algorithm. 3.4.4 F orward pro cedures By us ing this algorithm, is p oss ible to calculate P ( X | λ ) in O ( N × T × δ max) where δ max is the maxim um degree among all HMM states. This algorithm us es dynamic programming and consider a v ariable α t ( i ) defined as: α t ( i ) = P ( x 1 x 2 . . . x t , q t = i | λ ) (3.7) that is the probab ilit y that at time t , it is poss ible to observ e the partial sequence x 1 x 2 . . . x t and reac h the state i . The procedure consists of three phases: • Initialization: α 1 ( i ) = π i b i ( x 1 ) with 1 ≤ i ≤ N (3.8) • Induction: α t +1 ( j ) =  N P i =1 α t ( i ) a ij  b j ( x t +1 ) with 1 ≤ t ≤ T − 1 , 1 ≤ j ≤ N (3.9) • T ermination: P ( X | λ ) = N X i =1 α T ( i ) (3.10) In figure 3.10 the single steps that allo w to calculate α t +1 ( j ) are shown. The n umber of p os sible paths gro ws exp onen tially with the length of the sequence, so it is not p oss ible, in many appl ications, to consider all pa ths. F or this reason a go o d appro ximation is to consider only the pro babilit y of the most lik ely path. There is also a v arian t of this algorithm that,at the end of comput ation, calculates the same probabilit y start ing from the p oss ible terminal states used to recognize (or generate) 3.4. First solu tion : Hidden Mark ov Mo del 51 Figure 3.10: F orw ard pro cedure the sequence X . This v arian t, whic h is called the b ackwar d pr o c e dur es , as w ell as the forw ard procedure, uses a v ariable β t ( i ) defined as: β t ( i ) = P ( x t +1 x t +2 . . . x T | q t = i, λ ) (3.11) that represen ts the probabi lit y at time t , to ob serv e a partial sequence from time t + 1 un til the end, b eing in the state i under the assumption of the mo del λ . In figure 3.11 the single steps that allo w to calculate β t ( i ) are sho wn. 3.4.5 Viterbi algorithm The Viterbri algorithm pro vides an efficien t solution to the second problem of HMM i.e. computing the optimal s equence of states f or the recognition of the sequence X with the model λ . T he term “optim um” dep ends on the particular problem tak en in exam. In any case, one of the most used criteria is to find the b es t seq uence of states that generates X maximizing P ( Q | X, λ ) or equiv alently P ( Q, X | λ ) . The Viterbi algorithm uses dynamic programming and computes: • β t ( i ) = max q 1 q 2 ...q t − 1 P ( q 1 q 2 . . . q t − 1 , q t = i, x 1 x 2 . . . x t | λ ) i.e. the probabilit y of th e most lik ely path that tak es in to accoun t of the first t observ ations and that ends in state i ; • γ t ( i ) that represen ts the state that leads to the state i at time t . The pro cedure consists of four phases: 52 Chapter 3. Pa ttern Discov ery and Classification by MLA Figure 3.11: Bac kw ard pro cedure 1. Initialization: β 1 ( i ) = π i b i ( x 1 ) γ 1 ( i ) = 0 with 1 ≤ i ≤ N (3.12) 2. Induction: β t ( j ) = max 1 ≤ i ≤ N { β t − 1 ( i ) a ij } b j ( x t ) w ith2 ≤ t ≤ T γ t ( j ) = arg max 1 ≤ i ≤ N { β t − 1 ( i ) a ij } w ith1 ≤ j ≤ N (3.13) 3. T ermination: P ( Q | X , λ ) = max 1 ≤ i ≤ N { β T ( i ) } q T = arg max 1 ≤ i ≤ N { β T ( i ) } (3.14) 4. Bac k tracing: q t = γ t +1 ( t + 1) , t = T − 1 , . . . , 1 (3.15) This algorithm has a computational cost equiv alen t to O ( N × T × δ ) where σ represen ts the maximu m degree of the graph indu ced b y the transition ma trix of λ . Again, as in the forw ard pro cedure, the num b er of p ossible paths gro ws exp onen tially with the length of the sequence, making this metho d not alwa ys feasible in the case of large amoun ts of data. 3.4. First solu tion : Hidden Mark ov Mo del 53 3.4.6 Baum W elch algorithm The calcula tion of the v alues of mo del parameters λ = ( N , M , A, B , π ) that maximize P ( X | λ ) , is not an easy task. In fact, there isn’t an y analytical m etho d that solves the problem b y maximizing the probabilit y of observing the sequence: giv en a finite sequence as a training set, there isn’t a p erfect w ay to estimate the parameters of the mo del. Ho w ev er, it is p ossible to deriv e a mo del λ = ( N , M , A, B , π ) so that P ( X | λ ) is lo cally maximized us ing an iterati ve pro cedure. The b est-know n iterativ e procedure that solv es this probl em is the Baum W elch algorithm . T o describ e how this algorithm w orks first define this function: ξ t ( i, j ) = P ( q t = i, q t +1 = j | X , λ ) (3.16) i.e. the probabilit y of b eing in state i at time t and in state j at tim e t + 1 , giv en the model and the sequence of observ ations X . The sequence of even ts leading to the conditions required b y this v ariable is shown in the figure 3.12. Figure 3.12: Baum W elch algorithm Ob viously , it is clear that lo oking at the definiti on of the v ariables used in the pro cedures of bac kw ard and forw ard, it is p oss ible to rewrite: ξ t ( i, j ) = α t ( i ) a ij b j ( x t +1 ) β t +1 ( j ) P ( X | λ ) = = α t ( i ) a ij b j ( x t +1 ) β t +1 ( j ) N P i =1 N P j =1 α t ( i ) a ij b j ( x t +1 ) β t +1 ( j ) (3.17) Where the nu merator is simply the probabilit y P ( q t = i, q t +1 = j, X | λ ) . Previously α t ( i ) w as defined as the probabilit y of b eing in state i at time t , by observing the partial sequence x 1 x 2 . . . x t . Let’s see ho w α t ( i ) can b e defined in terms of ξ t ( i, j ) : 54 Chapter 3. Pa ttern Discov ery and Classification by MLA α t ( i ) = N X j = 1 ξ t ( i, j ) (3.18) Summing o v er t the f unctions α t ( i ) and ξ t ( i, j ) it is p os sible to obtain: T − 1 X t =1 α t ( i, j ) = n um b er of ex p ected transitions from state i (3.19) T − 1 X t =1 ξ t ( i, j ) = exp ected num b er of transitio ns b etw een state i and state j (3.20) Using the defined form ulas will b e s ho wn now the metho d for estimating parameters for a HMM using the Baum W elch pro cedure. Reasonable estimates for the parameters are: π i = ex p ected n um b er of times to b eing in state i at time ( t = 1) = α 1 ( i ) (3 .21) a ij = T − 1 P t =1 ξ t ( i, j ) T − 1 P t =1 α t ( i, j ) = exp ected n um b er of transitions from state i to state j exp ected n um b er of transition from state S i (3.22) b j ( k ) = T − 1 P t =1 ∧ x t = v k α t ( j ) T − 1 P t =1 α t ( j ) = = exp ected n um b er of times of b eing in the state j and observing the sim b ol v k exp ected n um b er of times of b eing in the state j (3.23) these equations can b e used in order to develop an iterativ e pro cess that, s tarting from a mo del λ = ( N , M , A, B , π ) , allo ws us to estim ate at eac h step a new mo del λ =  N , M , A, B , π  . In addition it can b e pro ve n that: • The mo del λ represen ts a critical p oin t of the like liho o d function in the case λ = λ ; • The model λ is better than the model λ , whic h means that the probabilit y of observing X giv en the model λ is greater than the pro babilit y of observing X giv en the mo del λ i.e P  X | λ  > P ( X | λ ) . 3.4. First solu tion : Hidden Mark ov Mo del 55 These t wo statemen ts tell us that this pro cedure conv erges to a critical point. This can b e done using iterative ly th e model λ instead of λ and rep eating the pro cess of parameters es timating, gradually increasing the likelihoo d of the observ ations of the training seq uence, until a critical point is reac hed. The end result of this pro cedure is called the maximum likeliho o d estimate of a HMM. It is imp ortan t to underline that this algorithm leads to a lo cal maximum p oin t, and in man y re al application the s urface to optimize is v ery complex and has many lo cal maxima. The f orm ulas to estimate parameters can also b e deriv ed directly from the Blum’s auxiliary function Q  λ, λ  in resp ect to λ ; this function is defined as: Q  λ, λ  = X Q P ( Q | X, λ ) log  P  X, Q | λ  (3.24) It can b e pro v en also that the maxim ization of the function increases the lik elihoo d: max λ  Q  λ, λ  ⇒ P  X | λ  ≥ P ( X | λ ) (3.25) 3.4.7 The prop osed HMM for nucleos ome p ositioning As men tioned earlier in [90] the problem of identi fying the n ucleosome using data from a pro cess of m icroarra y h ybridization and mo deling observ ations wi th a particular HMM, w as addressed. This is b ecause a simple thresholding technique has not sufficien t ac- curacy b ecause of noise and trend in the data. The prop osed mo del for the detection of n ucleosomes in chrom atin regions is sho wn in figure 3.13. I n this model, sev eral dif- feren t s tates for differen t t yp es of nucleo somes with sp ecial con nections are considered; in particular the states mo del the sequences of chr omosomes corresponding to a linker (state L ), w ell-p ositioned n ucleosomes (states N 1 , N 2 , ..., N 8 ) and delo calized nuc leo- somes (states D N 1 , D N 2 , ..., D N 9 ). The v alues of the measures that can b e observed b y eac h state corresp ond to the ph y sical v alues that the system outpu ts, whic h in this case re presen t the logar ithmic ratio b etw een the in tensit y of red and green for eac h sp ot of the microarra y . The transition matrix that establishes which are the allo wed transitions b et w een states and their probabilities, is estimated with the Baum W elc h algorithm together with the other parameters. In this mo del there is only one state that represen t the class of prob es corresp onding to link er regions, and this state has a lo op in order to mo del v ariable length linker regions. The n umber of states for the class of w ell-p ositioned nu cleosomes in this mo del is 8. This c hoice is justified considering the length of a n ucleosome in normal conditions (ab out 6-8 probe ). I n this wa y , the infor- mation ab out the exp ected length of a n ucleosome is enco ded in the model. Similarly , it is p ossible to note that the n um b er of states for the class of delo calized n ucleosomes in this mo del is 9 and the last s tate has a lo op (similar to the state link er) in order to mo del the different lengths of nu cleosomes regions that co ver a n um b er of prob e greater 56 Chapter 3. Pa ttern Discov ery and Classification by MLA Figure 3.13: HMM topology for nucle osome positiong than 9. Finally , a well -p ositioned nu cleosomes in this mo del hav e a length b etw een 6 and 8 pro b es, the delo calized n ucleosomes ha ve a num b er of prob es equal to or greater than 9, and link ers ha ve a v ariable length greater or equal to one. 3.5 Second solution: MLA In this s ection the application of MLA to f ace the problem of iden tify ing and classifying n ucleosomes will b e describ ed. The follo wing subsections will s ho w the v arious steps 3.5. Second sol ution: MLA 57 that allow the classification of the nu cleosomes iden tified trough the MLA and the construction of a mo del for well-positioned nu cleosomes. Firstly , let’s recall that the signal S is divided in to s egmen ts in whic h prob es can b e not con tiguous (due to data referring to differen t regions of c hromosomes, or missing data). In particular S is organized in T cont iguous fragmen ts S 1 , · · · , S T whic h represen t DN A sub-sequences. 3.5.1 Prepro cessing In the first stage of pro cessing a con v olution pro cess is applied in order to reduce the noise in the s ignal. The smo othing is done for eac h prob e s egmen t corresp onding to adjacen t regions of the signal i.e eac h fragmen t S t , 1 ≤ t ≤ T of the input signal, S , is smo othed b y a con v olution op erator that p erform the we igh ted av erage of three consecutiv e signal v alues, where the w eights are provided b y the kernel window w = [ 1 4 , 1 2 , 1 4 ] [52]. 3.5.2 Creating the mo del The construction of the mo del represen ts a phase of training, where it is p ossible to learn the shap e of the pattern corresp onding to the n ucleosome considering only the regions that corresp onds with high probabilit y to w ell-p ositioned n ucleosomes. Since w ell-p ositioned n ucleosomes are sho wn as p eaks of a b ell shap ed curv e, in order to lo cate the p osition of a nucl eosome, all lo cal maxima of the input signal are auto matically extracted from the con volv ed signal X of S . Th en a subset of maxima are opportunely selected for the mo del definit ion. Eac h co n v olv ed fragmen t X t is pro cessed in order to find L ( X t ) lo cal maxima M ( l ) t for l = 1 , · · · , L ( X t ) . The extraction of eac h sub- fragmen t for eac h M ( l ) t is p erformed b y ass igning all v alues in a windo w of radius os cen tered in M ( l ) t to a v ector, F l t of size 2 × os + 1 : F l t ( j ) = X t ( M ( l ) t − os + j − 1) , f or j = 1 , 2 , ..., 2 × os + 1 . The selection process extracts the si gnific ant sub-fragmen ts to b e used in the mo del definition. This is p erformed b y satisfyi ng the follo wing rule: ( F l t ( j + 1) − F l t ( j ) > 0 j = 1 , · · · , os F l t ( j + 1) − F l t ( j ) < 0 j = os + 1 , · · · , 2 × os (3.26) This condition is equiv alent to v erify that the signal in that fragmen t is increasing to the righ t of the maxim um and descending to the left (condition of con v exit y). If the pattern resp ects this condition, it will be used for the next phase of constructio n of the mo del of the w ell-p ositioned nuc leosome. The pro cess contin ues in a similar w ay for the other p oin ts of relativ e maximum (if present) in the segmen t considered in descending order. After this selection pro cess G ( X t ) sub-fragmen ts rem ain f or eac h X t . The mo del 58 Chapter 3. Pa ttern Discov ery and Classification by MLA of the inter esting p attern is then defined b y considering the follo wing av erage: F ( j ) = 1 T T X t =1 1 G ( X t ) G ( X t ) X k =1 F k t ( j ) j = 1 , · · · , 2 × os + 1 (3.27) That is, for each j , the av erage v alue of all the sub-fragmen ts satisfying Eq. 3.26. The mo del then will represen t the av erage pattern of a we ll-p ositioned n ucleosome through its exp ected shap e. Applying this pro cedure a mo del sho wn in figure 3.15(a) is carried out a v eraging the pattern in figure 3.14(b) . Figure 3.14: P atterns that meet the condition of con vexit y 3.5.3 In terv al iden tification This step is the core of the metho d i.e. the interv al iden tification obtained by the Simply Equal ly sp ac e d ML A presen ted in ch apter 2 . I n particular b y considering K threshold leve ls t k ( k = 1 , ..., K ) of the con vo lv ed signal X , for eac h t k a set o f in terv als R k =  I 1 k , I 2 k , · · · , I n k k  is obtained, where, I i k = [ b i k , e i k ] and X ( b i k ) = X ( e i k ) = t k . This set of in terv als as explained in cha pter 2 constitutes the in terv al represen tation Υ ( X ) of the input signal X . In Section 3.5.9 a calibration pro cedure to select the prop er v alue of K is describ ed. 3.5. Second sol ution: MLA 59 Figure 3.15: Model of w ell-p ositioned nucleo some 3.5.4 Aggregation rule and Pattern Definition This step is p erformed b y taking in to accoun t that b ell shap ed pattern m ust b e extracted for the cl assification phase. Suc h kind of patte rns are c haracterized b y sequences of in terv als n I 1 j , I 2 j + 1 , · · · , I n j + l o suc h that I i j ⊇ I i +1 j + 1 ; more formally a pattern P i is defined using the follo wing aggregation rule: P i = { I i j j , I i j +1 j + 1 , · · · , I i j + l j + l | ∀ I i k k ∃ ! I ∈ R k +1 : I = I i k +1 k +1 ⊆ I i k k } (3.28) where, j d efines the threshold, t j , of the widest in terv al of the pattern . F rom the previous definition it f ollo ws that P i is build by adding an in terv al I i k +1 k +1 only if it is the unique in R k +1 that is included in I i k k . Note that, this criterion is inspired b y the consideration that a n ucleosome is identifi ed b y bell s hap ed fragmen t of the signal, and the in tersection of suc h fragmen t with horizon tal threshold lines results on a sequence of ne sted in terv als. In figure 3.16 tw o examples of shapes with the relativ e patte rns are sho wn. 3.5.5 P attern selection In this step the inter esting p atterns P ( m ) are selected follo wing the criterium: 60 Chapter 3. Pa ttern Discov ery and Classification by MLA I 1 I 2 I 1 K K+1 K+1 R K R K+1 1 I I 1 K K+1 R K+1 K R Figure 3.16: Two differ ent shap es of the input si gnal: (on the left) Since at threshold lev el K + 1 the in terv al R k = { I 1 K } has t wo subset R k +1 = { I 1 K +1 , I 2 K +1 } , it is possible to set thr ee patt ern P 1 = { I 1 K } , P 2 = { I 1 K +1 } and P 3 = { I 2 K +1 } . (on the right) In this case, I 1 K +1 is the unique subset of I 1 K , th us it is possible to s et an unique patter n P 1 = { I 1 K , I 1 K +1 } P ( m ) = { P i : | P i | > m } (3.29) i.e. patterns con taining in terv als that p ersists at least for m increasing thresholds. This further selection crit erion is related to the heigh t of the shap ed b ell fragmen t, in fact a small v alue of m could represen ts noise rather than nuc leosomes. T he v alue m is said the m inimum numb er of p ermanenc es ; in subsection 3.5.9 a calibration pro cedure to estimate the b est v alue of m is describ ed. 3.5.6 F eature extraction Eac h pattern P i ∈ P ( m ) is iden tified b y I i j j , I i j +1 j + 1 , · · · , I i j + l j + l , with l ≥ m . Straigh tfor- w ardly , the f eature vector of P i is a 2 × l m atrix where eac h column represen ts the lo w er and upp er limits of each inter v al from th e lo wer threshold j to th e upper threshold j + l . The repr esen tation in this m ulti-dimensional featur e space is used to cha racterize differen t t y p es of patterns. 3.5.7 Dissimilarit y fu nction A dissimilarit y function b et w een patterns is defined in order to c haracterize their shap e: δ ( P r , P s ) = (1 − α )( A r − A s ) + α X i ∈ I ( a r i i − a s i i ) (3.30) where, A r and A s are the surfaces of the t w o p olygons bounded b y the set of v ertexes V = S i ∈ I { ( b r i i , e r i i ) , ( b s i i , e s i i ) } , a r i i = e r i i − b s i i , a s i i = e s i i − b s i i , and α is a user param eter ranging in the in terv al [0 , 1] to set the w eigh t of the t wo dissimilarit y components. The first comp onen t of this dissimilarit y allo w us to consider patterns of close dimen- sions, while the second comp onen t has b een in tro duced to include shap e information 3.5. Second sol ution: MLA 61 since it can b e considered a correlation measure of the t w o b ounding p olygons. This dissimilarit y can b e used b y a general classifier in order to dis tinguish the kind of pat- tern. An example of input signal and the extracted in teresting patterns is giv en in figure 3.17. 3.5.8 Nucleosome Classification With the MLA, one is able to classify four “refined nuc leosomal states”: lin kers , wel l- p ositione d , delo c alize d and fuse d nucle osomes . (see figure 3.18). I n the follo wing, the classification rules whic h allo w us to automatically discriminate suc h kind of patterns are stated. The classification w as conducted in t w o steps, in the first step the linker p at- terns , the exp e cte d wel l-p ositione d p atterns and exp e cte d delo c alize d p atterns are found. Afterw ards, the ranges of the regions represen ting the exp ected w ell-p ositioned and delo calized n ucleosomal patterns are set, defining the exp e cte d r e gions . Finally , the classification is p erformed b y testing the inte rsection of s uc h regions (see figure 3.19). First phase: F or each in teresting pattern P i , the dissimilarit y δ ( P i , F ) is ev aluated ( δ is defined in Eq. 3.30, F is the model), the rule to classify P i is : c 1 ( P i ) =        L if δ ( P i , F ) ≤ φ 1 E W if φ 1 < δ ( P i , F ) ≤ φ 2 E D otherwise (3.31) where L means linker p attern , E W or E D are n ucleosomal pattern, and in particular exp e cte d wel l-p ositione d p atterns and ex p e cte d delo c alize d p atterns resp ectiv ely . Se c ond pha se: Afterw ards, f or eac h exp ected w ell-p ositioned n ucleosomal pattern P i = { I i j j , I i j +1 j + 1 , · · · , I i j + l j + l } (e.g. c 1 ( P i ) = E W ), the c enter of the nucle osomal r e gion C i is calculated: C i = 1 l j + l X k = j e i k + b i k 2 (3.32) whic h represen ts the mean of the first l in terv als defining the pattern P i . Con versely , for eac h exp ected delo calized n ucleosomal pattern (e.g. c 1 ( P i ) = E D ), the delo c alize d interval [ B i , E i ] is defined such that: B i = 1 l/ 2 j + ( l / 2) X k = j b i k and E i = 1 l/ 2 j + ( l / 2) X k = j e i k (3.33) Note that, B i and E i represen t respectiv ely the mean of the first l / 2 b eginning and ending of eac h interv al belonging to the pattern P i . Th e exp e cte d r e gions is so defi ned: 62 Chapter 3. Pa ttern Discov ery and Classification by MLA A i = ( [ C i ( l ) − 3 , C i ( l ) + 3] if c 1 ( P i ) = E W [ B i , E i ] otherwise (3.34) In particular, eac h exp ected region A i is, in the case P i is an exp ected w ell-p ositioned pattern, an in terv al with b eginning 3 prob es b efore and ending 3 prob es after the center C i , otherwise it is the inter v al [ B i , E i ] . Finally , the classification rule is: c 2 ( P i ) =        F if A i ∩ A j 6 = ∅ j 6 = i otherwise " W if c 1 ( P i ) = E W D if c 1 ( P i ) = E D (3.35) where F , W and D s tands for fuse d , wel l-p ositi one d , delo c alize d n ucleosomes resp ectiv ely (see figure 3.18). Informally , the classification rule in Equation 3.35 ass ign the fuse d class if the exp ected nucleo somal regions o verlap o therwise confirm the classification of the first phase. 3.5.9 P arameter selection by calibration In order to set the prop er v alues of K (n um b er of thresholds), and m (the minim um n umber of p ermanences), a calibration pro cedure has b een us ed. In particular, suc h v alues has been estimated b y studying the plots of particular functions able t o measure the go o dness of sev eral K and m . 3.5.9.1 Estimation of m The m i nimum numb er of p ermanenc es m has b een estimated by using the sy n thetic signal generator describ ed abo v e. This gives the opp ortunit y to make a massiv e ex- p erimen tal stud y on the relation b etw een K and m . In particular, c = 10 copies at differen t signal to noise ratios j = 1 , 2 , 4 has b een generated, resulting in a total of 3 × 10 syn thetic signals V ij . Once fixed a signal to noise ratio j , for each V ij the v alue of m whic h maximizes the recognition performances for sev eral thresholds for k = 20 , · · · , 50 has b een found. Figure 3.20 sho ws the results p erformed b y considering c = 10 copies, three signal to noise ratio v alues 1 , 2 , 4 , and k = 20 , · · · , 50 thresholds. In eac h plo t, the x axis represen ts the num b er of thresholds k (i.e. n um b er of cut s), the column bar groups the best recognition and the p ercen tage of minim um n umber of permanences whic h causes the b est p erformances on all the 10 exp erimen ts. F rom this experimen tal study , it emerges that the use of an high num b er of thresholds can compromise the recognition pro cess (due to the fact that an high v alue of K can capture also the noise presen t in the s ignal), moreov er, the m v alue seems not dep enden t from K , and the one whic h causes the b est recognition ranges in an interv al of [0 . 15 × K, 0 . 30 × K ] . 3.5. Second sol ution: MLA 63 3.5.9.2 Estimation of K The prop er v alue of K is estimated starting from the conv olve d input signal X . Giving a con v oluted signal fragmen t X t it is resampled it i n the y direction resulting in seve ral samples X ( k ) t for differen t threshold v alues k = 1 , · · · , K max . It is p oss ible to measure the goo dness of k b y the aver age normali ze d c orr elation  ( k ) and the ave r age missi ng pr ob es M S ( k ) so defined:  ( k ) = 1 T T X t =1 1 + ρ 2 ( S t , S ( k ) t ) 2 (3.36) M S ( k ) = 1 T T X t =1 M S ( k , t ) (3.37) In particular  ( k ) measures the av erage normalized correlation betw een eac h resample X ( K ) t and the generic fragmen t X t ( ρ is the correlation co efficien t), while M S ( k ) the a verage of the missing prob e v alues M S ( k , t ) due to the resample of X t b y k thresholds. Finally the v alue K is selected in teractive ly by lo oking b oth at the plots of  and M S ,searc hing for the b est compromise of maxim um  and minim um M S (s ee figure 3.21). In this w a y the signal obtained has an high correlation with the original signal and a reasonable n um b er of missing samples in order to not capture the noise presen t in the signal. 3.5.10 Syn thetic gene ration of biological sig nals Before v alidating the M LA approac h on biological data, a pro cedure to generate sy n- thetic signal has been dev elop ed in order to assess th e feasibilit y of the metho d on con trolled data. Generated signals em ulate the on e coming from a tiling microarra y where eac h sp ot represen ts a pr ob e i of resolution r base pairs o v erlapping o base pairs with prob e i + 1 . I n particular, the chromo some is spanned b y moving a windo w (prob e) i of width r base pairs from left to righ t, measuring b oth the p ercen tage of monon u- cleosomal DNA G i ( gr e en channel ) and whole genomic DNA R i ( r e d channel ) within suc h windo w, resp ecting also that tw o consecutiv e windo ws (prob es) ha v e an ov erlap of o base pairs. The resulting s ignal V ( i ) for each prob e i is the logarithmic ratio of the gr e en channel G i to r e d channel R i . In tuitiv ely , n ucleosomes presence is related to p eaks of V whic h corresp ond to higher logarithmic ratio v alues, while lo w er ratio v al- ues sho ws n ucleosome free re gions called li nker r e gions . This genomic tiling microarra y approac h tak es inspiration from the w ork of Y uan et al. [90 ] where the authors ha ve used the same metho dology on the Sac char omyc es c er evisiae DNA. Here it is defined a mo del able to generate such signals c haracterized b y the fol lo wing parameter s: • nn: The n um b er of n ucleosomes to add to the syn thetic signal. 64 Chapter 3. Pa ttern Discov ery and Classification by MLA • nl: The length of a n ucleosome (in real case a n ucleosome is 146 base pairs long) • λ : Mean of the P oisson distribution used to mo del the exp ected distances b et ween adjacen t n ucleosomes; • r: The resolutio n of a single micro arra y prob e. • o: The length in base pairs of the ov erlapping zone b etw een t w o c onsecutiv e prob es. • nr: The num b er of sp otted copies (repli cates) of n ucleosomal and genomic DNA on eac h prob e of the microa rra y; • dp: The p ercen tage of the delo calized nuc leosomes o v er the total nu m b er of n u- cleosomes; • dr: The range w hic h limits the delo calization of a nuc leosome in eac h cop y of n r . It is defined in base pairs. • nsv: The v ariance of the green c hannel in eac h prob e, ev en in absence of nucleo - somes due to the cross hybri dization. This v ariance follo ws a normal distrib ution with mean 0 . 1 . • pur: The p ercen tage of DNA purification, whic h is the probabilit y that eac h single DNA fragmen t of the nr copies appears in the microa rra y h ybridization. • r a: Relativ e abundanc e betw een n ucleosomal and genomic DNA . • SNR: The linear signal to noise ratio of the syn thetic s ignal to generate. Note that the noise is assumed to b e gaussian. Initially , a binary mask signal M is generated by considerin g as 1 ’s all the base pairs represen ting a nucl eosome (the nucle osomal r e gions ) and as 0 ’s the regions represen t- ing link ers ( the linker r e gions ). Note that, the beginning of eac h n ucleosomal region is established by the P oisson distribution with mean λ . The mask signal M will b e used in order to v alidate the classification results. The red ch annel of the microarra y (the genomic channe l) results from the generation of nr replicates I R 1 , · · · , I R nr eac h one start- ing from an initial n ucleosomal region of random size b ∼ U (0 , r ) (uniformly distributed in the range [0 , r ] ), follo w ed b y con tin uous n ucleosomic region of r base pairs. Con- v ersely , in order to s imulat e the green channe l (the n ucleosomic c hannel) nr replicate s , I G 1 , · · · , I G nr are considered, each one initially equal to M and subsequen tly mo dified b y perturbing each starting p oin ts x i D of the n ucleosome to consider as delo calized such that x i D = x i D + µ with random µ ∼ U ( dr ) . Note that the p ercen tage of n ucleosomes to consider as de lo calized is established b y the parameter dp . Afterw ards, eac h n ucleo- somal region on the generic replicate I R i and I G i can b e switc hed off dep ending on the 3.6. Results 65 v alue of a random v ariable α ∼ U (0 , 1) . Prec isely , eac h n ucleosomal region v eryfing the test α < pur i s considered and set to 1 , otherwise it is not considered and set to 0 . This results in new replicates T R i and T G i . Finally , th e generated synt hetic signal V for a prob e i is so defined: V ( i ) = { l og 2 ( P nr j = 1 T G j ( k ) ∗ r a T R j ( k ) + ε ) | ( r − o ) i − r + o + 1 ≤ k ≤ ( r − o ) i + o } (3.38) where ε ∼ N (0 . 1 , nsv ) . I n figure 3.22 it is p ossible to see the steps of th is pro cess. 3.6 Results The follo wing exp erimen ts hav e b een carried out b y measuring the correspondence b e- t ween nu cleosome and link er regions. In the case of the syn thetic signal, the output of the classifier has b een compared with a mask M ′ deriv ed from M while in the case of the real data set it has b een compared with the output of the HMM for nucleosom e p ositioning (see section 3.4) optimally con v erted in to a binary string. In all the exp erimen ts, the same v alue ( φ 1 , φ 2 ) = ( mean ( δ ( F l t , F )) − 3 std ( δ ( F l t , F )) , mean ( δ ( F l t , F )) + 3 std ( δ ( F l t , F )) has b een considered, where F l t are all the sub-fragmen ts used on the construction of the mo del F . Moreov er, b y biological consideration, the radius os has b een set to os = 4 . The p erformances hav e b een ev al- uated in terms of R e c o gnition A c cur acy , RA . The R A uses a new mask M ′ obtained b y con ve rting M in to prob e co ordinates suc h that a probe v alue is set to 1 (e.g. sho ws a n ulceosome p ortion) if the corresp onding base pairs in M include at least a 1 . The real n ucleosomal (link er) regions RN R ( RLR ) are represen ted by M ′ as con tiguous sequence of 1 ’s or 0 ’s resp ectively , here a nuc leosomal (link er) region C N R ( C L R ) has b een classified correctly if there is a matc h of at least l = 0 . 7 × L con tiguous 1 ’s ( 0 ’s) b et w een C N R ( C LR ) and the corresp onding RN R ( RLR ) in M ′ where L is the length RN R ( RLR ). The v alue 0 . 7 has b een cho sen b ecause it represen ts a 70% of regions o verlap v ery unlik ely to b e due to c hance. 3.6.1 MLA vs HMM on Syn thetic Nucleosome P ositioning data F or MLA, w e hav e c hosen by the calibration phase K = 20 and m = 5 , the v alue of α in Eq. 3.30 has b een set to 0 . 5 to equally balance the t w o comp onen t of the dissimilarit y . In particular, 6 signals of length ranging from 2337 prob es ( 70130 base pairs) to 2361 probes ( 70850 base pairs) hav e b een generated for the signal to noise ratio v alues 1 , 2 , 4 , 6 , 8 , 10 . In Fig.3.23 the results of the total RA for all th e exp erimen ts are rep orted. The confusion matrices of H M M and MLA for all the exp erimen ts are rep orted in the tables 3.1 and 3.2. In Fig.3.23 the results of the total RA for all the exp erimen ts are s ummarized . Fig.3.23 sho ws that the H M M is slightly more accurate in 66 Chapter 3. Pa ttern Discov ery and Classification by MLA finding the b ounds of the nuc leosome regions. The syn thetic results can be summarized in an o v erall R A of 0 . 96 for the MLA and 0 . 98 f or H M M . snr = 1 L N snr = 2 L N L 0 , 82 0 , 18 L 0 , 96 0 , 04 N 0 , 03 0 , 97 N 0 , 01 0 , 99 snr = 4 L N snr = 6 L N L 1 0 L 1 0 N 0 1 N 0 1 snr = 8 L N snr = 10 L N L 0 . 9 9 0 . 01 L 1 0 N 0 1 N 0 1 T able 3.1: Confusion matrices of H M M on 6 differen t signal to noise ratios for n ucle- osome (N) and linke r (L) regions. snr = 1 L N snr = 2 L N L 0 , 81 0 , 19 L 0 , 88 0 , 12 N 0 , 04 0 , 96 N 0 1 snr = 4 L N snr = 6 L N L 0 , 94 0 , 06 L 0 , 96 0 , 04 N 0 , 01 0 , 99 N 0 1 snr = 8 L N snr = 10 L N L 0 , 96 0 , 04 L 0 , 97 0 , 03 N 0 1 N 0 1 T able 3.2: Confusion matrices of MLA on 6 differen t signal to noise ratios f or n ucleosome (N) and link er (L) regions. 3.6.2 MLA vs HMM on real data In this experimen t, it has b een compared the ac cordance of the t w o mo dels on the Sac char omyc es c er evisiae real data. Th e input signal represen ting this data is comp os ed b y 215 con tiguous f ragmen ts for a total of 24167 base pairs. In suc h exp erimen t, K = 40 , 3.6. Results 67 m = 6 were c hosen by the calibration phase ( m = 0 . 15 × 40 ) and α = 0 . 5 w as considered to equally balance the t wo comp onen ts o f the dissimilarit y (s ee the definition in Eq. 3.30). The confusion matrices whic h s ho w the RA of H M M considering MLA as the truth classification and RA of MLA considering H M M as the truth classification are rep orted in table 3.3. The results can b e summarized in an ov erall RA of 0 . 83 for the H M M (MLA true) and 0 . 69 f or MLA ( H M M true). In particular, from this studies it is p oss ible to concl ude th at MLA do es not fully agree with H M M on the linke rs patterns. Remark ably , compari ng MLA and H M M on the data coming from recen tly dev eloped de ep se quencing app r o ach ( D S ) [2] it is p os s ible to s ee a b etter agreemen t with MLA ( 0 . 58 ) rather than with H M M ( 0 . 44 ) (table 3.4 and figure 3.24). These analysis indicate that the in tegration of the H M M and MLA could im pro v e the ov erall classification. M L M H M M H L N M L N M L 0.79 0.21 L L 0.52 0.47 M N 0.13 0.87 M N 0.12 0.87 T able 3.3: A greemen t b etw een the H M M and MLA (and vicever sa) on the Sacc ha- rom yces cerevisiae data set for Nucleosomes (N) and Linke r (L) regions. The table on the left sho ws the RA results of H M M when considering MLA as the truth classifica- tion, while the opp osite is show n on the righ t table. M L M H M M L N L N L 0.40 0.60 L 0.40 0.60 N 0.24 0.76 N 0.53 0.46 T able 3.4: Confusion matrices of MLA and H M M on deep sequencing approac h (DS) data b y Pugh et Al. (2007). 3.6.3 Scalabilit y a nd computational time of MLA and HMM: This p oin t is fundamen tal b ecause the size of a problem can v ary significan tly in to this application domain, and if our metho d is not able to scale w ell it could b ecome totally useless. The comput ation time of MLA and H M M ha v e been compare d on 10 exp erimen ts. In particular, 10 sy n thetic signals hav e b een generated, each one with a 68 Chapter 3. Pa ttern Discov ery and Classification by MLA fixed n um b er of w ell-p ositioned n ucleosomes ranging from 10 to 100 b y step of 10 . In figure 3.25, the ratios b et w een the execution time of MLA ( T m ) and H M M ( T h ) for eac h exp erimen t is sho wn. F rom thi s study , it results that, on a v erage, T h = 1 . 7 × 10 4 × T m . 3.7 One-Class Classifier and MLA One of the k ey p oint of the MLA metho dology applied on the case of n ucleosome p o- sitioning, is the classification phase that is applied after the disco ve ry phase. In this section a new classification sc hema that take adv an tage of MLA will b e presented . As explained in chapt er 1 classification algorithms b ases th e construction of their discrimi- nating function on a training set that contain s s ev eral examples for eac h class (or in the particular case of binary class ification this means that are necessary b oth p ositive and negativ e examples). Ho wev er, in man y cases either only examples of a single class are a v ailable or the classes are v ery m uc h un balanced. T o address this particul ar problem one-class classifiers ha v e b een in tro duced in order to discriminate a target class from the rest of the f eature space [80]. Th e app roac h is based o n finding the smallest v olume h y- p ersphere (in the feature spac e) tha t encl oses most of th e traini ng data. This appr oac h is mandatory when only examples of the target class are a v ailable or the car dinalit y of the target class is m uc h greater than the other one so that to o few training examples of the smallest class are a v ailable in order to pr op erly train a classifier. It is important to pinpoint that the n ucleosome positioning data considered, in vo lv e necessary the use of a one-class sc heme, since a training set of only well- p ositioned n ucleosome is a v ail- able. This sectio n presen t, a one-class classifier sc hema, in particular a one-class K N N ( O C − K N N ) in order to distinguish b et ween n ucleosome and link ers. The p erformance of the one-class K N N em b edded in the MLA analysis, h as b een tested on the same kind of data previously described. Results ha v e sho wn, in b oth cases, a go o d recognitio n rate. 3.7.1 One-Class classifiers The first algorithms for one-cl ass classificatio n w ere based on neural netw orks, suc h as those of M o ya et al. [58, 57] and Jap o wicz et al. [38]. More recen tly , one-class versions of the supp ort v ector machi ne hav e b een prop osed by Sc holk opf et al. [68]. The aim is to find a binary function that tak es the v alue +1 in a smal l region capturing most of the data, and -1 elsewhere. Data transformations are applied such that the origin represen ts outliers, then the maxim um m argin, separating hyperplane b et w een th e data and the origin, is search ed. The application of mac hine learning to classification problems, that dep ends only on p ositiv e examples, is gaining atten tion in the comput ational biology comm unit y . This section lists some applicatio ns of one-class classifiers to biological and b iomedical data. 3.7. One-Class Classifier and MLA 69 In [89] a study using one-class mac hine learning for microRNA (miRNA) disco v ery is presented . Autho rs compare a One-class KNN to t w o-class approac hes using naiv e Ba yes and S upp ort V ector Mac hines. Using the EBV genome as an external v alidation of the metho d they found one-class mac hine learning to w ork as w ell as or b etter than a t wo-cla ss approac h in iden tifying true miRNAs as w ell as predicting new miRNAs. In [59] a general metho d for predicting protein-protein in teractions is presen ted. The searc h of feasible in teractions is carried out by a learning sys tem based on exp erimen- tally v alidated protein-protein in teractions in the h uman gastric bacterium Helicobacter p y lori. The author sho ws th at the linear co m bination of discriminan t cla ssifier provide s a lo w error rate. In [62] a one-class classification problem is a pplied to the detection of diseased m ucosa in oral ca vit y . Authors either com bine several measures of dissimilarit y of an elemen t from a set of target examples in a single one-class classifier or com bine sev eral one- class classifiers trained with a giv en measure of dissimilarit y . Results sho w that b oth approac hes ac hieve a significan t impro vem en t in p erformance. 3.7.2 One-Class K N N Here, the one-class classifier nam ed On e-class K N N will b e describ ed. A K N N classi- fier for an M classes problem is based on a training set T for each class m , 1 ≤ m ≤ M . The assignmen t rule for an unclassified elemen t x ∈ X is : j = ar g max 1 ≤ m ≤ M | T ( m ) K ( x ) | (3.39) where, T ( m ) K ( x ) are the training elemen ts of class m in the K nearest neigh b ors of x . One of the crucial p oint s of the K N N is the c hoice of the b est K , whic h is usually obtained minimizing the misclassification rate in v alidation data. In the case of a binary classification ( M = 2 ), one-class tra ining means that in the decision rule can b e us ed examples of only one-class. Here, a one-class training K N N ( O C − K N N ) is prop osed and whic h is a genera lization of the classical K N N classifier [37]. Let T p b e the train ing set for a generic pa ttern p represen ting a p osi tive instance, and δ a dissimilarit y function b et ween patterns. Th en the mem b ers hip for an unkno wn pattern x is: χ φ,K ( x ) = ( 1 if |{ y ∈ T p suc h that δ ( y , x ) ≤ φ }| ≥ K 0 otherwise (3.40) Informally , the rule sa ys that if there are at least K patterns i n T p dissimilar from x at most φ , then x is supp osed to be a p os itive patter n, otherw ise it is negativ e. It can b e simply pro v ed that the OC − K N N has some in teresting prop erties: 70 Chapter 3. Pa ttern Discov ery and Classification by MLA Prop osition 3.7.1 L et D a dataset of p atterns, T p ⊆ D the tr aining set for the p ositiv es , S φ,K = { x ∈ D | χ φ,K ( x ) = 1 } the set with memb ership χ φ,K , then: a) S φ,K ′ ⊆ S φ,K ∀ K ′ ≥ K b) S φ,K ⊆ S φ ′ ,K ∀ φ ≤ φ ′ The one-class K N N p erformances dep ends on the threshold, φ , and the n umber of neigh b ors, K , that are used in the classification phase. Both of them can b e determined b y using a v alidation pro cedure applied on the training s et of positives T p . In the follo wing, it will b e describ ed the pro cedure used to estimate the b est pair ( φ ∗ , K ∗ ) . Let us define the p erformance function M : M ( φ, K ) = | S φ,K | | T p | (3.41) Note that, in this v alidation pro cedure ∀ x ∈ T p assigned to S φ,K use the mem b ership χ φ,K ( x ) defined on the training set T p − { x } . By using M it is p ossible to define the functions P and Q P ( φ ) = X k ∈{ K m ,K M } M ( φ, k ) and Q ( k ) = X φ ∈{ φ m ,φ M } M ( φ, k ) (3.42) where { φ m , φ M } and { K m , K M } are sets of increasing v alues of thresholds and n um b er of neigh b ors resp ectiv ely . By applying the prop osition 3.7.1, it results that the f unction M increases while the threshold φ increases, and decreases while the neigh b ors K increases. In figure 3.26(a) a 3 d plot of the function M relativ e to the cla ssification of n ucleosome and linker regions on the Sacc harom yces cerevisiae data set is sho wn. Ass igning the v alues, φ m = min x,y ∈ T p δ ( x, y ) and φ M = max x,y ∈ T p δ ( x, y ) , K m = 1 , K M = | T p | , the pair ( φ ∗ , K ∗ ) to c ho ose is: φ ∗ = m in { φ | P ( φ ) = max { P ( φ ) }} (3.43) K ∗ = m ax { K | Q ( K ) 6 = 0 } (3.44) Informally , suc h estimation metho dology s elects the smalles t threshold φ ∗ whic h causes the b est p erformances on the v alidation data, most indep enden tly from the v alues of K . Moreo ver, the v alue K ∗ is c hosen to b e the largest one causing p erformances differen t from zero. In this wa y it is p ossible to obtain a go o d compromise b etw een the generalization abilit y of the classifier and its precision, in fact the b est v alue of φ tak es in accoun t of sev eral v alues of K and the v alue of K ch osen should gua ran tee a go o d gene ralization ab ilit y . In figure 3.26(b) an image represen tation of M sho ws also the c hosen ( φ ∗ , K ∗ ) concerning the classificatio n of nul ceosome and link er regions 3.7. One-Class Classifier and MLA 71 on the Sacch arom y ces cerevisiae data set. A fuzzy extension v ersion of the O C − K N N , has b een rece n tly tested on t w o public data-sets [22], studying also the ga in in class ification p erformances when com bining s ev eral one-class classifiers defined b y differen t dissimilarit y functions. 3.7.3 Results on synthetic data Also in this case, the p erformances ha v e b een ev aluated in te rms of R e c o gnition A c- cur acy , RA (s ee section 3.6 for details). The syn thetic exp erimen ts al lo ws to test the robustness of the O C − K N N to s ignal noise. All parameters used in the generation of sy nthetic data ha v e b een inspired b y biological co nsiderations and are nn = 200 , nl = 250 , λ = 200 , r = 50 , o = 20 , nr = 100 , dp = 0 , dr = 0 , pur = 0 . 8 , n sv = 0 . 01 , S N R = { 1 , 2 , 4 , 6 , 8 , 10 } and ra = 4 , resulting in 6 sy nthetic signals at differen t S N R . The trainin g set T p is represen ted by all W P N ’s that fit better the conditions in Eq. 3.26 with os = 4 , b ecause, b y biological consideration, it is known that a n ucleosome is around 150 base pairs whic h corresp onds to 8 prob es. Th us, the training set T p and consequen tly its s ize T L , are automatically selected b y the MLA dep ending on the generated input signal, resulting that, f or the sp ecific exp erimen ts rep orted here, T L = { 63 , 98 , 127 , 142 , 145 , 147 } f or S N R = { 1 , 2 , 4 , 6 , 8 , 10 } resp ectiv ely . The opti - mal parameters for the MLA are deriv ed b y a calibration phase describ ed in [16] and ha ve resulted H = 20 and m = 5 . Here and in the next section H represen ts the n umber of threshold operations of MLA analysis in order to av oid amb iguities with the K of OC-KNN that represen ts the num b er of neigh b ors. The p erformances ha v e b een ev aluated measuring the corresp ondence b et w een the classified W P N or LN re- gions and the ones imp osed in the generated signal. The parameters ( φ ∗ , K ∗ ) of the O C − K N N has b een c hosen b y the v alidatio n pro cedure descri b ed in section 3.7.2 for eac h S N R = { 1 , 2 , 4 , 6 , 8 , 10 } . Figure 3.27 reports the b est A c cur acy and FPR v alues v ersus S N R , sho wing also, for eac h S N R signal, the ( φ ∗ , K ∗ ) causing suc h v alues. F rom this study , it results that the a v erage accuracy and F P R ov er the 6 experimen ts is 94% and 9% resp ectively . 3.7.4 Results on real data: In this exp erimen t, it has b een again compared the accordance of the Hidden Mark ov mo del ( H M M ) for n ucleosome positioning on the Sac char omyc es c er evisiae real data. The training set T p has b een decided in the same w a y as ab ov e. In suc h ex p erimen t, H = 40 , m = 6 w ere c hosen b y a calibratio n phase ( m = 0 . 15 × 40 ) that is fully described in [16 ]. The confusion matrices, whic h sho w the R A of H M M considering MLA as the truth classification and RA of MLA considerin g H M M as the truth classification, are rep orted in table 3.5. The results can b e summ arized in an o v erall RA of (0 . 76) for the H M M (M LA true) and 0 . 65 for MLA ( H M M true). 72 Chapter 3. Pa ttern Discov ery and Classification by MLA M L M H M M H L N M L N M L 0.66 0.33 L L 0.6 5 0.34 M N 0.14 0.85 M N 0.34 0.65 T able 3.5: A greemen t b etw een the H M M and MLA (and vicever sa) on the Sacc ha- rom yces cerevisiae data set for Nucleosomes (N) and Linke r (L) regions. The table on the left sho ws the RA results of H M M when considering MLA as the truth classifica- tion, while the opp osite is show n on the righ t table . In particular, from this studies it is p ossible to conclude that MLA do es not fully agree with H M M on the n ucleosome patterns as in the previous case, in addition seems comparing the tables 3.5 and 3.3, that this class ifier do esn’t introduce an y significan t impro veme n t than the one used in Section 3.5.8. 3.7. One-Class Classifier and MLA 73 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Pattern Discovery probes normalized log ratio 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Classification probes normalized log ratio Delocalized Well positioned 114830−114879 115130−115179 115330−115379 115010−115059 114570−114619 Figure 3.17: (a) Input s i gnal, smo othing, p attern identific ation and extr action: A Sac char omyc es c er evisiae microarra y data p ortion. Eac h x v alue represen ts a sp ot (probe) on the microarra y and the corresp onding y v alue is the logarithmic ratio of its Green and R ed v alues. Nucl eosomes regio ns are around the p eaks signal (one is mark ed b y black circle), while lo w er ratio v alues show linker regions (marked b y dashed circles). The dashed lines represen ts the threshold lev els, in this example 6 patt erns are retriev ed, iden tified b y rhom bus, circle, square, triangle down , triangle up, star. Eac h pattern identi fier is replicated for eac h of its feature v alues and p oin ted in eac h one of its m iddle p oin t. (b) An e xample of classific ation: In this p ortion 5 n ucleosome re gions are shown together with its range in base pairs. I n particular 1 out of the 5 region s is classified as delo c alize d while the remaining wel l-p osition e d. 74 Chapter 3. Pa ttern Discov ery and Classification by MLA D F W Figure 3.18: Shap es of the p atterns: The three classes of n ucleosomes it is p ossible to detect with the MLA ver y like ly reflect differen t n ucleosome mobilit y existing in vivo at sp ecific chrom atin lo ci. Delo calized n ucleosomes probably represen t single n ucleosomes or arra ys of nucle osomes with high mobilit y , while fused nu cleosomes ma y reflect a single n ucleosome that o ccupies tw o distinct close p ositions in different cells. On the left of the arro ws, the particular nucleo some configuration whic h generates the resulting shap e of w ell-pos itioned (W), delo calized (D) and fused (F) nuc leosome classes are sho wn. 3.7. One-Class Classifier and MLA 75 P i L EW ED F W D Figure 3.19: Classific ation: The classification of a generic pattern P i is p erformed in to t wo phases. In the first phase the link er ( L ), the exp ected well-positioned ( E W ) and the exp ected delocalized ( E D ) patterns are established by using the class ification rule defined by c 1 . In the seco nd phase, the exp ected r egions A i are defined b y opportunely pro cessing E W and E D patterns, and afterw ards used by the classification rule c 2 in order to finally cl assify betw een w ell-p ositioned ( W ), delo calized ( D ) and fused ( F ) n ucleosomes. 76 Chapter 3. Pa ttern Discov ery and Classification by MLA (a) (b) Figure 3.20: Calibr ation phase for the choic e of m : Recog nition p erformance plots (group a) and p ercen tage of minim um num b er of p ermanences plots (group b) for 3 differen t s ignal to noise ratios, SNR = 1,2,4 (first, second, t hird colum n resp ective ly). The bar in eac h plot groups the results f or 10 exp erimen ts o ccurring at severa l threshold v alues (i.e n um b er of cuts). 3.7. One-Class Classifier and MLA 77 Figure 3.21: Calibr ation phase for the choic e of K : The v alue f or K is selected in ter- activ ely b y lo oking b oth at the plots of  and M S 0 0.5 1 I 0 1 2 I 1 * n ε 0 1 2 I 2 * n ε 0 1 2 I 3 * n ε 0 1 2 I 4 * n ε 0 1 2 (1/m) Σ i I i * n ε −5 0 5 X Figure 3.22: A n example of synthetic signal gener ation. 78 Chapter 3. Pa ttern Discov ery and Classification by MLA 1 2 3 4 5 6 7 8 9 10 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 MLM VS HMM on recognition MLM HMM Figure 3.2 3: R esults on synthetic data: The Recogn ition A ccuracy of M LA and H M M on 6 syn thetic signals generated at s ignal to noise ratios 1 , 2 , 4 , 6 , 8 , 10 . Figure 3.2 4: A represen tativ e sample windo ws spanning 13 nuclesom e where the agree- men t (disagreemen t) of the three methods is sho wn. The red draw represen ts the classification done b y Pugh et A l. (2007) in [2] . 3.7. One-Class Classifier and MLA 79 Figure 3.25: Com putation time p erformanc es : T he execution time ratio T h /T m of the MLA ( T m ) and HMM ( T h ) for 10 sy n thetic signal generated with differen t num b er of w ell-p ositioned n ucleosomes. The dashed line s ho ws the a verage execution time. 80 Chapter 3. Pa ttern Discov ery and Classification by MLA 0 5 10 15 0 5 10 15 20 25 30 35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 φ K (a) φ M K m K M φ * K * φ m (b) Figure 3.26: T w o differen t represen tations of M , on the left (a) a 3d plot, on the righ t (b) an image representa tion show ing the v alues of M using grayscale ( 0 is blac k, 1 is white). In thi s latter figure , there are also the c hosen pair ( φ ∗ , K ∗ ) 0 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SNR Accuracy False positive φ =216 K=60 φ =108 K=20 φ =135 K=5 φ =108 K=35 φ =106 K=40 φ =77 K=5 Figure 3.27: Best Accu racy and F P R v alues ver sus SNR. The couples ( φ, K ) causing suc h results are also rep orted. Chapter 4 T est of R andomness b y MLA This c hapter presen ts a new nonparametric test of randomness of a set of one- dimensional s ignals that tak e adv an tage of MLA preprocessi ng step. I n particular, this pro cedure is based on the probabilit y densit y function of the s y mmetrized Kull- bac k Leibler distance, estimated via a Monte Carlo sim ulation on the inte rv als lengths obtained b y MLA. The main adv an tage of this new approac h is that it allo ws to p erform an exploratory analysis in order to v erify directly the presence of sev eral structures in an input signal. In particular this test differs f rom the other ap proac hes because it exploits shap e features that are rare in a random signal. 4.1 T est of Random ness Giv en a s ignal or a sequence of sym b ols, it is first necessary to define the meaning of “random”. In fact the term r andomness has sev eral meanings as used in sev eral differen t fields. A go o d literature s urv ey ab out randomn ess tests can b e found here [67]. In the statistic literat ure, the c oncept of ran domness is s omewhat related to a sequence of random v ariables. The non randomness could be suggested by an y tendency of the observ ation to exhibit regularities in the sequence of observ ations. F or example, if an observ ation in a sequence is influenced by the previous observ ations or, mo re in general, if the observed v alue in a sequence is influenced b y its p ositi on, the pro cess is not truly random. More f ormally , a generic s equence is said random in statistical conte xt if the pro cess that has generated it, pro duces indep enden t and identi cally distributed observ ations or i.i.d.. In some con text, it is t y pical that the observ ations are not truly random in rigorous statistical sense i.e. i.i.d, but although the seq uence are not formally random, it could b e of in terest to measure, fixed a certain degree of confidence, ho w close to random it is. The application of these approac hes are manifold: for example a test of randomness can b e useful in the case of exploratory analys is in order to v erify the p oss ible presence of structures in a n inpu t signal; in the con text of cryptograph y to assess the p erformance of a go o d pseudo-random generator (b ecause it is a fundamen tal building blo ck in a lot of algorithms) or can b e used to test the strength of a passw ord [35, 30]. 82 Chapter 4. T est of Random ness b y MLA 4.1.1 State of the art This section does not pretend to b e a detailed revision of all the metho dologies kno wn in literature; the main ideas and their references will b e presen ted instead. In particular in statistic literature, there are severa l approac hes to test if a sequence is random, exploiting the “non randomness” in differen t wa ys: • test based on runs • test based on en trop y estimato r • test based on ranking • test based on goo dness of fitting to a giv en distributi on It will b e sho wn that the test of randomness that uses the M LA as prepro cess ing step b elongs to the last class. 4.1.2 T est based on runs These tests are based all o n the cent ral concept of run given in the follo wing definition: Definition Given an ordered s equence of one or more sym b ols, a run is defined to b e a succession of one or more t y p e of sym b ols whic h are f ollow ed and precede d b y differen t sym b ols or no symbol at all. Once th e runs in the s ignal are iden tified, the measure of randomness could dep end on their n umber, leng ths or b oth. That’s why in a real random sequence is very un usual to ha v e to o few or to o man y runs or runs of considerable length. So these information can b e used as statistical criteria to as s ess if a signal is truly random. Common ap- proac hes to d efine runs starting from a signal are to dic hotomize it (e.g. consider ing its sign for eac h observ ation), comparing the amplitude of consecutive p oints within resp ect to a fo cal p oin t (e.g. its mean or its median) or looking for trends. More information ab out these approac hes can b e found here [30]. 4.1.3 T est based on entrop y estimator These tests are ba sed on the en trop y of a signal or related feature s. In general the en tropy is a measure of the uncertain t y as s o ciated with a random v ariable [18]: Definition Let X a discrete random v ariable with alphab et Σ and probabilit y mass function p ( x ) = P r { X = x } , x ∈ Σ . The entr opy H ( X ) of a discrete random v ariable X is defined b y: H ( X ) ≡ H ( p ) = − X x ∈ Σ p ( x ) log 2 p ( x ) (4.1) 4.1. T est of Randomness 83 F or example if w e consider the sign test [3 0] (a particular run test) or a binary v ector, it should be exp ected that the sequence of s igns (or bits) are i. i.d. and t his ob viously follow s from the fact that the p ositive and negative signs are equiprobable i.e. P ( s ( i ) ≥ 0) = P ( s ( i ) < 0) . I f this assumption is not true, it is easy to pro ve that the en trop y will b e strictly less than 1 . In general these tests use this null h yp othesis: H 0 : H ( p ) = 1 (4.2) Usually , giv en a signal f these tests start approxima ting the probabilit y distribution for f and then calcul ating its entrop y . F urther details can b e found in [28] and [85] . 4.1.4 T est based on ranking: Wilco xon rank sum t est These tests are based on the concept of ranking, where for ranking is mean t a s orting of the observ ation in non-cre scen t or non-descdenden t order. A v ery p opular test that falls in this category and that can be used to ev aluate the randomness of a signal is the Wilco xon rank sum test. Giv en t w o vectors of observ ations X and Y also of differen t lengths, te st the null h yp othesis tha t data in the v ectors are indep enden t samples from iden tical con tin uous distrib utions w ith equal medians, against the a lternativ e that they do not ha ve equal medians [35]. More formally: Giv en N = m + n observ ations X 1 , . . . , X m and Y 1 , . . . , Y n , the assumed mo del is: X i = e i i = 1 , . . . , m (4.3) Y j = e m + j + ∆ j = 1 , . . . , n (4.4) where e m +1 , . . . , e m + n are unobserv able random v ariables, and ∆ is the shift b et w een the samples. Here we suppos e that the N observ ations are m utually indep enden t and eac h e come from the same con tinuous p opulation. The test consist in ev aluating the null h yp othesis: H 0 : ∆ = 0 (4.5) The first step is to sort the N observ ations in increasing order and let R j denote the rank of Y j in this ordering. The n the s tatistic W is calculated using this equation: W = n X j = 1 R j (4.6) F or a one side test of H 0 v ersus the alternativ e H 1 : ∆ > 0 , at α level of significance: reject H 0 if W ≥ w ( α, m, n ) accept H 0 if W < w ( α, m, n ) 84 Chapter 4. T est of Random ness b y MLA where the constan t w ( α, m, n ) satisfies P 0 [ W ≥ w ( α, m, n )] = α Let R (1) <, . . . , < R ( n ) the ordered Y ranks in the join t ranking of X and Y then the n ull distribution for W = P n j = 1 R j = P n j = 1 R ( j ) can b e obtained considering that under the h y p othesis H 0 all p os s ible  N n  assignmen ts for [ R (1) , . . . , R ( n ) ] ha ve probabilit y 1 /  N n  in this w a y it is p ossible to deriv e the null distribution without sp ecify ing the un derling distributions of the e ′ s . 4.1.5 T est based on go o dness of fit: Kolm ogoro v-Smirno v go o dness of fit T est These tests start from a statistical mo del try to assess ho w w ell some observ ations fit the mo del. A very p opular test that falls in this category and th at can b e used to ev al- uate if t wo samples are dra wn from the same distribution is the Kolm ogoro v-Smirno v go o dness of fit T est [73]. T his distribution free test is used to chec k if one sample comes from a parti cular distribution or if t wo samp les come from the same distribu tion. This test is based on the comparison b etw een the empirical cumu lative distribution function and the theoretical cum ulative distribution f unction. More formally: Let X a random v ariable with cum ulativ e function F ( x ) , giv en another cum ulativ e function F N ( x ) this test c hec k the h yp othesis: H 0 : F ( x ) = F N ( x ) , ∀ x (4.7) Let D the max absolute v alue of the difference b etw een the t wo cum ulativ e distribution, i.e. D = sup −∞ 0 , x 1 , . . . , x n ∈ X and c i , c j ∈ R . 104 Chapter 5. MLA and Kernel metho ds Pro of Since the matrix ( K ( x i , x j )) n i,j =1 is sy mmetric, there exists an orthogonal matrix V suc h that: K = V Λ V ′ where Λ is the diagonal matrix con taining the eigen v alues λ t of K , and the columns of V are the corresp onding eigen v ectors v t = ( v ti ) n i =1 . By h y p othesis, the eigen v alues of K are non-negativ e, so it is p ossible to define the mapping φ : φ : x i 7→ ( p λ t v ti ) n i =1 (5.8) And express the inner pro duct as: h φ ( x i ) , φ ( x j ) i = n X i =1 λ t v ti v tj = ( V Λ V ′ ) ij = K ( x, y ) (5.9) And this pro ves that K is a kernel funct ion that calculate the inne r pro duct in the vector space giv en b y the mapping function φ . Note that the condition of p ositive semi-definitene ss is necessary , since if it exists at least a negativ e eigen v alue λ s with corresp onding eigen vector v s , the p oin t: z = n X i =1 v si φ ( x i ) = √ Λ V ′ v s (5.10) w ould ha ve a norm squared less than 0 in that space that is imp ossible: k z k 2 = h z , z i = v ′ s V √ Λ √ Λ V ′ v s = v ′ s V Λ V ′ v s = v ′ s Kv s = λ s < 0 (5.11) 5.1.3 Kernels and distances A simple prop ert y of the inner pro duct, is that it naturally induces a norm: k x k 2 = p h x, x i (5.12) and th us a metric or distance: d ( x, z ) = k x − z k 2 (5.13) It follo ws immediately , that a generic kernel function also induces a distance: Definition Distanc e induc e d by a kernel function Giv en a k ernel function k , and consider the Gram’s matrix G ij = k ( x i , x j ) = h φ ( x i ) , φ ( x j ) i , it is p os sible to obtain a pairwise distance matrix D ij from G us ing the follo wing relation: D ij = q k φ ( x i ) − φ ( x j ) k 2 = q k ( x i , x i ) + k ( x j , x j ) − 2 k ( x i , x j ) (5.14) As an example, let us consider the euclid ean distance: 5.2. Kernel metho ds for tr ee 105 Definition Euclide an Distanc e Giv en t w o signals ~ x and ~ y their Euclide an Distance is defined as: d n ( ~ x, ~ y ) = v u u t m X i =1 ( x i − y i ) 2 (5.15) where ~ x = ( x 1 , . . . , x m ) , ~ y = ( y 1 , . . . , y m ) . It is straigh tforwar d that the euclidea n distance is induced b y the linear k ernel K ( x, y ) = xy ′ . 5.2 Kernel metho ds for tree All the classes of k ernel function in this category are based on the concept of tree i.e. the input data are represen ted in a tree structure. One assumes that the reader is familiar with t he general conce pts of graph theor y , in particular with the definition of tree structure. F or a n appropriate bac kground, the reader is referred to standard literature [7]. As s tress ed in the introduction, it is p oss ible to define kernel function ev en when the input data do esn’t ha v e an explicit vect or representa tion. This is the case of struc tured data a nd in particu lar in the case of tree structure. More in general, there exists a class of k ernel function called Convolution Kernel and firstly introduced b y Hausler [34] and later extended b y Shin and Kub oy ama [76] [77] that decompose a data ob ject in to simpler parts and then define a k ernel function in ter ms of suc h parts. 5.2.1 Con volution k ernel This class of k ernels are particular devote d for problem in volving the process ing of structured data like s tring, trees, graph. In fact it pro vides a w a y to extrac t real-v alued features and th us to map these data into a vecto r space R (finite case) or in the Hilb ert space of all square summable sequences (infinite case). The main idea of this ap proac h is that in some case, it is easier to compa re tw o ob j ects in terms of their s impler parts or features. As the other k ernels, it is not necessary to explicit map an input data in the feature s pace, the only requiremen t is the calculation of the inner pro duct b et ween t wo input data in the featu re space. The name c onvolution came from the fact that the v alue of the kernel is obtained f rom a sum of products of other kerne ls, similar to the idea of con v olution b et w een function. Definition Convolution Kernel Let x ∈ X a structured data, X 1 , . . . X D non-empt y separable metric spaces and − → x = ( x 1 , . . . , x D ) the subparts of x (for exampl e in a string a subpart could be a substring) with eac h x d ∈ X d with 1 ≤ d ≤ D . Consider the relation R : X 1 × . . . × X D × X where 106 Chapter 5. MLA and Kernel metho ds R ( − → x , x ) is true if and only if x 1 , . . . , x D are the s ubparts of x . Let R − 1 ( x ) = { − → x : R ( − → x , x ) } and R is said finite if R − 1 ( x ) is fi nite f or all x ∈ X . Giv en tw o elemen t x, y ∈ X their decomp osition − → x = ( x 1 , . . . , x D ) , − → y = ( y 1 , . . . , y D ) in X 1 , . . . X D , supp ose that for eac h X d with 1 ≤ d ≤ D exists a k ernel K d , then the Conv olution Kernel is defined as: K ( x, y ) = X − → x ∈ R − 1 ( x ) , − → y ∈ R − 1 ( y ) D Y d =1 K d ( x d , y d ) (5.16) The pro of that K is a v alid k ernel can b e found in the origina l paper [34]. 5.2.2 T ree k ernels In the last years a v ariet y of con volution k ernel has b een prop osed for different kind of s tructured data, such as string, tree and graph [29 ], [31],[11]. Here, only the main idea on k ernels for trees will b e presented , the in terested r eader can found a go o d c haracterization of tree kernels in the phd thesis by Kub o y ama [77]. T ree k ernels [14] can b e applied to ordered trees and they compute the similarit y b etw een trees considering their common subtrees. There are severa l kind of tre e ker nels but all of them s hare the same idea of decomp osing, in the con vo lution k ernel framework, a tree in differen t kind of s ubtree (for example simple subtree or co-ro oted s ubtree). As an example, let us consider a particular conv olution ker nel: le t x ∈ X a ro oted and ordered tree and X 1 , . . . X D the set of all D -degree ordered and ro oted trees. In this case the relation R defined b efore is: R ( − → x , x ) ⇔ x 1 , . . . , x D are the D subtrees of the tree x . in the follo wing, one tree k ernel used in con text of N atural Language Par sing that exploit this idea and that has inspir ed several w orks on tree k ernel (and also the MLA tree k ernel) will b e defined. Definition Col lin s and Duffy T r e e Kernel [14 ] Giv en a tre e T , and conside ring the en umerable set of all poss ible trees T = { T 1 , T 2 , . . . , T n } , T can be represen ted by an n-dimensional vecto r where the i ’th com- p onen t cont ains the n um b er of o ccurrences of the i ’th tree T i of T in T . This mapping is done considering the function h i ( T ) that coun t the n um b er of o ccurrences of T i in T . In this w a y it is possible to represen t a tree T as h ( T ) = ( h 1 ( T ) , h 2 ( T ) , . . . , h n ( T )) . Note that the num b er n c ould be h uge because the nu m b er of subtree of a giv en tree T is exp onen tial on its size. The k ernel is then defin ed as: K ( T 1 , T 2 ) = h ( T 1 ) · h ( T 2 ) = X i h i ( T 1 ) h i ( T 2 ) = (5.17) = X n 1 ∈ N 1 X n 2 ∈ N 2 X i I i ( n 1) I i ( n 2) = X n 1 ∈ N 1 X n 2 ∈ N 2 C ( n 1 , n 2 ) (5.18) 5.2. Kernel metho ds for tr ee 107 where N 1 is the n um b er of node in T 1 , N 2 is the n um b er of node in T 2 , I i ( n ) is an indicator function defined as: I i ( n ) = ( 1 if the subtree T i is seen ro oted at no de n 0 otherwise (5.19) and C ( n 1 , n 2 ) = P i I i ( n 1 ) I i ( n 2 ) This ker nel ca n computed i n p olynomial time, expressing C ( n 1 , n 2 ) with the f ollo w- ing recursiv e definition: • if the pro ductions at n 1 and n 2 are differen t: C ( n 1 , n 2 ) = 0 • if the pro ductions at n 1 and n 2 are the same and n 1 and n 2 are pre-te rminal no des: C ( n 1 , n 2 ) = 1 • else if the productions at n 1 and n 2 are the same and n 1 and n 2 are not pre-terminal no des: C ( n 1 , n 2 ) = nc ( n 1 ) Y j = 1 (1 + C ( ch ( n 1 , j ) , ch ( n 2 , j ))) (5.20) where nc ( n 1 ) is the n um b er of c hildren of n 1 in the tree (note that nc ( n 1 ) = nc ( n 2 ) b ecause the pro ductions are the same) and ch ( n k , i ) is the i ’th son of no de n k in a tree. In the original pap er some v arian t of this k ernel is prop osed to take into accoun t some issues: • The v alue of kernels K ( T 1 , T 2 ) dep ends strongly on the size of the trees T 1 and T 2 . A possible solution is to use a new normalized k ernel defined as: K ′ ( T 1 , T 2 ) = K ( T 1 , T 2 ) p K ( T 1 , T 1 ) K ( T 2 , T 2 ) (5.21) Note that K is still a k ernel function b ecause still satisfies the theorem 5.1.2. • Since the nu m b er of subtree increases with size or depth, it is necessary to s cale the imp ortance of each subtree taking in accoun t their sizes: C ( n 1 , n 2 ) = λ and C ( n 1 , n 2 ) = λ nc ( n 1 ) Y j = 1 (1 + C ( ch ( n 1 , j ) , ch ( n 2 , j ))) with 0 ≤ λ ≤ 1 (5.22) 108 Chapter 5. MLA and Kernel metho ds This corresp ond to the kerne l: K ( T 1 , T 2 ) = X i λ siz e i h i ( T 1 ) h i ( T 2 ) (5.23) In order to obtain this result the parameter 0 ≤ λ ≤ 1 w as in tro duced. In this w ay the k ernel do wnw eigh t the con tributions of tree fragmen ts exp onen tially with their size. 5.3 MLA Kernels 5.3.1 MLA T ree Kernel The MLA T ree Kernel is based on the MLA and in particular it is obtained using (1) the MLA on an input signal, (2) a particular aggregat ion r ule that pro duce a tree from in terv als and (3) a mo dified tree k ernel adapted to the nature of the class of trees pro duced b y the first t w o steps. A sc hematic view of the MLA T ree Kernel inserte d on the whole pro cess of Kernel Metho ds is depicted in figure 5.3. Figure 5.3: General Sc hema of MLA T ree Kernel 5.3. MLA Kernels 109 5.3.1.1 F rom signal to tree Definition MLA tr e e aggr e gation rule Giv en a signal f defined in [ a, b ] and K threshold operations σ k ( k = 1 , ..., K ) after the application of Equally spaced simple MLA where the con dition on eac h sigma is: σ ( x, φ ) = ( f ( x ) if f ( x ) ≤ φ φ other wise it is p ossible to obtain the in terv al represen tation Υ( f ) of f , recalling that Υ( f ) = { I 1 , I 2 , · · · , I K } with I k =  i 1 k , i 2 k , · · · , i n k k  the set of in terv als corresp onding to σ k . T o obtain a tree from the signal f it is necessary to use its interv al represen tation Υ( f ) using a particu lar aggregation rule on in terv als. I t is necessary first to in tro duce a relation R : I k × I k +1 with I k and I k +1 ∈ Υ( f ) . Given t w o interv als i s k and i t k +1 they are in relation and it will b e indicated as R ( i s k , i t k +1 ) if and only if i t k +1 ⊆ i t k . No w, let us define the undirected tree T = ( V , E ) suc h as: V = I 0 ∪ K [ i =1 I i with I 0 = { r = [ a, b ] } (5.24) and E = { ( i 1 , i 2 ) with i 1 , i 2 ∈ V : R ( i 1 , i 2 ) } . (5.25) In this w ay it is p ossible to define a lab eled and rooted tree T with ro ot r and in whic h eac h no de encode the corresp onden t in terv al. The depth of the tree is exactly K + 1 as it is necessary to add the no de r that represen ts the in terv al [ a, b ] where f is defined. It is p ossible to see an illustrativ e picture of the pro cess in figure 5.4 5.3.1.2 Pr op osed T ree Kernel This k ernel is defined starting from the tree T previously defined in 5.24 and 5.25. The idea b ehind this k ernel is similar to the tree k ernel prop osed b y Collins and Duffy in tro duced in section 5.2.2. In t heir original w ork they ha ve used the tree k ernel to c haracterize parse trees, here it is show n ho w adapt their approac h to the set of tree obtained b y MLA and represen ting the class of one-dimensional signals defined for some interv al [ a, b ] . The main idea of this kerne l is to compare t wo signals using their tree repre sen tation. In the original ke rnel of Collins and Duffy eac h node represen t a production rule or a terminal sym b ol for some formal languages, here the no des represen t in terv als. 110 Chapter 5. MLA and Kernel metho ds Definition MLA T r e e Kernel Using the same con ven tion of tr ee k ernel presen ted in 5.2.2, the MLA tr e e kernel is defined as: K ( T 1 , T 2 ) = h ( T 1 ) · h ( T 2 ) = X n 1 ∈ N 1 X n 2 ∈ N 2 C ( n 1 , n 2 , δ ) (5.26) where n 1 and n 2 for simplicit y of expression represen t also the in terv al lengths asso ciated to the no des n 1 and n 2 , δ ∈ R with 0 < δ < ( b − a ) , and C ( n 1 , n 2 , δ ) recursively defined as: • if n 1 is a leaf and n 2 is not a leaf or viceversa then C ( n 1 , n 2 , δ ) = 0 • if | n 1 − n 2 | > δ and the inte rv als are pre-terminals (b oth fathers of a leaf ) then C ( n 1 , n 2 , δ ) = 0 ( n 1 and n 2 are considered differen t). • if | n 1 − n 2 | ≤ δ and the inte rv al n 1 and n 2 are t w o leafs then C ( n 1 , n 2 , δ ) = 1 ( n 1 and n 2 are considered equal). • else if | n 1 − n 2 | ≤ δ and the interv als n 1 and n 2 are not b oth fathers of a leaf then: C ( n 1 , n 2 , δ ) = nc ( n 1 ) Y j = 1 (1 + C ( ch ( n 1 , j ) , ch ( n 2 , j ) , δ )) (5.27) Note that this k ernel suffers of the same issues as the Colli ns and Duffy tree k ernel, for this reason it could b e useful to consider the v arian t prop osed in 5.21, 5.22, 5.23.Note also that here the no de n 1 and n 2 5.3.2 MLA Con v olution Ker nel This k ernel is defined s tarting from the in terv al represen tation of a signal trough the Equally Spaced M LA defined in Chapter 2 . In particular given 2 signal x, y and let Υ( x ) = { I x 1 , I x 2 , · · · , I x K } and Υ( y ) = { I y 1 , I y 2 , · · · , I y K } their in terv als repr esen ta- tion with K threshold op erations. Definition MLA Convolution Kernel Let I a generic set of in terv als from s ome in terv al represen tation of a signal of length L and let define B I a signal of length L with: B I ( j ) = ( 1 if ∃ an interv al [ a, b ] ∈ I suc h that j ∈ [ a, b ] 0 otherwise (5.28) with 1 ≤ j ≤ L . In this w a y to a generic inter v al represen tation it is p ossi ble to asso ciate a set of binary string. 5.4. Supp ort V ector Mac hin es 111 Finally the k ernel is defined as: S ( x, y ) = K − hnp + 1 X k =1+ hnp 1 np     k + hnp − 1 X j = k − h np +1 B I x j     k + hnp − 1 X j = k − h np +1 B I y j     (5.29) where 0 ≤ γ ≤ 1 and np = | γ ∗ K | and hnp = np 2 . This kerne l function can b e seen as a lo cal correlation b et w e en corresponden t in- ternal p ortions of the signals and in whic h the size of the p ortion is contr olled by the parameter γ . 5.4 Supp ort V ector Mac hines Supp ort V ector Mac hines (SVM) are learning systems that use an h y p othesis space of linear functions in an high dimensional space, trained with a learning algorithm for optimization motiv ated from statistical learning theory [ 19]. SVM are binary class ifiers; in particular the discriminativ e function of the SVM represen t a linear decision b oundary also called margin. More f ormally , a SVM constructs an h yp erplane in a high (even tually infinite) dimensional space, using the implicit pro jection of the kernel f unctions, in order to obtain a go o d separation betw een p ositive and negative p oin ts. In particular SVM consider the hyperplane that has the largest distance to the nearest training data p oin ts of an y class since in general the larger the margin the lo w er t he generaliz ation error of the classifier. I n figure 5.5 it is p ossible to s ee the concept of margin and the h yp erplane (a straigh t line in 2 dimensions). The inte rested reader can found a go o d surv ey of SVM classifiers in [75]. 5.5 Exp erimen tal Setup In this section three exp erimen ts that use the MLA T ree ker nel will b e presen ted, in particular the first t w o inv olve a classification, while the third is related to clustering. 5.5.1 Syn thetic data: discrimina tion p o w er of MLA T ree Kernel on basic functions T o v alidate MLA T ree Kernel, three basic signals that can b e characte rized in term of shap e in time domain, has b een considered (see figure 5.7): • sinu soid signal • rectangular pulse s ignal • saw to oth signal 112 Chapter 5. MLA and Kernel metho ds Kernel F unction Correctly Classified Accuracy MLA T ree 150/150 100% Linear 143/150 95% P olinomial(2) 1 31/150 87% RBF 130/150 87% Sigmoid 141/ 150 94% T able 5.1: Classification accuracy on basic functions dataset. As training set S , N signals ha ve b een generated with an in creasing li near SNR noise v alue ra nging from 0 . 1 to 1 , for eac h of the three categories. In this wa y , one disp ose of a training s et with 3 × N elemen ts and with 3 classes. Analogously a T est Set T disj oin ted from S w as tak en in to accoun t, with the same cardinalit y i.e. 3 × N . T o v alidate the p erformances, a Supp ort V ector Mach ine w ith differen t k ernel f unctions has b een con sidered: linear, p olynomial, RBF, s igmoid and ML A T ree. T he results obtained with N = 50 an d with di fferen t k ernels are show n in table 5.1. As it it is p ossible to see all the kern els obtain v ery go o d p erformances although in the case of v ery noisy signal the MLA T ree Kernel can still reco ver the shap e information leadin g to a sligh tly b etter result. This mak es the MLA tree k ernel more robust to noise than the other ke rnels. 5.5.2 Syn thetic da ta: ML A T ree Kernel on wa v eform dataset In this exp erimen t the dataset from [8] w as considered. I t con tains 5000 instances divided in 3 classes of w a v es of 21 attributes, all of whic h include gaussian noise with mean 0 and v ariance 1 . I n particular, eac h class is generated from a com bination of 2 of 3 “base” w av es. The bes t accuracy that has b een obtain ed pro cessing this dataset has b een reac hed by the Optimal Bay es classifier, with a v alue of 86% . Here the dataset w as split in t wo balanced parts (training and test sets) of 1500 elem en ts equally distributed in to the tree classes for ev aluating the performances of MLA T ree ker nel with a SVM classifier. In particular as in the previous exp erimen t, linear, polynomial, RBF, sigmoid and MLA T ree k ernels function s hav e b een used. In the table 5.2 results are sho wn. As it it is p ossible to see all the k ernels obtain very goo d p erformances. 5.5.3 Assessmen t of induced distance of MLA Con v olution Kernel for clustering of seismic signal The dataset take n in exam for this ex p erimen t consists of n undersea explosion of an arra y of bombs at differen t distanced f rom a ship. T his dataset was builded in order to ha ve a w ell c haracterized set of s ignals to use as a b enc hmark for problems inv olving 5.5. Exp eriment al S etup 113 Kernel F unction Correctly Classified Accuracy MLA T ree 1364/1500 91% Linear 1286/1500 86% P olinomial(2) 1187/1500 80% RBF 1286/1500 86% Sigmoid 795/1500 53% T able 5.2: Classification accuracy on w av eforms dataset. geological signals. In particular, the ship record for eac h explosion at time t i a signal s i that express th e v ariation on pressure level. The explosions tak e place at regular in terv als of 300 seconds and each signal is sampled at 100 hz. A particularit y of this dataset, as it is p ossible to see in figu re 5.8, is that close temp oral explosions o ccurs at similar distances from the ship. This means that giv en a signal s i , with high probabilit y the most similar signal in te rm of shape is the signal s i + d with d close to 1 or − 1 i.e. a signal recorded in pro ximit y of instan t t i . This propert y allo ws to test in a natural w a y the p erformances of a similarit y or dissimilarit y function comparing the "order" that it induces on the s et of signals. In particular let s 1 , . . . , s n the set of s ignals recorded at s tarting time t 1 , . . . , t n resp ectiv ely , and the natural order of the signals can b e represen ted by the p ermu tation P = (1 , 2 , . . . , n ) . Giv en a generic distance d , le t D the n × n distance matrix con taining all the pairwise distances b et ween the signals i.e. D i,j = d ( s i , s j ) with 1 ≤ i, j ≤ n . A measure of go o dness of distance, can b e defined b y the distance optimalit y function so defined: Definition Distanc e Optimality Giv en a distance d and a dataset S of size N , and let D the pairwise distance matrix with D i,j = d ( s i , s j ) , s i , s j ∈ S and 1 ≤ i, j ≤ n , the distance optimalit y of d is defined as: do = n X i =1 | i − j − 1 | n − 2 with j = argmin 1 ≤ k ≤ n,k 6 = i D i,k (5.30) What is expected, in the case of a goo d distance measure, is a do ≈ 0 . It w as assessed the p erformances of the distance induced b y MLA T re e Kernel (using the equation 5.14 on its Gram’s matrix) and compared its results with t w o common distances i.e. Euclidean distan ce and Spearman correlation distance by the distance optimalit y function. Note that the used Spearman correlation distan ce is defined as 1 − r where r is the Sp earman correlation index defined in 2 by equation 2.8. The results of this analysis are shown on table 5.3. As it is p ossible to see, the induced distance from MLA Con v olution Kernel can exploit b etter the natural similarit y 114 Chapter 5. MLA and Kernel metho ds Distance Distance Optimalit y MLA Con volutio n 0.2 369 Euclidean 0.3889 P earson Correlation 0.2813 T able 5.3: Distance optimalit y on geolog ical signals b et w een signals than the other classic measures. This cha pter has sho wn how the data extracted b y MLA can b e optimally organized in a tree of in terv als, enco ding the shap e prop erties of a signal, using a particular aggregation rule. It w as sho wn also an example of k ernel trees prop erly adapted to b e used with this tree represen tation induced b y MLA. In addition another con volution k ernel and based on local correlations wa s in tro duced. T he first results are e ncouraging although it is necessary to do a more systematic study on the class of kernel functions that can b e induced by the prop osed aggregation ru le on the in terv al represen tation and also on their properties and extensions. The ma jor suggestion of the s tudy carr ied out in this cha pter is the con nection b etw een the class of algorithm s on trees and gra ph and the class of digital signal pro cessing tec hnique. I n fac t, the MLA transformation can b e useful to s earc h for relation b et ween op eration on trees and graph and signal manipulation in time or frequency domain. 5.5. Exp eriment al S etup 115 Figure 5.4: General Sc hema of Kernel Methods 116 Chapter 5. MLA and Kernel metho ds Figure 5.5: SVM margin and the separation h y p erplane 0 10 20 30 40 50 60 70 80 90 100 −1.5 −1 −0.5 0 0.5 1 1.5 0 10 20 30 40 50 60 70 80 90 100 −1.5 −1 −0.5 0 0.5 1 1.5 0 10 20 30 40 50 60 70 80 90 100 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 5.6: Basic function 5.5. Exp eriment al S etup 117 0 10 20 30 40 50 60 70 80 90 100 −4 −2 0 2 4 0 10 20 30 40 50 60 70 80 90 100 −4 −2 0 2 4 0 10 20 30 40 50 60 70 80 90 100 −4 −2 0 2 4 Figure 5.7: Basic function plus noise Figure 5.8: Schem a of the exp erimen t Chapter 6 Conclusions and F uture Directions This thesis has in tro duced a new methodology called Multi La yer Analysis (MLA), and its u se on sev eral con texts such as P attern Disco v ery , Classification, Clustering and also T est of Randomness. In c hapter 3, 4, and 5 several application domains related to these problems ha ve b een f aced with the MLA approac h. In some sense, the use of MLA can be considered as a gener al b o osting step to impro v e classic algorith ms in the fie lds of classification o r clustering. The main i dea be- hind MLA is the transformation from the space of one-dimen sional signals in to a new space called the space of in terv als in whic h a more detailed analysis could b e p erformed. In partic ular, in c hapter 3 it has b een shown that, b y using particular aggrega tion rules on suc h space, it is p ossible to ch aracterize differen t signal shap es; this allo ws to approac h some k ey problems in biology i.e. the n ucleosome spacing problem. Moreo ver, in c hapter 5 it has b een prop osed another aggregation rule that is capable to represen t a one-dimensional signal in terms of a tree of in terv als, and th us p ermits to express or c haracterize any kind of shap e. This poin t has s trong implications since it establishes a connection b et w een the class of algorithms that pro cess one-dimensional signal suc h as digi tal signal pro cessing tec hniques, and algorithms on tre es and graphs. This result is really imp ortan t b ecause it makes p ossible the application of particular transformations on a one-dimension al s ignal, mo difying its tree represen ta- tion and viceve rsa. In this s ense further in vestigation in this direction will b e p erformed. The final considerati on is that MLA can b e fruitfully applied on problems that in volv e the pro cessing of one -dimensional signals, such as Geology , Biomedicine, Biology and other disciplines. In some cases MLA on suc h problems ha v e comparable or sometimes sup erior p erformances to other methodologies curr en tly applied for the same purp oses. F urther in vestigation on MLA prop erties and its extension to m ultidimensional data will b e in v estigated. Bibliograph y [1] N. A ddison. T he Il lustr ate d W avelet T r ansform Handb o ok . T a ylor & F rancis, 2002. ix, 7, 8, 9 [2] I. A lb ert, T. N. Mavric h, L. P . T omsho, J. Qi, S. J. Zant on, S. C. Sch uster, and B. F. Pugh. T ranslational an d rotational s ettings of H2A.Z n ucleosomes acr oss the Sacc harom yces cerevisiae genome. Natur e , 446( 7135):572–576, Marc h 2007. xi, 44, 67, 78 [3] P . Baldi and G.W. Hatfield. DNA M icr o arr ays and Gene Expr ession: F r om Ex- p eriments to Data A nalysis and Mo deling . Cam bridge Univ ersity Press, 1 edition, Septem b er 2002. 40, 41 [4] A. Barski, S. Cuddapah, K. Cui, T. Y. Roh, D. E. Sc hones, Z. W ang, G. W ei, I. Chepelev, and K. Zhao. H igh-resolution profiling of histone meth ylations in the h uman genome. C el l , 129(4):823–837, Ma y 2007. 44 [5] B. E. Bernstein, C. L. Liu, E. L. Humphrey , E. O. P erlstein, a nd S. L. Sc hreib er. Global n ucleosome o ccupancy in yea st. Genome biolo gy , 5(9):R62+, 2004. 43 [6] R. J. Bolton and N . M . Adams. An iterativ e h y p othesis-testing strategy for pattern disco very . In Pr o c e e dings of the ninth A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , KDD ’03, pages 49–58, New Y ork, NY, USA, 2003. A CM. 15 [7] J.A. Bondy and U.S.R. Murt y . Gr aph The ory With Applic ations . Elsevier Science Ltd, 1976. 105 [8] L. Breiman, J. F riedma n, C. J. Stone, and R. A. Olshen. Classific ation and R e- gr ession T r e es . Chapman and Hall/CRC, 1 edition, Jan uary 1984. 112 [9] A. L . Buch sbaum an d R. Giancarl o. A lgorithmic asp ects in sp eec h rec ognition: an in tro duction. J. Exp. A lgorithmics , 2, 1997. 47 [10] M. Buck, A. Nob el, and J. Lieb. ChIPOTle: a user-friendly to ol for the analysis of ChIP-c hip data. Genome Biolo gy , 6(11):R97+, 2005. 43 [11] F. Camastra and A . Petr osino. Kernel methods for graphs: A comprehensiv e approac h. In I gnac Lo vrek, R ob ert Ho wlett, and Lakhmi Jain, editors, K now le dge- Base d Intel ligent Information and Engin e ering Systems , v olume 5178 of L e ctur e Notes in Computer Scienc e . Springer Berlin / Heidelberg, 2008. 106 122 Bibliography [12] W-K. Ching and M. K. Ng. Markov Chains: Mo dels, Alg orithms an d Applic ations (International Series in Op er ations R ese ar ch & Management Scienc e) . Springer, 1 edition, Decem b er 2005. 48 [13] L. L. Chiung-hon, L. Alan, and C. W en-sung. P attern disc ov ery of fuzzy time series for finan cial prediction. IEEE T r ansactions on Know le dge and Data Engine ering , 18:613–625, 2006. 15 [14] M. Collins and N . Duffy . Conv olution kernels for natural language. In Thomas G. Dietteric h, Suzanna Bec ke r, and Zoubin Ghahramani, editors, NIPS . MIT Press, 2001. 106 [15] R. Co oley , B. Mobasher, and J. Sriv asta v a. W eb m ining: Information and pat- tern disco v ery on the w orld wide w eb. T o ols with Artificial Intel ligenc e, I EEE International Confer enc e on , 0:0558, 1997. 15 [16] D.F.V. Corona, V. Di Gesù, G. Lo Bosco, L. Pinello, , and G-C. Y uan. "a new m ulti-la y ers metho d to analyze g ene expression". volume 4694 of L e ctur e Notes in Artificial Intel ligenc e . 2007. 71 [17] D.F.V. Corona and J .W. T amkun. Mult iple roles for iswi i n transcript ion in c hro- mosome organization and dna replication. Bio c him Bioph ys Acta. 2004. 43 [18] T. M. C o v er and J. A. Thomas. Elements of Information The ory 2n d Edition . Wiley-In terscience, 2 edition, July 2006. 82 [19] N. Cristianini and J. Sha w e-T aylor. An in tr o duction to supp ort ve ctor machines : and other kernel-b ase d le arning metho ds . Camb ridge Univ ersit y Press, 1 edition, Marc h 2000. 103, 111 [20] M. De Berg, V. M. Krefeld, M. Ove rmars, and O. Sc h warzk opf. C omputational Ge ometry: Alg orithms and Applic ations, Se c ond Edition . Springer, 2nd edition, 2000. 11 [21] A.L. Delcher , S. Kasif, H.R. Goldb erg, and W.H. Hsu. Protein secondary structure mo delling with probabilistic net w orks . In Int. Conf. on Intel ligent Systems and Mole cular Biolo gy , 1993. 44 [22] V. Di Gesù and G. Lo Bosco. Com bining one class fuzzy k nn’s. In Pr o c e e di n gs of the 7th i nternational work shop on F uzzy L o gic and Applic ations: Applic ati ons of F uzzy Sets The ory , WILF ’07, pages 152–160, Berlin, Heidelberg, 2007. Springer-V erlag. 71 Bibliography 123 [23] V. Di Gesù, G. Lo Bosco, and Pinello. A one class classifier f or signal iden tifica- tion: a biological case study . In I. Lo vrek, R.J. Howlett , an d L.C. Jain, editors, 12th I n ternational Confer enc e on Know le dge-Base d and Intel ligent Information & Engine ering Systems KES-2008 , v olume LNAI-5179, P art I I I. Zagreb, Croatia, Septem b er 2008. 89 [24] R. O. Duda, P . E. Hart, and D. G. Stork. Pattern classific ation . Wiley , 2 edition, No v em b er 2001. 16, 17 [25] R. Dur bin, S. R. E ddy , A. Krogh, and G. Mitchison. Biolo gic al Se quenc e A nalysis: Pr ob abilistic M o dels of Pr oteins an d Nucleic A cids . Cambr idge Universit y Press, July 1999. 47 [26] Y. Ephraim and N. Merhav. Hidden Marko v pro cesses. Information The ory, IEEE T r an sactions on , 48(6):1518–1569, 2002. 44 [27] W. F eller. An Intr o duction to Pr ob ability The ory and Its Applic ations, V ol. 2 . Wiley , 2 edition, Jan uary 1971. 30 [28] Y. Gao, I. K onto yiannis, and E. Bienensto ck. Estimat ing th e en trop y of binary time series: Metho dology , some theory and a sim ulation stud y . Entr opy , 10, June 2008. 83 [29] T. G ärtner. A surv ey of ke rnels f or structured data. SIGKDD Explor. Newsl. , 5(1), July 2003. 106 [30] J. D. Gibb ons and S. Ch akrab orti. Nonp ar am etric Statistic al Infer enc e (Stati stics: a S eries of T extb o oks and Mono gr aphs) . CR C, 4 edition , Ma y 2003. 81, 82, 83 [31] T. Gärtn er, A. K. F raunhofer, S. Birlingho ven, S. Augustin, V. L. Q uo c, and A. J. Smola. A short tour of k ernel metho ds for graphs. 2008. 106 [32] D. Hand. In Pattern Dete ction and Disc overy , v olume 2447 of L e ctur e Notes in Computer Scienc e . Springer, 2002. 14 [33] C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. R inaldi, K. D. Macisaac, T. W . Danford, N. M. Hanne tt, J-B. T agne, D. B. Rey nolds, J. Y o o, E. G. Je nnings, J. Zeitli nger, D. K. P okholok, M. Ke llis, P . A. Rolfe, K. T. T a kusaga wa, E. S. Lander, D. K. Gifford, E. F raenk el, and R. A. Y oung. T ranscriptional regulatory co de of a euk aryo tic genome. Natur e , 431(7004):99–10 4, Septem b er 2004. 43 [34] D. Hauss ler. Con voluti on k ernels on discrete structures. In T e chnic al R ep ort UCS- CRL-99-10. UC , 1999. 12 , 105, 106 124 Bibliography [35] M. Hollander and D. A. W olfe. Nonp ar ametric Statist i c al Metho ds, 2nd Edition . Wiley-In terscience, 2 edition, Jan uary 1999. 81, 83 [36] A. Ikuo and H. Oh ta. A test for normalit y based on k ullbac k-leibler information. The Americ an Statistician , 43(1). 86 [37] A. K. Jain and R . C. Dub es. Alg orithms for clustering data . Pren tice-Hall, Inc., Upp er Saddle River, NJ, USA, 198 8. 69 [38] N. Japk o wicz, C. My ers, and M. A . Gluc k. A Nov elt y Detection Approac h to Classification. In IJCAI , pages 518–523, 1995. 68 [39] F. V. Jensen. Bayesian Networks and De cision Gr aphs . Springer-V erlag New Y ork, Inc., Secaucus, NJ, USA, 2001. 44 [40] H. Ji and W. H. W ong. TileMap: create c hromosomal map of tiling arra y h ybridiza- tions. Bioinformatics (Oxfor d, England) , 21(18):3629–3636 , Septem b er 2005. 43 [41] D. S. Johnson, A. Mortaz a vi, R. M. My ers, and B. W old. Genome-Wide Mapping of in Vivo Protein-DNA In teractions. Scienc e , 316(5830):1497–15 02, June 2007. 44 [42] H. Johnson and S. Sinano vic. Symmetrizing the kullbac k -leibler distance. T echnica l rep ort, IEEE T ransaction on I nformation Theory , 2001. 86 [43] W. E. Johnson, W. Li, C. A. Meyer, R. Gottardo, J. S. Carroll, M. Bro wn, and X. S. Liu. Mo del-based analysis of tiling-arra y s f or ChIP-chip. Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of A m eric a , 103(33):12457– 12462, August 2006. 43 [44] S. Keleş, M. J. v an der Laan, S. Dudoit, and S. E. Ca wley . M ultiple testing methods for ChIP-Chip high densit y oligon ucleotide array data. J Comput Biol , 13(3):579–61 3, April 2006. 43 [45] T. H. Kim and B. Ren. Genome-Wide Analysis of Protein-DNA In teractions. Annual R eview of Genomics and Human Genetics , 7(1):81–102, 2006. 43 [46] R. D. K orn b erg and L. Stry er. Statistical distributions of n ucleosomes: nonrandom lo cations b y a sto chastic mecha nism. Nucleic A cids R ese ar ch , 16(14):6677–6690, 1988. 39 [47] C-K. K. Lee, B. Shibata, Y.and Rao, B. D. Strahl, and J. D. Lieb. Evidence for n ucleosome depletion at active regulatory regions genome-wide. Natur e genetics , 36(8):900–90 5, August 2004. 43 Bibliography 125 [48] W. Lee, D. Tillo, R. H. Bra y , N.and Morse, R . W. Davis, T. R. Hughes, and C. Nislo w. A high-resolution atlas of n ucleosome o ccupancy in yea st. Natur e Genetics , 39(10):1235–124 4, Septem b er 2007. 43 [49] C. Leslie, E. Eskin, and W. S. S. Noble. The sp ectrum kernel: a string ker nel for svm protein class ification. Pacific Symp osium on Bio c omputing. Pacific Symp osium on Bio c omputin g , pages 564–575, 2002. 13 [50] T. Lindeb erg. Scale space for discrete signals. IEEE T r ansactions on Pattern Analysis and Machine In tel ligenc e , 12:234–2 54, 1990. 8 [51] K. Luger, A. W. Mader, R. K. R ic hmond, D. F. Sargen t, and T. J . R ic hmond. Crystal s tructure of the n ucleosome core particle at 2.8 A res olution. Natur e , 389(6648):25 1–260, Septem b er 1997. 43 [52] Rich ard G. Ly ons. Understanding Digital Signal Pr o c essing (2nd Edition) . Pren tice Hall PTR, 2 edition, March 2004. 57 [53] S. Mantaci , A. Restivo, and Sciortino M. Distance measures for biological s e- quences: Some recen t approac hes. International Journal of Appr oximate R e ason- ing , 47:109–124, 2008. 12 [54] N. Metrop olis and S. Ulam. The monte carlo method. Journal of the Ame ric an Statistic al Asso ciation , 44, 1949. 85 [55] V. Miele, C. V aillan t, Y. D’aub en ton-Carafa, C. Thermes, and T. Grange. DNA ph ys ical prop erties de termine n ucleosome o ccupancy f rom yeast to fly. Nucl. A cids R es. , pages gkn262+, May 2008. 43 [56] T. S. Mik k elsen, M . Ku, D. B. Jaffe, B. I ssac, E. Lieberman, G . Giannouk os, P . Al- v arez, W. Bro ckma n, T. Kim, R . P . K o c he, W . Lee, E. Mendenhall, A. O’dono v an, A. Presser, C. R uss, X. Xie, A. Meissner, M. W ernig, R. Jaenisc h, C. Nusbaum, E. S. Lander, and B. E. Bernstein. Genome-wide maps of chrom atin s tate in pluripotent and lineage-com mitted cell s. Natur e , 448 (7153):553–560, July 2007. 44 [57] M. M. Moy a and D. R. Hush. Netw ork constrain ts and mult i-ob jectiv e optimizati on for one-class classification. Neur al Netw. , 9:463–474, April 1996. 68 [58] M. M. Moy a, M. W. K o c h, and L. D. Hostetler. One-class classifier net w orks for target recognition applications. NASA STI/R e c on T e chnic al R ep ort N , 93:24043+, 1993. 68 [59] L. Nanni. F usion of classifiers for predicting protein-protein in teractions. Neur o- c omputing , 68:289 – 296, 2005. 69 126 Bibliography [60] F. Ozsolak, J. S. Song, X. S. Liu, and D. E. Fisher. High-throughpu t mapping of the chrom atin structure of h uman promoters. Natur e Biote chnolo gy , 25(2):244–2 48, Jan uary 2007. 44 [61] A. S. P ark and J. R. Glass. Unsupervised pattern disco very in sp eec h: A pplications to w ord acquisition and sp eaker segmen tation, 1988. 15 [62] E. P ek alsk a, M. Skuric hina, and R.P .W. Duin. Com bining dissimilarit y-based one- class classifiers. In L e ctur e Notes in Computer Scienc e , v olume 3077, 2004. 69 [63] D. K. P okholok, C. T. Harbison, S. Levine, M. Cole, N. M. Hannett, T. I. I. Lee, G. W. Bell, K. W alk er, P . A. Rolfe, E. Herb olsheimer, J. Zeitlinger, F. Lewitter, D. K. Gifford, and R . A. Y oung. Genome-wide map of n ucleosome acet ylation and meth ylation in y east. Cel l , 122(4):517–527, August 2005. 43 [64] J. G. Proakis and D. K. Manolakis. Di gital Signal Pr o c essing (4th Edition) . Pren tice Hall, Marc h 2006. 25 [65] L. R . Rabiner. A tutorial on hidden Marko v mo dels and selected applications in sp eec h recognition. Pr o c e e din gs of the IEEE , 77(2):257–286, F eb 1989. 47 [66] K.N. Ripp e, A. Sc hrader, P . R iede, R. Strohner, E. Lehmann, and G. Langst. DNA sequence- and conformation-dire cted positioning of nucl eosomes b y chrom atin- remo deling complexes. Pr o c e e dings of the National A c ademy of Scienc es , pages 0702430104+ , Septem b er 2007. 40 [67] T. Ritter. Randomness tests: A literature survey . http://www.ciphersbyritter. com/RES/RANDTEST.HTM . 81 [68] B. Sc holk opf, J. C. Platt, J. Shaw e-T a y lor, A. J. Smola, and R. C. Williamson. Estimating the Supp ort of a High-Dimensional Distribution. Neur al Comp. , 13(7):1443–1 471, July 2001. 68 [69] E. Segal, Y. F ondufe-Mittendorf, L. Chen, A. Thåström, Y. Field, I . K. Mo ore, J-P . Z. W ang, and J. Widom. A geno mic code for n ucleosome positioning. Natur e , 442(7104):77 2–778, July 2006. 44 [70] E. Segal, Y. F ondufe-Mittendorf, L. Chen, A. Thåström, Y. Field, I . K. Mo ore, J. Z. W ang, and J. Widom. A genomic co de for n ucleosome p ositioning. Natur e , 442(7104):77 2–778, July 2006. 39 [71] E. Seg al and J. Wido m. P oly(da:dt) tracts: ma jor determinan ts of nuc leosome organization . Curr ent Opinion in Structu r al Biolo gy , 19(1):65 – 71, 2009. F olding and binding / Protein-n uclei acid in teractions. 39 Bibliography 127 [72] E. Segal and J. Widom. What con trols n ucleosome p ositions? T r ends in genetics : TIG , 25(8):335–343, August 2009. 39 [73] B. Senoglu and B. Surucu. Go o dness-of-fit tests based on ku llbac k-leibler informa- tion. IEEE T r ansactions on R eliability , 53, Septem b er 2004. 84, 8 6 [74] J. A. Sethian. L evel Set Metho ds an d F ast Mar ching Metho ds: Evolv ing Interfac es in Computational Ge ometry, Fluid Me chanics, Computer Vision , and Materials Scienc e (Cambridge ... on Applie d and Computational Mathematics) . Cam bridge Univ ersit y Press, 2 edition, June 1999. 13 [75] J. Shaw e-T a y lor and N. Cristianini. Kernel Metho ds for Pattern Analysis . Cam- bridge Univ ersit y Press, New Y ork, NY, USA, 2004 . 102, 111 [76] K. Shin and T. Kub o ya ma. A generalization of haussler’s conv olution ker nel: map- ping k ernel. In Pr o c e e din g of the International Confer enc e on Machine L e arning , 2008. 105 [77] K. Shin and T. Kub o y ama. A generaliz ation of ha ussler’s con voluti on kerne l map- ping k ernel and its application to tree k ernels. J ournal of Computer Scienc e and T e chnolo gy , 25(5), Septem b er 2010. 105, 106 [78] S. W. Smith. The Scientist & Engine er’s G uide to Digital Signal Pr o c essi ng . Cal- ifornia T ec hnical Pub., 2007. 6 [79] J. Sv aren and W. Horz. T ranscription f actors v s nuc leosomes: regulation of the PH05 promoter in yeast. T r ends in Bio chemic al Scienc es , 22(3):93–97, March 1997. 43 [80] D.M.J. T ax. One-class classific ation . PhD thesis, Delft Universit y of T echno logy , 2001. 68 [81] J. S. T aylor and N. Cristianini. Kernel Metho ds for Pattern Analysis . Cam bridge Univ ersit y Press, illustrated edition edition, June 2004. 101 [82] V. N. V apnik. T he natur e of statistic al le arni ng the ory . Springer-V erlag New Y ork, Inc., 1995. 10 1 [83] J. Vilo. Pattern Disc overy fr om Biose quenc es . PhD thesis, Univer sit y of Helsinki, 2002. 15 [84] K.H. Seifart W. Stunk el, I. K ob er. A n ucleosome p ositioned in the distal pro- moter region ac tiv ates transcription of the h uman u6 gene. Mole cular and Cel lular Biolo gy , 1997. 43 128 Bibliography [85] S. W egenkittl . En trop y estimator s and serial tests for ergo dic chain s. IEEE T r ans- actions on Inform ati on The ory , 47(6):2 480–2489, 2001. 83 [86] I. Whitehouse, O.J. Rando, J. Delro w, and T. T sukiy ama. Chromatin re mo delling at promoters suppresses an tisense transcription. Natur e , 450(7172):1031–1 035, De- cem b er 2007. 43 [87] A. K. C. W ong and Y. W ang. P attern discov ery: a data driv en approac h to decision supp ort. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part C , 33(1):114–12 4, 2003. 14, 1 5 [88] M. Y assour, T. Kaplan, A. Jaimo vic h, and N. F riedman. Nucleosome p os itioning from tiling microarra y data. Bioinformatics , 2008. 43 [89] M. Y ousef, S. Ju ng, L. C. Sh o w e, and M. K . Sho we . Lea rning from p ositiv e exam- ples when the negativ e class is undete rmined- microrna gene iden tification. Alg o- rithms for Mole cular Biolo gy , 3, 2008. 69 [90] G-C. Y uan, Y-J J. Liu, M. F. Dion, M. D. Slack, L. F. W u, S. J. Altsch uler, and O. J. Rando. Genome-scale iden tification of n ucleosome p ositions in s. cerevisiae. Scienc e (New Y ork, N.Y.) , 309(5734), July 2005. 43, 44, 55, 63, 88, 90 [91] Z. D. Zhang , J. Rozo wsky , H. Y. K. Lam, J. Du, M. Sn yder, and M. Gerstein. Tilescope: o nline analysis pip eline for high-densit y tiling microarra y data. Genome Biolo gy , 8:R81+, May 2007. 43

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment