A Machine Learning Perspective on Predictive Coding with PAQ

A Mac hine Learning P ersp ectiv e on Predictiv e Co ding with P A Q Byron Knoll & Nando de F reitas Univ ersity of British Colum bia V ancouv er, Canada { knoll,nando } @cs.ubc.ca No vem ber 27, 2024 Abstract P AQ8 is an op en source lossless data compression algorithm that curren tly achiev es the b est compression rates on man y benchmarks. This report presents a detailed description of P A Q8 from a statistical machine learning p ersp ectiv e. It shows that it is p ossible to understand some of the mo dules of P AQ8 and use this understanding to impro ve the metho d. Ho w ever, intuitiv e statistical explanations of the b eha vior of other mo dules remain elusiv e. W e hope the description in this rep ort will b e a starting p oin t for discussions that will increase our understanding, lead to improv emen ts to P AQ8, and facilitate a transfer of knowledge from P A Q8 to other machine learning methods, suc h a recurren t neural netw orks and sto c hastic memoizers. Finally , the rep ort presen ts a broad range of new applications of P A Q to mac hine learning tasks including language mo deling and adaptive text prediction, adaptive game playing, classiﬁcation, and compression using features from the ﬁeld of deep learning. 1 In tro duction Detecting temp oral patterns and predicting into the future is a fundamen tal problem in mac hine learning. It has gained great in terest recen tly in the areas of nonparametric Bay esian statistics (W o od et al., 2009) and deep learning (Sutskev er et al., 2011), with applications to several domains including language mo deling and unsup ervised learning of audio and video sequences. Some re- searc hers ha ve argued that sequence prediction is key to understanding human in telligence (Ha wkins and Blak eslee, 2005). The close connections b et ween sequence prediction and data compression are p erhaps under- appreciated within the machine learning communit y . The goal of this rep ort is to describ e a state-of-the-art compression method called P AQ8 (Mahoney, 2005) from the persp ectiv e of ma- c hine learning. W e show b oth ho w P AQ8 mak es use of several simple, well kno wn machine learning mo dels and algorithms, and ho w it can b e impro v ed b y exc hanging these components for more sophisticated mo dels and algorithms. P A Q is a family of op en-source compression algorithms closely related to the b etter kno wn Prediction b y Partial Matching (PPM) algorithm (Cleary and Witten, 1984). PPM-based data compression metho ds dominated many of the compression b enc hmarks (in terms of compression ratio) in the 1990s, but hav e since b een eclipsed by P AQ-based metho ds. Compression algorithms t ypically need to make a trade-oﬀ b etw een compression ratio, speed, and memory usage. P AQ8 is a v ersion of P A Q which achiev es record breaking compression ratios at the exp ense of increased 1 T able 1: Comparison of cross en tropy rates of several compression algorithms on the Calgary corpus ﬁles. The cross en tropy rate metric is deﬁned in Section 2.3. File PPM-test PPM*C 1PF UKN cPPMI I-64 P A Q8L bib 1.80 1.91 1.73 1.72 1.68 1.50 book1 2.21 2.40 2.17 2.20 2.14 2.01 book2 1.88 2.02 1.83 1.84 1.78 1.60 geo 4.65 4.83 4.40 4.40 4.16 3.43 news 2.28 2.42 2.20 2.20 2.14 1.91 obj1 3.87 4.00 3.64 3.65 3.50 2.77 obj2 2.37 2.43 2.21 2.19 2.11 1.45 p aper1 2.26 2.37 2.21 2.20 2.14 1.97 p aper2 2.22 2.36 2.18 2.18 2.12 1.99 pic 0.82 0.85 0.77 0.82 0.70 0.35 progc 2.30 2.40 2.23 2.21 2.16 1.92 progl 1.57 1.67 1.44 1.43 1.39 1.18 progp 1.61 1.62 1.44 1.42 1.39 1.15 trans 1.35 1.45 1.21 1.20 1.17 0.99 Av erage 2.23 2.34 2.12 2.12 2.04 1.73 time and memory usage. F or example, all of the winning submissions in the Hutter Prize (Hutter, accessed April 15, 2011), a contest to losslessly compress the ﬁrst 100 MB (10 8 b ytes) of Wikip edia, ha ve b een sp ecialized versions of P AQ8. Dozens of v ariations on the basic P A Q8 metho d can b e found on the web: http://cs.ﬁt.e du/ ∼ mmahoney/c ompr ession/p aq.html . As stated on the Hutter Prize website, “This compression contest is motiv ated b y the fact that b eing able to compress well is closely related to acting in telligently , thus reducing the slipp ery concept of intelligence to hard ﬁle size num bers.” The stochastic sequence memoizer (Gasthaus et al., 2010) is a language mo deling technique recen tly developed in the ﬁeld of Ba y esian nonparametrics. T able 1 shows a comparison of sev- eral compression algorithms on the Calgary corpus (Bell et al., 1990), a widely-used compression b enc hmark. A summary of the Calgary corpus ﬁles app ears in T able 2. PPM-test is our own PPM implemen tation used for testing diﬀerent compression techniques. PPM*C is a PPM implemen ta- tion that was state of the art in 1995 (Cleary et al., 1995). 1PF and UKN are implementations of the sto c hastic sequence memoizer (Gasthaus et al., 2010). cPPMII-64 (Shk arin, 2002) is curren tly among the b est PPM implementations. paq8l outp erforms all of these compression algorithms by what is considered to b e b e a very large margin in this b enc hmark. Despite the huge success of P A Q8, it is rarely men tioned or compared against in mac hine learning pap ers. There are reasons for this. A core diﬃculty is the lack of scientiﬁc publications on the inner-workings of P AQ8. T o the b est of our knowledge, there exist only incomplete high-lev el descriptions of P A Q1 through 6 (Mahoney, 2005) and P AQ8 (Mahoney, accessed April 15, 2011). The C++ source co de, although a v ailable, is v ery close to mac hine language (due to optimizations) and the underlying algorithms are diﬃcult to extract. Man y of the architectural details of P AQ8 in this rep ort were understo o d by examining the source co de and are presen ted here for the ﬁrst time. 2 T able 2: File size and description of Calgary corpus ﬁles. File bytes Description bib 111,261 ASCII text in UNIX “refer ” f orma t - 725 bibliographic references. book1 768,771 unforma tted ASCI I text - “F ar from the Madding Cro wd” book2 610,856 ASCI I text in UNIX “troff” forma t - “Principles of Computer Speech” geo 102,400 32 bit numbers in IBM floa ting point forma t - seismic da t a. news 377,109 ASCI I text - USENET ba tch file on a v ariety of topics. obj1 21,504 V AX execut able program - compila tion of PROGP. obj2 246,814 Macintosh execut able program - “Knowledge Suppor t System”. p aper1 53,161 “tr off” forma t - Arithmetic Coding f or Da t a Compression. p aper2 82,199 “tr off” forma t - Computer (in)security. pic 513,216 1728 x 2376 bitmap ima ge (MSB first). progc 39,611 Source code in C - UNIX compress v4.0. progl 71,646 Source code in Lisp - system softw are. progp 49,379 Sour ce code in P ascal - program to ev alua te PPM compression. trans 93,695 ASCII and control characters - transcript of a terminal session. 1.1 Con tributions W e pro vide a detailed explanation of ho w P AQ8 w orks. W e b eliev e this con tribution will be of great v alue to the machine learning communit y . An understanding of P A Q8 could lead to the design of b etter algorithms. As stated in (Mahoney, accessed April 15, 2011), P AQ was inspired by researc h in neural netw orks: “Schmidh ub er and Heil (1996) developed an exp erimen tal neural netw ork data compressor. It used a 3 lay er net work trained by back propagation to predict characters from an 80 character alphab et in text. It used separate training and prediction phases. Compressing 10 KB of text required sev eral days of computation on an HP 700 workstation.” In 2000, Mahoney made several improv emen ts that made neural net work compression practical. His new algorithm ran 10 5 times faster. P A Q8 uses techniques ( e.g. dynamic ensembles) that could lead to adv ances in mac hine learning. As a se cond contribution, we demonstrate that an understanding of P A Q8 enables us to deplo y mac hine learning tec hniques to ac hieve b etter compression rates. Speciﬁcally w e sho w that a second order adaptation scheme, the extended Kalman ﬁlter (EKF), results in improv emen ts ov er P AQ8’s ﬁrst order adaptation scheme. A third contribution is to presen t sev eral no vel applications of P AQ8. First, we demonstrate ho w P AQ8 can b e applied to adaptiv e text prediction and game playing. Both of these tasks ha ve b een tackled b efore using other compression algorithms. Second, we sho w for the ﬁrst time that P AQ8 can b e adapted for classiﬁcation. Previous works ha ve explored using compression algorithms, suc h as RAR and ZIP , for classiﬁcation (Marton et al., 2005). W e sho w that our prop osed classiﬁer, P A Qclass, can outp erform these techniques on a text classiﬁcation task. W e also show that P AQclass achiev es near state-of-the-art results on a shap e recognition task. Finally , w e develop a lossy image compression algorithm by combining P AQ8 with recently dev elop ed unsup ervised feature learning techniques. 3 1.2 Organization of this Rep ort In Section 2 we provide general background information ab out the problem of lossless data com- pression, including a description of arithmetic co ding and PPM. In Section 3 we present a detailed explanation of ho w P AQ8 w orks. This section also includes a description of our impro v ement to the compression rate of P AQ8 using EKF. W e present several nov el applications of P AQ8 in Section 4. Section 5 contains our conclusions and p ossible future w ork. App endix A con tains information on ho w to access demonstration programs w e created using P A Q8. 2 Lossless Data Compression An arbitrary data ﬁle can b e considered as a sequence of characters in an alphab et. The characters could be bits, b ytes, or some other set of c haracters (suc h as ASCII or Unico de characters). Lossless data compression usually inv olv es tw o stages. The ﬁrst is creating a probabilit y distribution for the prediction of every character in a sequence giv en the previous characters. The second is to enco de these probabilit y distributions in to a ﬁle using a co ding sc heme suc h as arithmetic coding (Rissanen and Langdon, 1979) or Huﬀman co ding (Huﬀman, 1952). Arithmetic co ding is usually preferable to Huﬀman co ding b ecause arithmetic co ders can pro duce near-optimal enco dings for any set of sym b ols and probabilities (which is not true of Huﬀman co ders). Since the problem of enco ding predicted probability distributions to a ﬁle has been solved, the p erformance diﬀerence b etw een compression algorithms is due to the w ay that they assign probabilities to these predictions. In the next section we giv e an ov erview of how arithmetic co ding w orks. 2.1 Arithmetic Co ding Arithmetic co ders p erform tw o op erations: enco ding and decoding. The encoder tak es as input a se- quence of characters and a sequence of predicted probabilit y distributions. It outputs a compressed represen tation of the sequence of characters. The decoder tak es as input the compressed repre- sen tation and a sequence of predicted probability distributions. It outputs the original sequence of characters (exactly equiv alent to the enco der’s input). The sequence of predicted probabilit y distributions needs to b e exactly the same for the e ncoder and the deco der in order for the deco der to repro duce the original character sequence. F rom the p ersp ectiv e of the comp onen t of a compression program which creates the predicted probabilit y distributions, compression and decompression is equiv alen t. In order to generate a predicted probability distribution for a sp eciﬁc character, it tak es as input all previous characters in the sequence. In the case of compression, it has access to these characters directly from the ﬁle it is trying to compress. In the case of decompression, it has access to these c haracters from the output of the arithmetic deco der. Since the predictor has to output the exact same probability distributions for b oth compression and decompression, an y randomized comp onen ts of the algorithm need to b e initialized with the same seed. The process used to encode a sequence of c haracters b y an arithmetic enco der is essentially equiv alen t to storing a single num b er (b etw een 0 and 1). W e will present how this pro cess w orks through a small example. Suppose we ha ve an alphab et of three characters: A, B, and C. Given the ﬁle “ABA” our goal is to compress this string using arithmetic enco ding. F or the ﬁrst character prediction, assume that the predictor giv es a uniform probabilit y distribution. W e can visualize this on a num b er line, as seen in the top of Figure 1. F or the second c haracter, assume that the predictor 4 Figure 1: An example of arithmetic co ding. The alphab et consists of three c haracters: A, B, and C. The string b eing enco ded is “ABA”. assigns a probability of 0.5 to A, 0.25 to B, and 0.25 to C. Since the ﬁrst character in the sequence w as A, the arithmetic enco der expands this region (0 to 1/3) and assigns the second predicted probabilit y distribution according to this expanded region. This is visualized in the middle lay er of Figure 1. F or the ﬁnal character in the sequence, assume that the predictor assigns a probability of 0.5 to A, 0.4 to B, and 0.1 to C. This is visualized in the b ottom of Figure 1. No w all the arithmetic co der needs to do is store a single num ber b et w een the v alues of 1/6 and 5/24. This num b er can b e eﬃcien tly enco ded using a binary search. The binary search ranges would b e: “0 to 1”, “0 to 0.5”, “0 to 0.25”, and ﬁnally “0.125 to 0.25”. This represen ts the num b er 0.1875 (whic h falls in the desired range). If we use “0” to enco de the decision to use the low er half and “1” to enco de the decision to use the upp er half, this sequence can b e represen ted in binary as “001”. No w consider the task of deco ding the ﬁle. As input, the arithmetic deco der has the num b er 0.1875, and a sequence of predicted probability distributions. F or the ﬁrst character, the predic- tor giv es a uniform probability distribution. The num ber 0.1875 falls into the ‘A’ sector, so the arithmetic deco der tells the predictor that the ﬁrst character was ‘A’. Similarly , for the next tw o c haracters the arithmetic deco der knows that the c haracters m ust b e ‘B’ and ‘A’. At this p oin t, the arithmetic deco der needs some wa y to know that it has reached the end of the sequence. There are t ypically tw o techniques that are used to communicate the length of the sequence to the deco der. The ﬁrst is to enco de a special “end of sequence” c haracter, so that when the deco der reac hes this character it knows it reached the end of the string. The second technique is to just store an additional integer along with the compressed ﬁle which represen ts the length of the sequence (this is usually more eﬃcient in practice). Although arithmetic co ding can achiev e optimal compression in theory , in practice there are t wo factors whic h preven t this. The ﬁrst is the fact that ﬁles can only b e stored to disk using a sequence of bytes, so this requires some ov erhead in comparison to storing the optimal num b er of bits. The second is the fact that precision limitations of ﬂoating p oint num b ers preven t optimal enco dings. In practice b oth of these factors result in r elatively small o v erhead, so arithmetic co ding still pro duces near-optimal enco dings. 2.2 PPM PPM (Cleary and Witten, 1984) is a lossless compression algorithm whic h consistently p erforms 5 T able 3: PPM mo del after pro cessing the string “abracadabra” (up to the second order mo del). This table is a recreation of a table from (Cleary et al., 1995). Order k = 2 Order k = 1 Order k = 0 Order k = -1 Predictions c p Predictions c p Predictions c p Predictions c p ab → r 2 2 3 a → b 2 2 7 → a 5 5 16 → A 1 1 | A | → Esc 1 1 3 → c 1 1 7 → b 2 2 10 → d 1 1 7 → c 1 1 16 ac → a 1 1 2 → Esc 3 3 7 → d 1 1 16 → Esc 1 1 2 → r 2 2 16 b → r 2 2 3 → Esc 5 5 16 ad → a 1 1 2 → Esc 1 1 3 → Esc 1 1 2 c → a 1 1 2 br → a 2 2 3 → Esc 1 1 2 → Esc 1 1 3 d → a 1 1 2 ca → d 1 1 2 → Esc 1 1 2 → Esc 1 1 2 r → a 2 2 3 da → b 1 1 2 → Esc 1 1 3 → Esc 1 1 2 ra → c 1 1 2 → Esc 1 1 2 w ell on text compression b enchmarks. It creates predicted probabilit y distributions based on the history of characters in a sequence using a technique called context matching. Consider the alphab et of low er case English characters and the input sequence “abracadabra”. F or eac h character in this string, PPM needs to create a probability distribution representing how lik ely the character is to occur. F or the ﬁrst c haracter in the sequence, there is no prior information ab out what character is lik ely to o ccur, so assigning a uniform distribution is the optimal strategy . F or the second c haracter in the sequence, ‘a’ can b e assigned a sligh tly higher probabilit y b ecause it has b een observed once in the input history . Consider the task of predicting the character after the en tire sequence. One wa y to go ab out this prediction is to ﬁnd the longest matc h in the input history whic h matches the most recen t input. In this case, the longest match is “abra” whic h o ccurs in the ﬁrst and eighth p ositions. Based on the longest match, a go od prediction for the next c haracter in the sequence is simply the character immediately after the match in the input history . After the string “abra” w as the character ‘c’ in the ﬁfth p osition. Therefore ‘c’ is a go o d prediction for the next c haracter. Longer con text matc hes can result in b etter predictions than shorter ones. This is b ecause longer matches are less lik ely to o ccur by chance or due to noise in the data. 6 PPM essentially creates probabilit y distributions according to the metho d describ ed ab ov e. Instead of generating the probability distribution entirely based on the longest context matc h, it blends the predictions of multiple con text lengths and assigns a higher weigh t to longer matches. There are v arious techniques on how to go ab out blending diﬀeren t con text lengths. The strategy used for combining diﬀerent context lengths is partially resp onsible for the p erformance diﬀerences b et w een v arious PPM implementations. One example of a technique used to generate the predicted probabilities is shown in T able 3 (Cleary et al., 1995). The table shows the state of the mo del after the string “abracadabra” has b een pro cessed. ‘k’ is the order of the con text match, ‘c’ is the o ccurence coun t for the context, and ‘p’ is the computed probability . ‘Esc’ refers to the even t of an unexp ected character and causes the algorithm to use a lo wer order mo del (weigh ted b y the probabilit y of the escap e even t). Note that the low est order mo del (-1) has no escap e ev ent since it matches any p ossible character in the alphab et A. PPM is a nonparametric mo del that adaptively changes based on the data it is c ompressing. It is not surprising that similar metho ds hav e b een discov ered in the ﬁeld of Bay esian nonpara- metrics. The sto c hastic memoizer (W o od et al., 2009) is a nonparametric mo del based on an un b ounded-depth hierarchical Pitman-Y or pro cess. The sto c hastic memoizer shares sev eral simi- larities with PPM implemen tations. The compression p erformance of the sto chastic memoizer is curren tly comparable with some of the b est PPM implementations. 2.3 Compression Metrics One w ay of measuring compression performance is to use the ﬁle size of compressed data. Ho wev er, ﬁle size is dep endent on a particular type of co ding scheme (such as arithmetic co ding or Huﬀman co ding). There are tw o common metrics used to measure the p erformance directly based on the predicted probability distributions: cross en trop y and p erplexit y . Cross entrop y can be used to estimate the av erage num ber of bits needed to co de eac h byte of the original data. F or a sequence of N c haracters x i , and a probability p ( x i ) assigned to eac h c haracter by the prediction algorithm, the cross entrop y can b e deﬁned as: − P N i =1 1 N log 2 p ( x i ). This gives the expected n umber of bits needed to co de each character of the string. Another common metric used to compare text prediction algorithms is p erplexit y which can b e deﬁned as tw o to the p o w er of cross entrop y . 2.4 Lossless Image Compression Compressing images represents a signiﬁcantly diﬀerent problem than compressing text. Lossless compression algorithms tend to w ork best on a sequence of c haracters whic h contain relativ ely little noise. They are well suited for natural language b ecause the characters are highly redundan t and con tain less noise than individual pixels. The problem with noise is that it reduces the maxim um con text lengths whic h an algorithm like PPM can identify . It has b een shown that a v ariant of PPM called prediction by partial appro ximate matc hing (PP AM) shows comp etitiv e compression rates when compared to other lossless image compression algorithms (Y ong and Adjeroh, 2008). PP AM uses approximate matches in order to ﬁnd longer context lengths whic h can impro ve the compression ratio for images. Another fundamen tal diﬀerence b etw een text and images is that text is a single dimensional sequence of c haracters while images are tw o dimensional. Applying PPM to image compression requires mapping the pixels to a single sequen tial dimension. A trivial w a y of p erforming this 7 Figure 2: The fourth and ﬁfth iteration of the Hilb ert curve construction. Image courtesy of Zbigniew Fiedoro wicz. mapping is using a raster scan ( i.e. scanning ro ws from top to b ottom and pixels in each ro w from left to righ t). Another mapping kno wn as the Hilb ert curve (Hilb ert, 1891) maximizes spatial lo calit y (sho wn in Figure 2). 2.5 Distance Metrics Keogh et al (Keogh et al., 2004) demonstrate ho w compression algorithms can be used as a distance metric b et ween time series data. This distance metric can b e used to solv e several in teresting problems such as clustering, anomaly detection, and classiﬁcation. F or example, the distance metric can b e used to cluster a v ariet y of types of ﬁles suc h as music, text do cumen ts, images, and genome sequences. Li et al (Li et al., 2001) prop ose the following metric to measure the distance b et w een the strings x and y : d k ( x, y ) = K ( x | y ) + K ( y | x ) K ( xy ) K ( x ) is deﬁned to b e the Kolmogorov complexity of a string x . That is the length of the shortest program capable of pro ducing x on a univ ersal computer. K ( x | y ) is the length of the shortest program that computes x when y is giv en as auxiliary input to the program. K ( xy ) is the length of the shortest program that outputs y concatenated to x . Kolmogorov complexit y represen ts the best p ossible compression that can b e ac hieved. Since the Kolmogorov complexity can not b e directly computed, a compression algorithm can b e used to approximate d k ( x, y ): d c ( x, y ) = C ( x | y ) + C ( y | x ) C ( xy ) C ( x ) is the compressed size of x and C ( x | y ) is the compressed size of x after the compressor has b een trained on y . The b etter a compression algorithm, the closer it approaches the Kolmogorov complexit y and the closer d c ( x, y ) appro ximates d k ( x, y ). Keogh et al (Keogh et al., 2004) use the follo wing dissimilarity metric: d C D M ( x, y ) = C ( xy ) C ( x ) + C ( y ) The main justiﬁcation made for using d C D M ( x, y ) ov er d c ( x, y ) is that it do es not require the calculation of C ( x | y ) and C ( y | x ). These can not b e calculated using most oﬀ-the-shelf compression 8 algorithms without mo difying their source co de. F ortunately , P A Q8 is op en source and these mo diﬁcations can b e easily implemented (so d c can b e used). F or the purp oses of classiﬁcation, w e inv estigated deﬁning our o wn distance metrics. Using cross entrop y , a more computationally eﬃcien t distance metric can b e deﬁned which requires only one pass through the data: d e 1 ( x, y ) = E ( x | y ) E ( x | y ) is the cross entrop y of x after the compressor has b een trained on y . W e also inv estigated a symmetric version of this distance metric, in which d e 2 ( x, y ) is alw ays equal to d e 2 ( y , x ): d e 2 ( x, y ) = E ( x | y ) + E ( y | x ) 2 Finally , Cilibrasi et al (Cilibrasi and Vitanyi, 2005) prop ose using the following distance metric: d N D M ( x, y ) = C ( xy ) − min ( C ( x ) , C ( y )) max ( C ( x ) , C ( y )) In section 4.2 of this rep ort, w e use a compression-based distance metric to p erform classiﬁcation. Keogh et al (Keogh et al., 2004) use the ZIP compression algorithm as a distance metric to p er- form exp erimen ts in clustering, anomaly detection, and classiﬁcation. Since P A Q8 achiev es b etter compression than ZIP , it should theoretically result in a b etter distance metric. Although w e do not p erform the exp erimen ts in this rep ort, it would mak e in teresting future w ork to compare ZIP and P AQ8 in the exp erimen ts by Keogh et al. 3 P A Q8 3.1 Arc hitecture P A Q8 uses a weigh ted combination of predictions from a large num b er of models. Most of the mo dels are based on con text matching. Unlik e PPM, some of the mo dels allo w noncontiguous con text matches. Noncontiguous con text matc hes improv e noise robustness in comparison to PPM. This also enables P AQ8 to capture longer-term dep endencies. Some of the mo dels are sp ecialized for particular types of data suc h as images or spreadsheets. Most PPM implementations make predictions on the byte-lev el (given a sequence of b ytes, they predict the next byte). Ho wev er, all of the mo dels used by P A Q8 make predictions on the bit-level. Some arc hitectural details of P AQ8 dep end on the version used. Even for a particular version of P AQ8, the algorithm changes based on the t yp e of data detected. F or example, few er predic- tion mo dels are used when image data is detected. W e will provide a high-level o verview of the arc hitecture used by paq8l in the general case of when the ﬁle t yp e is not recognized. paq8l is a stable v ersion of P AQ8 released b y Matt Mahoney in Marc h 2007. The P AQ8 v ersions submitted to the Hutter prize include additional language mo deling comp onen ts not presen t in paq8l suc h as dictionary prepro cessing and word-lev el mo deling. An ov erview of the paq8l arc hitecture is sho wn in Figure 3. 552 prediction mo dels are used. The mo del mixer combines the output of the 552 predictors in to a single prediction. This prediction is then passed through an adaptive probability map (APM) before it is used b y the arithmetic co der. In practice, APMs typically reduce prediction error by ab out 1%. APMs are also kno wn as secondary symbol estimation (Mahoney, accessed April 15, 2011). APMs w ere originally dev elop ed 9 Figure 3: P A Q8 architecture. b y Serge Osnac h for P AQ2. An APM is a t w o dimensional table which takes the mo del mixer prediction and a low order context as inputs and outputs a new prediction on a nonlinear scale (with ﬁner resolution near 0 and 1). The table en tries are adjusted according to prediction error after eac h bit is co ded. 3.2 Mo del Mixer The paq8l mo del mixer architecture is shown in Figure 4. The architecture closely resem bles a neural netw ork with one hidden lay er. How ev er, there are some subtle diﬀerences that distinguish it from a standard neural net work. The ﬁrst ma jor diﬀerence is that weigh ts for the ﬁrst and second lay ers are learned online and indep enden tly for each no de. Unlik e back propagation for a m ulti-lay er netw ork, each no de is trained separately to minimize the predictiv e cross-en tropy error, as outlined in section 3.2.2. In this sense, P AQ8 is a type of ensemble metho d (Opitz and Maclin, 1999). Unlik e typical ensembles, the parameters do not conv erge to ﬁxed v alues unless the data is stationary . P A Q8 was designed for b oth stationary and non-stationary data 1 . Figure 4: P A Q8 mo del mixer architecture. The second ma jor diﬀerence b et ween the mo del mixer and a standard neural net w ork is the fact that the hidden no des are partitioned in to seven sets. F or ev ery bit of the data ﬁle, one no de is selected from eac h set. The set sizes are shown in the rectangles of Figure 4. W e refer to the leftmost rectangle as set 1 and the rightmost rectangle as set 7. Only the edges connected to these sev en selected no des are updated for each bit of the data. That means of the 552 × 3,080 = 1,700,160 w eights in the ﬁrst lay er, only 552 × 7 = 3,864 of the weigh ts are up dated for each bit. This makes training the neural netw ork several orders of magnitude faster. 1 W e refer to “non-stationary data” as data in which the statistics change ov er time. F or example, we would consider a nov el to b e non-stationary while a text do cument of some rep eating string ( e.g . “abababab...”) to b e stationary . 10 Algorithm 1 paq8l no de selection mec hanism. set1Index ← 8 + history(1) set2Index ← history(0) set3Index ← lo wOrderMatches + 8 × ((lastF ourBytes/32) mo d (8)) if history(1) = history(2) then set3Index ← set3Index + 64 end if set4Index ← history(2) set5Index ← history(3) set6Index ← round(log 2 (longestMatc h) × 16) if bitP osition = 0 then set7Index ← history(3)/128 + bitMask(history(1),240) + 4 × (history(2)/64) + 2 × (lastF ourBytes / 2 31 ) else set7Index ← history(0) × 2 8 − bitPosition if bitPosition = 1 then set7Index ← set7Index + history(3)/2 end if set7Index ← min(bitP osition,5) × 256 + history(1)/32 + 8 × (history(2)/32) + bitMask(set7Index,192) end if Eac h set uses a diﬀerent selection mechanism to c ho ose a node. Sets num b er 1, 2, 4, and 5 c ho ose the no de index based on a single b yte in the input history . F or example, if the byte for set 1 has a v alue of 4, the ﬁfth no de of set 1 w ould b e selected. Set 1 uses the second most recen t byte from the input history , set 2 uses the most recent byte, set 4 uses the third most recent byte, and set 5 uses the fourth most recent byte. Set 6 chooses the no de based on the length of the longest con text matc hed with the most recent input. Sets 3 and 7 use a combination of several b ytes of the input history in order to c ho ose a no de index. The selection mechanism used by paq8l is shown in Algorithm 1. histor y ( i ) returns the i ’th most recen t byte, l ow Or der M atches the num b er of low- order con texts which hav e b een observed at least once b efore (b et w een 0 and 7), l astF ourB y tes is the four most recen t bytes, long estM atch is the length of the longest context match (b et w een 0 and 65534), bitM ask ( x, y ) do es a bit wise AND op eration b et ween x and y , and bitP osition is the bit index of the current byte (b et ween 0 and 7). 3.2.1 Mixtures of Exp erts In the previous section we compared the P AQ8 mo del mixer to a m ultilay er neural net work. The P A Q8 mo del mixer can also b e compared to a technique known as “mixtures of exp erts” (Jacobs et al., 1991). Although P AQ8 do es not use the standard mixtures of exp erts architecture, they do share some similarities. Jacobs et al (Jacobs et al., 1991) state: “If bac kpropagation is used to train a single, m ultila yer netw ork to p erform diﬀerent subtasks on diﬀerent o ccasions, there will generally b e strong interference eﬀects which lead to slo w learning and p o or generalization. If we kno w in adv ance that a set of training cases may b e naturally divided into subsets that corresp ond to distinct subtasks, in terference can b e reduced by using a system comp osed of several diﬀerent ‘exp ert’ netw orks plus a gating netw ork that decides whic h of the exp erts should b e used for each training case.” Their arc hitecture is shown in Figure 5. P A Q8 and the mixtures of exp erts architecture b oth use a gating mechanism to choose exp ert mo dels. The same problem-speciﬁc prop erties whic h lead to the dev elopmen t of mixtures of experts also applies to compression - data can b e naturally divided into subsets and separate ‘exp erts’ can 11 Figure 5: Mixtures of exp erts architecture. This ﬁgure is a recreation of a ﬁgure in (Jacobs et al., 1991). All of the exp erts are feedforward netw orks and hav e the same input. The gating netw ork acts as a switch to select a single exp ert. The output of the selected exp ert b ecomes the output of the system. Only the w eights of the selected exp ert are trained. b e trained on each subset. Using a gating mechanism has the additional computational b eneﬁt that only one exp ert needs to b e trained at a time, instead of training all exp erts simultaneously . Increasing the num ber of exp ert netw orks do es not increase the time complexit y of the algorithm. One diﬀerence betw een the mixtures of exp erts mo del and the P A Q8 model mixture is the gating mec hanism. Jacobs et al use a feedforw ard net w ork to learn the gating mec hanism, while P AQ8 uses a deterministic algorithm (shown in Algorithm 1) whic h do es not p erform adaptiv e learning. The gating algorithm used b y P A Q8 con tains problem-sp eciﬁc knowledge which is sp eciﬁed a priori. One in teresting area for future work would b e to inv estigate the eﬀect of adaptive learning in the P AQ8 gating mec hanism. Adaptiv e learning could p oten tially lead to a b etter distribution of the data to each exp ert. Ideally , the data should b e uniformly partitioned across all the exp erts. Ho wev er, using a deterministic gating mec hanism runs the risk of a particular expert b eing selected to o often. The gating mec hanism in P AQ is gov erned by the v alues of the input data. The idea of gating units in a netw ork according to the v alue of the input has also b een used in other recurren t neural net work architectures. F or example, in long-short-term-memory (LSTM), it is used to main tain hidden units switched-on and hence av oid the problem of v anishing gradien ts in back-propagation, see e.g. Gra v es et al. (2009). Ho w ever, the deterministic gating mechanism of P AQ is not intended at impro ving prediction p erformance or av oiding v anishing gradients. Rather, its ob jectiv e is to reduce computation v astly . W e recommend that researc hers w orking with RNNs, which take da ys to train, inv estigate wa ys of incorp orating these ideas to sp eed up the training of RNNs. Adaptiv e training, instead of training to a ﬁxed v alue, is another imp ortan t asp ect to keep in mind. There is a direct mapping b etw een the mixtures of exp erts arc hitecture and the P AQ8 mo del mixer architecture. Each “set” in the hidden lay er of Figure 4 corresp onds to a separate mixture of exp erts mo del. The seven mixtures of exp erts are then combined using an additional feedforward 12 la yer. The n um b er of exp erts in eac h mixtures of experts model corresp onds to the num b er of no des in each set ( e.g. there are 264 exp erts in set 1 and 1536 exp erts in set 7). As with the mixtures of exp erts arc hitecture, only the weigh ts of the exp ert c hosen by the gating mechanism are trained for each bit of the data. Another diﬀerence b et ween the standard mixtures of exp erts mo del and P A Q8 is the fact that mixtures of exp erts mo dels typically are optimized to conv erge tow ards a stationary ob jectiv e function while P AQ8 is designed to adaptively train on b oth stationary and non-stationary data. 3.2.2 Online Parameter Up dating Eac h no de of the paq8l mo del mixer (b oth hidden and output) is a Bernoulli logistic mo del: p ( y t | x t , w ) = Ber( y t | sigm( w T x t )) where w ∈ R n p is the v ector of weigh ts, x t ∈ [0 , 1] n p is the v ector of predictors at time t , y t ∈ { 0 , 1 } is the next bit in the data b eing compressed, and sigm( η ) = 1 / (1 + e − η ) is the sigmoid or logistic function. n p is the n umber of predictors and is equal to 552 for the ﬁrst lay er of the neural netw ork and 7 for the second lay er of the netw ork. Let π t = sigm( w T x t ). The negative log-lik eliho o d of the t -th bit is given by N LL ( w ) = − log[ π I ( y t =1) t × (1 − π t ) I ( y t =0) ] = − [ y t log π t + (1 − y t ) log(1 − π t )] where I ( · ) denotes the indicator function. The last expression is the cross-en tropy error (also kno wn as co ding error) function term at time t . The logistic regression w eights are up dated online with ﬁrst order up dates: w t = w t − 1 − η ∇ N LL ( w t − 1 ) = w t − 1 − η ( π t − y t ) x t The step size η is held constan t to ensure ongoing adaptation. 3.2.3 Extended Kalman Filter T o improv e the compression rate of paq8l , we applied an extended Kalman ﬁlter (EKF) to adapt the weigh ts. W e assume a dynamic state-space model consisting of a Gaussian transition prior, w t +1 = w t + N (0 , Q ), and a logistic observ ation model y t = π t + N (0 , r ). The EKF, although based on lo cal linearization of the observ ation mo del, is a second order adaptive metho d worth y of in vestigation. One of the earliest implementations of EKF to train multila y er p erceptrons is due to Singhal and W u (Singhal and W u, 1989). Since EKF has a O ( n p 2 ) time complexity , it would b e unfeasibly slow to apply EKF to the ﬁrst lay er of the neural net work. Ho wev er, EKF can b e used to replace the metho d used by paq8l in the second lay er of the neural netw ork without signiﬁcant computational cost since there are only seven weigh ts. Here w e presen t the EKF algorithm for optimizing the second lay er of the P A Q8 neural net work. The following v alues w ere used to initialize EKF: Q = 0 . 15 × I 7 × 7 , P 0 = 60 × I 7 × 7 , w 0 = 150 × 1 7 × 1 , 13 and r = 5. The following are the EKF up date equations for each bit of data: w t +1 | t = w t P t +1 | t = P t + Q K t +1 = P t +1 | t G 0 t +1 r + G t +1 P t +1 | t G 0 t +1 w t +1 = w t +1 | t + K t +1 ( y t − π t ) P t +1 = P t +1 | t − K t +1 G t +1 P t +1 | t , where G 1 × 7 is the Jacobian matrix: G = [ ∂ y /∂ w 1 · · · ∂ y /∂ w 7 ] with ∂ y /∂ w i = y (1 − y ) x i . W e compared the p erformance of EKF with other v arian ts of paq8l . The results are shown in T able 4. The ﬁrst three columns are paq8l with diﬀerent settings of the level parameter. level is the only paq8l parameter that can b e changed via command-line (without mo difying the source co de). It mak es a tradeoﬀ b et ween sp eed, memory usage, and compression p erformance. It can b e set to an integer v alue b et ween zero and eight. Lo wer v alues of level are faster and use less memory but ac hieve worse compression p erformance. level =8 is the slow est setting and uses the most memory (up to 1643 MiB) but achiev es the b est compression p erformance. level =5 has a 233 MiB memory limit. paq8-8-tuned is a customized v ersion of paq8l (with level =8) in whic h w e c hanged the v alue of the weigh t initialization for the second lay er of the neural net work. W e found changing the initialization v alue from 32,767 to 128 impro v ed compression performance. Finally , paq8-8-ekf refers to our mo diﬁed version of paq8l with EKF used to up date the weigh ts in the second lay er of the neural net work. W e ﬁnd that using EKF slightly outp erforms the ﬁrst order up dates. The impro vemen t is about the same order of magnitude as the impro vemen t b et w een level =5 and level =8. How ev er, c hanging level has a signiﬁcan t cost in memory usage, while using EKF has no signiﬁcan t computational cost. The initialization v alues for paq8-8-tuned and paq8-8-ekf were determined using man ual parameter tuning on the ﬁrst Calgary corpus ﬁle (‘bib’). The performance diﬀerence b et ween paq8l-8 and paq8-8-tuned is similar to the diﬀerence b et ween paq8-8-tuned and paq8-8-ekf . 4 Applications 4.1 Adaptiv e T ext Prediction and Game Playing The fact that P AQ8 ac hieves state of the art compression results on text do cuments indicates that it can b e used as a p o werful mo del for natural language. P AQ8 can b e used to ﬁnd the string x that maximizes p ( x | y ) for some training string y . It can also b e used to estimate the probability p ( z | y ) of a particular string z given some training string y . Both of these tasks are useful for sev eral natural language applications. F or example, many sp eech recognition systems are comp osed of an acoustic modeling comp onen t and a language modeling comp onen t. P AQ8 could b e used to directly replace the language mo deling comp onen t of any existing speech recognition system to ac hieve more accurate w ord predictions. T ext prediction can b e used to minimize the n umber of k eystrokes required to type a particular string (Gara y-Vitoria and Abascal, 2006). These predictions can b e used to improv e the commu- nication rate for p eople with disabilities and for p eople using slo w input devices (such as mobile phones). W e modiﬁed the source code of paq8l to create a program whic h predicts the next n 14 T able 4: P AQ8 compression rates on the Calgary corpus File paq8l-1 paq8l-5 paq8l-8 paq8-8-tuned paq8-8-ekf bib 1.64592 1.49697 1.49645 1.49486 1.49207 book1 2.14158 2.00573 2.00078 2.00053 1.99603 book2 1.73257 1.59531 1.5923 1.59198 1.58861 geo 3.70451 3.43456 3.42725 3.42596 3.43444 news 2.07839 1.90573 1.90284 1.90237 1.89887 obj1 3.25932 2.77358 2.77407 2.76531 2.76852 obj2 1.85614 1.45499 1.43815 1.43741 1.43584 p aper1 2.09455 1.96543 1.96542 1.96199 1.95753 p aper2 2.09389 1.99046 1.99053 1.9882 1.98358 pic 0.6604 0.35088 0.35083 0.35073 0.3486 progc 2.07449 1.91574 1.91469 1.91037 1.9071 progl 1.31293 1.18313 1.18338 1.1813 1.18015 progp 1.31669 1.1508 1.15065 1.14757 1.14614 trans 1.1021 0.99169 0.99045 0.98857 0.98845 Av erage 1.93382 1.72964 1.72698 1.7248 1.72328 c haracters while the user is typing a string. A new prediction is created after each input character is typed. It uses fork() after each input character to create a pro cess which generates the most lik ely next n c haracters. fork() is a system call on Unix-like op erating systems which creates an exact cop y of an existing pro cess. The program can also b e given a set of ﬁles to train on. Some preliminary observ ational studies on our text prediction system are shown in Figure 6. Note that P AQ8 contin uously does online learning, ev en while making a prediction (as seen b y the completion of “Byron Knoll” in the top example). The character predictions do capture some syn tactic structures (as seen by completion of LaT eX syntax) and even some seman tic information as implied by the training text. Sutsk ever et al (Sutsk ever et al., 2011) use Recurrent Neural Netw orks (RNNs) to p erform text prediction. They also compare RNNs to the sequence memoizer and P A Q8 in terms of compression rate. They conclude that RNNs achiev e b etter compression than the sequence memoizer but worse than P A Q8. They p erform several text prediction tasks using RNNs with diﬀerent training sets (similar to the examples in Figure 6). One diﬀerence b et w een their metho d and ours is the fact that P A Q8 contin uously do es online learning on the test data. This feature could b e b eneﬁcial for text prediction applications b ecause it allows the system to adapt to new users and data that do es not app ear in the training set. W e found that the P A Q8 text prediction program could b e mo diﬁed into a ro ck-paper-scissors AI that usually b eats h uman play ers. Giv en a sequence of the opponent’s ro c k-pap er-scissors mo ves (suc h as “rpprssrps”) it predicts the most likely next mov e for the opp onent. In the next round the AI would then play the mo ve that b eats that prediction. The reason that this strategy usually b eats h uman pla yers is that h umans typically use predictable patterns after a large n umber of ro ck- pap er-scissors rounds. The P AQ8 text prediction program and ro c k-pap er-scissors AI are av ailable to b e downloaded (see App endix A). 15 M —ay the contemplation of so many w onders extinguish the spirit ofv engeance in him! My — companions and I had decided to escap e as soon as the vessel cameclose enough for us to be heard My n —erves calmed a little, but with m y brain so aroused,I did a swift review of my whole existence My name i —n my ears and some enormous baleen whales My name is B —ay of Bengal, the seas of the East Indies, the seasof China My name is Byr —on and as if it was an insane idea. But where the lounge.I stared at the ship b earing My name is Byron K —eeling Island disapp eared b elow the horizon, My name is Byron Kn —ow how the skiﬀ escap ed from the Maelstrom’sfearsome eddies, My name is Byron Knoll. — It was an insane idea. F ortunately I controlled myselfand stretched My name is Byron Knoll. My name is B —yron Knoll. My name is Byron Knoll. My name is Byron Knoll. F —or example, consider the form of the exp onen tial family Fi —gure ∼\ ref { ﬁg:betaPriorPost } (c) shows what happ ens as the num b er of heads in the past data. Figure o —f the data, as follo ws: \ bea \ gauss( \ mu |\ gamma, \ lambda(2 \ alpha-1)) Figure ou —r conclusions are a con vex combination of the prior mean and the constraints Figure out Bay esian theory we m ust. Jo —rdan conjugate prior Figure out Bay esian theory we m ust. Jos —h T enenbaum p oin t of the p osterior mean is and mo de of the Figure out Bay esian theory we m ust. Josh agrees. Long live P( \ vtheta — |\ data) Figure 6: Tw o examples of P A Q8 in teractiv e text prediction sessions. The user typed the text in b oldface and P AQ8 generated the prediction after the “—” symbol. W e shortened some of the predictions for presen tation purp oses. In the top example, P A Q8 w as trained on “Tw ent y Thousand Leagues Under the Seas” (Jules V erne, 1869). In the b ottom example, P AQ8 w as trained on the LaT eX source of a machine learning b o ok by Kevin P . Murph y . 4.2 Classiﬁcation In man y classiﬁcation settings of practical in terest, the data app ears in sequences ( e.g. text). T ext categorization has particular relev ance for classiﬁcation tasks in the w eb domain (such as spam ﬁltering). Ev en when the data do es not app ear to b e obviously sequential in nature ( e.g. images), one can sometimes ﬁnd ingenious wa ys of mapping the data to sequences. Compression-based classiﬁcation w as discov ered indep endently by several researc hers (Marton et al., 2005). One of the main b eneﬁts of compression-based metho ds is that they are very easy to apply as they usually require no data preprocessing or parameter tuning. There are several standard pro cedures for p erforming compression-based classiﬁcation. These pro cedures all take adv antage of the fact that when compressing the concatenation of tw o pieces of data, compression programs tend to achiev e b etter compression rates when the data share common patterns. If a data p oint in the test set compresses well with a particular class in the training set, it likely b elongs to that class. Any of the distance metrics deﬁned in Section 2.5 can b e directly used to do classiﬁcation (for example, using the k-nearest neighbor algorithm). W e developed a classiﬁcation algorithm using P A Q8 and show that it can b e used to achiev e competitive classiﬁcation rates in t w o disparate domains: text categorization and shap e recognition. 4.2.1 T echniques Marton et al (Marton et al., 2005) describ e three common compression-based classiﬁcation pro ce- dures: standard minimum description length (SMDL), approximate minimum description length (AMDL), and b est-compression neighbor (BCN). Supp ose eac h data p oin t in the training and test sets are stored in separate ﬁles. Each ﬁle in the training set b elongs to one of the classes C 1 , ..., C N . Let the ﬁle A i b e the concatenation of all training ﬁles in class C i . SMDL runs a compression algorithm on each A i to obtain a mo del (or dictionary) M i . Each test ﬁle T is compressed using eac h M i . T is assigned to the class C i whose mo del M i results in the b est compression of T . While 16 T able 5: The num b er of times eac h bit of data gets compressed using diﬀerent compression-based classiﬁcation metho ds. N X is the num b er of training ﬁles, N Y is the num b er of test ﬁles, and N Z is the num b er of classes. Method Training da t a Test da t a SMDL 1 N Z AMDL N Y + 1 N Z BCN N Y + 1 N X compressing T , the compression algorithm do es not up date the mo del M i . F or a ﬁle F , let f ( F ) b e the ﬁle size of the compressed version of F . Also, let A i T b e the ﬁle A i concatenated with T . F or AMDL, T is assigned to the class C i whic h minimizes the diﬀerence f ( A i T ) - f ( A i ). Let B be a ﬁle in the training set. BCN chec ks every pair B T and assigns T to the class which minimizes the diﬀerence f ( B T ) - f ( B ). A sp eed comparison b etw een these metho ds can b e made by considering how many times each bit of the data gets compressed, as sho wn in T able 5. It should b e noted that the primary diﬀerence b et w een SMDL and AMDL is the fact that SMDL only processes the training data once, while AMDL repro cesses the training data for every ﬁle in the test set. F or man y datasets, the num b er of training and test ﬁles ( N X and N Y ) are m uch larger than the num b er of classes ( N Z ). That means that SMDL can b e orders of magnitude faster than AMDL (and AMDL faster than BCN). Although P AQ8 achiev es state of the art compression rates, it is also extremely slo w compared to the ma jorit y of compression algorithms. Using P A Q8 for classiﬁcation on large datasets w ould b e unfeasibly slo w using AMDL or BCN. 4.2.2 P AQclass AMDL and BCN b oth w ork with oﬀ-the-shelf compression programs. Ho wev er, implemen ting SMDL usually requires access to a compression program’s source co de. Since P A Q8 is op en source, w e mo diﬁed the source co de of paq8l (a version of P AQ8) to implemen t SMDL. W e call this classiﬁer P AQclass. T o the b est of our knowledge, P A Q has never b een modiﬁed to implement SMDL b efore. W e c hanged the source co de to call fork() when it ﬁnishes pro cessing data in the training set (for a particular class). One fork ed process is created for every ﬁle in the test set. This essen tially copies the state of the compressor after training and allo ws eac h test ﬁle to b e compressed indep enden tly . Note that this pro cedure is slightly diﬀerent from SMDL b ecause the mo del M i con tinues to b e adaptiv ely mo diﬁed while it is pro cessing test ﬁle T . How ever, it still has the same time complexity as SMDL. paq8l has one parameter to set the compression level. W e used the default parameter setting of 5 during classiﬁcation exp erimen ts. Compression p erformance for AMDL and BCN is measured using ﬁle size. The use of ﬁle size is fundamen tally limited in t wo wa ys. The ﬁrst is that it is only accurate to within a b yte (due to the w ay ﬁles are stored on disk). The second is that it is reliant on the non-optimal arithmetic co ding pro cess to enco de ﬁles to disk. Cross entrop y is a b etter measurement of compression p erformance b ecause it is sub ject to neither of these limitations. Since we had access to the paq8l source co de, w e used cross entrop y as a measure of compression p erformance instead of ﬁle size. 17 T able 6: Num b er of do cumen ts in each category of the 20news dataset. Class Count al t.a theism 799 comp.graphics 973 comp.os.ms-windows.misc 985 comp.sys.ibm.pc.hardw are 982 comp.sys.mac.hard w are 961 comp.windows.x 980 misc.forsale 972 rec.autos 990 rec.motorcycles 994 rec.spor t.baseball 994 rec.spor t.hockey 999 sci.cr ypt 991 sci.electronics 981 sci.med 990 sci.sp ace 987 soc.religion.christian 997 t alk.politics.guns 910 t alk.politics.mideast 940 t alk.politics.misc 775 t alk.religion.misc 628 T otal 18,828 4.2.3 T ext Categorization T ext categorization is the problem of as signing do cumen ts to categories based on their conten t. W e ev aluated P A Qclass on the 20 Newsgroup (20news) dataset. This dataset con tains 18,828 newsgroup do cumen ts partitioned (nearly) evenly across 20 categories. The num b er of do cumen ts in each category is shown in T able 6. W e used J. Rennie’s version of the corpus, which is av ailable at: http://p e ople.csail.mit.e du/p e ople/jr ennie/20Newsgr oups/20news-18828.tar.gz . In this version of the corpus duplicate p ostings to more than one newsgroup were remo ved. Most message headers w ere also remov ed, while the “Sub ject” and “F rom” ﬁelds were retained. W e ev aluated P AQclass using randomized 80-20 train-test splits. The 80-20 split seems to be the most common ev aluation protocol used on this dataset. No do cumen t prepro cessing w as performed. The results are shown in T able 7. Our result of 92.3526% correct is comp etitiv e with the b est results published for this dataset. T able 8 shows some comparativ e results. It should b e noted that there are several versions of the 20news dataset and many publications use diﬀeren t ev aluation proto cols. Some of these published results can not b e directly compared. F or example, Zhang & Oles (Zhang and Oles, 2001) rep ort a ﬁgure of 94.8% correct on a v ersion of the dataset in which the “Newsgroup:” headers were not remo ved from messages. The four best results (including P A Qclass) in T able 8 (Marton et al., 2005, W einberger and Saul, 2009, Kibriya et al., 2005) all seem to use the same version of 20news. P A Qclass outp erforms classiﬁcation using the RAR compression algorithm (Marton et al., 2005) on this dataset. 18 T able 7: Classiﬁcation results on the 20news dataset. Each ro w shows one run of a randomized 80-20 train-test split. Correct classifica tions Percent correct (out of 3766) 3470 92.1402 3482 92.4588 3466 92.034 3492 92.7244 3480 92.4057 Av erage 92.3526 T able 8: Comparativ e results on the 20news dataset. Our results are in b oldface. Methodology Protocol Per cent correct extended version of Naive Ba yes 80-20 train-test split 86 . 2 (Rennie et al., 2003) SVM + error correcting output coding 80-20 train-test split 87 . 5 (Rennie, 2001) language modeling 80-20 train-test split 89 . 23 (Peng et al., 2004) AMDL using RAR compression 80-20 train-test split 90 . 5 (Mar ton et al., 2005) mul ticlass SVM + linear kernel 70-30 train-test split 91 . 96 (Weinberger and Saul, 2009) P AQclass 80-20 train-test split 92.35 mul tinomial Naive Ba yes + TFIDF 80-20 train-test split 93 . 65 (Kibriy a et al., 2005) 4.2.4 Shap e Recognition Shap e recognition is the problem of assigning images to categories based on the shap e or contour of an ob ject within the image. W e ev aluated P AQclass on the chic ken dataset of (Andreu et al., 1997), av ailable from http://algoval.essex.ac.uk/data/se quenc e/chicken . This dataset con tains 446 binary images of chic ken parts in ﬁv e categories (see Figure 9). Example images from this dataset are sho wn in Figure 7. The chic ken pieces in the images are not set to a standard orientation. The images are square and v ary in resolution from 556 × 556 to 874 × 874. As discussed in Section 2.4, compressing images p oses a signiﬁcan tly diﬀeren t problem compared to compressing text. There does not seem to b e a large b ody of research on using compression-based metho ds for image classiﬁcation (in comparison to text categorization). This ma y b e due to the fact that compression-based metho ds tend to b e slow and may b e infeasible for the large datasets often used for ob ject recognition tasks. Lossy compression algorithms can b e used for p erforming compression-based classiﬁcation. 19 Figure 7: Five example images from each class of the chic ken dataset. The images hav e not b een rotated. T able 9: Num b er of images in each class of the chic ken dataset. Class Count back 76 breast 96 drumstick 96 thigh and back 61 wing 117 T otal 446 There are sev eral options for creating one-dimensional lossy represen tations of images. F or ex- ample, W atanab e et al (W atanab e et al., 2002) demonstrate a metho d of conv erting images to text. They sho w that their system is eﬀective for image classiﬁcation tasks. W ei et al (W ei et al., 2008) describ e a metho d of conv erting shap e contours into time series data. They use this representation to ac hieve successful classiﬁcation results on the chic ken dataset. Based on these results, we decided to com bine this representation with P A Q8 classiﬁcation. Figure 8 demonstrates ho w we conv ert images in the chic ken dataset into one-dimensional time series data. The ﬁrst step is to calculate the cen troid of the shape. W e pro ject a ra y from the cen troid p oint and measure the distance to the edge of the shap e. If the ray in tersects the shap e edge at multiple p oin ts, w e tak e the furthest intersection (as seen at p oin t 5 in Figure 8). W e rotate the ray around the entire shape and take measuremen ts at uniform in terv als. The n um b er of measuremen ts taken along the shap e contour is a tunable parameter. Once the Euclidean distance is measured for a particular angle of the ra y , it is conv erted into a single byte b y rounding the result of the follo wing formula: (100 ∗ distance/w idth ). distance is the Euclidean distance measurement and w idth is the width of the image. P A Qclass is then run on this binary data. 20 Figure 8: An example of con verting a shap e into one-dimensional time series data. The original image is shown on top (part of the “wing” class of the c hick en dataset) and the time series data sho wn on the b ottom. P oints along the contour hav e b een lab eled and the corresp onding p oints on the time series are shown. T able 10: Leav e-one-out classiﬁcation results on the c hick en dataset with diﬀerent settings of the “n umber of measuremen ts” parameter. There are a total of 446 images. The ro w with the b est classiﬁcation results is in b oldface. Number of Correct Percent measurements classifica tions correct 1 162 36.3229 5 271 60.7623 10 328 73.5426 30 365 81.8386 35 380 85.2018 38 365 81.8386 39 363 81.3901 40 389 87.2197 41 367 82.287 42 367 82.287 45 359 80.4933 50 352 78.9238 100 358 80.2691 200 348 78.0269 300 339 76.009 The classiﬁcation results for diﬀerent settings of the “n umber of measuremen ts” parameter are sho wn in T able 10. W e used lea v e-one-out cross-v alidation since this seems to b e the most common ev aluation proto col used on this dataset. The “num b er of measuremen ts” parameter 21 T able 11: Confusion matrix for chic ken dataset with the “num b er of measurements” parameter set to 40. C1=bac k, C2=breast, C3=drumstick, C4=thigh and back, C5=wing. Predicted C1 C2 C3 C4 C5 C1 55 10 2 2 7 C2 0 93 0 3 0 Actual C3 0 5 84 0 7 C4 0 8 0 48 5 C5 3 1 3 1 109 T able 12: Comparativ e results on the chic ken dataset. Our results are in b oldface. Methodology Protocol Per cent correct 1-NN + Levenshtein edit dist ance lea ve-one-out ≈ 67 (Mollineda et al., 2002) 1-NN + HMM-based dist ance lea ve-one-out 73 . 77 (Bicego and Trudd a, 2008) 1-NN + mBm-based fea tures lea ve-one-out 76 . 5 (Bicego and Trudd a, 2008) 1-NN + appro xima ted cyclic dist ance lea ve-one-out ≈ 78 (Mollineda et al., 2002) 1-NN + conver t to time series lea ve-one-out 80 . 04 (Wei et al., 2008) SVM + HMM-based entropic fea tures lea ve-one-out 81 . 21 (Perina et al., 2009) SVM + HMM-based nonlinear kernel 50-50 train-test split 85 . 52 (Carli et al., 2009) SVM + HMM-based Fisher kernel 50-50 train-test split 85 . 8 (Bicego et al., 2009) P AQclass + conv ert to time series leav e-one-out 87.22 had a surprisingly large eﬀect on classiﬁcation accuracy . Adjusting the parameter by a single measuremen t from the b est v alue (40) resulted in ≈ 5 to 6% loss in accuracy . Another unfortunate prop ert y of adjusting this parameter is that the classiﬁcation accuracy is not a conv ex function (as seen at the parameter v alue 35). This means ﬁnding the optimal v alue of the parameter w ould require an exhaustive searc h. Due to time constraints, we did not p erform an exhaustive search (only the exp erimen ts in T able 10 were p erformed). T able 11 sho ws a confusion matrix at the b est parameter setting. Our result of 87.2197% correct classiﬁcations is among the best results published for this dataset. 22 T able 12 shows some comparative results. The classiﬁcation pro cedure w e used is not rotationally in v ariant. Since the chic ken pieces in the dataset can be in arbitrary orientations, this could lead to a decrease in classiﬁcation accuracy . W ei et al (W ei et al., 2008) use the same one dimensional image representation as we use, but they use a rotationally in v arian t classiﬁcation pro cedure. They use a 1-nearest-neighbor classiﬁer combined with a Euclidean distance metric. When comparing the distance betw een t wo images, they try all p ossible rotations and use the angle which results in the low est Euclidean distance b et ween the time series represen tations. This same procedure of trying all p ossible orien tations could b e used to mak e the P AQclass classiﬁcation pro cedure rotationally inv ariant (although it would increase the algorithm’s running time). Alternatively , w e could use a single rotationally in v arian t represen tation suc h as alw ays setting the smallest sampled edge distance to b e at ang l e = 0. The eﬀect of rotational in v ariance on classiﬁcation accuracy would mak e interesting future work. 4.3 Lossy Compression Figure 9: 256 6 × 6 image ﬁlters trained using k-means on the CIF AR-10 dataset. P A Q8 can also b e used for p erforming lossy compression. An y lossy representation can p oten- tially b e passed through P AQ8 to ac hieve additional compression. F or example, paq8l can losslessly compress JPEG images by ab out 20% to 30%. paq8l contains a mo del sp eciﬁcally designed for JPEG images. It essentially undo es the lossless compression steps p erformed by JPEG (keeping the lossy representation) and then p erforms lossless compression more eﬃcien tly . T o create a lossy 23 Figure 10: top-left image: original (700 × 525 pixels), top-righ t image: our compression metho d (4083 b ytes), b ottom left: JPEG (16783 bytes), b ottom-righ t: JPEG2000 (4097 bytes) image compression algorithm, we ﬁrst created a set of ﬁlters based on a metho d described b y Coates et al (Coates et al., 2011). W e used the k-means algorithm to learn a set of 256 6 × 6 ﬁlters on the CIF AR-10 image dataset (Krizhevsky, 2009). The ﬁlters were trained using 400,000 randomly selected images patches. The ﬁlters are sho wn in Figure 9. In order to create a lossy image representation, we calculated the closest ﬁlter matc h to each image patch in the original image. These ﬁlter selections were enco ded by p erforming a raster scan through the image and using one byte p er patc h to store the ﬁlter index. These ﬁlter selections w ere then losslessly compressed using paq8l . Some example images compressed using this metho d are sho wn in Figures 10, 11, 12, and 13. At the maxim um JPEG compression rate, the JPEG images w ere still larger than the images created using our metho d. Even at a larger ﬁle size the JPEG images app eared to b e of low er visual quality compared to the images compressed using our metho d. W e also compared against the more adv anced lossy compression algorithm JPEG2000. JPEG2000 has been designed to exploit limitations of h uman visual perception: the eye is less sensitiv e to color v ariation at high spatial frequencies and it has diﬀerent degrees of sensitivity to brightness v ariation dep ending on spatial frequency (Mahoney, accessed April 15, 2011). Our metho d was not designed to exploit these limitations (we leav e this as future w ork). It simply uses the ﬁlters learned from data. Based on the set of test images, JPEG2000 app ears to outp erform our metho d in terms of visual qualit y (at the same compression rate). 24 Figure 11: top-left image: original (525 × 700 pixels), top-righ t image: our compression metho d (1493 b ytes), b ottom left: JPEG (5995 bytes), b ottom-righ t: JPEG2000 (1585 bytes) 5 Conclusion W e hop e this tec hnical exp osition of P AQ8 will mak e the metho d more accessible and stir up new research in the area of temp oral pattern learning and prediction. Casting the weigh t up dates in a statistical setting already enabled us to mak e mo dest impro vemen ts to the technique. W e tried several other techniques from the ﬁelds of sto c hastic approximation and nonlinear ﬁltering, including the unscented Kalman ﬁlter, but did not observe signiﬁcan t improv ements o ver the EKF implemen tation. One promising technique from the ﬁeld of nonlinear ﬁltering w e ha ve not y et implemen ted is Rao-Blac kw ellized particle ﬁltering for online logistic regression (Andrieu et al., 2001). W e lea ve this for future work. The wa y in which P A Q8 adaptively combines predictions from m ultiple mo dels using context matc hing is diﬀeren t from what is t ypically done with mixtures of exp erts and ensemble metho ds suc h as b oosting and random forests. A statistical p ersp ectiv e on this, whic h allows for a general- ization of the technique, should b e the fo cus of future eﬀorts. Bridging the gap b et ween the online learning framework (Cesa-Bianchi and Lugosi, 2006) and P AQ8 is a p oten tially fruitful researc h direction. Recent developmen ts in RNNs seem to b e synergistic with P AQ8, but this still requires metho dical exploration. Of particular relev ance is the adoption of P A Q8’s deterministic gating 25 Figure 12: top-left image: original (700 × 525 pixels), top-righ t image: our compression metho d (3239 b ytes), b ottom left: JPEG (16077 bytes), b ottom-righ t: JPEG2000 (2948 bytes) arc hitecture so as to reduce the enormous computational cost of training RNNs. This should b e done in conjunction with a mov e to ward adaptive prediction. On the application front, w e found it remark able that a single algorithm could b e used to tackle suc h a broad range of tasks. In fact, there are man y other tasks that could ha ve b een tackled, including clustering, compression-based distance metrics, anomaly detection, sp eec h recognition, and interactiv e interfaces. It is equally remark able how the metho d ac hieves comparable results to state-of-the-art in text classiﬁcation and image compression. There are c hallenges in deploying P A Q b ey ond this p oin t. The ﬁrst c hallenge is that the mo dels are non-parameteric and hence require enormous storing capacity . A b etter memory architecture, with some forgetting, is needed. The second challenge is the fact that P AQ applies only to univ ariate sequences. A computationally eﬃcient extension to multiple sequences do es not seem trivial. In this sense, RNNs hav e an adv antage ov er P AQ, PPM and sto c hastic memoizers. Ac kno wledgemen ts W e w ould lik e to thank Matt Mahoney , who enth usiastically helped us understand imp ortant details of P AQ and pro vided us with many insights into predictive data compression. W e w ould also like to thank Ben Marlin and Ilya Sutskev er for discussions that help ed improv e this manuscript. 26 Figure 13: top-left image: original (700 × 525 pixels), top-righ t image: our compression metho d (3970 b ytes), b ottom left: JPEG (6335 bytes), b ottom-righ t: JPEG2000 (3863 bytes) A P A Q8 Demonstrations Tw o of the applications are av ailable at: http://cs.ub c.c a/ ∼ knol l/P AQ8-demos.zip . The ﬁrst demonstration is text prediction and the second demonstration is a ro ck-paper-scissors AI. In- structions are provided on how to compile and run the programs in Linux. References G. Andreu, A. Cresp o, and J. M. V aliente. Selecting the toroidal self-organizing feature maps (TSOFM) best organized to ob ject recognition. In International Confer enc e on Neur al Networks , v olume 2, pages 1341–1346, 1997. C. Andrieu, N. de F reitas, and A. Doucet. Rao-Blackw ellised particle ﬁltering via data augmen ta- tion. A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2001. T. Bell, J. Cleary , and I. Witten. T ext c ompr ession . Pren tice-Hall, Inc., 1990. M. Bicego and A. T rudda. 2D shap e classiﬁcation using m ultifractional brownian motion. In Struc- tur al, Syntactic, and Statistic al Pattern R e c o gnition , volume 5342 of L e ctur e Notes in Computer Scienc e , pages 906–916. Springer, 2008. M. Bicego, M. Cristani, V. Murino, E. P ek alsk a, and R. Duin. Clustering-based construction of hidden mark ov mo dels for generativ e k ernels. In Ener gy Minimization Metho ds in Computer 27 Vision and Pattern R e c o gnition , volume 5681 of L e ctur e Notes in Computer Scienc e , pages 466– 479. Springer, 2009. A. Carli, M. Bicego, S. Baldo, and V. Murino. Non-linear generative embeddings for k ernels on laten t v ariable mo dels. In IEEE 12th International Confer enc e on Computer Vision Workshops (ICCV Workshops) , pages 154–161, 2009. N. Cesa-Bianchi and G. Lugosi. Pr e diction, L e arning, and Games . Cambridge Universit y Press, 2006. R. Cilibrasi and P . M. B. Vitanyi. Clustering b y compression. IEEE T r ansactions on Information The ory , 51(4):1523–1545, 2005. J. Cleary and I. Witten. Data compression using adaptive co ding and partial string matc hing. IEEE T r ansactions on Communic ations , 32(4):396–402, 1984. J. Cleary , W. T eahan, and I. Witten. Un b ounded length contexts for PPM. Data Compr ession Confer enc e , 1995. A. Coates, H. Lee, and A. Y. Ng. An analysis of single-la yer netw orks in unsupervised feature learning. AIST A TS 14 , 2011. N. Garay-Vitoria and J. Abascal. T ext prediction systems: a survey . Universal A c c ess in the Information So ciety , 4:188–203, 2006. J. Gasthaus, F. W oo d, and Y. W. T eh. Lossless compression based on the sequence memoizer. Data Compr ession Confer enc e , pages 337–345, 2010. A. Grav es, M. Liwic ki, S. F ernandez, R. Bertolami, H. Bunke, and J. Schmidh ub er. A no vel connectionist system for unconstrained handwriting recognition. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 31(5):855 –868, 2009. J. Ha wkins and S. Blak eslee. On Intel ligenc e . Owl Bo oks, 2005. D. Hilb ert. Ueb er die stetige abbildung einer line auf ein ﬂachenstuc k. Mathematische A nnalen , 38:459–460, 1891. D. Huﬀman. A method for the construction of minimum-redundancy co des. Pr o c e e dings of the IRE , 40(9):1098–1101, 1952. M. Hutter. The h uman knowledge compression prize. http://prize.h utter1.net, accessed April 15, 2011. R. Jacobs, M. Jordan, S. Nowlan, and G. E. Hinton. Adaptiv e mixtures of lo cal exp erts. Neur al Computation , 3:79–87, 1991. E. Keogh, S. Lonardi, and C. A. Ratanamahatana. T ow ards parameter-free data mining. In Pr o c e e dings of the tenth A CM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 206–215, 2004. 28 A. Kibriya, E. F rank, B. Pfahringer, and G. Holmes. Multinomial naiv e Bay es for text categorization revisited. In A dvanc es in A rtiﬁcial Intel ligenc e , volume 3339 of L e ctur e Notes in Computer Scienc e , pages 235–252. Springer, 2005. A. Krizhevsky . Learning m ultiple la yers of features from tin y images. Master’s thesis, Universit y of T oronto, 2009. M. Li, J. H. Badger, X. Chen, S. Kw ong, P . Kearney , and H. Zhang. An information-based sequence distance and its application to whole mito c hondrial genome phylogen y . Bioinformatics , 17(2): 149–154, 2001. M. Mahoney . Adaptive weighing of context mo dels for lossless data compression. Florida T ech. T ec hnical Rep ort, CS-2005-16, 2005. M. Mahoney . Data compression explained. h ttp://mattmahoney .net/dc/dce.html, accessed April 15, 2011. Y. Marton, N. W u, and L. Hellerstein. On compression-based text classiﬁcation. In A dvanc es in Information R etrieval , volume 3408 of L e ctur e Notes in Computer Scienc e , pages 300–314. Springer, 2005. R. A. Mollineda, E. Vidal, and F. Casacub erta. Cyclic sequence alignments: Appro ximate versus optimal tec hniques. International Journal of Pattern R e c o gnition and Artiﬁcial Intel ligenc e , 16: 291–299, 2002. D. Opitz and R. Maclin. Popular ensemble metho ds: an empirical study . In Journal of Artiﬁcial Intel ligenc e R ese ar ch 11 , pages 169–198, 1999. F. P eng, D. Sc huurmans, and S. W ang. Augmenting naiv e Ba yes classiﬁers with statistical language mo dels. Information R etrieval , 7:317–345, 2004. A. P erina, M. Cristani, U. Castellani, and V. Murino. A new generative feature set based on en tropy distance for discriminativ e classiﬁcation. In Image A nalysis and Pr o c essing ICIAP 2009 , v olume 5716 of L e ctur e Notes in Computer Scienc e , pages 199–208. Springer, 2009. J. D. M. Rennie. Impro ving m ulti-class text classiﬁcation with naiv e Bay es. Master’s thesis, M.I.T., 2001. J. D. M. Rennie, L. Shih, J. T eev an, and D. Karger. T ackling the p oor assumptions of naive Bay es text classiﬁers. In International Confer enc e on Machine L e arning (ICML) , pages 616–623, 2003. J. Rissanen and G. G. Langdon. Arithmetic co ding. IBM J. R es. Dev. , 23:149–162, 1979. D. Shk arin. PPM: one step to practicality . In Data Compr ession Confer enc e , pages 202–211, 2002. S. Singhal and L. W u. T raining multila y er p erceptrons with the extended k alman algorithm. In A dvanc es in neur al information pr o c essing systems (NIPS) , pages 133–140, 1989. I. Sutskev er, J. Martens, and G. E. Hinton. Generating text with recurrent neural net works. In International Confer enc e on Machine L e arning (ICML) , 2011. 29 T. W atanab e, K. Suga wara, and H. Sugihara. A new pattern represen tation scheme using data compression. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 24(5):579–590, 2002. L. W ei, E. Keogh, X. Xi, and M. Y oder. Eﬃcien tly ﬁnding un usual shapes in large image databases. Data Mining and Know le dge Disc overy , 17:343–376, 2008. K. Q. W ein b erger and L. K. Saul. Distance metric learning for large margin nearest neigh b or classiﬁcation. J. Mach. L e arn. R es. , 10:207–244, 2009. F. W o od, C. Arc hambeau, J. Gasthaus, L. James, and Y. W. T eh. A sto c hastic memoizer for sequence data. In International Confer enc e on Machine L e arning (ICML) , pages 1129–1136, 2009. Z. Y ong and D. A. Adjeroh. Prediction by partial approximate matching for lossless image com- pression. IEEE T r ansactions on Image Pr o c essing , 17(6):924–935, 2008. T. Zhang and F. Oles. T ext categorization based on regularized linear classiﬁcation metho ds. Information R etrieval , 4:5–31, 2001. 30

A Machine Learning Perspective on Predictive Coding with PAQ

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment