Notes on a New Philosophy of Empirical Science

Notes on a New Philosoph y of Empirical Science (Draft V ersion) Daniel Burfoot April 2011 Release Notes This document is a draft version of the book, tentativ ely titled “Notes on a Ne w Philosophy of Empirical Science”. This book represents an attempt to build a philosophy of science on top of a large number of technical ideas related to information theory , machine learning, computer vision, and computational linguistics. It seemed necessary , in order to make the arguments con vincing, to include brief summary descriptions of these ideas. It is probably ine vitable that the technical summaries will contain a number of errors or misconceptions, related to, for example, the Shi-Malik image segmentation algorithm or the B L E U metric for ev aluating machine translation results. While it seems unlikely that such errors could derail the central arguments of the book, it is not impossible. The reader is advised to exercise caution and consult the rele vant literature directly . The book has been inﬂuenced by a di verse set of authors and ideas. Especially inﬂuential references are cited with an asterisk, for example: [92]*. The ﬁnal version of the book will probably include a brief description of how each ke y reference inﬂuenced the dev elopment of the book’ s ideas. This draft version contains all the major ideas and themes of the book. Howe ver , it also contains no small number of blemishes, disﬂuencies, and other shortcomings. T wo holes are particularly glaring. First, the chapter on computer vision includes an analysis of ev aluation methods for the task of optical ﬂo w estimation, as well as a comperical reformulation of the task, but does not describe the task itself. Interested readers can repair this problem by a Google Scholar search for the term “optical ﬂow”. Second, there should be another thought experiment in volving Sophie and Simon near the end of Chapter 2. In this thought e xperiment, Sophie proposes to use a birdsong synthesizer to create virtual labels, thereby obviating the need for Simon to label audio clips by hand. This idea comes up again in the section on the e v aluation of face detection algorithms, where a graphics program for face modeling is used instead. i ii It is a profound and necessary truth that the deep things in science are not found because they are useful, the y are found because it was possible to ﬁnd them. -Oppenheimer It is the mark of a higher culture to v alue the little unpretentious truths which ha ve been discovered by means of rigorous method more highly than the errors handed do wn by metaphysical ages and men, which blind us and make us happy . -Nietzsche Go as far as you can see; when you get there you will be able to see farther . -Carlyle iii i v Contents Release Notes i Nomenclature ix 1 Compression Rate Method 1 1.1 Philosophical Foundations of Empirical Science . . . . . . . . . . . . . . . . . 1 1.1.1 Objecti vity , Irrationality , and Progress . . . . . . . . . . . . . . . . . . 3 1.1.2 V alidation Methods and T axonomy of Scientiﬁc Acti vity . . . . . . . . 4 1.1.3 T o ward a Scientiﬁc Method . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.4 Occam’ s Razor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.5 Problem of Demarcation and Falsiﬁability Principle . . . . . . . . . . . 9 1.1.6 Science as a Search Through Theory-Space . . . . . . . . . . . . . . . 11 1.1.7 Circularity Commitment and Reusability Hypothesis . . . . . . . . . . 12 1.2 Sophie’ s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.1 The Shaman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.2 The Dead Experimentalist . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.3 The Ri val Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Compression Rate Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3.1 Data Compression is Empirical Science . . . . . . . . . . . . . . . . . 21 1.3.2 Comparison to Popperian Philosophy . . . . . . . . . . . . . . . . . . 23 1.3.3 Circularity and Reusability in Context of Data Compression . . . . . . 25 1.3.4 The In visible Summit . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.3.5 Objecti ve Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.4 Example Inquiries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.4.1 Roadside V ideo Camera . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.4.2 English T ext Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.4.3 V isual Manhattan Project . . . . . . . . . . . . . . . . . . . . . . . . . 33 v 1.5 Sampling and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.5.1 V eridical Simulation Principle of Science . . . . . . . . . . . . . . . . 35 1.6 Comparison to Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2 Compression and Lear ning 39 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.1.1 Standard Formulation of Supervised Learning . . . . . . . . . . . . . . 40 2.1.2 Simpliﬁed Description of Learning Algorithms . . . . . . . . . . . . . 41 2.1.3 Generalization V ie w of Learning . . . . . . . . . . . . . . . . . . . . . 42 2.1.4 Compression V ie w . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.1.5 Equi valence of V iews . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.1.6 Limits of Model Complexity in Canonical T ask . . . . . . . . . . . . . 50 2.1.7 Intrinsically Complex Phenomena . . . . . . . . . . . . . . . . . . . . 51 2.1.8 Comperical Reformulation of Canonical T ask . . . . . . . . . . . . . . 53 2.2 Manual Overﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.1 The Stock T rading Robot . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.2 Analysis of Manual Overﬁtting . . . . . . . . . . . . . . . . . . . . . 58 2.2.3 T rain and T est Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2.4 Comperical Solution to Manual Overﬁtting . . . . . . . . . . . . . . . 60 2.3 Indirect Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3.1 Dilbert’ s Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3.2 The Japanese Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3.3 Sophie’ s Self-Conﬁdence . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.3.4 Direct and Indirect Approaches to Learning . . . . . . . . . . . . . . . 68 2.4 Natural Setting of the Learning Problem . . . . . . . . . . . . . . . . . . . . . 70 2.4.1 Robotics and Machine Learning . . . . . . . . . . . . . . . . . . . . . 70 2.4.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.4.3 Pedro Rabbit and the Poison Leaf . . . . . . . . . . . . . . . . . . . . 72 2.4.4 Simon’ s Ne w Hobby . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.4.5 Foundational Assumptions of Supervised Learning . . . . . . . . . . . 75 2.4.6 Natural Form of Input Data . . . . . . . . . . . . . . . . . . . . . . . . 76 2.4.6.1 Data is a Stream . . . . . . . . . . . . . . . . . . . . . . . . 76 2.4.6.2 The Stream is V ast . . . . . . . . . . . . . . . . . . . . . . . 77 2.4.6.3 Supervision is Scarce . . . . . . . . . . . . . . . . . . . . . 77 2.4.7 Natural Output of Learning Process . . . . . . . . . . . . . . . . . . . 79 vi 2.4.7.1 Dimensional Analysis . . . . . . . . . . . . . . . . . . . . . 79 2.4.7.2 Predicting the Stream . . . . . . . . . . . . . . . . . . . . . 79 2.4.8 Synthesis: Dual System V ie w of Brain . . . . . . . . . . . . . . . . . . 80 3 Compression and V ision 83 3.1 Representati ve Research in Computer V ision . . . . . . . . . . . . . . . . . . 84 3.1.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.1.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.1.3 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.1.4 Object Recognition and Face Detection . . . . . . . . . . . . . . . . . 92 3.2 Ev aluation Methodologies in Computer V ision . . . . . . . . . . . . . . . . . . 94 3.2.1 Ev aluation of Image Segmentation Algorithms . . . . . . . . . . . . . 94 3.2.2 Ev aluation of Edge Detectors . . . . . . . . . . . . . . . . . . . . . . 96 3.2.3 Ev aluation of Object Recognition . . . . . . . . . . . . . . . . . . . . 97 3.2.4 Ev aluation of Stereo Matching and Optical Flow Estimation . . . . . . 100 3.3 Critical Analysis of Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.3.1 W eakness of Empirical Ev aluation . . . . . . . . . . . . . . . . . . . . 102 3.3.2 Ambiguity of Problem Deﬁnition and Replication of Ef fort . . . . . . . 104 3.3.3 Failure of Decomposition Strate gy . . . . . . . . . . . . . . . . . . . . 105 3.3.4 Computer V ision is not Empirical Science . . . . . . . . . . . . . . . . 106 3.3.5 The Elemental Recognizer . . . . . . . . . . . . . . . . . . . . . . . . 108 3.4 Comperical Formulation of Computer V ision . . . . . . . . . . . . . . . . . . 109 3.4.1 Abstract Formulation of Computer V ision . . . . . . . . . . . . . . . . 110 3.4.2 Stereo Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.4.3 Optical Flo w Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.4.4 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.4.5 Face Detection and Modeling . . . . . . . . . . . . . . . . . . . . . . 116 4 Compression and Language 118 4.1 Computational Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2.1 Ev aluation of Parsing Systems . . . . . . . . . . . . . . . . . . . . . . 124 4.2.2 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2.3 Comperical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.3 Statistical Machine T ranslation . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.1 Ev aluation of Machine T ranslation Systems . . . . . . . . . . . . . . . 133 vii 4.3.2 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.3.3 Comperical Formulation of Machine T ranslation . . . . . . . . . . . . 137 4.4 Statistical Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.4.1 Comparison of Approaches to Language Modeling . . . . . . . . . . . 143 4.5 Additional Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.5.1 Chomskyan F ormulation of Linguistics . . . . . . . . . . . . . . . . . 147 4.5.2 Prediction of Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5 Compression as P aradigm 151 5.1 Scientiﬁc Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.1.1 Requirements of Scientiﬁc Paradigms . . . . . . . . . . . . . . . . . . 152 5.1.2 The Casimir Ef fect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.1.3 The Microprocessor Paradigm . . . . . . . . . . . . . . . . . . . . . . 155 5.1.4 The Chess Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.5 Artiﬁcial Intelligence as Pre-paradigm Field . . . . . . . . . . . . . . . 158 5.1.6 The Brooksian Paradigm Candidate . . . . . . . . . . . . . . . . . . . 160 5.2 Comperical Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.2.1 Conceptual Clarity and Parsimonious Justiﬁcation . . . . . . . . . . . 164 5.2.2 Methodological Ef ﬁciency . . . . . . . . . . . . . . . . . . . . . . . . 165 5.2.3 Scalable Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.2.4 Systematic Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6 Meta-Theories and Uniﬁcation 171 6.1 Compression Formats and Meta-F ormats . . . . . . . . . . . . . . . . . . . . . 171 6.1.1 Encoding Formats and Empirical Theories . . . . . . . . . . . . . . . . 174 6.2 Local Scientiﬁc Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.3 Cartographic Meta-Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.4 Search for Ne w Empirical Meta-Theories . . . . . . . . . . . . . . . . . . . . 180 6.5 Uniﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.5.1 The Form of Birds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.5.2 Uniﬁcation in Comperical Science . . . . . . . . . . . . . . . . . . . . 184 6.5.3 Uni versal Grammar and the F orm of Language . . . . . . . . . . . . . 185 6.5.4 The Form of F orms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 viii A Inf ormation Theory 189 A.1 Basic Principle of Data Compression . . . . . . . . . . . . . . . . . . . . . . . 189 A.1.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 B Related W ork 194 B.1 Hutter Prize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 B.2 Generati ve Model Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . 195 B.3 Unsupervised and Semi-Supervised Learning . . . . . . . . . . . . . . . . . . 198 B.4 T raditional Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 201 References 213 ix Chapter 1 Compr ession Rate Method 1.1 Philosophical F oundations of Empirical Science In a remarkable paper published in 1964, a biophysicist named John Platt pointed out the some what impolitic fact that some scientiﬁc ﬁelds made progress much more rapidly than others [89]*. Platt cited particle physics and molecular biology as exemplar ﬁelds in which progress was especially rapid. T o illustrate this speed he relates the follo wing anecdote: [Particle physicists asked the question]: Do the fundamental particles conserve mirror-symmetry or “parity” in certain reactions, or do they not? The crucial ex- periments were suggested: within a fe w months they were done, and conservation of parity was found to be excluded. Richard Garwin, Leon Lederman, and Marcel W einrich did one of the crucial e xperiments. It was thought of one e vening at sup- pertime: by midnight they had arranged the apparatus for it; and by 4 am they had picked up the predicted pulses sho wing the non-conserv ation of parity . Platt attributed this rapid progress not to the superior intelligence of particle physicists and molecular biologists, but to the fact that they used a more rigorous scientiﬁc methodology , which he called Strong Inference. In Platt’ s vie w , the ke y requirement of rapid science is the ability to rapidly generate new theories, test them, and discard those that pro ve to be incompat- ible with e vidence. Many observers of ﬁelds such as artiﬁcial intelligence (AI), computer vision, computational linguistics, and machine learning will agree that, in spite of the journalistic hype surrounding them, these ﬁelds do not make rapid progress. Research in artiﬁcial intelligence was begun ov er 50 years ago. In spite of the bold pronouncement made at the time, the ﬁeld has failed to transform society . Robots do not walk the streets; intelligent systems are generally brittle 1 and function only within narro w domains. This lack of progress is illustrated by a comment by Marvin Minsky , one of the founders of AI, in reference to David Marr , one of the founders of computer vision: After [David Marr] joined us, our team became the most famous vision group in the world, but the one with the fewest results. His idea was a disaster . The edge ﬁnders the y ha ve now using his theories, as f ar as I can see, are slightly w orse than the ones we had just before taking him on. W e’ ve lost twenty years ([26], pg. 189). This book argues that the lack of progress in artiﬁcial intelligence and related ﬁelds is caused by philophical limitations, not by technical ones. Researchers in these ﬁelds have no scientiﬁc methodology of power comparable to Platt’ s concept of Strong Inference. They do not rapidly generate, test, and discard theories in the way that particle physicists do. This kind of critique has been uttered before, and would hardly justify a book-length exposition. Rather , the purpose of this book is to pr opose a scientiﬁc method that can be used, at least, for computer vision and computational linguistics, and probably for se veral other ﬁelds as well. T o set the stage for the proposal it is necessary to brieﬂy examine the unique intellectual content of the scientiﬁc method. This uniqueness can be highlighted by comparing it to a the- ory of physics such as quantum mechanics. While quantum mechanics often seems mysterious and perplexing to beginning students, the scientiﬁc method appears obvious and inevitable. Physicists are constantly testing, examining, and searching for failures of quantum mechanics. The scientiﬁc method itself recei ves no comparable interrogation. Physicists are quite conﬁ- dent that quantum mechanics is wrong in some subtle way: one of their great goals is to ﬁnd a uniﬁed theory that reconciles the conﬂicting predictions made by quantum mechanics and general relativity . In contrast, it is not ev en clear what it would mean for the scientiﬁc method to be wrong. But consider the follo wing chain of causation: the scientiﬁc method allo wed humans to dis- cov er physics, physics allowed humans to dev elop technology , and technology allowed humans to reshape the world. The fact that the scientiﬁc method succeeds must rev eal some abstract truth about the nature of reality . Put another way , the scientiﬁc method depends implicitly on some assertions or propositions, and because those assertions happen to be true, the method works. But what is the content of those assertions? Can they be examined, modiﬁed, or gener - alized? This chapter begins with an attempt to analyze and document the assertions and philo- sophical commitments upon which the scientiﬁc method depends. Then, a series of thought experiments illustrate how a slight change to one of the statements results in a modiﬁed v ersion 2 of the method. This new version is based on lar ge scale lossless data compression, and it uses large databases instead of e xperimental observation as the necessary empirical ingredient. The remainder of the chapter argues that the new method retains all the crucial characteristics of the original. The signiﬁcance of the new method is that allows researchers to conduct in vestiga- tions into aspects of empirical reality that hav e nev er before been systematically interrogated. For example, Chapter 3 that attempting to compress a database of natural images results in a ﬁeld very similar to computer vision. Similarly , attempting to compress lar ge text databases re- sults in a ﬁeld very similar to computational linguistics. The starting point in the dev elopment is a consideration of one of the most critical components of science: objecti vity . 1.1.1 Objectivity , Irrationality , and Pr ogress The history of humanity clearly indicates that humans are prone to dangerous ﬂights of irra- tionality . Psychologists have shown that humans suffer from a wide range of cogniti ve blind spots, with names lik e Scope Insensitivity and A v ailability Bias [56]. One special aspect of hu- man irrationality of particular relev ance to science is the human propensity to enshrine theories, abstractions, and explanations without sufﬁcient e vidence. Often, once a person decides that a certain theory is true, he begins to use that theory to interpret all ne w e vidence. This distortativ e ef fect pre vents him from seeing the ﬂaws in the theory itself. Thus Ptolemy belie ved that the Sun rotated around the Earth, while Aristotle believ ed that all matter could be decomposed into the elements of ﬁre, air , water , or earth. Indi vidual human fallibility is not the only obstacle to intellectual progress; another po w- erful barrier is gr oup irrationality . Humans are fundamentally social creatures; no indi vidual acting alone could e ver obtain substantial kno wledge about the world. Instead, humans must rely on a division of labor in which kno wledge-acquisition tasks are delegated to groups of dedicated specialists. This di vision of labor is replicated ev en within the scientiﬁc community: physicists rely extensi vely on the experimental and theoretical work of other physicists. But groups are vulnerable to an additional set of perception-distorting ef fects in volving issues such as status, signalling, politics, conformity pressure, and pluralistic ignorance. A low-ranking indi vidual in a large group cannot comfortably disagree with the statements of a high-ranking indi vidual, ev en if the former has truth on his side. Furthermore, scientists are naturally com- petiti ve and skeptical. A scientist proposing a new result must be prepared to defend it against ine vitable criticism. T o ov ercome the problems of indi vidual irrationality and group irrationality , a single princi- ple is tremendously important: the principle of objecti vity . Objectivity requires that ne w results 3 be v alidated by mechanistic procedure that cannot be inﬂuenced by indi vidual perceptions or sociopolitical ef fects. While humans may implement the validation procedure, it must be some- ho w independent of the particular oddities of the human mind. In the language of computer science, the procedure must be like an abstract algorithm, that does not depend on the particular architecture of the machine it is running on. The validation procedure helps to pre vent indi- vidual irrationality , by requiring scientists to hammer their ideas against a hard an vil. It also protects against group irrationality , by providing scientists with a strong shield against criticism and pressure from the group. The objecti vity principle is also an important requirement for a ﬁeld to make progress. Re- searchers in all ﬁelds lov e to publish papers. If a ﬁeld lacks an objecti ve validation procedure, it is difﬁcult to pre vent people from publishing lo w quality papers that contain incorrect results or meaningless observ ations. The so-called hard sciences such as mathematics, physics, and engineering employ highly objectiv e ev aluation procedures, which facilitates rapid progress. Fields such as psychology , economics, and medical science rely on statistical methods to val- idate their results. These methods are less rigorous, and this leads to signiﬁcant problems in these ﬁelds, as illustrated by a recent paper entitled “Why most published research ﬁndings are false” [52]. Nonscientiﬁc ﬁelds such as literature and history rely on the qualitativ e judgments of practitioners for the purposes of e v aluation. These examples illustrate a striking correlation between the objecti vity of a ﬁeld’ s ev aluation methods and the degree of progress it achie ves. 1.1.2 V alidation Methods and T axonomy of Scientiﬁc Activity The idea of objectivity , and the mechanism by which various ﬁelds achiev e objectivity , can be used to deﬁne a useful taxonomy of scientiﬁc ﬁelds. Scientiﬁc activity , broadly considered, can be categorized into three parts: mathematics, empirical science, and engineering. These acti vities intersect at many le vels, and often a single indi vidual will make contributions in more than one area. But the three categories produce very distinct kinds of results, and utilize dif ferent mechanisms to validate the results and thereby achie ve objecti vity . Mathematicians see the goal of their ef forts as the discov ery of ne w theorems. A theorem is fundamentally a statement of implication: if a certain set of assumptions are true, then some de- ri ved conclusion must hold. The legitimate mechanism for demonstrating the v alidity of a new result is a proof. Proofs can be e xamined by other mathematicians and v eriﬁed to be correct, and this process provides the ﬁeld with its objectiv e validation mechanism. It is worthwhile to note that practical utility plays no essential role in the validation process. Mathematicians 4 may hope that their results are useful to others, but this is not a requirement for a theorem to be considered correct. Engineers, in contrast, take as their basic goal the de velopment of practical de vices. A de vice is a set of interoperating components that produce some useful effect. The word “useful” may be broadly interpreted: sometimes the utility of a new de vice may be speculati ve, or it may be useful only as a subcomponent of a larger de vice. Either way , the requirement for proclaiming success in engineering is a demonstration that the de vice works . It is v ery dif ﬁcult to game this process: if the ne w airplane fails to take off or the new microprocessor fails to multiply numbers correctly , it is obvious that these results are low-quality . Thus, this public demonstration process provides engineering with its method of objecti ve v alidation. The third category , and the focus of this book, is empirical science. Empirical scientists attempt to obtain theories of natural phenomena. A theory is a tool that enables the scientist to make predictions regarding a phenomenon. The value and quality of a theory depends entirely on ho w well it can predict the phenomenon to which it applies. Empirical scientists are similar to mathematicians in the purist attitude the y take toward the product of their research: they may hope that a ne w theory will hav e practical applications, but this is not a requirement. Mathematics and engineering both hav e long histories. Mathematics dates back at least to 500 BC, when Pythagoras prov ed the theorem that bears his name. Engineering is ev en older; perhaps the ﬁrst engineers were the men who fashioned axes and spearheads out of ﬂint and thus ushered in the Stone Age. Systematic empirical science, in contrast, started only relati vely recently , b uilding on the work of think ers lik e Galileo, Descartes, and Bacon. It is worth asking why the ancient philosophers in civilizations like Greece, Babylon, India, and China, in spite of their general intellectual advancement, did not begin a systematic empirical in vestig ation of v arious natural phenomena. The delay could hav e been caused by the fact that, for a long time, no one realized that there could be, or needed to be, an area of intellectual inquiry that w as distinct from mathematics and engineering. Even today , it is difﬁcult for nonscientists to appreciate the difference between a statement of mathematics and a statement of empirical science. After all, physical la ws are almost always expressed in mathematical terms. What is the dif ference between the Ne wtonian statement F = ma and the Pythagorean theorem a 2 + b 2 = c 2 ? These statements, though they are expressed in a similar form, are in fact completely dif ferent constructs: one is an empirical theory , the other is a mathematical law . Sev eral heuristics can be used to dif ferentiate between the two types of statement. One good technique is to ask if the statement could be in validated by some new observ ation or evidence. One could draw a misshapen triangle that did not obey the Pythagorean theorem, but that would hardly mean anything about the truth of 5 the theorem. In contrast, there are observations that could in validate Newton’ s la ws, and in f act such observations were made as a result of Einstein’ s theory of relati vity . There are, in turn, also observ ations that could disprov e relativity . Ancient thinkers might also have failed to see how it could be meaningful to make state- ments about the world that were not essentially connected to the dev elopment of practical de vices. An ancient might very well hav e believ ed that it was impossible or meaningless to ﬁnd a unique optimal theory of gra vity and mass. Instead, scientists should dev elop a toolbox of methods for treating these phenomena. Engineers should then select a tool that is well-suited to the task at hand. So an engineer might very well utilize one theory of gravity to design a bridge, and then use some other theory when designing a catapult. In this mindset, theories can only be e valuated by incorporating them into some practical device and then testing the device. Empirical science is also unique in that it depends on a special pr ocess for obtaining ne w results. This process is called the scientiﬁc method; there is no analogous procedure in mathe- matics or engineering. W ithout the scientiﬁc method, empirical scientists cannot do much more than make catalogs of disconnected and uninterpretable observ ations. When equipped with the method, scientists begin to discern the structure and meaning of the observ ational data. But as explained in the next section, the scientiﬁc method is only ob vious in hindsight. It is built upon deep philosophical commitments that would ha ve seemed bizarre to an ancient think er . 1.1.3 T oward a Scientiﬁc Method T o understand the philosophical commitments implicit in empirical science, and to see why those commitments were nonobvious to the ancients, it is helpful to look at some other plausible scientiﬁc procedures. T o do so, it is con venient to introduce the following simpliﬁed abstract description of the goal of scientiﬁc reasoning. Let x be an e xperimental conﬁguration, and y be the experimental outcome. The v ariables x and y should be thought of not as numbers but as large packets of information including descriptions of v arious objects and quantities. The goal of science is to ﬁnd a function f ( · ) that predicts the outcome of the conﬁguration: y = f ( x ) . A ﬁrst approach to this problem, which can be called the pure theoretical approach, is to deduce the form of f ( · ) using logic alone. In this vie w , scientists should use the same mechanism for proving their statements that mathematicians use. Here there is no need to check the results of a prediction against the experimental outcome. Just as it is meaningless to check the Pythagorean theorem by drawing triangles and measuring its sides, it is meaningless to check the function f ( · ) against the actual outcomes y . Mathematicians can achie ve perfect conﬁdence in their theories without making any kind of appeal to experimental v alidation, 6 so why shouldn’t scientists be able to reason the same way? If Euclid can prove, based on purely logical and conceptual considerations, that the sum of the angles of a triangle adds up to 180 degrees, why cannot Aristotle use analogous considerations to conclude that all matter is composed of the four classical elements? A subtle critic of this approach might point out that mathematicians require the use of axioms , from which they deduce their results, and it is not clear what statements can play this role in the in vestigation of real-w orld phenomena. But e ven this criticism can be answered; perhaps the existence of human reason is the only necessary axiom, or perhaps the axioms can be found in religious texts. Even if someone had proposed to check a prediction against the actual outcome, it is not at all clear what this means or ho w to go about doing it. What would it mean to check Aristotle’ s theory of the four elements? The ancients must hav e viewed the crisp proof-based v alidation method of mathematics as far more rigorous and intellectually satisfying than the tedious, error prone, and conceptually murky process of observ ation and prediction-checking. At the other extreme from the pure theoretical approach is the strate gy of searching for f ( · ) using a purely experimental in vestigation of various phenomena. The plan here would be to conduct a large number of experiments, and compile the results into an enormous almanac. Then to make a prediction in a gi ven situation, one simply looks up a similar situation in the almanac, and uses the recorded v alue. For example, one might want to predict whether a brige will collapse under a certain weight. Then one simply looks up the section marked “bridges” in the almanac, ﬁnds the bridge in the almanac that is most similar to the one in question, and notes how much weight it could bear . In other words, the researchers obtain a large number of data samples { x i , y i } and deﬁne f ( · ) as an enormous lookup table. The pure experimental approach has an obvious dra wback: it is immensely labor-intensi ve. The researchers gi ven the task of compiling the section on bridges must construct se veral different kinds of bridges, and pile them up with weight until they collapse. Bridge building is not easy work, and the almanac section on bridges is only one among many . The pure experimental approach may also be inaccurate, if the almanac includes only a fe w examples relating to a certain topic. Obviously , neither the pure theoretical approach nor the pure experimental approach is v ery practical. The great insight of empirical science is that one can ef fecti vely combine e xperimen- tal and theoretical in vestigation in the follo wing w ay . First, a set of experiments corresponding to conﬁgurations { x 1 , x 2 . . . x N } are performed, leading to outcomes { y 1 , y 2 . . . y N } . The dif- ference between this process and the pure experimental approach is that here the number of tested conﬁgurations is much smaller . Then, in the theoretical phase, one attempts to ﬁnd a function f ( · ) that agrees with all of the data: y i = f ( x i ) for all i . If such a function is found, 7 and it is in some sense simple, then one concludes that it will gener alize and make correct predictions when applied to ne w conﬁgurations that hav e not yet been tested. This description of the scientiﬁc process should produce a healthy dose of sympathy for the ancient thinkers who failed to discov er it. The idea of generalization, which is totally essential to the entire process, is completely nonob vious and raises a number of nearly intractable philo- sophical issues. The hybrid process assumes the existence of a ﬁnite number of observations x i , but claims to produce a universal predictiv e rule f ( · ) . Under what circumstances is this legitimate? Philosophers ha ve been grappling with this question, called the Problem of Induc- tion, since the time of David Hume. Also, a moment’ s reﬂection indicates that the problem considered in the theoretical phase does not ha ve a unique solution. If the observed data set is ﬁnite, then there will be a large number of functions f ( · ) that agree with it. These functions must make the same predictions for the kno wn data x i , but may make very dif ferent predictions for other conﬁgurations. 1.1.4 Occam’ s Razor W illiam of Occam f amously articulated the principle that bears his name with the Latin phrase: entia non sunt multiplicanda praeter necessitatem ; entities must not be multiplied without ne- cessity . In plainer English, this means that if a theory is adequate to explain a body of obser- v ations, then one should not add gratuitous embellishments or clauses to it. T o wield Occam’ s Razor means to take a theory and cut a way all of the inessential parts until only the core idea remains. Scientists use Occam’ s Razor to deal with the problem of theory degenerac y mentioned abov e. Gi ven a ﬁnite set of e xperimental conﬁgurations { x 1 , x 2 , . . . x N } and corresponding ob- served outcomes { y 1 , y 2 . . . y N } , there will alw ays be an inﬁnite number of functions f 1 , f 2 , . . . that agree with all the observ ations. The number of compatible theories is inﬁnite because one can always produce a new theory by adding a ne w clause or qualiﬁcation to a previous theory . For e xample, one theory might be e xpressed in English as “General relati vity holds e verywhere in space”. This theory agrees with all known experimental data. But one could then produce a ne w theory that says “General relativity holds e verywhere in space except in the Alpha Centauri solar system, where Newton’ s laws hold. ” Since it is quite difﬁcult to sho w the superiority of the theory of relati vity o ver Ne wtonian mechanics e ven in our local solar system, it is probably almost impossible to show that relati vity holds in some other , far -off star system. Furthermore, an impious philosopher could generate an ef fectiv ely inﬁnite number of variant theories of this kind, simply by replacing “ Alpha Centauri” with the name of some other star . This produces 8 a v ast number of conﬂicting accounts of physical reality , each with about the same degree of empirical e vidence. Scientists use Occam’ s Razor to deal with this kind of crisis by justifying the disqualiﬁca- tion of the v ariant theories mentioned abov e. Each of the v ariants has a gratuitous subclause, that speciﬁes a special region of space where relativity does not hold. The subclause does not improv e the theory’ s descriptiv e accuracy; the theory would still agree with all observ ational data if it were remov ed. Thus, the basic theory that relati vity holds e verywhere stands out as the simplest theory that agrees with all the e vidence. Occam’ s Razor instructs us to accept the basic theory as the current champion, and only revise it if some ne w contradictory evidence arri ves. This idea sounds attractiv e in the abstract, but raises a thorny philosophical problem when put into practice. Formally , the razor requires one to construct a functional H [ f ] that rates the complexity of a theory . Then giv en a set of theories F all of which agree with the empirical data, the champion theory is simply the least complex member of F : f ∗ = min f ∈ F H [ f ] The problem is: ho w does one obtain the complexity functional H ? Giv en two candidate def- initions for the functional, how does one decide which is superior? It may very well be that complexity is in the e ye of the beholder , and that two observers can legitimately disagree about which of two theories is more complex. This disagreement would, in turn, cause them to dis- agree about which member of a set of candidate theories should be considered the champion on the basis of the currently av ailable evidence. This kind of disagreement appears to un- dermine the objectivity of science. Fortunately , in practice, the issue is not insurmountable. Informal measures of theory complexity , such as the number of words required to describe a theory in English, seem to work well enough. Most scientists would agree that “relati vity holds e verywhere” is simpler than “relativity holds ev erywhere except around Alpha Centauri”. If a disagreement persists, then the disputants can, in most cases, settle the issue by running an actual experiment. 1.1.5 Pr oblem of Demarcation and F alsiﬁability Principle The great philosopher of science Karl Popper proposed a principle called falsiﬁability that substantially clariﬁed the meaning and justiﬁcation of scientiﬁc theorizing [92]*. Popper was moti vated by a desire to rid the world of pseudosciences such as astrology and alchemy . The problem with this goal is that astrologers and alchemists may very well appear to be doing 9 real science, especially to laypeople. Astrologers may employ mathematics, and alchemists may utilize much of the same equipment as chemists. Some people who promote creationist or religiously inspired accounts of the origins of life make plausible sounding ar guments and appear to be following the rules of logical inference. These kinds of surface similarities may make it impossible for nonspecialists to determine which ﬁelds are scientiﬁc and which ﬁelds are not. Indeed, e ven if ev eryone agreed that astronomy is science but astrology is not, it would be important from a philosophical perspecti ve to justify this determination. Popper calls this the Problem of Demarcation: ho w to separate scientiﬁc theories from nonscientiﬁc ones. Popper answered this question by proposing the principle of falsiﬁability . He required that, in order for a theory to be scientiﬁc, it must make a prediction with enough conﬁdence that, if the prediction disagreed with the actual outcome of an appropriate experiment or observation, the theory would be discarded. In other words, a scientist proposing a ne w theory must be willing to risk embarassment if it turns out the theory does not agree with reality . This rule pre vents people from constructing grandiose theories that hav e no empirical consequences. It also prev ents people from using a theory as a lens, that distorts all observations so as to render them compatible with its abstractions. If Aristotle had been aware of the idea of falsiﬁability , he might ha ve av oided dev eloping his silly theory of the four elements, by realizing that it made no concrete predictions. In terms of the notation dev eloped abov e, the falsiﬁability principle requires that a theory can be instantiated as a function f ( · ) that applies to some real world conﬁgurations. Further- more, the theory must designate a conﬁguration x and a prediction f ( x ) with enough conﬁdence that if the experiment is done, and the resulting y v alue does not agree with the prediction y 6 = f ( x ) , then the theory is discarded. This condition is fairly weak, since it requires a pre- diction for only a single conﬁguration. The point is that the f alsiﬁability principle does not say anything about the value of a theory , it only states a requirement for the theory to be considered scientiﬁc. It is a sort of precondition, that guarantees that the theory can be ev aluated in relation to other theories. It is very possible for a theory to be scientiﬁc b ut wrong. In addition to marking a boundary between science and pseudoscience, the falsiﬁability principle also permits one to delineate between statements of mathematics and empirical sci- ence. Mathematical statements are not falsiﬁable in the same way empirical statements are. Mathematicians do not and can not use the falsiﬁability principle; their results are veriﬁed us- ing an alternate criterion: the mathematical proof. No new empirical observ ation or experiment could falsify the Pythagorean theorem. A person who drew a right triangle and attempted to sho w that the length of its sides did not satisfy a 2 + b 2 = c 2 would just be ridiculed. Mathemat- 10 ical statements are fundamentally implications: if the axioms are satisﬁed, then the conclusions follo ws logically . The falsiﬁability principle is strong medicine, and comes, as it were, with a set of po werful side-ef fects. Most prominently , the principle allo ws one to conclude that a theory is false, but provides no mechanism whatev er to justify the conclusion that a theory is true. This fact is rooted in one of the most basic rules of logical inference: it is impossible to assert universal conclusions on the basis of existential premises. Consider the theory “all swans are white”. The sighting of a black swan, and the resulting premise “some swans are black”, leads one to conclude that the theory is false. But no matter how many white swans one may happen to observe, one cannot conclude with perfect conﬁdence that the theory is true. According to Popper , the only way to establish a scientiﬁc theory is to falsify all of its competitors. But because the number of competitors is vast, the y cannot all be disqualiﬁed. This promotes a stance of radical skepticism to wards scientiﬁc kno wledge. 1.1.6 Science as a Sear ch Through Theory-Space Though the scientiﬁc is not monoli thic or precisely deﬁned, the follo wing list describes it fairly well: 1. Through observ ation and e xperiment, amass an initial corpus of conﬁguration-outcome pairs { x i , y i } relating to some phenomenon of interest. 2. Let f C be the initial champion theory . 3. Through observ ation and analysis, de velop a new theory , which may either be a reﬁne- ment of the champion theory , or something completely new . Prefer simpler candidate theories to more complex ones. 4. Instantiate the new theory in a predictiv e function f N . If this cannot be done, the theory is not scientiﬁc. 5. Find a conﬁguration x for which f C ( x ) 6 = f N ( x ) , and run the indicated experiment. 6. If the outcome agrees with the ri val theory , y = f N ( x ) , then discard the old champion and set f C = f N . Otherwise discard f N . 7. Return to step #3. 11 The scientiﬁc process described abov e makes a crucial assumption, which is that perfect agreement between theory and experiment can be observ ed, such that y = f ( x ) . In practice, scientists ne ver observe y = f ( x ) but rather y ≈ f ( x ) . This fact does not break the process described abov e, because ev en if neither theory is perfectly correct, it is reasonable to assess one theory as “more correct” than another and thereby discard the less correct one. Howe ver , the fact that real e xperiments ne ver agree perfectly with theoretical predictions has important philosophical consequences, because it means that scientists are searching not for perfect truth but for good approximations. Most physicists will admit that ev en their most reﬁned theories are mere approximations, though they are spectacularly accurate approximations. In the light of this idea about approximation, the follo wing conception of science becomes possible. Science is a search through a vast space F that contains all possible theories. There is some ideal theory f ∗ ∈ F , which correctly predicts the outcome of all experimental conﬁgura- tions. Ho wev er , this ideal theory can nev er be obtained. Instead, scientists proceed tow ards f ∗ through a process of iterativ e reﬁnement. At ev ery moment, the current champion theory f C is the best known approximation to f ∗ And each time a champion theory is unseated in fav or of a ne w candidate, the new f C is a bit closer to f ∗ . This view of science as a search for good approximations brings up another nonobvious component of the philosophical foundations of empirical science. If perfect truth cannot be obtained, why is it worth expending so much ef fort to obtain mere approximations? W ouldn’t one expect that using an approximation might cause problems at crucial moments? If the the- ory that explains an airplane’ s ability to remain aloft is only an approximation, why is anyone willing to board an airplane? The answer is, of course, that the approximation is good enough. The fact that perfection is unachie v able does not and should not dissuade scientists from reach- ing toward it. A serious runner considers it deeply meaningful to attempt to run faster , though it is impossible for him to complete a mile in less than a minute. In the same way , scientists consider it worthwhile to search for increasingly accurate approximations, though perfect truth is unreachable. 1.1.7 Cir cularity Commitment and Reusability Hypothesis Empirical scientists follo w a unique conceptual cycle in their work that begins and ends in the same place. Mathematicians start from axioms and move on to theorems. Engineers start from basic components and assemble them into more sophisticated devices. An empirical scientist begins with an experiment or set of observ ations that produce measurements. She then contem- plates the data and attempts to understand the hidden structure of the measurements. If she is 12 smart and luck y , she might discover a theory of the phenomenon. T o test the theory , she uses it to make predictions regarding the original phenomenon . In other w ords, the same phenomenon acts as both the starting point and the ultimate justiﬁcation for a theory . This dedication to the single, isolated goal of describing a particular phenomenon is called the Circularity Commit- ment. The nonobviousness of the Circularity Commitment can be understood by considering the alternati ve. Imagine a scientiﬁc community in which theories are not justiﬁed by their ability to make empirical predictions, but by their practical utility . For example, a candidate theory of thermodynamics might be ev aluated based on whether it can be used to construct comb us- tion engines. If the engine works, the theory must be good. This reasoning is actually quite plausible, but science does not work this way . No serious scientist would suggest that because the theory of relativity is not relev ant to or useful for the construction of airplanes, it is not an important or worthwhile theory . Modern physicists de velop theories re garding a wide range of esoteric topics such as quantum superﬂuidity and the entropy of black holes without concerning themselves with the practicality of those theories. Empirical scientists are thus very similar to mathematicians in the purist attitude they adopt re garding their work. In a prescientiﬁc age, a researcher expressing this kind of dedication to pure empirical in- quiry , especially gi ven the effort required to carry out such an inquiry , might be vie wed as an eccentric crank or religious zealot. In modern times no such stigma exists, because e veryone can see that empirical science is eminently practical. This leads to another deeply surprising idea, here called the Reusability Hypothesis: in spite of the fact that scientists are explicitly unconcerned with the utility of their theories, it just so happens that those theories tend to be extraordinarily useful. Of course, no one can kno w in adv ance which areas of empirical inquiry will pro ve to be technologically rele v ant. But the history of science demonstrates that new em- pirical theories often catalyze the dev elopment of amazing new technologies. Thus Maxwell’ s uniﬁed theory of electrodynamics led to a wide array of electronic devices, and Einstein’ s the- ory of relati vity led to the atomic bomb . The fact that large sums of public money are spent on constructing e ver-lar ger particle colliders is e vidence that the Reusability Hypothesis is well understood e ven by go vernment of ﬁcials and policy mak ers. The Circularity Commitment and the Reusability Hypothesis complement each other nat- urally . Society would ne ver be willing to fund scientiﬁc research if it did not produce some tangible beneﬁts. But if society explicitly required scientists to produce practical results, the scope of scientiﬁc in vestigati on would be drastically reduced. Einstein would not hav e been able to justify his research into relativity , since that theory had fe w obvious applications at the time it was in vented. The two philosophical ideas justify a fruitful division of labor . Scientists 13 aim with intent concentration at a single target: the de velopment of good empirical theories. They can then hand of f their theories to the engineers, who often ﬁnd the theories to be useful in the de velopment of ne w technologies. 1.2 Sophie’ s Method This section de velops a reﬁned version of the scientiﬁc method, in which large databases are used instead of experimental observations as the necessary empirical ingredient. The neces- sary modiﬁcations are fairly minor , so the revised version includes all of the same conceptual apparatus of the standard version. At the same time, the modiﬁcation is signiﬁcant enough to considerably expand the scope of empirical science. The reﬁned version is dev eloped through a series of thought experiments relating to a ﬁctional character named Sophie. 1.2.1 The Shaman Sophie is a assistant professor of physics at a lar ge American state uni versity . She ﬁnds this job ve xing for sev eral reasons, one of which is that she has been chosen by the department to teach a physics class intended for students majoring in the humanities, for whom it serves to ﬁll a breadth requirement. The students in this class, who major in subjects like literature, religious studies, and philosophy , tend to be intelligent b ut also querulous and somewhat disdainful of the “merely technical” intellectual achie vements of physics. In the current semester she has become a ware of the presence in her class of a discalced stu- dent with a large beard and often bloodshot eyes. This student is surrounded by an entourage of similarly strange looking follo wers. Sophie is on good terms with some of the more seri- ous students in the class, and in con versation with them has found out that the odd student is attempting to start a ne w naturalistic religious mov ement and refers to himself as a “shaman”. One day while deliv ering a simple lecture on Newtonian mechanics, Sophie is surprised when the shaman raises his hand. When Sophie calls on him, he proceeds to claim that physics is a propagandistic hoax designed by the elites as a way to control the population. Sophie blinks se veral times, and then responds that physics can’t be a hoax because it makes real-world predictions that can be veriﬁed by independent observers. The shaman counters by claiming that the so-called “predictions” made by physics are in fact trivialities, and that he can obtain better forecasts by communing with the spirit world. He then proceeds to challenge Sophie to a predictiv e duel, in which the two of them will make forecasts re garding the outcome of a simple experiment, the winner being decided based on the accuracy of the forecasts. Sophie is 14 taken aback by this b ut, hoping that by proving the shaman wrong she can break the spell he has cast on some of the other students, agrees to the challenge. During the next class, Sophie sets up the following e xperiment. She uses a spring mech- anism to launch a ball into the air at an angle θ . The launch mechanism allo ws her to set the initial velocity of the ball to a v alue of v i . She chooses as a predictiv e test the problem of pre- dicting the time t f that the ball will fall back to the ground after being launched at t i = 0 . Using a tri vial Newtonian calculation she concludes that t f = 2 g − 1 v i sin( θ ) , sets v i and θ to giv e a v alue of t f = 2 seconds, and announces her prediction to the class. She then asks the shaman for his prediction. The shaman declares that he must consult with the wind spirits, and then spends a couple of minutes chanting and muttering. Then, dramatically ﬂaring open his e yes as if to signify a moment of re velation, he grabs a piece of paper , writes his prediction on it, and then hands it to another student. Sophie suspects some kind of trick, but is too exasperated to in vestigate and so launches the ball into the air . The ball is equipped with an electronic timer that starts and stops when an impact is detected, and so the number registered in the timer is just the time of ﬂight t f . A student picks up the ball and reports that the result is t f = 2 . 134 . The shaman giv es a gleeful laugh, and the student holding his written prediction hands it to Sophie. On the paper is written 1 < t f < 30 . The shaman declares victory: his prediction turned out to be correct, while Sophie’ s was incorrect (it was of f by 0 . 134 seconds). T o counter the shaman’ s claim and because it was on the syllab us anyway , in the next class Sophie begins a discussion of probability theory . She goes ov er the basic ideas, and then connects them to the experimental prediction made about the ball. She points out that technically , the Newtonian prediction t f = 2 is not an assertion about the exact v alue of the outcome. Rather it should be interpreted as the mean of a probability distribution describing possible outcomes. For example, one might use a normal distribution with mean µ = t f = 2 and σ = . 3 . The reason the shaman superﬁcially seemed to win the contest is that he ga ve a probability distribution while Sophie gav e a point prediction; these two types of forecast are not really comparable. In the light of probability theory , the reason to prefer the Ne wtonian prediction abov e the shamanic one, is that it assigns a higher probability to the outcome that actually occurred. No w , plausibly , if only a single trial is used then the Ne wtonian theory might simply ha ve gotten luck y , so the reasonable thing to do is combine the results o ver man y trials, by multiplying the probabilities together . Therefore, the formal justiﬁcation for preferring the Ne wtonian theory to the shamanic theory is that: Y k P newton ( t f ,k ) > Y k P shaman ( t f ,k ) 15 Where the k index runs o ver many trials of the experiment. Sophie then sho ws how the New- tonian probability predictions are both more conﬁdent and more corr ect than the shamanic predictions. The Ne wtonian predictions assign a v ery lar ge amount of probability to the region around the outcome t f = 2 , and in fact it turns out that almost all of the real data outcomes fall in this range. In contrast, the shamanic prediction assigns a relativ ely small amount of proba- bility to the t f = 2 region, because he has predicted a very wide interval ( 1 < t f < 30 ). Thus while the shamanic prediction is correct, it is not very conﬁdent. The Ne wtonian prediction is correct and highly conﬁdent, and so it should be prefered. Sophie tries to emphasize that the Ne wtonian probability prediction P newton only works well for the r eal data. Because of the requirement that probability distrib utions be normalized, the Newtonian theory can only achie ve good results by reassigning probability towards the region around t f = 2 and aw ay from other re gions. A theory that does not perform this kind of reassignment cannot achie ve superior high performance. Sophie recalls that some of the students are studying computer science and for their beneﬁt points out the following. The famous Shannon equation L ( x ) = − log 2 P ( x ) gov erns the rela- tionship between the probability of an outcome and the length of the optimal code that should be used to represent it. Therefore, gi ven a large data ﬁle containing the results of man y trials of the ballistic motion experiment, the two predictions (Newtonian and shamanic) can both be used to build specialized programs to compress the data ﬁle. Using the Shannon equation, the abov e inequality can be re written as follows: X k L newton ( t f ,k ) < X k L shaman ( t f ,k ) This inequality indicates an alternativ e criterion that can be used to decide between two riv al theories. Given a data ﬁle recording measurements related to a phenomenon of interest, a scien- tiﬁc theory can be used to write a compression program that will shrink the ﬁle to a small size. T o decide between two riv al theories of the same phenomenon, one in vok es the corresponding compressors on a shared benchmark data set, and prefers the theory that achie ves a smaller en- coded ﬁle size. This criterion is equiv alent to the probability-based one, but has the advantage of being more concrete, since the quantities of interest are ﬁle lengths instead of probabilities. 1.2.2 The Dead Experimentalist Sophie is a theoretical physicist and, upon taking up her position as assistant professor , began a collaboration with a brilliant experimental physicist who had been working at the univ ersity for some time. The experimentalist had pre viously completed the dev elopment of an advanced 16 apparatus that allowed the in vestig ation of an exotic ne w kind of quantum phenomenon. Using data obtained from the new system, Sophie made rapid progress in de veloping a mathematical theory of the phenomenon. T ragically , just before Sophie was able complete her theory , the experimentalist was killed in a laboratory explosion that also destroyed the special apparatus. After grieving for a while, Sophie decided that the best way to honor her friend’ s memory would be to bring the research the y had been working on to a successful conclusion. Unfortunately , there is a critical problem with Sophie’ s plan. The experimental apparatus had been completely destroyed, and Sophie’ s late partner was the only person in the world who could hav e rebuilt it. He had run many trials of the system before his death, so Sophie had a quite large quantity of data. But she had no way of generating any new data. Thus, no matter how beautiful and perfect her theory might be, she had no way of testing it by making predictions. One day while thinking about the problem Sophie recalls the incident with the shaman. She remembers the point she had made for the beneﬁt of the software engineers, about how a scientiﬁc theory could be used to compress a real world data set to a very small size. Inspired, she decides to apply the data compression principle as a way of testing her theory . She imme- diately returns to her of ﬁce and spends the ne xt sev eral weeks writing Matlab code, con verting her theory into a compression algorithm. The resulting compressor is successful: it shrinks the corpus of e xperimental data from an initial size of 8 . 7 · 10 11 bits to an encoded size of 3 . 3 · 10 9 bits. Satisﬁed, Sophie writes up the theory , and submits it to a well-kno wn physics journal. The journal editors like the theory , but are a bit skeptical of the compression based method for testing the theory . Sophie ar gues that if the theory becomes widely known, one of the other experts in the ﬁeld will develop a similar apparatus, which can then be used to test the theory in the traditional way . She also of fers to release the e xperimental data, so that other researchers can test their own theories using the same compression principle. Finally she promises to release the source code of her program, to allow external veriﬁcation of the compression result. These arguments ﬁnally con vince the journal editors to accept the paper . 1.2.3 The Riv al Theory After all the mathematics, software dev elopment, prose revisions, and persuasion necessary to complete her theory and ha ve the paper accepted, Sophie decides to re ward herself by li ving the good life for a while. She is conﬁdent that her theory is essentially correct, and will ev entually be recognized as correct by her colleagues. So she spends her time reading nov els and hanging out in cof fee shops with her friends. 17 A couple of months later , howe ver , she recei ves an unpleasant shock in the form of an email from a colleague which is phrased in consolatory language, but does not contain any clue as to why such language might be in order . After some in vestigation she ﬁnds out that a ne w paper has been published about the same quantum phenomenon of interest to Sophie. The paper proposes a alternative theory of the phenomenon which bears no resemblance whatev er to Sophie’ s. Furthermore, the paper reports a better compression rate than was achie ved by Sophie, on the database that she released. Sophie reads the new paper and quickly realizes that it is worthless. The theory depends on the introduction of a large number of additional parameters, the values of which must be obtained from the data itself. In f act, a substantial portion of the paper in v olves a description of a statistical algorithm that estimates optimal parameter values from the data. In spite of these aesthetic ﬂaws, she ﬁnds that man y of her colleagues are quite taken with the new paper and some consider it to be “next big thing”. Sophie sends a message to the journal editors describing in detail what she sees as the many ﬂaws of the upstart paper . The editors express sympathy , but point out that the ne w theory outperforms Sophie’ s theory using the performance metric she herself proposed. The beauty of a theory is important, but its correctness is ultimately more important. Some what discouraged, Sophie sends a polite email to the authors of the new paper , con- gratulating them on their result and asking to see their source code. Their response, which arri ves a week later , contains a v ague excuse about ho w the source code is not properly doc- umented and relies on proprietary third party libraries. Annoyed, Sophie contacts the journal editors again and asks them for the program the y used to verify the compression result. They reply with a link to a binary version of the program. When Sophie clicks on the link to download the program, she is annoyed to ﬁnd it has a size of 800 meg abytes. But her annoyance is quickly transformed into enlightenment, as she realizes what happened, and that her previous philosophy contained a serious ﬂa w . The upstart theory is not better than hers; it has only succeeded in reducing the size of the encoded data by dramatically increasing the size of the compressor . Indeed, when dealing with specialized compressors, the distinction between “program” and “encoded data” becomes almost irrele vant. The critical number is not the size of the compressed ﬁle, but the net size of the encoded data plus the compressor itself. Sophie writes a response to the new paper which describes the reﬁned compression rate principle. She begins the paper by reiterating the unfortunate circumstances which forced her to appeal to the principle, and expressing the hope that someday an experimental group will rebuild the apparatus dev eloped by her late partner , so that the experimental predictions made 18 by the two theories can be properly tested. Until that day arriv es, standard scientiﬁc practice does not permit a decisiv e declaration of theoretical success. But surely there is some theoretical statement that can be made in the meantime, gi ven the large quantity of data that is av ailable. Sophie’ s proposal is that the goal should be to ﬁnd the theory that has the highest probability of predicting a new data set, when it can ﬁnally be obtained. If the theories are very simple in comparison to the data being modeled, then the size of the encoded data ﬁle is a good way of choosing the best theory . But if the theories are complex, then there is a risk of overﬁtting the data. T o guard against ov erﬁtting complex theories must be penalized; a simple way to do this is to tak e into account the codelength required for the compressor itself. The length of Sophie’ s compressor was negligible, so the net score of her theory is just the codelength of the encoded data ﬁle: 3 . 3 · 10 9 bits. The riv al theory achie ved a smaller size of 2 . 1 · 10 9 for the encoded data ﬁle, but required a compressor of 6 . 7 · 10 9 bits to do so, giving a total score of 8 . 8 · 10 9 bits. Since Sophie’ s net score is lo wer , her theory should be prefered. 1.3 Compr ession Rate Method In the course of the thought experiments discussed abov e, the protagonist Sophie articulated a reﬁned version of the scientiﬁc method. This procedure will be called the Compression Rate Method (CRM). The web of concepts related to the CRM will be called the comperical philos- ophy of science, for reasons that will become e vident in the next section. The CRM consists of the follo wing steps: 1. Obtain a vast database T relating to a phenomenon of interest. 2. Let f C be the initial champion theory . 3. Through observation and analysis, de velop a ne w theory f N , which may be either a sim- ple reﬁnement of f C or something radically ne w . 4. Instantiate f N as a compression program. If this cannot be done, then the theory is not scientiﬁc. 5. Score the theory by calculating L ( T | f N ) + H [ f N ] , the sum of the encoded version of T and the length of the compressor . 6. If L ( T | f N ) + H [ f N ] < L ( T | f C ) + H [ f C ] , then discard the old champion and set f C = f N . Otherwise discard f N . 19 7. Return to step #3. It is worthwhile to compare the CRM to the version of the scientiﬁc method giv en in Sec- tion 1.1.6 . One improv ement is that in this version the Occam’ s Razor principle plays an explicit role, through the inﬂuence of the H term. A solution to the Problem of Demarcation is also built into the process in Step #4. The main dif ference is that the empirical ingredient in the CRM is a large database, while the traditional method emplo ys experimental observ ation. The signiﬁcance of the CRM can be seen by understanding the relationship between the target database T and the resulting theories. If T contains data related to the outcomes of physical experiments, then physical theories will be necessary to compress it. If T contains information related to interest rates, house prices, global trade ﬂo ws, and so on, then economic theories will be necessary to compress it. One obvious choice for T is simply an enormous image database, such as the one hosted by the Facebook social networking site. In order to compress such a database one must dev elop theories of visual r eality . The idea that there can be an empirical science of visual reality has nev er before been articulated, and is one of the central ideas of this book. A key argument, contained in Chapter 3 , is that the research resulting from the application of the CRM to a lar ge database of natural images will produce a ﬁeld v ery similar to modern computer vision. Similarly , Chapter 4 ar gues that the application of the CRM to a large text corpus will result in a ﬁeld very similar to computational linguistics. Furthermore, the reformulated versions of these ﬁelds will ha ve far stronger philosophical foundations, due to the explicit connection between the CRM and the traditional scientiﬁc method. It is crucial to emphasize the deep connection between compression and prediction. The real goal of the CRM is to e v aluate the predictiv e power of a theory , and the compression rate is just a way of quantifying that po wer . There are three advantages to using the compression rate instead of some measure of predictive accurac y . First, the compression rate naturally accomo- dates a model complexity penalty term. Second, the compression rate of a large database is an objecti ve quantity , due to the ideas of K olmogorov comple xity and uni versal computation, dis- cussed below . Third, the compression principle pro vides an important veriﬁcational beneﬁt. T o verify a claim made by an advocate of a ne w theory , a referee only needs to check the encoded ﬁle size, and ensure that the resulting decoded data matches exactly the original database T . Most people e xpress skepticism as their ﬁrst reaction to the plan of research embodied by the CRM. They generally admit that it may be possible to use the method to obtain increasingly short codes for the target databases. But they balk at accepting the idea that the method will produce anything else of v alue. The follo wing sections argue that the philosophical commit- ments implied by the CRM are exactly analogous to those long accepted by scientists working 20 Figure 1.1: Histograms of differences between v alues of neighboring pixels in a natural image (left) and a random image (right). The clustering of the pixel dif ference values around 0 in the natural image is what allows compression formats like PNG to achiev e compression. Note the larger scale of the image on the left; both histograms represent the same number of pix els. in mainstream ﬁelds of empirical science. Comperical science is nonobvious in the year 2011 for exactly the same kinds of reasons that empirical science w as nonobvious in the year 1511. 1.3.1 Data Compr ession is Empirical Science The following theorem is well known in data compression. Let C be a program that losslessly compresses bit strings s , assigning each string to a ne w code with length L C ( s ) . Let U N ( s ) be the uniform distribution ov er N -bit strings. Then the following bound holds for all compression programs C : E ( s ∼ U N ) [ L C ( s )] ≥ N (1.0) In words the theorem states that no lossless compression program can achie ve av erage code- lengths smaller than N bits, when av eraged ov er all possible N bit input strings. Belo w , this statement is referred to as the “No Free Lunch” (NFL) theorem of data compression as it im- plies that one can achie ve compression for some strings s only at the price of inﬂating other strings. At ﬁrst glance, this theorem appears to turn the CRM proposal into nonsense. In fact, the theorem is the ke ystone of the comperical philosophy because it shows how lossless, large-scale compression research must be essentially empirical in character . T o see this point, consider the following apparent paradox. In spite of the NFL theorem, lossless image compression programs exist and hav e been in widespread use for years. As an example, the well-known Portable Network Graphics (PNG) compression algorithm seems to reliably produce encoded ﬁles that are 40-50% shorter than would be achiev ed by a uniform encoding. This apparent success seems to violate the No Free Lunch theorem. 21 The paradox is resolved by noticing that the images used to e valuate image compression algorithms are not drawn from a uniform distribution U N ( s ) over images. If lossless image for- mats were e v aluated based on their ability to compress random images, no such format could e ver be judged successful. Instead, the images used in the ev aluation process belong to a very special subset of all possible images: those that arise as a result of e veryday human photog- raphy . This “real world” image subset, though vast in absolute terms, is miniscule compared to the space of all possible images. So PNG is able to compress a certain image subset, while inﬂating all other images. And the subset that PNG is able to compress happens to o verlap substantially with the real world image subset. The speciﬁc empirical regularity used by the PNG format is that in real world images, adjacent pixel v alues tend to ha ve very similar values. A compressor can exploit this property by encoding the differences between neighboring pix el values instead of the v alues themselv es. The distribution of dif ferences is very narrowly clustered around zero, so they can be encoded using shorter av erage codes (see Figure 1.1 ). Of course, this trick does not work for random images, in which there is no correlation between adjacent pixels. The NFL theorem indicates that in order to succeed, a compericalresearcher must follow a strategy analogous to the procedure of physics. First, she must attempt to discover some structures or patterns present in real world images. Then she must de velop a mathematical theory characterizing that structure, and build the theory into a compressor . Finally , she must demonstrate that the theory corresponds to reality , by sho wing that it achieves an improv ed compression rate. T o make statements about the world, physicists need to combine mathematical and em- pirical reasoning; neither alone is sufﬁcient. Consider the follo wing statement of physics: when a ball is tossed into the air , its vertical position will be described by the equation: y ( t ) = g t 2 + v 0 t + y 0 . That statement can be decomposed into a mathematical and an em- pirical component. The mathematical statement is: if a quantity’ s e volution in time is go verned by the differential equation d 2 y dt 2 = k , where k is some constant, then its value is giv en by the function y ( t ) = k t 2 + v 0 t + y 0 , where v 0 and y 0 are determined by the initial conditions. The empirical statement is: if a ball is thro wn into the air , its vertical position will be governed by the differential equation d 2 y dt 2 = g , where g is the acceleration due to gravity . By combining these statements together , the physicist is able to make a v ariety of predictions. Just like physicists, comperical researchers must combine mathematical statements with empirical statements in order to make predictions. Because of the NFL theorem, pure mathe- matics is ne ver sufﬁcient to reach conclusions of the form: “ Algorithm Y achiev es good com- pression. ” Mathematical reasoning can only be used to make implications: “If the images 22 exhibit property X, then algorithm Y will achiev e good compression”. In order to actually achie ve compression, it is necessary to demonstrate the empirical fact that the images actually hav e property X. This shows why the comperical proposal is not fundamentally about saving disk space or bandwidth; it is fundamentally about characterizing the properties of images or other types of data. 1.3.2 Comparison to P opperian Philosophy The comperical philosoph y of science bears a strong family resemblance to the Popperian one, and inherits many of its conceptual adv antages. First, the compression principle provides a clear answer to the Problem of Demarcation: a theory is scientiﬁc if and only if it can be used to b uild a compressor for an appropriate kind of database. Because of the intrinsic difﬁculty of lossless data compression, the only way to sav e bits is to explicitly reassign probability aw ay from some outcomes and tow ard other outcomes. If the theory assigns very low probability to an outcome which then occurs, this suggests that the theory has low quality and should be discarded. Thus, the probability reassignment requirement is just a graduated or continuous version of the falsiﬁcation requirement. The falsiﬁability principle means that a researcher hoping to prov e the v alue of his ne w theory must risk embarassment if his predictions turn out to be incorrect. The compression principle requires a researcher to face the potential for embarassment if his ne w theory ends up inﬂating the database. One dif ference between Popperian view and comperical vie w is that the former appears to justify stark binary assessments regarding the truth or falsehood of a theory , while the latter provides only a number which can be compared to other numbers. If theories are either true or false, then the compression principle is no more useful than the falsiﬁability principle. But if theories can exist on some middle ground between absolute truth and its opposite, then it makes sense to claim that one theory is relati vely more true than another , ev en if both are imperfect. The compression principle can be used to justify such claims. Falsiﬁability consigns all imperfect theories to the same garbage bin; compression can be used to rescue the valuable theories from the bin, dust them of f, and establish them as legitimate science. The falsiﬁability idea seems to imply that theories can be ev aluated in isolation: a theory is either true or false, and this assessment does not depend on the content of ri v al theories. In contrast, while the compression idea assigns a score to an indi vidual theory , this score is useful only for the purpose of comparison. This distinction may be conceptually signiﬁcant to some people, b ut in practice it is unimportant. Science is a search for good approximations; science proceeds by incrementally improving the quality of the approximations. The power of the 23 falsiﬁability requirement is that it enables a rapid search through the theory-space by ensuring that theories can be decisiv ely compared. The compression requirement provides exactly the same beneﬁt. When a researcher proposes a new theory and shows that it can achie ve a smaller compressed ﬁle size for the target database, this provides decisi ve e vidence that the new theory is superior . Furthermore, both principles allo w a research community to identify a champion theory . In the Popperian view , the champion theory is the one that has withstood all attempts at falsiﬁcation. In the comperical vie w , the champion theory is the one that achiev es the smallest codelength on the rele vant benchmark database. One of the core elements of Popper’ s philosophy is the dedication to the continual testing, examination, and skepticism of scientiﬁc theories. A Popperian scientist is nev er content with the state of his knowledge. He ne ver claims that a theory is true; he only accepts that there is currently no e vidence that would falsify it. The comperical philosopher takes an entirely analogous stance. T o her , a theory is nev er true or e ven optimal, it is only the best theory that has thus far been discovered. She will ne ver claim, “the probability of e vent X is 35%”. Instead, she would state that “according to the current champion theory , the probability of e vent X is 35%”. She might e ven make decisions based on this probability assignment. But if a ne w theory arriv es that provides a better codelength, she immediately replaces her probability estimates and updates her decision policy based on the ne w theory . The Popperian commitment to continual examination and criticism of theoretical kno wl- edge is good discipline, but the radical skepticism it promotes is probably a bit too extreme. A strict Popperian would be unwilling to use Ne wtonian physics once it was falsiﬁed, in spite of the fact that it obviously still works for most problems of practical interest. The compression principle promotes a more nuanced view . If a claim is made that a theory provides a good description of a certain phenomenon, and the claim is justiﬁed by demonstrating a strong com- pression result, then the claim is v alid for all time. It is possible to dev elop a new theory that achie ves a better compression rate, or to show that the previous theory does not do as well on another related database. These circumstances might suggest that the old theory should no longer be used. But if the old theory provided a good description of a particular database, no future dev elopments will change that fact. This captures the intuition that Ne wtonian physics still provides a perfectly adequate description of a wide range of phenomena; Eddington’ s solar eclipse photographs simply sho wed that there are some phenomena to which it does not apply . 24 1.3.3 Cir cularity and Reusability in Context of Data Compression Just like empirical scientists do, comperical researchers adopt a the Circularity Commitment to guide and focus their efforts. A comperical researcher ev aluates a ne w theory based on one and only one criterion: its ability to compress the database for which it was de veloped. A community using a lar ge collection of face images will be highly interested in v arious tools such as hair models, eye glass detectors, and theories of lip color , and only secondarily interested in potential applications of face modeling technology . If the researchers chose to introduce some additional considerations to the theory comparison process, such as the rele vance of a theory to a certain type of practical task, they would compromise their own ability to discard low quality theories and identify high quality ones. Some truly purist thinkers may consider large scale data compression as an intrinsically in- teresting goal. Comperical researchers will face many challenging problems, in volving math- ematics, algorithm design, statistical inference, and kno wledge representation. Furthermore, researchers will receiv e a clear signal indicating when they hav e made progress, and ho w much. For a certain type of intellectual, these considerations are very signiﬁcant, ev en if there is no reason to belie ve that the in vestigation will yield an y practical results. In this light, it is worth comparing the proposed ﬁeld of large scale lossless data compres- sion with the established ﬁeld of computer chess. Chess is an abstract symbolic game with very little connection to the real world. A computer chess adv ocate would ﬁnd it quite difﬁ- cult to con vince a skeptical audience that constructing po werful chess programs would yield any tangible beneﬁt. Ho wev er , like the compression goal, the computer chess goal is attracti ve because it produces a v ariety of subproblems, and also provides a method for making decisiv e comparisons between riv al solutions. For these reasons, computer scientists de voted a signiﬁ- cant amount of effort to the ﬁeld, leading some to claim that chess was “the drosophila of AI research”. Furthermore, these efforts were incredibly successful, and led to the historic defeat of the top ranked human grandmaster, Gary Kasparov , by IBM’ s Deep Blue in 1997. Most scientists would agree that this ev ent was an important advance for human knowledge, ev en if it did not lead to any practical applications. Because of its similar methodological advantages, comperical research has a similar potential to adv ance human knowledge. For the reader who is unmoved by the ar gument about the intrinsic interest of compression science, it is essential to defend the validity of the Reusability Hypothesis in the conte xt of data compression. The hypothesis really contains two separate pieces. First, theories employ abstractions, and good theories use abstractions that correspond to reality . So the abstraction called “mass” is not just a clev er computational trick, but represents a fundamental aspect of re- ality . These real abstractions are useful both for compression and for practical applications. The 25 second piece of the Reusability Hypothesis is that, while theories based on naïve or simplistic characterizations of reality can achie ve compression, the best codelengths will be achiev ed by theories that use real abstractions. So by vigorously pursuing the compression goal, researchers can identify the real abstractions gov erning a particular phenomenon, and those abstractions can be reused for practical applications. The following examples illustrate the idea of the Reusability Hypothesis. Consider con- structing a target database by setting up a video camera next to a highway and recording the resulting image stream. One way to predict image frames (and thus compress the data) would be to identify batches of pixels corresponding to a car , and use an estimate of the car’ s v elocity to interpolate the pixels forward. A compressor that uses this trick thus implicitly contains abstractions related to the concepts of “car” and “velocity”. Since these are real abstractions, the Reusability Hypothesis states that the specialized compressor should achiev e better com- pression rates than a more generic one. Another good example of this idea relates to text compression. Here, the Reusability Hypothesis states that a specialized compressor making use of abstractions such as verb conjugation patterns, parts of speech, and rules of grammar will perform better than a generic compressor . If the hypothesis is true, then the same di vi- sion of labor between scientists and engineers that works for mainstream ﬁelds will work here as well. The comperical scientists obtain various abstractions by follo wing the compression principle, and hand them off to the engineers, who will ﬁnd them very useful for de veloping applications like automatic license plate readers and machine translation systems. 1.3.4 The In visible Summit An important concept related to the Compression Rate Method is called the K olmogorov com- plexity . The K olmogorov complexity K A ( s ) of a string s is the length of the shortest program that will output s when run on a T uring machine A . The ke y property of the K olmogorov com- plexity comes about as a consequence of the idea of univ ersal computation. If a T uring machine (roughly equi valent to a programming language) is of sufﬁcient complexity , it becomes univer - sal : it can simulate any other T uring machine, if giv en the right simulator program. So gi ven a string s and a short program P A that outputs it when run on T uring machine A, one can easily obtain a program P B that outputs s when run on (universal) T uring machine B, just by prepending a simulator program S AB to P A , and | P B | = | S AB | + | P A | . No w , the simulator program is ﬁxed by the deﬁnition of the two T uring machines. Thus for very long and comple x strings, the contribution of the simulator to the total program length becomes insigniﬁcant, so 26 that | P B | ≈ | P A | , and thus the K olmogorov complexity is ef fectiv ely independent of the choice of T uring machine. Unfortunately or not, a brief proof sho ws that the K olmogorov complexity is incomputable: a program attempting to compute K ( s ) cannot be guaranteed to terminate in ﬁnite time. This is not surprising, since if a method for computing the K olmogorov complexity were found, it would be immensely powerful. Such a program would render theoretical physicists unneces- sary . Experimental physicists could simply compile a large database of observations, and feed the database to the program. Since the optimal theory of physics corresponds pro vides the best explanation, and thus the shortest encoding, of the data, the program would automatically ﬁnd the optimal theory of physics on its way to ﬁnding the K olmogorov comple xity . Another way of seeing the impossibility of ﬁnding K ( s ) is by imagining what it would mean to ﬁnd the K olmogorov complexity of the Facebook image database. T o compress this database to the smallest possible size, one would hav e to know P ∗ ( I ) : the probability dis- tribution generating the Facebook images. While P ∗ ( I ) may look innocuous, in fact it is a mathematical object of vast complexity , containing an innumerable quantity of details. T o begin with, it must contain a highly sophisticated model of the human face. It must contain kno wledge of hair styles and facial expressions. It must capture the fact that lips are usually reddish in color , and that women are more likely to enhance this color using lipstick. Moving on from there, it would require knowledge about other things people like to photograph, such as pets, natural scenery , weddings, and boisterous parties. It would need to contain details about the appearance of babies, such as the fact that a baby usually has a pink face, and its head is large in proportion to the rest of its body . All this kno wledge is necessary because, for example, P ∗ ( I ) must assign higher probability , and shorter codelength, to an image featuring a woman with red lips, than to an image that is identical in ev ery way except that the woman has green lips. While calculating K ( s ) is impossible in general, one can ﬁnd upper bounds to it. In- deed, the Compression Rate Method is just the process of ﬁnding a sequence of increasingly tight upper bounds on the K olmogorov comple xity of the target database. Each new champion theory corresponds to a tighter upper bound. In the case of images, a new champion theory corresponds to to a new model P C ( I ) of the probability of an image. Every iteration of theory reﬁnement packages more realistic information into the model P C ( I ) , thereby bringing it closer to the unkno wable P ∗ ( I ) . This process is exactly analogous to the search through the theory space carried out by empirical scientists. Both empirical scientists and comperical scientists recognize that their theories are mere approximations. The fact that perfect truth cannot be obtained simply does not matter: it is still worthwhile to climb to wards the in visible summit. 27 1.3.5 Objectiv e Statistics Due to the direct relationship between statistical modeling and data compression (see Ap- pendix A ), comperical research can be regarded as a subﬁeld of statistics. A traditional problem in statistics starts with a set of N observations { x 1 , x 2 . . . x N } of some quantity , such as the physical height of a population. By analyzing the data set, the statistician attempts to obtain a good estimate P ( x ) of the probability of a gi ven height. This model could be, for example, a Gaussian distrib ution with a giv en mean and variance. Comperical research inv olves an en- tirely analogous process. The difference is that instead of simple single-dimensional numbers, comperical statisticians analyze comple x data objects such as images or sentences, and attempt to ﬁnd good models of the probability of such objects. All statistical inference must face a deep conceptual issue that has been the subject of ac- rimonious debate and philosophical speculation since the time of David Hume, who ﬁrst iden- tiﬁed it. This is the Problem of Induction: when is it justiﬁed to jump from a limited set of speciﬁc observations (the data samples) to a univ ersal rule describing the observations (the model)? This problem has divided statisticians into two camps, the Bayesians and the frequen- tists, who disagree fundamentally about the meaning and justiﬁcation of statistical inference. A full analysis of the nature of this disagreement would require its o wn book, but a v ery rough summary is that, while the Bayesian approach has a number of conceptual beneﬁts, it is hobbled by its dependence on the use of prior distributions. A Bayesian performs inference by using Bayes rule to update a prior distribution in response to e vidence, thus producing a posterior distribution, which can be used for decision-making and other purposes. The critical problem is that is no objectiv e way to choose a prior . Furthermore, two Bayesians who start with dif fer- ent priors will reach different conclusions, in spite of observing the same evidence. The use of Bayesian techniques to justify scientiﬁc conclusions therefore depri ves science of objecti vity . Any data compressor must implement a mapping from data sets T to bit strings of length L ( T ) . This mapping deﬁnes an implicit probability distribution P ( T ) = 2 − L ( T ) . It appears, therefore, that comperical statisticians make the same commitment to the use of prior distrib u- tions as the Bayesians do. Ho we ver , there is a crucial subtlety here. Because the length of the compressor itself is taken into account in the CRM, the prior distribution is actually deﬁned by the choice of programming language used to write the compressor . Furthermore, comperi- cal researchers use their models to describe vast datasets. Combined, these tw o f acts imply that comperical statistical inference is objectiv e. This idea is illustrated by the following thought experiment. Imagine a research subﬁeld which has established a database T as its target for CRM-style in vestigation. The subﬁeld makes slow but steady progress for se veral years. Then, out of the 28 blue, an unemployed autodidact from a rural village in India appears with a bold new theory . He claims that his theory , instantiated in a program P A , achie ves a compression rate which is dramatically superior to the current best published results. Ho we ver , among his other eccentric- ities, this gentleman uses a programming language he himself dev eloped, which corresponds to a T uring machine A . Now , the other researchers of the ﬁeld are well-meaning but skeptical, since all the previously published results used a standard language corresponding to a T uring machine B . But it is easy for the Indian maverick to produce a compressor that will run on B : he simply appends P A to a simulator program S AB , that simulates A when run on B . The length of the new compressor is | P B | = | P A | + | S AB | , and all of the other researchers can conﬁrm this. No w , assuming the data set T is large and complex enough so that | P A |  | S AB | , then the codelength of the modiﬁed v ersion is ef fecti vely the same as the original: | P B | ≈ | P A | . This sho ws that there can be no fundamental disagreement among comperical researchers regarding the quality of a ne w result. 1.4 Example Inquiries This section makes the makes the abstract discussion abov e tangible by describing sev eral concrete proposals. These proposals be gin with a method of constructing a tar get database, which deﬁnes a line of inquiry . In principle, researchers can use any large database that is not completely random as a starting point for a comperical in vestigation. In practice, unless some care is e xercised in the construction of the target dataset, it will be dif ﬁcult to make progress. In the beginning stages of research, it will be more productiv e to look at data sources which display relati vely limited amounts of variation. Here are some examples inquiries that might provide good starting points: • Attempt to compress the immense image database hosted by the popular F acebook social networking web site. One obvious property of these images is that they contain many faces. T o compress them well, it will be necessary to dev elop a computational under - standing of the appearance of faces. • Construct a target database by packaging together digital recordings of songs, concerts, symphonies, opera, and other pieces of music. This kind of inquiry will lead to theories of the structure of music, which must describe harmony , melody , pitch, rhythm and the relationship between these v ariables in different musical cultures. It must also contain models of the sounds produced by different instruments, as well as the human singing voice. 29 • Build a target database by recording from microphones positioned in treetops. A major source of variation in the resulting data will be bird v ocalizations. T o compress the data well, it will be necessary to differentiate between bird songs and bird calls, to dev elop tools that can identify species-characteristic vocalizations, and to build maps showing the typical ranges of various species. In other words, this type of inquiry will be a computa- tional version of the traditional study of bird v ocalization carried out by ornithologists. • Generate a huge database of economic data sho wing changes in home prices, interest and exchange rate ﬂuctuations, business in ventories, welfare and unemployment applications, and so on. T o compress this database well, it will be necessary to de velop economic theories that are capable of predicting, for example, the ef fect that changes in interest rates hav e on home purchases. Since the above examples in volv e empirical inquiry into v arious aspects of reality , any reader who believ es in the intrinsic v alue of science should reg ard them as at least potentially interesting. Skeptical readers, on the other hand, may doubt the applicability of the Reusability Hypothesis here, and so vie w an attempt to compress these databases as an eccentric philo- sophical quest. The following examples are more detailed, and gi ve explicit analysis of what kinds of theories (or computational tools) will be needed, and ho w those theories will be more widely useful. An important point, common to all of the in vestigations, is that a single target database can be used to de velop and e v aluate a large number of methods. It should be clear that, if successful, these example inquiries should lead to practical ap- plications. The study of music may help composers to write better music, allo w listeners to ﬁnd new music that suits their taste, and assist music publishing companies in determining the quality of a new piece. The in vestigation of bird vocalization, if successful, should be useful to en vironmentalists and bird-watchers who might want to monitor the migration and popula- tion ﬂuctuation of various a vian species. The study of economic data is more speculativ e, b ut if successful should be of obvious interest to policy makers and in vestors. In the case of the roadside video data described belo w , the result will be sophisticated visual systems that can be used in robotic cars. Also mentioned belo w is an inquiry into the structure of English text, which should prov e useful for speech recognition as well as for machine translation. 1.4.1 Roadside V ideo Camera Consider constructing a target database by setting up a video camera next to a highway , and recording video streams of the passing cars. Since the camera does not move, and there is 30 usually not much activity on the sides of highw ays, the main source of v ariation in the resulting video will be the automobiles. Therefore, in order to compress the video stream well, it will be necessary to obtain a good computational understanding of the appearance of automobiles. A simple ﬁrst step would be to take adv antage of the fact that cars are rigid bodies subject to Ne wtonian laws of physics. The position and v elocity of a car must be continuous functions of time. Giv en a series of images at timesteps { t 0 , t 1 , t 2 . . . t n } it is possible to predict the image at timestep t n +1 simply by isolating the moving pixels in the series (these correspond to the car), and interpolating those pixels forward into the new image, using basic rules of camera geometry and calculus. Since neither the background nor the moving pix el blob changes much between frames, it should be possible to achiev e a good compression rate using this simple trick. Further impro vements can be achie ved by detecting and exploiting patterns in the blob of moving pix els. One observ ation is that the wheels of a mo ving car hav e a simple characteristic appearance: a dark outer ring corresponding to the tire, along with the of f-white circle of the hubcap at the center . Because of this characteristic pattern, it should be straightforward to build a wheel detector using standard techniques of supervised learning. One could then sa ve bits by representing the wheel pixels using a specialized model, akin to a graphics program, which draws a wheel of a giv en size and position. Since it takes fewer bits to encode the size and position parameters than to encode the raw pixels of the wheel, this trick should sav e codelength. Further progress could be achie ved by conducting a study of the characteristic appearance of the surfaces of cars. Since most cars are painted in a single color , it should be possible to de velop a specialized algorithm to identify the frame of the car . Another graphics program could be used to draw the frame of the car , using a variety of parameters related to its shape. Extra attention would be required to handle the complex reﬂectiv e appearance of the windshield, but the same general idea would apply . Note that the encoder always has the option of “backing of f ”; if attempts to apply more aggressi ve encoding methods fail (e.g., if the car is painted in multiple colors), then the simpler pixel-blob encoding method can be used instead. Additional progress could be achie ved by recognizing that most automobiles can be cate- gorized into a discrete set of categories (e.g., a 2009 T oyota Corolla). Since these categories hav e standardized dimensions, bits could be sav ed by encoding the category of a car instead of information related to its shape. Initially , the process of building category-speciﬁc modules for the appearance of a car might be dif ﬁcult and time-consuming. But once one has de vel- oped modules for the Hyundai Sonata, Che vrolet Equinox, Honda Ci vic, and Nissan Altima, it should not require much additional work to construct a module for the T oyota Sienna. Indeed, it may be possible to dev elop a learning algorithm that, through some sort of clustering process, 31 would automatically extract, from lar ge quantities of roadside video data, appearance modules for the v arious car categories. 1.4.2 English T ext Corpus Books and other written materials constitute another interesting source of target data for com- perical inquiry . Here one simply obtains a lar ge quantity of text, and attempts to compress it. One tool that will be v ery useful for the compression of English text is an English dictionary . T o see this, consider the follo wing sentence: John went to the liquor store and bought a bottle of ____. Assume that the word in the blank space has N letters, and the compressor encodes this information separately . A naïve compressor would require log(26 N ) = N log 26 bits to encode the word, since there are 26 N ways to form an N -letter word. A compressor equipped with a dictionary can do much better . First it looks up all the words of length N , and then it encodes the inde x of the actual word in this list. This costs log( W N ) , where W N is the number of words of length N in the dictionary . Since most combinations of letters such as “yttu” and “qwhg” are not real words, W N < 26 N and bits are sav ed. By making the compressor smart, it’ s possible to do e ven better . A smart compressor should kno w that the word “of ” is usually followed by a noun. So instead of looking up all the N -letter words, the compressor could restrict the search to only nouns. This cuts down the number of possibilities e ven further , saving more bits. An e ven smarter compressor would know that in the phrase “bottle of X”, the word X is usually a liquid. If it had an enhanced dictionary which contained information about various properties of nouns, it could restrict the search to N -letter nouns that represent liquids. Even better results could be obtained by noticing that the bottle is purchased at a liquor store, and so probably represents some kind of alcohol. This trick would require that the enhanced dictionary contains annotations indicating that w ords such as “wine”, “beer”, “vodka”, are types of alcoholic bev erages. It may be possible to do e ven better by analyzing the surrounding te xt. The word list may be narro wed e ven further if the text indicates that John is fond of brandy , or that his wife is using a recipe that calls for v odka. Of course, these more advanced schemes are far beyond the current state of the art in natural language processing, b ut they indicate the wide array of techniques that can in theory be brought to bear on the problem. 32 1.4.3 V isual Manhattan Project Consider constructing a database target by mounting video cameras on the dashboards of a number of Ne w Y ork City taxi cabs, and recording the resulting video streams. Owing to the vi vid visual en vironment of Ne w Y ork City , such a database w ould e xhibit an immense amount of complexity and variation. Sev eral aspects of that complexity could be then analyzed and studied in depth. One interesting source of variation in the video would come from the pedestrians. T o achie ve good compression rates for the pixels representing pedestrians, it would be necessary to de velop theories describing the appearance of Ne w Y orkers. These theories would need to include details about clothing, ethnicity , facial appearance, hair style, walking style, and the relationship between these variables. A truly sophisticated theory of pedestrians would need to take into account time and place: it is quite likely to observe a suited in vestment banker in the ﬁnancial district on a weekday afternoon, b ut quite unlik ely to observ e such a person in the Bronx in the middle of the night. Another source of v ariation would come from the building and storefronts of the city . A ﬁrst steps to wards achie ving a good compression rate for these pix els would be to construct a three- dimensional model of the city . Such a model could be used not only to determine the location from which an image frame was taken, but also to predict the next frame in the sequence. For example, the model could be used to predict that, if a picture is taken at the corner of 34th Street and Fifth A venue, the Empire State Building will feature very prominently . Notice that a naïve representation of the 3D model will require a large number of bits to specify , and so e ven more sa vings can be achie ved by compressing the model itself. This can be done by analyzing the appearance of typical building surfaces such as brick, concrete, and glass. This type of research might ﬁnd common ground with the ﬁeld of architecture, and lead to producti ve interdisciplinary in vestigations. A third source of variation would come from the other cars. Analyzing this source of varia- tion would lead to an in vestigation very similar to the roadside video camera inquiry mentioned abov e. Indeed, if the roadside video researchers are successful, it should be possible for the taxi cab video researchers to reuse man y of their results. In this way , researchers can proceed in a virtuous circle, where each ne w advance f acilitates the next line of study . 33 1.5 Sampling and Simulation Sampling is a technique whereby one uses a statistical model to generate a data set that is “typical” of it. For example, imagine one kno ws that the distribution of heights in a certain population is a Gaussian with a mean of 175 cm and a standard de viation of 10 cm. Then by sampling from a Gaussian distribution with these parameters, one obtains a set of numbers that are similar to what might be observed if some actual measurements were done. Most of the data would cluster in the 165-185 cm range, and it would be extremely rare to observ e a sample larger than 205 cm. The idea of sampling suggests a useful technique for determining the quality of a statistical model: one samples from the model, and compares the sample data to the real data. If the sample data looks nothing like the real data, then there is a ﬂaw in the model. In the case of one-dimensional numerical data this trick is not very useful. But if the data is complex and high-dimensional, and humans hav e a good understanding of its real structure, the technique can be quite powerful. As an example of this, consider the following two batches of pseudo- words: a abangiv esery ad allars ambed amyorsagichou an and anendouathin anth ar as at ate atompasey averean cath ce d dea dr e ed eeaind eld enerd ens er ev edof fod fre g gand gho gisponeshe greastoreta har has haspy he heico ho ig iginse ill ilyo in ind io is ite iter itwat ju k le lene lilollind lliche llkee ly mang me mee mpichmm n nd nder ng ngobou nif nl noved o ond onghe oounin oreengst otaserethe oua ptrathe r rd re reed rerov ed sern sinttlof suikngmm t tato tcho te th the toungsshes ver wit y ythe a ally anctyough and andsaid anot as aslatay astect be beeany been bott bout but camed chav e comuperain deas dook ed e ven y fel ﬁlear ﬁrgut for fromed gat gin gi ve gi vesed got ha hard he hef her heree hilpte hoce hof ierty imber in it jor like lo lome lost mader mare mise moread od of om ome onertelf our out ov er owd pass put qu rown says seectusier seeked she shim so soomereand sse such tail the thingse tite to tor tre tro uf ughe umily upeeperlyses upoid was wat we were wers whith wird wirt with wor These w ords were created by sampling from tw o dif ferent models P ( α i | α i − 1 , . . . α 1 ) of the conditional probability of a letter giv en a history of preceding letters. The variable α i stands for the i th letter of the word. T o produce a word, one obtains the ﬁrst letter by sampling from 34 the unconditional distribution P ( α 1 ) . Then one samples from P ( α 2 | α 1 ) to produce the second letter , and so on. A special word-ending character is added to the alphabet, and when this character is drawn, the w ord is complete. The tw o models were both constructed using a large corpus of English text. The ﬁrst model is a simplistic bigram model, where the probability of a letter depends only on the immediately preceding letter . The second model is an enhanced version of the bigram model, which uses a reﬁned statistical characterization of English words, that incorporates, for example, the fact that it is very unlikely for a word to hav e no vo wel. Most people will agree that the words from the second set are more similar to real English words (indeed, se veral of them are real words). This perceptual assessment justiﬁes the conclusion that the second model is in some sense superior to the ﬁrst model. Happily , it turns out that the second model also achie ves a better compression rate than the ﬁrst model, so the qualitativ e similarity principle agrees with the quantitativ e compression principle. While the second model is better than the ﬁrst, it still contains imperfections. One such imperfection relates to the word “sse”. The double-s pattern is common in English words, but it is ne ver used to begin a word. It should be possible to achie ve impro ved compression rates by correcting this deﬁcienc y in the model. All compressors implicitly contain a statistical model, and it is easy to sample from this model. T o do so one simply generates a random bit string and feeds it into the decoder . Unless the decoder is tri vially suboptimal, it will map any string of bits to a legitimate outcome in the original data space. This perspecti ve provides a nice interpretation of what compression means. An ideal encoder maps real data to perfectly random bit strings, and the corresponding decoder maps random bit strings to real data. 1.5.1 V eridical Simulation Principle of Science Modern video games often attempt to illustrate scenes in volving complex physical processes, such as explosions, light reﬂections, or collisions between nonrigid bodies (e.g. football play- ers). In order to make these scenes look realistic, video game dev elopers need to include “physics engines” in their games. A physics engine is a program that simulates various pro- cesses using the laws of physics. If the physics used in the simulators did not correspond to real physics, the scenes would look unrealistic: the colliding players would fall too slowly , or the surface of a lake w ould not produce an appropriate reﬂection. This implies that there is a connection between scientiﬁc theories and veridical simulation. Can this principle be generalized? Suspend disbelief for a moment and imagine that, perhaps as a result of patronage from an adv anced alien race, humans had obtained computers before the 35 de velopment of physics. Then scientists could conduct a search for a good theory of mechanics using the follo wing method. First, they would write down a new candidate theory . Then they would build a simulator based on the theory , and use the simulator to generate v arious scenes, such athletes jumping, rocks colliding in mid-air , and water spurting from fountains. The new theory would be accepted and the old champion discarded if the former produced more realistic simulations than the latter . As a more plausible example, consider using the simulation principle to guide an inquiry into the rules of grammar and linguistics. Here the researchers write do wn candidate theories of linguistics, and use the corresponding simulator to generate sentences. A ne w theory is accepted if the sentences it generates are more realistic and natural than those produced by the pre vious champion theory . This is actually very similar to Chomsky’ s formulation of the goal of generati ve grammar; see Chapter 4 for further discussion. This notion of science appears to meet many of the requirements of empirical science dis- cussed pre viously in the chapter . It provides a solution to the Problem of Demarcation: a theory is scientiﬁc if it can be used to build a simulation program for a particular phenomenon. It gi ves scientists a way to make decisi ve theory comparisons, allo wing them to search ef ﬁciently through the space of theories. It in volv es a kind of Circularity Commitment: one de velops theories of a certain phenomenon in order to be able to construct con vincing simulations of the same phenomenon. Sophie could plausibly hav e answered the shaman’ s critique of physics by demonstrating that a simulator based on Newtonian mechanics produces more realistic image sequences than one based on shamanic re velation. In comparison to the compression principle, the veridical simulation principle has one ob- vious disadvantage: theory comparisons depend on qualitativ e human perception. If the human observers ha ve no special ability to judge the authenticity of a particular simulation, the the- ory comparisons will become noisy and muddled. The method may work for things like basic physics, text, speech, and natural images, because humans hav e intimate knowledge of these things. But it probably will not work for phenomena which humans do not encounter in their e veryday li ves. The adv antage of the simulation principle compared to the compression principle is that it provides an indication of where and in what way a model fails to capture reality . The word sampling example abov e showed how the model failed to capture the fact that real English words do not start with a double-s. If a model of visual reality were used to generate images, the unrealistic aspects of the resulting images will indicate the shortcomings of the model. For example, if a certain model does not handle shadows correctly , this will become obvious when it produces an image of a tree that casts no shade. The compression principle does not provide 36 this kind of indication. For this reason, the simulation principle can be thought of as a natural complement to the compression principle, that researchers can use to ﬁnd out where to look for further progress. Another interesting aspect of the veridical simulation principle is that it can be use to deﬁne a challenge similar to the T uring T est. In this challenge, researchers attempt to b uild simulators that can produce samples that are v eridical enough to fool humans into thinking they are real. The outcome of the conte xt is determined by sho wing a human judge tw o data objects, one real and one simulated. The designers of the system win if the human is unable to tell which object is real. T o see the dif ﬁculty and interest of this challenge, consider using videos obtained in the course of the V isual Manhattan Project inquiry of Section 1.4.3 as the real world component. The statistical model of the video data would then need to produce samples that are indistin- guishable from real footage of the streets of New Y ork City . The model would thus need to contain all kinds of information and detail relating to the visual en vironment of the city , such as the layout and architecture of the buildings, and the fashion sense and walking style of the pedestrians. This is, of course, exactly the kind of information needed to compress the video data. This observation provides further support for the intuiti ve notion that while the simulation principle and the compression principle are not identical, they are at least strongly aligned. It will require an enormous lev el of sophistication to win the simulation game, especially if the judges are long term inhabitants of Ne w Y ork. A true Ne w Y orker would be able to spot very minor deviations from veridicality , related to things like the color of the sidew alk carts used by the pretzel and hot dog vendors, or to subtle changes in the style of clothing worn by denizens of different parts of the city . A true New Y orker might be able to spot a fake video if it failed to include an appropriate degree of strangeness. Ne w Y ork is no normal place and a real video strream will reﬂect that by showing celebrities, business executi ves, beggars, transvestites, fashion models, inebriated artists, and so on. In spite of this difﬁculty , the alignment between the compression and simulation principles suggest that there is a simple way to make systematic progress: get more and more video data, and improv e the compression rate. 1.6 Comparison to Ph ysics Physics is the e xemplar of empirical science, and man y other ﬁelds attempt to imitate it. Some researchers ha ve deplored the inﬂuence of so-called “physics en vy” on ﬁelds like computer vi- sion and artiﬁcial intelligence [12]. This book argues that there is nothing wrong with imitating 37 physics. Instead, the problem is that pre vious researchers failed to understand the essential character of physics, and instead copied its superﬁcial appearance. The superﬁcial appear- ance of physics is its use of sophisticated mathematics; the essential character of physics is its obsession with reality . A physicist uses mathematics for one and only one reason: it is useful in describing empirical reality . Just as physicists do, comperical researchers adopt as their fundamental goal the search for simple and accurate descriptions of reality . They will use mathematics, but only to the e xtent that it is useful in achie ving the goal. Another ke y similarity between physics and comperical science inv olves the justiﬁcation of research questions. Some skeptics may accept that CRM research is legitimate science, but belie ve that it will be conﬁned to a narrow set of technical topics. After all, the CRM de- ﬁnes only one problem: lar ge scale lossless data compression. But notice that physics also deﬁnes only one basic problem: giv en a particular physical conﬁguration, predict its future e volution. Because there is a vast number of possible conﬁgurations of matter and energy , this single question is enormously producti ve, justifying research into such div erse topics as black holes, superconducti vity , quantum dots, Bose-Einstein condensates, the Casimir ef fect, and so on. Analogously , the single question of comperical science justiﬁes a wide range of research, due to the enormous div ersity of empirical regularities that can be found in databases of nat- ural images, text, speech, music, etc . The fact that a single question provides a parsimonious justiﬁcation for a wide range of research is actually a ke y advantage of the philosoph y . Both physics and comperical science require candidate theories to be tested against empir- ical observation using hard, quantitativ e ev aluation methods. Howe ver , there is an important dif ference in the way the theory-comparisons work. Physical theories are very speciﬁc. In physics, any ne w theory must agree with the current champion in a lar ge number of cases, since the current champion has presumably been validated on man y conﬁgurations. T o adjudi- cate a theory contest, researchers must ﬁnd a particular conﬁguration in which the two theories make opposing predictions, and then run the appropriate experiment. In comperical science, the predictions made by the champion theory are neither correct or incorrect, they are merely good. T o unseat the champion theory , it is suf ﬁcient for a ri val theory to make better predictions on av erage. 38 Chapter 2 Compr ession and Learning 2.1 Machine Lear ning Humans hav e the ability to de velop amazing skills relating to a very broad array of activities. Ho wev er , almost without exception, this competence not innate, and is achie ved only as a result of extended learning. The ﬁeld of machine learning takes this observation as its starting point. The goal of the ﬁeld is to de velop algorithms that improv e their performance ov er time by adapting their behavior based on the data the y observe. The ﬁeld of machine learning appears to ha ve achie ved signiﬁcant progress in recent years. Researchers produce a steady stream of new learning systems that can recognize objects, an- alyze facial expressions, translate documents from one language to another , or understand speech. In spite of this stream of ne w results, learning systems still ha ve frustrating limita- tions. Automatic translations systems often produce gibberish, and speech recognition systems often cause more anno yance than satisfaction. One particularly glaring illustration of the limits of machine learning came from a “racist” camera system that was supposed to detect faces, b ut worked only for white faces, f ailing to detect black ones [106]. The gap between the enormous ambitions of the ﬁeld and its present limitations indicates that there is some mountainous con- ceptual barrier impeding progress. T wo views can be articulated regarding the nature of this barrier . According to the ﬁrst vie w , the barrier is primarily tec hnical in nature. Machine learning is on a promising trajectory that will ultimately allo w it to achiev e its long sought goal. The ﬁeld is asking the right questions; success will be achiev ed by impro ving the answers to those ques- tions. The limited capabilities of current learning systems reﬂect limitations or inadequacies of modern theory and algorithms. While the modern mathematical theory of learning is advanced, 39 it is not yet adv anced enough. In time, new algorithms will be found that are far more po wer- ful than current algorithms such as AdaBoost and the Support V ector Machine [37, 118]. The steady stream of new theoretical results and impro ved algorithms will ev entually yield a sort of grand uniﬁed theory of learning which will in turn guide the de velopment of truly intelligent machines. In the second vie w , the barrier is primarily philosophical in nature. In this view , progress in machine learning is tending to ward a sort of asymptotic limit. The modern theory of learn- ing provides a comprehensiv e answer to the problem of learning as it is currently formulated. Algorithms solve the problems for which they are designed nearly as well as is theoretically possible. The demonstration of an algorithm that provides an improv ed con ver gence rate or a tighter generalization bound may be interesting from an intellectual perspecti ve, and may provide slightly better perfomance on the standard problems. But such incremental advances will produce true intelligence. T o achiev e intelligence, machine learning systems must make a discontinuous leap to an entirely ne w lev el of performance. The current mindset is analogous to the researchers in the 1700s who attempted to expedite ground transportation by breeding faster horses, when they should actually hav e been searching for qualitati vely different mode of transportation. The problem, then, is in the philosophical foundations of the ﬁeld, in the types of questions considered by its practitioners and their philosophical mindset. If this view is true, then to make further progress in machine learning, it is necessary to formulate the problem of learning in a ne w way . This chapter presents arguments in fa vor of the second vie w . 2.1.1 Standard F ormulation of Supervised Lear ning There are two primary modes of statistical learning: the supervised mode and the unsupervised mode. The present discussion will focus primarily on the former; the latter is discussed in Appendix B . The supervised version can be understood by considering a typical example of what it can do. Imagine one wanted to build a face detection system capable of determining if a digital photo contains an image of a face. T o use a supervised learning method, the researcher must ﬁrst construct a labeled dataset, which is made up of two parts. The ﬁrst part is a set of N images X = { x 1 , x 2 . . . x N } . The second part is a set of binary labels Y = { y 1 , y 2 . . . y N } , which indicate whether or not a face is present in each image. Once this database has been built, the researcher in vok es the learning algorithm, which attempts to obtain a predictiv e rule h ( · ) such that h ( x ) = y . Belo w , this procedure is refered to as the “canonical” form of the supervised learning problem. Many applications can be formulated in this w ay , as shown in the follo wing list: 40 • Document classiﬁcation: the x i data are the documents, and the y i data are category labels such as “sports”, “ﬁnance”, “political”, etc. • Object recognition: the x i data are images, and the y i data are object categories such as “chair”, “tree”, “car”, etc. • Electoral prediction: each x i data is a package of information relating to current political and economic conditions, and the y i is a binary label which is true if the incumbent wins. • Marital satisfaction: each x i data is a package of vital statistics relating to particular marriage (frequency of sex, frequency of ar gument, religious in v olvement, education le vels, etc) and the corresponding y i is a binary label which is true if the marriage ends in di vorce. • Stock Market prediction: each x i is a set of economic indicators such as interest rates, ex- change rates, and stock prices for a gi ven day; the y i is the change in v alue of a particular stock on the next day . 2.1.2 Simpliﬁed Description of Learning Algorithms For readers with no background in machine learning, the following highly simpliﬁed descrip- tion should provide a basic understanding of the basic ideas. One starts with a system S that performs some task, and a method for ev aluating the performance S provides on the task. Let this ev aluation function be denoted as E , and E [ S ] be the performance of the system S . In terms of the canonical task mentioned above, the system is the predictiv e rule h ( · ) , and the e valuation function is just the squared dif ference between the predictions and the real data: E [ h ] = X i ( h ( x i ) − y i ) 2 A key property of the system is that it be mutable. If a system S is mutable, then a small perturbation will produce a ne w system S 0 that beha ves in nearly the same way as S . This mu- tability requirement prev ents one from deﬁning S to be, for example, the code of a computer program, since a slight random change to a program will usually break it completely . T o con- struct systems that can withstand these minor mutations without suf fering catastrophic f ailures, researchers often construct the system by introducing a set of numerical parameters θ . If the behavior of S ( θ ) changes smoothly with changes in θ , then small changes to the system can be made by making small changes to θ . There are, of course, other ways to construct mutable 41 systems. Gi ven a mutable system and an ev aluation function, then the follo wing procedure can be used to search for a high-performance system: 1. Begin by setting S = S 0 , where S 0 is some default setup (which can be naïv e). 2. Introduce a small change to S , producing S 0 . 3. If E [ S 0 ] > E [ S ] , keep the change by setting S = S 0 . Otherwise, discard the modiﬁed version. 4. Return to step #2. Many machine learning algorithms can be understood as reﬁned versions of the abo ve pro- cess. For example, the backpropagation algorithm for the multilayer perceptron uses the chain rule of calculus to ﬁnd the deri vati ve of the E ( S ( θ )) with respect to θ [98]. Man y reinforcement learning algorithms work by making smart changes to a policy that depends on the parameters θ [111]. Genetic algorithms, which are inspired by the idea of natural selection, also roughly follo w the process outlined abov e. 2.1.3 Generalization V iew of Learning Machine learning researchers have dev eloped two conceptual perspectiv es by which to ap- proach the canonical task. The ﬁrst and more popular perspecti ve is called the Generalization V ie w . Here the goal is to obtain, on the basis of the limited N -sample data set, a model or pre- dicti ve rule that works well for new , pre viously unseen data samples. The Generalization V ie w is attractiv e for obvious practical purposes: in the case of the face detection task, for example, the model resulting from a successful learning process can be used in a system which requires the ability to detect faces in pre viously unobserved images (e.g. a surveillance application). The ke y challenge of the Generalization V iew is that the r eal distribution generating the data is unknown. Instead, one has access to the empirical distribution deﬁned by the observed data samples. In the early days of machine learning research, many practitioners thought that the suf ﬁcient condition for a model to achie ve good performance on the real distribution was that it achie ved good empirical performance: it performed well on the observed data set. Ho wev er , the y often found that their models would perform very well on the observed data, but fail completely when applied to ne w samples. There were a variety of reasons for this failure, but the main cause was the phenomenon of overﬁtting . Overﬁtting occurs when a researcher applies a complex model to solve a problem with a small number of data samples. Figure 2.1 illustrates the problem of 42 Figure 2.1: Illustration of the idea of model complexity and overﬁtting. In the limited data regime situation depicted on the left, the line model should be preferred to the curve model, because it is simpler . In the lar ge data regime, ho wever , the polynomial model can be justiﬁed. ov erﬁtting. Intuiti vely , it is easy to see that when there are only ﬁ ve data points, the complex curve model should not be used, since it will probably fail to generalize to any ne w points. The linear model will probably not describe new points exactly , but it is less likely to be wildly wrong. While intuition fav ors the line model, it is not immediately obvious how to formalize that intuition: after all, the curve model achie ves better empirical performance (it goes through all the points). The great conceptual achiev ement of statistical learning is the de velopment of methods by which to ov ercome overﬁtting. These methods ha ve been formulated in many different ways, but all articulations share a common theme: to av oid overﬁtting, one must penalize complex models. Instead of choosing a model solely on the basis of its empirical performance, one must optimize a tradeof f between the empirical performance and model comple xity . In terms of Figure 2.1 , the curve model achiev es excellent empirical performance, but only because it is highly complex. In contrast the line model achiev es a good balance of performance and simplicity . For that reason, the line model should be preferred in the limited-data regime. In order to apply the comple xity penalty strategy , the ke y technical requirement is a method for quantifying the complexity of a model. Once a suitable expression for a model’ s complexity is obtained, some further deriv ations yield a type of expression called a gener alization bound . A generalization bound is a statement of the follo wing form: if the empirical performance of the model is good, and the model is not too complex, then with high probability its real performance will be only slightly worse. The ca veat “with high probability” can ne ver be done a way with, because there is always some chance that the empirical data is simply a bizarre or unlucky sample of the real distribution. 43 One might conclude with v ery high conﬁdence that a coin is biased after observing 1000 heads in a ro w , but one could ne ver be completely sure. While most treatments of model complexity and generalization bounds require sophisti- cated mathematics, the following simple theorem can illustrate the basic ideas. The theorem can be stated in terms of the notation used for the canonical task of supervised learning men- tioned abo ve. Let C be a set of h ypotheses or rules that take a ra w data object x as an argument and output a prediction h ( x ) of its label y . In terms of the face detection problem, x would be an image, and y would be a binary ﬂag indicating whether the image contains a face. Assume it is possible to ﬁnd a hypothesis h ∗ ∈ C that agrees with all the observed data: h ∗ ( x i ) = y i i = 1 . . . N No w select some  and δ such that the follo wing inequality holds: N ≥ 1  log  | C | δ  (2.0) Then with probability 1 − δ , the error rate of the hypothesis will be at most  when measured against the real distribution. Abstractly , the theorem says that if the hypothesis class is not too large compared to the number of data samples, and some element achiev es good empirical performance, then with high probability its performance on the real (full) distrib ution will be not too much worse. The follo wing informal proof of the theorem may illuminate the core concept. T o understand the theorem, imagine you are searching through a barrel of apples (the hy- potheses), looking for a good one. Most of the apples are “wormy”: they hav e a high error rate on the real distrib ution. The goal is to ﬁnd a ripe, tasty apple; one that has a lo w error rate on the real distribution. Fortunately , most of the wormy apples can be discarded because they are visibly old and rotten, meaning they make errors on the observed data. The problem is that there might be a “hidden worm” apple that looks tasty - it performs perfectly on the observed data - but is in fact wormy . Deﬁne a wormy apple as one that has real error rate larger than  . No w ask the question: if an apple is wormy , what is the probability it looks tasty? It’ s easy to ﬁnd an upper bound for this probability: P (hidden worm) ≤ (1 −  ) N This is because, if the apple is wormy , the probability of not making a mistake on one sample is ≤ (1 −  ) , so the probability of not making a single mistake on N samples is ≤ (1 −  ) N . No w the question is: what is the probability that there are no hidden worms in the entire hypothesis 44 class? Let H W k (  ) be the e vent that the k th apple is a hidden worm. Then the probability that there are no hidden worms in the hypothesis class is: P (nohidden worms) = P ( ¬ [ H W 1 (  ) ∨ H W 2 (  ) ∨ H W 3 (  ) . . . ]) = 1 − P ([ H W 1 (  ) ∨ H W 2 (  ) ∨ H W 3 (  ) . . . ]) ≥ 1 − X k P ( H W k (  )) = 1 − | C | P ( H W (  )) ≥ 1 − | C | (1 −  ) N The ﬁrst step is true because P ( ¬ A ) = 1 − P ( A ) , the second step is true because P ( A ∨ B ) ≤ P ( A ) + P ( B ) , the third step is true because there are | C | hypotheses, and the ﬁnal step is just a substitution of Inequality 2.1.3 . Then the result follows by letting δ = | C | (1 −  ) N , noting that log(1 −  ) ≈ −  , and rearranging terms. A crucial point about the proof is that it makes no guarantee whate ver that a good hypothesis (tasty worm-free apple) will actually appear . The proof merely says that, if the model class is small and the other v alues are reasonably chosen, then it is unlikely for a hidden worm hypothesis to appear . If the probability of a hidden worm is low , and by chance a shiny apple is found, then it is probable that the shiny apple is actually w orm-free. A far more sophisticated de velopment of the ideas of model complexity and generalization is due to the Russian mathematician Vladimir V apnik [116]*. In V apnik’ s formulation the goal is to minimize the real (generalization) risk R , which can be the error rate or some other function. V apnik deri ved a sophisticated model comple xity term called the VC dimension, and used it to prov e se veral generalization bounds. A typical bound is: R ( h i ) ≤ R emp ( h i ) + log( | C 0 | ) − log ( δ ) N  1 + s 1 + 2 N R emp ( h i ) log( | C 0 | ) − log ( δ )  Where R ( h i ) is the real risk of hypothesis h i and R emp ( h i ) is the empirical risk, calculated from the observed data. The bound, which holds for all hypotheses simultaneously , indicates the conditions under which the real risk will not exceed the empirical risk by too much. As abov e, the bound holds with probability 1 − δ , and N is the number of data samples. The term log( | C 0 | ) is the log size of the VC-dimension, which plays a conceptually similar role to the simple log( | C | ) term in the previous theorem. V apnik’ s complex inequality shows the same basic idea as the simple theorem abov e: the real performance will be good if the empirical performance is good and the log size of the hypothesis class is small in comparison with the 45 number of data samples. Proofs of theorems in the VC theory also use a similar strate gy: sho w that if the model class is small, it is unlikely that it includes a “hidden w orm” h ypothesis which has low empirical risk but high real risk. Also, none of the VC theory bounds guarantee that a good hypothesis (lo w R emp ( h i ) ) will actually be found. The problem of ov erﬁtting is easily understood in the light of these generalization theorems. A naïve approach to learning attempts to minimize the empirical risk without reference to the complexity of the model. The theorems show that a low empirical risk, by itself, does not guarantee low real risk. If the model complexity terms log( | C | ) and log ( | C 0 | ) are large compared to the number of samples N , then the bounds will become too loose to be meaningful. In other words, e ven if the empirical risk is reduced to a very small quantity , the real risk may still be large. The intuition here is that because such a lar ge number of hypotheses was tested, the fact that one of them performs well on the empirical data is meaningless. If the hypothesis class is very lar ge, then some hypotheses can be e xpected to perform well merely by chance. The abo ve discussion seems to indicate that complexity penalties actually apply to model classes, not to individual models. There is an important subtlety here. In both of the general- ization theorems mentioned abov e, all elements of the model class were treated equally , and the penalty depended only on the size of the class. Ho wev er , it is also reasonable to apply dif ferent penalties to different elements of a class. Say the class C contains two subclasses C a and C b . Then if | C b | > | C a | , hypotheses drawn from C b must receiv e a larger penalty , and therefore require relatively better empirical performance in order to be selected. For example, in terms of Figure 2.1 , one could easily construct an aggregate class that includes both lines and polynomials. Then the polynomials would recei ve a larger penalty , because there are more of them. While more complex models must recei ve larger penalties, they are nev er prohibited out- right. In some cases it very well may be worthwhile to use a complex model, if the model is justiﬁed by a lar ge amount of data and achie ves good empirical performance. This concept is illustrated in Figure 2.1 : when there are hundreds of points that all fall on the comple x curve, then it is entirely reasonable to prefer it to the line model. The generalization bounds also express this idea, by allo wing log ( | C | ) or log ( | C 0 | ) to be large if N is also large. 2.1.4 Compr ession V iew The second perspectiv e on the learning problem can be called the Compression V iew . The goal here is to compress a data set to the smallest possible size. This view is founded upon the insight, drawn from information theory , that compressing a data set to the smallest possible 46 size requires the best possible model of it. The difﬁculty of learning comes from the fact that the bit cost of the model used to encode the data must itself be accounted for . In the statistics and machine learning literature, this idea is kno wn as the Minimum Description Length (MDL) principle [120, 95]. The motiv ation for the MDL idea can best be seen by contrasting it to the Maximum Like- lihood Principle, one of the foundational ideas of statistical inference. Both principles apply to the problem of how to choose the best model M ∗ out of a class M to use to describe a gi ven data set D . For example, the model class M could be the set of all Gaussian distrib utions, so that an element M would be a single Gaussian, deﬁned by a mean and v ariance. The Maximum Likelihood Principle suggests choosing M ∗ so as to maximize the likelihood of the data giv en the model: M ∗ = max M ∈ M P ( D | M ) This principle is simple and effecti ve in many cases, but it can lead to overﬁtting. T o see ho w , imagine a data set made up of 100 numbers { x 1 , x 2 , . . . x 100 } . Let the class M be the set of Gaussian mixture models. A Guassian mixture model is just a sum of normal distributions with dif ferent means and variances. Now , one simple model for the data could be built by ﬁnding the mean and variance of the x i data and using a single Gaussian with the giv en parameters. A much more complex model can be b uilt by taking a sum of 100 Gaussians, each with mean equal to some x i and near-zero variance. Obviously , this “comb” model is worthless: it has simply ov erﬁt the data and will fail badly when a new data sample is introduced. But it produces a higher likelihood than the single Gaussian model, and so the Maximum Likelihood principle suggests it should be selected. This indicates that the principle contains a ﬂaw . The Minimum Description Length principle approaches the problem by imagining the fol- lo wing scenario. A sender wishes to transmit a data set { x 1 , x 2 , . . . x 100 } to a receiv er . The two parties hav e agreed in advance on the model class M . T o do the transmission, the sender chooses some model M ∗ ∈ M and sends enough information to specify M ∗ to the receiv er . The sender then encodes the x i data using a code based on M ∗ . The best choice for M ∗ minimizes the net codelength required: M ∗ = min M ∈ M L ( M ) + L ( D | M ) = min M ∈ M L ( M ) − log P ( D | M ) Where L ( M ) is the bit cost of specifying M to the receiv er , and L ( D | M ) = − log P ( D | M ) is the cost of encoding the data gi ven the model. If if it were not for the L ( M ) term, the MDL 47 principle would be exactly the same as the Maximum Likelihood principle, since maximizing P ( D | M ) is the same as minimizing − log P ( D | M ) . The use of the L ( M ) term penalizes complex models, which allows users of the MDL principle to av oid overﬁtting the data. In the e xample mentioned abov e, the Gaussian mixture model with 100 components would be strongly penalized, since the sender would need to transmit a mean/v ariance parameter pair for each component. The MDL principle can be applied to the canonical task by imagining the follo wing sce- nario. A sender has the image database X and the label database Y , and wishes to transmit the latter to a recei ver . A crucial and somewhat counterintuitiv e point is that the receiv er already has the image database X . Because both parties ha ve the image database, if the sender can discov er a simple relationship between the images and the labels, he can exploit that relation- ship to sav e bits. If a rule can be found that accurately predicts y i gi ven x i , that is to say if a good model P ( Y | X ) can be obtained, then the label data can be encoded using a short code. Ho wev er , in order for the recei ver to be able to perform the decoding, the sender must encode and transmit information about how to b uild the model. More complex models will increase the total number of bits that must be sent. The best solution, therefore, comes from optimizing a tradeof f between empirical performance and model complexity . 2.1.5 Equiv alence of V iews The Compression V iew and the Generalization V iew adopt very dif ferent approaches to the learning problem. Profoundly , howe ver , when the two dif ferent goals are formulated quanti- tati vely , the resulting optimization problems are quite similar . In both cases, the essence of the problem is to balance a tradeof f between model complexity and empirical performance. Similarly , both views justify the intuition relating to Figure 2.1 that the linear model should be preferred in the low-data regime, while the polynomial model should be preferred in the high-data regime. The relationship between the two views can be further understood in the conte xt of the simple hidden worm theorem described abov e. As stated, this theorem belongs to the Gener- alization V iew . Ho we ver , it is easy to con vert it into a statement of the Compression V iew . A sender wishes to transmit to a receiv er a database Y of labels which are related to a set X of raw data objects. The receiv er already has the raw data X . The sender and receiv er agree in adv ance on the hypothesis class C and an encoding format based on it that works as follo ws. The ﬁrst bit is a ﬂag that indicates whether a good hypothesis h ∗ was found. If so, the sender then sends the index of the hypothesis in C , using log 2 | C | bits. The recei ver can then look up 48 h ∗ and apply it to the images x i to obtain the labels y i . Otherwise, the sender encodes the labels y i normally at a cost of N bits. This scheme achie ves compression under two conditions: a good hypothesis h ∗ is found and log | C | is small compared to the number of samples N . These are exactly the same conditions required for generalization to hold in the Generalization V iew approach to the problem. This equiv alence in the case of the hidden worm theorem could be just a coincidence. But in fact there are a v ariety of theoretical statements in the statistical learning literature that suggest that equi valence is actually quite deep. For example, V apnik sho wed that if that if a model class C could be used to compress the label data, then the following inequality relates the achiev ed compression rate K ( C ) to the generalization risk R ( C ) : R ( C ) < 2( K ( C ) log 2 − N − 1 log δ ) (2.-7) The second term on the right, N − 1 log δ , is for all practical cases small compared to the K ( C ) term, so this inequality sho ws a very direct relationship between compression and generaliza- tion. This expression is strikingly simpler than an y of the other VC generalization bounds. There are many other theorems in the machine learning literature that suggest the equi va- lence of the Compression V iew and the Generalization V iew . For example, a simple result due to Blumer et al. relates the learnability of a hypothesis class to the existence of an Occam algorithm for it. In this paper , the ke y question of learning is whether a good approximation h ∗ can be found of the true hypothesis h T , when both functions are contained in a h ypothesis class H . If a good approximation (low  ) can be found with high probability (lo w δ ) using a limited number of data samples (small N ), the class is called learnable. Functions in the hypothesis class H can be speciﬁed using some ﬁnite bit string; the length of this string is the complexity of the function. T o deﬁne an Occam algorithm, let the unknown true function h T hav e com- plexity W , and let there be N samples ( x i , h T ( x i )) . The algorithm then produces a hypothesis h ∗ of complexity W c N α , that agrees with all the sample data, where c ≥ 1 and 0 ≤ α < 1 are constants. Because the complexity of h ∗ gro ws sublinearly with N , then a simple encoding scheme such as the one mentioned abov e based on H and the Occam algorithm is guaranteed to produce compression for large enough N . Blumer et al. show that if an Occam algorithm exists, then the class H is learnable. More complex results by the same authors are gi ven in ?? (see section 3.2 in particular). While the Generalization V iew and the Compression V iew may be equi valent, the latter approach has a variety of conceptual advantages. First of all, the No Free Lunch theorem of data compression indicates that no completely general compressor can ev er succeed. This sho ws that all approaches to learning must discover and exploit special empirical structure 49 in the problem of interest. This fact does not seem to be widely appreciated in the machine learning literature: many papers advertise methods without explicitly describing the conditions required for the methods to work. Also, because the model complexity penalty L ( M ) can be interpreted as a prior o ver hypotheses, the Compression V ie w clariﬁes the relationship between learning and Bayesian inference. This relationship is obscure in the Generalization V iew , lead- ing some researchers to claim that learning differs from Bayesian inference in some kind of deep philosophical way . Another signiﬁcant adv antage of the Compression V iew is that it is simply easier to think up compression schemes than it is to prove generalization theorems. For example, the Gen- eralization V ie w version of the hidden worm theorem requires a deriv ation and some modest le vel of mathematical sophistication to determine the conditions for success, to wit, that log | C | is small compared to N and a good hypothesis h ∗ is found. In contrast, in the Compression V ie w version of the theorem, the requirements for success become obvious immediately after deﬁning the encoding scheme. The equi v alence between the two vie ws suggest that a fruitful procedure for ﬁnding new generalization results is to de velop new compression schemes, which will then automatically imply associated generalization bounds. 2.1.6 Limits of Model Complexity in Canonical T ask In the Compression V iew , an important implication reg arding model complexity limits in the canonical task is immediately clear . The canonical task is approached by ﬁnding a short pro- gram that uses the image data set X to compress the label data Y . The goal is to minimize the net codelength of the compressor itself plus the encoded version of Y . This can be formalized mathematically as follo ws: M ∗ = arg min M ∈ M [ L ( M ) − log 2 P M ( Y | X )] Where M ∗ is the optimal model, L ( M ) is the codelength required to specify model M , and M is the model class. Now assume that M contains some tri vial model M 0 , and assume that L ( M 0 ) = 0 . The intuition of M 0 is that it corresponds to just sending the data in a ﬂat format, without compressing it at all. Then, in order to justify the choice of M ∗ ov er M 0 , it must be the case that: L ( M ∗ ) − log 2 P M ∗ ( Y | X ) < log 2 P M 0 ( Y | X ) The right hand side of this inequality is easy to estimate. Consider a typical supervised learning task where the goal is to predict a binary outcome, and there are N = 10 3 data samples 50 (many such problems are studied in the machine learning literature, see the revie w [30]). Then a dumb format for the labeled data simply uses a single bit for each outcome, for a total of N = 10 3 bits. The inequality then immediately implies that: L ( M ∗ ) < 10 3 This puts an absolute upper bound on the complexity of an y model that can e ver be used for this problem. In practice, the model complexity must really be quite a bit lo wer to get good results. Perhaps the model requires 200 bits to specify and the encoded data requires 500 bits, resulting in a sa vings of 300 bits. 200 bits corresponds to 25 bytes. It should be obvious to anyone who has e ver written a computer program that no model of any complexity can be speciﬁed using only 25 bytes. 2.1.7 Intrinsically Complex Phenomena Consider the following thought experiment. Let T = { t 1 , t 2 . . . t N } be some database of in- terest, made up of a set of raw data objects. Let M be the set of programs that can be used to losslessly encode T . A program m is an element of M , its length is | m | . Furthermore let L m ( T ) be the codelength of the encoded version of T produced by m . For technical reasons, assume also that the compressor is stateless , so that the t i can be encoded in an y order , and the codelength for each object will be the same reg ardless of the ordering. This simply means that any knowledge of the structure of T must be included in m at the outset, and not learned as a result of analyzing the the t i . No w deﬁne: L ∗ T ( x ) = min | m | e + f , we hav e saved bits. In fact, if we use the constraint that surfaces are mostly smooth, so that d ( x, y ) v aries slowly , we can further encode d ( x, y ) by its a verage value d 0 ( y ) on each horizontal line and its x -deriv ati ve d x ( x, y ) which is mostly much smaller . The important point is that MDL coding leads you to introduce the third coordinate of space, i.e. to discov er three-dimensional space! A further study of the discontinuities in d , and the “non-matching” pix els visible to one eye only goes further and leads you to in vent a description of the image containing labels for distinct objects, i.e. to discover that the world is usually made up of discr ete objects [80]. Note ho w a single principle (compression) leads to the rediscov ery of structure in visual reality that is otherwise tak en for granted (authors of object recognition papers do not typically feel obligated to justify the assumption that the world is made up of discrete objects). Mum- ford’ s thought experiment also emphasizes the intrinsic scalability of the compression problem: ﬁrst one discov ers the third dimension, and then that the world is made up of discrete objects. 113 3.4.3 Optical Flow Estimation Another traditional task in computer vision is optical ﬂo w estimation. This is the problem of ﬁnding the apparent motion of the brightness patterns in an image sequence. The optical ﬂo w problem can be reformulated as a specialized compression technique that works as follows. Consider a high frame rate image sequence (say , 100 Hz). Because of the high frame rate, the scene does not change much between frames. Thus, a good way to sa ve bits would be to encode and transmit full frames at a lo wer rate (say , 25 Hz), and use an interpolation scheme to predict the intermediate frames. The predicted pixel value would then be used as the mean for the distribution used to encode the real value, and assuming the predictions were good, substantial bit savings would be achie ved while maintaining losslessness. No w , a “dumb” interpolation scheme could just linearly interpolate a pixel v alue by using the start and end frames. But a smarter technique would be to infer the motion of the pixels (i.e. the optical ﬂow) and use this information to do the interpolation. The simplest encoding distribution to use w ould be a Gaussian with unit variance and mean equal to the predicted v alue. In that case, the codelength required to encode a pixel would be simply the squared difference between the prediction and the real outcome, plus a constant corresponding to the normalization factor . A smarter scheme might take into account the local intensity v ariation - if the intensity gradient is large, it is likely that the prediction will be inaccurate, and so a lar ger variance should be used for the encoding distribution. The resulting codelength for a single interpolated frame would be: X x,y ( I ( x, y ) − I GT ( x, y )) 2 ||∇ I GT ( x, y ) || 2 +  + k ( I GT ( x, y )) (3.-4) Where I ( x, y ) is the real frame and I GT ( x, y ) is the image predicted from the optical ﬂo w , and the k ( · ) is a term corresponding to the normalization factor . Since shorter codelengths are achie ved by improving the accuracy with which I GT predicts I , this shows that improv ements in the optical ﬂo w estimation algorithm will lead to improv ements in the compression rate. In- deed, with the exception of the k ( · ) terms, the above expression is equi valent to an ev aluation metric for optical ﬂow algorithms proposed by [4] (compare their Equation 1 to the above ex- pression). The compression-equi valent scheme is much simpler than the other metric proposed by [4], which in volv es the use of a complicated experimental apparatus to obtain ground truth. The compression metric permits any image sequence to be used as empirical data. 114 3.4.4 Image Segmentation As mentioned above, idea of image segmentation is to partition an image into a small number of homogeneous regions with simple boundaries. Each of the italicized words is crucial for the problem to make sense at all. The regions must be homogeneous, otherwise one can simply draw arbitrary boundaries. The boundaries must be simple, because otherwise it would be easy to segment similar pixels together by drawing complex jagged boundaries. Thirdly the number of re gions should be small, or one can solv e the problem simply by creating thousands of mini-regions of four or ﬁ ve pixels each. The segmentation problem can be formulated as a compression problem by using a com- pressor that encodes an image in terms of a set of re gions. A specialized model is ﬁt to each region, and then used to encode the pixels of the region. Because the model is region-speciﬁc, it describes the distribution of the pix els in the region better than a generic model would, and therefore can achie ve a substantial codelength sa vings. Ho wev er , several conditions must be met for this to work. First, the pixels assigned to a region must be very similar to one another . Second, the format requires that the contours of the regions must also be encoded, so that the decoder knows what model to use for each pixel. T o reduce this cost, it is necessary to use simple region boundaries, so that the y can be encoded with a short code. Thirdly , the sender is also obligated to send the parameters required to deﬁne each region-speciﬁc model, so the total number of re gions should be kept lo w . These three considerations supply cleanly justiﬁed deﬁnitions for the problematic words (homogeneous, simple, small) mentioned abo ve. Se veral authors hav e adopted the compression-based approach to segmentation [58, 64]. The following is a brief discussion of a method proposed by Zhu and Y uille [128]. Zhu and Y uille formulate the segmentation problem as a minimization of a codelength functional: M X i  λ − log P ( I x,y : ( x, y ) ∈ R i | α i ) + µ 2 Z ∂ R i ds  (3.-3) Here R i denotes the i th region with boundary ∂ R i , α i is a set of parameters specifying a region- speciﬁc model, λ is a per -region overhead cost, and there are M total re gions. The goal is to ﬁnd choices for M and R i that minimize the functional, which represents the cost that would be required to encode the image using a segmentation-based compression format. The overhead cost of specifying a re gion-speciﬁc model is λ . The cost of specifying the boundary ∂ R i is giv en by the contour integral term. The cost of encoding the pixels in region R i using the specialized model deﬁned by α i is given by the log P ( . . . ) term. The minimization is achiev ed by balancing the need to package similar pixels together , while at the same time using a small total number 115 of regions with simple boundaries. The minimization problem is a competition between the need to package similar pixels together so that narrow , region-speciﬁc model distrib utions will describe them well, and the need to use a small number of regions with simple boundaries. Zhu and Y uille focus most of their ef fort on the de velopment of an algorithm for ﬁnding a good minimum of Equation 3.-3 . The result is an algorithm where the regions “compete” over boundary pixels; the winner of the competition is the region that can encode the pixel with the shortest code. Little effort is e xpended on examining Equation 3.-3 itself. For example, the cost of a re gion boundary in Equation 3.-3 is proportional to its length. But this seems inefﬁcient; it should be possible to deﬁne a polygonal boundary by specifying a small number of vertices. Also, the region-speciﬁc models P ( ·| α i ) are chosen to be simple multiv ariate Gaussians. This choice is not supported by any kind of empirical e vidence. Finally , the paper does not report any actual compression results, only se gmentation results, and only on a fe w images. The reason for these oversights is ob vious: Zhu and Y uille vie w compression as merely a trick that can be used to obtain good segmentation. If the authors had adopted the compression principle as their primary goal, the y would ha ve had to ask an awkward question: is segmenta- tion really a good tool for describing images? That is to say , is segmentation really a scientiﬁc idea? Is it empirically useful to describe images in terms of a set of regions? This question cannot be answered without empirical in vestigation. At this point it is worth noting that the edge detection task discussed abov e has no direct reformulation as a compression problem. Howe ver , it seems very likely that having a good edge detector will be useful for the image se gmentation problem. Edge detector algorithms can therefore be justiﬁed in compericalresearch by sho wing ho w they contrib ute to the performance of segmentation-based compressors. 3.4.5 F ace Detection and Modeling Consider a compericalinquiry that uses the image database hosted by the popular internet social networking site Facebook. This enormous database contains many images of faces. F aces hav e a very consistent structure, and a computational understanding of that structure will be useful to compress the database well. There is a signiﬁcant literature on modeling faces [9, 27], and sev eral techniques exist that can produce con vincing reproductions of face images from models with a small number of parameters. Giv en a starting language L , by adding this kind of model based face rendering technique a ne w language L f can be deﬁned that contains the ability to describe scenes using face elements. Since the number of model parameters required is generally small and the reconstructions are quite accurate, it should be possible to 116 signiﬁcantly compress the Facebook database by encoding face elements instead of raw pixels when appropriate. Ho wev er it is not enough just to add face components to the description language. In order to take advantage of the ne w face components of the language to achie ve compression, it is also necessary to be able to obtain good descriptions D L f of images that contain faces. If unlimited computational po wer were av ailable, then it would be possible to test each image subwindo w to determine if it could be more efﬁciently encoded by using the face model. But the procedure of extracting good parameters for the face model is relativ ely expensi ve, so this brute force procedure is inefﬁcient. A better scheme would be to use a fast classiﬁer for face detection such as the V iola-Jones detector described abov e [119]. The detector scans each subwindo w , and if it reports that a face is present, the subwindo w is encoded using the face model component. Bits are sav ed only when the detector correctly predicts that the face-based encoder can be used to sa ve bits for the subwindow . A false neg ativ e is a missed opportunity , while a false positiv e incurs a cost related to the inappropriate use of the face model to encode a subwindo w . In other words, the face model implicitly deﬁnes a virtual label V ( W ) for each subwindo w W : V ( W ) = D ( W ) − F ( W ) (3.-3) Where D ( W ) is the default cost of encoding the window , and F ( W ) is the cost using the face model. These virtual labels depend only on the face model, the original encoder , and the image data, so the y can be generated with no human ef fort. The face detector can be trained using the virtual labels, and the performance of the combined detector/modeler system can be ev aluated using the ov erall compression rate. Since one important bottleneck in machine learning is the limitation on the amount of labeled data that is av ailable, this technique should be very useful. The same basic strategy can be used to e valuate object recognition systems, though the state of the art of generic object modeling is less adv anced. Indeed, the virtual label strategy can be used whenev er a synthesizer for some kind of data can be found. For example, one could attempt to train a speech detection system using a lar ge audio database and a voice synthesizer . 117 Chapter 4 Compr ession and Language 4.1 Computational Linguistics Comperical research begins by ﬁnding a lar ge pool of structure-rich data that can be obtained cheaply . In addition to images, another ob vious source of such data is text. The study of text is also interesting from a humanistic perspectiv e, due to the crucial role language plays in human society . The line of inquiry resulting from applying the Compression Rate Method to lar ge text corpora has much in common with the modern ﬁeld kno wn as computational linguistics ( C L ). Computational linguistics is the ﬁeld dedicated to computer-based analysis and processing of text. In many ways, the ﬁeld is similar to computer vision in mindset and philosophical approach. C L researchers employ many of the same mathematical tools as vision researchers, such as Hidden Marko v Models, Markov Random Fields, and the Support V ector Machine algorithm. The two ﬁelds also share a mindset: they both produce hybrid math/engineering results, and are strongly inﬂuenced by the ﬁeld of computer science. Both ﬁelds suffer from a “toolbox” mentality: researchers produce a large number of candidate solutions for each task, but lack con vincing methods by which to e valuate those solutions. Computational linguistics dif fers from computer vision in that it has a sister ﬁeld - tradi- tional linguistics - from which it borro ws many ideas. Linguistic theories sometimes play a role in the dev elopment such as parsers or machine translation systems. The parser dev eloped by Collins, described below , includes ideas such as Wh -mo vement and the complement/adjunct distinction [24]. Another area in which traditional linguistic theory plays a role is in the de- velopment of parsed corpora such as the Penn T reebank. A treebank is a set of sentences with attached parse trees; such a resource allows researchers to use learning systems to construct parsers. T o construct a treebank, dev elopers must select a theory of grammar that deﬁnes the 118 set of part-of-speech (POS) tags and syntactic tags (noun phrase, adv erb phrase, Wh -adjectiv e phrase, etc). Ho wev er , the inﬂuence of standard linguistics theory on computational linguistics is not as great as one might expect. Often, CL researchers ﬁnd it more effecti ve to ignore linguistic kno wledge and simply employ brute force computational or statistical techniques. Thus, the machine translation system dev eloped by Brown et al. , discussed belo w , uses minimal linguis- tic content [15]. This is actually considered an advantage, because it makes the system easier to port to a new language-pair . Similarly , researchers in the ﬁeld of language modeling often at- tempt to deplo y model based on sophisticated linguistic concepts, only to ﬁnd that such models are outperformed by simple n -grams. One of the major seismic shifts in natural language research was the transition from rule- based systems to statistical systems. Rule-based systems were constructed by assembling a group of linguists and transferring their knowledge, expressed in the form of rules, into a com- puter system. These packages were often fairly complex, because most linguistic rules hav e exceptions. Even a basic rule such as “all English sentences must hav e a verb” is often broken in practice. The complexity of the rule-based systems led to brittleness, fragility , and difﬁculty of maintenance. Furthermore, the rule-based approach was unattracti ve because the methods de veloped for one language did not often transfer to another language. This made it expensi ve to add ne w languages to a system. The ﬁeld of statistical natural language processing emerged as a response to the limitations of the rule-based system. This area was substantially pioneered by a group of researchers at IBM in the 90s [15, 8, 28]. A statistical system attempts to av oid making e xplicit commitments to a certain type of structure. Instead, the system attempts to learn the structure of text by analyzing a corpus. Thus, a major ingredient for any learning-based natual language system is a corpus of text. In the case of statistical machine translation, it is essential to hav e a bilingual or parallel corpus: one which contains sentences from one language side by side with sentences from the other language. The system learns ho w to translate by analyzing the relationship between the two side-by-side sentences. From the perspectiv e of compericalresearch, a particularly interesting C L problem is lan- guage modeling. The goal here is to ﬁnd good statistical models of lanuage, which achiev e a lo w cr oss-entr opy score on a text database. Cross-entropy is just the neg ativ e log-likelihood of the text data using the model. In other words, it is just the compression rate without the model complexity penalty . Section 4.4 provides an analysis of this subﬁeld and its relationship to com- perical research. T wo other standard CL topics, statistical parsting and machine translation, are discussed in Sections 4.2 and 4.3 . Some other tasks, such as document classiﬁcation and word 119 sense disambiguation, are discussed brieﬂy in Section 4.5 . Again, the goal is not to provide a comprehensi ve surve y of the ﬁeld, or even to describe individual papers in any depth. Instead, the idea is to giv e nonspecialist readers an impression of the basic issues in a giv en area. Each section contains a description of the task, followed by an analysis of the mechanisms used to e valuate candidate solutions. A brief critique of the research is gi ven, focusing primarily on the e valuation schemes. Then a comperical reformulation of the problem is giv en. 4.2 Parsing An important part of the structure of natural language is grammar . Grammatical rules govern ho w different elements of a sentence can ﬁt together . Consider the sentence “John loves Mary”. That sentence can be parsed as follo ws: (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary))) Here NNP and VPZ are part of speech (POS) tags for nouns and verbs, while S, NP , and VP are syntactic tags representing sentence, noun phrase, and verb phrase structures. The goal of parsing is to recov er both the POS tags and the syntactic tags, given the w ords of the sentence. Thus, the parsing problem includes the POS tagging problem. In principle, an important part of designing a parser is to determine the set of P O S tags and syntactic tags one wishes to obtain. While there is agreement about the basics, different linguists endorse different theories of grammar , which in turn utilize dif ferent abstractions. Ho wev er , this problem has been solved in practice by the appearance of the Penn T reebank [72]. This is a corpus of text for which parse trees have been provided by human annotators (see further discussion belo w). Thus, most parsers simply utilize the same set of tags as the Penn T reebank. A very common approach to the parsing problem is based on an idea called the Probabilistic Context Free Grammar (PCFG). A PCFG is a list of symbols, and a set of rewrite rules that can be used to transform those symbols. For e xample, a PCFG could specify that an NP can transform into an adjectiv e and a noun. Another rule might allow the NP to transform into a determiner-adjecti ve-noun sequence. Each re write rule has a probability attached to it, allowing the system to express the fact that some grammatical structures are more common than others. The re write rules of a PCFG can include recursion. For example, an NP might be re written as a PP and a NP . Recursion allo ws the PCFG to produce an inﬁnite number of sentences. 120 One key dif ﬁculty in parsing is that the number of possible parse trees is exponential in the length of the sentence. This means that e ven if a parser could determine the validity of a particular parse tree with perfect accuracy , it would be unable to test all possible parses trees. Instead, the parser must employ some search strategy that narro ws down the number of parses that actually get examined. This is a bit like reconstructing the details of a crime from examining e vidence left at the scene. Some explanations make more sense than others, but if one does not glimpse the possibility of a certain explanation, one might end up concluding that a less-likely e xplanation is true. One inﬂuential paper in this area is “Learning to Parse Natural Language with Maximum Entropy Models” by Adwait Ratnaparkhi [93]. Ratnaparkhi uses the idea of shift-r educe pars- ing, a standard technique used in computer science to compile software from human readable form to machine code. A shift-reduce parser , as the name indicates, applies two basic oper- ations. The shift operation pushes an input element onto the stack, where it awaits further processing. The reduce operation joins an input element together with the element on top of the stack, producing an element of a new type. In grammatical terms, this means taking two elements such as the (determiner) and dog (noun) and joining them to produce a noun phrase. The main issue with applying this technique to natural language parsing, as opposed to software parsing, is that in the former there is ambiguity about the correct way of combining elements together . T o deal with this ambiguity , Ratnaparkhi’ s parser does not use a determin- istic rule to decide whether to combine two elements (reduce). Instead, it considers multiple options. In other w ords, when it comes to a fork in the road, it pursues both paths, at least for a little while. Each operation, or fork in the road, is assigned a probability . And a full deri vation, or a path through the woods, has a net probability determined by the product of each indi vidual operation. The algorithm attempts to ﬁnd the path with the highest net probability . The probabilities assigned to single operations are obtained using a Maximum Entropy (MaxEnt) model. MaxEnt models are of the form: p ( a | b ) = 1 Z ( b ) exp  X i λ i f i ( a, b )  (4.0) Where p ( a | b ) is the probability of an operation a in a context b , f i are a set of context functions , the λ i are coef ﬁcients related to the context functions, and Z ( b ) ensures normalization. The operations a represent either shift or reduce. The context b contains information related to the current group of words being examined and the nearby subtrees. The key property of MaxEnt is that it allo ws the user an enormous amount of ﬂexibility in deﬁning the context functions f i . 121 The researcher can test out man y different kinds of context functions and the training algorithm will automatically select the optimal parameters λ i . Ratnaparkhi also presents an algorithm for searching for good parse trees. The algorithm is a type of breadth ﬁrst search (BFS) that prunes a potential parse candidate if its probability , assigned by the MaxEnt model, is too low . This pruning is necessary because the total number of parse trees is v ast. A more recent paper on parsing is “Head-Driv en Statistical Models for Natural Language Parsing” by Michael Collins [24]. Abstractly , Collins strategy is to deﬁne a joint probability model P ( T , S ) , where T is a parse tree and S is a sentence. Then, gi ven a particular sentence, his algorithm maximizes the conditional probability of the tree gi ven the sentence: T ∗ = arg max T P ( T | S ) = arg max T P ( T , S ) P ( S ) = arg max T P ( T , S ) Gi ven this formulation, there are two basic problems: how to deﬁne the model P ( T , S ) , and ho w to perform the actual maximization. Collins spends most of his attention on the former problem, noting that there are standard algorithms that can be applied to handle the latter . Collins employs the PCFG framew ork to deﬁne P ( T , S ) , A PCFG model deﬁnes the joint probability as simply the product of each of the expansions used in the tree. Each expansion is a transformation of the form α → β . For example, a verb phrase could transform into a verb and a noun phrase. If the parse tree includes n expansions, then the full joint probability is gi ven by the e xpression: P ( T , S ) = n Y i =1 P ( β i | α i ) Where α i and β i are the pre- and post-expansion structures in volv ed in the i th step of the deri vation embodied by the parse tree. It is worth noting the close relationship between this approach to parsing and the problem of modeling the standalone probability of a sentence, P ( S ) , which can be obtained from P ( T , S ) by mar ginalizing ov er T . T o complete the model, it is necessary to ﬁnd the probabilities of a particular expansion P ( β | α ) . If a parsed corpus is av ailable, this can be done by simple counting: P ( β | α ) = C ount ( α → β ) C ount ( α ) While the procedure described above is almost complete, the basic PCFG model is too sim- plistic to provide good performance. Collins’ main focus in the paper is on various strategies for making the model more realistic. One technique is to use a lexicalized PCFG, where each 122 nonterminal (node in the parse tree) includes not only a syntactic tag (noun phrase, preposi- tional phrase, etc) but also a head word and an associated POS tag. So for the sentence “Last week, IBM bought Lotus”, the root node is S ( B O U G H T , V B D ) . The root node then expands into N P ( W E E K , N N ) , N P ( I B M , N N P ) , and V P ( B O U G H T , V B D ) . The PCFG formalism can eas- ily accomodate this; it simply means there are many more symbols and expansions. The point of this modiﬁcation is that speciﬁc words often contain very useful information that can affect the probabilities of v arious deriv ations. The major drawback of using the lexicalized PCFG approach is that it v astly increases the number of deriv ational rules, leading to sev ere sparse data issues. The speciﬁc deriv ation of S ( B O U G H T , V B D ) → N P ( W E E K , N N ) · N P ( I B M , N N P ) · V P ( B O U G H T , V B D ) probably occurs only once in the corpus. Collins makes a number of independence assumptions to alleviate the sparse data problems and reduce the number of parameters in the model. T o do this, he notes that all production rules are of the form: α → { L 1 . . . L n , H , R 1 . . . R m } where { L i } , H , and { R j } designate the left side components, the head word component, and the right side components respecti vely . So in the example gi ven above, the H is the node associated with bought , L 1 , L 2 are the components associated with week and IBM , and there are no components on the right side. Then the simplifying independence assumption is that the probability for each node in L, H , R depends only on the parent P . This technique makes it much easier to estimate the rele vant probabilities. Ho wev er , it turns out that this assumption is actually too strong, and mak es it impossible to capture certain linguistic structures. Collins describes three increasingly sophisticated models, that capture increasingly complex information about the structure of sentences. The ﬁrst model introduces the notion of linguistic distance into the model, allo wing it to take into account the history of pre viously applied expansions. Roughly , this means that when expanding a node, information about the node’ s parent node or sibling nodes can be used to modify the expansion probabilities. The second model introduced by Collins attempts to handle the distinction between adjuncts and complements. A naïve parse of the sentence “Last week IBM bought Lotus” would identify both Last week and IBM as noun phrases. But these two elements actually hav e distinct roles: IBM is the subject of the verb bought , while Last week is an adjunct modifying the verb . Collins introduces a ne w set of v ariables into the model, which specify whether a nonterminal generates a left or right complement. Most v erbs tak e one left complement, which represents the subject. So the v erb bought will generate one left complement, and the parser can assign the word IBM to this role. Of course, all of these rules are e xpressed probabilistically: it is not impossible for a verb to ha ve multiple left complements, just v ery unlikely . 123 Collins’ third model attempts to handle the phenomena of Wh -movement and traces. The importance of traces can be seen in the follo wing examples: 1) The company (SB AR that TRA CE bought Lotus). 2) The company (SB AR that IBM bought TRA CE). In the ﬁrst sentence, company refers to IBM, which ﬁlls the trace position. In the second sentence, company refers to Lotus. One way to handle Wh -mov ement is to use the notion of a gap. A gap starts with a trace, and propagates up through the parse tree until it ﬁnds something to resolve with. In sentence #2 above, the gap starts as the complement of bought and propagates up the tree until it resolves with company . The gap element is added as another v ariable in the now quite comple x parsing model. After deﬁning the third model, Collins makes a variety of additional reﬁnements to handle issues like punctuation, coordination, and sentences with empty subjects. The importance of coordination can be seen by considering the phrase “the man and his dog”, which has a head word N P ( M A N ) . This component then e xpands as N P ( M A N ) → N P ( M A N ) C C ( A N D ) N P ( D O G ) . The issue is that, after the coordinator and appears, it becomes very lik ely that another element will be produced. A naïve scheme, ho wev er , that does not take into account the presence of the coordinator , will put high probability on the outcome where no additional elements are produced. The empty subject issue relates to sentences like “Mountain climbing is danger- ous”. The Penn T reebank tags this sentence as having no subject. This linguistic analysis of the structure is problematic, because it causes the model to assign high probability to sentences with no subject. An alternati ve analysis would conclude that it is actually very rare for En- glish sentences to hav e no subject. T o deal with this issue, Collins uses a preprocessing step to transform trees with these kind of empty subjects into a simpler form. 4.2.1 Evaluation of P arsing Systems As noted abov e, one of the most important de velopments in the history of statistical parsing re- search was the appearance of parsed corpora such as the Penn T reebank (Marcus et al. [72]) and the Penn Chinese T reebank (Xue et al. [125]). A treebank is a corpus of sentences with attached parse tree information, which has been produced by human annotators. These parsed corpora, which are analogous to labeled datasets in machine learning research, allowed researchers to apply statistical learning techniques to the problem of parsing. A key issue in the dev elopment of a treebank is that there is no single, objecti vely correct method or ruleset for parsing. T o parse a sentence, one must implicitly employ a theory of 124 grammar which describes the rules of parsing and the elements included in a parse tree. Though there is widespread agreement reg arding basic elements like “verb” and “noun”, the precise content of grammatical theories are the subject of continuing research and debate in the ﬁeld of linguistics. In order to construct a treebank, the de velopers must make choices about which theory of grammar to use. This issue is ackno wledged by the de velopers of the Penn Chinese T reebank: When we design the treebank, we consider (user) preferences and try to accomo- date them when possible. For instance, people who work on dependency parsers would prefer a treebank that contains dependency structures, while others might prefer a phrase structure treebank... It is common for people to disagree on the underlying linguistic theories and the particular analyses of certain linguistic phe- nomena in a treebank [125]. And later: Another desired goal is theoretical neutrality . Clearly we prefer that this corpus survi ves ev er changing linguistic theories. While absolute theoretical neutrality is an unattainable goal, we approach this by building the corpus on the “safe” as- sumptions of theoretical frameworks ... the inﬂuence of Government and Binding theory and X-bar theory is obvious in our corpus, we do not adopt the whole pack- age [125]. A concrete example of this kind of issue relates to the choice of POS tags in the Penn T reebank [72]. Marcus et al. note that in the Bro wn Corpus, upon which the Penn T reebank is based, the contraction I’m is tagged as PPSS+BEM. PPSS indicates a “non-third person nominati ve personal pronoun”, while BEM is a special tag reserved for am or its contracted form ’m . In contrast to the Bro wn Corpus, the Penn T reebank uses a much smaller number of tags. This raises the issue of whether the treebank used the “right” or “optimal” set of tags, or if it e ven makes sense to discuss such an issue. Marcus et al. also not that in some cases, the correct tag of a word simply cannot be conclusiv ely determined from the sentence. In that case, the annotator marks the word with a double tag. In addition to the conceptual issue of how to choose a linguistic theory to guide the an- notation process, treebank de velopers must face the very practical issue of how to minimize the amount of human labor required. Manually constructing a large number of parse trees is a highly time-intensi ve process. This fact puts strong constraints on the ultimate form the database takes. As Marcus et al. note, 125 Our approach to dev eloping the syntactic tagset w as highly pragmatic and strongly inﬂuenced by the need to create a lar ge body of annotated material gi ven limited human resources [72]. One way to make the annotation work easier is to use an automatic parsing tool as a prepro- cessing step. The human annotators then correct an y errors the automatic tool may ha ve made. While this scheme sav es human labor, it also puts subtle constraints on the ultimate form of the annotation output. The de velopers of the Penn T reebank used a program called Fidditch to perform the initial parsing [45]. Fidditch mak es certain grammatical assumptions and produces parse trees that reﬂect those assumptions. The human annotators can correct small errors made by Fidditch, but do not hav e time to make comprehensiv e revisions. Thus, the grammatical assumptions made by Fidditch are built into the structure of the annotations of the treebank. T reebank de velopers rely on human annotators, who sometimes make mistakes. This is to be expected, since parsing is a cognitiv ely demanding task, and the annotators are encouraged to perform as ef ﬁciently as possible, to maximize the total number of words in the corpus. Marcus et al. note that the median error rate for the human annotators was 3.4% [72]. This implies that there are a substantial number of errors in the treebank. Gi ven the human-annotated parse tree, it is fairly straightforward to deﬁne a score for the machine-generated tree. The basic quantities are N ∗ , the number of correctly labeled con- stituents, N p , the number of constituents in the machine parse, and N t , the number of con- stituents in the human parse. For a constituent to be correct, it must span the same range of words as the human parse, and ha ve the same label. Research papers report performance in terms of precision ( N ∗ / N p ), recall ( N ∗ / N t ), and F -score, which is the harmonic mean of precision and recall. One notable aspect of research in statistical parsing is that the range of scores reported in the literature is quite narrow . A paper published in 1995, two years after the publication of the treebank, achie ved an F -score of 85% [70]. Eleven years later, a paper published in 2006 achie ved an F -score of 92% [77]; this score appears to be comparable to the state of the art at the time of this writing. It is not clear if additional research will produce additional improv ements, or if there is some natural limit to the performance that can be achiev ed by statistical parsers on this problem. 4.2.2 Critical Analysis The major shortcoming of research in statistical parsing is that it is totally dependent on the existence of a parsed treebank. Researchers rely on the treebank both to tr ain their systems and 126 to evaluate their systems. This has sev eral negati ve implications. The ﬁrst implication is that because the treebanks dri ve the de velopment of the parsers, the assumptions made by the treebank authors are “baked in” to the parsers. If the treebank de velopers use X-bar theory to guide the annotation process, then systems will learn to parse sentences using X-bar theory [53]. If it turns out that X-bar theory is incorrect, there is no way for researchers in statistical parsing to disco ver that f act. This point is exactly analogous to the Elemental Recognizer thought experiment of Chapter 3 , except that researchers are b uilding systems based on Chomsky’ s X-bar theory instead of Aristotle’ s theory of the four elements. In comparison to the process of b uilding b uilding systems that learn to regurgitate the ab- stractions used by linguists, a far superior strategy would be for researchers to use learning systems to test and v alidate those abstractions. For example, an important issue in the design of the Penn Chinese T reebank is the question of how to characterize the special ba -construction of Chinese. Different linguists ha ve ar gued that ba is a verb, a preposition, a topic marker , and v arious other things. After discussing the issue with various linguists, Xue et al. ﬁnally decided to categorize ba as a verb [125]. As a result of this choice, all the statistical parsers trained on the Penn Chinese T reebank will learn to recognize ba as a verb . If that categorization choice was an error , the learning systems will be learning to duplicate an error . It would be inﬁnitely preferable if the parsing systems could be used as tools to answer linguistic questions, such as ho w to categorize the ba -construction. This general principle holds for many of the issues related to parsing. What is the optimal set of POS tags? What is the optimal syntactic tagset? Is a dependency structure representation of grammar superior to a phrase structure representation? Computational linguists can answer these questions on their o wn, using grammatical introspection and other tools of traditional linguistic analysis, and then train the systems to produce the same answers. Or they can attempt to use learning systems to automatically discov er the optimal answer using some alternativ e principle, such as the compression rate. A more concrete criticism of statistical parsing research relates to the amount of progress achie ved compared to the amount of human effort required. As noted, treebanks require a substantial in vestment of human time to dev elop. Unfortunately , it is not clear how much of an improvement in the state of the art a treebank produces. 11 years of research relating to the Penn T reebank seems to hav e produced about a 7% absolute increase in F -score. Does this improv ement indicate a signiﬁcant adv ance in the power of statistical parsers? Giv en that parsers no w achie ve F -scores of around 92%, does that mean parsing is a solved problem? If not, will the research community need to construct a new and more dif ﬁcult treebank in 127 order to mak e further progress? If each small advance in parsing technology requires a massiv e expenditure of time and mone y , it would seem that this approach is hopeless. 4.2.3 Comperical F ormulation The fundamental principle of comperical philosophy is that in order to compress a dataset, one must understand its structure. Since grammar and part of speech information constitute a crucial component of the structure of text, the compression goal leads naturally to a study of those phenomena. Consider the follo wing sentence: John spoke _____. T o compress this sentence well, the compressor must predict what word will ﬁll in the blank. Assuming it is kno wn that the sentence contains only three words, it is clear that the word in the blank is going to be an adverb such as “quickly”, “thoughtfully”, or “angrily”. The compressor can sav e bits by exploiting this fact. This shows ho w an analysis of parsing and grammar can be justiﬁed by the compression principle. There is no reason to belie ve this idea cannot be scaled up to include more sophisticated techniques. Indeed, it is easy to see how a PCFG can be used as a tool for te xt compression. Consider the follo wing highly simpliﬁed PCFG: S 1: P=1 : → NP · VP VP 1: P=0.5 : → V · NP 2: P=0.3 : → V · NP · NP 3: P=0.2 : → V · NP · PP NP 1: P=0.4 : → N 2: P=0.4 : → A · N 3: P=0.2 : → D · A · N So a sentence (S) always transforms into a noun phrase (NP) plus a verb phrase (VP). W ith probability P = . 5 a verb phrase transforms into a verb (V) and a single noun phrase (NP), with probability P = . 3 it transforms into a v erb and two noun phrases, and with probability P = . 2 it transforms into a verb, noun phrase, and prepositional phrase (PP). A noun phrase can 128 transform into a single noun ( P = . 4 ), an adjectiv e and a noun ( P = . 4 ) or a determiner , ad- jecti ve, and noun ( P = . 2 ). The verb, noun, adjecti ve, and determiner cate gories correspond to actual words. Consider how this highly simpliﬁed grammar can be used to parse the follo wing sentence: The black mouse ate the green cheese. This sentence actually only contains three deriv ational rules. First, the sentence splits into a noun phrase ( the black cat ) and a verb phrase ( ate the gr een cheese ). The verb phrase then splits into a verb and a noun phrase. Both of the noun phrases split into the determiner , adjecti ve, noun pattern. A compressor can encode the sentence using its parse tree as follows. The S → NP · VP deri vation requires zero bits, since there are no other possibilities. T o encode the deriv ation VP → V · NP , the compressor sends rule #1 in the VP list, at a cost of − log 2 (0 . 5) = 1 bit. T o encode the two deriv ations NP → D · A · N , it sends rule #3 in the NP list, at a cost of − log 2 (0 . 2) ≈ 2 . 3 bits each. So the entire parse tree can be encoded at a cost of about 5.6 bits. This is quite reasonable, giv en that a naïve encoding for a single letter requires − log 2 (26) ≈ 4 . 7 bits. Finally , the compressor transmits the information necessary to transform a terminal category (verb, noun, etc) into an actual word. Knowing the category allo ws the compressor to sav e bits, since P (mouse | noun) is much greater than P (mouse) . Notice that using parsing techniques not only saves bits, it also sets the stage for more ad- v anced analysis techniques, that can sav e ev en more bits. Consider the sentence “John kissed Mary”. Some bits can be sav ed by encoding this sentence using its parse tree. But more inter- estingly , ﬁnding the parse tree also allows a higher lev el system to apply a more sophisticated analysis. Such a system might notice that when the verb is kissed , the subject is almost always a human, and the object is usually also human. This information can then be used to sav e additional bits when encoding the names J ohn and Mary . 4.3 Statistical Machine T ranslation Machine translation is one of the oldest subﬁelds of artiﬁcial intelligence, dating back to a paper written by W arren W eaver in 1949 [121]. Initial expectations were high: various researchers predicted that success would be achiev ed within a couple of years. That prediction, of course, turned out to be wildly o veroptimistic: machine translation is still not a solved problem, though some systems achiev e acceptable performance in some cases. Originally , a major impetus for the ﬁeld came from the American defense and intelligence establishment, which desired the ability to translate Russian scientiﬁc documents. Even today , US go vernment agencies provide 129 a major share of the funding for translation research, and also help to or ganize competitions and e valuations. A major issue in machine translation is the problem of e valuation. For a long time, the only reliable method for ev aluating a candidate solution to the translation problem was to assemble a team of human judges, and ha ve the judges assign scores to the computer translations. More recently , a v ariety of automatic schemes for e valuating machine translations have appeared [86, 108, 5]. These schemes seem to spurred a ne w b urst of research in machine translation, though their use is still controv ersial. A major paper that began the transition away from rules-based and to wards statistical lan- guage processing is “Mathematics of Statistical Machine T ranslation: Parameter Estimation” by Brown et al. [15]. The authors approach the problem by using what the y call the Fundamen- tal Equation of Machine T ranslation: E ∗ = arg max E P ( F | E ) P ( E ) (4.0) Where E ∗ is the produced English translation of a French (or Foreign) sentence F . This ap- proach decomposes the problem into two subproblems: modeling P ( F | E ) and P ( E ) . The authors focus most of their attention on the former , noting that other research exists that deals with the latter . This problem of conditional modeling is still extremely hard. One way to get a sense of the difﬁculty is to notice the immense size of the models. If the average sentence has W = 10 words, and there are T = 5 · 10 4 words in the language, then the size of P ( E ) is on the order of T W ≈ 9 · 10 46 , while the size of P ( F | E ) is T 2 W ≈ 9 · 10 93 . The enormous sizes of the spaces in volv ed cause one of the key problems of complex sta- tistical modeling: sparse data. In principle, if one had enough data, one could estimate the probability P ( F | E ) by simply counting the number of times a French sentence F is used as a translation of an English sentence E . Unfortunately , that strategy is completely impractical due to the huge outcome spaces. Indeed, it is quite difﬁcult ev en to estimate the probability P ( f | e ) of a French word f serving as the translation of an English word e ; a naïve method for estimating this model would require W 2 ≈ 2 . 5 · 10 9 . In order to cut down on the size of the spaces, and thus the amount of data needed to estimate the model probabilities, a v ariety of simplifying assumptions are necessary . The simpliﬁcation strategy reﬂects one of the fundamental insights of statistical modeling. T o understand the idea, imagine one was trying to estimate a model of a distribution of heights. Say the measurements range from 130 cm to 230 cm, and are accurate to a precision of 0.1 mm. Then a naïve method would be to simply count the number of observations in ev ery 0.1 mm-width bin, and use the observed frequenc y as the probability estimate. For example, the 130 probability of a height being between 150.05 cm and 150.06 cm would just be the number of observ ations in that range, divided by the total number of observations. This strategy requires 10,000 statistics: the counts for each bin. It will work in the limit of lar ge data, but if there are a small number of observations, then the estimates obtained in this way will be extremely unreliable. T o solve this problem one computes a different set of statistics: the mean and v ariance of the data set. This simplifying assumption requires far fe wer parameters, and will produce a much more accurate model with a limited amount of data. Bro wn et al. follo w an analogous strategy to reduce the number of parameters required by their models. They introduce a number of simplifying concepts, one of which is the idea of an alignment . An alignment A is a matching between the French words and the English words of the two sentences. If the alignment can be found, the probability of an English w ord in a gi ven position can be modeled as: P ( e i | A, F ) = P ( e i | f a i ) Where e i is the i th word in the English sentence, and f a i is the French word aligned with it. Ho wev er , ﬁnding an alignment introduces its own challenges, because it is a hidden variable: it cannot be extracted directly from the corpus, but must be inferred from the sentence pair . In order to ﬁnd the most likely alignment, various pieces of knowledge about the translation parameters must be kno wn. Giv en a sentence-pair ( E , F ) and a good model P ( E , A | F ) , then the most likely alignment can be found by: A ∗ = arg max A ∈ A P ( E , A | F ) Where A is the space of all possible alignments. Ho wev er , to use this optimization, it is nec- essary to hav e a good model P ( E , A | F ) . If the corpus contained the alignment data, it would be possible to estimate the parameters for such a model from the data. But since the align- ment information is hidden, there is a chicken-and-egg problem: one requires alignment data to calculate a good model P ( E , A | F ) , but alignment data can only be obtained using the model. There is a standard strategy that can be used to solve this kind of chick en-and-egg problem called the Expectation Maximization algorithm [29]. This algorithm works by ﬁrst postulating a naïve initial model. This model is used to infer a set of alignments. These alignments are then used to estimate ne w model parameters. This process is repeated sev eral times. Bro wn et al. use a special modiﬁed version of the Expectation Maximization algorithm. Instead of using the same model for each step, they actually swap in a new model after a cou- ple of iterations. The initial models contain a number of simplifying assumptions, which are 131 relaxed in the later models. One e xample of a simplifying assumption is that each French word corresponds to exactly one English word. Another such assumption is that the probability of a French word in a certain position depends only on the English word that it is connected to in the alignment. Obviously these assumptions are not technically correct, but the simpliﬁed models can still serv e as starting points in the EM algorithm. The purpose of using the simpler models ﬁrst is that they provide initial parameter estimates for the more complicated models. For exam- ple, a good initial estimate of the word-to-w ord translation probability P ( f | e ) can be obtained using the simple initial models. This technique is useful because the more complex models are computationally e xpensiv e. Also, the initial estimates help to ensure that the Expectation Maximization algorithm ﬁnds a good maximum, as opposed to a poor local maximum. The idea of alignment, mentioned abov e, is related to another classic problem in large scale statistical modeling. Consider the equation used to calculate the conditional probability P ( F | E ) using the alignment-based model: P ( F | E ) = X A P ( F | A, E ) P ( A | E ) (4.0) In other words, e very possible alignment A contributes to the net probability . Howe ver , com- puting the probability in this way is totally infeasible from a computational perspectiv e, because there are a vast number of possible alignments. T o grapple with this issue, it is necessary to approximate the sum by ﬁnding the alignments with the highest possible values of P ( A | E ) and only summing ov er those. A much simpler paper , but also inﬂuential, is “Minimum Error Rate T raining in Statistical Machine T ranslation” by Och [84]. This paper describes a way of improving the output of another system. That is to say , it relies on an unspeciﬁed black box translation algorithm to produce a set of candidate translations. The method then attempts to ﬁnd the best of the set of candidate translations. The point is that the initial translation algorithm doesn’ t ha ve to be very good, because the reranking process will ﬁnd the best translation from a list of candidates. The method employs a MaxEnt model, but trains the model in a new way . The standard way of choosing the optimal v alue for a parameter λ i in MaxEnt is gi ven by the equation: λ ∗ i = arg max λ i X s log P λ i ( E s | F s ) Where ( E s , F s ) is the s th translation pair in the corpus and P λ i is the model gi ven the parameter v alue λ i . This scheme will produce a model that assigns high probability to the translation pairs in the corpus. But it might not produce the translations with the lo west error rate. T o produce lo w-error translations, Och proposes to select parameters using the optimization: 132 λ ∗ i = arg min λ i  X s G ( E s , ˆ E ( F s ; λ i ))  Where G ( · , · ) is the particular error function being used, and E s is the reference translation for the s th sentence pair . The translation guess ˆ E is gi ven by: ˆ E ( F s ; λ i ) = arg max E ∈ C s  X m λ m h m ( E | F s )  Where h m are the context functions, and C s is a set of candidate translations of F s produced by the initial system. This alternate scheme of ﬁnding the λ i parameters will tend to minimize the error rate, instead of maximizing the probability . There are a fe w issues that come up when using the new optimization scheme. One issue is that, because of the arg max operation, the score is not continuous: a small change in λ i can cause a large change in the score. The latter issue can be solved by using a “softened” version of the arg max operation which constructs a weighted combination of the sentences, with weights that depend on the model probabilities P ( E s | F ) . 4.3.1 Evaluation of Machine T ranslation Systems In the most straightforward procedure for ev aluating machine translation results, a bilingual hu- man judge assigns a score to an algorithm-generated translation by comparing it to the original source sentence. Unfortunately , this process is far too time-consuming to be carried out v ery frequently . It also scales badly with the number of candidate systems: doubling the number of systems nearly doubles the amount of human time required. For this reason, the machine translation (MT) community has come to rely on a number of automatic scoring functions. A scoring function assigns a score to a machine-generated translation by comparing it to a set of reference translations contained in a bilingual corpus. Once a bilingual corpus has been constructed, the scoring functions can be used as often as desired. This provides crucial rapid feedback to MT de velopers. The designers of translation scoring functions must face sev eral conceptual challenges. One basic challenge is that there e xist a large number of valid translations of any giv en source sentence. The scoring function only has access to a small number of valid translations, that are contained in the bilingual corpus. It is very possible for a translation to be good but bear little resemblance to any of the human translations. Consider for example the follo wing sentences: 133 H: The ﬁsh tasted delicious. T1: The salmon was e xcellent. T2: The ﬁsh tasted rotten. In this case, the ﬁrst machine translation T1 should recei ve a high score, ev en though it shares only one word with the human translation H. Con versely , the T2 translation has three words out of four in common with the human translation, but it should still recei ve a lo w score. One of the most widely used translation scoring functions is called B L E U , which is an acronym for Bilingual Evaluation Understudy [86]. B L E U uses a scoring metric called modi- ﬁed n-gr am pr ecision . The idea is to count ho w many times a gi ven word sequence from the machine translation occurs in the human translations. So if the human translation is “John ﬂe w to America on a plane”, and the machine translation is “John went to America on plane”, then the translation will get two unigram hits (John, plane), and one trigram hit (to America on). The designers of B L E U chose to take into account only precision: the number of sequences in the machine translation that also appear in the human translations. It does not explicitly penalize a machine translation for failing to include words from the human translations (i.e. it does not take recall into account). This is somewhat a wkward, since the sentence “John did not commit the crime” means something v ery dif ferent if the w ord “not” is omitted. T o partially repair this deﬁciency , B L E U penalizes a machine translation if it is substantially shorter than the human version. The widespread use of B L E U scores has strongly inﬂuenced the ﬁeld, but has also ignited some controversy . V arious authors hav e pointed out shortcomings in B L E U [114, 17]. One problem with B L E U is that an n -gram is scored if it appears in the human translation, e ven if it appears in a very strange position. Callison-Burch et al. giv e the following e xample: H: Orejuela appeared calm as he was being led to the American plane that was to carry him to Miami, Florida. T1: Appeared calm | when | he was | taken | to the American plane | , | which will | to Miami, Florida | . T2: which will | he was | , | when | taken | Appeared calm | to the American plane | | to Miami, Florida |. The ﬁrst sentence is the human translation, and the second two are machine translations, with the n -gram boundaries sho wn. Though the second sentence T1 is not perfect, it is clearly better than T2. But the B L E U score assigns both of these sentences the same v alue, because it does not take into account the order in which the n -grams appear [17]. Callison-Burch et al. also 134 noted that B L E U sometimes fails to agree with human judgments, and cite one case where the system that produced the best human-scored translations came in sixth out of sev en in terms of B L E U score. Se veral other translation scoring metrics ha ve appeared in the last couple of years. Snov er et al. de veloped a metric called T E R , which is based on a concept called T ranslation Edit Rate: the number of changes an editor would need to make to con vert the machine translation into one of the human sentences [108]. This system was designed to be used in conjunction with a human editor , who would ﬁnd the correct translation that is closest to the machine tranlation in terms of T E R . Impressiv ely , the authors show that the human-enhanced T E R score correlates better with human assessment of translation quality than the assessment of another human. Another recent automatic scoring function is called M E T E O R , de veloped by Banerjee and Lavie [5]. The M E T E O R metric uses a more powerful word-matching technique than B L E U , which relies on explicit letter-by-letter equiv alence. M E T E O R matches words in a translation pair if they have the same root, or if the y appear to be synonyms (a database called W ordNet is used for checking synon ymy). M E T E O R also employs a more sophisticated word grouping method than the n -gram scheme used in B L E U . In principle, this allows it to repair the problems related to n -gram rearrangement noted abov e. 4.3.2 Critical Analysis One of the most obvious shortcomings of research in machine translation is that most methods do not use any actual linguistic knowledge. This fact is actually considered to be beneﬁcial in some cases: Bro wn et al. note in their abstract that because their method uses “minimal linguistic content”, it is easy to port it o ver to a new language-pair [15]. This conceptual shortcoming is related to one of the central themes of this book: the failure on the part of machine intelligence researchers to formulate their work as empirical science. This failure pre vents them from exploiting structure that may be present in a problem, and from building ef fectiv ely on pre vious work. This conceptual limitation has important implications for machine translation. As men- tioned abov e, one of the key issues in machine translation is alignment: connecting words in the source sentence to words in the target sentence. Brown et al. show how their alignment scheme works for the follo wing pair of sentences: What is the anticipated cost of administering and collecting fees under the new proposal? 135 En vertu de les nouvelles propositions, quel est le cout pre vu de administration et de perception de les droits? [15]. Impressi vely , the algorithm is able to correctly align the English word “proposal” with the French word “propositions”, e ven though those words are located in very dif ferent positions in the two sentences. This alignment technique is an essential component of the full translation system. But the only reason the alignment problem is considered at all is that the algorithms do not depend on any special linguistic knowledge or on any pre viously dev eloped systems. Researchers in computational linguistics ha ve been working on parsers and part-of-speech tag- gers for a long time, b ut Brown et al. chose not to reuse any of that work. It seems obvious that kno wledge of part of speech would make the alignment problem dramatically easier , if for no other reason than that it would cut do wn the number of alignments that would need to be considered. For example, in a 20-word sentence, there are in principle 20! ≈ 2 . 4 · 10 18 possible alignments. But if each sentence is kno wn to consist of 5 nouns, 3 verbs, 5 adjectiv es, 4 prepo- sitions, and 3 particles, then only 5! · 3! · 5! · 4! · 3! ≈ 5 · 10 7 alignments need to be checked. Furthermore, if it were possible to assign semantic roles to words such as “subject”, “action”, or “topic”, it would be possible to cut do wn on the number of alignments by e ven more. The more abstract argument here is that it seems unreasonable to attempt to do machine translation without extensi ve background knowledge of both languages in the pair . T o see this point, imagine gi ving a person with no knowledge of English or French a huge corpus of parallel text, and asking her to learn, on the basis of that corpus, how to translate between the two languages. This seems absurd. The reasonable approach is to ﬁrst train the person in English and French, and only then to ask her to learn to translate. It is not hard to criticize the various scores such as B L E U and M E T E O R used for ev alua- tion. These methods seem some what arbitrary and ad hoc , and there is e xtensi ve controv ersy in the community regarding their use [114, 17]. T o justify the automatic metrics their de velopers needed to conduct meta-ev aluations in which the automatic scores are correlated against human scores. But the correlation e vidence is hardly ov erwhelming proof of the quality of an ev alu- ator . In some cases the correlation was achiev ed on the basis of a small number of sentences; the M E T E O R metric was ev aluated on the basis of 920 Chinese sentences and 664 Arabic sen- tences [5]. It could be that the scores will f ail to correspond with human judgment when tested on larger databases, or on text that is written in a different style. Furthermore, the mere fact that an automatic scoring metric correlates with human judgment is not necessarily proof of its quality , since the correlation of human scores with one another is actually quite low [114]. Even if an initial ev aluation shows that an automatic scoring metric correlates well with human scores, that correlation may ev aporate ov er time. This point was mentioned in Chapter 3 , 136 in connection with the idea of Goodhart’ s Law , which says that if a statistical re gularity is used for control purposes, it will disappear . Perhaps when they originally appeared, scores like B L E U and M E T E O R correlated well with human judgment. But as researchers begin to use the metric to design their systems, this pressure may destroy the correlation. This problem is particularly relev ant because many researchers now use techniques such as Minimum Error Rate T raining to optimize their systems with respect to a particular scoring metric [84]. In fairness to the de velopers of B L E U and other scoring metrics, these scores were not intended to fully replace human judgment for the machine translation task. The community periodically holds contests or workshops in which machine translation results are ev aluated by a panel of human judges. This method cannot be criticized except for the fact that it is expensi ve and labor -intensiv e. In particular , the human ev aluation strategy scales badly with the number of candidate systems. If there are N sentences and T systems, then the time requirement is proportional to N · T . This means the number of sentences used must be kept low , which may make the ev aluations sensiti ve to random ﬂuctuations: perhaps some systems work better on certain types of sentences, and by chance that type of sentence is ov er-represented in the small e valuation set. 4.3.3 Comperical F ormulation of Machine T ranslation The comperical philosophy suggests a clean and direct formulation of the machine translation problem: apply the Compression Rate Method to a large bilingual corpus. A specialized com- pressor can exploit the relationship between the sentences in the translation pair to save bits. In statistical terms, researchers will need to obtain a good model of P ( E | F ) - the probability of an English sentence giv en its French counterpart - in order to achiev e good compression. Of course, obtaining such models is not at all an easy problem, and in volv es a wide range of research questions. Once strong models of the form P ( E | F ) or P ( F | E ) are obtained, there are two ways to use them to generate actual translations. The ﬁrst method is simply to sample from the P ( E | F ) distribution. The idea here is that if the model P ( E | F ) is good, then samples of P ( E | F ) will be veridical or realistic simulations of what a human translator would produce. The second method of generating translations is to apply the Fundamental Equation of Machine T ranslation: E ∗ = arg max E ∈ E P ( F | E ) P ( E ) 137 This method is more difﬁcult than sampling, because the optimizatiion over the huge space E may be hard to carry out. The adv antage of this method is that the use of the P ( E ) model term can mitigate deﬁciencies in the P ( E | F ) model [15]. The presence of the standalone model P ( E ) in the abov e equation can be seen as simplis- tic evidence in fa vor of the Reusability Hypothesis. Comperical researchers studying a large database of ra w English text will produce good models P ( E ) . Improv ements in the standalone models will produce immediate improv ements in machine translation results. While the impor- tance of using a good model P ( E ) has been noticed in the MT literature, in practice this step seems to be ov erlooked. Here is an example translation produced by Google T ranslate: If locally heavy snow fell in a short time, ev en life and logistics support under the direct control highway was closed to trafﬁc early on that country . T o prev ent congestion and stuck it in a hurry to sno w remov al v ehicles. MLIT has block ed the national highway has been av oiding the highway , unlik e direct control, we changed course. Evidently Google T ranslate does not use a v ery good model of P ( E ) . Other automatic trans- lation services routinely produce comparably unintelligible outputs. Note that if P ( E ) was highly sophisticated but P ( F | E ) were substandard, in order to tell if a translation was good or not a bilingual examiner would hav e to compare the meaning of the English output with the foreign original. In addition to the simple reason giv en above, there is actually another reason why it will be useful to learn standalone models for machine translation. This is the fact that background kno wledge of English and French will be highly valuable when attempting to do translation. The idea here is similar to the point of the Japanese Quiz thought experiment of Chapter 2 . That example showed that it was impossible to construct good models for the category of a sentence without e xtensiv e background kno wledge of Japanese. Analogously , in the conte xt of machine translation, it may be impossible to construct good models of an English translation sentence without extensi ve background knowledge of French. This idea is highly intuiti ve. The current procedure of machine translation is a bit like taking a large bilingual corpus of English and French, gi ving it to a person with no knowledge of those languages, and asking her to learn to translate based solely on the corpus. A much more ef fectiv e strategy would be for the person to ﬁrst learn both English and French, and then to polish her translation skills by studying the bilingual corpus. These arguments suggest an ev en better version of the problem formulation than the one mentioned above. This is to package the bilingual corpus together with a raw English corpus 138 and a ra w French corpus, and use the aggregate dataset as a target of a C R M in vestigation. This will require both a model P ( E | F ) and standalone models P ( F ) and P ( E ) , which can be reused to produce good translations with the optimization mentioned abo ve. Computational tools and statistical models can be justiﬁed on the basis of the large raw corpora, and then immediately redeployed to attack the conditional modeling problem. In concrete terms, the claim here is that knowledge of issues such as parsing, POS tagging, semantic role analysis, etc , will be useful not only for ra w text compression b ut for conditional compression (and thus translation) as well. As a metric for ev aluating machine translation systems, the compression rate metric com- pares very fa v orably to the other metrics such as B L E U and M E T E O R proposed in the literature. As noted abov e, these metrics have serious shortcomings and hav e generated substantial con- trov ersy . B L E U can be gamed, in the sense that very poor translations can be found that will nonetheless recei ve high B L E U scores. In contrast the compression metric is, in some sense, abov e criticism. Certainly it cannot be gamed. It is ultimately based on the same principles of prediction and falsiﬁcation that driv e traditional empirical science. In order to criticize the use of the conditional compression rate metric, a critic would need to sho w that it is possible for a system to achiev e a strong compression score, b ut still produce poor translations. Since a system that achie ves a good compression rate must hav e a good model of P ( E ) and P ( F | E ) , this would ef fectiv ely imply that the Fundamental Equation of Machine T ranslation (Eq. 4.3 ) does not work. Finally , the compression metric is more interesting: comperical researchers can justify their work on the basis of pure intellectual curiousity , but no one would ev er try to optimize B L E U scores if doing so did not supposedly produce good machine translations. 4.4 Statistical Language Modeling Statistical language modeling is a subﬁeld of computational linguistics where the goal is to build statistical models of the probability of a sentence, P ( s ) (denoted P ( E ) in the discussion of machine translation). This is almost always done by breaking the problem up into a series of word probabilities: P ( s ) = Y i P ( w i | w i − 1 . . . w 0 ) Where w i is the i th word in the sentence. These models are trained by analyzing the statistics of a large te xt corpus. Language modelers typically attempt to minimize the cross-entropy: 139 X s P ∗ ( s ) log P ( s ) ≈ X k log P ( s k ) Where P ∗ ( s ) is the r eal probability of a sentence. The expression on the left is the actual cross entropy , and the sum is o ver all possible sentences. This quantity can nev er actually be computed, not only because performing the sum is computationally infeasible, but also because the real distribution P ∗ ( s ) is unknown. The right hand side is an empirical estimate of the cross entropy , and it in volves a sum ov er all the sentences s k in the corpus. This empirical cross- entropy is exactly equi valent to the compression rate except that it does not inv olve a model complexity term. As discussed belo w , this non-incorporation of a model complexity penalty has important ramiﬁcations; many language models use a large number of parameters and are therefore probably ov erﬁtting the data. Overall, though, language modeling is the area of current research that is most similar to the comperical proposal. As noted in Section 4.3 , a good language model is an important component of many ap- proaches to machine translation, due to the role P ( E ) plays in the Fundamental Equation of Machine T ranslation. It turns out that an entirely analogous formulation works for speech recognition: E ∗ = arg max E ∈ E P ( A | E ) P ( E ) Where E ∗ is the guess of the spoken English words, and A is the audio data measured by the system. The only difference between this equation and the one used for machine translation is that here the audio data A takes the place of the French sentence. In fact, the connection to speech recognition the primary motiv ation for language modeling research. Most papers on the topic describe a ne w language modeling technique, sho w how it can reduce the cross-entropy , and in the ﬁnal section demonstrate the reductions in word error rate that can be achiev ed by connecting the model to a speech recognizer . By far the simplest, and the most widely used, approach to language modeling is the n - gram. The conditional probabilities in a n -gram model are obtained by simple counting; for example, in a trigram model, the probabilities are: P ( w i | w i − 1 , w i − 2 ) = C ount ( w i , w i − 1 , w i − 2 ) C ount ( w i − 1 , w i − 2 ) (4.0) Where C ount ( · ) denotes the count of a particular e vent in the training data. The most com- mon values of n are 2 (bigram) and 3 (trigram). Language modelers are often frustrated by the ef fectiv eness of n -grams, as they ﬁnd it dif ﬁcult to do better by using more sophisticated techniques. 140 An early paper on statistical language modeling, and important inﬂuence on this book, is “ A Maximum Entropy Approach to Adapti ve Statistical Language Modeling” by Ronald Rosenfeld [96]*. Rosenfeld begins by identifying a number of potentially useful information sources (or predictors) in the language stream. The most basic predictors are based on n -gram models. A more interesting type of predictor is a trig ger pair . A pair of words C and D form a trigger pair C → D when observing C makes it more likely to observe D . For example, if the word history contains the w ord “stock”, it is much more likely than norm al that the subsequent text will include the word “bond”. T rigger pairs can be automatically e xtracted from the corpus. Another interesting technique is called long distance n -grams. These are like n -grams except that they match against words farther back in the history . Rosenfeld also experiments with a method for using word stems to augment the trigger pair technique. The idea here is that if the word “ﬁlm” triggers the word “mo vie”, then the word “ﬁlming” probably does as well. Rosenfeld’ s basic strategy is to write do wn many predictors, and use algorithmic techniques to combine them together to achiev e the best possible model. The issue is ho w exactly to do the combination. Most naïv e strategies for information integration, such as using a linear com- bination of the predictions, lack theoretical justiﬁcation and tend to underperform in practice. Rosenfeld adopts the Maximum Entropy approach to solve this problem. In this context, the MaxEnt model is of the follo wing form: P ( w | H ) = 1 Z ( H ) exp( X i λ i f i ( w , H )) Where the f i ( w , H ) are a set of context functions, that operate on the history H and the new word w , and Z ( H ) is a quantity that ensures normalization. As noted abov e, the po wer of the MaxEnt idea is that it allows the user great ﬂexibility in deﬁning whatev er kinds of context functions he may consider useful. For example, one useful predictor might be a function that returns 1 if w = then and the word if appears in the history , and 0 otherwise. The training algorithm for MaxEnt, called Generalized Iterativ e Scaling, automatically ﬁnds the optimal parameters λ i corresponding to each context function that maximizes the likelihood of the data. An important feature is that there is no signiﬁcant penalty for trying out ne w predictors: if a predictor has no value, the training algorithm will simply assign it a near-0 λ v alue so that it has no ef fect on the actual probability . Rosenfeld’ s paper inﬂuenced the dev elopment of the ideas in this book because his ap- proach seemed like a kind of empirical science. His procedure was to propose a new class of context functions, incorporate them into the MaxEnt model, and measure the resulting cross- entropy score. If the new functions produced a reduction in cross-entropy , it meant that they 141 captured some real empirical regularity in the structure of the text. In other words, by using MaxEnt along with the cross-entropy metric, Rosenfeld was able to pr obe the structure of text by experimenting with dif ferent kinds of context functions. Unfortunately , Rosenfeld did not carry this procedure as far as he could hav e. T wo reasons may hav e limited his moti vation to do so. First, new techniques tend to “cannibalize” one another: sometimes adding a new predictor is ef fectiv e, but reduces the ef fecti veness of some other predictor , because the they both exploit the same underlying regularity . Second, like many other language modelers, Rosenfeld’ s ultimate goal was to improv e the performance of a speech recognition application. At the end of the paper , he notes that improvements in the cross-entropy score tend to yield diminishing returns in terms of reductions in w ord error rates for speech recognition. These two ef fects, combined together , probably sapped his motiv ation to carry out a more systematic search for the best possible language model. A more recent paper in this area is “ A Bit of Progress in Language Modeling” by Joshua Goodman [40]. Most language modeling papers present ne w techniques for language modeling. Goodman’ s paper is dif ferent in that it describes a systematic comparison and e v aluation of an array of competing techniques. It also e xamines ho w dif ferent modeling tools can be used in conjunction with one another . Goodman starts of f by showing that one of the most important components of an n -gram language model is a good smoothing method. Smoothing is important because of sparse data problems, which become increasingly sev ere as the value of n becomes large, making the counts in Equation 4.4 unreliable. T o understand the sparse data issue, notice that if the number of words in the dictionary is T = 5 · 10 4 , then there are T 3 ≈ 1 . 5 · 10 13 possible three-word sequences. This implies that most three word sequences will ne ver be observed in the corpus, so C ( w i , w i − 1 , w i − 2 ) = 0 . This can be true e ven if a sequence has relati vely high probability (say , p ≈ T − 3 ). Equation 4.4 would assign such a sequence zero probability , but clearly this beha vior must be a voided. T o deal with this situation, language modelers use smoothing techniques. One smoothing technique is to “back-of f ” in cases where a sequence has nev er been observed, and use a lower - n model instead. Goodman describes sev eral other methods, such as Katz smoothing and Kneser-Ne y smoothing [59, 61]. Goodman also describes a v ariety of language modeling techniques that work because of speciﬁc properties of text. One such technique is called a skipping model. A skipping model is like an n -gram, except that it skips a word. So for example the model might be based on P ( w i | w i − 2 , w i − 3 ) , ignoring the inﬂuence of the immediately preceding word w i − 1 . This method works because of phrases like “went to John’ s party”. If the pattern “went to X party” has been observed se veral times in the corpus, then a skipping model will assign a high probability to 142 the w ord “party” in the phrase “went to Sarah’ s party”, e ven if that exact phrase has nev er been observed. A second technique that works because it reﬂects the real structure of text is called cluster- ing. The idea here is to model P ( w i | w i − 1 w i − 2 ) = P ( w i | W i ) P ( W i | W i − 1 W i − 2 ) Here W i , W i − 1 , W i − 2 are word cate gories instead of speciﬁc words. So the probabilistic effect of a word in the history is mediated through its category . This also means that w i is condition- ally independent of w i − 1 gi ven W i − 1 . In theory , this technique can assign high probability to the phrase “Jim ﬂe w to Canada”, e ven if only very approximately similar phrases such as “Bob drov e to Spain” hav e been observed in the corpus. Furthermore, this technique can provide a decisiv e defense against sparse data problems, because there are far fewer word categories than words. If there are 10 2 categories, then a trigram-style model using categories will be only on the order of 10 6 . The main challenge in using this technique is in obtaining the word categories; a variety of research deals with this problem [14]. Goodman’ s ev aluation reports that the best-performing system is a model of this type. The extended version of Goodman’ s paper ends on a very pessimistic note: one section is entitled “ Abandon hope, ye who enter here”. Goodman’ s frustration is due to the difﬁculty of ﬁnding techniques that offer better practical performance than n -gram models on the speech recognition task. T echniques that achiev e lower cross-entropy usually require far more com- putational and memory resources. Furthermore, even if a technique achiev es an improv ed cross-entropy , it often yields a relativ ely smaller reduction in word error rate. Goodman also complains that there is v ery little progress in the area, because people only compare their mod- els to the trigram, not to other published research. This makes it dif ﬁcult to determine which techniques actually represent the state of the art. 4.4.1 Comparison of A pproaches to Language Modeling Reserch in statistical language modeling is very similar to a comperical inquiry tar geted at a large text corpus. The similarities between the two types of research of fer ammunition to both critics and proponents of the comperical philosophy . Proponents can note that language modeling research has already demonstrated the basic v alidity of the Reusability Hypothesis: improv ements in language models lead directly to reductions in the error rate of speech recog- nition systems. On the other hand, critics of the comperical philosophy can cite the fact that language modeling has not produced spectacular results or dramatic progress. Indeed, language 143 modelers often express frustration, noting that error rate reductions are very difﬁcult to achie ve in practice. If language modeling and similar lines of research are really going to produce important breakthroughs, why ha ven’ t any of those breakthroughs already been achie ved? While traditional and comperical language modelers may appear to be working on the same kind of problem, they are guided by a dif ferent philosophy . T o be gin the comparison of the two mindsets, consider the follo wing comment by Rosenfeld: Ultimately , the quality of a language model must be measured by its effect on the speciﬁc application for which it was designed, namely by its effect on the error rate of that application [97]. Rosenfeld’ s comment illustrates what can be called the toolbox mindset of language modeling. The goal here is to create tools that produce performance improvements when connected to applications such as speech recognizers or machine translations systems. This mindset has an important implied belief: there is no such thing as a uniquely optimal language model. Instead, there are a wide v ariety of tools that are useful in dif ferent situations. Perhaps tool A works best for speech recognition, while tool B is well suited to machine translation, and tool C can be used for information retrie val. The goal, therefore, is to create a toolbox that is as large and div erse as possible. A researcher creates a ne w tool starting more or less from scratch, and v alidates his result by comparing it to the standard n -gram model. If a tool is nov el and interesting, and it performs well compared to the n -gram, then it must be a worthwhile contribution to the toolbox. Fe w researchers attempt to systematically compare the v arious tools, or to improve an already e xisting tool. No one w ants to spend years of ef fort trying to reﬁne someone else’ s tool to produce a minor performance improv ement. In the light of this diagnosis, it is worth noting that one of the most-cited papers in the subﬁeld is entitled “SRILM - an Extensible Language Modeling T oolkit” [109]. Empirical scientists completely reject all the vie ws associated with the toolbox mental- ity . Physicists belie ve in the existence of a singular optimal theory of empirical reality . They vie w their work as a systematic search for this theory , and because they can make decisiv e theory-comparisons, the search can proceed rapidly . Physicists are not intrinsically interested in applications, and ev aluate theories without regard for practical utility . The same basic the- ory of physics is used by all engineers, reg ardless of whether they are constructing airplanes, bridges, skyscrapers, or automobiles. A physicist who attempted to justify an alternati ve theory of mechanics by arguing that it would be more useful for bridge-building would be ridiculed 144 and ignored, ev en by bridge-builders. In physics it is extremely important and v aluable to pro- vide a modest improv ement or reﬁnement to the champion theory - for example, by showing ho w quantum mechanics can be used to explain superconducti vity . Comperical researchers adopt the mindset of physics in their approach to language model- ing. They reject the toolbox mentality , and belie ve in the existence of a single optimal theory for describing a gi ven te xt corpus. They view their w ork as a systematic search for that optimal theory , and carry out the search by conducting decisiv e theory-comparisons. The y recognize the v alue of minor extensions or reﬁnements of the champion theory , if those reﬁnements produce codelength reductions. They adopt the Circularity Commitment, and belie ve the Reusability Hypothesis promises that their highly reﬁned theories will hav e practical applications. A good e xample of ho w this dif ference in mindset plays out in practice relates to the empir- ical regularities mentioned by Goodman. These regularities allow models based on techniques such as caching, skipping, and word categorization to achiev e reductions in the cross-entropy . T raditional language modelers notice and sometimes exploit these regularities, but do not at- tempt to systematically document and characterize them. This is at least partially because, if the regularity is uncommon or difﬁcult to describe, the amount of work required to exploit it is not justiﬁed by the minor potential improv ement in speech recognition. In contrast, com- perical language modelers vie w these regularities as the focus of their research. How can the regularities be categorized, or ganized, and modeled? Which re gularities represent independent phenomena, and which are manifestations of the same underlying ef fect? Are the regularities present in English basically the same as those present in Japanese, or do they dif fer in some fundamental way? T o the comperical language researcher , all of these questions are deeply interesting. In addition to the change of mindset described abov e, there is actually an important techni- cal difference between the two formulations. The cross-entropy is actually not equiv alent to the compression rate, because it does not include a model complexity term. This fact goes far in explaining the frustrating apparent superiority of the n -gram methods. These methods, though conceptually simple, are in fact inductiv ely (or parametrically) complex. If a corpus contains T words, then the bigram model requires T 2 parameters to specify while the trigram model re- quires T 3 . If T = 10 5 , then the bigram model uses ten billion ( 10 10 ) parameters and the trigram uses a million billion ( 10 15 ) parameters. This indicates that while the n -gram methods may provide good performance using the cross-entropy score, they will not do well according to the complexity-penalized compression rate score. The parametric complexity of n -gram methods also contributes to the chronic sparse data problems faced by language modelers. Because of the lar ge number of parameters used by n -gram models, the estimates of the parameter v alues 145 will be unreliable unless there is a huge quantity of data. This leads language modelers to focus a substantial amount of attention on smoothing techniques [20]. Smoothing may be interesting from a mathematical perspectiv e, but if the goal is to describe the structure of te xt, it is mostly a distraction. In contrast, comperical researchers will need to ﬁnd models with fe wer parameters in order to achiev e compression. The use of more parsimonious models will naturally prev ent sparse data problems, and also re veal interesting facets of the structure of language. 4.5 Additional Remarks The discussion above treated three important areas of computational linguistics, and showed ho w those areas can be reformulated as large scale text compression problems. In this view , there is no meaningful distinction between the three areas. Parsing is an integral part of lan- guage modeling, and both are necessary for machine translation. Similar reformulations can also be established for other areas of computational linguistics. For example, document classiﬁcation can be justiﬁed by adding document-lev el compo- nents to the language model. Consider a newspaper article with the headline “Red Sox W in Game Three of Playoffs”. A smart compressor should be able to guess from the headline that the article is about baseball. When it comes to the text of the article, the compressor should assign higher probability to words like “games”, “Y ankees”, and “home run”. Thus, document classiﬁcation research can be justiﬁed by the compression principle. Some reﬂection will re veal that se veral other standard tasks of computational linguistics, such as word sense disambigua- tion [81] and semantic role analysis [38], can be reformulated as specialized text compression techniques. This chapter has brieﬂy criticized some aspects of the philosophical foundations of compu- tational linguistics. In general, many of the same conceptual failures described in relation to computer vision also exist in computational linguistics. Research is not formulated as empiri- cal science; the computer science mindset has an unhealthy inﬂuence. Researchers attempt to build systems that replicate the perceptual process of humans, but those perceptual processes and abstractions are too loosely deﬁned. The problem of e valuation is not well thought out; e valuation schemes are risky and time-consuming to dev elop, and may actually not work very well. There is no parsimonious justiﬁcation for research topics, allo wing researchers too much liberty to deﬁne their o wn problems. A good example of the problem of esoteric justiﬁcation is the newly-popular task of word sense disambiguation (WSD). One aspect of English and other languages which is ve xing for language processing systems is that sometimes the same word can hav e different meanings in 146 dif ferent contexts. An example is the word “bank”, which may signify a ﬁnancial institution or the edge of a ri ver . WSD is the problem of determining which meaning is actually being used in a giv en sentence. In a recent survey of the WSD subﬁeld, Roberto Navigli mentioned a variety of problems with the task [81]. First, authors do not agree on what the fundamental deﬁnition of WSD, and this leads to different and incompatible formalizations. Second, the WSD task relies heavily on background kno wledge, but constructing knowledge databases is expensi ve and time-consuming. According to Na vigli, the difﬁculty of the task is e videnced by the f act that it has not been applied to an y real-w orld tasks. Gi ven these conceptual dif ﬁculties, it is hard to understand why anyone would bother studying the problem. The reason is probably sociological: WSD is new . Research on topics like parsing and machine translation has started to yield diminishing returns (because of conceptual ﬂaws in the formulation of those tasks), so the ﬁeld’ s attention has wandered onto a ne w topic. 4.5.1 Chomskyan F ormulation of Linguistics In his famous early book “Syntactic Structures”, Chomsky formalized the problem of linguis- tics in the follo wing way: The fundamental aim in the linguistic analysis of a language L is to separate to grammatical sequences which are the sentences of L from the ungrammatical se- quences which are not sentences of L and to study the structure of the grammat- ical sequences. The grammar of L will thus be a device that generates all of the grammatical sequences of L and none of the ungrammatical ones. One way to test the adequacy of a grammar proposed for L is to determine whether or not the sequences that it generates are actually grammatical, i.e., acceptable to a nati ve speaker , etc. [22] This passage describes the generative approach to the study of grammar , because it aims at constructing a “de vice that generates” grammatical sentences. This sentence-generating device must be very sophisticated, because the grammatical rules of natural language are complex. The construction of such a device will therefore require a wide range of linguistic in vestigations. Chomsky’ s deﬁnition is elegant because it simultaneously satisﬁes two important criteria: it is relativ ely concrete, and it provides a parsimonious justiﬁcation for many research topics in linguistics. Comperical language researchers aim at a goal that is only slightly dif ferent from the one gi ven by Chomsky . T o achiev e good compression, researchers must construct models P ( s ) of 147 the probability of a sentence, that assigns low probability to rare sequences and high probability to common sequences. If a text corpus is constructed by compiling a number of legitimate publications such as books and newspaper articles, then ungrammatical sequences should be quite rare. So by identifying the rules of grammar , it should be possible to sav e bits by reserving short codes for the grammatical sentences. Furthermore, Chomsky’ s method of testing a proposed grammar , by generating sentences and showing them to a nati ve speaker bears a striking resemblance to the veridical simulation principle of Chapter 1 . A comperical researcher can generate sentences from the model P ( s ) by feeding random bit strings into the decoder . The simulation principle asserts that, if the compressor is very good, then the sentences generated in this way will be legitimate English sentences. If the sentences fail to observe some grammatical rule, then this indicates a deﬁ- ciency in the model; if the researcher can correct this deﬁciency , the new model should achie ve a better codelength. The decoder component of a very good compressor is therefore very nearly equi valent to a “de vice that generates” the grammatical sentences of a language. Despite the close similarity between Chomsky’ s stated goal and the comperical formulation of linguistics, Chomsky rejects the applicability of statistical methods to syntactic research. T o justify this position, Chomsky posed the famous sentence “Colourless green ideas sleep furiously”. This sentence is exceedingly strange, and one would expect that it should nev er occur in any normal corpus of English text. Ho we ver , the sentence is perfectly acceptable from a grammatical point of vie w . T o Chomsky , this implies that statistical methods, which rely on estimating the frequency of e vents, cannot be used to discov er a theory of syntax. Statistical methods, by equating “probable” with “grammatical” and vice v ersa, will fail to correctly assess the grammaticality of the green ideas sentence. The ﬂa w in Chomsky’ s ar gument can be seen by transplanting it into another domain. Con- sider the ev ent ev oked by the sentence: “ A sword fell out of the sky”. This ev ent is exceptionally rare, to the extent that most humans hav e ne ver observed it, except perhaps in movies. How- e ver , the ev ent is perfectly allow able according to the laws of physics. It could certainly occur , if for example someone thre w a sword off the roof of a tall b uilding. T ransplanting Chom- sky’ s argument to deal with this e vent, one w ould conclude that statistical analysis of observed e vents cannot be used to infer the laws of physics. But this is obviously incorrect; many crucial physical insights were obtained through observ ation. T o analyze this point further , consider the follo wing procedure. A comperical researcher constructs a model P ( s ) of the probability of a sentence, using some basic scheme such as n -grams. The researcher then constructs a new model P 0 ( s ) by revising P ( s ) in the follo wing way . In P 0 ( s ) , all of the sentences that do not contain a verb are assigned v ery lo w probability . 148 The probability freed up in this way is reassigned to the other sentences. Because almost all English sentences have verbs, the model P 0 ( s ) will achiev e a substantially lo wer codelength than the model P ( s ) when used to compress a text database. There is no reason to belie ve this principle cannot be scaled up to include new and more comple x rules of grammar . This argu- ment shows that comperical language research includes syntactic in vestigations, though it also includes other kinds of inquiry as well. The full extent of the overlap between comperical lan- guage research and traditional linguistics cannot be kno wn until such research is far advanced. One possibility that might cause a schism between the two areas is if there are many grammat- ical structures that humans will accept as legitimate, but will nev er actually use in writing (or speech). Comperical methods will fail to identify such structures as le gitimate. This concern is not very discouraging. It is not ob vious why the deﬁnition of grammatical- ity as “acceptable to a nati ve speak er” should be superior to the deﬁnition as “used in practice”. The latter deﬁnition is probably f ar more relev ant for practical purposes such as machine trans- lation and information retrie v al. Overall, the fact that comperical linguistics aligns strongly with traditional linguistics, while also bringing a host of conceptual and methodological ad- v antages, provides more than adequate moti v ation for interest in it. As a ﬁnal point, it seems very probable that in a real compressor , making a distinction between syntactic and semantic modeling will be very useful from a software engineering point of view . Probably it will make sense to decompose the compressor into a series of modules. One module will save bits by reserving short codelengths for grammatical sentences. Then a secondary module saves more bits by exploiting observ ations such as the fact that “ideas” rarely hav e colors. 4.5.2 Pr ediction of Progr ess Some critics may dismiss the philosophy of this book as intangible and un veriﬁable. The fol- lo wing concrete prediction should counter this criticism and help to clarify the beliefs implicit in the comperical philosophy . Consider some small community that exists on the margins of mainstream linguistics re- search. This community dedicates all of its ef forts to the problem of large scale compression of English text. The researchers de velop an extensi ve set of tools and methodologies to at- tack this problem. They make repeated decisive comparisons between candidate theories. The champion theory grows increasingly reﬁned and complex, but this complexity is justiﬁed by improv ed descriptiv e accuracy . This theory incorporates a number of abstractions related to the 149 structure of text, including but not limited to grammatical rules. The prediction, then, is as fol- lo ws. Once this strong champion theory is obtained, the comperical researchers will reuse it to solve the standard problems of traditional linguistics such as parsing, machine translation, and word sense disambiguation. At that time, the results achie ved by the comperical researchers will be dramatically superior to those achie ved by more traditional approaches. This prediction is just a restatement of the Reusability Hypothesis. The history of science indicates that the Reusability Hypothesis holds for ﬁelds like physics and chemistry . There are reasons to belie ve the hypothesis will hold for compericalscience as well, b ut as yet no concrete e vidence. If the prediction turns out to be true, it will vindicate the Reusability Hypothesis, and by extension the comperical philosophy . If, on the other hand, the champion theory proves to be useless for any other purpose, then the comperical idea should be abandoned. 150 Chapter 5 Compr ession as Paradigm 5.1 Scientiﬁc Paradigms Thomas K uhn, in his famous book “The Structure of Scientiﬁc Re volutions”, identiﬁed se veral abstractions that are very useful in describing the patterns of scientiﬁc history [63]*. Kuhn began by analyzing what most scientists do in their day-to-day activities. Most of the time, a typical scientist is not attempting to ov erthro w the cornerstone theory of his ﬁeld. Instead, he is attempting to articulate or enlar ge the theory , by showing how it can be applied to a new situation, or how it can be used to explain a previously mysterious phenomenon. Kuhn called this kind of incremental research “normal science”. A ke y property of normal science is that it is cumulati ve: while individuals may fail, once someone solves a problem, it is solved for all time and the ﬁeld can move on. Because of this cumulativ e effect, and because there are a large number of scientists, normal science can produce rapid progress. Ho we ver , sev eral re- quirements must be met before normal science can take place. It is necessary for the scientiﬁc community to agree on the importance of various questions, and on the legitimacy of the solu- tion methods used to answer them. In other words, the community must be bound together by a shared set of philosophical commitments and technical kno wledge. Kuhn calls this unifying body of conceptual apparatus a scientiﬁc paradigm. In spite of their crucial importance, it is rare for a paradigm to be explicitly articulated. This is because a paradigm is in many ways a set of abstract philosophical commitments, and scientists want to do science, not philosophy . Instead of explicit instruction, young scientists learn a paradigm by studying ex emplary research within a ﬁeld, and attempting to assimilate the rules and assump- tions guiding the research. Because the necessity of a paradigm is not always ackno wledged, people often attempt to do science without one. This leads to incoherence, ev en if the individ- uals in volved are disciplined and intelligent scientists. 151 Paradigms are ne ver perfect, and sometimes collapse. This e vent, called a scientiﬁc rev o- lution, occurs when the inconsistencies and ambiguities previously tolerated by the paradigm start to become so glaring and burdensome as to prev ent further progress. As the paradigm breaks do wn, more and more researchers abandon it, and go in search of a replacement, or leav e science altogether . Scientiﬁc rev olutions tend to be traumatic, because normally reason- able academic discourse becomes acrimonious and intractable. The scientiﬁc community will splinter into opposing camps deﬁned by ne w and mutually incompatible philosophical commit- ments. Since the discarded paradigm deﬁned the rules of scientiﬁc debate, there is no reason why a proponent of one camp should accept the arguments made by another camp. Members of communities deﬁned by opposing paradigms might not e ven agree on what research questions are legitimate topics within a giv en ﬁeld. Thus, paradigm conﬂicts must be settled by methods that are essentially extrascientiﬁc. Perhaps one paradigm leads to more practical applications, or perhaps it is simply more appealing to the younger generation of researchers. In any ev ent, further progress is impossible until a new scientiﬁc community coalesces around a replacement paradigm. 5.1.1 Requir ements of Scientiﬁc Paradigms In order to fulﬁll their purpose of providing coherence to scientiﬁc community and enabling normal science, scientiﬁc paradigms must possess certain properties. The most basic function of a paradigm is that it serves as a body of shared background knowledge. This knowledge is set forth in the textbooks used by students, and any practitioner in the ﬁeld can be expected to be familiar with it. This is crucial, because it means that researchers within a paradigm do not hav e to waste time e xplaining basic ideas to one another . A paradigm ensures that researchers share a core of technical knowledge, but perhaps more importantly , it ensures that they share a set of philosophical commitments. A paradigm must gi ve clear answers to the meta-questions : what are the legitimate topics of inquiry within a ﬁeld, and what are the important open problems? In addition, a good paradigm should answer the question of ev aluation: ho w can the quality of new results be determined? If scientists needed to spend large amounts of time pondering these abstract questions, they would hav e no time for concrete research. The paradigm also provides an important concentration-enhancing role. Practitioners become highly focused on the relev ant problems deﬁned by the paradigm. This concentrating effect helps to insulate practitioners from distractions, such as social pressures from the outside world. Scientists operating within a paradigm commit themselv es to a certain 152 intellectual path, upon which they can stride forward conﬁdently , without going in circles or getting lost in the woods. In addition to solving philosophical problems, a paradigm also helps to solve sociological issues. Scientists tend to be intelligent, individualistic, and distrustful of authority . There is no central committee or chief executi ve telling scientists what to do. The paradigm is what allo ws this decentralized group of often eccentric persons to make systematic progress. If it were not for a paradigm, a scientiﬁc community would act more like a disorganized mob and less like a disciplined army . Science is also, in many ways, a competiti ve sport, where the prizes include things like grant mone y , tenure, and fame. The paradigm deﬁnes the rules of the game. W ithout a paradigm, science would be like a baseball game in which the players do not agree about the location of the bases, the number of outs in an inning, or whether it is legal to bunt. It is not a coincidence that the person who determines whether to accept a paper submitted to a journal is called a “referee”. Another , less obvious requirement of a good paradigm is that it must deﬁne a lar ge number of technical questions, or “puzzles”. Ideally , these puzzles are not be exorbitantly difﬁcult: an intelligent and hard-working practitioner should hav e a very good chance of solving the puzzle he or she chooses to work on. If this were not the case, morale in the ﬁeld would e vaporate and it w ould cease to attract talented young people. Since a scientiﬁc community commits to a paradigm, and the paradigm deﬁnes a set of puzzles, members of the community will recognize a ne w puzzle solution as a valuable research contribution. Also, the puzzles must be cumulativ e in nature. The puzzle solutions should be like bricks in a wall; each new brick both expands the wall and pro vides a space for another brick to be placed on top of it. If a scientiﬁc community is able to both generate good puzzles and solve them, it will nat- urally become increasingly esoteric. The concepts, techniques, and phenomena it considers will become completely incomprehensible to a layperson. This esotericity is a characteristic property of a mature scientiﬁc ﬁeld. A well-educated person can be expected to understand Ne wton’ s laws and perhaps ev en special relati vity . But no layperson can be expected to under - stand an abstract such as the follo wing: Ne w torsional oscillator experiments with plastically deformed helium show that what was thought to be defect-controlled supersolidity at low temperature may in fact be high-temperature softening from nonsuperﬂuid defect motion in the crys- talline structure [7]. 153 5.1.2 The Casimir Effect The follo wing anecdote may help to illustrate the meaning of normal science. In 2003 the au- thor was a graduate student in physics at the Univ ersity of Connecticut. The physics department would often in vite researchers from other uni versities to come and present their work. At one point, a visiting researcher came to talk about his experiments in volving a phenomenon kno wn as the Casimir ef fect. The Casimir ef fect is a fascinating implication of the theory of quantum electrodynamics. This theory states that there e xist minute ﬂuctuations in the electromagnetic ﬁeld at e very point in space, ev en if there is no matter or forces at the point. If space is thought of as a string, then these ﬂuctuations correspond to very low frequency excitations that arise spontaneously . Ordinarily , these ﬂuctuations occur ev erywhere at more or less the same rate. But if two mir- rors are placed very close to one another (on the order of one micron) then they dampen out certain frequencies of the spontaneous excitations. Since the excitations contain energy , this implies that the region outside the mirrors has a slightly higher energy density than the region between them. This causes a force, analogous to pressure, that pushes the mirrors together . This phenomenon is called the Casimir ef fect. The visiting researcher had performed a very sophisticated experiment to measure the Casimir ef fect. The effect is hard to measure because it is quite weak except at small scales, where se veral other forces can interfere. T o measure the ef fect, one must make corrections relating to thermal ﬂuctuations, and to the fact that real mirrors are neither perfectly smooth nor perfectly reﬂectiv e. Furthermore, because of the weakness of the effect, a sophisticated and highly sensitiv e device called an atomic force microscope must be used to measure it. Finally , since air particles can also interfere, the experimental apparatus must be placed within a nearly perfect v acuum. The researcher was able to obtain measurements for the Casimir ef fect that agreed with theoretical predictions to an accuracy of 1% [79], so the experiment was considered a great success. It is worth asking, howe ver , what would hav e happened if such an agreement had not been achiev ed. The most likely outcome is that the experiment would hav e been interpreted as a failure, and not published. If the negati ve results were published, they w ould undoubtedly hav e very little effect on the ﬁeld, except perhaps to inspire other experimentalists to try to tackle the problem. Almost certainly , the negati ve result would not cause anyone to conclude that quantum electrodynamics was wrong. This anecdote illustrates the philosophical commitments that guide research in physics. An intellectual with no understanding of the physics paradigm might reasonably conclude that the measurement was a colossal waste of time and money . The Casimir ef fect is vanishingly 154 weak and hardly ev er arises in practical situations, the visiting researcher was not the ﬁrst to measure it, the experiment was enormously dif ﬁcult and expensi ve to carry out, and if the measurements had not agreed with predictions, the results would ha ve been ignored. In spite of all this, the e xperiment was widely recognized as a success, and in a congratulatory spirit the scientist w as in vited to go on a lecture tour to present his research to aspiring young physicists. The researcher’ s measurement was a small step forward, but it was decisi ve one, that clearly adv anced the state of human knowledge. 5.1.3 The Micr oprocessor P aradigm One of the most analyzed and commented-upon trends in recent technological history goes by the name Moore’ s Law . There are various versions of Moore’ s Law , b ut the basic statement is that the number of transistors on a minimum cost chip doubles ev ery year . Because the number of transistors included in a microprocessor is an important factor in its performance, this law roughly states that the performance of digital computers increases exponentially . A remarkable aspect of Moore’ s Law is that it depends on the reliable appearance of e ver more sophisticated technological insights. The microprocessor depends on a complex web of technologies, that enable not only the thing itself, but also the processes that allo w it to be manufactured ef ﬁciently . Consider photolithography , which is but a single strand of the web . Photolithography is a technique in which a geometric pattern is etched into a substrate using ultra violet light. The pattern is selected by shining light through a special mask. Before exposure, a thin layer of a special chemical, called a photoresist, is deposited onto the chip. After the deposition, the chip is spun to ensure that the photoresist spreads out uniformly across the surface. The areas of the photoresist that are exposed to the UV light become soluble in another special chemical. This chemical is then used to remove the exposed areas of the photoresist, leaving behind the unexposed areas which match the pattern of the mask. It is also necessary to use a special excimer laser that can produce deep ultra violet light. The full photolithographic process relies upon the use of specialized chemicals, advanced lasers, ﬂuid mechanical modeling techniques, and optical engineering methods. All of these moving parts need to work together seamlessly and with high reliability . The brief description given abov e mentions only a couple of innov ativ e techniques that played a role in Moore’ s Law; there are many more. In spite of the almost magical nature of these technologies, they all appeared right on time to satisfy the next mandated doubling of processor speeds. Another remarkable aspect of Moore’ s Law is that it depends on a chaotic process of com- mercial competition to produce the predicted impro vements. Dozens of companies, such as 155 IBM, Intel, Motorola, and T exas Instruments, participated in and contributed to the micropro- cessor re volution. The history of the ﬁeld records a large number of new design of ferings, many of which failed commercially or led nowhere. For example, the T exas Instruments TMS 1000 was one of the ﬁrst general purpose computers on a chip. TI used the chip in its calculators and some other products, but it did not play an important role in later computers. At the same time as the TMS 1000, Intel released an all-purpose, 4-bit chip called the 4004. The 4004 was not especially successful, but pav ed the way for later Intel chips called the 8008 and 8080. The 8080 was one of the ﬁrst truly useful microprocessors. Other general purpose chips of that era include the RCA 1802, the IBM 801, the Motorola 6800, and the MOS 6502. It is remark- able that, in spite of the fact that many individual products ﬂopped and many companies went bankrupt, the ﬁeld as a whole achie ved rapid progress. A major factor contributing to Moore’ s La w was economics. In the early decades of re- search in computers and electrical engineering, it was unclear whether there was a market for e ver -faster computer chips. But at a certain point, it became clear that demand did exist. Once computers came into widespread use, their basically generic nature meant that improv ements in processor speed yielded improvements in a wide variety of applications (a faster CPU will speed up not only the spreadsheet application, but also the word processor , database, web bro wser , and so on). These gav e companies an incenti ve to spend huge amounts of money on ne w research and dev elopment ef forts. While economics played an important role in microprocessor performance, philosophical factors played an equally important role. In many ways, microprocessor research served as a scientiﬁc paradigm. The paradigm provided a clear answer to the meta-question: a new result was valuable if it improved the speed of a chip, or if it could make chip fabrication less expensi ve, or if it helped with a related issue such as po wer consumption. Ev aluation was straightforward and objecti ve: did the chip compute the correct result? Can it be manufactured cheaply and reliably? Also, the paradigm posed a stream of puzzles for new researchers to attack. These puzzles were certainly not easy , but the existence of the paradigm ga ve people the determination necessary to attempt to solve them. Carver Mead, who did se veral early studies related to Moore’ s Law , remarks that: [Moore’ s Law] is really about people’ s belief system, it’ s not a la w of physics, it’ s about human belief, and when people belie ve in something, they’ll put energy behind it to make it come to pass [60]. In other words, people knew what they were looking for , and they knew that it was possible to make rapid improv ements, so the prospect of achieving yet another doubling of clock speed 156 seemed like a routine challenge and not an impossible dream. This expectation of success produced the institutional determination necessary to mobilize large groups of people. Even if a each indi vidual researcher achiev ed only a modest improv ement, this was suf ﬁcient for rapid aggregate progress. The economic and philosophical factors, combined together , implied that a group of people could dedicate themselves to a highly specialized and esoteric research project with a strong degree of certainty that the fruit of their efforts would constitute a substantial contribution to human kno wledge and civilization. 5.1.4 The Chess Paradigm The problem of computer chess is nearly as old as the ﬁeld of computer science itself. Even before the ﬁrst computers were built, Alan T uring wrote a chess algorithm and used his own brain to “run” it. Claude Shannon suggested programming a computer to play chess, ar guing that: Although of no practical importance, the question is of theoretical interest, and it is hoped that... this problem will act as a wedge in attacking other problems... of greater signiﬁcance [103]. The history of computer chess sho ws a record of gradual but systematic progress. Dietrich Prinz, a student of Einstein and colleague of T uring, wrote the ﬁrst limited chess program in England in 1951. Alex Bernstein, a researcher at IBM, wrote the ﬁrst program that could play a full game of chess in 1958. The Association for Computing Machinery started to hold biannual computer chess competitions in 1970. These competitions spurred a feeling of camaraderie and competiti ve spirit, and by the end of the 1970s the the top programs were as good as upper- le vel human players. Interestingly , e ven as the ﬁeld was making rapid progress, there was uncertainty about how far it could ultimately go. In 1978, Da vid Le vy , an International Master , won a bet he made ten years earlier that no computer would beat him. At an ACM conference in 1984, a panel of experts could not agree about whether a computer would ev er defeat the top-ranked human player . Of course, this ev ent did ev entually occur, when IBM’ s Deep Blue defeated Gary Kasparov in 1997. The inexorable progress achiev ed in the ﬁeld of computer chess can be explained by the fact that chess deﬁnes a paradigm. The chess paradigm pro vides a clean answer to the meta- question: a new idea is a v alid contribution to the ﬁeld if it can be used to enhance the per- formance of a chess program. The paradigm also supports clean, inexpensi ve, and objectiv e 157 e valuations of riv al solutions. Furthermore, the paradigm deﬁnes a nice set of puzzles. Com- puter chess researchers inno vated a v ariety of techniques to attack the problem. One such tech- nique is alpha-beta pruning, which is a way of searching a tree representing game states [62]. The idea of alpha-beta is to prune subtrees that are known to be suboptimal, giv en that e v- ery other branching decision in the tree is made by the opposing player . The chess problem e ven inspired two Bell Labs researchers to de velop a chess computer with special hardware for e valuating board positions and generating lists of possible mo ves. Most people would agree that computer chess w as a successful research program that made an important contrib ution to human knowledge. This fact is remarkable, gi ven that chess is an abstract symbolic game with no connection to the real world. An analogy to the game of tennis may be useful here. A tennis game does not directly e v aluate a player’ s ﬁtness level. Ne vertheless, in order to win at tennis one must be highly ﬁt. So the discipline imposed by the rules of the game forces players to become ﬁt, if they want to win. The chess paradigm has an analogous ef fect. The discipline imposed by the objecti ve e valuation procedure forced researchers to innov ate in order to win. In comparison to the chess paradigm, the compression principle provides a similar le vel of discipline, which should force researchers to innov ate. Even if it turns out that comperical research has no practical importance, it may act, as Shannon put it, as a wedge in attacking other problems of greater signiﬁcane. 5.1.5 Artiﬁcial Intelligence as Pr e-paradigm Field This book has mostly av oided explicit discussion of the ﬁeld of artiﬁcial intelligence, because it is too broad to be reasoned about as a whole. But in the light of the discussion of paradigms, a simple diagnosis can be made that e xplains the gap between the enormous ambitions of the ﬁeld and the actual success it has achie ved. AI is, quite simply , a ﬁeld without a paradigm. Though many authors hav e attempted to deﬁne intelligence and thereby provide a direction to research, none of those deﬁnitions are widely accepted. This lack of shared commitments pre vents the ﬁeld from making systematic progress, except for short spurts of activity within narro wly deﬁned domains. The ﬁeld consists of a lar ge number of indi viduals carrying out idiosyncratic and mutually incompatible research programs. Evidence in fa v or of this diagnosis can be obtained simply by perusing at the standard AI textbook by Russell and Norvig [99]. In a textbook describing a mature ﬁeld, one would ﬁnd a clear relationship between the early sections of the book and the later sections. The early chapters would discuss fundamental ideas and theories, while the later ones would in volv e applications or specialized areas. But the topics related in the ﬁrst half of the AI textbook 158 hav e nothing to do with the topics in the second half. An early section of the book is entitled “Problem Solving”, which includes topics such as heuristic search, adversarial search, and constraint satisf action. A later section is called “Communicating, Percei ving, and Acting”, and discusses areas such as natural language processing, perception, and robotics. A researcher in natural language processing or robotics has no special need to know anything about adversarial search or constraint satisf action, and may v ery well be completely ignorant of these topics. The incoherence of the textbook is not to be blamed on the authors, who ha ve provided clear and cogent descriptions of the standard topics of the ﬁeld. This is not to say that AI researchers do not propose paradigms; in fact, they do quite of- ten. The problem is simply that the paradigms do not work very well. One example of such a paradigm relates to the STRIPS planning formalism (STRIPS stands for Stanford Research In- stitute Problem Solver) [35]. This formalism represents the state of a system as a collection of propositions. So A T ( T RU C K A , M I A M I ) would assert that a certain truck is in Miami. The agent has the option of modifying the state by executing an action, which may ha ve preconditions, add ef fects, and delete ef fects. An example action is M OV E ( T R U C K A , M I A M I , H O U S T O N ) . This action would require AT ( T R U C K A , M I A M I ) as a precondition, and also delete it, while adding A T ( T R U C K A , H O U S T O N ) . This formalism is expressi ve and general enough to allo w users to encode a wide v ariety of planning problems. This expressi veness implies that if a gen- eral purpose STRIPS solv er could be found, it would be an extremely useful tool. Motiv ated by the goal of obtaining such a tool, members of the automatic planning community hav e carried out sev eral decades of research based on STRIPS. T o ev aluate progress, the community has deﬁned a number of benchmark problems that in volve things lik e logistics (how to deli ver a set of packages using the smallest amount of time and fuel) and airport traf ﬁc control [78]. Unfor- tunately , the mere act of deﬁning shared benchmarks has not allowed the community to make much progress. Research in this area is hobbled by a v ariety of intractability theorems. One such theorem sho ws that the basic STRIPS problem is NP-hard [16]. Another sho ws that if the formalism is enhanced to include numeric quantities, the problem becomes undecidable [44]. If ne w planning systems appear to show improved performance, this is probably because they been designed to exploit special properties of the benchmark domains, not because they hav e achie ved truly general applicability . One of the purposes of a paradigm is to produce a set of puzzles that all members of a community will recognize as important research topics. Because there is no clear paradigm for artiﬁcial intelligence, researchers are generally at liberty to deﬁne their o wn research questions. This situation might not be so deplorable, if researchers were honest about their reasons for considering a certain question important. In practice, people hav e prefered solution methods, 159 and deﬁne their research questions to be exactly those that can be solved, or at least attacked, by their methods. 5.1.6 The Br ooksian Paradigm Candidate In the early 1990s, an MIT researcher named Rodney Brooks wrote a series of papers articulat- ing a critique of the philosophical foundations of artiﬁcial intelligence research and outlining a ne w paradigm for the ﬁeld [11, 13]. Brooks belie ved that AI researchers suffered from a disease called “puzzlitis”, which was caused by two factors. One the one hand, ev eryone saw and felt the necessity of using objectiv e measures to ev aluate ne w research results. But at the same time, the connection between AI research and the real world was tenuous at best. This meant that the typical strategy for producing new research was to in vent some obscure logical puzzle that no one else had e ver considered, and then show that a ne w system could solve the puzzle. Brooks also deplored the fact that much research in AI was ungr ounded . Researchers spent much time and effort on deﬁning ne w abstractions, ﬁguring out how to organize them, and deciding ho w they should relate to one another . But these abstractions were not, for the most part, grounded in empirical reality . T o repair these problems, Brooks proposed a paradigm candidate for AI research based on the idea of real world agents. Brooks deﬁned intelligence to be the set of computational abilities that an agent would required to survi ve in the real world. This deﬁnition provided an answer to the meta-question for the paradigm: a new result was a v aluable contribution if it helped an agent to navig ate or operate in the real world. Since real world agents encounter a wide v ariety of perceptual and computational challenges, this deﬁnition provided a parsimonious justiﬁcation for a broad range of AI research. Brooks noted that many old problems disappeared and many ne w problems arose when an agent was placed in the real world. For e xample, the problem of building and manipulating a complex internal world model largely disappeared. Instead, agents could simply rely on extensi ve and constant sensing in order to keep track of the world state. Brooks’ research plan was also partly motiv ated by an analogy to the path taken by ev olu- tion in producing intelligence. Ev olution started with very simple creatures, that didn’t do much other than wander around and try to ﬁnd food. As ev olution advanced, more complex sensory and cogniti ve abilities appeared. Crucially , ev olution spent huge amounts of time de veloping lo w-lev el skills like perception, motor control, and basic na vigation. In contrast, advanced hu- man skills like logic, mathematics, and language appeared only recently in historical terms. 160 This implied that the real challenge was to dev elop the lo w-le vel skills. Brooks argued that AI research should follo w a similar path. At the time of Brooks’ original work, it seemed very plausible that the embodiment paradigm would satisfy many of the requirements for a good paradigm. It produced a large stream of puzzles, that competent practitioners would ha ve a good chance of solving. It provided a set of shared philosophical commitments that would mandate all participants to recognize new puzzle-solutions as legitimate contrib utions. An example of such a puzzle-solution might be a robot that could navigate around an ofﬁce environment and clean up all the discarded soda cans. At least in principle, it seemed like the research in this area could be cumulati ve in nature. Gi ven a soda can cleaning robot, it might be a tractable problem to improv e it so that it also has the ability to fetch cof fee. The debate between Brooks and other researchers pursuing more traditional approaches to AI pro vides a good illustration of ho w paradigms can be incommensurate. T o Brooks, topics such as theorem proving, automated planning, and computer chess were simply irrelev ant dis- tractions. An artiﬁcial agent operating in the real world has no need to prov e mathematical theorems or play chess. Con versely , critics of Brooks’ idea did not accept that computational tasks like wall-following or obstacle av oidance were rele v ant to intelligence. Neither group had any special reason to accept the other’ s assumptions and commitments. The debate cannot be solved by normal scientiﬁc methods such as e xperimental veriﬁcation or proof-checking. T wenty years after its initiation, the Brooksian paradigm has not yet achie ved its goals. Perhaps the ke y failure was the assumption that this kind of research could achiev e cumulati ve progress. In reality , it is very difﬁcult for a ne w researcher to pick up a robotics project begun by an earlier work er and successfully enhance its capabilities. One reason for this is that problems in robotics are not often conclusiv ely solved like mathematics problems are solved. Rather , a typical robot will succeed in some situations, b ut fail badly in others. This means that the foundation laid by earlier researchers is not solid enough for their successors to build upon. Another problem, somewhat foreign to the purist intellectual concerns of academic re- searchers, in volv es simple economics. Robots are very expensi ve and time-consuming to b uild. If a researcher de votes a million dollars and three years to the de velopment of a soda can clean- ing robot, and then the robot doesn’t work for technical reasons (perhaps its can-pincing mech- anism is not precise enough), this is a disaster for the researcher’ s career . This means that researchers must av oid risk y projects, since so much ef fort is in vested. But as any entrepreneur will agree, risk av ersion is poisonous to innov ation. While the Brooksian paradigm did not succeed in completely rev olutionizing AI, it would be incorrect to label it a complete failure. One of the few truly useful commercial robotics prod- 161 ucts on the market today is the Roomba. This robot is not “intelligent” in the traditional sense, but is instead resilient and resourceful at na vigating the real world. Brooks’ research on navi- gation and obstacle av oidance techniques played a ke y role in de veloping the Roomba. Many other intriguing devices emerge from the laboratories dedicated to the embodiment paradigm, and it is possible that they will inspire other , more practical applications in the future [88]. 162 5.2 Comperical Paradigm The pre vious discussion mentioned some of the aspects of pre vious paradigms that enabled researchers operating within them to make progress. Se veral issues arose in each case. A paradigm requires an answer to the meta-question and to the problem of e valuation. In some cases, these two issues were combined together . So in the chess paradigm, any contribution was important if it allo wed a computer to play better chess, and ne w solutions were e valuated using the victory rate they achiev ed. In physics, the answer to the meta-question is less explicit, b ut researchers can use the long history of success in the ﬁeld to hone their intuitions about what kind of research is valuable. The embodiment idea is interesting because it came very close to meeting the criteria necessary for a good paradigm, but fell short for subtle reasons. This section argues that the compression principle can provide a solid paradigm for re- search in a v ariety of topics. The comperical philosophy meets all of the necessary criteria. It provides a deﬁniti ve answer to the meta-question, as well as an objectiv e ev aluation process. The compression principle provides a parsimonious justiﬁcation for many research topics. The follo wing sections discuss these topics in greater detail. The text refers speciﬁcally to com- puter vision, b ut the remarks apply equally well to v arious related ﬁelds. A summary of the comparison between the comperical approach to vision and the traditional approach is gi ven in T able 5.2 . T able 5.1: Comparison of Methodological Approaches T raditional Comperical Meta-Question No Standard Formulation Clear , clean formulation Comparisons Ambiguous Stark, decisi ve Bugs & Fraud Hard to Detect Cannot Occur Manual Overﬁtting Hard to Pre vent Cannot Occur T ype of Science Math/Engineering Empirical Science Product of Research Suite of T ools Single best theory Moti vation Practical Utility Open-Ended Curiousity Data Source Small Labeled V ast Unlabeled Ev aluator Dev elopment Expensi ve, risk y Cheap Ev aluator Scalability One-to-one Many-to-one Justiﬁcation of T opic Esoteric Parsimonious Cooperation Constant Replication of Ef fort Standing on Shoulders of Giants 163 5.2.1 Conceptual Clarity and Parsimonious J ustiﬁcation Any paradigm candidate for a scientiﬁc ﬁeld must provide clear answer to at least two ques- tions. The ﬁrst is the meta-question: what topics or research questions are contained within the ﬁeld? W ithout a clear answer to this question, the ﬁeld becomes incoherent. In some ﬁelds, the meta-question is answered only implicitly . Perhaps there is no standard deﬁnition of what exactly constitutes a legitimate question in physics research. But e veryone can see the ridicu- lousness of trying to publish a historical analysis of presidential elections in a physics journal. The journal editors would reject such a paper out of hand, ev en though presidential elections are ultimately go verned by physical law . Ho we ver , the rejection would not automatically imply that the paper was not a legitimate piece of scholarship, only that it does not belong in a physics journal and, equi v alently , the editors cannot ev aluate it. This equi v alence immediately suggest the second function of a paradigm: it deﬁnes the legitimate methods of e valuating ne w research within a ﬁeld. If members of a community do not at least approximately agree on this point, academic ci vility will break down due to disputes re garding the v alue of ne w contributions. The modern ﬁeld of computer vision provides only v ery weak answers to these fundamental philosophical questions. The ﬁeld has a set of topics of interest, such as image segmentation, stereo matching, and object recognition. But these problems appear to lack deep justiﬁcation. T opics such as edge detection are oftened justiﬁed by neuroscientiﬁc ﬁndings, such as the dis- cov ery by Hubel and W iesel of cells in the visual cortex whose receptiv e ﬁeld suggests that their purpose is to detect edges [49]. Ideas of Gestalt psychology provide another commonly cited moti vation [57]. While the basic idea of imitating the action of the human brain seems sound enough, in practice this strategy produces a large number of incompatible and discon- nected research results. An orthogonal justiﬁcation is practicality: certainly a solution to the face detection task is useful. Unfortunately , this justiﬁcation fails in almost all cases, because computer vision research is only very rarely useful. Certainly computer vision solutions are not widely deployed in real-world commerical systems (an important exception here is f actory settings, where the visual en vironment can be tightly controlled). So modern computer vision offers a di verse range of research based on a div erse set of justiﬁcations. In stark contrast, the comperical approach leads to an extensi ve and varied set of research problems justiﬁed by a single principle. Any concept that can be used to com- press a database of natural images is part of comperical vision science. This includes a wide v ariety of contributions, including ne w mathematical theorems, algorithms, kno wledge repre- sentation tools, and methods of statistical inference. This many-to-one relationship between contributions and justiﬁcation also exists in physics, where all research is ultimately justiﬁed by a single problem: predict the future. 164 T o see ho w these philosophical issues intersect with practical considerations, consider the problem of ho w a graduate student should choose a thesis topic. Finding a good topic is a nonobvious research problem itself, and requires familiarity with the ﬁeld and the state of the art. But the student, by virtue of being a student, does not yet possess a rich understanding of the ﬁeld, and so is at a loss to decide what kinds of questions are important. Sometimes the student’ s supervisor steps in and mak es the choice for the student, b ut this solution is not ideal. The comperical philosophy provides a very simple method for ﬁnding ne w research topics. One simply do wnloads a benchmark database and the corresponding champion theory , and looks for aspects of the data that the theory does not adequately capture. This can be done by looking for segments of the database that require an abnormally high amount of codelength to encode. The student can also ﬁnd shortcomings of the champion theory by sampling from the model and examining the v eridicality of the generated samples. 5.2.2 Methodological Efﬁciency A major beneﬁt of the comperical approach is the surprising degree of methodological ef ﬁ- ciency it permits. T o see this, consider a hypothetical Journal of Comperical V ision Research, which accepts two kinds of submisssions. The ﬁrst type includes reports of new shared im- age databases for use by the community . A paper of this type describes the content of a ne w database, the mechanism by which it was constructed, the extent of the variation that it con- tains, and any other details that might be rele vant. These submissions are accompanied with the actual database. The journal editors brieﬂy inspect the paper and the database. Unless the database contains some glaring deﬁciency , these kinds of contributions should in most cases simply be accepted. The editors then publish the paper and post a link to the database on the journal’ s web site where it can be do wnloaded by other interested members of the community . The second type of submission includes reports of new compression rates achie ved on one of the shared databases. A submission of this type must be accompanied by the actual com- pressor used to achie ve the reported result. As part of the revie w process, the journal editors run the compressor , verify that the real net codelength agrees with the reported result, and then check that the decoded version matches the original. In principle, the editors should accept any contrib ution that legitimately reports an improved compression rate. In practice, it may be necessary for the editors to ex ercise some degree of qualitativ e judgment: a paper that merely tweaks some settings of a pre vious result and thereby achie ves a small reduction in codelength is probably not worth publishing, while an innov ativ e new approach that doesn’ t quite manage to surpass the current champion probably is. In spite of the above caveat, the journal editors 165 hav e a remarkably easy job . The simpliﬁed nature of the comperical re view process contrasts strongly with the situation in most modern scientiﬁc ﬁelds, where peer re view is a crucial and time-consuming acti vity . The comperical philosophy also considerably simpliﬁes the process of database dev elop- ment. A major limiting factor in traditional computer vision research is the effort required to build ground truth databases. The comperical approach mandates the use of unlabeled databases, which are much easier to construct. For example, a researcher wishing to pursue the roadside video proposal of Chapter 1 needs only to set up a video camera next to a highway and start recording. Even this relati vely easy step can be bypassed, if an organization with video cameras already in place can be con vinced to share their data. Of course, not all databases are equally useful for research purposes. It will be necessary to ex ercise some ingenuity and fore- sight when building target databases, especially in the early stages of research. It is probably impossible to make a lot of progress by using an arbitrary collection of images do wnloaded from the internet. Such an image collection would probably contain an unapproachably large amount of v ariation. The ease of data collection in comperical research is a consequence of its fundamentally ne w mindset. Most researchers, in computer vision and other ﬁelds, adopt a directed, utilitarian mindset. They begin with the question: “what would it be useful to kno w in this area?” This leads to speciﬁc questions like how to recognize faces, how to detect speech, how to perform part-of-speech tagging, whether a ne w drug will reduce the risk of heart disease, and whether a gov ernment stimulus will reduce unemplo yment. T o answer these questions, the scientist must conduct a time-consuming empirical study . The dif ﬁculty of the study limits the amount of data that can be collected, which in turn limits the complexity of the model that can be used. In contrast to the mindset of targeted research, comperical researchers adopt a stance of open- minded curiousity . They begin with the question: “giv en this easily obtainable data source, what can be learned from it?” This mindset is motiv ated by the belief that such databases do contain v aluable secrets. As noted abov e, the comperical philosophy makes it easy in principle to verify the quality of a new contrib ution. The issue of veriﬁability is one of the main reasons why this book advocates the compression rate as a way of e valuating predicti ve po wer , as opposed to the log-likelihood or some other metric. Consider a Maximum Likelihood Contest where the goal is to ﬁnd the model that assigns the highest possible likelihood to a data set T . The current champion theory has achiev ed a log-likelihood of − 3 . 4 · 10 8 for T . Now a challenger submits a new model, implemented as a software program. The referee tests the program by in voking it on the data set. After running for a while, the program prints out: 166 Modeling complete: log-likelihood is − 2 . 2 · 10 8 . If this claim is legitimate, then the ne w model is superior and should be declared the new champion theory . But it is very difﬁcult for the judge to verify that the model follo ws the rules of probability theory and that the software does not contain any bugs. If a data compressor contains any signiﬁcant bugs, these will be discovered when the decoded database fails to match the original one exactly . The issue of softw are bugs and other kinds of mistak es is actually v ery rele vant to machine intelligence research. Software produced in vision research is far too complex to be thoroughly examined by re vie wers. Often researchers will not e ven release their source code for examina- tion. T ypically , re viewers and the community mus t rely on the results published by the authors in their papers. But if the authors produce those results using buggy software, it is very dif- ﬁcult for these errors to be discov ered and repaired. Consider the task of trying to determine if a segmentation algorithm is correctly implemented according to the procedure set forth in a paper . In some places, incorrect lines of code will cause the program to crash or produce an error . But in other places, software bugs will simply cause the program to produce a dif ferent segmentation. It is likely that many published results in computer vision are affected by bugs of this kind. It is also possible that some errors are not entirely honest. Giv en ho w dif ﬁcult it is for a third party to check a result, and giv en the intense competiti ve pressure scientists operate under , it seems likely that some published results in v olve academic fraud. It is very easy for a researcher , when comparing his new algorithm for task X to some standard algorithm for X, to make an “accidental” mistake in implementing the latter . The comperical philosophy provides a strong defense against both honest mistakes and actual academic fraud. 5.2.3 Scalable Evaluation The word scalability refers to the rate at which the cost of a system increases as increasing demands are placed on it. For example, a process in which widgets are built by hand is not very scalable, since doubling the number of widgets will approximately double the amount of labor required. A widget factory , in contrast, might be able to double its output using only a ten percent increase in labor , and is therefore highly scalable. A brief analysis sho ws that the current ev aluation mindset of computer vision has very bad scaling properties, while the comperical approach to e valuation is highly scalable. The current ev aluation strategy in computer vision can be characterized as a one-to-one approach. Since each speciﬁc vision task, such as object recognition, image segmentation, or stereo matching, requires its o wn ev aluator method, this approach scales badly . If it turns out 167 that a comprehensiv e articulation of computer vision requires solutions for 1000 tasks, then the ﬁeld will need to de velop 1000 ev aluators. This inefﬁcienc y is exacerbated by the fact that the process of dev eloping an e v aluator is difﬁcult work, requiring both intellectual, and often manual, labor . Furthermore, this arduous labor may v ery well go unrew arded. Sometimes, after substantial ef fort has been in vested in dev eloping an e valuator , it turns out the scheme suf fers from some ﬂaw (a possible example of this is the difﬁculty related to the R OC curve e valuation scheme mentioned in Section 3.2.2 ). Even if the e v aluator is technically sound, it is not al ways ob vious that it assigns consistently higher scores to higher -quality solutions. It may be necessary to conduct a meta-e v aluation process in order to rate the quality of the ev aluators. Combined, these considerations imply that the current one-to-one approach to ev aluation is painfully inef ﬁcient. The importance of scalability becomes ev en more obvious when one realizes that many vision tasks of current interest, such as image segmentation, edge detection, and optical ﬂow estimation, are low-le vel tasks that are not directly useful. Instead, the idea is that once good so- lutions are obtained for the low-le vel tasks, the y can be incorporated into higher-le vel systems. These more advanced systems, then, constitute the ultimate goal of vision research. Under the current paradigm, each high-le vel task will require its own e valuator method to compare candidate solutions. Furthermore, since the high-lev el tasks are by deﬁnition more abstract, performance on such tasks will be much harder to ev aluate. So the pile of work required to complete the project of ﬁnding e v aluators for curr ent tasks is tiny compared to the mountain of work that will be required to de velop e valuators for futur e tasks. In contrast to the current ev aluation methodology , the Compression Rate Method provides the ability to e valuate a large number of disparate techniques using a single principle. Chapter 3 sho wed that man y tasks in computer vision can be reformulated as specialized image compres- sion methods. This implies that the compression principle provides a many-to-one ev aluation strategy . As discussed in the proposal of Section 1.4.1 , a single database of roadside video can be used to e valuate the performance of a lar ge number of components, such as motion detec- tors, wheel detectors, specialized segmentation algorithms, and learning algorithms that infer car cate gories. The V isual Manhattan Project database of taxi cab video streams can be used to de velop algorithms for detecting and analyzing the appearance of pedestrians, buildings, and other cars. In addition to greatly reducing the number of man-hours required for e valuator dev elop- ment, the compression principle pro vides a clean answer to the question of what the high-le vel tasks ar e . In the compression approach, a lo w-le vel system is one that achiev es compression by exploiting simple and relativ ely obvious regularities, such as the fact that cars are rigid bodies 168 obeying Ne wtonian laws of motion. A higher -lev el system is b uilt on top of lo wer-le vel systems and achiev es an impr oved compression rate by e xploiting more sophisticated abstractions, such as the fact that cars can be categorized by make and model. Other techniques, not speciﬁcally related to image modeling, can be justiﬁed by showing that they help to solve the problem of transforming from the raw pix el representation to the abstract description language. 5.2.4 Systematic Pr ogress Kuhn’ s book about scientiﬁc paradigms and rev olutions contains a number of profound apho- risms, but one of the most rele vant is follo wing rhetorical question: T o a very great extent, the term science is reserved for ﬁelds that do progress in obvious ways. But does a ﬁeld make progress because it is a science, or is it a science because it makes progress? [63]*. As noted in Chapter 3 , the ﬁeld of computer vision has a terrible problem in volving constant replication of effort. For any gi ven task, such as segmentation, edge detection, object recog- nition, or image reconstruction, there are hundreds of candidate solutions to be found in the literature. This endless rehashing and redev elopment of standard tasks, and the resulting lack of clear progress, is a clear sign of the immaturity of the ﬁeld and the weakness of its philo- sophical foundations. Consider , in contrast, how rare it is for ph ysicists to submit ne w theories purporting to provide better descriptions of standard phenomena. In and of itself, the prolifer- ation of papers proposing ri val solutions to the same task might not be so bad; any unexpected ne w experimental observation in physics might merit a large number of proposed theoretical explanations. The difference is that the physicists are able to ultimately ﬁnd the right explana- tion, at which point people can mov e on to ne w areas of research. The comperical approach to vision not only allo ws the ﬁeld to be formulated as an empirical science, but also effecti vely requires it to make progress. Systematic progress is built into the very deﬁnition of the Compression Rate Method. In effect, an in vestig ation guided by the compression principle must make clear , quantiﬁable progress, or grind to a complete halt. The contrast between the potential for progress resulting from the comperical approach and the lack of progress e xhibited by the traditional approach seems, almost by itself, to be a suf ﬁcient argument in f av or of the former . The comperical principle demands progress, b ut does not guarantee that progress will be easy . Howe ver , there is a fairly clear path that a ne wcomer to the ﬁeld can take to build upon her predecessors’ work. Giv en a compressor that achie ves a certain lev el of performance, the 169 ne w researcher can simply add a specialized module that operates for data samples exhibiting a certain type of property , and lies dormant for others. For e xample, a researcher could take a generic image compressor such as PNG, and add a special module for encoding faces (see Section 3.4.5 ). The enhanced compressor would then achiev e much better codelengths for images that include faces. Another researcher could then add a module for encoding arms and hands (the pixels in these re gions would be narrowly distributed around some mean skin color). The improv ement achiev ed by any single contribution may not be huge, of course, but this mode of research is guaranteed to achie ve cumulati ve progress. This description has glossed ov er some issues of software engineering; it is not, in practice, always easy to attach a new module to an existing system. But this is exactly the kind of interesting technical challenge that computer scientists are trained to handle. The issue of software design brings up another subtle advantage of the comperical approach: it provides a software engineering principle to guide the integration of a lar ge number of sep- arate computational modules. Giv en a set of software modules providing implementations of segmentation, object recognition, edge detection, optical ﬂow estimation, and so on, the tradi- tional mindset of computer vision provides no guidance about ho w to package these modules together into an integrated system. An attempt to do so would likely result in little more than a package of disjoint libraries, rather than a real application with a clear purpose. Establishing compression as the function of the application provides a principle for binding the modules to- gether . The input is the raw image, the output is the encoded v ersion, and the software modules are employed in v arious ways to f acilitate the transformation. 170 Chapter 6 Meta-Theories and Uniﬁcation 6.1 Compr ession F ormats and Meta-F ormats The last sequence of ideas dealt with by this book relate to the process of adaptation. All modern compressors work by adapting to the statistics of the data to which they are applied. For example, a text compressor in vok ed on a corpus of Spanish text will adapt to the language, so that for example the sufﬁx -dad (e.g. ver dad, ciudad, soledad ) will be assigned a relativ ely short code. Early comperical research will aim at the de velopment of highly specialized, and thus less adapti ve, compressors. A comperical theory-compressor of English will achie ve superior short codelengths for a corpus of English, but fail completely when in voked on a Spanish corpus. But the ultimate goal of comperical research is to achie ve both adapti vity and superior performance. The follo wing dev elopment describes ho w this can, in principle, be achie ved. T o begin, consider a team of scientists who are planning an e xperiment, and wish to de velop a specialized compressor for the data produced by the experiment. Suppose the experiment was kno wn to produce data that would follo w a exponential distrib ution: P ( x ) = k e − λx (6.0) Where x is a non-negati ve integer , λ is parameter of the system, and k is a constant required to ensure normalization. A simple calculation shows that k = (1 − e − λ ) . Now consider three scenarios in volving dif ferent schemes for encoding the data generated by this distribution. In the ﬁrst scenario, the true value of λ = 2 is known exactly in advance. This allo ws the scientists to ﬁnd the optimal codelength function: L ( x ) = − log P ( x ) = − log k + λx 171 Figure 6.1: Graph of codelength penalty per sample for various choices of λ . The real distri- bution has λ = 2 . The expected codelength achieved by using this codelength function can easily be calculated by plugging Equation 6.1 into Equation A.1 . The result is 0 . 458 bits, and this is the shortest possible expected per-sample codelength that can be achiev ed; also kno wn as the entropy of the distrib ution P ( x ) . A codelength function that div erged from the one sho wn abov e could do better on some outcomes, b ut it would underperform on av erage. The ability of the scientists to achie ve the best possible codelengths is based on their forekno wledge of λ , and thus P ( x ) . In the second scenario, the scientists are sure the data will follo w an exponential distribu- tion, but do not kno w the v alue of λ . They decide to make a guess, and use the value λ g = 3 . This produces a model distribution Q ( x ) that is not the same as the real distribution P ( x ) . Equation A.-3 can be used to calculate the av erage resulting codelength, which is 0 . 521 bits per sample. Due to the disagreement between Q ( x ) and P ( x ) , this scheme underperforms the optimal one by 0 . 042 bits per sample. This can be interpreted as a penalty that must be paid for using an imperfect model Q ( x ) . Figure 6.1 shows the relationship between the guess λ g and the codelength penalty . The greater the dif ference between λ and λ g , the larger the penalty . In the third scenario, the scientists decide not to try to guess the v alue of λ . T o get around their ignorance of λ , they instead decide to use a ﬂexible meta-format, which works as follo ws. When the compressor is in voked on the data, it reads all the samples, and performs a calculation to ﬁnd the optimal v alue of λ . It writes this v alue into the header of the format, and then encodes the rest of the data using it. The decoder reads the λ value in the header , and then uses it to decode the samples. The price to be paid for using this scheme is the extra bit cost of the header . Assume λ is encoded as a standard double precision ﬂoating point number , requiring 64 bits. Then a simple calculation indicates that if N > 1032 , it is more ef ﬁcient to use the adapti ve 172 encoding scheme than the “dumb guess” scheme. Of course, the superiority of the meta-format ov er the simple λ g = 3 guess format can’ t be known in adv ance. If the scientists get lucky and the guess turns out to be right, then the simple format will provide a better codelength. The beneﬁt of using a meta-format is ﬂexibility: good performance can be achiev ed in spite of prior ignorance. There are two drawbacks to the meta-format scheme: the additional ov erhead cost required to transmit the parameters (e.g. λ ) and the computational cost in volv ed in ﬁnding the optimal parameter values. In the simple example discussed abov e, neither of these drawbacks seem very signiﬁcant. But when the models become more complex, so that instead of a single λ v alue there are thousands or millions of parameters, the costs can increase dramatically . This meta-format idea can be taken further . Imagine that it was not known in adv ance that the data would follow an exponential distribution. Instead, it w as kno wn that the data would be generated by either the exponential, Gaussian, Laplacian, Poisson, or Cauchy distribution. T o ﬁnd the optimal speciﬁc format, it is necessary to determine which distribution describes the data the best, and then to ﬁnd the optimal parameter v alues (such as the mean and variance for the Gaussian). This introduces no fundamental dif ﬁculties: one simply calculates the optimal parameters for each distribution, and then selects the distribution/parameter pair that provides the shortest code. This scheme requires a small additional codelength price to be paid in order to specify which distrib ution is being used, and a small additional amount of computation. In return it provides a signiﬁcant increase in the ﬂe xibility of the format. The scientists can choose what meta-lev el to use based on the amount of data they expect to encode and their lev el of conﬁdence about what form the data will take. The more data there is to encode, and the more ignorant the y are of its structure, the higher up the meta-ladder they would want to go. But there is a natural limit to this trend. The highest meta-le vel choice is simply to specify a programming language as the format. A programming language is the ultimate meta-format, because one can specify any possible speciﬁc format by transmitting the source code of a program that decodes the format. For e xample, imagine that for a gi ven dataset, the bzip2 program achiev es very short codes. Then the programming language format can transform itself into the bzip2 format, simply by sending the source code for bzip2 as a preﬁx. Furthermore, this preﬁx is a one-time cost, that becomes insigniﬁcant as the volume of data being sent becomes lar ge. The programming language format provides the largest possible degree of ﬂexibility , and also the largest overhead cost. But in the limit of large data, the programming language format may very well be the best choice. In the case of the simple exponential meta-format, it was a tri vial task to ﬁnd the optimal speciﬁc format, which was deﬁned by a λ value. In contrast, the problem of ﬁnding the optimal 173 Figure 6.2: Performance of the meta-theory vs. the optimal speciﬁc theory . In this mode, the data is analyzed only once, and the meta-theory adapts as it is exposed to more and more data. The x -axis is total data observed, and the y -axis is the compression rate per unit of data. The area between the two curves is the amount by which the meta-theory underperforms the optimal speciﬁc theory . speciﬁc format when using the programming language meta-format is incomputable. Finding the optimal speciﬁc format is equiv alent to ﬁnding the K olmogorov complexity of the data [67]. The impossibility of ﬁnding the optimal speciﬁc format when starting from the programming language meta-format does not, howe ver , imply that the latter should not be used. It hardly matters that the optimal speciﬁc format cannot be found, when the programming language meta-format will provide codelengths only trivially worse, in the limit of lar ge data, that any other , less ﬂexible meta-format. 6.1.1 Encoding F ormats and Empirical Theories Chapter 1 discussed the No Free Lunch theorem, which states that no compressor can achiev e lossless compression when a veraged over all possible inputs. The implication was that in order to achie ve lossless compression in practice, a compressor needs to make an empirical assertion about the structure of the data set it targets. If this assertion is correct, compression is achiev ed; if not, the resulting encoded data will be lar ger than the original. Thus a compression format can be thought of an empirical theory of the phenomenon that generates the data. The meta-format schemes mentioned abo ve do not circumvent the No Free Lunch theorem, which holds for any v alid compressor . Since a meta-format can be implemented as a v alid compressor , by using header data and preﬁx es, the theorem must hold for meta-formats as 174 well. Since a meta-format is still a format, it is also an empirical theory , or more precisely an empirical meta-theory . T o understand what this means, consider again the discussion of the exponential formats. A speciﬁc format makes the very precise assertion that the data will follo w an exponential distribution with a particular λ v alue. If the real data distrib ution departs substantially from this assertion, the format will fail to achie ve compression, and may end up inﬂating the data. The meta-format, in contrast, makes the weaker assertion that the data will follo w an exponential distribution with some λ v alue. Again, if this assertion turns out to be false, the meta-format will inﬂate the data. At this point, an important technical consideration must be mentioned, which is that there is a smarter way of using a meta-format than by using header data and preﬁx information. This idea is called univ ersal coding in the MDL literature [42]. The idea here is to calculate the optimal parameter values from the data, as it is being processed. In the case of the exponential data, the scheme starts out identical to the dumb guess strategy , using some uninformed v alue of λ g . The difference is that as each x data is processed and encoded, the value of λ g is updated. After many data points are sent, the updated value of λ g becomes nearly optimal. This scheme achiev es poor performance at the outset, but gets better and better as more data points are processed; this idea is sho wn in Figure 6.2 . This strategy is technically superior to the header data approach, because it does not require any overhead costs for encoding the optimal parameters. Howe ver , it has the same qualitati ve characteristics: it sacriﬁces performance in the lo w- N regime for ﬂe xibility and good performance in the high- N regime. 6.2 Local Scientiﬁc Theories For most people, the word “scientiﬁc theory” generally refers to an idea like quantum me- chanics or biological ev olution. These ideas seem to be univ ersal, or at least highly general; physicists assume quantum mechanics holds e verywhere in the uni verse, and biologists believ e that all organisms are shaped by ev olution. This book employs a deﬁnition of a scientiﬁc the- ory that differs slightly from ordinary usage. For a comperical researcher , a scientiﬁc theory is any kind of computational tool that can be used to predict and therefore compress an empirical dataset. It is not e ven necessary for the predictions produced by the theory to be perfectly accu- rate, as long as the y are better than random guessing. Since the number of aspects of empirical reality that can potentially be measured is enormous, the comperical deﬁnition includes a very broad array of tools and techniques. 175 T able 6.1: Example Bird Sighting Database Date County Location Species 2/22/2011 Coconino Coconino Rough-legged Ha wk 2/19/2011 Nav ajo Heber Pine Siskin 2/16/2011 Pinal Coolidge Bald Eagle 2/12/2011 Nav ajo Jacques Marsh T undra Swan 2/9/2011 Apache Greer W ilson’ s Snipe 2/9/2011 Maricopa Peoria Least Bittern 2/9/2011 Apache Eagar Rough-legged Ha wk 2/2/2011 Pinal Santa Cruz Flats Rufous-backed Robin 2/2/2011 Pinal Santa Cruz Flats Crested Caracara 2/1/2011 Maricopa Scottsdale Zone-tailed Hawk 1/23/2011 Pinal Coolidge McCo wn’ s Longspur 1/23/2011 Maricopa Arlington V alley Red-breasted Merg anser 1/19/2011 Pinal K earney Lake Ross’ s Goose 1/17/2011 Nav ajo W oodruff Bre wer’ s Sparrow 1/17/2011 Nav ajo W oodruff Bre wer’ s Blackbird 1/16/2011 Maricopa Arlington V alley Harlan’ s Red-tail Hawk 1/16/2011 Nav ajo W oodruff Sandhill Crane 1/15/2011 Maricopa Glendale Recharge Ponds W estern Sandpiper This more inclusiv e deﬁnition ele v ates to the status of science, and thereby identiﬁes as worthy of examination, se veral simple tools that may otherwise ha ve been ov erlooked. One such example is the humble map. A person using a map correctly can make reliable predictions about v arious observations such as street names and building locations. Similarly , a smart compressor could use a map to substantially compress a dataset of the form giv en in T able 6.2 . Another example of a newly recognized scientiﬁc theory is a dictionary (word list), which can be used to predict word outcomes when scanning through a stream of text. The dictionary will not, of course, provide exact predictions, but it does cut down on the number of possibilities (there are far fe wer real 10-letter words than there are combinations of 10 letters). A less common example of this kind of computational tool is a bird range map, as sho wn in Figure 6.3 . An observer using this map can make predictions relating to the types of bird species that might be sighted at dif ferent locations and times of the year . Consider a data set of the form shown in T able 6.2 , and imagine one is attempting to predict the fourth column (bird species) from the ﬁrst three columns (date and location information). Because bird species ha ve 176 Figure 6.3: Range map for the American Goldﬁnch. This map is a computational tool that can be used to predict a database with the form sho wn in T able 6.2 , and can also be extracted from the information contained in such a database. characteristic geographical distrib utions, information of the kind embodied in Figure 6.3 can be used to help with the prediction. For e xample, if the location corresponds to an area in Alberta, and the date is in January , then the range map predicts that the American Goldﬁnch will not be observed. The map also predicts that the goldﬁnch will not be sighted in Florida during the summer , and that it will nev er be sighted in Alaska. One can easily imagine embellishing the range map with information relating to population statistics, allo wing more accurate predictions that reﬂect the relati ve pre v alence of v arious species. The computational tools described abov e hav e sev eral properties in common. First, they are local as opposed to univ ersal theories: the map only works for one geographic area, while the dictionary only works for one language. Second, while they are like physical theories in that they enable predictions, they employ an opposing set of inferential tools. Physical theories combine a minimalist set of inducti ve steps (the natural la ws) with complex deducti ve reasoning such as calculus, linear algebra, and group theory . Physical theories can therefore be characterized as inducti vely simple but deductiv ely comple x. In contrast, the local scientiﬁc theories mentioned above are logically simple to use, b ut contain a lar ge number of parameters. Therefore, they can be vie wed as employing the in verse of the inferential strategy used in physics: complex induction combined with simple deduction. A third, and crucial property of a local scientiﬁc theory is that it can be e xtracted from an appropriate dataset. This is most obvious for the case of a word list, which can be extracted from a large text corpus. T o do so, one simply programs an algorithm to ﬁnd every unique combination of letters in the corpus. More sophisticted techniques will, of course, lead to a 177 more reﬁned dictionary; a smart extraction algorithm might use kno wledge of verb conjugations so that “swim”, “swims”, “swimming”, and “swam” are packaged together under the same heading. It is also quite easy to imagine constructing a set of bird range maps from a lar ge database of the form sho wn in T able 6.2 . A street map can be constructed from a large corpus of observ ational data of the form sho wn in T able 6.2 , though this is more complex algorithmically . So not only can a local scientiﬁc theory be used to predict a particular database, it can also be extracted from the same database. T able 6.2: T able of driving observation data. These were generated by an internet mapping service, but the same kind of data could be generated by e xploration and observ ation. Operation Street Name Distance Time Slight right Garden St 0.2 mi - T urn right Mason St 0.1 mi - T urn right Brattle St 344 ft - First left Ha wthorn St 0.1 mi 1 min Continue onto Memorial Dr 390 ft - T urn left Memorial Dr 1.0 mi 2 min T urn right W estern A ve 0.1 mi - T urn left Soldiers Field Rd 0.2 mi 1 min T ake I-90 Ramp 0.3 mi - K eep left I-90 East 3.4 mi 4 min T ake e xit 24C I-93 North 0.7 mi 1 min Merge I-93 South 15.3 17 min T ake e xit 1 I-95 South 80.0 mi 84 min T ake e xit 90 CT -27 0.3 mi - T urn left Whitehall A v e 1.5 mi 3 min T urn right Holmes St 0.3 mi 1 min T urn right East Main St 0.2 mi 1 min T ake 2nd left W ater St 0.2 mi - T ake 2nd right Noank Rd 459 ft - 6.3 Cartographic Meta-Theory There is something very interesting about the concept of a map, or more precisely the concept of mappability , which the following thought experiment may help re veal. Imagine an outbreak 178 of polio has been observed in an remote Indian town. A team of health experts has been sent by the United Nations to administer the polio v accine to e veryone in the town. The doctors plan to conduct a systematic door -to-door sweep of the to wn, administering the vaccine to e veryone they can ﬁnd. Howe ver , they hav e encountered an obstacle in carrying out their plan: due to an odd prohibition promulgated by the local religious guru, there are no maps of the to wn (the guru belie ves that becoming lost is important for spiritual health). The lack of a good map presents no special difﬁculties, howe ver , because the team can just write down a map on their o wn. The interesting aspect of this thought experiment is the epistemic state the doctors are in before they construct the map. They have ne ver visited the town, and they possess only the most rudimentary information about it. Nev ertheless, they are completely conﬁdent in their ability to construct a map of the to wn. Furthermore, they are highly conﬁdent that once this map is written, it will help with the door-to-door vaccination ef fort. The doctors therefore make a meta-prediction: the map, once created, will make good predictions. They are highly conﬁdent in this meta-prediction, in spite of the fact that the map has not yet been written and that they hav e no idea what it will contain. They are, in ef fect, utilizing an empirical meta-theory , which can be called the cartographic meta-theory . T o see that the cartographic idea must be an empirical theory , it is suf ﬁcient to sho w that it can be used to construct a compressor . Consider a large database of observ ations of the form of T able 6.2 . Observe that a street map will be useful to predict and compress this data. T o see this, notice that a certain section of the observ ations turning right onto W estern A ve. from Memorial Dr . This speciﬁes a precise location on the map. The observ ations then in vole turning left after a distance of 0.1 miles. The map can no w be used to predict that the name of this street will be “Soldier’ s Field Rd”, because there is presumably only one road that intersects W estern A ve. at that point. No w , when the compressor begins, it will not ha ve a map, and so it w on’t be able to make any conﬁdent predictions. But if the corpus is large, then eventually the compressor will ha ve enough data to infer a street map and use it to compress the subsequent observ ations. So perhaps the compressor will achiev e no compression for the ﬁrst 10 Mb of data, but 60% for the next 90 Mb . Since it can be used to build a compressor , the cartographic principle must be empirical, and since it actually achiev es compression, it must correspond to reality . The success of the cartographic meta-theory reﬂects a certain aspect of reality: the fact that streets and b uildings occupy ﬁx ed geographic positions. One can imagine a ﬂoating city made up of a shifting patchwork of boats and bar ges in which the cartographic principle will fail. There is a strong similarity between the cartographic meta-theory and the adaptiv e- λ scheme mentioned abov e. In both cases, one starts with a ﬂexible meta-format. Then one performs 179 some calculation or computation on the data. This computation produces a set of parameters that deﬁne a speciﬁc format. The dif ference between the two schemes is only a matter of scale. The cartographic meta-format requires man y more parameters, and much more computation to ﬁnd the optimal ones. But the idea of using an algorithm to ﬁnd a speciﬁc format based on a meta-format and a corpus of data is exactly the same. 6.4 Sear ch f or New Empirical Meta-Theories It is rare for conceptual constructs to be unique; mathematics contains many distinct theorems and physics includes many natural laws. Therefore it seems unlikely that the cartographic meta- theory mentioned abov e is the unique example of its class. And indeed, a brief consideration will rev eal sev eral other examples. For many languages, a lexicographic meta-theory holds, which asserts that dictionaries can be constructed. Another meta-theory asserts that it is pos- sible to build bird range maps. Notice, again, that these other meta-theories make statements that are abstract, but still empirical. The lexicographic meta-theory implies that languages are made up of words, and the set of legitimate words is far smaller than the set of possible letter- combinations. The avian meta-theory exploits the fact that bird species are not distributed randomly across the globe, b ut tend to inhabit characteristic regions that change from season to season, but not from year to year . Because meta-theories correspond to meta-formats, and meta-formats are still formats, the assertions embodied by the meta-theories can be tested using the compression principle. If bird species did not cluster in characteristic regions, the corresponding meta-theory would fail to achie ve compression. It is not, perhaps, very interesting to prove the meta-theories mentioned abov e, because they are already intuitiv ely obvious. Ho wev er , if there are other valid meta- theories, some presently unknown, then the compression principle can be used to v alidate them. Some reﬂection leads to ne w examples of other possible meta-theories. One example, re- lated to the roadside video dataset proposed in Chapter 1 , is a list of car types such as the T oyota Corolla and the Honda Ci vic, that includes information about the dimensions of those types. When a T oyota Corolla appears in the video data, the encoder could then specify the car type only and sav e bits by obviating the need to transmit shape information. In the termi- nology of Chapter 3 , this would correspond to deﬁning a scene description language with a built-in function that draws a T oyota Corolla. By using this specialized function when appro- priate, shorter codelengths could be achiev ed in comparison with using an element that draws a generic automobile. 180 Another interesting possibility could be called a “texture dictionary”. Different kinds of materials have different visual te xtures: asphalt, brick, tree leav es, and clouds all hav e very distincti ve te xtures. Such a dictionary might be very powerful in combination with an image segmentation algorithm. The algorithm would attempt to identify the texture of a pixel re gion, and then package together pixels with similar textures. Using the dictionary would save bits by using a prespeciﬁed statistical model for a pixel region. Since textures are often associated with materials or objects, this technique could lead to further improv ements relating to the understanding of characteristic object shapes. 6.5 Uniﬁcation A major theme in physics is the process of theoretical uniﬁcation. Uniﬁcation occurs when two or more phenomena, which were previously described with completely unrelated theories, are sho wn to be describable by a single theory , which is more general and abstract. Since a single uniﬁed theory is simpler than multiple specialized theories, uniﬁcation indicates that the new theory is more correct and po werful. The history of science records sev eral examples of the process of uniﬁcation; one of the most dramatic comes from the ﬁeld of electromagnetics. Until the time of Maxwell, the phys- ical theories of electricity and magnetism were not packaged together . Coulomb found that an electric ﬁeld was produced by the e xistence of a concentration of char ge at some point in space. The electric ﬁeld points directly away from (or to wards) the charged particles. Ampere found that magnetic ﬁelds could be created by electrical current, and that the magnetic ﬁeld lines cir - cle around the direction of the current. Faraday , in turn, found that a change in the magnetic ﬁeld creates an electric ﬁeld. These relationships were packaged together by Maxwell, who proposed the follo wing compact set of equations: ∇ · E = ρ  0 ∇ · B = 0 ∇ × E = − ∂ B ∂ t ∇ × B = µ 0 J + µ 0  0 ∂ E ∂ t Where E is the electric ﬁeld, B is the magnetic ﬁeld, ρ is a charge density , and J is the cur- rent. These equations elegantly express the geometrical relationship of the various quantities to one another . The div ergence operator , ∇· , indicates straight lines radiating outward from a 181 source, while the curl operator ∇× indicates that the ﬁeld lines rotate around a central axis. Maxwell’ s key contribution was the term µ 0  0 ∂ E ∂ t , which expresses an ef fect that had not been measured at that time. This term has crucial implications, because it sho ws that a magnetic ﬁeld is produced not only by a current ( J ), but also by a change in the electric ﬁeld. This explains the phenomenon of light. Since a change in the electric ﬁeld causes a magnetic ﬁeld, and a change in the magnetic ﬁeld causes an electric ﬁeld, an initial disturbance in either ﬁeld will propagate through space. This propagation is a light wa ve. Maxwell’ s equations therefore present a uniﬁed description of three phenomena: electricity , magnetism, and light. The fol- lo wing thought experiment sho ws ho w a similar process of uniﬁcation can take place within comperical science. 6.5.1 The F orm of Birds After his discussions with Sophie about the idea of virtual labels, Simon beg an to make rapid progress in the dev elopment of theories of birdsong. He would use the synthesizer to create vir- tual labels for a large number of sound clips, then use the virtual labels to create a fast detector . When the detector worked for the ﬁrst time, and he ran outside to ﬁnd a Blue Jay chirping in a tree ne xt to the microphone, he experienced a feeling of vertiginous elation. Unfortunately , the birdsong synthesizer program he found on the internet was able to produce vocalizations for only a couple of bird species. T o go further , it was necessary for Simon to dev elop extensions to the synthesizer himself. At the outset, Simon felt excited about this work. Writing a synthesizer for a new species was an interesting challenge, requiring him to study the acoustic properties of each species’ song, and ﬁgure out how to express those properties in computer code. After the synthesizer was completed, he used it to create virtual labels and then train a recognizer for the bird. Eventually , though, his old impatience got the better of him again. The process of writing a synthesizer process began to feel like a tedious and painstaking chore. It took Simon almost a month to write a synthesizer for a single species, and there are hundreds of bird species in North America. Simon began to despair of e ver completing his task. During a pause in his efforts, Simon was assigned to read Plato’ s “Republic” in his high school philosophy class. Many of Plato’ s ideas seemed incomprehensible to Simon, but when he started to read about the theory of Froms, he felt that Plato’ s ideas connected someho w with his o wn studies. Plato observ ed that all people agreed more or less on the meaning of the word “dog”, and the properties attached to the dog-concept. A man from Sparta would mostly agree with a man from Athens about the concept, ev en if the two men had ne ver observed the same 182 dog. Plato explained this phenomenon by postulating the e xistence of Forms - abstract v ersions of each particular entity . Thus, all real dogs were simple imitations or manifestations of the Form of Dogs, which existed on an entirely higher plane. By observing many dogs, the Spartan and the Athenian both come to percei ve the F orm of Dogs, and this explains why they can agree on what the word “dog” means. The Forms also shape perception. When one observes a dog, one percei ves the dog in terms of its Form. The Form speciﬁes the range of variation allo wed in each particular instance. Thus dogs may be small or lar ge, with ﬂoppy ears or pointed ears, and so on. T o understand a particular dog, one combines the rele vant observ ed properties with the abstract in v ariants determined by the F orm. By doing so, one obtains a good understanding of any ne wly observed dog, without needing to conduct a comprehensi ve study of it. Simon thought about the idea of Forms for a long time, while also trying to ﬁgure out how to e xpedite his project of detecting and recognizing birdsong. After se veral months, he came to believ e that the idea of Forms held the secret to his problem. Each particular bird species had a unique song. But all of the songs shared general properties. These general properties were dictated by the biomechanical properties of the birds’ vocal system, and the range of frequencies to which their ears were sensitiv e. Simon did not kno w exactly what the precise characteristics of these general properties were. But that information was implicitly contained in the database; he just needed to extract it. So he only needed to write a learning algorithm that inferred the properties of each species’ song from the data. After sev eral months of hard intellectual ef fort, Simon succeeded. He obtained a program that, when in vok ed on the database of treetop recordings, would produce speciﬁc local theories describing the song of each species. As more and more data was obtained, the performance of the automatically generated local theories became just as good as Simon’ s original manually- de veloped theories (Figure 6.2 ). The local theory for a gi ven species contained both a synthe- sizer (generati ve model) and a detector for the corresponding song. For that reason, Simon’ s research was of interest to bird-watchers and environmentalists. By working with a group of ecologists studying the Amazon rain forest, Simon obtained a large database of recordings from that ecosystem. By applying the meta-theory to that database, Simon was able to detect the existence of a pre viously unknown bird species, which he named after his aunt. After the meta-theory of birdsong was complete, a natural next step was to study the appear- ance of birds. The necessary data was easily obtained, by putting cameras in treetops and other promising locations. Simon was initially stunned by the dif ﬁculty of working with image data. V isual ef fects such as occlusion and lighting, in combination with the simple fact that a small change in pose could cause a substantial change in appearance, made the problem de vilishly hard. But by narro wing his attention do wn to a single species, and by assiduously following the 183 compression rate, Simon was able to build a local theory for the American Goldﬁnch. Then he de veloped theories for the Blue Jay , Spotted Owl, Ospre y , and the Ra ven. Finally , after writing se veral local theories, Simon was able to construct a meta-theory of the appearance of birds. 6.5.2 Uniﬁcation in Comperical Science The thought experiment giv en abov e illustrates the process of uniﬁcation in comperical science. Simon was able to jump from a set of unrelated theories describing speciﬁc birdsongs to a uni- ﬁed and abstract theory describing birdsong in general. In addition to its practical time-saving v alue, this abstract theory is obviously much more satisfying from an intellectual perspectiv e. Of course, the uniﬁcation process is incomparably easier to describe as a thought experiment than to actually perform. T o understand the nature of this difﬁculty , compare the hypothesized birdsong meta-theory to the well-understood cartographic meta-theory (the Form of Maps). When combined with a large quantity of observational data, the latter produces a speciﬁc local theory describing a geographical region. The basic idea of the cartographic meta-theory is understood by e veryone. Ne vertheless, it is still quite dif ﬁcult to discov er the precise algorithmic process by which to construct a speciﬁc theory out of a large corpus of data. The birdsong meta-theory performs an analogous role: giv en a large corpus of treetop recordings, it produces sev eral speciﬁc local theories of the songs of individual bird species. Howe ver , unlike the cartographic meta-theory , the nature of the birdsong meta-theory is currently unknown. T o complete it will require an extensi ve study of birdsong, and an ef fort to disco ver the in variants shared by all songs, as well as the abstract dimensions of v ariation between songs. Then additional work will be required to ﬁnd a learning algorithm that can automatically extract the relev ant parameters from an appropriate dataset. If the ideas abov e are at all clear , it should be obvious that there are a wide range of cate- gories susceptible to the process of uniﬁcation. Consider the roadside video camera database proposed in Chapter 1 . One component of that research might be the use of templates describ- ing v arious car designs, such as the T oyota Corrolla. Such templates will allow the compressor to sav e bits by encoding only the category of the car, instead of its precise geometry . At the outset, researchers will probably need to design these templates by hand. But after templates hav e been de veloped for se veral car types, researchers should begin to discern the basic pat- tern of the process. Once templates have been created for the Corrolla, the Honda Civic, the Ford Explorer , and the Mazda Miata, it will not be much harder to construct the template for the Hyundai Sonata. At this point it should become possible to construct a meta-theory of the 184 appearance of cars (the Form of Cars), and a corresponding learning algorithm. The learning algorithm uses the meta-theory and a large corpus of video to obtain speciﬁc theories for each car type. A similar process may work for trees, insects, ﬂo wers, motorcycles, clothing, and houses. The next section ar gues the process may also hold for language. 6.5.3 Univ ersal Grammar and the F orm of Language A concept called Unive rsal Grammar (UG) is a major theme and goal in the ﬁeld of linguis- tics [21]. At least two key observations moti v ate the search for UG. First, any normal human child growing up in any linguistic en vironment will be able to achiev e ﬂuency in its nativ e tongue. A child of Japanese ethnicity growing up in Peru will become perfectly ﬂuent in Pe- ruvian Spanish. Second, children are able to infer facts about grammar and syntax from a limited amount of data. In particular , children recei ve v ery little exposure to ne gati ve examples (ungrammatical utterances). In spite of this, they reliably learn to distinguish grammatical sen- tences from ungrammatical ones. These ideas imply that humans must ha ve a po werful innate ability to learn language. Related to the idea of UG is the observ ation that human languages seem to obey a set of linguistic univ ersals. All languages have grammar and use words that can be categorized into parts of speech such as “noun” and “verb”. The linguist Joseph Greenberg proposed a set of 45 linguistic uni versals, of which the follo wing is an example: In declarati ve sentences with nominal subject and object, the dominant order is almost always one in which the subject precedes the object [41]. From a learning-theoretic point of view , the existence of linguistic uni versals is not surprising, gi ven the fact mentioned abo ve that children can learn correct grammar from a small number of e xamples. If there were no such structure or constraints on the form of language, children (or any other learning system) would require much more data to infer the correct grammatical rules. The idea of Univ ersal Grammar , and the search for its properties, can be formulated in terms of comperical philosophy as follo ws. Linguistic competence represents a speciﬁc local theory of some language. This local theory contains a variety of components, including but not limited to v ocabulary lists, part of speech information, and grammatical rules. Howe ver , humans are not born with this speciﬁc theory . Instead, they are born with a meta-theory . This meta-theory , when combined with a reasonably sized corpus of data, produces a speciﬁc theory . 185 The abo ve claim can be transformed into a concrete computational statement. Consider a set of linguistic databases of sev eral dif ferent languages (English, Chinese, and so on). For each individual database, there is some speciﬁc optimal theory-compressor that can be used to compress the database to the smallest possible size. The comperical translation of the UG claim is that there exists a single meta-theory that can adapt to the structure of each language, and produce compression rates that are asympotically equi v alent to the rates produced by the language-speciﬁc theories (Figure 6.2 ). The ke y point that bears reiteration is that the meta-theory must still be empirical . It must make assumptions about the structure of the data to which it is applied, or else it cannot achiev e compression. These assumptions will take the form of abstract statements, such as the one due to Greenberg mentioned abov e. Proposal theories of UG can therefore be tested using the compression principle. If the theories are too restrictiv e, they will fail to describe some languages well, and therefore fail to compress the corresponding databases. If the theories are too loose, they will fail to exploit structure that exists, and underperform relati ve to more aggressi ve theories. Therefore, the compression principle can be used to ﬁnd the optimal meta- theory that describes linguistic uni versals while also ackno wledging linguistic di versity . 6.5.4 The F orm of F orms After sev eral years of studying birds, Simon’ s attention began to wander . There were still some small challenges related to birds, such as attempting to identify individuals, but he felt that he had solved all the major problems. One day he was studying an image of a Spotted Owl that was partially obscured by tree leav es. Because the bird was obscured, his compressor had dif ﬁculty with the image. He found himself studying the shapes of the leav es and the branches of the tree. He realized that trees had their o wn unique characteristics that were also worthy of study . W orking with roboticists, he designed a robot that drove through the forest and photographed trees. The robot w as programmed to take close-up images of the bark, roots, and lea ves of each tree, in addition to taking full scale photographs of the entire tree. Se veral copies of the robot were built, and deployed in forests around the world. By studying the database of images sent back by the robots, Simon be gan an inquiry into the appearance of trees. He learned about dif ferent textures of bark and different types of lea ves, and ho w those two features were corre- lated. He was able to deduce a cladogram of tree species by analyzing the similarities in their outer appearance. Finally , after building speciﬁc theories for oaks, pines, elms, birches, and palm trees, he was able to construct a meta-theory of trees. The tools Simon de veloped in his 186 study were useful to park rangers and logging companies, who were interested in monitoring the number of trees of each type in a gi ven re gion. While trees were an interesting challenge, they only held Simon’ s attention for a couple of years. He met another group of roboticists, who where working on an underwater robot, and he persuaded them to help him construct a large database of images of ﬁsh. By studying this database, Simon learned theories of the appearance of marlins, trout, catﬁsh, sharks, and lampreys, and ﬁnally he obtained a meta-theory of ﬁsh. These theories helped conservationists to monitor the populations of dif ferent ﬁsh species, and helped ﬁshermen locate their tar gets. Simon then moved on to a study of insects. He built specialized camera systems that were able to track insects as they crawled or ﬂew , and take high frame rate image sequences of their motion. Based on the resulting database, he began to construct and reﬁne theories of the appearance of insect species. He began with bees, and then mov ed on to ants, and then to cockroaches. The difﬁculty he had with each new species caused him to dev elop a grudging respect for the tremendous biodiversity of insects. Finally , after years of study , he was able to obtain a uniﬁed meta-theory of insects that successfully described the entire class, including e verything from wasps and spiders to ladybugs and b utterﬂies. These tools helped farmers to protect their crops against insects, and so reduce their dependence on chemical pesticides. They also helped scientists in Africa to monitor the acti vity of mosquitoes, thereby contributing to ef forts in the ﬁght against malaria. After more than a decade of studying the biological world, Simon s witched his attention to the social world, and he began a study of written language. He assembled large corpora of text, and applied Sophie’ s method to these databases. By doing so he learned about grammatical rules, parts of speech, and linguistic topics such as Wh -mov ement and the distinction between complements and adjuncts. After studying written language for a while, he broadened his in- quiry to include spoken language as well. He learned that spoken language obeyed a different set of grammatical rules than written language. After long efforts spent dev eloping local theo- ries of English, Japanese, Arabic, and Gujarati, he began to see the abstract patterns common to all human languages, and he was able to discover a meta-theory of language. The tools he dev eloped became useful for a v ariety of purposes such as information retriev al, machine translation, and speech recognition. Though he much prefered images of trees and animals to images of cars and people, he re- alized the ne xt logical step was to study the human visual en vironment. He persuaded a taxi cab company to mount cameras on top of their cars, and collect the resulting video streams. From this data, he studied the appearance of pedestrians, buildings, cars, and many other aspects of the urban panorama. He discov ered meta-theories for each category of object. 187 In each area Simon studied, his procedure was the same. First he would de velop a series of local scientiﬁc theories that described a speciﬁc set of elements of a class very well. In studying trees, he ﬁrst b uilt theories of the appearance of oaks, then elms, then birch, then pine trees. As he constructed each ne w local theory , he would begin to see the underlying structure of the class. A local theory of a particular tree species required an analysis of its bark, its leaf pattern, its branching rate, and a couple of other components. Once he understood these parameters of variation, he w ould construct a meta-theory describing the entire class. The meta-theory , combined with the lar ge database and a learning algorithm, would automatically construct local theories for each particular member of the class. At ﬁrst the different categories, such as birds, trees, languages, and buildings, seemed to hav e almost nothing in common with one another . When starting in each ne w area, Simon would hav e to begin again almost from scratch. But as he studied more and more categories, he began to notice their similarities. Just as computer scientists constantly reuse the list, the map, and the for-loop, and physicists constantly reuse calculus and linear algebra, Simon found himself using the same set of tools over and over again. He began to sense the possibility of a still more abstract theory . Instead of producing local theories directly , this ne w meta-meta- theory would, when supplied with a suf ﬁcient quantity of data, produce meta-theories, which would in turn produce local theories. A meta-theory was lik e a factory , which instead of trans- forming raw material into cars, airplanes, and electronics, transformed data into local theories. The meta-meta-theory would be like a factory for constructing factories. By combining all his pre vious datasets together , and liberally adding a wide v ariety of new ones, Simon constructed the largest database he had e ver used. Then he began his ultimate quest: the search for the Form of Forms. 188 A ppendix A Inf ormation Theory A.1 Basic Principle of Data Compr ession This appendix presents a very brief ov erview of the some of the core concepts of information theory required to understand the book. For a fuller treatment, see [25]. The motiv ating question of information theory is: how can a given data set { x 1 , x 2 . . . x N } be encoded as a bit string using the smallest possible number of bits? In spite of the very technical nature of this question, the answer turns out to be quite deep. T o motiv ate this problem, imagine trying to encode an immense database of DN A data. The DN A information is represented as a long sequence of letters G,A,T , and C, as follows: AAAA GGTCT A GTCAAA GTCGA GAAA GCTCGGGAAAAA TGCAA TC ... Since there are four different outcomes, an obvious scheme would be to encode each outcome using 2 bits. For example, the following assignment would work: A → 00 , G → 01 , T → 10 , C → 11 . Here, each outcome requires 2 bits to encode, so if the database includes 10 6 outcomes, then this scheme would require 2 · 10 6 bits to encode it. Ho wev er , suppose that it is found that some DN A outcomes are much more common than others. In particular , 50% of the outcomes are A, 25% are G, while C and T both represent 12.5% of the total. In this case, it turns out that a dif ferent encoding scheme will achiev e better results. Let A → 1 , G → 01 , T → 001 , C → 000 . This scheme achiev es the following codelength for the full database: 189 L = N ( P ( A ) L ( A ) + P ( G ) L ( G ) + P ( T ) L ( T ) + P ( C ) L ( C )) = 10 6 ( 1 2 · 1 + 1 4 · 2 + 1 8 · 3 + 1 8 · 3) = 10 6 (1 . 75) Where P ( x ) is the frequenc y of an outcome x , and L ( x ) is the number of bits used by the encoding scheme to represent x . Evidently , this scheme sa ves 2 . 5 · 10 5 bits compared to the pre vious scheme. Since the A outcomes are very common, it is possible to sav e bits overall by assigning them a very short code, while reserving longer codes for the other outcomes. When rearranging codes like this, there is an important rule that must be obe yed, which can be illustrated by imagining that that the encoding scheme used A → 1 , and G → 11 . The problem is that when the decoder sees a 1, it will think the outcome is an A, ev en if the encoder wants to send a G. In general, the rule is that no outcome can be assigned a code that is the preﬁx of some other outcome’ s code. This is called the preﬁx-free rule. This rule can be used to deriv e an important relationship, called Kraft’ s inequality , gov erning the codelengths L ( x ) that are used to encode a data set: X x 2 − L ( x ) ≤ 1 (A.-4) This sum runs over all possible outcomes, so in the DN A e xample there would be four terms, corresponding to A, G, T , and C. Kraft’ s inequality is a mathematical e xpression of the intrinsic scarcity of short codes: if there are 16 possible data outcomes, they cannot all be assigned code- lengths less than 4 bits, because the sum of the 2 − L ( x ) will be greater than 1. Kraft’ s inequality holds for any viable preﬁx-free code, whether or not it is any good. But if the sum is substan- tially less than one, it means that the code is actually wasting bits. For an y codelength function L ( x ) that is not trivially suboptimal, Kraft’ s inequality becomes an equality . Because of this, the function P ( x ) = 2 − L ( x ) satisﬁes the criteria for probability distribution: 0 ≤ P ( x ) ≤ 1 , and the normalization requirement P x P ( x ) = 1 . This suggests a relationship between the probability distribution of a set of data outcomes and the codelength function one should use to encode them. The DN A e xample gi ven abov e suggests the basic strategy of data compression. In order to sav e bits on av erage in spite of the instrinsic scarcity of short codes, it is essential to assign long codes to the uncommon outcomes, while reserving short codes for the common outcomes. This suggests a relationship between the probability P ( x ) of an outcome and the optimal codelength 190 L ( x ) that should be used to encode it. But what exactly is the relationship between the two functions? In f act, the actual relationship was hinted at a moment ago. Claude Shannon pro ved that the optimal codelength L ( x ) is related to the probability P ( x ) as follows: L ( x ) = − log P ( x ) (A.-4) Here and in the follo wing, all logarithms are assumed to be in base 2. T ake a moment to con- sider what this equation is saying. It does not dictate the choice of codelength functions; rather , it says that any other function will perform worse on a verage. The cav eat “on a verage” is neces- sary because it is not impossible for the observed frequenc y of an outcome to be very different from its probability . Consider a case where the probability P ( A ) of A is 50%, but by chance it is observed at a frequency of only 10%; in such a case a codelength function that corresponds to the observed frequency will do better than one corresponding to the predicted probability . Also, the codelength-probability relationship only speciﬁes the lengths of the codes; it giv es no guidance about ho w to construct them. The relationship between codelength and probability speciﬁed by Equation A.1 indicates a deep relationship between statistical modeling and compression. In order to achiev e the shortest possible codes, a sender wishing to transmit a data set { x 1 , x 2 . . . } must know the distribution P ( x ) that generated the data. It also indicates that there is a one-to-one relationship between a discrete statistical model and a compression program for the type of data speciﬁed by the model. Gi ven the codelength-probability equiv alence relationship, it is natural to deﬁne the entr opy : H ( P ) = X x P ( x ) L ( x ) = − X x P ( x ) log P ( x ) (A.-4) These sums run over all possible x outcomes. The entropy of a distribution P ( x ) is the expected codelength achiev ed by the optimal code for P ( x ) . There are sev eral important facts to notice about the entropy . First, it is a function of the distribution P ( x ) . Second, upper bounds for the entropy can easily be found. For example, if there are 16 possible outcomes, then the entropy is as most 4 bits. Generally speaking, the greater the number of outcomes in the distrib ution, the larger the entrop y . A question that arises often in statistical modeling is: what happens if an imperfect model Q ( x ) is used instead of the real distribution P ( x ) , which may be unknown? Since Q ( x ) is imperfect, the codelengths speciﬁed by Q ( x ) will be suboptimal. The codelength penalty for using imperfect codes based on the model Q ( x ) instead of perfect codes based on the real distribution P ( x ) is: 191 D ( P || Q ) = X x P ( x ) L ( x ) − X x P ( x ) L q ( x ) (A.-3) = X x P ( x ) log P ( x ) Q ( x ) (A.-2) The quantity D ( P || Q ) is kno wn as the Kullback-Liebler di ver gence (KLD). It is easy to show that the Kullback-Liebler div ergence is al ways non-negati ve, and becomes zero only if Q ( x ) = P ( x ) . This implies that no scheme for assigning codelengths to data outcomes can achiev e better codelengths on av erage than L ( x ) = − log P ( x ) . It is possible to get better codes for certain outcomes. But by reducing the codelength of some outcomes, it is necessary to increase the codelength of other outcomes, and this will hurt the ov erall av erage. Though the entropy and the KLD are fascinating concepts, they are not very useful in practice, because they both depend on P ( x ) , which is generally unknown. A more useful quantity is the realized codelength: N X k =1 L ( x k ) = − N X k =1 log Q ( x k ) (A.-2) Here the sum goes ov er the N -sample data set { x k } . Since this quantity depends only on the model Q ( x ) and the data set, it can be computed in practice. Also, if the data set is large, it should be related to the entropy and the KLD, since: − 1 N N X k =1 log Q ( x k ) ≈ − X x P ( x ) log Q ( x ) = − X x P ( x ) log P ( x ) + X x P ( x ) log P ( x ) Q ( x ) = H ( P ) + D ( P || Q ) In words, the av erage realized per-sample codelength for the model Q ( x ) is about equal to the entropy H ( P ) plus the KLD D ( P || Q ) . This provides one motiv ation for the goal of minimizing the codelength. H ( P ) is a constant of the distribution P , so it can nev er be reduced. Thus, reducing the codelength is equi v alent to reducing the KLD. The only way to reduce the KLD is to obtain better and better models Q ( x ) , since as Q ( x ) → P ( x ) , D ( P || Q ) → 0 . A second motiv ation for attempting to minimize the codelength is the relationship to the Maximum Likelihood Principle. Giv en a model family Q ( x ; θ ) that depends on parameters θ , this principle suggests choosing θ to maximize the likelihood of the data gi ven the model: 192 θ ∗ = arg max θ Y k Q ( x k ; θ ) = arg min θ X k − log Q ( x k ; θ ) In words, the codelength is minimized when the likelihood is maximized. A.1.1 Encoding The codelength-probability relationship of Equation A.1 speciﬁes the lengths of the optimal codes. But ho w then can the code itself be constructed? That is to say , gi ven a model Q ( x ) , ho w do we transform a series of outcomes { x 1 , x 2 . . . x N } into a bit string S that satisﬁes A.1 ? This is called the encoding problem, and it is non-trivial. Fortunately , ho wev er , very general solutions to the encoding problem exist. The ﬁrst method proposed that was guaranteed to achie ve near-optimal codelengths is called Huffman encoding [50]. Another method, called arithmetic encoding [94, 124], is also very useful. The encoding techniques allo w compression researchers to av oid a variety of technical problems. One such problem is: giv en that probabilities can take on arbitrary values between zero and one, but a bit string needs to ha ve an integral length, ho w can L ( x ) = − log P ( x ) produce integral bits? The solution is to chain together multiple outcomes, so that while L ( x ) is not quite integral, not much is lost by simply rounding it up. 193 A ppendix B Related W ork This appendix describes research that is closely related to the comperical idea. In se veral cases, the related research has had a strong inﬂuence in shaping the ideas of the present book. This is particularly true for the Hutter Prize, Hinton’ s philosophy of generati ve models, and the research on statistical language modeling that was described in Chapter 4 . B.1 Hutter Prize Matt Mahone y , a researcher at the Florida Institute of T echnology , was the ﬁrst to advocate the problem of large scale lossless data compression. Mahoney saw the deep interest of the com- pression problem and its methodological advantages. In particular, he argued that achieving prediction accuracy on te xt comparable to what a human can achiev e is related to the T uring T est and therefore nearly equi valent to the problem of artiﬁcial intelligence. Mahoney’ s justiﬁ- cation is included in the follo wing paper: • Matt Mahoney , “Rationale for a Large T ext Compression Benchmark” [71]* Based on this philosophy , Mahoney set up a text compression contest targeting a 100 Mb chunk of the English W ikipedia. The contest was funded by and named after Marcus Hutter , a researcher who became interested in the compression problem for theoretical reasons [51]. Mahoney dev eloped a compressor called P A Q which achie ved the ﬁrst strong result on the W ikipedia data. Afterwards, a small group of researchers from around the world contributed improv ed techniques, leading to a gradual improvement in the compression results. Most of these contributions in v olved either preprocessors or modiﬁcations to P A Q. While Mahoney’ s ideas and the Hutter Prize are a strong inﬂuence on the present book, there is a substantial gap between the two philosophies. Mahoney did not clearly articulate the 194 essential component of the comperical philosophy , which is the equiv alence between lossless data compression and empirical science. Mahoney and Hutter both seem to suffer from the computer science mentality , which suggests the existence of one single, general, all-purpose data compressor . There are two reasons why this belief had a ne gati ve inﬂuence on the research related to the Hutter Prize. The ﬁrst result of the negati ve inﬂuence was that contributions to the contest were mostly all general-purpose compressors. P A Q itself is a smart algorithm for integrating statistics (predic- tions) from many history-dependent context functions. Because of the linear nature of text, it is possible to mak e relati vely good predictions based on the immediate history , and P A Q w orks well for that reason. P A Q would also work well for se veral other types of data, such as music and programming language source code. Few participants in the contest seemed interested in de veloping special-purpose compressors targeting the W ikipedia data. Most seemed to believe that a smart general-purpose compressor could be found that would giv e compression rates nearly as good as any specialized compressor could. The second negati ve inﬂuence of the computer science mentality w as that it led Mahoney to choose the W ikipedia data as the tar get. This data has an ab undance of XML tags, image links, URLs, and related cruft intermixed with the text data. Mahoney probably thought that a smart general-purpose compressor could be found that would automatically deal with the formatting information. If Mahoney had selected a cleaner batch of text data as the compression target, the contest participants would probably ha ve realized the link between language modeling and text compression, and started to apply sophisticated language modeling techniques to the prob- lem. Moreover , the cruft-laden nature of the tar get database probably deterred participants from attempting to dev elop specialized compressors. If the data was pure text, a specialized com- pressor would be equi v alent to a language model, and w ould therefore probably be reusable for reasons discussed in the book. But a specialized compressor for the W ikipedia data would ha ve to include thousands of lines of code to deal with the v arious forms of cruft. B.2 Generativ e Model Philosophy An important thread of research in machine learning deals with the construction of generativ e models. These models attempt to describe the observ ational data instead of, or in addition to, the quantity of interest. In other words, the generativ e approach attempts to build models P ( X ) or P ( X , Y ) instead of just P ( Y | X ) . These kind of models are called generati ve because they can produce data similar to the original X data by sampling (the degree of similarity depends on ho w good the model is). The Collins parser described in Chapter 4 was an example 195 of a generativ e model, because it deﬁned a joint distribution P ( T , S ) of the probability of a sentence S and a parse tree T . There is also a clear connection between the idea of generati ve models and Chomsky’ s idea of generativ e grammar . The generati ve model approach can also be used to approach the standard task of super- vised learning when the Y data comes in a small number of discrete labels. Consider the much-studied problem of handwritten digit recognition, which in volv es ten labels. T o use the generati ve approach, one builds ten models P 1 ( X ) , P 2 ( X ) , . . . P 10 ( X ) . Then giv en ne w sample x , the system outputs a guess y k ∗ , where k ∗ is chosen using the follo wing rule: k ∗ = arg max k P k ( x ) (B.0) While the generati ve approach can be used for the standard task, it usually provides worse performance in terms of error rate than the normal approach of learning P ( Y | X ) directly . One major advocate of the generativ e approach is Geof f Hinton, a professor at the Univ er- sity of T oronto. Hinton’ s work, and its underlying philosophy , has had a signiﬁcant inﬂuence on the present book. Unfortunately that philosoph y is nev er articulated clearly and directly , b ut important aspects of it can be gleaned from the follo wing three papers: • Geoff Hinton, “T o recognize shapes, ﬁrst learn to generate images” [46]*. • Geoff Hinton, Peter Dayan, Brendan Frey , Radford Neal, “The wak e-sleep algorithm for unsupervised neural networks” [47]*. • Geoff Hinton, Simon Osindero, Y ee-Whye T eh, “ A Fast Learning Algorithm for Deep Belief Nets” [48]*. The title of the ﬁrst paper sums up the approach quite well. Hinton advocates the indirect approach to learning, in which a generativ e model P ( X ) is learned during the ﬁrst step. If a good model P ( X ) can be found, it should not be much harder to ﬁnd a joint model P ( X , Y ) , which can then be used to solve the supervised problem. Hinton emphasizes that the same features (or patterns) that are important in modeling P ( X ) are also probably important in mod- eling P ( Y | X ) . Furthermore, the raw data models P ( X ) can be made far more complex than the conditional P ( Y | X ) models, because the raw data contains much more information. Sec- tion 2.1.8 contains a ke y passage from reference [48]* that articulates this idea. The second paper , about the wake-sleep approach, describes an algorithm for using the generati ve model idea to train a neural network. The ke y insight is that if the neural network model has adequately captured the statistics of the data set, the samples it generates should look 196 like the real data. The authors present a special type of neural network, that alternates between a “wake” phase, where it processes the input data using its current connection weights, and a “sleep” phase, where it samples from the current model, and compares the samples to the real data. The authors then show how a simple learning rule, based on a statistical comparison of the sampled data to the real data, can be used to improv e the network model. The design of the wak e-sleep algorithm highlights the reciprocal roles of interpretation and generation, which are analogous to encoding and decoding in data compression. These ideas can be understood easily in terms of the abstract formulation of computer vision described in Chapter 3 . A generati ve model is a graphics program G that transforms scene descriptions D into an image G ( D ) . It is easy to sample from the model: create a description D at random, and feed it into the graphics program. It is much more dif ﬁcult, ho we ver , to obtain a good description D that reproduces an target image I such that I ≈ G ( D ) ; this is called the interpre- tation step. The wake-sleep algorithm uses a smart statistical interpretation scheme to produce descriptions. In terms of the neural network, a description is a vector describing the acti vity le vel of each node in the network. The third paper , about deep belief nets, demonstrates the potential power of a two-phase (indirect) process in which the ﬁrst phase learns highly complex structures by modeling the raw data. Hinton et al. propose to learn models of the raw data by using a stack of neural network layers. Each layer learns about the patterns and features in the output of the pre vious layer; the base layer learns directly from the input images. By learning about the raw data instead of the image-label relationship, this strategy repairs a standard problem with training a multi-layer network, which is that there is a large distance between the input and the output. This distance makes it harder to determine what effect a small change in a low-le vel weight will have on the ﬁnal output. Another problem with the standard approach is that multi-layer networks are by deﬁnition complex, so they will tend to cause overﬁtting when trained on limited data problems. T raining on the raw data ov ercomes this problem as well, since by modeling raw data it is possible to use comple x models without ov erﬁtting. Hinton et al. applied their stacked neural network algorithm to the MNIST database of handwritten digit images, a standard benchmark in machine learning. The results they achie ved, using the indirect approach, were slightly better than the pre vious state of the art results, which were obtained using sophisticated direct modeling algorithms such as the Support V ector Ma- chine. Ho we ver , a more visually impressiv e feat was a demonstration of veridical simulation: the images produced by sampling from the learned model were almost indistinguishable from real handwritten digit images. 197 Hinton’ s philosophy has several elements in common with the comperical approach. Both philosophies advocate indirect learning as a way to circumvent the model complexity limita- tions inherent in the direct approach. Both philosophies emphasize the veridical simulation principle. The reciprocal relationship between interpretation (encoding) and generation (de- coding) comes up in both approaches. In spite of these similarities, there are sev eral crucial differences between the two philoso- phies. F or technical reasons, Hinton’ s models can be used to sample from the distribution P ( X ) , b ut cannot be used to explicitly calculate the probability of a data sample. This makes it impossible to compare one model to another , using the log-likelihood, compression rate, or an analogous quantity . Perhaps this limitation can be repaired, b ut Hinton appears to hav e little moti vation to do so. This is because of the toolbox mentality criticized in Chapter 4 . Hinton and his cow orkers concei ve of their work as the dev elopment of a suite of general-purpose tools. Since dif ferent tools work well in different situations, it is difﬁcult or meaningless to make rigorous comparisons between them. Because the tools are thought of as general purpose, they can be tested using whate ver databases happen to be av ailable, such as the MNIST handwritten digit database or the problems in the UCI machine learning repository [65, 2]. Thus, in this mindset, there is not much point in creating specialized databases, or in creating targeted mod- els that exploit the empirical properties of such databases. This procedure is a core component of the comperical philosophy . B.3 Unsupervised and Semi-Super vised Learning Unsupervised learning is the problem of learning without a set of labels or other guidance pro- vided by a human supervisor . Instead of attempting to learn the relationship between a set of im- ages { x i } and a set of labels { y i } , unsupervised methods attempt to learn something about the images themselv es. The (huge) adv antage of unsupervised methods is that the y do not necessi- tate the construction of a labeled dataset, which is almost al ways a tedious and time-consuming project. Furthermore, unsupervised methods can justify highly complex models without over - ﬁtting. These features of the problem formulation overlap with the comperical idea. The goal of a related subﬁeld, called semi-supervised learning, is to improv e the performance of super - vised learning by performing unsupervised learning in advance [19]. This notion is related to the thought experiments of Chapter 2 about indirect learning, and the Reusability Hypothesis that forms a core component of the comperical philosophy . Gi ven a large set of raw data objects and the general goal of learning something about those objects, a critical question immediately arises: what quantity should the system optimize? 198 The subﬁeld mandates no speciﬁc response to this question, and dif ferent answers will lead to widely v arying lines of research. Notice ho w in the case of supervised learning, the basic form of the answer is obvious: the goal is to learn the relationship between the data objects { x i } and the labels { y i } . Dif ferent researchers answer the optimization question by proposing quantities that reﬂect their expectations about what kind of structure the data set will exhibit. For example, a common expectation is that the ra w data samples will be distrib uted in clusters, and so the computational goal should be to identify those structures. This rev eals a general conceptual shortcoming of unsupervised learning: is that it is hard to kno w if the structure discov ered by the algorithm is really present in the data, or if it is simply imposed on the data by the researcher’ s preconcei ved notions. Many researchers ha ve formulated the problem of unsupervised learning as one of dimen- sionality r eduction . Most raw data objects (images, sound, text, etc) are v ery high dimensional. For e xample, a typical image might ha ve 9 · 10 5 pixels; so one image can be thought of as a point in a space with 9 · 10 5 dimensions. The goal of dimensionality reduction techniques is to transform this kind of data sample into a point in a lower -dimensional space, while preserving most of the “rele vant” information. One common technique in this family is called Principal Components Analysis (PCA) [87]. Giv en a dataset of P -dimensional points, PCA ﬁrst com- putes the P · P array of correlations between each pair of dimensions. It then constructs a ne w basis by calculating the top D eigen vectors of this matrix, where D < P . This ne w basis expresses most of the v ariation in the data set, while requiring a smaller number of dimensions to do so. Another popular dimensionality reduction technique is called Isomap, de veloped by T enenbaum et al. , which is a reﬁnement of a procedure called Multi-Dimensional Scaling (MDS) [113]. Based on an N · N matrix A ij of point-point distances, MDS attempts to ﬁnd a set of vectors in D -dimensional space that preserves the distance mappings: || x i − x j || ≈ A ij . Since D can be smaller than the original dimensionality of the points, MDS provides a way of projecting the points into a lo wer-dimension manifold that preserves their similarity (distance) relationships. T enenbaum et al. deﬁne a special graph to produce the distance matrix. In this graph, only nearby points hav e edges between them, with small weights for very close together points. The distance between tw o points is then deﬁned as the shortest path between the points in the graph. Because this deﬁnition preserves the local structure of the relationships between the points, it can solv e some comple x manifold-disco very problems that other methods cannot. Another important vein of unsupervised learning research is called automatic clustering. The idea here is that data is often naturally distributed in clusters. Data samples within a cluster often resemble one another much more than they resemble other samples. Zoology provides a good example of naturally occuring clusters. Categories like “ﬁsh” and “mammal” are in 199 some sense real, and reﬂect legitimate order in the natural world. The key point, again, is that clusters can in principle be learned on the basis of the raw data (e.g. images). Many clustering algorithms ha ve been proposed in the literature; one immensely popular algorithm is called K - means [69]. The goal of this algorithm is to partition the data points into K sets in such a way as to minimize the distance between each point and the mean of its containing set. The algorithm operates by choosing some initial guesses for the mean points, and then assigning each data sample to a cluster based on its distance to the means. Then the means for each cluster are recomputed using the data points in the cluster . Ng et al. proposed a clustering algorithm based on spectral graph theory [83]. The basic idea here is to ﬁnd the eigen vectors of the graph’ s Laplacian matrix, which is the dif ference between the degree matrix and the adjacency matrix. The second eigen vector is related to the solution of the minimum cut problem for partitioning the graph into two subcomponents. The variety of answers to the optimization question is a ke y conceptual weakness of re- search in this area. Because there is no standard answer , there is no strong method for com- paring candidate solutions. Researchers operate according to the toolbox mindset, already criticized in various places in this book. Perhaps the K -means algorithm will produce clusters that work well for a particular application; perhaps it won’t. If it doesn’ t, that is not the fault of the algorithm, it just means that the tool is not well-suited to the particular task. It is the responsibility of the application dev eloper to search through the toolbox to ﬁnd the one that works the best for her particular project. T o prove the potential usefulness of a ne w contribu- tion, an unsupervised learning researcher need only show that it is useful for some task. This results in a proliferation of tools. Another ﬂaw in the philosophy of unsupervised learning research is that methods are thought to be “general”: they can be applied to any data set that has the correct form. The implicit belief is that there exists some ideal or optimal algorithm that can discov er the correct categories or hidden structure in any data set. This belief is a holdover from traditional computer science, the discipline in which most machine learning researchers recei ve their early training. In tra- ditional computer science, it is perfectly reasonable to search for general algorithms. Methods like QuickSort and Dijkstra’ s algorithm are fully general solutions to the problems to which they apply . The comperical approach to learning is thus related to unsupervised learning, but quite dif ferent in emphasis. Comperical researchers agree on the compression rate as the singular quantity of interest. This enables them to rigorously compare candidate solutions. At the same time, the No Free Lunch theorem sho ws that no truly general compression algorithm can e xist. 200 T o achiev e improv ed compression rates, therefore, researchers must study the empirical prop- erties of the data they are analyzing. Furthermore, comperical researchers study v ast data sets and construct those data sets with some de gree of care, so as to facilitate the study of particular aspects of the data (such as English grammar or the appearance of Ne w Y ork buildings). B.4 T raditional Data Compr ession Data compression itself is not a new idea; it dates back at least to the work of Claude Shannon in the 1940s. Researchers hav e been working on various problems in this area for decades. Indeed, many important applications of the “dot com” era of the 1990s and 2000s became possible due to innov ations in this ﬁeld. Perhaps the most important innov ation was the dev elopment of the MP3 audio compression format, that rocked the music industry by making it possible to distribute music ﬁles ov er a modestly fast internet connection. Another important set of innov ations relate to video compression, which allo ws people to watch movies online or on their smartphones. It should be clear that the research proposed in this book is only superﬁcially related to traditional data compression research. There are se veral crucial differences in emphasis and methodology that lead to wide di ver gences in practice. The most basic difference is that traditional data compression aims for real practicality , while comperical research aims for specialized understanding of v arious phenomena. This means, that, for example, a traditional compression researcher would ne ver bother to set up a roadside video camera and attempt to compress it. No one else is interested in transmitting such a data set, so whate ver he discovered, it would be of no interest to anyone else. Also, the specialized insights that would allo w superior compression rates for the roadside video camera would not be especially useful for other types of data. Because of their primary interest in practical data transmission, traditional compression researchers consider a v ariety of other factors of additional factors when ev aluating a new system. One important consideration is speed; most users do not want to wait an hour for the decompressor to produce an image. Another consideration is simplicity of implementation. If a compression format is to become a standard, it must be relativ ely easy for different groups to implement it. T raditional compression researchers also dev ote a signiﬁcant amount of effort the lossy ver - sion of the problem, where what goes into the encoder does not match exactly what comes out of the decoder . This is worthwhile because the human perceptual system simply cannot detect some of the minute details of an image, so bits can be sav ed by leaving those details out. The 201 disadv antage of this approach is that it normally introduces an element of subjecti vity . It means that a human being must observe the lossy compressed image and decide if it provides a good quality for a low price in bits. This remov es a critical component of the philosophy articulated in this book, which is the methodological rigor associated with objecti ve comparisons. Another ke y difference between traditional compression research and comperical research is that the former deals with relati vely small objects such as single images or text ﬁles, while the latter deals with databases that are orders of magnitude larger . This change in focus completely changes the character of the problem. T o see why , consider the problem of building a car . If one wants to build only a single car , then one constructs each component of the car individually using machine tools, and ﬁts the parts together by hand. The challenge is to minimize the time- and dollar-cost of the manual labor . This is an interesting challenge, b ut the challenge of building ten million cars is entirely different. If the goal is to build ten million cars, then it makes sense to start by building a factory . This will be a big up-front cost, but it will pay for itself by reducing the marginal cost of each additional car . Analogously , when attempting to compress huge databases, it becomes worthwhile to b uild sophisticated computational tools into the compressor . For example, it is worthwhile to use a specialized hubcap detector for the roadside video data (but not for a single car image). The dev elopment of these sophisticated computational tools is the real goal of the research. 202 Refer ences [1] S. Agarwal, A. A wan, and D. Roth. Learning to detect objects in images via a sparse, part-based representation. P attern Analysis and Machine Intelligence, IEEE T ransac- tions on , 26(11):1475–1490, 2004. 98 [2] A. Asuncion and D.J. Newman. UCI machine learning repository , 2007. 52 , 198 [3] J.J. Atick. Could information theory provide an ecological theory of sensory processing? Network , 3:213–251, 1992. 68 [4] S. Baker , D. Scharstein, JP Lewis, S. Roth, M.J. Black, and R. Szeliski. A database and e valuation methodology for optical ﬂo w. In ICCV , volume 5, 2007. 101 , 102 , 114 [5] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT ev aluation with im- prov ed correlation with human judgments. Intrinsic and Extrinsic Evaluation Measur es for Machine T ranslation and/or Summarization , page 65, 2005. 130 , 135 , 136 [6] H. B. Barlow . The coding of sensory messages. In W .H. Thorpe and O.L. Zangwill, editors, Curr ent Pr oblems in Animal Behavior . Cambridge Univ ersity Press, 1961. 68 [7] John Beamish. Supersolidity or quantum plasticity? Physics , 3:51, Jun 2010. 153 [8] Adam L. Berger , Stephen Della Pietra, and V incent J. Della Pietra. A maximum entrop y approach to natural language processing. Computational Linguistics , 22(1):39–71, 1996. 119 [9] V . Blanz and T . V etter . Face Recognition Based on Fitting a 3D Morphable Model. P AMI , pages 1063–1074, 2003. 116 [10] Bernhard E. Boser , Isabelle M. Guyon, and Vladimir N. V apnik. A training algorithm for optimal margin classiﬁers. In COLT ’92: Pr oceedings of the ﬁfth annual workshop on Computational learning theory , pages 144–152, Ne w Y ork, NY , USA, 1992. A CM. 92 203 [11] R. A. Brooks. Elephants don’t play chess. Robotics and A utonomous Systems , 6:3–15, 1990. 160 [12] R.A. Brooks. Cambrian intelligence: the early history of the new AI . The MIT press, 1999. 37 [13] Rodney A. Brooks. Intelligence without representation. Artiﬁcial Intelligence , 47:139– 160, 1991. 160 [14] Peter F . Brown, Peter V . deSouza, Robert L. Mercer , V incent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Comput. Linguist. , 18:467–479, December 1992. 143 [15] P .F . Bro wn, S.D. Pietra, V .J.D. Pietra, and R.L. Mercer . The mathematic of statistical machine translation: Parameter estimation. Computational linguistics , 19(2):263–311, 1994. 119 , 130 , 135 , 136 , 138 [16] T om Bylander . The computational complexity of propositional STRIPS planning. Arti- ﬁcial Intelligence , 69(1-2):165–204, 1994. 159 [17] C. Callison-Burch, M. Osborne, and P . Koehn. Re-ev aluating the role of BLEU in ma- chine translation research. Citeseer , 2006. 134 , 136 [18] J. Canny . A computational approach to edge detection. Readings in computer vision: issues, pr oblems, principles, and paradigms , 184:87–116, 1987. 86 [19] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learning . The MIT Press, 1st edition, 2010. 198 [20] S.F . Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Langua ge , 13(4):359–393, 1999. 146 [21] N. Chomsky . The minimalist pr ogr am . The MIT Press, 1995. 185 [22] Noam Chomsky . Syntactic structur es . The Hague: Mouton, 1957. 147 [23] Noam Chomsky . A revie w of B. F. Skinner’ s Verbal Behavior . In Leon A. Jakobo vits and Murray S. Miron, editors, Readings in the Psycholo gy of Language , pages 142–143. Prentice-Hall, 1967. 71 204 [24] M. Collins. Head-dri ven statistical models for natural language parsing. Computational linguistics , 29(4):589–637, 2003. 118 , 122 [25] T . M. Cover and J. A. Thomas. Elements of Information Theory . New Y ork: W iley , 1991. 189 [26] D. Cre vier . AI: the tumultuous history of the searc h for artiﬁcial intelligence . Basic Books, 1993. 2 [27] Douglas DeCarlo, Dimitris Metaxas, and Matthe w Stone. An anthropometric face model using v ariational techniques. In SIGGRAPH , pages 67–74, 1998. 116 [28] S. Della Pietra, V . Della Pietra, and J. Laf ferty . Inducing features of random ﬁelds. P attern Analysis and Machine Intellig ence , 19(4):380–393, 1997. 119 [29] A.P . Dempster , N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society . Series B (Methodologi- cal) , 39(1):1–38, 1977. 131 [30] T .G. Dietterich. An Experimental Comparison of Three Methods for Constructing En- sembles of Decision T rees: Bagging, Boosting, and Randomization. Machine Learning , 40(2):139–157, 2000. 51 [31] S. Dougherty and K.W . Bo wyer . A formal frame work for the objectiv e ev aluation of edge detectors. In International Confer ence on Imag e Pr ocessing , 1998. 96 [32] F .J. Estrada and A.D. Jepson. Quantitativ e ev aluation of a novel image segmentation algorithm. In CVPR , volume 2, pages 1132–1139, 2005. 95 [33] L. Fei-Fei, R. Fergus, and P . Perona. Learning generati ve visual models from few train- ing e xamples: An incremental Bayesian approach tested on 101 object cate gories. Com- puter V ision and Image Understanding , 106(1):59–70, 2007. 98 [34] P .F . Felzenszwalb and D.P . Huttenlocher . Ef ﬁcient graph-based image segmentation. International J ournal of Computer V ision , 59(2):167–181, 2004. 88 [35] Richard Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theo- rem proving to problem solving. In IJCAI , pages 608–620, 1971. 159 [36] L.A. Forbes and B.A. Draper . Inconsistencies in edge detector ev aluation. CVPR , 2:398– 404, 2000. 97 205 [37] Y . Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J ournal of Computer and System Sciences , 55(1):119– 139, 1997. 40 , 92 [38] D. Gildea and D. Jurafsk y . Automatic labeling of semantic roles. Computational Lin- guistics , 28(3):245–288, 2002. 146 [39] C.A.E. Goodhart. Monetary relationships: a view from Threadneedle Street. P apers in monetary economics , 1, 1975. 103 [40] J.T . Goodman. A bit of progress in language modeling. Computer Speech & Langua ge , 15(4):403–434, 2001. 142 [41] J.H. Greenberg. Some univ ersals of grammar with particular reference to the order of meaningful elements. Universals of languag e , 73:113, 1963. 185 [42] P .D. Gr "unwald. The minimum description length principle . The MIT Press, 2007. 175 [43] R.M. Haralick. Computer vision theory: The lack thereof. CVGIP , 36(2-3):372–386, 1986. 102 [44] M. Helmert. Decidability and undecidability results for planning with numerical state v ariables. In Pr oceedings of AIPS-02 , pages 44–53, 2002. 159 [45] D. Hindle. Acquiring disambiguation rules from text. In Pr oceedings of the 27th annual meeting on Association for Computational Linguistics , pages 118–125. Association for Computational Linguistics, 1989. 126 [46] G.E. Hinton. T o recognize shapes, ﬁrst learn to generate images. Pr ogr ess in Brain Resear ch , 165:535, 2007. 55 , 196 [47] G.E. Hinton, P . Dayan, B.J. Frey , and R.M. Neal. The" wake-sleep" algorithm for unsu- pervised neural networks. Science , 268(5214):1158, 1995. 196 [48] G.E. Hinton, S. Osindero, and Y . T eh. A fast learning algorithm for deep belief nets. Neural Computation , 18:1527–1554, 2006. 55 , 196 [49] D.H. Hubel and T .N. W iesel. Receptiv e ﬁelds, binocular interaction and functional ar- chitecture in the cat’ s visual cortex. The Journal of Physiology , 160(1):106, 1962. 164 206 [50] D.A. Huffman. A Method for the Construction of Minimum-Redundancy Codes. Pr o- ceedings of the IRE , 40(9):1098–1101, 1952. 193 [51] Marcus Hutter . Universal Artiﬁcial Intelligence: Sequential Decisions based on Algo- rithmic Pr obability . Springer , Berlin, 2004. 194 [52] John P . A. Ioannidis. Why most published research ﬁndings are false. PLoS Medicine , 2(8), 2005. 4 [53] R. Jackendof f. X-bar-Syntax: A study of phrase structure. 1977. 127 [54] R.C. Jain and T .O. Binford. Ignorance, myopia, and nai vete in computer vision systems. CVGIP: Image Under standing , 53(1):112–117, 1991. 102 [55] T . Joachims. Making lar ge-scale support vector machine learning practical. In A. Smola B. Schölkopf, C. Burges, editor , Advances in K ernel Methods: Support V ector Machines . MIT Press, 1998. 76 [56] D. Kahneman, P . Slovic, and A. Tversk y . J udgment under uncertainty: Heuristics and biases . Cambridge Uni versity Press, 1982. 3 [57] G. Kanizsa, P . Legrenzi, and P . Bozzi. Organization in vision: Essays on Gestalt per- ception . Ne w Y ork: Praeger , 1979. 164 [58] T . Kanungo, B. Dom, W . Niblack, and D. Steele. A fast algorithm for MDL-based multi-band image segmentation. In CVPR , pages 609–616, Jun 1994. 115 [59] S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE T ransactions on Acoustics, Speec h and Signal Pr ocessing , 35(3):400–401, 1987. 142 [60] K. Kelly . What technolo gy wants . V iking Press, 2010. 156 [61] R. Kneser and H. Ney . Improved backing-off for m-gram language modeling. In ICASSP , pages 181–184. IEEE, 1995. 142 [62] D.E. Knuth and R.W . Moore. An analysis of alpha-beta pruning. Artiﬁcial intelligence , 6(4):293–326, 1975. 158 [63] T .S. Kuhn. The Structure of Scientiﬁc Revolutions . Univ ersity of Chicago Press, 1970. 151 , 169 207 [64] Y .G. Leclerc. Constructing simple stable descriptions for image partitioning. Interna- tional J ournal of Computer V ision , 3(1):73–102, 1989. 115 [65] Y ann LeCun and Corinna Cortes. The MNIST database of handwritten digits. http: //yann.lecun.com/exdb/mnist/ . 54 , 198 [66] B. Leibe and C. W allraven. Cogvis: The object database. http://cogvis.nada. kth.se/obj- def.html , 200106. 100 [67] Ming Li and Paul V itanyi. An Intr oduction to K olmogor ov Complexity and Its Applica- tions . Springer V erlag, 1997. 174 [68] B.D. Lucas and T . Kanade. An iterati ve image registration technique with an application to stereo vision. In International joint confer ence on artiﬁcial intelligence , v olume 3, pages 674–679. Citeseer , 1981. 90 [69] J. MacQueen. Some methods for classiﬁcation and analysis of multiv ariate observa- tions. In Pr oceedings of the ﬁfth Berkele y symposium on mathematical statistics and pr obability , volume 1, page 14, 1967. 200 [70] D.M. Magerman. Statistical decision-tree models for parsing. In Pr oceedings of the 33r d annual meeting on Association for Computational Linguistics , pages 276–283. As- sociation for Computational Linguistics, 1995. 126 [71] Matt Mahone y . Rationale for a lar ge te xt compression benchmark. http://www.cs. fit.edu/~mmahoney/compression/rationale.html , 2006. 194 [72] M.P . Marcus, M.A. Marcinkie wicz, and B. Santorini. Building a large annotated corpus of English: The Penn T reebank. Computational linguistics , 19(2), 1993. 120 , 124 , 125 , 126 [73] D. Marr and E. Hildreth. Theory of edge detection. Pr oceedings of the Royal Society of London. Series B. Biological Sciences , 207(1167):187, 1980. 85 [74] D. Martin. An Empirical Appr oach to Gr ouping and Se gmentation . PhD thesis, Univ er- sity of California, Berkele y , 2000. 95 [75] D. Martin, C. Fowlk es, D. T al, and J. Malik. The berkeley segmentation dataset and benchmark web site. http://www.eecs.berkeley.edu/Research/ Projects/CS/vision/bsds/ , 2001. 59 208 [76] D. Martin, C. Fo wlkes, D. T al, and J. Malik. A database of human segmented natural im- ages and its application to e v aluating se gmentation algorithms and measuring ecological statistics. In ICCV , volume 2, pages 416–423, 2001. 95 [77] D. McClosky , E. Charniak, and M. Johnson. Reranking and self-training for parser adaptation. In Pr oceedings of the 21st International Confer ence on Computational Lin- guistics and the 44th annual meeting of the Association for Computational Linguistics , pages 337–344. Association for Computational Linguistics, 2006. 126 [78] D.M. McDermott. The 1998 AI planning systems competition. AI magazine , 21(2):35, 2000. 159 [79] U. Mohideen and A. Roy . Precision measurement of the Casimir force from 0.1 to 0.9 µ m. Physical Revie w Letters , 81(21):4549–4552, 1998. 154 [80] D. Mumford. Neuronal architectures for pattern-theoretic problems. Lar ge-Scale Neu- r onal Theories of the Brain , pages 125–152, 1994. 113 [81] Roberto Na vigli. W ord sense disambiguation: A survey . ACM Comput. Surv . , 41:10:1– 10:69, February 2009. 146 , 147 [82] S.A. Nene, S.K. Nayar , and H. Murase. Columbia object image library (coil-100). T ech- nical Report CUCS-006-96, Columbia Uni versity , 1996. 100 [83] A. Ng, M. Jordan, and Y . W eiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Pr ocessing Systems 14: Pr oceeding of the 2001 Con- fer ence , pages 849–856, 2001. 200 [84] F .J. Och. Minimum error rate training in statistical machine translation. In Pr oceed- ings of the 41st Annual Meeting on Association for Computational Linguistics-V olume 1 , pages 160–167. Association for Computational Linguistics, 2003. 132 , 137 [85] R. Ohlander , K. Price, and D.R. Reddy . Picture segmentation using a recursiv e region splitting method*. Computer Graphics and Ima ge Pr ocessing , 8(3):313–333, 1978. 94 [86] K. Papineni, S. Roukos, T . W ard, and W .J. Zhu. BLEU: a method for automatic ev alu- ation of machine translation. In Pr oceedings of the 40th annual meeting on association for computational linguistics , pages 311–318. Association for Computational Linguis- tics, 2002. 130 , 134 209 [87] K. Pearson. On lines and planes of closest ﬁt to points in space. Philosophical magazine , 2(6):559–572, 1901. 199 [88] Rolf Pfeifer , Max Lungarella, and Fumiya Iida. Self-organization, embodiment, and biologically inspired robotics. Science , 318:1088–1093, 2007. 162 [89] J.R. Platt. Strong inference. Science , 146(3642):347–353, 1964. 1 [90] T . Poggio, V . T orre, and C. K och. Computational vision and regularization theory. Image Understanding , 3:1–18, 1989. 111 [91] J. Ponce, T . Berg, M. Everingham, D. F orsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. Russell, A. T orralba, et al. Dataset issues in object recognition. T owar d Cate gory-Level Object Recognition , pages 29–48, 2006. 99 , 100 [92] K.R. Popper . The Logic of Scientiﬁc Discovery . Basic Books, 1959. i , 9 [93] A. Ratnaparkhi. Learning to parse natural language with maximum entropy models. Machine Learning , 34(1):151–175, 1999. 121 [94] J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of Re- sear ch and De velopment , 20(3):198–203, 1976. 193 [95] J. Rissanen. Modeling by shortest data description. Automatica , 14:465–471, 1978. 47 [96] R. Rosenfeld. A maximum entropy approach to adaptiv e statistical language modeling. Computer , Speech and Langua ge , 10:187– 228, 1996. 141 [97] R. Rosenfeld. T wo decades of statistical language modeling: Where do we go from here? Pr oceedings of the IEEE , 88(8):1270–1278, 2000. 144 [98] D. E. Rumelhart, G. E. Hinton, and R. J. W illiams. Learning internal representations by error propagation. Natur e , 323:533–536, 1986. 42 [99] S.J. Russell and P . Norvig. Artiﬁcial intelligence: a modern appr oach . Prentice hall, 2009. 158 [100] D. Scharstein and R. Szeliski. Middlebury Stereo Vision Page. online web resource. 101 210 [101] D. Scharstein and R. Szeliski. A T axonomy and Evaluation of Dense T wo-Frame Stereo Correspondence Algorithms. International Journal of Computer V ision , 47(1):7–42, 2002. 101 [102] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structured light. In CVPR , volume 1, pages 195–202, 2003. 101 [103] C.E. Shannon. A chess-playing machine. Scientiﬁc American , 182(2):48, 1950. 157 [104] J. Shi and J. Malik. Normalized cuts and image segmentation. P attern Analysis and Machine Intellig ence , 22(8):888–905, 2000. 88 [105] M.C. Shin, D. Goldgof, and K.W . Bowyer . An objectiv e comparison methodology of edge detection algorithms using a structure from motion task. In CVPR , pages 190–195, 1998. 94 [106] M. Simon. Hp looking into claim webcams can’t see black people. http://articles.cnn.com/2009- 12- 22/tech/hp.webcams_1_ webcams- video- youtube?_s=PM:TECH , December 2009. 39 [107] B.F . Skinner . V erbal behavior . 1957. 71 [108] M. Snover , B. Dorr , R. Schwartz, L. Micciulla, and J. Makhoul. A study of translation edit rate with tar geted human annotation. In Pr oceedings of Association for Machine T ranslation in the Americas , pages 223–231. Citeseer , 2006. 130 , 135 [109] A. Stolcke. SRILM-an extensible language modeling toolkit. In Pr oceedings of the inter- national confer ence on spoken language pr ocessing , volume 2, pages 901–904. Citeseer , 2002. 144 [110] J. Sun, N.N. Zheng, and H.Y . Shum. Stereo matching using belief propagation. IEEE T ransactions on P attern Analysis and Machine Intelligence , pages 787–800, 2003. 90 [111] R. Sutton and A. Barto. Reinfor cement Learning: An Intr oduction . Cambridge, MA: MIT Press, 1997. 42 , 71 [112] R. Szeliski. Prediction error as a quality metric for motion and stereo. In ICCV , vol- ume 2, pages 781–788, 1999. 101 [113] J.B. T enenbaum, V . Silv a, and J.C. Langford. A global geometric frame work for nonlin- ear dimensionality reduction. Science , 290(5500), 2000. 199 211 [114] J.P T urian, L. Shen, and I. D. Melamed. Evaluation of machine translation and its e val- uation. In Pr oceedings of the MT Summit IX , 2003. 134 , 136 [115] R. Unnikrishnan, C. Pantofaru, and M. Hebert. T o ward objectiv e ev aluation of image segmentation algorithms. P AMI , 29(6):929–944, 2007. 95 [116] V . V apnik. The Natur e of Statistical Learning Theory . Springer , 1998. 45 , 68 , 74 , 75 [117] V . V apnik and R. Gilad-Bachrach. Learning has just started: an intervie w with Professor Vladimir Vapnik. http://www.learningtheory.org/ , 2008. 52 [118] Vladimir V apnik, Stev en E. Golowich, and Alex J. Smola. Support vector method for function approximation, regression estimation and signal processing. In Advances in Neural Information Pr ocessing Systems , pages 281–287, 1996. 40 , 68 , 93 [119] P . V iola and M.J. Jones. Robust real-time face detection. International J ournal of Com- puter V ision , 57(2):137–154, 2004. 92 , 117 [120] C.S. W allace and D.M. Boulton. An information measure for classiﬁcation. The Com- puter J ournal , 11(2):185, 1968. 47 [121] W . W eav er . T ranslation. Machine tr anslation of languages , 14:15–23, 1955. 129 [122] W ikipedia. Computer vision — Wikipedia, the free enc yclopedia, 2010. Online; ac- cessed 12-May-2010. 84 [123] W ikipedia. Flickr — wikipedia, the free enc yclopedia, 2011. [Online; accessed 2-April- 2011]. 54 [124] Ian H. W itten, Radford M. Neal, and John G. Cleary . Arithmetic coding for data com- pression. Commun. A CM , 30(6):520–540, 1987. 193 [125] N. Xue, F . Xia, F .D. Chiou, and M. Palmer . The Penn Chinese T reeBank: Phrase structure annotation of a lar ge corpus. Natural Langua ge Engineering , 11(02):207–238, 2005. 124 , 125 , 127 [126] H. Zhang, A.C. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminativ e nearest neigh- bor classiﬁcation for visual category recognition. In Computer V ision and P attern Recog- nition , volume 2, pages 2126–2136. IEEE, 2006. 93 212 [127] H. Zhang, J.E. Fritts, and S.A. Goldman. Image segmentation e v aluation: A surve y of unsupervised methods. Computer V ision and Image Understanding , 110(2):260–280, 2008. 95 [128] S.C. Zhu and A. Y uille. Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. P AMI , 18(9):884–900, 1996. 115 213

Notes on a New Philosophy of Empirical Science

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment