Notes on a New Philosophy of Empirical Science

This book presents a methodology and philosophy of empirical science based on large scale lossless data compression. In this view a theory is scientific if it can be used to build a data compression program, and it is valuable if it can compress a st…

Authors: Daniel Burfoot

Notes on a New Philosophy of Empirical Science
Notes on a New Philosoph y of Empirical Science (Draft V ersion) Daniel Burfoot April 2011 Release Notes This document is a draft version of the book, tentativ ely titled “Notes on a Ne w Philosophy of Empirical Science”. This book represents an attempt to build a philosophy of science on top of a large number of technical ideas related to information theory , machine learning, computer vision, and computational linguistics. It seemed necessary , in order to make the arguments con vincing, to include brief summary descriptions of these ideas. It is probably ine vitable that the technical summaries will contain a number of errors or misconceptions, related to, for example, the Shi-Malik image segmentation algorithm or the B L E U metric for ev aluating machine translation results. While it seems unlikely that such errors could derail the central arguments of the book, it is not impossible. The reader is advised to exercise caution and consult the rele vant literature directly . The book has been influenced by a di verse set of authors and ideas. Especially influential references are cited with an asterisk, for example: [92]*. The final version of the book will probably include a brief description of how each ke y reference influenced the dev elopment of the book’ s ideas. This draft version contains all the major ideas and themes of the book. Howe ver , it also contains no small number of blemishes, disfluencies, and other shortcomings. T wo holes are particularly glaring. First, the chapter on computer vision includes an analysis of ev aluation methods for the task of optical flo w estimation, as well as a comperical reformulation of the task, but does not describe the task itself. Interested readers can repair this problem by a Google Scholar search for the term “optical flow”. Second, there should be another thought experiment in volving Sophie and Simon near the end of Chapter 2. In this thought e xperiment, Sophie proposes to use a birdsong synthesizer to create virtual labels, thereby obviating the need for Simon to label audio clips by hand. This idea comes up again in the section on the e v aluation of face detection algorithms, where a graphics program for face modeling is used instead. i ii It is a profound and necessary truth that the deep things in science are not found because they are useful, the y are found because it was possible to find them. -Oppenheimer It is the mark of a higher culture to v alue the little unpretentious truths which ha ve been discovered by means of rigorous method more highly than the errors handed do wn by metaphysical ages and men, which blind us and make us happy . -Nietzsche Go as far as you can see; when you get there you will be able to see farther . -Carlyle iii i v Contents Release Notes i Nomenclature ix 1 Compression Rate Method 1 1.1 Philosophical Foundations of Empirical Science . . . . . . . . . . . . . . . . . 1 1.1.1 Objecti vity , Irrationality , and Progress . . . . . . . . . . . . . . . . . . 3 1.1.2 V alidation Methods and T axonomy of Scientific Acti vity . . . . . . . . 4 1.1.3 T o ward a Scientific Method . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.4 Occam’ s Razor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.5 Problem of Demarcation and Falsifiability Principle . . . . . . . . . . . 9 1.1.6 Science as a Search Through Theory-Space . . . . . . . . . . . . . . . 11 1.1.7 Circularity Commitment and Reusability Hypothesis . . . . . . . . . . 12 1.2 Sophie’ s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.1 The Shaman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.2 The Dead Experimentalist . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.3 The Ri val Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Compression Rate Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.3.1 Data Compression is Empirical Science . . . . . . . . . . . . . . . . . 21 1.3.2 Comparison to Popperian Philosophy . . . . . . . . . . . . . . . . . . 23 1.3.3 Circularity and Reusability in Context of Data Compression . . . . . . 25 1.3.4 The In visible Summit . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.3.5 Objecti ve Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.4 Example Inquiries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.4.1 Roadside V ideo Camera . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.4.2 English T ext Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.4.3 V isual Manhattan Project . . . . . . . . . . . . . . . . . . . . . . . . . 33 v 1.5 Sampling and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.5.1 V eridical Simulation Principle of Science . . . . . . . . . . . . . . . . 35 1.6 Comparison to Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2 Compression and Lear ning 39 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.1.1 Standard Formulation of Supervised Learning . . . . . . . . . . . . . . 40 2.1.2 Simplified Description of Learning Algorithms . . . . . . . . . . . . . 41 2.1.3 Generalization V ie w of Learning . . . . . . . . . . . . . . . . . . . . . 42 2.1.4 Compression V ie w . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.1.5 Equi valence of V iews . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.1.6 Limits of Model Complexity in Canonical T ask . . . . . . . . . . . . . 50 2.1.7 Intrinsically Complex Phenomena . . . . . . . . . . . . . . . . . . . . 51 2.1.8 Comperical Reformulation of Canonical T ask . . . . . . . . . . . . . . 53 2.2 Manual Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.1 The Stock T rading Robot . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.2 Analysis of Manual Overfitting . . . . . . . . . . . . . . . . . . . . . 58 2.2.3 T rain and T est Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2.4 Comperical Solution to Manual Overfitting . . . . . . . . . . . . . . . 60 2.3 Indirect Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3.1 Dilbert’ s Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3.2 The Japanese Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3.3 Sophie’ s Self-Confidence . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.3.4 Direct and Indirect Approaches to Learning . . . . . . . . . . . . . . . 68 2.4 Natural Setting of the Learning Problem . . . . . . . . . . . . . . . . . . . . . 70 2.4.1 Robotics and Machine Learning . . . . . . . . . . . . . . . . . . . . . 70 2.4.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.4.3 Pedro Rabbit and the Poison Leaf . . . . . . . . . . . . . . . . . . . . 72 2.4.4 Simon’ s Ne w Hobby . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.4.5 Foundational Assumptions of Supervised Learning . . . . . . . . . . . 75 2.4.6 Natural Form of Input Data . . . . . . . . . . . . . . . . . . . . . . . . 76 2.4.6.1 Data is a Stream . . . . . . . . . . . . . . . . . . . . . . . . 76 2.4.6.2 The Stream is V ast . . . . . . . . . . . . . . . . . . . . . . . 77 2.4.6.3 Supervision is Scarce . . . . . . . . . . . . . . . . . . . . . 77 2.4.7 Natural Output of Learning Process . . . . . . . . . . . . . . . . . . . 79 vi 2.4.7.1 Dimensional Analysis . . . . . . . . . . . . . . . . . . . . . 79 2.4.7.2 Predicting the Stream . . . . . . . . . . . . . . . . . . . . . 79 2.4.8 Synthesis: Dual System V ie w of Brain . . . . . . . . . . . . . . . . . . 80 3 Compression and V ision 83 3.1 Representati ve Research in Computer V ision . . . . . . . . . . . . . . . . . . 84 3.1.1 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.1.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.1.3 Stereo Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.1.4 Object Recognition and Face Detection . . . . . . . . . . . . . . . . . 92 3.2 Ev aluation Methodologies in Computer V ision . . . . . . . . . . . . . . . . . . 94 3.2.1 Ev aluation of Image Segmentation Algorithms . . . . . . . . . . . . . 94 3.2.2 Ev aluation of Edge Detectors . . . . . . . . . . . . . . . . . . . . . . 96 3.2.3 Ev aluation of Object Recognition . . . . . . . . . . . . . . . . . . . . 97 3.2.4 Ev aluation of Stereo Matching and Optical Flow Estimation . . . . . . 100 3.3 Critical Analysis of Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.3.1 W eakness of Empirical Ev aluation . . . . . . . . . . . . . . . . . . . . 102 3.3.2 Ambiguity of Problem Definition and Replication of Ef fort . . . . . . . 104 3.3.3 Failure of Decomposition Strate gy . . . . . . . . . . . . . . . . . . . . 105 3.3.4 Computer V ision is not Empirical Science . . . . . . . . . . . . . . . . 106 3.3.5 The Elemental Recognizer . . . . . . . . . . . . . . . . . . . . . . . . 108 3.4 Comperical Formulation of Computer V ision . . . . . . . . . . . . . . . . . . 109 3.4.1 Abstract Formulation of Computer V ision . . . . . . . . . . . . . . . . 110 3.4.2 Stereo Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.4.3 Optical Flo w Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.4.4 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.4.5 Face Detection and Modeling . . . . . . . . . . . . . . . . . . . . . . 116 4 Compression and Language 118 4.1 Computational Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2.1 Ev aluation of Parsing Systems . . . . . . . . . . . . . . . . . . . . . . 124 4.2.2 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.2.3 Comperical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.3 Statistical Machine T ranslation . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.3.1 Ev aluation of Machine T ranslation Systems . . . . . . . . . . . . . . . 133 vii 4.3.2 Critical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.3.3 Comperical Formulation of Machine T ranslation . . . . . . . . . . . . 137 4.4 Statistical Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.4.1 Comparison of Approaches to Language Modeling . . . . . . . . . . . 143 4.5 Additional Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.5.1 Chomskyan F ormulation of Linguistics . . . . . . . . . . . . . . . . . 147 4.5.2 Prediction of Progress . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5 Compression as P aradigm 151 5.1 Scientific Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.1.1 Requirements of Scientific Paradigms . . . . . . . . . . . . . . . . . . 152 5.1.2 The Casimir Ef fect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.1.3 The Microprocessor Paradigm . . . . . . . . . . . . . . . . . . . . . . 155 5.1.4 The Chess Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.1.5 Artificial Intelligence as Pre-paradigm Field . . . . . . . . . . . . . . . 158 5.1.6 The Brooksian Paradigm Candidate . . . . . . . . . . . . . . . . . . . 160 5.2 Comperical Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.2.1 Conceptual Clarity and Parsimonious Justification . . . . . . . . . . . 164 5.2.2 Methodological Ef ficiency . . . . . . . . . . . . . . . . . . . . . . . . 165 5.2.3 Scalable Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.2.4 Systematic Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6 Meta-Theories and Unification 171 6.1 Compression Formats and Meta-F ormats . . . . . . . . . . . . . . . . . . . . . 171 6.1.1 Encoding Formats and Empirical Theories . . . . . . . . . . . . . . . . 174 6.2 Local Scientific Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.3 Cartographic Meta-Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.4 Search for Ne w Empirical Meta-Theories . . . . . . . . . . . . . . . . . . . . 180 6.5 Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.5.1 The Form of Birds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.5.2 Unification in Comperical Science . . . . . . . . . . . . . . . . . . . . 184 6.5.3 Uni versal Grammar and the F orm of Language . . . . . . . . . . . . . 185 6.5.4 The Form of F orms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 viii A Inf ormation Theory 189 A.1 Basic Principle of Data Compression . . . . . . . . . . . . . . . . . . . . . . . 189 A.1.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 B Related W ork 194 B.1 Hutter Prize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 B.2 Generati ve Model Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . 195 B.3 Unsupervised and Semi-Supervised Learning . . . . . . . . . . . . . . . . . . 198 B.4 T raditional Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 201 References 213 ix Chapter 1 Compr ession Rate Method 1.1 Philosophical F oundations of Empirical Science In a remarkable paper published in 1964, a biophysicist named John Platt pointed out the some what impolitic fact that some scientific fields made progress much more rapidly than others [89]*. Platt cited particle physics and molecular biology as exemplar fields in which progress was especially rapid. T o illustrate this speed he relates the follo wing anecdote: [Particle physicists asked the question]: Do the fundamental particles conserve mirror-symmetry or “parity” in certain reactions, or do they not? The crucial ex- periments were suggested: within a fe w months they were done, and conservation of parity was found to be excluded. Richard Garwin, Leon Lederman, and Marcel W einrich did one of the crucial e xperiments. It was thought of one e vening at sup- pertime: by midnight they had arranged the apparatus for it; and by 4 am they had picked up the predicted pulses sho wing the non-conserv ation of parity . Platt attributed this rapid progress not to the superior intelligence of particle physicists and molecular biologists, but to the fact that they used a more rigorous scientific methodology , which he called Strong Inference. In Platt’ s vie w , the ke y requirement of rapid science is the ability to rapidly generate new theories, test them, and discard those that pro ve to be incompat- ible with e vidence. Many observers of fields such as artificial intelligence (AI), computer vision, computational linguistics, and machine learning will agree that, in spite of the journalistic hype surrounding them, these fields do not make rapid progress. Research in artificial intelligence was begun ov er 50 years ago. In spite of the bold pronouncement made at the time, the field has failed to transform society . Robots do not walk the streets; intelligent systems are generally brittle 1 and function only within narro w domains. This lack of progress is illustrated by a comment by Marvin Minsky , one of the founders of AI, in reference to David Marr , one of the founders of computer vision: After [David Marr] joined us, our team became the most famous vision group in the world, but the one with the fewest results. His idea was a disaster . The edge finders the y ha ve now using his theories, as f ar as I can see, are slightly w orse than the ones we had just before taking him on. W e’ ve lost twenty years ([26], pg. 189). This book argues that the lack of progress in artificial intelligence and related fields is caused by philophical limitations, not by technical ones. Researchers in these fields have no scientific methodology of power comparable to Platt’ s concept of Strong Inference. They do not rapidly generate, test, and discard theories in the way that particle physicists do. This kind of critique has been uttered before, and would hardly justify a book-length exposition. Rather , the purpose of this book is to pr opose a scientific method that can be used, at least, for computer vision and computational linguistics, and probably for se veral other fields as well. T o set the stage for the proposal it is necessary to briefly examine the unique intellectual content of the scientific method. This uniqueness can be highlighted by comparing it to a the- ory of physics such as quantum mechanics. While quantum mechanics often seems mysterious and perplexing to beginning students, the scientific method appears obvious and inevitable. Physicists are constantly testing, examining, and searching for failures of quantum mechanics. The scientific method itself recei ves no comparable interrogation. Physicists are quite confi- dent that quantum mechanics is wrong in some subtle way: one of their great goals is to find a unified theory that reconciles the conflicting predictions made by quantum mechanics and general relativity . In contrast, it is not ev en clear what it would mean for the scientific method to be wrong. But consider the follo wing chain of causation: the scientific method allo wed humans to dis- cov er physics, physics allowed humans to dev elop technology , and technology allowed humans to reshape the world. The fact that the scientific method succeeds must rev eal some abstract truth about the nature of reality . Put another way , the scientific method depends implicitly on some assertions or propositions, and because those assertions happen to be true, the method works. But what is the content of those assertions? Can they be examined, modified, or gener - alized? This chapter begins with an attempt to analyze and document the assertions and philo- sophical commitments upon which the scientific method depends. Then, a series of thought experiments illustrate how a slight change to one of the statements results in a modified v ersion 2 of the method. This new version is based on lar ge scale lossless data compression, and it uses large databases instead of e xperimental observation as the necessary empirical ingredient. The remainder of the chapter argues that the new method retains all the crucial characteristics of the original. The significance of the new method is that allows researchers to conduct in vestiga- tions into aspects of empirical reality that hav e nev er before been systematically interrogated. For example, Chapter 3 that attempting to compress a database of natural images results in a field very similar to computer vision. Similarly , attempting to compress lar ge text databases re- sults in a field very similar to computational linguistics. The starting point in the dev elopment is a consideration of one of the most critical components of science: objecti vity . 1.1.1 Objectivity , Irrationality , and Pr ogress The history of humanity clearly indicates that humans are prone to dangerous flights of irra- tionality . Psychologists have shown that humans suffer from a wide range of cogniti ve blind spots, with names lik e Scope Insensitivity and A v ailability Bias [56]. One special aspect of hu- man irrationality of particular relev ance to science is the human propensity to enshrine theories, abstractions, and explanations without sufficient e vidence. Often, once a person decides that a certain theory is true, he begins to use that theory to interpret all ne w e vidence. This distortativ e ef fect pre vents him from seeing the flaws in the theory itself. Thus Ptolemy belie ved that the Sun rotated around the Earth, while Aristotle believ ed that all matter could be decomposed into the elements of fire, air , water , or earth. Indi vidual human fallibility is not the only obstacle to intellectual progress; another po w- erful barrier is gr oup irrationality . Humans are fundamentally social creatures; no indi vidual acting alone could e ver obtain substantial kno wledge about the world. Instead, humans must rely on a division of labor in which kno wledge-acquisition tasks are delegated to groups of dedicated specialists. This di vision of labor is replicated ev en within the scientific community: physicists rely extensi vely on the experimental and theoretical work of other physicists. But groups are vulnerable to an additional set of perception-distorting ef fects in volving issues such as status, signalling, politics, conformity pressure, and pluralistic ignorance. A low-ranking indi vidual in a large group cannot comfortably disagree with the statements of a high-ranking indi vidual, ev en if the former has truth on his side. Furthermore, scientists are naturally com- petiti ve and skeptical. A scientist proposing a new result must be prepared to defend it against ine vitable criticism. T o ov ercome the problems of indi vidual irrationality and group irrationality , a single princi- ple is tremendously important: the principle of objecti vity . Objectivity requires that ne w results 3 be v alidated by mechanistic procedure that cannot be influenced by indi vidual perceptions or sociopolitical ef fects. While humans may implement the validation procedure, it must be some- ho w independent of the particular oddities of the human mind. In the language of computer science, the procedure must be like an abstract algorithm, that does not depend on the particular architecture of the machine it is running on. The validation procedure helps to pre vent indi- vidual irrationality , by requiring scientists to hammer their ideas against a hard an vil. It also protects against group irrationality , by providing scientists with a strong shield against criticism and pressure from the group. The objecti vity principle is also an important requirement for a field to make progress. Re- searchers in all fields lov e to publish papers. If a field lacks an objecti ve validation procedure, it is difficult to pre vent people from publishing lo w quality papers that contain incorrect results or meaningless observ ations. The so-called hard sciences such as mathematics, physics, and engineering employ highly objectiv e ev aluation procedures, which facilitates rapid progress. Fields such as psychology , economics, and medical science rely on statistical methods to val- idate their results. These methods are less rigorous, and this leads to significant problems in these fields, as illustrated by a recent paper entitled “Why most published research findings are false” [52]. Nonscientific fields such as literature and history rely on the qualitativ e judgments of practitioners for the purposes of e v aluation. These examples illustrate a striking correlation between the objecti vity of a field’ s ev aluation methods and the degree of progress it achie ves. 1.1.2 V alidation Methods and T axonomy of Scientific Activity The idea of objectivity , and the mechanism by which various fields achiev e objectivity , can be used to define a useful taxonomy of scientific fields. Scientific activity , broadly considered, can be categorized into three parts: mathematics, empirical science, and engineering. These acti vities intersect at many le vels, and often a single indi vidual will make contributions in more than one area. But the three categories produce very distinct kinds of results, and utilize dif ferent mechanisms to validate the results and thereby achie ve objecti vity . Mathematicians see the goal of their ef forts as the discov ery of ne w theorems. A theorem is fundamentally a statement of implication: if a certain set of assumptions are true, then some de- ri ved conclusion must hold. The legitimate mechanism for demonstrating the v alidity of a new result is a proof. Proofs can be e xamined by other mathematicians and v erified to be correct, and this process provides the field with its objectiv e validation mechanism. It is worthwhile to note that practical utility plays no essential role in the validation process. Mathematicians 4 may hope that their results are useful to others, but this is not a requirement for a theorem to be considered correct. Engineers, in contrast, take as their basic goal the de velopment of practical de vices. A de vice is a set of interoperating components that produce some useful effect. The word “useful” may be broadly interpreted: sometimes the utility of a new de vice may be speculati ve, or it may be useful only as a subcomponent of a larger de vice. Either way , the requirement for proclaiming success in engineering is a demonstration that the de vice works . It is v ery dif ficult to game this process: if the ne w airplane fails to take off or the new microprocessor fails to multiply numbers correctly , it is obvious that these results are low-quality . Thus, this public demonstration process provides engineering with its method of objecti ve v alidation. The third category , and the focus of this book, is empirical science. Empirical scientists attempt to obtain theories of natural phenomena. A theory is a tool that enables the scientist to make predictions regarding a phenomenon. The value and quality of a theory depends entirely on ho w well it can predict the phenomenon to which it applies. Empirical scientists are similar to mathematicians in the purist attitude the y take toward the product of their research: they may hope that a ne w theory will hav e practical applications, but this is not a requirement. Mathematics and engineering both hav e long histories. Mathematics dates back at least to 500 BC, when Pythagoras prov ed the theorem that bears his name. Engineering is ev en older; perhaps the first engineers were the men who fashioned axes and spearheads out of flint and thus ushered in the Stone Age. Systematic empirical science, in contrast, started only relati vely recently , b uilding on the work of think ers lik e Galileo, Descartes, and Bacon. It is worth asking why the ancient philosophers in civilizations like Greece, Babylon, India, and China, in spite of their general intellectual advancement, did not begin a systematic empirical in vestig ation of v arious natural phenomena. The delay could hav e been caused by the fact that, for a long time, no one realized that there could be, or needed to be, an area of intellectual inquiry that w as distinct from mathematics and engineering. Even today , it is difficult for nonscientists to appreciate the difference between a statement of mathematics and a statement of empirical science. After all, physical la ws are almost always expressed in mathematical terms. What is the dif ference between the Ne wtonian statement F = ma and the Pythagorean theorem a 2 + b 2 = c 2 ? These statements, though they are expressed in a similar form, are in fact completely dif ferent constructs: one is an empirical theory , the other is a mathematical law . Sev eral heuristics can be used to dif ferentiate between the two types of statement. One good technique is to ask if the statement could be in validated by some new observ ation or evidence. One could draw a misshapen triangle that did not obey the Pythagorean theorem, but that would hardly mean anything about the truth of 5 the theorem. In contrast, there are observations that could in validate Newton’ s la ws, and in f act such observations were made as a result of Einstein’ s theory of relati vity . There are, in turn, also observ ations that could disprov e relativity . Ancient thinkers might also have failed to see how it could be meaningful to make state- ments about the world that were not essentially connected to the dev elopment of practical de vices. An ancient might very well hav e believ ed that it was impossible or meaningless to find a unique optimal theory of gra vity and mass. Instead, scientists should dev elop a toolbox of methods for treating these phenomena. Engineers should then select a tool that is well-suited to the task at hand. So an engineer might very well utilize one theory of gravity to design a bridge, and then use some other theory when designing a catapult. In this mindset, theories can only be e valuated by incorporating them into some practical device and then testing the device. Empirical science is also unique in that it depends on a special pr ocess for obtaining ne w results. This process is called the scientific method; there is no analogous procedure in mathe- matics or engineering. W ithout the scientific method, empirical scientists cannot do much more than make catalogs of disconnected and uninterpretable observ ations. When equipped with the method, scientists begin to discern the structure and meaning of the observ ational data. But as explained in the next section, the scientific method is only ob vious in hindsight. It is built upon deep philosophical commitments that would ha ve seemed bizarre to an ancient think er . 1.1.3 T oward a Scientific Method T o understand the philosophical commitments implicit in empirical science, and to see why those commitments were nonobvious to the ancients, it is helpful to look at some other plausible scientific procedures. T o do so, it is con venient to introduce the following simplified abstract description of the goal of scientific reasoning. Let x be an e xperimental configuration, and y be the experimental outcome. The v ariables x and y should be thought of not as numbers but as large packets of information including descriptions of v arious objects and quantities. The goal of science is to find a function f ( · ) that predicts the outcome of the configuration: y = f ( x ) . A first approach to this problem, which can be called the pure theoretical approach, is to deduce the form of f ( · ) using logic alone. In this vie w , scientists should use the same mechanism for proving their statements that mathematicians use. Here there is no need to check the results of a prediction against the experimental outcome. Just as it is meaningless to check the Pythagorean theorem by drawing triangles and measuring its sides, it is meaningless to check the function f ( · ) against the actual outcomes y . Mathematicians can achie ve perfect confidence in their theories without making any kind of appeal to experimental v alidation, 6 so why shouldn’t scientists be able to reason the same way? If Euclid can prove, based on purely logical and conceptual considerations, that the sum of the angles of a triangle adds up to 180 degrees, why cannot Aristotle use analogous considerations to conclude that all matter is composed of the four classical elements? A subtle critic of this approach might point out that mathematicians require the use of axioms , from which they deduce their results, and it is not clear what statements can play this role in the in vestigation of real-w orld phenomena. But e ven this criticism can be answered; perhaps the existence of human reason is the only necessary axiom, or perhaps the axioms can be found in religious texts. Even if someone had proposed to check a prediction against the actual outcome, it is not at all clear what this means or ho w to go about doing it. What would it mean to check Aristotle’ s theory of the four elements? The ancients must hav e viewed the crisp proof-based v alidation method of mathematics as far more rigorous and intellectually satisfying than the tedious, error prone, and conceptually murky process of observ ation and prediction-checking. At the other extreme from the pure theoretical approach is the strate gy of searching for f ( · ) using a purely experimental in vestigation of various phenomena. The plan here would be to conduct a large number of experiments, and compile the results into an enormous almanac. Then to make a prediction in a gi ven situation, one simply looks up a similar situation in the almanac, and uses the recorded v alue. For example, one might want to predict whether a brige will collapse under a certain weight. Then one simply looks up the section marked “bridges” in the almanac, finds the bridge in the almanac that is most similar to the one in question, and notes how much weight it could bear . In other words, the researchers obtain a large number of data samples { x i , y i } and define f ( · ) as an enormous lookup table. The pure experimental approach has an obvious dra wback: it is immensely labor-intensi ve. The researchers gi ven the task of compiling the section on bridges must construct se veral different kinds of bridges, and pile them up with weight until they collapse. Bridge building is not easy work, and the almanac section on bridges is only one among many . The pure experimental approach may also be inaccurate, if the almanac includes only a fe w examples relating to a certain topic. Obviously , neither the pure theoretical approach nor the pure experimental approach is v ery practical. The great insight of empirical science is that one can ef fecti vely combine e xperimen- tal and theoretical in vestigation in the follo wing w ay . First, a set of experiments corresponding to configurations { x 1 , x 2 . . . x N } are performed, leading to outcomes { y 1 , y 2 . . . y N } . The dif- ference between this process and the pure experimental approach is that here the number of tested configurations is much smaller . Then, in the theoretical phase, one attempts to find a function f ( · ) that agrees with all of the data: y i = f ( x i ) for all i . If such a function is found, 7 and it is in some sense simple, then one concludes that it will gener alize and make correct predictions when applied to ne w configurations that hav e not yet been tested. This description of the scientific process should produce a healthy dose of sympathy for the ancient thinkers who failed to discov er it. The idea of generalization, which is totally essential to the entire process, is completely nonob vious and raises a number of nearly intractable philo- sophical issues. The hybrid process assumes the existence of a finite number of observations x i , but claims to produce a universal predictiv e rule f ( · ) . Under what circumstances is this legitimate? Philosophers ha ve been grappling with this question, called the Problem of Induc- tion, since the time of David Hume. Also, a moment’ s reflection indicates that the problem considered in the theoretical phase does not ha ve a unique solution. If the observed data set is finite, then there will be a large number of functions f ( · ) that agree with it. These functions must make the same predictions for the kno wn data x i , but may make very dif ferent predictions for other configurations. 1.1.4 Occam’ s Razor W illiam of Occam f amously articulated the principle that bears his name with the Latin phrase: entia non sunt multiplicanda praeter necessitatem ; entities must not be multiplied without ne- cessity . In plainer English, this means that if a theory is adequate to explain a body of obser- v ations, then one should not add gratuitous embellishments or clauses to it. T o wield Occam’ s Razor means to take a theory and cut a way all of the inessential parts until only the core idea remains. Scientists use Occam’ s Razor to deal with the problem of theory degenerac y mentioned abov e. Gi ven a finite set of e xperimental configurations { x 1 , x 2 , . . . x N } and corresponding ob- served outcomes { y 1 , y 2 . . . y N } , there will alw ays be an infinite number of functions f 1 , f 2 , . . . that agree with all the observ ations. The number of compatible theories is infinite because one can always produce a new theory by adding a ne w clause or qualification to a previous theory . For e xample, one theory might be e xpressed in English as “General relati vity holds e verywhere in space”. This theory agrees with all known experimental data. But one could then produce a ne w theory that says “General relativity holds e verywhere in space except in the Alpha Centauri solar system, where Newton’ s laws hold. ” Since it is quite difficult to sho w the superiority of the theory of relati vity o ver Ne wtonian mechanics e ven in our local solar system, it is probably almost impossible to show that relati vity holds in some other , far -off star system. Furthermore, an impious philosopher could generate an ef fectiv ely infinite number of variant theories of this kind, simply by replacing “ Alpha Centauri” with the name of some other star . This produces 8 a v ast number of conflicting accounts of physical reality , each with about the same degree of empirical e vidence. Scientists use Occam’ s Razor to deal with this kind of crisis by justifying the disqualifica- tion of the v ariant theories mentioned abov e. Each of the v ariants has a gratuitous subclause, that specifies a special region of space where relativity does not hold. The subclause does not improv e the theory’ s descriptiv e accuracy; the theory would still agree with all observ ational data if it were remov ed. Thus, the basic theory that relati vity holds e verywhere stands out as the simplest theory that agrees with all the e vidence. Occam’ s Razor instructs us to accept the basic theory as the current champion, and only revise it if some ne w contradictory evidence arri ves. This idea sounds attractiv e in the abstract, but raises a thorny philosophical problem when put into practice. Formally , the razor requires one to construct a functional H [ f ] that rates the complexity of a theory . Then giv en a set of theories F all of which agree with the empirical data, the champion theory is simply the least complex member of F : f ∗ = min f ∈ F H [ f ] The problem is: ho w does one obtain the complexity functional H ? Giv en two candidate def- initions for the functional, how does one decide which is superior? It may very well be that complexity is in the e ye of the beholder , and that two observers can legitimately disagree about which of two theories is more complex. This disagreement would, in turn, cause them to dis- agree about which member of a set of candidate theories should be considered the champion on the basis of the currently av ailable evidence. This kind of disagreement appears to un- dermine the objectivity of science. Fortunately , in practice, the issue is not insurmountable. Informal measures of theory complexity , such as the number of words required to describe a theory in English, seem to work well enough. Most scientists would agree that “relati vity holds e verywhere” is simpler than “relativity holds ev erywhere except around Alpha Centauri”. If a disagreement persists, then the disputants can, in most cases, settle the issue by running an actual experiment. 1.1.5 Pr oblem of Demarcation and F alsifiability Principle The great philosopher of science Karl Popper proposed a principle called falsifiability that substantially clarified the meaning and justification of scientific theorizing [92]*. Popper was moti vated by a desire to rid the world of pseudosciences such as astrology and alchemy . The problem with this goal is that astrologers and alchemists may very well appear to be doing 9 real science, especially to laypeople. Astrologers may employ mathematics, and alchemists may utilize much of the same equipment as chemists. Some people who promote creationist or religiously inspired accounts of the origins of life make plausible sounding ar guments and appear to be following the rules of logical inference. These kinds of surface similarities may make it impossible for nonspecialists to determine which fields are scientific and which fields are not. Indeed, e ven if ev eryone agreed that astronomy is science but astrology is not, it would be important from a philosophical perspecti ve to justify this determination. Popper calls this the Problem of Demarcation: ho w to separate scientific theories from nonscientific ones. Popper answered this question by proposing the principle of falsifiability . He required that, in order for a theory to be scientific, it must make a prediction with enough confidence that, if the prediction disagreed with the actual outcome of an appropriate experiment or observation, the theory would be discarded. In other words, a scientist proposing a ne w theory must be willing to risk embarassment if it turns out the theory does not agree with reality . This rule pre vents people from constructing grandiose theories that hav e no empirical consequences. It also prev ents people from using a theory as a lens, that distorts all observations so as to render them compatible with its abstractions. If Aristotle had been aware of the idea of falsifiability , he might ha ve av oided dev eloping his silly theory of the four elements, by realizing that it made no concrete predictions. In terms of the notation dev eloped abov e, the falsifiability principle requires that a theory can be instantiated as a function f ( · ) that applies to some real world configurations. Further- more, the theory must designate a configuration x and a prediction f ( x ) with enough confidence that if the experiment is done, and the resulting y v alue does not agree with the prediction y 6 = f ( x ) , then the theory is discarded. This condition is fairly weak, since it requires a pre- diction for only a single configuration. The point is that the f alsifiability principle does not say anything about the value of a theory , it only states a requirement for the theory to be considered scientific. It is a sort of precondition, that guarantees that the theory can be ev aluated in relation to other theories. It is very possible for a theory to be scientific b ut wrong. In addition to marking a boundary between science and pseudoscience, the falsifiability principle also permits one to delineate between statements of mathematics and empirical sci- ence. Mathematical statements are not falsifiable in the same way empirical statements are. Mathematicians do not and can not use the falsifiability principle; their results are verified us- ing an alternate criterion: the mathematical proof. No new empirical observ ation or experiment could falsify the Pythagorean theorem. A person who drew a right triangle and attempted to sho w that the length of its sides did not satisfy a 2 + b 2 = c 2 would just be ridiculed. Mathemat- 10 ical statements are fundamentally implications: if the axioms are satisfied, then the conclusions follo ws logically . The falsifiability principle is strong medicine, and comes, as it were, with a set of po werful side-ef fects. Most prominently , the principle allo ws one to conclude that a theory is false, but provides no mechanism whatev er to justify the conclusion that a theory is true. This fact is rooted in one of the most basic rules of logical inference: it is impossible to assert universal conclusions on the basis of existential premises. Consider the theory “all swans are white”. The sighting of a black swan, and the resulting premise “some swans are black”, leads one to conclude that the theory is false. But no matter how many white swans one may happen to observe, one cannot conclude with perfect confidence that the theory is true. According to Popper , the only way to establish a scientific theory is to falsify all of its competitors. But because the number of competitors is vast, the y cannot all be disqualified. This promotes a stance of radical skepticism to wards scientific kno wledge. 1.1.6 Science as a Sear ch Through Theory-Space Though the scientific is not monoli thic or precisely defined, the follo wing list describes it fairly well: 1. Through observ ation and e xperiment, amass an initial corpus of configuration-outcome pairs { x i , y i } relating to some phenomenon of interest. 2. Let f C be the initial champion theory . 3. Through observ ation and analysis, de velop a new theory , which may either be a refine- ment of the champion theory , or something completely new . Prefer simpler candidate theories to more complex ones. 4. Instantiate the new theory in a predictiv e function f N . If this cannot be done, the theory is not scientific. 5. Find a configuration x for which f C ( x ) 6 = f N ( x ) , and run the indicated experiment. 6. If the outcome agrees with the ri val theory , y = f N ( x ) , then discard the old champion and set f C = f N . Otherwise discard f N . 7. Return to step #3. 11 The scientific process described abov e makes a crucial assumption, which is that perfect agreement between theory and experiment can be observ ed, such that y = f ( x ) . In practice, scientists ne ver observe y = f ( x ) but rather y ≈ f ( x ) . This fact does not break the process described abov e, because ev en if neither theory is perfectly correct, it is reasonable to assess one theory as “more correct” than another and thereby discard the less correct one. Howe ver , the fact that real e xperiments ne ver agree perfectly with theoretical predictions has important philosophical consequences, because it means that scientists are searching not for perfect truth but for good approximations. Most physicists will admit that ev en their most refined theories are mere approximations, though they are spectacularly accurate approximations. In the light of this idea about approximation, the follo wing conception of science becomes possible. Science is a search through a vast space F that contains all possible theories. There is some ideal theory f ∗ ∈ F , which correctly predicts the outcome of all experimental configura- tions. Ho wev er , this ideal theory can nev er be obtained. Instead, scientists proceed tow ards f ∗ through a process of iterativ e refinement. At ev ery moment, the current champion theory f C is the best known approximation to f ∗ And each time a champion theory is unseated in fav or of a ne w candidate, the new f C is a bit closer to f ∗ . This view of science as a search for good approximations brings up another nonobvious component of the philosophical foundations of empirical science. If perfect truth cannot be obtained, why is it worth expending so much ef fort to obtain mere approximations? W ouldn’t one expect that using an approximation might cause problems at crucial moments? If the the- ory that explains an airplane’ s ability to remain aloft is only an approximation, why is anyone willing to board an airplane? The answer is, of course, that the approximation is good enough. The fact that perfection is unachie v able does not and should not dissuade scientists from reach- ing toward it. A serious runner considers it deeply meaningful to attempt to run faster , though it is impossible for him to complete a mile in less than a minute. In the same way , scientists consider it worthwhile to search for increasingly accurate approximations, though perfect truth is unreachable. 1.1.7 Cir cularity Commitment and Reusability Hypothesis Empirical scientists follo w a unique conceptual cycle in their work that begins and ends in the same place. Mathematicians start from axioms and move on to theorems. Engineers start from basic components and assemble them into more sophisticated devices. An empirical scientist begins with an experiment or set of observ ations that produce measurements. She then contem- plates the data and attempts to understand the hidden structure of the measurements. If she is 12 smart and luck y , she might discover a theory of the phenomenon. T o test the theory , she uses it to make predictions regarding the original phenomenon . In other w ords, the same phenomenon acts as both the starting point and the ultimate justification for a theory . This dedication to the single, isolated goal of describing a particular phenomenon is called the Circularity Commit- ment. The nonobviousness of the Circularity Commitment can be understood by considering the alternati ve. Imagine a scientific community in which theories are not justified by their ability to make empirical predictions, but by their practical utility . For example, a candidate theory of thermodynamics might be ev aluated based on whether it can be used to construct comb us- tion engines. If the engine works, the theory must be good. This reasoning is actually quite plausible, but science does not work this way . No serious scientist would suggest that because the theory of relativity is not relev ant to or useful for the construction of airplanes, it is not an important or worthwhile theory . Modern physicists de velop theories re garding a wide range of esoteric topics such as quantum superfluidity and the entropy of black holes without concerning themselves with the practicality of those theories. Empirical scientists are thus very similar to mathematicians in the purist attitude they adopt re garding their work. In a prescientific age, a researcher expressing this kind of dedication to pure empirical in- quiry , especially gi ven the effort required to carry out such an inquiry , might be vie wed as an eccentric crank or religious zealot. In modern times no such stigma exists, because e veryone can see that empirical science is eminently practical. This leads to another deeply surprising idea, here called the Reusability Hypothesis: in spite of the fact that scientists are explicitly unconcerned with the utility of their theories, it just so happens that those theories tend to be extraordinarily useful. Of course, no one can kno w in adv ance which areas of empirical inquiry will pro ve to be technologically rele v ant. But the history of science demonstrates that new em- pirical theories often catalyze the dev elopment of amazing new technologies. Thus Maxwell’ s unified theory of electrodynamics led to a wide array of electronic devices, and Einstein’ s the- ory of relati vity led to the atomic bomb . The fact that large sums of public money are spent on constructing e ver-lar ger particle colliders is e vidence that the Reusability Hypothesis is well understood e ven by go vernment of ficials and policy mak ers. The Circularity Commitment and the Reusability Hypothesis complement each other nat- urally . Society would ne ver be willing to fund scientific research if it did not produce some tangible benefits. But if society explicitly required scientists to produce practical results, the scope of scientific in vestigati on would be drastically reduced. Einstein would not hav e been able to justify his research into relativity , since that theory had fe w obvious applications at the time it was in vented. The two philosophical ideas justify a fruitful division of labor . Scientists 13 aim with intent concentration at a single target: the de velopment of good empirical theories. They can then hand of f their theories to the engineers, who often find the theories to be useful in the de velopment of ne w technologies. 1.2 Sophie’ s Method This section de velops a refined version of the scientific method, in which large databases are used instead of experimental observations as the necessary empirical ingredient. The neces- sary modifications are fairly minor , so the revised version includes all of the same conceptual apparatus of the standard version. At the same time, the modification is significant enough to considerably expand the scope of empirical science. The refined version is dev eloped through a series of thought experiments relating to a fictional character named Sophie. 1.2.1 The Shaman Sophie is a assistant professor of physics at a lar ge American state uni versity . She finds this job ve xing for sev eral reasons, one of which is that she has been chosen by the department to teach a physics class intended for students majoring in the humanities, for whom it serves to fill a breadth requirement. The students in this class, who major in subjects like literature, religious studies, and philosophy , tend to be intelligent b ut also querulous and somewhat disdainful of the “merely technical” intellectual achie vements of physics. In the current semester she has become a ware of the presence in her class of a discalced stu- dent with a large beard and often bloodshot eyes. This student is surrounded by an entourage of similarly strange looking follo wers. Sophie is on good terms with some of the more seri- ous students in the class, and in con versation with them has found out that the odd student is attempting to start a ne w naturalistic religious mov ement and refers to himself as a “shaman”. One day while deliv ering a simple lecture on Newtonian mechanics, Sophie is surprised when the shaman raises his hand. When Sophie calls on him, he proceeds to claim that physics is a propagandistic hoax designed by the elites as a way to control the population. Sophie blinks se veral times, and then responds that physics can’t be a hoax because it makes real-world predictions that can be verified by independent observers. The shaman counters by claiming that the so-called “predictions” made by physics are in fact trivialities, and that he can obtain better forecasts by communing with the spirit world. He then proceeds to challenge Sophie to a predictiv e duel, in which the two of them will make forecasts re garding the outcome of a simple experiment, the winner being decided based on the accuracy of the forecasts. Sophie is 14 taken aback by this b ut, hoping that by proving the shaman wrong she can break the spell he has cast on some of the other students, agrees to the challenge. During the next class, Sophie sets up the following e xperiment. She uses a spring mech- anism to launch a ball into the air at an angle θ . The launch mechanism allo ws her to set the initial velocity of the ball to a v alue of v i . She chooses as a predictiv e test the problem of pre- dicting the time t f that the ball will fall back to the ground after being launched at t i = 0 . Using a tri vial Newtonian calculation she concludes that t f = 2 g − 1 v i sin( θ ) , sets v i and θ to giv e a v alue of t f = 2 seconds, and announces her prediction to the class. She then asks the shaman for his prediction. The shaman declares that he must consult with the wind spirits, and then spends a couple of minutes chanting and muttering. Then, dramatically flaring open his e yes as if to signify a moment of re velation, he grabs a piece of paper , writes his prediction on it, and then hands it to another student. Sophie suspects some kind of trick, but is too exasperated to in vestigate and so launches the ball into the air . The ball is equipped with an electronic timer that starts and stops when an impact is detected, and so the number registered in the timer is just the time of flight t f . A student picks up the ball and reports that the result is t f = 2 . 134 . The shaman giv es a gleeful laugh, and the student holding his written prediction hands it to Sophie. On the paper is written 1 < t f < 30 . The shaman declares victory: his prediction turned out to be correct, while Sophie’ s was incorrect (it was of f by 0 . 134 seconds). T o counter the shaman’ s claim and because it was on the syllab us anyway , in the next class Sophie begins a discussion of probability theory . She goes ov er the basic ideas, and then connects them to the experimental prediction made about the ball. She points out that technically , the Newtonian prediction t f = 2 is not an assertion about the exact v alue of the outcome. Rather it should be interpreted as the mean of a probability distribution describing possible outcomes. For example, one might use a normal distribution with mean µ = t f = 2 and σ = . 3 . The reason the shaman superficially seemed to win the contest is that he ga ve a probability distribution while Sophie gav e a point prediction; these two types of forecast are not really comparable. In the light of probability theory , the reason to prefer the Ne wtonian prediction abov e the shamanic one, is that it assigns a higher probability to the outcome that actually occurred. No w , plausibly , if only a single trial is used then the Ne wtonian theory might simply ha ve gotten luck y , so the reasonable thing to do is combine the results o ver man y trials, by multiplying the probabilities together . Therefore, the formal justification for preferring the Ne wtonian theory to the shamanic theory is that: Y k P newton ( t f ,k ) > Y k P shaman ( t f ,k ) 15 Where the k index runs o ver many trials of the experiment. Sophie then sho ws how the New- tonian probability predictions are both more confident and more corr ect than the shamanic predictions. The Ne wtonian predictions assign a v ery lar ge amount of probability to the region around the outcome t f = 2 , and in fact it turns out that almost all of the real data outcomes fall in this range. In contrast, the shamanic prediction assigns a relativ ely small amount of proba- bility to the t f = 2 region, because he has predicted a very wide interval ( 1 < t f < 30 ). Thus while the shamanic prediction is correct, it is not very confident. The Ne wtonian prediction is correct and highly confident, and so it should be prefered. Sophie tries to emphasize that the Ne wtonian probability prediction P newton only works well for the r eal data. Because of the requirement that probability distrib utions be normalized, the Newtonian theory can only achie ve good results by reassigning probability towards the region around t f = 2 and aw ay from other re gions. A theory that does not perform this kind of reassignment cannot achie ve superior high performance. Sophie recalls that some of the students are studying computer science and for their benefit points out the following. The famous Shannon equation L ( x ) = − log 2 P ( x ) gov erns the rela- tionship between the probability of an outcome and the length of the optimal code that should be used to represent it. Therefore, gi ven a large data file containing the results of man y trials of the ballistic motion experiment, the two predictions (Newtonian and shamanic) can both be used to build specialized programs to compress the data file. Using the Shannon equation, the abov e inequality can be re written as follows: X k L newton ( t f ,k ) < X k L shaman ( t f ,k ) This inequality indicates an alternativ e criterion that can be used to decide between two riv al theories. Given a data file recording measurements related to a phenomenon of interest, a scien- tific theory can be used to write a compression program that will shrink the file to a small size. T o decide between two riv al theories of the same phenomenon, one in vok es the corresponding compressors on a shared benchmark data set, and prefers the theory that achie ves a smaller en- coded file size. This criterion is equiv alent to the probability-based one, but has the advantage of being more concrete, since the quantities of interest are file lengths instead of probabilities. 1.2.2 The Dead Experimentalist Sophie is a theoretical physicist and, upon taking up her position as assistant professor , began a collaboration with a brilliant experimental physicist who had been working at the univ ersity for some time. The experimentalist had pre viously completed the dev elopment of an advanced 16 apparatus that allowed the in vestig ation of an exotic ne w kind of quantum phenomenon. Using data obtained from the new system, Sophie made rapid progress in de veloping a mathematical theory of the phenomenon. T ragically , just before Sophie was able complete her theory , the experimentalist was killed in a laboratory explosion that also destroyed the special apparatus. After grieving for a while, Sophie decided that the best way to honor her friend’ s memory would be to bring the research the y had been working on to a successful conclusion. Unfortunately , there is a critical problem with Sophie’ s plan. The experimental apparatus had been completely destroyed, and Sophie’ s late partner was the only person in the world who could hav e rebuilt it. He had run many trials of the system before his death, so Sophie had a quite large quantity of data. But she had no way of generating any new data. Thus, no matter how beautiful and perfect her theory might be, she had no way of testing it by making predictions. One day while thinking about the problem Sophie recalls the incident with the shaman. She remembers the point she had made for the benefit of the software engineers, about how a scientific theory could be used to compress a real world data set to a very small size. Inspired, she decides to apply the data compression principle as a way of testing her theory . She imme- diately returns to her of fice and spends the ne xt sev eral weeks writing Matlab code, con verting her theory into a compression algorithm. The resulting compressor is successful: it shrinks the corpus of e xperimental data from an initial size of 8 . 7 · 10 11 bits to an encoded size of 3 . 3 · 10 9 bits. Satisfied, Sophie writes up the theory , and submits it to a well-kno wn physics journal. The journal editors like the theory , but are a bit skeptical of the compression based method for testing the theory . Sophie ar gues that if the theory becomes widely known, one of the other experts in the field will develop a similar apparatus, which can then be used to test the theory in the traditional way . She also of fers to release the e xperimental data, so that other researchers can test their own theories using the same compression principle. Finally she promises to release the source code of her program, to allow external verification of the compression result. These arguments finally con vince the journal editors to accept the paper . 1.2.3 The Riv al Theory After all the mathematics, software dev elopment, prose revisions, and persuasion necessary to complete her theory and ha ve the paper accepted, Sophie decides to re ward herself by li ving the good life for a while. She is confident that her theory is essentially correct, and will ev entually be recognized as correct by her colleagues. So she spends her time reading nov els and hanging out in cof fee shops with her friends. 17 A couple of months later , howe ver , she recei ves an unpleasant shock in the form of an email from a colleague which is phrased in consolatory language, but does not contain any clue as to why such language might be in order . After some in vestigation she finds out that a ne w paper has been published about the same quantum phenomenon of interest to Sophie. The paper proposes a alternative theory of the phenomenon which bears no resemblance whatev er to Sophie’ s. Furthermore, the paper reports a better compression rate than was achie ved by Sophie, on the database that she released. Sophie reads the new paper and quickly realizes that it is worthless. The theory depends on the introduction of a large number of additional parameters, the values of which must be obtained from the data itself. In f act, a substantial portion of the paper in v olves a description of a statistical algorithm that estimates optimal parameter values from the data. In spite of these aesthetic flaws, she finds that man y of her colleagues are quite taken with the new paper and some consider it to be “next big thing”. Sophie sends a message to the journal editors describing in detail what she sees as the many flaws of the upstart paper . The editors express sympathy , but point out that the ne w theory outperforms Sophie’ s theory using the performance metric she herself proposed. The beauty of a theory is important, but its correctness is ultimately more important. Some what discouraged, Sophie sends a polite email to the authors of the new paper , con- gratulating them on their result and asking to see their source code. Their response, which arri ves a week later , contains a v ague excuse about ho w the source code is not properly doc- umented and relies on proprietary third party libraries. Annoyed, Sophie contacts the journal editors again and asks them for the program the y used to verify the compression result. They reply with a link to a binary version of the program. When Sophie clicks on the link to download the program, she is annoyed to find it has a size of 800 meg abytes. But her annoyance is quickly transformed into enlightenment, as she realizes what happened, and that her previous philosophy contained a serious fla w . The upstart theory is not better than hers; it has only succeeded in reducing the size of the encoded data by dramatically increasing the size of the compressor . Indeed, when dealing with specialized compressors, the distinction between “program” and “encoded data” becomes almost irrele vant. The critical number is not the size of the compressed file, but the net size of the encoded data plus the compressor itself. Sophie writes a response to the new paper which describes the refined compression rate principle. She begins the paper by reiterating the unfortunate circumstances which forced her to appeal to the principle, and expressing the hope that someday an experimental group will rebuild the apparatus dev eloped by her late partner , so that the experimental predictions made 18 by the two theories can be properly tested. Until that day arriv es, standard scientific practice does not permit a decisiv e declaration of theoretical success. But surely there is some theoretical statement that can be made in the meantime, gi ven the large quantity of data that is av ailable. Sophie’ s proposal is that the goal should be to find the theory that has the highest probability of predicting a new data set, when it can finally be obtained. If the theories are very simple in comparison to the data being modeled, then the size of the encoded data file is a good way of choosing the best theory . But if the theories are complex, then there is a risk of overfitting the data. T o guard against ov erfitting complex theories must be penalized; a simple way to do this is to tak e into account the codelength required for the compressor itself. The length of Sophie’ s compressor was negligible, so the net score of her theory is just the codelength of the encoded data file: 3 . 3 · 10 9 bits. The riv al theory achie ved a smaller size of 2 . 1 · 10 9 for the encoded data file, but required a compressor of 6 . 7 · 10 9 bits to do so, giving a total score of 8 . 8 · 10 9 bits. Since Sophie’ s net score is lo wer , her theory should be prefered. 1.3 Compr ession Rate Method In the course of the thought experiments discussed abov e, the protagonist Sophie articulated a refined version of the scientific method. This procedure will be called the Compression Rate Method (CRM). The web of concepts related to the CRM will be called the comperical philos- ophy of science, for reasons that will become e vident in the next section. The CRM consists of the follo wing steps: 1. Obtain a vast database T relating to a phenomenon of interest. 2. Let f C be the initial champion theory . 3. Through observation and analysis, de velop a ne w theory f N , which may be either a sim- ple refinement of f C or something radically ne w . 4. Instantiate f N as a compression program. If this cannot be done, then the theory is not scientific. 5. Score the theory by calculating L ( T | f N ) + H [ f N ] , the sum of the encoded version of T and the length of the compressor . 6. If L ( T | f N ) + H [ f N ] < L ( T | f C ) + H [ f C ] , then discard the old champion and set f C = f N . Otherwise discard f N . 19 7. Return to step #3. It is worthwhile to compare the CRM to the version of the scientific method giv en in Sec- tion 1.1.6 . One improv ement is that in this version the Occam’ s Razor principle plays an explicit role, through the influence of the H term. A solution to the Problem of Demarcation is also built into the process in Step #4. The main dif ference is that the empirical ingredient in the CRM is a large database, while the traditional method emplo ys experimental observ ation. The significance of the CRM can be seen by understanding the relationship between the target database T and the resulting theories. If T contains data related to the outcomes of physical experiments, then physical theories will be necessary to compress it. If T contains information related to interest rates, house prices, global trade flo ws, and so on, then economic theories will be necessary to compress it. One obvious choice for T is simply an enormous image database, such as the one hosted by the Facebook social networking site. In order to compress such a database one must dev elop theories of visual r eality . The idea that there can be an empirical science of visual reality has nev er before been articulated, and is one of the central ideas of this book. A key argument, contained in Chapter 3 , is that the research resulting from the application of the CRM to a lar ge database of natural images will produce a field v ery similar to modern computer vision. Similarly , Chapter 4 ar gues that the application of the CRM to a large text corpus will result in a field very similar to computational linguistics. Furthermore, the reformulated versions of these fields will ha ve far stronger philosophical foundations, due to the explicit connection between the CRM and the traditional scientific method. It is crucial to emphasize the deep connection between compression and prediction. The real goal of the CRM is to e v aluate the predictiv e power of a theory , and the compression rate is just a way of quantifying that po wer . There are three advantages to using the compression rate instead of some measure of predictive accurac y . First, the compression rate naturally accomo- dates a model complexity penalty term. Second, the compression rate of a large database is an objecti ve quantity , due to the ideas of K olmogorov comple xity and uni versal computation, dis- cussed below . Third, the compression principle pro vides an important verificational benefit. T o verify a claim made by an advocate of a ne w theory , a referee only needs to check the encoded file size, and ensure that the resulting decoded data matches exactly the original database T . Most people e xpress skepticism as their first reaction to the plan of research embodied by the CRM. They generally admit that it may be possible to use the method to obtain increasingly short codes for the target databases. But they balk at accepting the idea that the method will produce anything else of v alue. The follo wing sections argue that the philosophical commit- ments implied by the CRM are exactly analogous to those long accepted by scientists working 20 Figure 1.1: Histograms of differences between v alues of neighboring pixels in a natural image (left) and a random image (right). The clustering of the pixel dif ference values around 0 in the natural image is what allows compression formats like PNG to achiev e compression. Note the larger scale of the image on the left; both histograms represent the same number of pix els. in mainstream fields of empirical science. Comperical science is nonobvious in the year 2011 for exactly the same kinds of reasons that empirical science w as nonobvious in the year 1511. 1.3.1 Data Compr ession is Empirical Science The following theorem is well known in data compression. Let C be a program that losslessly compresses bit strings s , assigning each string to a ne w code with length L C ( s ) . Let U N ( s ) be the uniform distribution ov er N -bit strings. Then the following bound holds for all compression programs C : E ( s ∼ U N ) [ L C ( s )] ≥ N (1.0) In words the theorem states that no lossless compression program can achie ve av erage code- lengths smaller than N bits, when av eraged ov er all possible N bit input strings. Belo w , this statement is referred to as the “No Free Lunch” (NFL) theorem of data compression as it im- plies that one can achie ve compression for some strings s only at the price of inflating other strings. At first glance, this theorem appears to turn the CRM proposal into nonsense. In fact, the theorem is the ke ystone of the comperical philosophy because it shows how lossless, large-scale compression research must be essentially empirical in character . T o see this point, consider the following apparent paradox. In spite of the NFL theorem, lossless image compression programs exist and hav e been in widespread use for years. As an example, the well-known Portable Network Graphics (PNG) compression algorithm seems to reliably produce encoded files that are 40-50% shorter than would be achiev ed by a uniform encoding. This apparent success seems to violate the No Free Lunch theorem. 21 The paradox is resolved by noticing that the images used to e valuate image compression algorithms are not drawn from a uniform distribution U N ( s ) over images. If lossless image for- mats were e v aluated based on their ability to compress random images, no such format could e ver be judged successful. Instead, the images used in the ev aluation process belong to a very special subset of all possible images: those that arise as a result of e veryday human photog- raphy . This “real world” image subset, though vast in absolute terms, is miniscule compared to the space of all possible images. So PNG is able to compress a certain image subset, while inflating all other images. And the subset that PNG is able to compress happens to o verlap substantially with the real world image subset. The specific empirical regularity used by the PNG format is that in real world images, adjacent pixel v alues tend to ha ve very similar values. A compressor can exploit this property by encoding the differences between neighboring pix el values instead of the v alues themselv es. The distribution of dif ferences is very narrowly clustered around zero, so they can be encoded using shorter av erage codes (see Figure 1.1 ). Of course, this trick does not work for random images, in which there is no correlation between adjacent pixels. The NFL theorem indicates that in order to succeed, a compericalresearcher must follow a strategy analogous to the procedure of physics. First, she must attempt to discover some structures or patterns present in real world images. Then she must de velop a mathematical theory characterizing that structure, and build the theory into a compressor . Finally , she must demonstrate that the theory corresponds to reality , by sho wing that it achieves an improv ed compression rate. T o make statements about the world, physicists need to combine mathematical and em- pirical reasoning; neither alone is sufficient. Consider the follo wing statement of physics: when a ball is tossed into the air , its vertical position will be described by the equation: y ( t ) = g t 2 + v 0 t + y 0 . That statement can be decomposed into a mathematical and an em- pirical component. The mathematical statement is: if a quantity’ s e volution in time is go verned by the differential equation d 2 y dt 2 = k , where k is some constant, then its value is giv en by the function y ( t ) = k t 2 + v 0 t + y 0 , where v 0 and y 0 are determined by the initial conditions. The empirical statement is: if a ball is thro wn into the air , its vertical position will be governed by the differential equation d 2 y dt 2 = g , where g is the acceleration due to gravity . By combining these statements together , the physicist is able to make a v ariety of predictions. Just like physicists, comperical researchers must combine mathematical statements with empirical statements in order to make predictions. Because of the NFL theorem, pure mathe- matics is ne ver sufficient to reach conclusions of the form: “ Algorithm Y achiev es good com- pression. ” Mathematical reasoning can only be used to make implications: “If the images 22 exhibit property X, then algorithm Y will achiev e good compression”. In order to actually achie ve compression, it is necessary to demonstrate the empirical fact that the images actually hav e property X. This shows why the comperical proposal is not fundamentally about saving disk space or bandwidth; it is fundamentally about characterizing the properties of images or other types of data. 1.3.2 Comparison to P opperian Philosophy The comperical philosoph y of science bears a strong family resemblance to the Popperian one, and inherits many of its conceptual adv antages. First, the compression principle provides a clear answer to the Problem of Demarcation: a theory is scientific if and only if it can be used to b uild a compressor for an appropriate kind of database. Because of the intrinsic difficulty of lossless data compression, the only way to sav e bits is to explicitly reassign probability aw ay from some outcomes and tow ard other outcomes. If the theory assigns very low probability to an outcome which then occurs, this suggests that the theory has low quality and should be discarded. Thus, the probability reassignment requirement is just a graduated or continuous version of the falsification requirement. The falsifiability principle means that a researcher hoping to prov e the v alue of his ne w theory must risk embarassment if his predictions turn out to be incorrect. The compression principle requires a researcher to face the potential for embarassment if his ne w theory ends up inflating the database. One dif ference between Popperian view and comperical vie w is that the former appears to justify stark binary assessments regarding the truth or falsehood of a theory , while the latter provides only a number which can be compared to other numbers. If theories are either true or false, then the compression principle is no more useful than the falsifiability principle. But if theories can exist on some middle ground between absolute truth and its opposite, then it makes sense to claim that one theory is relati vely more true than another , ev en if both are imperfect. The compression principle can be used to justify such claims. Falsifiability consigns all imperfect theories to the same garbage bin; compression can be used to rescue the valuable theories from the bin, dust them of f, and establish them as legitimate science. The falsifiability idea seems to imply that theories can be ev aluated in isolation: a theory is either true or false, and this assessment does not depend on the content of ri v al theories. In contrast, while the compression idea assigns a score to an indi vidual theory , this score is useful only for the purpose of comparison. This distinction may be conceptually significant to some people, b ut in practice it is unimportant. Science is a search for good approximations; science proceeds by incrementally improving the quality of the approximations. The power of the 23 falsifiability requirement is that it enables a rapid search through the theory-space by ensuring that theories can be decisiv ely compared. The compression requirement provides exactly the same benefit. When a researcher proposes a new theory and shows that it can achie ve a smaller compressed file size for the target database, this provides decisi ve e vidence that the new theory is superior . Furthermore, both principles allo w a research community to identify a champion theory . In the Popperian view , the champion theory is the one that has withstood all attempts at falsification. In the comperical vie w , the champion theory is the one that achiev es the smallest codelength on the rele vant benchmark database. One of the core elements of Popper’ s philosophy is the dedication to the continual testing, examination, and skepticism of scientific theories. A Popperian scientist is nev er content with the state of his knowledge. He ne ver claims that a theory is true; he only accepts that there is currently no e vidence that would falsify it. The comperical philosopher takes an entirely analogous stance. T o her , a theory is nev er true or e ven optimal, it is only the best theory that has thus far been discovered. She will ne ver claim, “the probability of e vent X is 35%”. Instead, she would state that “according to the current champion theory , the probability of e vent X is 35%”. She might e ven make decisions based on this probability assignment. But if a ne w theory arriv es that provides a better codelength, she immediately replaces her probability estimates and updates her decision policy based on the ne w theory . The Popperian commitment to continual examination and criticism of theoretical kno wl- edge is good discipline, but the radical skepticism it promotes is probably a bit too extreme. A strict Popperian would be unwilling to use Ne wtonian physics once it was falsified, in spite of the fact that it obviously still works for most problems of practical interest. The compression principle promotes a more nuanced view . If a claim is made that a theory provides a good description of a certain phenomenon, and the claim is justified by demonstrating a strong com- pression result, then the claim is v alid for all time. It is possible to dev elop a new theory that achie ves a better compression rate, or to show that the previous theory does not do as well on another related database. These circumstances might suggest that the old theory should no longer be used. But if the old theory provided a good description of a particular database, no future dev elopments will change that fact. This captures the intuition that Ne wtonian physics still provides a perfectly adequate description of a wide range of phenomena; Eddington’ s solar eclipse photographs simply sho wed that there are some phenomena to which it does not apply . 24 1.3.3 Cir cularity and Reusability in Context of Data Compression Just like empirical scientists do, comperical researchers adopt a the Circularity Commitment to guide and focus their efforts. A comperical researcher ev aluates a ne w theory based on one and only one criterion: its ability to compress the database for which it was de veloped. A community using a lar ge collection of face images will be highly interested in v arious tools such as hair models, eye glass detectors, and theories of lip color , and only secondarily interested in potential applications of face modeling technology . If the researchers chose to introduce some additional considerations to the theory comparison process, such as the rele vance of a theory to a certain type of practical task, they would compromise their own ability to discard low quality theories and identify high quality ones. Some truly purist thinkers may consider large scale data compression as an intrinsically in- teresting goal. Comperical researchers will face many challenging problems, in volving math- ematics, algorithm design, statistical inference, and kno wledge representation. Furthermore, researchers will receiv e a clear signal indicating when they hav e made progress, and ho w much. For a certain type of intellectual, these considerations are very significant, ev en if there is no reason to belie ve that the in vestigation will yield an y practical results. In this light, it is worth comparing the proposed field of large scale lossless data compres- sion with the established field of computer chess. Chess is an abstract symbolic game with very little connection to the real world. A computer chess adv ocate would find it quite diffi- cult to con vince a skeptical audience that constructing po werful chess programs would yield any tangible benefit. Ho wev er , like the compression goal, the computer chess goal is attracti ve because it produces a v ariety of subproblems, and also provides a method for making decisiv e comparisons between riv al solutions. For these reasons, computer scientists de voted a signifi- cant amount of effort to the field, leading some to claim that chess was “the drosophila of AI research”. Furthermore, these efforts were incredibly successful, and led to the historic defeat of the top ranked human grandmaster, Gary Kasparov , by IBM’ s Deep Blue in 1997. Most scientists would agree that this ev ent was an important advance for human knowledge, ev en if it did not lead to any practical applications. Because of its similar methodological advantages, comperical research has a similar potential to adv ance human knowledge. For the reader who is unmoved by the ar gument about the intrinsic interest of compression science, it is essential to defend the validity of the Reusability Hypothesis in the conte xt of data compression. The hypothesis really contains two separate pieces. First, theories employ abstractions, and good theories use abstractions that correspond to reality . So the abstraction called “mass” is not just a clev er computational trick, but represents a fundamental aspect of re- ality . These real abstractions are useful both for compression and for practical applications. The 25 second piece of the Reusability Hypothesis is that, while theories based on naïve or simplistic characterizations of reality can achie ve compression, the best codelengths will be achiev ed by theories that use real abstractions. So by vigorously pursuing the compression goal, researchers can identify the real abstractions gov erning a particular phenomenon, and those abstractions can be reused for practical applications. The following examples illustrate the idea of the Reusability Hypothesis. Consider con- structing a target database by setting up a video camera next to a highway and recording the resulting image stream. One way to predict image frames (and thus compress the data) would be to identify batches of pixels corresponding to a car , and use an estimate of the car’ s v elocity to interpolate the pixels forward. A compressor that uses this trick thus implicitly contains abstractions related to the concepts of “car” and “velocity”. Since these are real abstractions, the Reusability Hypothesis states that the specialized compressor should achiev e better com- pression rates than a more generic one. Another good example of this idea relates to text compression. Here, the Reusability Hypothesis states that a specialized compressor making use of abstractions such as verb conjugation patterns, parts of speech, and rules of grammar will perform better than a generic compressor . If the hypothesis is true, then the same di vi- sion of labor between scientists and engineers that works for mainstream fields will work here as well. The comperical scientists obtain various abstractions by follo wing the compression principle, and hand them off to the engineers, who will find them very useful for de veloping applications like automatic license plate readers and machine translation systems. 1.3.4 The In visible Summit An important concept related to the Compression Rate Method is called the K olmogorov com- plexity . The K olmogorov complexity K A ( s ) of a string s is the length of the shortest program that will output s when run on a T uring machine A . The ke y property of the K olmogorov com- plexity comes about as a consequence of the idea of univ ersal computation. If a T uring machine (roughly equi valent to a programming language) is of sufficient complexity , it becomes univer - sal : it can simulate any other T uring machine, if giv en the right simulator program. So gi ven a string s and a short program P A that outputs it when run on T uring machine A, one can easily obtain a program P B that outputs s when run on (universal) T uring machine B, just by prepending a simulator program S AB to P A , and | P B | = | S AB | + | P A | . No w , the simulator program is fixed by the definition of the two T uring machines. Thus for very long and comple x strings, the contribution of the simulator to the total program length becomes insignificant, so 26 that | P B | ≈ | P A | , and thus the K olmogorov complexity is ef fectiv ely independent of the choice of T uring machine. Unfortunately or not, a brief proof sho ws that the K olmogorov complexity is incomputable: a program attempting to compute K ( s ) cannot be guaranteed to terminate in finite time. This is not surprising, since if a method for computing the K olmogorov complexity were found, it would be immensely powerful. Such a program would render theoretical physicists unneces- sary . Experimental physicists could simply compile a large database of observations, and feed the database to the program. Since the optimal theory of physics corresponds pro vides the best explanation, and thus the shortest encoding, of the data, the program would automatically find the optimal theory of physics on its way to finding the K olmogorov comple xity . Another way of seeing the impossibility of finding K ( s ) is by imagining what it would mean to find the K olmogorov complexity of the Facebook image database. T o compress this database to the smallest possible size, one would hav e to know P ∗ ( I ) : the probability dis- tribution generating the Facebook images. While P ∗ ( I ) may look innocuous, in fact it is a mathematical object of vast complexity , containing an innumerable quantity of details. T o begin with, it must contain a highly sophisticated model of the human face. It must contain kno wledge of hair styles and facial expressions. It must capture the fact that lips are usually reddish in color , and that women are more likely to enhance this color using lipstick. Moving on from there, it would require knowledge about other things people like to photograph, such as pets, natural scenery , weddings, and boisterous parties. It would need to contain details about the appearance of babies, such as the fact that a baby usually has a pink face, and its head is large in proportion to the rest of its body . All this kno wledge is necessary because, for example, P ∗ ( I ) must assign higher probability , and shorter codelength, to an image featuring a woman with red lips, than to an image that is identical in ev ery way except that the woman has green lips. While calculating K ( s ) is impossible in general, one can find upper bounds to it. In- deed, the Compression Rate Method is just the process of finding a sequence of increasingly tight upper bounds on the K olmogorov comple xity of the target database. Each new champion theory corresponds to a tighter upper bound. In the case of images, a new champion theory corresponds to to a new model P C ( I ) of the probability of an image. Every iteration of theory refinement packages more realistic information into the model P C ( I ) , thereby bringing it closer to the unkno wable P ∗ ( I ) . This process is exactly analogous to the search through the theory space carried out by empirical scientists. Both empirical scientists and comperical scientists recognize that their theories are mere approximations. The fact that perfect truth cannot be obtained simply does not matter: it is still worthwhile to climb to wards the in visible summit. 27 1.3.5 Objectiv e Statistics Due to the direct relationship between statistical modeling and data compression (see Ap- pendix A ), comperical research can be regarded as a subfield of statistics. A traditional problem in statistics starts with a set of N observations { x 1 , x 2 . . . x N } of some quantity , such as the physical height of a population. By analyzing the data set, the statistician attempts to obtain a good estimate P ( x ) of the probability of a gi ven height. This model could be, for example, a Gaussian distrib ution with a giv en mean and variance. Comperical research inv olves an en- tirely analogous process. The difference is that instead of simple single-dimensional numbers, comperical statisticians analyze comple x data objects such as images or sentences, and attempt to find good models of the probability of such objects. All statistical inference must face a deep conceptual issue that has been the subject of ac- rimonious debate and philosophical speculation since the time of David Hume, who first iden- tified it. This is the Problem of Induction: when is it justified to jump from a limited set of specific observations (the data samples) to a univ ersal rule describing the observations (the model)? This problem has divided statisticians into two camps, the Bayesians and the frequen- tists, who disagree fundamentally about the meaning and justification of statistical inference. A full analysis of the nature of this disagreement would require its o wn book, but a v ery rough summary is that, while the Bayesian approach has a number of conceptual benefits, it is hobbled by its dependence on the use of prior distributions. A Bayesian performs inference by using Bayes rule to update a prior distribution in response to e vidence, thus producing a posterior distribution, which can be used for decision-making and other purposes. The critical problem is that is no objectiv e way to choose a prior . Furthermore, two Bayesians who start with dif fer- ent priors will reach different conclusions, in spite of observing the same evidence. The use of Bayesian techniques to justify scientific conclusions therefore depri ves science of objecti vity . Any data compressor must implement a mapping from data sets T to bit strings of length L ( T ) . This mapping defines an implicit probability distribution P ( T ) = 2 − L ( T ) . It appears, therefore, that comperical statisticians make the same commitment to the use of prior distrib u- tions as the Bayesians do. Ho we ver , there is a crucial subtlety here. Because the length of the compressor itself is taken into account in the CRM, the prior distribution is actually defined by the choice of programming language used to write the compressor . Furthermore, comperi- cal researchers use their models to describe vast datasets. Combined, these tw o f acts imply that comperical statistical inference is objectiv e. This idea is illustrated by the following thought experiment. Imagine a research subfield which has established a database T as its target for CRM-style in vestigation. The subfield makes slow but steady progress for se veral years. Then, out of the 28 blue, an unemployed autodidact from a rural village in India appears with a bold new theory . He claims that his theory , instantiated in a program P A , achie ves a compression rate which is dramatically superior to the current best published results. Ho we ver , among his other eccentric- ities, this gentleman uses a programming language he himself dev eloped, which corresponds to a T uring machine A . Now , the other researchers of the field are well-meaning but skeptical, since all the previously published results used a standard language corresponding to a T uring machine B . But it is easy for the Indian maverick to produce a compressor that will run on B : he simply appends P A to a simulator program S AB , that simulates A when run on B . The length of the new compressor is | P B | = | P A | + | S AB | , and all of the other researchers can confirm this. No w , assuming the data set T is large and complex enough so that | P A |  | S AB | , then the codelength of the modified v ersion is ef fecti vely the same as the original: | P B | ≈ | P A | . This sho ws that there can be no fundamental disagreement among comperical researchers regarding the quality of a ne w result. 1.4 Example Inquiries This section makes the makes the abstract discussion abov e tangible by describing sev eral concrete proposals. These proposals be gin with a method of constructing a tar get database, which defines a line of inquiry . In principle, researchers can use any large database that is not completely random as a starting point for a comperical in vestigation. In practice, unless some care is e xercised in the construction of the target dataset, it will be dif ficult to make progress. In the beginning stages of research, it will be more productiv e to look at data sources which display relati vely limited amounts of variation. Here are some examples inquiries that might provide good starting points: • Attempt to compress the immense image database hosted by the popular F acebook social networking web site. One obvious property of these images is that they contain many faces. T o compress them well, it will be necessary to dev elop a computational under - standing of the appearance of faces. • Construct a target database by packaging together digital recordings of songs, concerts, symphonies, opera, and other pieces of music. This kind of inquiry will lead to theories of the structure of music, which must describe harmony , melody , pitch, rhythm and the relationship between these v ariables in different musical cultures. It must also contain models of the sounds produced by different instruments, as well as the human singing voice. 29 • Build a target database by recording from microphones positioned in treetops. A major source of variation in the resulting data will be bird v ocalizations. T o compress the data well, it will be necessary to differentiate between bird songs and bird calls, to dev elop tools that can identify species-characteristic vocalizations, and to build maps showing the typical ranges of various species. In other words, this type of inquiry will be a computa- tional version of the traditional study of bird v ocalization carried out by ornithologists. • Generate a huge database of economic data sho wing changes in home prices, interest and exchange rate fluctuations, business in ventories, welfare and unemployment applications, and so on. T o compress this database well, it will be necessary to de velop economic theories that are capable of predicting, for example, the ef fect that changes in interest rates hav e on home purchases. Since the above examples in volv e empirical inquiry into v arious aspects of reality , any reader who believ es in the intrinsic v alue of science should reg ard them as at least potentially interesting. Skeptical readers, on the other hand, may doubt the applicability of the Reusability Hypothesis here, and so vie w an attempt to compress these databases as an eccentric philo- sophical quest. The following examples are more detailed, and gi ve explicit analysis of what kinds of theories (or computational tools) will be needed, and ho w those theories will be more widely useful. An important point, common to all of the in vestigations, is that a single target database can be used to de velop and e v aluate a large number of methods. It should be clear that, if successful, these example inquiries should lead to practical ap- plications. The study of music may help composers to write better music, allo w listeners to find new music that suits their taste, and assist music publishing companies in determining the quality of a new piece. The in vestigation of bird vocalization, if successful, should be useful to en vironmentalists and bird-watchers who might want to monitor the migration and popula- tion fluctuation of various a vian species. The study of economic data is more speculativ e, b ut if successful should be of obvious interest to policy makers and in vestors. In the case of the roadside video data described belo w , the result will be sophisticated visual systems that can be used in robotic cars. Also mentioned belo w is an inquiry into the structure of English text, which should prov e useful for speech recognition as well as for machine translation. 1.4.1 Roadside V ideo Camera Consider constructing a target database by setting up a video camera next to a highway , and recording video streams of the passing cars. Since the camera does not move, and there is 30 usually not much activity on the sides of highw ays, the main source of v ariation in the resulting video will be the automobiles. Therefore, in order to compress the video stream well, it will be necessary to obtain a good computational understanding of the appearance of automobiles. A simple first step would be to take adv antage of the fact that cars are rigid bodies subject to Ne wtonian laws of physics. The position and v elocity of a car must be continuous functions of time. Giv en a series of images at timesteps { t 0 , t 1 , t 2 . . . t n } it is possible to predict the image at timestep t n +1 simply by isolating the moving pixels in the series (these correspond to the car), and interpolating those pixels forward into the new image, using basic rules of camera geometry and calculus. Since neither the background nor the moving pix el blob changes much between frames, it should be possible to achiev e a good compression rate using this simple trick. Further impro vements can be achie ved by detecting and exploiting patterns in the blob of moving pix els. One observ ation is that the wheels of a mo ving car hav e a simple characteristic appearance: a dark outer ring corresponding to the tire, along with the of f-white circle of the hubcap at the center . Because of this characteristic pattern, it should be straightforward to build a wheel detector using standard techniques of supervised learning. One could then sa ve bits by representing the wheel pixels using a specialized model, akin to a graphics program, which draws a wheel of a giv en size and position. Since it takes fewer bits to encode the size and position parameters than to encode the raw pixels of the wheel, this trick should sav e codelength. Further progress could be achie ved by conducting a study of the characteristic appearance of the surfaces of cars. Since most cars are painted in a single color , it should be possible to de velop a specialized algorithm to identify the frame of the car . Another graphics program could be used to draw the frame of the car , using a variety of parameters related to its shape. Extra attention would be required to handle the complex reflectiv e appearance of the windshield, but the same general idea would apply . Note that the encoder always has the option of “backing of f ”; if attempts to apply more aggressi ve encoding methods fail (e.g., if the car is painted in multiple colors), then the simpler pixel-blob encoding method can be used instead. Additional progress could be achie ved by recognizing that most automobiles can be cate- gorized into a discrete set of categories (e.g., a 2009 T oyota Corolla). Since these categories hav e standardized dimensions, bits could be sav ed by encoding the category of a car instead of information related to its shape. Initially , the process of building category-specific modules for the appearance of a car might be dif ficult and time-consuming. But once one has de vel- oped modules for the Hyundai Sonata, Che vrolet Equinox, Honda Ci vic, and Nissan Altima, it should not require much additional work to construct a module for the T oyota Sienna. Indeed, it may be possible to dev elop a learning algorithm that, through some sort of clustering process, 31 would automatically extract, from lar ge quantities of roadside video data, appearance modules for the v arious car categories. 1.4.2 English T ext Corpus Books and other written materials constitute another interesting source of target data for com- perical inquiry . Here one simply obtains a lar ge quantity of text, and attempts to compress it. One tool that will be v ery useful for the compression of English text is an English dictionary . T o see this, consider the follo wing sentence: John went to the liquor store and bought a bottle of ____. Assume that the word in the blank space has N letters, and the compressor encodes this information separately . A naïve compressor would require log(26 N ) = N log 26 bits to encode the word, since there are 26 N ways to form an N -letter word. A compressor equipped with a dictionary can do much better . First it looks up all the words of length N , and then it encodes the inde x of the actual word in this list. This costs log( W N ) , where W N is the number of words of length N in the dictionary . Since most combinations of letters such as “yttu” and “qwhg” are not real words, W N < 26 N and bits are sav ed. By making the compressor smart, it’ s possible to do e ven better . A smart compressor should kno w that the word “of ” is usually followed by a noun. So instead of looking up all the N -letter words, the compressor could restrict the search to only nouns. This cuts down the number of possibilities e ven further , saving more bits. An e ven smarter compressor would know that in the phrase “bottle of X”, the word X is usually a liquid. If it had an enhanced dictionary which contained information about various properties of nouns, it could restrict the search to N -letter nouns that represent liquids. Even better results could be obtained by noticing that the bottle is purchased at a liquor store, and so probably represents some kind of alcohol. This trick would require that the enhanced dictionary contains annotations indicating that w ords such as “wine”, “beer”, “vodka”, are types of alcoholic bev erages. It may be possible to do e ven better by analyzing the surrounding te xt. The word list may be narro wed e ven further if the text indicates that John is fond of brandy , or that his wife is using a recipe that calls for v odka. Of course, these more advanced schemes are far beyond the current state of the art in natural language processing, b ut they indicate the wide array of techniques that can in theory be brought to bear on the problem. 32 1.4.3 V isual Manhattan Project Consider constructing a database target by mounting video cameras on the dashboards of a number of Ne w Y ork City taxi cabs, and recording the resulting video streams. Owing to the vi vid visual en vironment of Ne w Y ork City , such a database w ould e xhibit an immense amount of complexity and variation. Sev eral aspects of that complexity could be then analyzed and studied in depth. One interesting source of variation in the video would come from the pedestrians. T o achie ve good compression rates for the pixels representing pedestrians, it would be necessary to de velop theories describing the appearance of Ne w Y orkers. These theories would need to include details about clothing, ethnicity , facial appearance, hair style, walking style, and the relationship between these variables. A truly sophisticated theory of pedestrians would need to take into account time and place: it is quite likely to observe a suited in vestment banker in the financial district on a weekday afternoon, b ut quite unlik ely to observ e such a person in the Bronx in the middle of the night. Another source of v ariation would come from the building and storefronts of the city . A first steps to wards achie ving a good compression rate for these pix els would be to construct a three- dimensional model of the city . Such a model could be used not only to determine the location from which an image frame was taken, but also to predict the next frame in the sequence. For example, the model could be used to predict that, if a picture is taken at the corner of 34th Street and Fifth A venue, the Empire State Building will feature very prominently . Notice that a naïve representation of the 3D model will require a large number of bits to specify , and so e ven more sa vings can be achie ved by compressing the model itself. This can be done by analyzing the appearance of typical building surfaces such as brick, concrete, and glass. This type of research might find common ground with the field of architecture, and lead to producti ve interdisciplinary in vestigations. A third source of variation would come from the other cars. Analyzing this source of varia- tion would lead to an in vestigation very similar to the roadside video camera inquiry mentioned abov e. Indeed, if the roadside video researchers are successful, it should be possible for the taxi cab video researchers to reuse man y of their results. In this way , researchers can proceed in a virtuous circle, where each ne w advance f acilitates the next line of study . 33 1.5 Sampling and Simulation Sampling is a technique whereby one uses a statistical model to generate a data set that is “typical” of it. For example, imagine one kno ws that the distribution of heights in a certain population is a Gaussian with a mean of 175 cm and a standard de viation of 10 cm. Then by sampling from a Gaussian distribution with these parameters, one obtains a set of numbers that are similar to what might be observed if some actual measurements were done. Most of the data would cluster in the 165-185 cm range, and it would be extremely rare to observ e a sample larger than 205 cm. The idea of sampling suggests a useful technique for determining the quality of a statistical model: one samples from the model, and compares the sample data to the real data. If the sample data looks nothing like the real data, then there is a flaw in the model. In the case of one-dimensional numerical data this trick is not very useful. But if the data is complex and high-dimensional, and humans hav e a good understanding of its real structure, the technique can be quite powerful. As an example of this, consider the following two batches of pseudo- words: a abangiv esery ad allars ambed amyorsagichou an and anendouathin anth ar as at ate atompasey averean cath ce d dea dr e ed eeaind eld enerd ens er ev edof fod fre g gand gho gisponeshe greastoreta har has haspy he heico ho ig iginse ill ilyo in ind io is ite iter itwat ju k le lene lilollind lliche llkee ly mang me mee mpichmm n nd nder ng ngobou nif nl noved o ond onghe oounin oreengst otaserethe oua ptrathe r rd re reed rerov ed sern sinttlof suikngmm t tato tcho te th the toungsshes ver wit y ythe a ally anctyough and andsaid anot as aslatay astect be beeany been bott bout but camed chav e comuperain deas dook ed e ven y fel filear firgut for fromed gat gin gi ve gi vesed got ha hard he hef her heree hilpte hoce hof ierty imber in it jor like lo lome lost mader mare mise moread od of om ome onertelf our out ov er owd pass put qu rown says seectusier seeked she shim so soomereand sse such tail the thingse tite to tor tre tro uf ughe umily upeeperlyses upoid was wat we were wers whith wird wirt with wor These w ords were created by sampling from tw o dif ferent models P ( α i | α i − 1 , . . . α 1 ) of the conditional probability of a letter giv en a history of preceding letters. The variable α i stands for the i th letter of the word. T o produce a word, one obtains the first letter by sampling from 34 the unconditional distribution P ( α 1 ) . Then one samples from P ( α 2 | α 1 ) to produce the second letter , and so on. A special word-ending character is added to the alphabet, and when this character is drawn, the w ord is complete. The tw o models were both constructed using a large corpus of English text. The first model is a simplistic bigram model, where the probability of a letter depends only on the immediately preceding letter . The second model is an enhanced version of the bigram model, which uses a refined statistical characterization of English words, that incorporates, for example, the fact that it is very unlikely for a word to hav e no vo wel. Most people will agree that the words from the second set are more similar to real English words (indeed, se veral of them are real words). This perceptual assessment justifies the conclusion that the second model is in some sense superior to the first model. Happily , it turns out that the second model also achie ves a better compression rate than the first model, so the qualitativ e similarity principle agrees with the quantitativ e compression principle. While the second model is better than the first, it still contains imperfections. One such imperfection relates to the word “sse”. The double-s pattern is common in English words, but it is ne ver used to begin a word. It should be possible to achie ve impro ved compression rates by correcting this deficienc y in the model. All compressors implicitly contain a statistical model, and it is easy to sample from this model. T o do so one simply generates a random bit string and feeds it into the decoder . Unless the decoder is tri vially suboptimal, it will map any string of bits to a legitimate outcome in the original data space. This perspecti ve provides a nice interpretation of what compression means. An ideal encoder maps real data to perfectly random bit strings, and the corresponding decoder maps random bit strings to real data. 1.5.1 V eridical Simulation Principle of Science Modern video games often attempt to illustrate scenes in volving complex physical processes, such as explosions, light reflections, or collisions between nonrigid bodies (e.g. football play- ers). In order to make these scenes look realistic, video game dev elopers need to include “physics engines” in their games. A physics engine is a program that simulates various pro- cesses using the laws of physics. If the physics used in the simulators did not correspond to real physics, the scenes would look unrealistic: the colliding players would fall too slowly , or the surface of a lake w ould not produce an appropriate reflection. This implies that there is a connection between scientific theories and veridical simulation. Can this principle be generalized? Suspend disbelief for a moment and imagine that, perhaps as a result of patronage from an adv anced alien race, humans had obtained computers before the 35 de velopment of physics. Then scientists could conduct a search for a good theory of mechanics using the follo wing method. First, they would write down a new candidate theory . Then they would build a simulator based on the theory , and use the simulator to generate v arious scenes, such athletes jumping, rocks colliding in mid-air , and water spurting from fountains. The new theory would be accepted and the old champion discarded if the former produced more realistic simulations than the latter . As a more plausible example, consider using the simulation principle to guide an inquiry into the rules of grammar and linguistics. Here the researchers write do wn candidate theories of linguistics, and use the corresponding simulator to generate sentences. A ne w theory is accepted if the sentences it generates are more realistic and natural than those produced by the pre vious champion theory . This is actually very similar to Chomsky’ s formulation of the goal of generati ve grammar; see Chapter 4 for further discussion. This notion of science appears to meet many of the requirements of empirical science dis- cussed pre viously in the chapter . It provides a solution to the Problem of Demarcation: a theory is scientific if it can be used to build a simulation program for a particular phenomenon. It gi ves scientists a way to make decisi ve theory comparisons, allo wing them to search ef ficiently through the space of theories. It in volv es a kind of Circularity Commitment: one de velops theories of a certain phenomenon in order to be able to construct con vincing simulations of the same phenomenon. Sophie could plausibly hav e answered the shaman’ s critique of physics by demonstrating that a simulator based on Newtonian mechanics produces more realistic image sequences than one based on shamanic re velation. In comparison to the compression principle, the veridical simulation principle has one ob- vious disadvantage: theory comparisons depend on qualitativ e human perception. If the human observers ha ve no special ability to judge the authenticity of a particular simulation, the the- ory comparisons will become noisy and muddled. The method may work for things like basic physics, text, speech, and natural images, because humans hav e intimate knowledge of these things. But it probably will not work for phenomena which humans do not encounter in their e veryday li ves. The adv antage of the simulation principle compared to the compression principle is that it provides an indication of where and in what way a model fails to capture reality . The word sampling example abov e showed how the model failed to capture the fact that real English words do not start with a double-s. If a model of visual reality were used to generate images, the unrealistic aspects of the resulting images will indicate the shortcomings of the model. For example, if a certain model does not handle shadows correctly , this will become obvious when it produces an image of a tree that casts no shade. The compression principle does not provide 36 this kind of indication. For this reason, the simulation principle can be thought of as a natural complement to the compression principle, that researchers can use to find out where to look for further progress. Another interesting aspect of the veridical simulation principle is that it can be use to define a challenge similar to the T uring T est. In this challenge, researchers attempt to b uild simulators that can produce samples that are v eridical enough to fool humans into thinking they are real. The outcome of the conte xt is determined by sho wing a human judge tw o data objects, one real and one simulated. The designers of the system win if the human is unable to tell which object is real. T o see the dif ficulty and interest of this challenge, consider using videos obtained in the course of the V isual Manhattan Project inquiry of Section 1.4.3 as the real world component. The statistical model of the video data would then need to produce samples that are indistin- guishable from real footage of the streets of New Y ork City . The model would thus need to contain all kinds of information and detail relating to the visual en vironment of the city , such as the layout and architecture of the buildings, and the fashion sense and walking style of the pedestrians. This is, of course, exactly the kind of information needed to compress the video data. This observation provides further support for the intuiti ve notion that while the simulation principle and the compression principle are not identical, they are at least strongly aligned. It will require an enormous lev el of sophistication to win the simulation game, especially if the judges are long term inhabitants of Ne w Y ork. A true Ne w Y orker would be able to spot very minor deviations from veridicality , related to things like the color of the sidew alk carts used by the pretzel and hot dog vendors, or to subtle changes in the style of clothing worn by denizens of different parts of the city . A true New Y orker might be able to spot a fake video if it failed to include an appropriate degree of strangeness. Ne w Y ork is no normal place and a real video strream will reflect that by showing celebrities, business executi ves, beggars, transvestites, fashion models, inebriated artists, and so on. In spite of this difficulty , the alignment between the compression and simulation principles suggest that there is a simple way to make systematic progress: get more and more video data, and improv e the compression rate. 1.6 Comparison to Ph ysics Physics is the e xemplar of empirical science, and man y other fields attempt to imitate it. Some researchers ha ve deplored the influence of so-called “physics en vy” on fields like computer vi- sion and artificial intelligence [12]. This book argues that there is nothing wrong with imitating 37 physics. Instead, the problem is that pre vious researchers failed to understand the essential character of physics, and instead copied its superficial appearance. The superficial appear- ance of physics is its use of sophisticated mathematics; the essential character of physics is its obsession with reality . A physicist uses mathematics for one and only one reason: it is useful in describing empirical reality . Just as physicists do, comperical researchers adopt as their fundamental goal the search for simple and accurate descriptions of reality . They will use mathematics, but only to the e xtent that it is useful in achie ving the goal. Another ke y similarity between physics and comperical science inv olves the justification of research questions. Some skeptics may accept that CRM research is legitimate science, but belie ve that it will be confined to a narrow set of technical topics. After all, the CRM de- fines only one problem: lar ge scale lossless data compression. But notice that physics also defines only one basic problem: giv en a particular physical configuration, predict its future e volution. Because there is a vast number of possible configurations of matter and energy , this single question is enormously producti ve, justifying research into such div erse topics as black holes, superconducti vity , quantum dots, Bose-Einstein condensates, the Casimir ef fect, and so on. Analogously , the single question of comperical science justifies a wide range of research, due to the enormous div ersity of empirical regularities that can be found in databases of nat- ural images, text, speech, music, etc . The fact that a single question provides a parsimonious justification for a wide range of research is actually a ke y advantage of the philosoph y . Both physics and comperical science require candidate theories to be tested against empir- ical observation using hard, quantitativ e ev aluation methods. Howe ver , there is an important dif ference in the way the theory-comparisons work. Physical theories are very specific. In physics, any ne w theory must agree with the current champion in a lar ge number of cases, since the current champion has presumably been validated on man y configurations. T o adjudi- cate a theory contest, researchers must find a particular configuration in which the two theories make opposing predictions, and then run the appropriate experiment. In comperical science, the predictions made by the champion theory are neither correct or incorrect, they are merely good. T o unseat the champion theory , it is suf ficient for a ri val theory to make better predictions on av erage. 38 Chapter 2 Compr ession and Learning 2.1 Machine Lear ning Humans hav e the ability to de velop amazing skills relating to a very broad array of activities. Ho wev er , almost without exception, this competence not innate, and is achie ved only as a result of extended learning. The field of machine learning takes this observation as its starting point. The goal of the field is to de velop algorithms that improv e their performance ov er time by adapting their behavior based on the data the y observe. The field of machine learning appears to ha ve achie ved significant progress in recent years. Researchers produce a steady stream of new learning systems that can recognize objects, an- alyze facial expressions, translate documents from one language to another , or understand speech. In spite of this stream of ne w results, learning systems still ha ve frustrating limita- tions. Automatic translations systems often produce gibberish, and speech recognition systems often cause more anno yance than satisfaction. One particularly glaring illustration of the limits of machine learning came from a “racist” camera system that was supposed to detect faces, b ut worked only for white faces, f ailing to detect black ones [106]. The gap between the enormous ambitions of the field and its present limitations indicates that there is some mountainous con- ceptual barrier impeding progress. T wo views can be articulated regarding the nature of this barrier . According to the first vie w , the barrier is primarily tec hnical in nature. Machine learning is on a promising trajectory that will ultimately allo w it to achiev e its long sought goal. The field is asking the right questions; success will be achiev ed by impro ving the answers to those ques- tions. The limited capabilities of current learning systems reflect limitations or inadequacies of modern theory and algorithms. While the modern mathematical theory of learning is advanced, 39 it is not yet adv anced enough. In time, new algorithms will be found that are far more po wer- ful than current algorithms such as AdaBoost and the Support V ector Machine [37, 118]. The steady stream of new theoretical results and impro ved algorithms will ev entually yield a sort of grand unified theory of learning which will in turn guide the de velopment of truly intelligent machines. In the second vie w , the barrier is primarily philosophical in nature. In this view , progress in machine learning is tending to ward a sort of asymptotic limit. The modern theory of learn- ing provides a comprehensiv e answer to the problem of learning as it is currently formulated. Algorithms solve the problems for which they are designed nearly as well as is theoretically possible. The demonstration of an algorithm that provides an improv ed con ver gence rate or a tighter generalization bound may be interesting from an intellectual perspecti ve, and may provide slightly better perfomance on the standard problems. But such incremental advances will produce true intelligence. T o achiev e intelligence, machine learning systems must make a discontinuous leap to an entirely ne w lev el of performance. The current mindset is analogous to the researchers in the 1700s who attempted to expedite ground transportation by breeding faster horses, when they should actually hav e been searching for qualitati vely different mode of transportation. The problem, then, is in the philosophical foundations of the field, in the types of questions considered by its practitioners and their philosophical mindset. If this view is true, then to make further progress in machine learning, it is necessary to formulate the problem of learning in a ne w way . This chapter presents arguments in fa vor of the second vie w . 2.1.1 Standard F ormulation of Supervised Lear ning There are two primary modes of statistical learning: the supervised mode and the unsupervised mode. The present discussion will focus primarily on the former; the latter is discussed in Appendix B . The supervised version can be understood by considering a typical example of what it can do. Imagine one wanted to build a face detection system capable of determining if a digital photo contains an image of a face. T o use a supervised learning method, the researcher must first construct a labeled dataset, which is made up of two parts. The first part is a set of N images X = { x 1 , x 2 . . . x N } . The second part is a set of binary labels Y = { y 1 , y 2 . . . y N } , which indicate whether or not a face is present in each image. Once this database has been built, the researcher in vok es the learning algorithm, which attempts to obtain a predictiv e rule h ( · ) such that h ( x ) = y . Belo w , this procedure is refered to as the “canonical” form of the supervised learning problem. Many applications can be formulated in this w ay , as shown in the follo wing list: 40 • Document classification: the x i data are the documents, and the y i data are category labels such as “sports”, “finance”, “political”, etc. • Object recognition: the x i data are images, and the y i data are object categories such as “chair”, “tree”, “car”, etc. • Electoral prediction: each x i data is a package of information relating to current political and economic conditions, and the y i is a binary label which is true if the incumbent wins. • Marital satisfaction: each x i data is a package of vital statistics relating to particular marriage (frequency of sex, frequency of ar gument, religious in v olvement, education le vels, etc) and the corresponding y i is a binary label which is true if the marriage ends in di vorce. • Stock Market prediction: each x i is a set of economic indicators such as interest rates, ex- change rates, and stock prices for a gi ven day; the y i is the change in v alue of a particular stock on the next day . 2.1.2 Simplified Description of Learning Algorithms For readers with no background in machine learning, the following highly simplified descrip- tion should provide a basic understanding of the basic ideas. One starts with a system S that performs some task, and a method for ev aluating the performance S provides on the task. Let this ev aluation function be denoted as E , and E [ S ] be the performance of the system S . In terms of the canonical task mentioned above, the system is the predictiv e rule h ( · ) , and the e valuation function is just the squared dif ference between the predictions and the real data: E [ h ] = X i ( h ( x i ) − y i ) 2 A key property of the system is that it be mutable. If a system S is mutable, then a small perturbation will produce a ne w system S 0 that beha ves in nearly the same way as S . This mu- tability requirement prev ents one from defining S to be, for example, the code of a computer program, since a slight random change to a program will usually break it completely . T o con- struct systems that can withstand these minor mutations without suf fering catastrophic f ailures, researchers often construct the system by introducing a set of numerical parameters θ . If the behavior of S ( θ ) changes smoothly with changes in θ , then small changes to the system can be made by making small changes to θ . There are, of course, other ways to construct mutable 41 systems. Gi ven a mutable system and an ev aluation function, then the follo wing procedure can be used to search for a high-performance system: 1. Begin by setting S = S 0 , where S 0 is some default setup (which can be naïv e). 2. Introduce a small change to S , producing S 0 . 3. If E [ S 0 ] > E [ S ] , keep the change by setting S = S 0 . Otherwise, discard the modified version. 4. Return to step #2. Many machine learning algorithms can be understood as refined versions of the abo ve pro- cess. For example, the backpropagation algorithm for the multilayer perceptron uses the chain rule of calculus to find the deri vati ve of the E ( S ( θ )) with respect to θ [98]. Man y reinforcement learning algorithms work by making smart changes to a policy that depends on the parameters θ [111]. Genetic algorithms, which are inspired by the idea of natural selection, also roughly follo w the process outlined abov e. 2.1.3 Generalization V iew of Learning Machine learning researchers have dev eloped two conceptual perspectiv es by which to ap- proach the canonical task. The first and more popular perspecti ve is called the Generalization V ie w . Here the goal is to obtain, on the basis of the limited N -sample data set, a model or pre- dicti ve rule that works well for new , pre viously unseen data samples. The Generalization V ie w is attractiv e for obvious practical purposes: in the case of the face detection task, for example, the model resulting from a successful learning process can be used in a system which requires the ability to detect faces in pre viously unobserved images (e.g. a surveillance application). The ke y challenge of the Generalization V iew is that the r eal distribution generating the data is unknown. Instead, one has access to the empirical distribution defined by the observed data samples. In the early days of machine learning research, many practitioners thought that the suf ficient condition for a model to achie ve good performance on the real distribution was that it achie ved good empirical performance: it performed well on the observed data set. Ho wev er , the y often found that their models would perform very well on the observed data, but fail completely when applied to ne w samples. There were a variety of reasons for this failure, but the main cause was the phenomenon of overfitting . Overfitting occurs when a researcher applies a complex model to solve a problem with a small number of data samples. Figure 2.1 illustrates the problem of 42 Figure 2.1: Illustration of the idea of model complexity and overfitting. In the limited data regime situation depicted on the left, the line model should be preferred to the curve model, because it is simpler . In the lar ge data regime, ho wever , the polynomial model can be justified. ov erfitting. Intuiti vely , it is easy to see that when there are only fi ve data points, the complex curve model should not be used, since it will probably fail to generalize to any ne w points. The linear model will probably not describe new points exactly , but it is less likely to be wildly wrong. While intuition fav ors the line model, it is not immediately obvious how to formalize that intuition: after all, the curve model achie ves better empirical performance (it goes through all the points). The great conceptual achiev ement of statistical learning is the de velopment of methods by which to ov ercome overfitting. These methods ha ve been formulated in many different ways, but all articulations share a common theme: to av oid overfitting, one must penalize complex models. Instead of choosing a model solely on the basis of its empirical performance, one must optimize a tradeof f between the empirical performance and model comple xity . In terms of Figure 2.1 , the curve model achiev es excellent empirical performance, but only because it is highly complex. In contrast the line model achiev es a good balance of performance and simplicity . For that reason, the line model should be preferred in the limited-data regime. In order to apply the comple xity penalty strategy , the ke y technical requirement is a method for quantifying the complexity of a model. Once a suitable expression for a model’ s complexity is obtained, some further deriv ations yield a type of expression called a gener alization bound . A generalization bound is a statement of the follo wing form: if the empirical performance of the model is good, and the model is not too complex, then with high probability its real performance will be only slightly worse. The ca veat “with high probability” can ne ver be done a way with, because there is always some chance that the empirical data is simply a bizarre or unlucky sample of the real distribution. 43 One might conclude with v ery high confidence that a coin is biased after observing 1000 heads in a ro w , but one could ne ver be completely sure. While most treatments of model complexity and generalization bounds require sophisti- cated mathematics, the following simple theorem can illustrate the basic ideas. The theorem can be stated in terms of the notation used for the canonical task of supervised learning men- tioned abo ve. Let C be a set of h ypotheses or rules that take a ra w data object x as an argument and output a prediction h ( x ) of its label y . In terms of the face detection problem, x would be an image, and y would be a binary flag indicating whether the image contains a face. Assume it is possible to find a hypothesis h ∗ ∈ C that agrees with all the observed data: h ∗ ( x i ) = y i i = 1 . . . N No w select some  and δ such that the follo wing inequality holds: N ≥ 1  log  | C | δ  (2.0) Then with probability 1 − δ , the error rate of the hypothesis will be at most  when measured against the real distribution. Abstractly , the theorem says that if the hypothesis class is not too large compared to the number of data samples, and some element achiev es good empirical performance, then with high probability its performance on the real (full) distrib ution will be not too much worse. The follo wing informal proof of the theorem may illuminate the core concept. T o understand the theorem, imagine you are searching through a barrel of apples (the hy- potheses), looking for a good one. Most of the apples are “wormy”: they hav e a high error rate on the real distrib ution. The goal is to find a ripe, tasty apple; one that has a lo w error rate on the real distribution. Fortunately , most of the wormy apples can be discarded because they are visibly old and rotten, meaning they make errors on the observed data. The problem is that there might be a “hidden worm” apple that looks tasty - it performs perfectly on the observed data - but is in fact wormy . Define a wormy apple as one that has real error rate larger than  . No w ask the question: if an apple is wormy , what is the probability it looks tasty? It’ s easy to find an upper bound for this probability: P (hidden worm) ≤ (1 −  ) N This is because, if the apple is wormy , the probability of not making a mistake on one sample is ≤ (1 −  ) , so the probability of not making a single mistake on N samples is ≤ (1 −  ) N . No w the question is: what is the probability that there are no hidden worms in the entire hypothesis 44 class? Let H W k (  ) be the e vent that the k th apple is a hidden worm. Then the probability that there are no hidden worms in the hypothesis class is: P (nohidden worms) = P ( ¬ [ H W 1 (  ) ∨ H W 2 (  ) ∨ H W 3 (  ) . . . ]) = 1 − P ([ H W 1 (  ) ∨ H W 2 (  ) ∨ H W 3 (  ) . . . ]) ≥ 1 − X k P ( H W k (  )) = 1 − | C | P ( H W (  )) ≥ 1 − | C | (1 −  ) N The first step is true because P ( ¬ A ) = 1 − P ( A ) , the second step is true because P ( A ∨ B ) ≤ P ( A ) + P ( B ) , the third step is true because there are | C | hypotheses, and the final step is just a substitution of Inequality 2.1.3 . Then the result follows by letting δ = | C | (1 −  ) N , noting that log(1 −  ) ≈ −  , and rearranging terms. A crucial point about the proof is that it makes no guarantee whate ver that a good hypothesis (tasty worm-free apple) will actually appear . The proof merely says that, if the model class is small and the other v alues are reasonably chosen, then it is unlikely for a hidden worm hypothesis to appear . If the probability of a hidden worm is low , and by chance a shiny apple is found, then it is probable that the shiny apple is actually w orm-free. A far more sophisticated de velopment of the ideas of model complexity and generalization is due to the Russian mathematician Vladimir V apnik [116]*. In V apnik’ s formulation the goal is to minimize the real (generalization) risk R , which can be the error rate or some other function. V apnik deri ved a sophisticated model comple xity term called the VC dimension, and used it to prov e se veral generalization bounds. A typical bound is: R ( h i ) ≤ R emp ( h i ) + log( | C 0 | ) − log ( δ ) N  1 + s 1 + 2 N R emp ( h i ) log( | C 0 | ) − log ( δ )  Where R ( h i ) is the real risk of hypothesis h i and R emp ( h i ) is the empirical risk, calculated from the observed data. The bound, which holds for all hypotheses simultaneously , indicates the conditions under which the real risk will not exceed the empirical risk by too much. As abov e, the bound holds with probability 1 − δ , and N is the number of data samples. The term log( | C 0 | ) is the log size of the VC-dimension, which plays a conceptually similar role to the simple log( | C | ) term in the previous theorem. V apnik’ s complex inequality shows the same basic idea as the simple theorem abov e: the real performance will be good if the empirical performance is good and the log size of the hypothesis class is small in comparison with the 45 number of data samples. Proofs of theorems in the VC theory also use a similar strate gy: sho w that if the model class is small, it is unlikely that it includes a “hidden w orm” h ypothesis which has low empirical risk but high real risk. Also, none of the VC theory bounds guarantee that a good hypothesis (lo w R emp ( h i ) ) will actually be found. The problem of ov erfitting is easily understood in the light of these generalization theorems. A naïve approach to learning attempts to minimize the empirical risk without reference to the complexity of the model. The theorems show that a low empirical risk, by itself, does not guarantee low real risk. If the model complexity terms log( | C | ) and log ( | C 0 | ) are large compared to the number of samples N , then the bounds will become too loose to be meaningful. In other words, e ven if the empirical risk is reduced to a very small quantity , the real risk may still be large. The intuition here is that because such a lar ge number of hypotheses was tested, the fact that one of them performs well on the empirical data is meaningless. If the hypothesis class is very lar ge, then some hypotheses can be e xpected to perform well merely by chance. The abo ve discussion seems to indicate that complexity penalties actually apply to model classes, not to individual models. There is an important subtlety here. In both of the general- ization theorems mentioned abov e, all elements of the model class were treated equally , and the penalty depended only on the size of the class. Ho wev er , it is also reasonable to apply dif ferent penalties to different elements of a class. Say the class C contains two subclasses C a and C b . Then if | C b | > | C a | , hypotheses drawn from C b must receiv e a larger penalty , and therefore require relatively better empirical performance in order to be selected. For example, in terms of Figure 2.1 , one could easily construct an aggregate class that includes both lines and polynomials. Then the polynomials would recei ve a larger penalty , because there are more of them. While more complex models must recei ve larger penalties, they are nev er prohibited out- right. In some cases it very well may be worthwhile to use a complex model, if the model is justified by a lar ge amount of data and achie ves good empirical performance. This concept is illustrated in Figure 2.1 : when there are hundreds of points that all fall on the comple x curve, then it is entirely reasonable to prefer it to the line model. The generalization bounds also express this idea, by allo wing log ( | C | ) or log ( | C 0 | ) to be large if N is also large. 2.1.4 Compr ession V iew The second perspectiv e on the learning problem can be called the Compression V iew . The goal here is to compress a data set to the smallest possible size. This view is founded upon the insight, drawn from information theory , that compressing a data set to the smallest possible 46 size requires the best possible model of it. The difficulty of learning comes from the fact that the bit cost of the model used to encode the data must itself be accounted for . In the statistics and machine learning literature, this idea is kno wn as the Minimum Description Length (MDL) principle [120, 95]. The motiv ation for the MDL idea can best be seen by contrasting it to the Maximum Like- lihood Principle, one of the foundational ideas of statistical inference. Both principles apply to the problem of how to choose the best model M ∗ out of a class M to use to describe a gi ven data set D . For example, the model class M could be the set of all Gaussian distrib utions, so that an element M would be a single Gaussian, defined by a mean and v ariance. The Maximum Likelihood Principle suggests choosing M ∗ so as to maximize the likelihood of the data giv en the model: M ∗ = max M ∈ M P ( D | M ) This principle is simple and effecti ve in many cases, but it can lead to overfitting. T o see ho w , imagine a data set made up of 100 numbers { x 1 , x 2 , . . . x 100 } . Let the class M be the set of Gaussian mixture models. A Guassian mixture model is just a sum of normal distributions with dif ferent means and variances. Now , one simple model for the data could be built by finding the mean and variance of the x i data and using a single Gaussian with the giv en parameters. A much more complex model can be b uilt by taking a sum of 100 Gaussians, each with mean equal to some x i and near-zero variance. Obviously , this “comb” model is worthless: it has simply ov erfit the data and will fail badly when a new data sample is introduced. But it produces a higher likelihood than the single Gaussian model, and so the Maximum Likelihood principle suggests it should be selected. This indicates that the principle contains a flaw . The Minimum Description Length principle approaches the problem by imagining the fol- lo wing scenario. A sender wishes to transmit a data set { x 1 , x 2 , . . . x 100 } to a receiv er . The two parties hav e agreed in advance on the model class M . T o do the transmission, the sender chooses some model M ∗ ∈ M and sends enough information to specify M ∗ to the receiv er . The sender then encodes the x i data using a code based on M ∗ . The best choice for M ∗ minimizes the net codelength required: M ∗ = min M ∈ M L ( M ) + L ( D | M ) = min M ∈ M L ( M ) − log P ( D | M ) Where L ( M ) is the bit cost of specifying M to the receiv er , and L ( D | M ) = − log P ( D | M ) is the cost of encoding the data gi ven the model. If if it were not for the L ( M ) term, the MDL 47 principle would be exactly the same as the Maximum Likelihood principle, since maximizing P ( D | M ) is the same as minimizing − log P ( D | M ) . The use of the L ( M ) term penalizes complex models, which allows users of the MDL principle to av oid overfitting the data. In the e xample mentioned abov e, the Gaussian mixture model with 100 components would be strongly penalized, since the sender would need to transmit a mean/v ariance parameter pair for each component. The MDL principle can be applied to the canonical task by imagining the follo wing sce- nario. A sender has the image database X and the label database Y , and wishes to transmit the latter to a recei ver . A crucial and somewhat counterintuitiv e point is that the receiv er already has the image database X . Because both parties ha ve the image database, if the sender can discov er a simple relationship between the images and the labels, he can exploit that relation- ship to sav e bits. If a rule can be found that accurately predicts y i gi ven x i , that is to say if a good model P ( Y | X ) can be obtained, then the label data can be encoded using a short code. Ho wev er , in order for the recei ver to be able to perform the decoding, the sender must encode and transmit information about how to b uild the model. More complex models will increase the total number of bits that must be sent. The best solution, therefore, comes from optimizing a tradeof f between empirical performance and model complexity . 2.1.5 Equiv alence of V iews The Compression V iew and the Generalization V iew adopt very dif ferent approaches to the learning problem. Profoundly , howe ver , when the two dif ferent goals are formulated quanti- tati vely , the resulting optimization problems are quite similar . In both cases, the essence of the problem is to balance a tradeof f between model complexity and empirical performance. Similarly , both views justify the intuition relating to Figure 2.1 that the linear model should be preferred in the low-data regime, while the polynomial model should be preferred in the high-data regime. The relationship between the two views can be further understood in the conte xt of the simple hidden worm theorem described abov e. As stated, this theorem belongs to the Gener- alization V iew . Ho we ver , it is easy to con vert it into a statement of the Compression V iew . A sender wishes to transmit to a receiv er a database Y of labels which are related to a set X of raw data objects. The receiv er already has the raw data X . The sender and receiv er agree in adv ance on the hypothesis class C and an encoding format based on it that works as follo ws. The first bit is a flag that indicates whether a good hypothesis h ∗ was found. If so, the sender then sends the index of the hypothesis in C , using log 2 | C | bits. The recei ver can then look up 48 h ∗ and apply it to the images x i to obtain the labels y i . Otherwise, the sender encodes the labels y i normally at a cost of N bits. This scheme achie ves compression under two conditions: a good hypothesis h ∗ is found and log | C | is small compared to the number of samples N . These are exactly the same conditions required for generalization to hold in the Generalization V iew approach to the problem. This equiv alence in the case of the hidden worm theorem could be just a coincidence. But in fact there are a v ariety of theoretical statements in the statistical learning literature that suggest that equi valence is actually quite deep. For example, V apnik sho wed that if that if a model class C could be used to compress the label data, then the following inequality relates the achiev ed compression rate K ( C ) to the generalization risk R ( C ) : R ( C ) < 2( K ( C ) log 2 − N − 1 log δ ) (2.-7) The second term on the right, N − 1 log δ , is for all practical cases small compared to the K ( C ) term, so this inequality sho ws a very direct relationship between compression and generaliza- tion. This expression is strikingly simpler than an y of the other VC generalization bounds. There are many other theorems in the machine learning literature that suggest the equi va- lence of the Compression V iew and the Generalization V iew . For example, a simple result due to Blumer et al. relates the learnability of a hypothesis class to the existence of an Occam algorithm for it. In this paper , the ke y question of learning is whether a good approximation h ∗ can be found of the true hypothesis h T , when both functions are contained in a h ypothesis class H . If a good approximation (low  ) can be found with high probability (lo w δ ) using a limited number of data samples (small N ), the class is called learnable. Functions in the hypothesis class H can be specified using some finite bit string; the length of this string is the complexity of the function. T o define an Occam algorithm, let the unknown true function h T hav e com- plexity W , and let there be N samples ( x i , h T ( x i )) . The algorithm then produces a hypothesis h ∗ of complexity W c N α , that agrees with all the sample data, where c ≥ 1 and 0 ≤ α < 1 are constants. Because the complexity of h ∗ gro ws sublinearly with N , then a simple encoding scheme such as the one mentioned abov e based on H and the Occam algorithm is guaranteed to produce compression for large enough N . Blumer et al. show that if an Occam algorithm exists, then the class H is learnable. More complex results by the same authors are gi ven in ?? (see section 3.2 in particular). While the Generalization V iew and the Compression V iew may be equi valent, the latter approach has a variety of conceptual advantages. First of all, the No Free Lunch theorem of data compression indicates that no completely general compressor can ev er succeed. This sho ws that all approaches to learning must discover and exploit special empirical structure 49 in the problem of interest. This fact does not seem to be widely appreciated in the machine learning literature: many papers advertise methods without explicitly describing the conditions required for the methods to work. Also, because the model complexity penalty L ( M ) can be interpreted as a prior o ver hypotheses, the Compression V ie w clarifies the relationship between learning and Bayesian inference. This relationship is obscure in the Generalization V iew , lead- ing some researchers to claim that learning differs from Bayesian inference in some kind of deep philosophical way . Another significant adv antage of the Compression V iew is that it is simply easier to think up compression schemes than it is to prove generalization theorems. For example, the Gen- eralization V ie w version of the hidden worm theorem requires a deriv ation and some modest le vel of mathematical sophistication to determine the conditions for success, to wit, that log | C | is small compared to N and a good hypothesis h ∗ is found. In contrast, in the Compression V ie w version of the theorem, the requirements for success become obvious immediately after defining the encoding scheme. The equi v alence between the two vie ws suggest that a fruitful procedure for finding new generalization results is to de velop new compression schemes, which will then automatically imply associated generalization bounds. 2.1.6 Limits of Model Complexity in Canonical T ask In the Compression V iew , an important implication reg arding model complexity limits in the canonical task is immediately clear . The canonical task is approached by finding a short pro- gram that uses the image data set X to compress the label data Y . The goal is to minimize the net codelength of the compressor itself plus the encoded version of Y . This can be formalized mathematically as follo ws: M ∗ = arg min M ∈ M [ L ( M ) − log 2 P M ( Y | X )] Where M ∗ is the optimal model, L ( M ) is the codelength required to specify model M , and M is the model class. Now assume that M contains some tri vial model M 0 , and assume that L ( M 0 ) = 0 . The intuition of M 0 is that it corresponds to just sending the data in a flat format, without compressing it at all. Then, in order to justify the choice of M ∗ ov er M 0 , it must be the case that: L ( M ∗ ) − log 2 P M ∗ ( Y | X ) < log 2 P M 0 ( Y | X ) The right hand side of this inequality is easy to estimate. Consider a typical supervised learning task where the goal is to predict a binary outcome, and there are N = 10 3 data samples 50 (many such problems are studied in the machine learning literature, see the revie w [30]). Then a dumb format for the labeled data simply uses a single bit for each outcome, for a total of N = 10 3 bits. The inequality then immediately implies that: L ( M ∗ ) < 10 3 This puts an absolute upper bound on the complexity of an y model that can e ver be used for this problem. In practice, the model complexity must really be quite a bit lo wer to get good results. Perhaps the model requires 200 bits to specify and the encoded data requires 500 bits, resulting in a sa vings of 300 bits. 200 bits corresponds to 25 bytes. It should be obvious to anyone who has e ver written a computer program that no model of any complexity can be specified using only 25 bytes. 2.1.7 Intrinsically Complex Phenomena Consider the following thought experiment. Let T = { t 1 , t 2 . . . t N } be some database of in- terest, made up of a set of raw data objects. Let M be the set of programs that can be used to losslessly encode T . A program m is an element of M , its length is | m | . Furthermore let L m ( T ) be the codelength of the encoded version of T produced by m . For technical reasons, assume also that the compressor is stateless , so that the t i can be encoded in an y order , and the codelength for each object will be the same reg ardless of the ordering. This simply means that any knowledge of the structure of T must be included in m at the outset, and not learned as a result of analyzing the the t i . No w define: L ∗ T ( x ) = min | m |

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment