2006: Celebrating 75 years of AI - History and Outlook: the Next 25 Years
When Kurt Goedel layed the foundations of theoretical computer science in 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a…
Authors: ** *작성자 정보가 논문 본문에 명시되지 않아 확인할 수 없습니다.* (보통 해당 연도(2006)와 주제(AI 역사·전망)로 추정하면, 저자는 **Jürgen Schmidhuber** 혹은 **Jürgen J. Schmidhuber**와 같은 AI 이론가일 가능성이 높음.) --- **
2006: Celebratin g 75 years of AI - History and Outloo k: the Next 25 Y ears ∗ J ¨ urgen Schmidhuber TU Munich, Boltzmannstr . 3, 85748 Garching bei M ¨ unchen, Germany & IDSIA, Galleria 2, 6928 Manno (Lugano), Switzerland juergen@id sia.ch - http://w ww.idsia.c h/ ∼ juergen Abstract When Kurt G ¨ odel layed the foundations of theo retical computer science i n 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a m ajor role in many pra ctical AI applications, in the ne w millennium AI theory has finally become a full-fledged formal science, with importan t optimality results for embodied agents livin g in unkno wn en vironments, obtained through a combination of theory ` a la G ¨ odel and probability theory . Here we look back at important milestones of AI history , mention essential recent results, and speculate about what we may expect fr om the next 25 years, emphasizing the significance of the ongoing dramatic hardware speedu ps, and discussing G ¨ odel-inspired, self- referential, self-improving uni versal problem solvers. 1 Highlights of AI History—From G ¨ odel to 2006 G ¨ odel and Lilienfeld. In 1931, 75 years ago and just a few years after Julius Lilien - feld p atented the transistor, Kurt G ¨ odel lay ed th e f oundatio ns of th eoretical comp uter science (CS) with his work on un i versal for mal lan guages and the lim its of proof and computatio n [5]. He constru cted fo rmal systems allowing for self-referen tial state- ments that talk a bout th emselves, in particular, about whether they can be derived from a set of gi ven axiom s thr ough a computational theor em pr oving proce dure. G ¨ odel wen t on to con struct statem ents th at claim their own unp rovability , to d emonstrate that tra- ditional math is either flawed in a certain algorithmic sense or contains unprovable but true statements. G ¨ odel’ s in completen ess result is widely regard ed as th e most remarkab le achie ve- ment of 20th century mathematics, althou gh som e m athematicians say it is lo gic, not math, and o thers call it the fundam ental r esult of theoretical comp uter science, a dis- cipline that did no t yet officially exist back then but was effectiv ely cre ated throu gh ∗ In vited contrib ution to the Proceedings of the “50th Anni versary Summit of Artificial Intelli gence” at Monte V erita, Ascona, Switze rland, 9-14 July 2006 (variant accep ted for Springer’ s L N AI series) 1 G ¨ odel’ s work . It had enorm ous impac t not only on compu ter science but also on ph i- losophy and othe r fields. In p articular, sinc e h umans can “see” the tru th o f G ¨ odel’ s unprovable statements, som e researcher s mistakenly though t that his results show that machines and Ar tificial Intellig ences (AIs) will always be in ferior to human s. Giv en the tremendous impact of G ¨ odel’ s results on AI theory , it d oes make sen se to date AI’ s beginnings back to his 1931 publication 75 years ago. Zuse and T uring. In 1 936 Alan T uring [37] intr oduced the T uring ma chine to re- formu late G ¨ odel’ s results and Alonzo Church’ s extension s thereof. TM s are often m ore conv e nient t han G ¨ odel’ s integer-based form al systems, and later bec ame a centr al tool of CS theor y . Simultan eously Konrad Zuse built th e first working pro gram-co ntrolled computer s (1935-1 941), using the b inary ar ithmetic and th e bits of Go ttfried W ilhelm von Leibniz ( 1701) instead of th e mor e cum bersome d ecimal system used b y Char les Babbage, wh o pionee red the co ncept of progr am-contr olled c omputer s in the 1 840s, and tried to build one , altho ugh witho ut success. By 194 1, all the m ain in gredients o f ‘moder n’ comp uter science were in place, a d ecade after G ¨ odel’ s paper , a c entury after Babbage, an d roug hly th ree centurie s after W ilhelm Sch ickard, wh o started t he h is- tory of autom atic computing hardware by constructing the first non-pr ogram- controlled computer in 1623. In the 1940s Zuse went on to de vise the first high-level p rogram ming lang uage (Plankalk ¨ ul), which he u sed to write th e first ch ess p rogram . Back then chess-p laying was considered an intelligent acti vity , hence one m ight call th is chess program th e first design of an AI pro gram, although Z use did not really implem ent it bac k then. Soo n afterwards, in 1948 , Claude Shannon [33] published in formation theory , r ecycling sev- eral older ideas such as Ludwig B oltzmann’ s entropy from 19th century statistical me- chanics, and the bit of information (Leibniz, 1701). Relays, T ubes, T ransistor s. Alternativ e in stances of transistors, the concept pio- neered a nd paten ted by Julius E dgar Lilien feld (19 20s) and Oskar Heil (1 935), were built by William Shock ley , W alter H. Brattain & John Bardee n (1 948: point contact transistor) as well as Herbert F . Matar ´ e & Heinrich W alker (1948, e xploiting transcon- ductance effects of germanium diode s observed in the Luftwaffe during WW -II ). T oday most transistors are of the field-effect typ e ` a la Lilienfeld & Heil. I n principle a switch remains a switch no matter wh ether it is imp lemented as a relay o r a tu be or a transis- tor , but tran sistors switch faster than relays (Z use, 1941 ) and tubes (Co lossus, 194 3; ENIA C, 1946). This ev entually led to significant speedups o f compu ter har dware, which was essential for many s u bsequen t AI applicatio ns. The I in AI. In 19 50, some 56 y ears ago, T u ring in ven ted a famous subjective test to decid e whether a machine or so mething else is intelligent. 6 years later, an d 25 years after G ¨ odel’ s p aper, John McCarthy finally coin ed the ter m “ AI” . 5 0 y ears later, in 20 06, this pro mpted some to celebrate the 50th birthd ay o f AI, but th is ch apter’ s title sh ould make clear that its au thor cann ot agree with this view—it is the thing that counts, not its name. Roots of Probability -Based AI. In the 19 60s an d 1970 s Ray So lomono ff com- bined theor etical CS and probab ility theory to establish a g eneral the ory of un iv er sal inductive inference and pred ictiv e AI [35] clo sely related to the co ncept of K o lmogor ov complexity [1 4]. His t heoretically optimal pr edictors a nd their Baye sian lear ning algo- rithms only assume that the observ able rea ctions of the environmen t in response to cer- 2 tain action sequences are sampled from an unknown p robability distribution contained in a set M of all enu merable distributions. That is, giv en an o bservation sequence we only assume there exists a compu ter program that can com pute the probabilities of the next p ossible observations. Th is includes all scientific theor ies of physics, o f course. Since we ty pically do n ot know this program , we predict using a weigh ted sum ξ of all distributions in M , where the sum of the weights do es not exceed 1. It tu rns ou t th at this is in deed the b est one can po ssibly do, in a very gen eral sense [11, 35]. Although the universal app roach is practically infeasible since M con tains infinitely many dis- tributions, it do es represent the first sou nd and gen eral theo ry o f optimal predic tion based on experience, id entifying the limits of both human and artificial predictors, and providing a yardstick for all prediction machines to come. AI vs Astrology? Unfortun ately , failed prop hecies o f h uman-level AI with just a tiny fractio n of the br ain’ s comp uting power discredited som e of the AI research in the 1960s and 70s. M any th eoretical computer scientists actually regarded muc h of the field with contempt for its perceived lack of har d theo retical results. ETH Zurich’ s T ur- ing award winn er and cre ator of the P ASCAL program ming language, Niklaus Wirth, did n ot hesitate to link AI to astrology . P ractical AI o f that er a was dominated by rule-based e x pert systems and Logic Programming . Th at is, d espite Solomonoff ’ s fu n- damental resu lts, a main fo cus of that time was on logical, deterministic d eduction of facts from pr eviously known f acts, as op posed to (p robabilistic) induction of hy pothe- ses from experience. Evolution, Neurons, Ants. Largely u nnoticed b y mainstream AI gurus of that era, a biolo gy-inspire d typ e of AI emerged in the 196 0s when Ingo Rechenb erg pi- oneered the method of artificial evolution to solve c omplex o ptimization tasks [22], such as the d esign of o ptimal airplan e wings or combustion chamber s of roc ket no z- zles. Such methods (and late r variants th ereof, e.g., Holland [10] (1970 s), often g av e better results th an classical ap proache s. I n the fo llowing decades, other types o f “sub- symbolic” AI also became pop ular, especially n eural networks. Early neu ral net pa- pers inclu de tho se of McCulloch & Pitts, 1940s (linking cer tain simple neu ral ne ts to old an d well- known, simple ma thematical conc epts such as linear regression); Min- sky & Papert [17] (tem porarily discou raging neural network r esearch), Kohonen [12], Amari, 1960s; W erbo s [40], 1970 s; and many others in the 1980 s. Orthogo nal ap - proach es inc luded fuzzy logic (Zadeh, 1960 s), Riss anen’ s practical variants [2 3] o f Solomon off ’ s univ e rsal method, “r epresentation -free” AI (Broo ks [2]), Artificial Ants (Dorigo & Gambard ella [4], 1990s), statis tica l learnin g theory (in less general settings than tho se studied by Solomon off) & suppor t vector mac hines (V apnik [38] and oth- ers). As of 20 06, th is alternative type of AI research is receiving mor e a ttention th an “Good Old-Fashioned AI” (GOF AI). Mainstream AI Marries Statistics. A dominan t theme of the 19 80s and 9 0s was the marriag e of mainstream AI an d old co ncepts from pro bability th eory . Bayes net- works, Hidd en Mar kov Models, and num erous othe r pr obabilistic mod els fou nd wide applications ranging f rom pattern recognition , med ical diagn osis, data mining, machine translation, robotics, etc. Hardware Outshining Software: H umanoids, Robot Cars, Etc. In the 199 0s and 2000 s, much of the p rogress in p ractical AI was due to be tter har dware, getting rough ly 100 0 times faster p er Eur o p er decad e. In 1 995, a fast vision- based rob ot car 3 by Ern st Dickmanns (who se team built the world’ s first reliab le robot cars in th e early 1980s with the help of Mercedes-Benz, e. g., [3]) auto nomou sly drove 100 0 miles from Munich to Den mark and b ack, in tra ffic at up to 12 0 mph, autom atically passing other cars (a safety driver took over only rar ely in critical situation s). Japanese labs (Honda, Sony) and Pfeif fer ’ s lab at TU Mun ich built famo us hum anoid walking ro bots. Engineer ing p roblems often seemed more challengin g than AI-related problems. Another sourc e of p rogress was the dram atically improved acc ess to all kind s o f data through the W WW , created by T im Berners-Lee at the Europ ean p article col- lider CERN (Switzer land) in 199 0. This gre atly facilitated an d en courage d all kind s of “intelligent” data mining applications. Howe ver, there were few if any o bvious fun- damental algorithmic breakth rough s; improvements / e xten sions of already e x isting algorithm s seemed less impressive and less c rucial th an h ardware advances. For ex- ample, ch ess world cham pion Kasparov was beaten by a fast IBM com puter runn ing a fairly standard algorithm. Rather simple but compu tationally expensiv e proba bilistic methods for speech recog nition, statistical mach ine translation, co mputer vision, opti- mization, vir tual re alities etc. started to bec ome feasible on PCs, mainly b ecause PCs had become 1000 times more powerful within a decad e or so. 2006. As n oted by Stefan Artman n (p ersonal co mmunicatio n, 2 006), today’ s AI textbooks seem substan tially mo re com plex and less unified than those of several d ecades ago, e. g., [18], since they have to cover so many appar ently quite different su bjects. There s eems to be a need f or a new u nifying vie w of intellig ence. In the author’ s opinion this view already exists, as will be discussed belo w . 2 Subjecti ve Selec ted Highlights of Present AI The mor e recent some event, th e har der it is to jud ge its lon g-term significan ce. But this biased autho r thinks that th e most imp ortant thing that h appene d rece ntly in AI is the begin o f a tran sition from a h euristics-dom inated science (e .g., [24]) to a real formal science. Let us elabo rate on this topic. 2.1 The T wo W ays of Making a Dent in AI Resear ch There are at least two convincing ways of doing AI re search: ( 1) constru ct a (po ssibly heuristic) machine or algo rithm that som ehow (it does not really matter how) solves a previously unsolved inter esting p roblem, suc h as bea ting th e best human player of Go (suc cess will outshine any lack of th eory). Or (2) prov e that a particular nov el algorithm is optimal for an importan t class of AI pro blems. It is th e natu re o f heuristics (case (1) ) that they lack staying power , as they may soon get replace d by n ext year’ s even better heuristics. Theo rems (case (2 ) ), however , are for eternity . That’ s why formal sciences prefer theorems. For e xam ple, probability theory b ecame a f ormal science centuries ag o, and totally formal in 1933 with K olmo gorov’ s axioms [13], shortly after G ¨ odel’ s paper [5]. Old but provably optimal technique s of pr obability theory are still in every day’ s use, and in fact highly significan t for mo dern AI, wh ile many initially succ essful heu ristic appro aches ev e ntually became unfashionable, of interest mainly to the historians of the field. 4 2.2 No Brain Withou t a Body / AI Becoming a Formal S cience Heuristic ap proach es will c ontinue to play a n impo rtant role in many AI ap plications, to the extent th ey empirically outperfo rm competing methods. But like with all y oung sciences at t he transition point between an early intu ition-do minated and a later formal era, the im portance of math ematical optimality theor ems is gr owing q uickly . Progress in the formal er a, howe ver, is and will b e driven by a different breed of resear chers, a fact that is not necessarily universally enjoyed and welcome d by all the earlier pion eers. T oday th e imp ortance of em bodied , embedd ed AI is almost un iv er sally a cknowl- edged (e . g., [20]), as obvious from frequently overheard remarks such as “let the physics co mpute” an d “no br ain witho ut a bo dy . ” Many p resent AI resear chers fo cus on real r obots living in real phy sical environmen ts. T o some of them the title of th is subsection may seem oxym oronic: the exten sion of AI into th e r ealm of the phy sical body seems to be a step away from formalism. But the new millennium’ s formal point of view i s actually taking this step into accou nt in a very general w ay , thro ugh the first mathematical theory of univ er sal embedded AI, combining “old” th eoretical computer science and “ancient” p robability th eory to derive op timal b ehavior for embedded, em- bodied r ational agen ts living in unknown but learn able environments. Mor e on th is below . 2.3 What’ s the I in AI? What is Life? Etc. Before we proceed, let u s c larify what we are talk ing abou t. Shou ldn’t researchers on Artificial In telligence (AI) an d Artificial Life (AL) agree on basic question s such as: What is Intelligence? What is Life? I nterestingly they don’t. Are Cars Alive? For example, AL researchers of ten offer definition s o f life such as: it must repr oduce, ev o lve, etc. Cars are alive, too, accordin g to m ost of th ese d ef- initions. For examp le, cars evolve and multiply . They n eed comp lex environments with car factor ies to do so, but living a nimals also ne ed complex environments full of chemic als an d other animals to repr oduce — the DNA in formation by itself does not suffice. Ther e is no obvious fund amental d ifference between an organ ism w hose self-replication infor mation is stor ed in its DNA, and a car who se self-replication in- formation is stored in a c ar builder’ s ma nual in the glove compartment. T o copy itself, the organism n eeds its mothers womb plus n umerou s other objects and living beings in its environmen t (such as trillions o f bac teria inside a nd outside o f th e mo ther’ s bo dy). The car needs iron mines and car part factories and human workers. What is Intelligence? If we can not agree on what’ s lif e, or , f or that matter , l ove, or consciousne ss (another fashionable topic), how can there be any hope to define intelli- gence? T uring’ s definition (1 950, 19 year s after G ¨ odel’ s paper ) was totally subjective: intelligent i s wh at convinces m e th at it is intelligen t while I am in teracting with it. Fortunately , h owe ver , th ere are more formal and less subjectiv e definition s. 2.4 Formal AI Definitions Popper said: all life is p roblem solving [21]. Instead of defining intellige nce in T uring’ s rather vague and subjective way w e d efine intellige nce with respect to the abilities of 5 universal op timal problem s olvers. Consider a le arning robo tic agent with a single lif e which co nsists of discrete c y cles or time steps t = 1 , 2 , . . . , T . Its total lifetim e T may o r may no t b e known in adv an ce. In what follows ,the v alu e of any time- varying variable Q at time t ( 1 ≤ t ≤ T ) will be denoted by Q ( t ) , th e ordered sequ ence of values Q (1) , . . . , Q ( t ) by Q ( ≤ t ) , a nd the (possibly empty) sequence Q (1) , . . . , Q ( t − 1) by Q ( < t ) . At any given t the ro bot receives a real-valued input vector x ( t ) from the environ- ment and executes a r eal-valued action y ( t ) which may affect future inp uts; at times t < T its goal is to max imize future success or utility u ( t ) = E µ " T X τ = t +1 r ( τ ) h ( ≤ t ) # , (1) where r ( t ) is an add itional real-valued reward inp ut at time t , h ( t ) the order ed triple [ x ( t ) , y ( t ) , r ( t )] (hence h ( ≤ t ) is the known history up to t ), and E µ ( · | · ) de notes the condition al expectation operator with respect to s ome possibly unknown distribution µ from a set M of possible distributions. Here M reflects whatever is known ab out the possibly prob abilistic reaction s o f the environment. For example, M may contain a ll computab le distributions [11, 35]. Note that unlike in m ost previous work by oth ers [36], th ere is just o ne life, no need fo r pre defined repeatab le trials, no restriction to Markovian in terfaces b etween sensors an d environment, a nd th e utility functio n im- plicitly takes into accoun t th e expected remainin g lifespan E µ ( T | h ( ≤ t )) and th us the possibility to extend it through appropriate actions [29]. Any form al pr oblem o r sequen ce o f pro blems ca n be encoded in the rew ard func - tion. For example, the rew a rd function s of many living or ro botic be ings cau se occ a- sional h unger or pain or pleasur e signals e tc. At tim e t an optimal AI will make the best possible use of experience h ( ≤ t ) to m aximize u ( t ) . But how? 2.5 Univer s al, Mathematically Optimal, But Incomputable AI Unbekn ownst to m any trad itional AI r esearchers, there is indeed an extremely gen- eral “b est” way o f explo iting previous experience. At any time t , the recent theoret- ically op timal ye t practically infe asible reinfor cement learning (RL) algorith m A I X I [11] u ses Solomo noff ’ s a bove-mentioned univ ersal prediction scheme to select those action seque nces that p romise max imal fu ture r ew ar d up to som e horizon , given the current data h ( ≤ t ) . Using a variant of So lomonoff ’ s uni versal p robab ility mixture ξ , in cycle t + 1 , A I X I selects as its next actio n the first action of an action s e quence max- imizing ξ -predicted reward up to the ho rizon. Hu tter’ s recen t work [1 1] dem onstrated A I X I ’ s optimal use of observations as follows. The Bayes- optimal po licy p ξ based on the mixture ξ is self -optimizing in the sense th at its average utility v alu e converges asymptotically for all µ ∈ M to the optimal value achiev ed by th e (infeasible) Bayes- optimal policy p µ which knows µ in advance. The necessary co ndition that M admits self-optimizin g policies is also suf ficien t. Of cou rse o ne canno t claim the old AI is d ev oid of f ormal researc h! The rece nt approa ch above, ho wever , goes far beyond previous formally justified but very limited AI-related ap proach es rang ing from linear perceptro ns [1 7] to the A ∗ -algorithm [18]. 6 It provides, fo r the first time, a mathematically sound th eory of general AI and op timal decision making based on experience, identifying the limits o f both hu man and a rtificial intelligence, and a yardstick for any future, scaled-down, practically feasible appr oach to general AI. 2.6 Optimal Curiosity and Cr eativity No theory of AI will be con vin cing if it doe s no t explain curiosity and creati vity , which many co nsider as importan t ingr edients of in telligence. W e c an p rovide an explan a- tion in the framework of optim al r ew ard ma ximizers suc h as those from the previous subsection. It is p ossible to come up with theor etically optimal ways of im proving the pre dic- ti ve world m odel o f a cu rious r obotic agent [28], extending ear lier ideas on ho w to implement a rtificial curiosity [25]: The re wards of an op timal r einfo r cemen t learner ar e the predictor’ s impr ovements on th e ob servation h istory so far . They enco urage the reinforce ment learne r to p roduce actio n sequ ences that c ause the cr eation a nd the learning of new , p reviously unknown regular ities in th e sensory input stre am. It turns out th at ar t an d creativity can be exp lained as by-pro ducts of such intr insic c uriosity r e- wards: go od obser ver -depen dent art d eepens the obser ver’ s insights abou t this world or possible worlds, con necting previously disconn ected patterns in an in itially surp rising way that ev entually bec omes kn own and boring. While previous attempts at describing what is satisfactory art or music we re inf ormal, this work per mits the first technical, formal approach to understand ing the natur e of art and creati vity [28]. 2.7 Computable, Asymptotically Optimal Gen eral Pr oblem Solver Using the Sp eed Prior [26] one can scale d own the univ e rsal appr oach abov e such that it becomes co mputable. I n wh at follows we will mention ge neral meth ods whose optimality criteria e x plicitly take i nto accou nt the comp utational costs of p rediction and decision making—co mpare [15]. The recent asymptotically optimal sear ch algorithm for all well-d efined problems [11] all ocates part of the total s earch time to searching the sp ace of proofs for prov ably correct candidate pro grams with pr ov able u pper runtime bou nds; at any given time it focuses resou rces o n those program s with the curr ently best proven time bounds. Th e method is as fast as the initially unknown fastest proble m so lver for the gi ven problem class, save fo r a co nstant slowdo wn factor o f at most 1 + ǫ , ǫ > 0 , and an ad ditiv e constant that does not depend on the problem instance! Is this algo rithm then the ho ly grail of compu ter science? Unfortu nately not quite, since the additive constant (which disappe ars in the O () -no tation of theo retical CS) may be hu ge, and p ractical ap plications may n ot ignor e it. This mo ti vates the next section, which ad dresses all kinds of fo rmal optimality (not just asymp totic optimality). 2.8 Fully Self-Refer ential, Self-Impr oving G ¨ odel Machine W e may use G ¨ odel’ s self-refere nce trick to build a universal genera l, f ully self- referential, self-improving , optimally efficient prob lem solver [29]. A G ¨ odel Mach ine is a co m- 7 puter whose orig inal software includes axioms descr ibing the hardware and the original software ( this is p ossible without circularity) plus whate ver is known about the ( proba- bilistic) en vir onment plus some formal goal in form of an arbitrary user - defined utility function , e .g., cumulative future expected re ward in a sequen ce of optimization tasks - see eq uation (1). The original software also inclu des a proo f searche r which uses the axioms (and possibly an online variant of Levin’ s un i versal searc h [15]) to systema ti- cally make pairs (“pro of ”, “program” ) until it finds a proof that a re wr ite o f the original software throu gh “p rogram ” will in crease utility . The m achine can be design ed such that each s e lf-rewrite is necessarily glob ally op timal in th e sen se of the utility f unction, ev e n those re writes that destroy the proof searcher [29]. 2.9 Practical Algorithms f or Pr ogram Lear ning The theoretically op timal uni versal m ethods abov e are optimal in ways th at do n ot (yet) immediately yield practically fe asible general pro blem so lvers, due to possibly large initial overhead costs. Wh ich are today ’ s pr actically mo st pr omising extension s of traditional machine learning? Since vir tually all realistic sensor y inputs of rob ots and other cog nitiv e systems are seq uential by na ture, the future o f m achine learn ing and AI in general d epends on progr ess in in sequence processing as opp osed to the trad itional p rocessing of stationary input pa tterns. T o nar row the gap between learn ing abilities of h umans and mach ines, we will have to study how to lear n g eneral algorithm s instead of such reac ti ve m ap- pings. Most traditional methods for learnin g time series and m appings from sequences to sequen ces, howe ver, ar e based on simple time w indows: one of the numer ous feed- forward ML techniq ues such as feedforward neur al nets (NN) [1] o r suppor t vector machines [38] is used to map a restricted , fixed time wind ow of sequential inpu t values to desired target values. Of course such approach es are b ound to fail if there a re tem- poral depende ncies exceeding the time windo w size. Large ti me windows, on the o ther hand, yield unacceptable numbe rs o f free parameters. Presently studied, rather general sequen ce learners include certain probab ilistic ap- proach es an d es pecially recurren t neur al n etworks (RNNs), e.g. , [19]. RNNs have adaptive feedb ack conn ections that allow them to learn mappin gs from inp ut sequence s to outpu t sequences. T hey can implemen t any sequential, algorithm ic behavior imple- mentable on a p ersonal compu ter . In grad ient-based RNNs, ho wever , we can differ - entiate o ur wishes with r espect to pr ograms, to o btain a search direction in algo rithm space. RNNs are biolog ically more plausible and compu tationally more powerful t han other adap tiv e mod els such as Hidden Markov Mo dels (HMMs - no continu ous in - ternal states), feedfo rward networks & Support V ector Machin es (no in ternal states at all). For several r easons, howe ver , the first RNNs co uld n ot learn to look f ar ba ck into the past. Th is p roblem was overcome by RNNs of the Long Short-T erm Memo ry type (LSTM) , cur rently the most powerful an d p ractical sup ervised RNN architectur e for many applicatio ns, trainable either by gradie nt d escent [9 ] or e volutionary methods [32], occasionally profiting from a marriage with probab ilistic a pproach es [ 8]. Unsuperv ised RNNs that learn withou t a teach er to con trol physical pr ocesses or robots fr equently use evolutionary alg orithms [1 0, 22] to learn appr opriate program s (RNN weigh t m atrices) throu gh trial an d error [41]. Recent w o rk b rough t prog ress 8 throug h a focus on reducing s e arch spaces by co -ev olv ing the comp aratively small weight vecto rs o f indi vid ual recu rrent neu rons [7]. Such R NNs can l earn t o create memories o f imp ortant events, solv ing nu merous RL / op timization tasks unsolvable by tr aditional RL meth ods [6, 7 ]. They are amo ng the most p romising m ethods for practical pr ogram le arning, and c urrently being app lied to th e co ntrol of sop histicated robots such as the walking biped of TU Munich [16]. 3 The Next 25 Y ears Where will AI research stand in 2 031, 25 year s from no w , 1 00 y ears after G ¨ odel’ s groun d-break ing paper [5], some 200 years after Babb age’ s fir st design s, so me 400 years af ter the first au tomatic c alculator by Schick ard ( and som e 2 000 ye ars af ter the crucifixion of the man whose birth year anchors the W estern calendar) ? T rivial predictions are those that just naively extrapolate the curren t trend s, suc h as: computer s will contin ue to get faster by a f ac tor of roughly 1000 p er decade; h ence they will be at least a million times faster by 2031 . Acc ording to freq uent estimates, current sup ercomp uters achieve roug hly 1 percen t of the raw com putationa l power o f a hum an brain, hen ce th ose of 2 031 will have 10,000 “br ain p owers”; and even cheap devices will achieve many brain powers. Many tasks that are hard fo r today’ s soft- ware on pre sent machines will b ecome easy witho ut even fundamentally changing the algorithm s. This inc ludes nu merou s p attern r ecognition and contro l tasks arising in factories of many in dustries, curren tly still emp loying humans instead of robots. W ill the oretical a dvances and p ractical software keep up w ith the har dware devel- opment? W e are convinced they will. As discussed above, the new millen nium has already brou ght fundam ental n ew insights into the prob lem of con structing theoreti- cally optimal rationa l agents or uni versal AIs, e ven if those do no t y et immediately translate into pra ctically feasible methods. On the other han d, on a mo re pr actical lev el, there has been rapid progress in lear ning algorithms for ag ents interacting with a dynamic environmen t, autonomo usly discovering true sequence-p rocessing, p roblem- solving programs, as opposed to the rea cti ve mapp ings fr om station ary in puts to o utputs studied in most of traditio nal machin e learning researc h. In the au thor’ s o pinion the above-mentione d theoretical an d practical strands are go ing to con verge. In con junc- tion with the ongoing hardware advances this will yield non-u niversal but nev er theless rather gener al ar tificial problem -solvers who se capabilities will exceed those of m ost if no t all hu mans in many do mains of co mmercial interest. This may seem like a bo ld prediction to som e, b ut it is actu ally a trivial one as th ere ar e so m any experts who would agree with it. Nontrivial pr edictio ns are those that an ticipate truly un expected, re volutiona ry breakthr oughs. By definition, these are h ard to pred ict. For example, in 1985 only v ery few scientists and science fiction author s p redicted the WW W revolution o f the 1990s. The few who d id wer e no t influen tial eno ugh to make a sign ificant pa rt of hu manity believe in their pred ictions and prepare for their coming true. Similarly , after the latest stock ma rket cr ash on e can al ways fin d with high pro bability some “p rophet in the desert” who pred icted it in advance, but had few if any f ollowers until the crash really occurre d. 9 T ru ly nontr i vial pr edictions are those that most will not be lie ve until they com e true. W e will mo stly r estrict our selves to trivial p redictions like t hose ab ove and refr ain from too mu ch sp eculation in for m of nontrivial on es. However , we may have a look at p revious u nexpected scien tific breakthr oughs and try to discern a pattern, a pattern that may n ot allow us to precisely pred ict the details of th e next rev o lution but at le ast its timing. 3.1 A Patter n in the H istory of Rev olutions? Let us put the AI-orien ted developments [27] discussed above in a broader con text, and look at the h istory of major scientific r ev olutio ns and essential historic developments (that is, the subjects of the m ajor ch apters in history b ooks) since th e beginnin gs of modern ma n over 40,00 0 years ago [30, 31]. Amazin gly , they seem to m atch a b inary logarithm ic scale marking exponentially declining tempora l in tervals [31], each half the size o f the pr evious on e, an d measurab le in terms of powers of 2 multiplied by a human life time (ro ughly 80 years—through out reco rded history many indi viduals have reached this age, althoug h the average lifetime often was sh orter, mostly due to h igh children mor tality). I t lo oks as i f histor y itself wil l co n v er ge in a historic singular- ity or Omega point Ω ar ound 2040 (the term historic singula rity is apparen tly du e to Stanislaw Ulam (1950 s) and was po pularized by V ernor V inge [39] in the 1990s) . T o convince yourself o f history’ s conv e rgence, associate an erro r ba r of not much more than 10 percent with each date below: 1. Ω − 2 9 lifetimes: mo dern humans start colonizing the w o rld from Africa 2. Ω − 2 8 lifetimes: bow and arrow invented; hu nting rev olution 3. Ω − 2 7 lifetimes: invention of agriculture; first permane nt settlements; beginnings of civilization 4. Ω − 2 6 lifetimes: first high civilizations (Sume ria, Eg ypt), an d th e most impor- tant invention of recor ded histor y , name ly , the on e that made r ecorded history possible: writing 5. Ω − 2 5 lifetimes: the ancient Greeks invent dem ocracy and lay th e fou ndations of W estern science and ar t and philosophy , from algor ithmic proced ures and formal proof s to anatomically perfect sculptures, harmon ic music, and o rganized sports. Old T estament w ritten ( basis o f Jud aism, Christianity , Islam ); major Asian r eli- gions foun ded. Hig h ci vilizations in China, orig in of the first calculatio n too ls, and India, origin of alphabets and the zero 6. Ω − 2 4 lifetimes: bookp rint (of ten called the most impo rtant inv e ntion of the past 2000 years) in vented in Chin a. Islamic science and culture start spreading across large par ts o f the kn own world ( this has so metimes bee n called the m ost importan t e vent between Antiquity and the age of discov eries) 7. Ω − 2 3 lifetimes: the Mon golian Empire, the largest and mo st dominan t empire ev e r (po ssibly inclu ding mo st o f human ity and the world econo my), stretch es 10 across Asia from Korea all the way to Ge rmany . Ch inese fleets a nd later also Europ ean vessels start explo ring the world. Gun powder and gun s inv ented in China. Rennaissance an d W estern book print (o ften called th e most influential in ventio n of the past 100 0 years) and subsequ ent Reformation in Euro pe. Begin of the Scientific Rev olutio n 8. Ω − 2 2 lifetimes: Age of enlightenme nt and rational thought in Europe . Massive progr ess in the sciences; first flying mach ines; first steam engines pr epare the industrial rev o lution 9. Ω − 2 lifetimes: Second indu strial rev olutio n based on com bustion engines, cheap electricity , an d m odern c hemistry . Birth of m odern m edicine thr ough the germ the ory of disease; ge netic and evolution theory . Euro pean colonialism at its short- li ved peak 10. Ω − 1 lifetime: moder n po st-W orld W ar II s o ciety and po p cu lture em erges; superpower stalemate based on nuclear deter rence. The 20th century super- exponential pop ulation explosion (from 1.6 billio n to 6 billion people, mainly due to the Haber-Bosch pr ocess [3 4]) is at its pe ak. First spacecr aft and com- mercial compu ters; DN A structure un veiled 11. Ω − 1 / 2 lifetime (now): for the first time in history most of the most destructi ve weapons ar e disman tled, after th e Cold W ar ’ s pea ceful end . 3rd ind ustrial r ev o- lution based on personal comp uters an d the W orld W ide W eb. A mathematical theory of universal AI emerges (see section s above) - will this be con sidered a milestone in the future? 12. Ω − 1 / 4 lifetime: This poin t will b e reach ed arou nd 20 20. B y then m any com- puters will have sub stantially more raw computing po wer than human brain s. 13. Ω − 1 / 8 lifetime (100 years after G ¨ od el’ s pap er): will practical variants o f G ¨ od el machines s tart a run away ev o lution of c ontinually self- improving super minds way beyond human imagination , causing far mor e u npredicta ble revolutions in the final decade before Ω than during all the millennia before? 14. ... The following disclosure should help the reader to take this list with a grain of salt though . The a uthor, who admits be ing very interested in witnessing Ω , was b orn in 1963, and therefo re perhaps should n ot expect to live lo ng past 20 40. Th is may mo- ti vate him to un cover certain historic patterns th at fit his desires, while ig noring oth er patterns that do not. Perh aps there even is a general rule for bo th the individual memory of s ingle humans and the collective memory of entire societies and their history b ooks: constant amounts of memory space get allocated to exponentially lar g er , adjacent time intervals fu rther and further into the past. Maybe that’ s why th ere has n ev er been a shortage of prop hets pr edicting th at the end is near - th e importan t ev e nts accordin g to on e’ s own view of th e past always seem to acc elerate expon entially . See [31] for a more thoroug h discussion of this possibility . 11 Refer ences [1] C. M. Bishop. Neural networks fo r pattern recognition . Oxford Uni versity Press, 1995. [2] R. A. Brooks. I ntelligence without reason. I n Pr oceed ings of the T welveth Intar- nationl Joint Con fer ence on Artificia l Intelligence , pages 569–5 95, 1991. [3] E. D. Dickman ns, R. Beh ringer, D. Dickmanns, T . Hildeb randt, M. Mau rer , F . Tho manek, and J. Schieh len. The seeing passeng er car ’V aMoRs-P’. In Pr o c. Int. Symp. on Intelligent V ehicles ’94, P aris , pages 68–73, 1994. [4] M. Dorigo, G. Di Caro, and L. M. Gambardella. Ant algorithms for discrete optimization . Artificial Life , 5(2) :137–1 72, 1999. [5] K. G ¨ odel. ¨ Uber f ormal un entscheidbar e S ¨ atze de r Princip ia Mathema tica u nd verwandter Systeme I. Monatshefte f ¨ ur Mathematik und Physik , 38:173 –198, 1931. [6] F . Gomez, J. Schmidhuber, and R. Miik kulainen. Efficient n on-linear control throug h neuro ev olution . In ECML 2006: Pr o ceedings of th e 17th Eu r opean Con- fer ence on Machine Learning . Springer, 2006 . [7] F . J. Go mez and R. Miikkulainen. Active guidanc e for a fin less rocket using neuroevolution. In Pr oc. GECCO 2003 , Chicago , 2003. W inner o f B est P aper A war d in Real W orld Applica tions. Gomez is working at IDSIA o n a CSEM grant to J . Schmidhuber . [8] A. Gra ves, S. Fernandez, F . Gomez, and J. Schmidhuber . Con nectionist temporal classification: Labelling u nsegmented sequence data with r ecurrent neu ral nets. In ICML ’06: Pr oce edings of the I nternation al Confer ence on Machine Learning , 2006. [9] S. Ho chreiter and J. Sch midhub er . Long sho rt-term memory . Neural Computa- tion , 9(8):173 5–178 0, 1997. [10] J. H. Holland. Adapta tion in Natural and Artificial Systems . University o f M ichi- gan Press, Ann Arbor , 1975 . [11] M. Hutter . Unive rsal Artificial Intelligence: Sequentia l Decisions based o n Al- gorithmic Pr ob ability . Spr inger, Berlin, 2004. (On J. Schmidhub er’ s SNF gran t 20-61 847). [12] T . K oh onen. S elf-Or ga nization an d Associative Memory . Sp ringer, second ed i- tion, 1988. [13] A. N. K olm ogorov . Grundbegrif fe der W ahrscheinlichk e itsr echnung . Sp ringer, Berlin, 1933. [14] A. N. Kolmogorov . Thr ee a pproach es to the quan titati ve d efinition of information . Pr oblems of Information T ransmission , 1:1–11, 1965. 12 [15] L. A. Le v in. Universal sequential search prob lems. Pr ob lems of Information T ransmission , 9(3 ):265– 266, 1973 . [16] S. Lohmeier, K. Loeffler , M. Gieng er , H. Ulb rich, and F . Pfeiffer . Sensor system and tr ajectory con trol of a biped r obot. In Pr o c. 8th IEEE In ternational W orkshop on Adva nced Motion Con tr ol (AMC’0 4), Kawasaki, J apan , pages 3 93–39 8, 2 004. [17] M. Minsky and S. Papert. P erceptr o ns . Cambridg e, MA: MIT Press, 1969 . [18] N. J. Nilsson. Principles of a rtificial intelligence . Morgan Kaufmann, San Fran- cisco, CA, USA, 1980. [19] B. A. Pear lmutter . Gradient ca lculations for d ynamic recur rent neu ral networks: A survey . IEE E T ransactions on Neural Networks , 6(5):12 12–12 28, 1995. [20] R. Pfeifer and C. Scheier . Un derstanding Intelligence . MIT Press, 2001 . [21] K. R. Popper . All Life I s Pr o blem Solving . Routledg e, London, 1999. [22] I. Rechenbe rg. Evolutionsstrategie - Optimier ung technisch er Systeme nach Prinzipien d er biolo gischen Evolution. Dissertation, 197 1. Published 197 3 by Fromman -Holzboo g. [23] J. Rissanen. Modeling by sho rtest data descriptio n. Automatica , 1 4:465– 471, 1978. [24] P . S. Rosenb loom, J. E. Laird , a nd A . Newell. Th e SO AR P a pers . MI T Press, 1993. [25] J. Sch midhube r . Curious m odel-building control system s. In Pr ocee dings of the Internation al J oin t Confer en ce on Neural Networks, Sing apore , volume 2, page s 1458– 1463. IEEE press, 1991. [26] J. Schmidhube r . The Speed Prior : a n ew simplicity measur e yielding n ear-optimal computab le pr edictions. In J. Kivinen and R. H. Sloan, editors, Pr oceed ings of the 15th Annu al Confer enc e on Computatio nal Learning Theory (COLT 2002) , Lec- ture Notes in Artificial I ntelligence, pages 21 6–228 . Sprin ger , Sydney , Australia, 2002. [27] J. Schmid huber . Artificial I ntelligence - histor y highligh ts and ou t- look: AI maturing and becom ing a real form al science, 20 06. http://www .idsia.ch/˜jue rgen/ai.html. [28] J. Schm idhube r . Developmental robo tics, optim al artificial curiosity , cre ati vity , music, and the fine arts. Connection Science , 18(2):173– 187, 2006 . [29] J. Schm idhuber . G ¨ odel m achines: f ully self-r eferential op timal universal p roblem solvers. In B. Goertzel and C . Penna chin, e ditors, Artificial General I ntelligence , pages 199–22 6. Sprin ger V erlag, 2006. 13 [30] J. Schmidhu ber . Is history co n verging ? Again?, 2 006. http://www .idsia.ch/˜jue rgen/history .html. [31] J. Schmidhuber . New millennium AI an d the con vergence of history . In W . Duch and J. Mandziu k, editors, Challenges to Computation al Intelligence . Springer, in press, 2006. Also av ailable as TR IDSIA-04 -03, cs.AI/030201 2. [32] J. Schm idhuber, D. W ierstra, M. Gagliolo, and F . Gomez. Training recur rent networks by EV OLINO. Neural Computation , 19(3):75 7–779 , 2007. [33] C. E. Shanno n. A ma thematical theory of commun ication (par ts I and II). Bell System T echnical Journal , XXVII:379–42 3, 1948 . [34] V . Smil. Detonator of the population e x plosion. Natu r e , 400 :415, 1999. [35] R. J. Solomo noff. Complexity-based in duction systems. IEEE T ransactions on Information Theory , IT -24(5):4 22–43 2, 1 978. [36] R. Sutton an d A. Barto. Reinforcement learn ing: An intr odu ction . Cambridge , MA, MIT Press, 1998. [37] A. M. Turing. On com putable numbers, with an application to the Entschei- dungsp roblem. Pr oceedin gs of th e Lond on Math ematical Society , Series 2 , 41:230 –267, 1 936. [38] V . V apnik. The Natur e of Statistical Learning Theory . Sprin ger , New Y o rk, 1 995. [39] V . V inge. The coming tech nologica l singularity , 1993. VISION-21 Sym posium sponsored by NASA Lewis Research Center, and Whole Earth Revie w , Winter issue. [40] P . J. W erbos. Beyond Regr ession: New T ools for Pr ed iction and Analysis in the Behavioral Sciences . PhD thesis, Har vard University , 1974. [41] Xin Y ao. A review of ev o lutionary artificial neural ne tworks. Internation al Jour - nal of Intelligent Systems , 4:203–222 , 199 3. 14
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment