Idiotypic Immune Networks in Mobile Robot Control
Jerne's idiotypic network theory postulates that the immune response involves inter-antibody stimulation and suppression as well as matching to antigens. The theory has proved the most popular Artificial Immune System (ais) model for incorporation in…
Authors: Am, a Whitbrook, Uwe Aickelin
1 I DI OTYPIC I MMUNE NETWORKS IN MOBIL E ROB OT CONTROL Amanda M. Whitbro ok, Uwe Aicke lin, Member, IEEE , and Jo nathan M . Garibaldi Abstract — Jerne’s idioty pic network theory postulates that the immune re spon se involves inter-antibody stimulation an d suppressio n as we ll as matching to an tige ns. The the ory has prove d the most po pular Artificial I mmun e System (AIS) mod el for incorp oration i nto behav ior-based ro botics bu t guidelines fo r implementing id iotypic selection are scarce. Furt hermore, the di rect ef fects of employing the techniqu e have not been demo nstrated in t he form of a comparis on with non-idioty pic sy stems. This paper aims to address these issues. A met hod for i ntegrating an i d ioty pic AIS network with a Reinforcement L ear ning based c ontrol sy stem (R L) is described and the me chanisms under ly ing antibody stimulation a nd suppression are expl ained in detail. Some hypotheses that accou nt for the netwo rk advantage are put forward and tested usin g three sy stems with increasin g idioty pic complex ity . The basic RL , a simplified hy brid AI S-RL that implements idio ty pic selection independen tly of derive d concentr ation levels an d a full hy brid AI S-RL scheme are examined. The test b ed takes the form of a simulated Pionee r robo t that is require d to navig ate throu gh maze worlds de tecting and tra cking door marke rs. Index Terms —Artificial immune sy stem, behav ior arbitr ation mech anism, idioty pic network theory , reinforcement l earning. I. I NTRODUCT I ON THE main f ocus of mobile r obot resea rch ha s been be havior- based r eactive con trol since the publication of Brooks’ subsum ption architectu re in the mid-eighties [ 22]. T his approach allows a degree of intelligence to emerge from competence module (i ndivid ual behavior ) in teractions, but it is normally integ rated with other 2 AI methods, for ex ample rein forcement learning [ 17] or neural networks [14], as th ese provide greater flexibilit y for dy namically changing environments. More recently , researchers have been exploiting the learning and ad aptive prop erties of the vertebr ate immune sy stem in or der to de sign ef fectiv e sensor y response alg orithms. I n essence the im mune sy stem matches a ntibodies (re ceptor s on b -cells ) to a nti gens (foreign material t hat in vades the b ody ), so that b-cells with suitable receptors undergo stimulation, increase in number ( clonal se lection) and d estroy the inva d ing cells. Artificial Immu ne Sy stems (AIS) have a matching functi on that d etermines the strength o f the bond bet ween the antibody and antigen and they utilize a concentration parameter as an additional measure of antibody fitness. A comprehe nsive introduct ion to AI S sy stems and the ir a pplications is p rovided in [4 ]. Within mobile robotics, Farmer's [2] computational model of Jerne’ s idioty pic network theory [1] has been nota ble as a mea ns of inducing flexible behavior mediat ion and it has demonstra ted some encourag ing results. I n these idioty pic networ ks, a ntibodies (compe tence modules ) are linked both to a nti gens (environment al stimuli) an d to each oth er, forming a dy namic chain of suppression and stimulation that affects their concentration levels globally. The sy stem is balanced so that concent ration levels also play a role in determining the degr ees of stimulation and supp ression th at occur. This “global persp ective” differs from the mor e conve ntional AI S approa ch (clona l selection the ory [3]), which consider s that only antibody - antigen stimulation alters antibody concentrations. The success of the idioty pic systems ha s larg ely been attributed to the b ehavior arbitration capabilities of the communi cating antibo dies, but no attentio n has been directed to wards provin g that this is the case, or showing that other sy stems are inferior. In addition, th ere has been little attempt to explain the pa rticular mecha nisms by which antibodies stimulate a nd suppress ea ch other and ho w this is able to improv e robot perf ormanc e. This paper a ims to addre ss these issues by providing a compr ehensive descrip t ion of a hy brid robot contro l sy stem that implemen ts Reinforcement Learning (RL ) with a Farmer-based idiotypic network for antibo dy selection. A lthough the sy stem described do es not attempt to evolve network connections and uses a fixed set of an tibodies, additional details missing from earlier narratives are supp lied. In particular, a 3 rigorous accou nt of t he implementati on of stimul ation and supp ression and some hypotheses that try to explain the id iotypic advantage are given. Most impo rt antly , this paper seeks to test these hy potheses by undertaking a number o f experiments t hat introduce id iotypic effects into the RL sy stem grad ually . I n the first sy stem (S1) an idio ty pic network is not implemented and ant ibodi es are selected on th e basis of strength of match to antig ens only . In effect th is is a pure reinforcement learning system. The second sy stem S2 is a hybrid that coup les reinforcement learn ing with a simpli fied idiotypic network. Antibodies are selected b y summi ng the effects of the networ k interactions to provide a global strength of match, but concentration lev els do not influen ce the id iotypic process in any way . The thi rd system S3 is a full AIS that bases selection on a combin ation of the global strength o f match and the co ncentration lev el and also feeds the concen trations levels b ack to the network. This step-wise appro ach is important in attempting to assess and exp lain the effects of intro ducing the idiotypic netw ork into the sy stem. I n addition, idioty pic dy namics have not pr eviously been unc oupled from antibo dy concentrations when imple menting the F armer equa tion, which repre sents a nove l investiga tion. The pa per is ar rang ed as follows. Section I I pr ovides ba ckg round info rmatio n including a brie f acc ount of the biological immune system that highlight s the main differences between the more tradi tional clonal selection th eory [3] and Jerne' s idiotypic n etwork theo ry [1]. The sec tion al so de scribes how the network theory has be en applied to autonomous robot na vigation an d a short re view of r ece nt work in this field is gi v e n . S ec t i o n III d i s c u s s e s t h e m o ti v at i o n b eh in d t h e r e s e ar c h , r e l a t i n g t he p r o b l e m s a s s o c i a t e d w i t h reinforcement learning an d introducin g some hy potheses that a ttempt to explain the idiotypic network advantag e. Section IV details the n avigation proble m and en vironments that h ave been used as the test bed for the hy potheses. Section V presen ts information on system architec ture, Section VI focuses on the experime ntal methodolog y ado pted and Sec tion VI I reports on the re sults and thei r interpre t ation. Se ction VI I I concludes the paper. 4 II. BACKGROUND A. Clonal Selec tion Theory I n the adaptive immune system of vertebrates, b-ce lls play an important role in the ident ification a nd removal of antigens. The clonal selection t heory [3] st ates that di visio n occurs for b-cells wi th receptors that have a high de gree of match to a stimulating antig en’s epitope pattern and that t hese cells then mature into plasma cells t hat secrete the matchi ng receptors or anti bodies into t he bloo dstream. The rep roduction of the b-cells also causes a high rate of mutation so that weakly matching cells may mutate to produce antibod ies with higher affiniti es for the stimulatin g antigen. On ce in the b loodstream the antibod y combining sites o r paratopes bind to the antig en epitope s, causing other cells to assist with elimination. Paratopes and epito pes are complimentary and are an alogous to key s and lo cks. Paratopes can be viewed as master key s that may open a set of locks and some locks can be op ened by more than one key [2]. Some of the matching b -cells are retained in circulation for a long time, acting as memory cells. The efficiency o f the immune response to a given antig en is hence govern ed by the dynamically changing conce ntration of matching b- ce lls, which in tur n de pends on prev ious exposu re to the a nti gen. I n this way , the immune sy stem adapt s by building up high concentration s of b-cells that have prov ed useful i n the past. Diversity is maintained by replacem ent of the ce lls in the bone ma rrow at the rate of a bout 5% pe r day , during which time mutation can oc cur . B. Idiotypic N etwork Th eory Jerne's idiot y pic n etwork theory [1] proposes that anti bodies also possess a set of epitopes and so are capable of being reco gnized by other antibodies. Epitopes uniq ue to an antib ody ty pe are termed idiotopes and the group of antibodies sharing the same idiotope belong s to the same idioty pe. When an antibody 's idio tope is recognized by the paratopes of ot her anti b odies, it is suppre ssed and its conc entration is r educed . However, wh en an anti body's paratope recognizes the idiotopes of other antib odies or the epitopes of 5 antigens i t is stimulated and its co ncent ration increases. Jern e's theory h ence views the i mmune system as a complex ne twork of pa ratope s that rec ognize id iotopes and idiotope s t hat a re r ec ognize d by para topes, see Fig. 1. This implies that b-cells are not isol ated, but are commu nicating with each other via collective dy namic network interactio ns [10]. Th e network i s self-regulatin g and cont inually adapts its elf, mai ntaining a steady state that reflects the global results of i nteract ing with the environment [1]. This is in contrast to the clonal selection theory , which supports the view that change to immune me mory is the result of antibody -antig en inter actions on l y . In addition, J erne's theory asserts that antibodies continue communicating even in th e absence of antigens, which produces cont inual change o f concentration levels. This can be interpreted as two forms of inter-antibody activity , “back g round” c ommunication whic h occurs perpetually and “active” communicati on that takes pl ace only when antigens are present. I n the latt er ca se, a single an tibody becomes more do minant since the cell w ith the paratope that best fits the antigen e pitope contributes more to the collective respon se [ 11]. It presents itself to the system as the antig enic antibo dy [7], which distur bs the networ k, inducin g furt her inter -antibody suppressio n and stimulation. C. Incorpo rat ion of the N etwor k Th eory into Mobile Robotics Farmer et al . [ 2] pro pose that Jern e's hy po thesis can be modeled as a differential equation simulating the changing concentrations o f a ntibo dies with respect to the stimulatory and su ppressive effects and the natura l death rate . Their mode l supposes that in a sy stem with N antibodies [ x 1 , x 2 ... x N ] and L anti gens [ y 1 , y 2 ... y L ], th e differential equation governing rate of change in concentration C of antibody x i is give n by (1). ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 2 11 1 1 i L = j N = p p i ip N = m m i im j i ij i x C k x C x C W + x C x C V k y C x C U b = x C − − ∑∑ ∑ & ( 1 ) The first sum in the square bracket ex presse s the stimulation of a ntibody x i in response to all antigens. Here, U repr esents a mat ching functi on between ant ibodies and anti gens and the C ( x i ) C ( y j ) terms model tha t 6 the probability of a c ollision between them (and henc e t he probability of stimulation) i s dependent on the ir relative concent rations. The second su m repres ents suppre ssion of an tibody x i in response to all other antibodie s. V is a fun ction t hat mo dels t he degree of recognit ion for su ppress ion and C ( x i ) C ( x m ) is the collision fact or. The third sum mo dels the s t imulation of antibody x i in response to the other antibodies. The funct ion W represe n ts the degree of r ecogniti on for s timu lati on and C ( x i ) C ( x p ) models the collisions. The variab l e k 1 allows possible inequalities between inter-a ntibody stimulation and su ppression, but if k 1 = 1 these for ces are equal. The k 2 term outside the brack ets is a damping factor , which de notes the te ndency of antibodies to die in th e absence of inte ractions, with constant rate. Va riable b is a rate constant that simulates both the number of collisions per unit time and the rate of antibody production when a c ollision occ urs. Equatio n (1) is based on the principle that an tibody levels are d ependen t upon af finity between the antibod y and the an tigen, past use and the i nter-an tibody connections. The conce ntration levels are calculated dynamically in this way so that they can be used to de t ermine fitness to the current environ ment. I n addit ion, th ose with levels below a thresh old can be eliminated from the system and replaced with n ew ones, as in natu re. Some robotic s resea rcher s construct c ommunication networks without using the Farmer equation. For example Sathyanath et al . [23], [2 4] and O pp et al . [25] implement mine detection in th e multi robot domain by modeling the locations of the mines and r obots a s the antigen epitopes and antibody para t opes respectively. A broadcast network th at communicates antigen location info rmation between th e antibod ies is analogo us to t he Jerne netw ork. Robo ts are stimulated to move towards the mine to aid in diffusing it when t hey receive its locati on and are sup pressed and move ran domly otherwise. Idiotopes are not mo deled and play no r ole in deter mining suppr ession an d stimulation lev els. I n addition, the numbe r o f ro bots remains constant, meaning that variable an tibo dy con centrations cannot be imple mented. Howe ver, most inte gra t ions of idioty pic selection and beha vior-based robo ti cs use the Far mer model since it approx imates the biolo gy very clo sely. For instance, Watanabe et al . [5], [6], [15] us e the approac h 7 for a garbage-coll ecting robot with conflicti ng objectiv es. They represent e pitopes, parato pes and idiotopes as binary strings that model the sen sor rea dings, pre-con d ition and disallow ed cond ition of the an t ibody respectively and use a roulette wh eel manner of selection based on antibody concentrations after idio ty pic interactions. The work presen ted in [15] is concer ned with using reinfo rceme nt signa ls to derive ap propriate idiotopes that are initially random. Re ferences [5] and [6] use a genetic alg orithm with devised crossover to evolve the idiotopes, the ne twork co nnections and the numb er and ty pes of antibod ies. Mich elan and Von Zube n [12] solve the same problem, pr oposing a similar evolutionar y mechanism fo r determining the network con nections b ut they d o not establish the an tecedent and consequ ent parts of t he antibodi es au tom at ic ally . V arg as et al . [7 ] also use t he garbage ex ample bu t evolve the n etwork structure with a genetic algorithm and u pdate the att ributes that define their antibodies using a L earning Classifier Sy stem (L CS) [8]. Antibodies are selected ba sed on activation, g iven by the product of conce ntration and strengt h of match to antigens after idioty pic effects have been calculated. Reinforcement learnin g is used bo th on the selected antibody and on those connected to it in the networ k. Krautmacher and Dilger [13 ] apply the idiotypic technique to nav igation in a simu lated maze world u sing the same basic approach as Watanabe et al . [5], but th eir antigens are variab le, being compo sed of an ob ject type and an object positi on and th ey do not imp lement meta-dy namics, i.e. antibo dies are not repl aced. Lu h and Liu [9] use an idiotypic sy stem to overcome local mini ma probl ems, mod eling their an tibodies as steering di rections and their anti gens as a fusion data set con sisting of or ientation of g oal, distance between o bstacles an d sensors a nd positions of sensors. They i mplement stimulation and supp ression by defining trigonome t ric re lations betwe en the steering angles. I n all the Farmer-based sy stems described ([5] , [6] , [7], [9], [12], [13 ] and [15]) antige ns repre sen t environm ental situatio ns, ant ibodies represen t competence modul es and the dy namics are governed using (1) or variations of it. However, the idioty pic controllers a re not compa red with ba se line sy stems to provide an indication of the idioty pic contribution to pe r forma nce and no alternative selection procedures are tested. Furthermore, each pap er assumes that th e idiotypic sy st em is readily ada ptable to en vironmental ch ange via 8 highly flexible behavior selec tion, but the underly ing mech anisms by which the dy namics f acilitate selection of efficient and adap table solutio ns are not ex plained in any depth. III. MOTIVATI ON A. Problems with t he Reinforcement Approach When robots expl ore terrain they are forced to make generalizations abo ut enviro nmental informatio n and re spond to those conclusions. For e x ample, the messa ge obje ct to right could apply to a multitude of different situations, for instance, w here another object is also fairly close to the left or a situation wh ere turning away too much cou ld lead the rob ot away from its target position. For th is reaso n, a non-adaptiv e controll er that prescribes a fix ed cou rse of action fo r each generalizatio n will almo st certainly lead the robo t into a trap, i.e. into a position where it canno t fr ee it self or repeats its behavior indefinitely. Reinfo rce ment lear ning, f or example [16], [17] and [18] is more adap tive as it a llo ws robots to score their performance and adjust th eir behavio r accordingly, but it suffers from three main prob lems. First, the behaviors adopted an d the speed of learni ng are too intimately linked with the reinforcement algorithms, which often n eed to be ca refully engi neere d in order to y ield a good solu tion. Th is compro mises the sy stem’s autonomy . Second , the tec hnique tends to undergo premature convergence prevent ing certain behaviors fro m being selected; a score increase is immed iately awarded to t he first successful behavi or and other potentially better action s are hen ce pe rce ived as in ferior and subse quently ignored . Fina lly , when localized scoring structures are used, it can often take a long time for a robot to ch ange its strategy when it gets caught in repeated beh avior p attern s that s core positively in the local sense bu t do not contribute to achieving the overal l goal. The delay i s often caused by having to wait for an ac tion's score to reduce sufficiently so that another i s selected. If the reinfo rcement learning is not crafted carefully, robot s can end up in never endin g loops of r epea ted beha vior. 9 B. The Idiotypic Advantag e When Farmer-based idiotypic sy stems are implemented, behavio r selection is a three stage process. The fi rs t sta ge is the nomin at ion of the antib od y with the highest strength of match to the presenting antigens (the an tigenic anti body or stage 1 winner). In biol ogi cal sy stems this degree of match is a phy sical attribut e of the an tibody 's parato pe, but in robo tics wher e antibodie s represe nt actions it is never acc urately known and needs to be estimated, for example us i ng current rein forcement learni ng scores. During t he seco nd stage idioty pic suppression an d stimul ation occur. The ant ibodi es with idi otopes that are recognized b y the stage 1 winner's paratope are su ppressed and th ose with paratop es that recognize the stage 1 winner's idiotope are stimulated. Earli er works have hint ed that antibodies of the same ty pe or species (i.e. valid “altern atives”) should be chosen for stimulation and tha t di fferent species s hould be suppressed. For example W atanabe et a l . [5] sugge st that stimulation and supp ression cha i ns work a s a self and non-self recogniz er. In addition Jerne [1] main tains that when an antibo dy p aratope recognizes a foreig n idiotope the suppre ssive fo rces do minate. This is not to say that antibodies identical to the stage 1 winner should be stimulated and others suppr essed be cause this would exacerbate premature conv ergence. The main fu nction of idioty pic communic ation is to promote those antibodies tha t demonstr ate a ba lance between similarit y with and difference from the firs t winner. S implistically this can be viewed as stimulation of antibodies of th e same basic ty pe (or spe cies) b ut possessing different pa rameters. For example, re versing backw ards in a str aight line a nd re versing with a l eft spin of 30º are both of type “reverse” but have different spin componen ts. St imulation increase s streng th of match and supp ression reduces it so that the antib ody with the h ighest strength of match after these effects h ave been calculated h as a high chance of bein g selected to execute i ts action. The actual an tibody chosen also depends on the third stage, which cons iders current antibody concentration levels as well as the strength of match. In some cases the final ele cted antibod y i s the stage 1 w inner; in other s a diffe rent antibody may be called. Theoretical ly , one should see improvement in a robot's performa nce whe n idiotypic suppression a nd stimulation are introduced into re inforce ment learning based behavior selec t ion. This is be cause the 10 idioty pic sy stem is potentially able to overcome th e three main prob lems listed above. Although the sy stem is still highly dependent on the structure of the reinforc ement learning (since anti body -antigen matrices are updated according to the reinfo rcement scores awarded), the action with the highest st age 1 fitness is not alway s selected and the co ncentrations o f all antibod i es are adjusted accor ding to the degrees of stimulation and suppressio n. This should instigate a degree of de tachment from the en gineered learning, p rovidi ng a more autono mous appr oach. In addition , the method shoul d significantly reduce prema ture conv ergence since antibodies with lower stage 1 fitness sh ould also get a chance to demonstrate their abilities and increase their fitness. Thi s offers increased flex ibili ty to derive more creative s olutions to problems. I n addition, r obots should be a ble to break out of repe ated sets of be havior much fa ster since th ey do not have to wait for f itness to reduc e before another behavior is selected. The idiotypic networ k sho uld provide a more dynamic sy stem th at demonstrates a h igher rate of antibo dy ch ange, potentially enabling the robot to break the cyc le. Even if the cy cle is not b roken straight away , the dy namics should ensure that a suitable behavior is eventua lly chosen . C. Hypothe ses on the Perform ance of Idiotyp ic Networks As stated above, an idiotypic network should be able to overcome the problems associated with reinfor ceme nt learning . To this end, thre e hy potheses are propose d as follows: H 1 The idiotypic network system shows a degree of de-coupling fr om engineered reinfo rcement learning and hen ce provides a more au tonomou s approach. H 2 T he idiot y pi c network system significantly reduces the problem of premature con vergence. H 3 The idioty pic network sy stem allows rapid escape f rom repe ated be havior pa tterns or pr events them from happening entirely. Note that H 3 is linked to H 2 since redu ced premature convergence f acilitates a less greedy strategy , which 11 encourages m ore va ried beh aviors . The fol lowi ng secti ons des cribe the p roblem , model s, pr ograms and experime ntal proce dures that are used to te st the above hy pothese s. IV . TEST ENVIRONMENT AND PROBLEM The agents used in this research are virtual Pi oneer P3DX robo ts that are requi red to navi gate around maze worlds d eveloped with Stage 2.0.1, a 2 D simulator for th e Play er 2.0.1 interface [ 19] . For example, Maze W orld repre sents a f ictitious building in which the robot mu st travel throug h six rooms A – F, avoiding obstacles and entrap ment (see Fig. 2). Sm all square cy an markers are used to indicate the doorways and competence modules for detectin g them with a camera and tracking th em are provided. On ce the robot ha s passed a ma rker or doorwa y , the path back is man ually blocked using the movable blocking lines shown in Fig . 2. The blocking positions are also indic ated using dashed lines. This procedur e effectively simulates automatic closure of th e door s once the robot has passed through . The rob ot carrie s a SICK L MS 200 laser with minimum rang e set at 0.0 m and maximum rang e set at 8.0 m. The device takes 36 1 readings covering the front 180° an d measures the distances b etween the robot ’s centre and any obstacles ahead. When processing the da ta this area is divided into eight eq ual sectors 0 – 7 eac h 22.5° wide, with sec t ors 3 and 4 a t the fr ont of the robo t , sector s 0, 1 and 2 to the lef t and sector s 5, 6 and 7 to the right. The minimum re ading a nd its s ector, th e sector with the maxi mum reading and the average reading are determine d . A Canon VCC4 pan-tilt -zoom camera and blob fi nding software are used to searc h for the do or mar kers. Th e blob finding softwa re enables translation o f the camera data into groups of like-color ed pixels or blobs disting uishable by their RGB va lue. The came ra remains fixed ahead at 0 ° at all times, wit h field of view set t o 60°. The in ternal odometry determines the distance t raveled and eight rear sonar measure t he average distance behind th e robot. The robot is started in room A1 in the position shown, with its fina l target mid- way through room F, i.e. it 12 is allowed to stop when the blob area from the fi nal line is grea ter than 1,000 pixels. The rob ot's performance is assessed accordin g to how fast it comple tes the journey and by the number of c ollisions with the obstacles or walls. Additionally , Mirror W orld , a mirror image of Maze W orld is used to t est the robot's performan ce after i nitial train ing has been carri ed out . The mazes are deliberately designed to fac ilitate the drawing of more g ene ral conclusions, i.e. the problems are entir ely solvable but provide a l evel of difficulty suitable for differentiating between weak and strong methodologies. For e xample, the do ors are wide en ough f or the ro bot to pass through, bu t small enoug h so that very refi ned movem ents are r equired if the robot is to pass without c ollision. Obsta cles a re strategically placed so t hat doo rway s are not blocked, bu t also so that freedom of movement is restricted in some places. The course is kept fix ed througho ut all ex periments to pro vide a fair comparison between the differ ent appr oache s. Althoug h it may b e arg ued tha t variation of the envi ronment is limited, there are several rooms in the wo rld and each of these may be consid ered a sub-enviro nment. Furthermo re, the worlds used hav e proved extr emely no n-determinist ic. The control soft ware uses the libplay erc++ client lib ra ry developed f or use with the Play er server (version 2.0.1) and i t is run on GNU/Linux 2.6.9 (CentOS distrib ution) with a Pent ium 4 p rocessor and 3 .6 GHz o f memory . All simulations are run in real t ime. V. SYSTEM ARCHITEC TURE A. Equations Used to Model the Network The fu nctions V and W in (1) m odel both backg round a nti body communica t ion an d active antibody communication, i.e. th ey compare each antibody with all of the ot hers so that the levels of st imulation and suppression can be de t ermined. Backg round c ommunica tion is not simulated her e as active communica tion represents a stronger force and since in this system all environme ntal situations are modeled as a n tigens (even the case where average sensor readings are hi gh), active communication is of most interest. I n 13 addition the removal of bac kgrou nd commun ication pr oduces a simpler sy stem, as each antibody need only be compared wit h the an tigenic ant ibody denoted here as x w 1 . The commu nication is mimicked by comparing the paratope of x w 1 with the idio topes of the ot her competing ant ibodies and vice versa. Th is involv es cons tructi ng a parato pe mat rix P that shows the strength of matc h betw een a ntibodies and antig ens and an idiotope matrix I that shows disallowed matche s so that desire d combinations c an be rec ognize d against unwanted o nes. The process of compu ting the intern al network effects thus consists of summing P and I streng th of match values betwe en antigens and antibodies. A var iation of F armer 's equa tion (2) that sums the int er-antibody suppressive and stimul atory effects over the number of antigens L rather than the number o f antibo dies is hence used . The idioty pic ma tching f unctions a re termed V ’ and W ’ here to distinguish them from functions V and W in (1). Another importan t differ ence is tha t equation ( 2) use s concentratio n of t he antigenic anti body C ( x w 1 ) rather than that o f every antibody in the sy st em C ( x m ) and C ( x p ) in (1). In addition, the antigen concentration term C ( y j ) and the associated C ( x i ) term in (1) hav e been removed fro m the first part of (2) since antigen concentrations are not modeled and their relative importance is already represented by weighting them accordin g to a priori ty ranking. Oth er than thi s, the terms in (2) are identi cal to those u sed in (1), which are full y explain ed in Secti on II , Part C . ) ( ) ( ) ( ) ( ) ( ) ( 2 11 1 1 1 1 i L = j L = p w i ip L = m w i im ij i x C k x C x C W + x C x C V k U b = x C − ′ ′ − ∑∑ ∑ & ( 2 ) Equat ion (2) m ust b e evaluat ed in se para te parts, since the antigenic antibody x w 1 is unknown un til the first sum in the square brackets is used to determi ne it. It is therefore split into fi ve separ ate eq uations (3) – (7). Equatio n (3) represents th e first sum in the square brackets, i .e. T 1 ( x i ) is the stre ngth of match of antibo dy x i to the set of presenting antigens and U i s the antigen matchin g function. Once (3) is evaluated for all anti bodies, the antibo dy wit h the highest T 1 ( x i ) value is selected as the antigenic antibod y x w 1 , therefore equati on (3) is alway s pro cessed first. Equa tion (4) r eprese nts the second sum in the sq uare 14 brac kets, i.e. T 2 ( x i ) calculates the suppressio n in ant ibody x i by using V’ as a suppre ssi ve matching function and modeling the probability of collisions between antibodies x i and x w 1 . Similarly, T 3 ( x i ) in (5) su ms the stimulation in x i , us ing W’ as a stimulato ry matching fu nction. F unctions U , V ’ a nd W ’ are expressed in terms of P and I and are explained f urthe r in Part D. T g ( x i ) in (6) repres ents the glob al stre ngth of matc h, a strength metric that encompasses all molecular activity, i.e. T g ( x i ) equates to all of the terms in the square bracket in (2 ). The param eter k 1 in (6) is the sam e as in (1) an d (2). Equati on (7) is equation (2) re-written in terms of T g ( x i ) and it exp resses the rate of change of concentratio n of antibod y x i with time. As th e concentrations must be co mputed di scretely , a difference equation form of ( 7) is us ed, (8). ∑ L j= ij i U = x T 1 1 ) ( ( 3 ) ∑ ′ L = m w i im i x C x C V = x T 1 1 2 ) ( ) ( ) ( ( 4 ) ∑ ′ L = p w i ip i x C x C W = x T 1 1 3 ) ( ) ( ) ( ( 5 ) () ) ( ) ( ) ( ) ( 3 2 1 1 i i i i g x T + x T k x T = x T − ( 6 ) () ) ( ) ( ) ( 2 i i g i x C k x T b = x C − & ( 7 ) ( ) t i t i g t i + t i x C k x T b + x C = x C ) ( ) ( ) ( ) ( 2 1 − ( 8 ) I n hy brid AI S sy stems antibody fitness F is often measured using a combin ation of metrics that represent the in divi dual compon ents of the scheme. For exampl e [7] u ses activation A ( x i ), defin ed as the pr oduct of 15 the global str eng th of ma tch T g ( x i ) and the concentration , see (9). This method of assessing fitness is adopted he re as it incor porates both the AI S and re inforce ment l ear ning aspe cts of the hy brid sy stem. ) ( ) ( ) ( 1 i g + t i i x T x C = x A ( 9 ) B. Choice of Antibodies and Antigens As in ma ny previo us robot AI S sy stems ([5], [6], [7], [9], [12 ], [13 ] and [15]), environme ntal situa tions are modeled as an tigens and competen ce modules are mo deled as antibodie s. For simplicity , fixed number s are used (8 antigens and 16 antib odies) and an tibody replacement is not im plemen ted. The set of antigens (listed in Table I ) is given a prior ity structure based on th e principle that th e needs of some si tuations outweigh those of others. For ex ample, if t he robot is stalled against a wal l it must tak e action to free itself before it can deal with less u rgent problem s such as an o bject to t he left. Sin ce antigens 0 – 6 ha ve a wide a pplication to most robot na viga tion pro blems and antig en 7 would b e useful f or a n y problem involving tracking an object, this priority ra nking is reason ably unspecific and means that mo re ge neral con clusions ma y be dra w n from the experime nts. The c ondition parame ters are sele cted from the results of con ducting p re-trials that e n abled the d oor tr acking task to be carried out effici ently using sy stem S1. The a ntibody repertoire, i.e. the set of p ossible behavior s listed in Table I I, is selected on th e basis of providing the robot with the ability to move in a number of different directions and at a number of different speeds covering b oth the fro nt and rear. In addition, se lection o f more i ntelligent actio ns such as wandering toward the maximum laser reading and tracking the doo r markers are provid ed. Apart from tracking the marke rs, all the ac tions may be con sidered generic to n aviga tion prob lems that r equire a robot to avoid obstacle s and traps. The maximum speed permitted M is 2.0 m/s. 16 C. The Paratope an d Idiotope Matrices Five different paratope matrices P 1 - P 5 (a ntibody -antigen stre ngth of match ma trices) are use d in the experime nts. To help minimize any initial bias the se ar e pre pare d bef oreha nd by gene rating ra ndom element valu es P [ x i , y j ] between 0.50 an d 0.75, i. e. not too hig h and not too low. The se value s are t hen adjusted by adding a small number δ ( x i ) (positive or negative) to each antib ody 's elements so that the mean across each row o f P is 0. 625 . Variable δ ( x i ) is given by (10) where the or iginal mean for ro w i is denoted by µ i and the des ired me an by µ d . Again, L is the number of antig ens. Th e deriva tion is given be l ow. () ( ) ∑∑ == + = + = L j i L j j i i j i d x L y x P x y x P L 11 ) ( ] , [ ) ( ] , [ δ δ µ ) ( ] , [ 1 i i d j i L j i x L L L y x P L δ µ µ µ + = = ∑ = i d i = x δ µ µ − ) ( ( 1 0 ) When the program executes the elements of P are updated approximately once each second using reinforcement learning . However, it is wort h noting that values are not al lowed to fall below 0.00 or rise above 1.00. Only one fixed idiotope matrix I i s used, i.e. it is not pe rmitted to ch ang e, either within the duration of the robot's ru n or throughout the course of the expe rime nts. This is de liberate in order to ren der easy investiga tion and expl anation of the idiotypic mech anisms. The matrix is ha nd coded accord ing to perc eived disallowed antibody -a ntigen c ombinations, i.e. pairs that wo uld produce non- sensible or unwanted actions are g iven positive element values, see Table I V. Numbers in the range 0.00 to 1.00 are possible, bu t the sum of el ements for each ant ibody (acro ss all antigens) is set t o 1.00. This is to reduce the likelihood of any antibody becoming ov er stimulated or suppressed. The positive va l ues indicate the level of confidence that th e combination is poor in some way. Null values do not necessarily in dicate a good 17 combination, they merely show complete unce rtainty. D. The Algorithm an d Matching Func tions The ran dom, varia ble para tope matrix P and fixed idiotope matrix I are both i mported from files and the robot senso rs are read in a continu ous loop . The sy stem checks the sensor data for the presence of a n tigens approx imately once per second, i.e. every ten iterations of the contin uous l oop. Multiple antigens are allowed to present themselv es simultaneously , but one is deemed dominant acc ording to the pr iorities give n in Table I . An antigen a rray G ( x i ) is fo rmed, wh ich has value 0 f or non- presenting ant igens, valu e 2 for a dom inant antigen y d with P [ x i , y d ] > 0, value 0 for a dom inant anti gen with P [ x i , y d ] = 0 and valu e ¼ for all other p resenting antigens, so t hat the do minant one receives greater weighting fo r all antibodies with positive P [ x i , y d ]. For example, if antigens 2, 4 a nd 7 prese nt themselves then G ( x i ) = [0, 0, ¼, 0, 2, 0, 0, 0, ¼], pr ovided P [ x i , y d ] > 0. An antibo dy is selected t o have its action execu ted in response to th e presenting antigens and this is effectively a three stage process. The first stage is selectio n of the ant igenic antibody x w 1 by compu ting T 1 ( x i ), i.e. summing strength of match to the antig en set using (3), where the matching function U is defi ned by (11) below. Here P is the paratope matrix and G is the an tigen array . j i j i ij x G y , x P = U ) ( ] [ ( 1 1 ) This definition of U uses the weight ed current reinfor cement scor es for ant igen matchi ng and ensures that all antibodies with zero match to the dominan t antigen are discounted, i. e. are assigned T 1 = 0 . T h i s i s important since reinforcement learni ng o perates on t he degree of match to the dom inant ant igen. Negative scoring would therefore have no effec t on antibodies w ith zero match since match values are not a llowed to fall below zero. Once x w 1 is de te rmi ned it pr ese nts itself to the idioty pic networ k as the ant igenic anti body. 18 The second stage is summing t he stimul ation and su ppression that th e antigenic antib ody causes. I n other words each anti body ’s global stren gth value T g ( x i ) is calculated by computing the T 2 ( x i ) and T 3 ( x i ) values from (4) and ( 5) that repr esent the effects of suppression and stim ulation r espectively . I n this algorithm all other a n tibodies with T 1 > 0 must c ompete with x w 1 for selection via the idioty pic process. Note that x w 1 does not compe te with itself, i.e. i t is not permitted to stimulate or su ppress itself; its streng t h remains unchanged thr oughout th e entire second stage, i.e. T g ( x w 1 ) = T 1 ( x w 1 ), T 2 ( x w 1 ) and T 3 ( x w 1 ) = 0 . In addit ion, non-comp eting a ntibodies must have T g = T 1 = T 2 = T 3 = 0. For this purpos e, an antibo dy array H is formed that has val ue 1 for competi ng antibodies wi th T 1 > 0, but value 0 for antibodie s with T 1 = 0 and an tibody x w 1 . The fu nction V’ in (4) is gi ven by (12) be low, wher e I is the fixed idiotope matrix . i m i m w im H y , x I y , x P = V ] [ ] [ 1 ′ ( 1 2 ) Under t his defi nit ion of V’ e quation (4) simulates suppre ssion by compar ing the stage 1 winner 's pa ratope with the competing antibody 's idio tope . Since the paratope constitutes app roved antigen matches and the idiotope sho ws disallowanc e, the product o f these el ements pr ovides a good ind ication of the level of suppression th at should b e induced in t he competitor. ( ) i p w p i ip H y , x I y , x P = W ] [ ] [ 1 1 − ′ ( 1 3 ) The fu nction W’ in (5) is given by (13). This definiti on allows equat ion (5) to mim ic stim ulati on, thi s time examining the stag e 1 wi nner’s i diotope and the competing antibody ’s par atope. A low pa ratope element coupled with a positive idiotop e value indicat es a possible simil ar species and that the antibo dy should be st imulated. Here, the p aratope element is subtracted from 1 in order that t he elemental prod uct y ields high values for a high level of stimulation. Equatio ns (4) and (5) show that the ele mental pro ducts in (12) a nd (13) a re summe d over all th e antig ens 19 and multiplied by the concentration terms. I n this way an indi vi dual ant ibod y may undergo mult iple idioty pic suppressions and st imulations. The net resu lt of these and the original antigen matc hing determine s T g ( x i ) via (6). || || ∑ N j= + t j + t i + t i x C x C = x C 1 1 1 1 ) ( ) ( ) ( ( 1 4 ) The third stag e involves the use of (8) to calculate each new an tibody concentration and (9 ) to calculate activation . Here, the term conc entration is used to mean the propor tional number of clon es of an antibody ty pe in circulation if the total number is held at N , where N is the number of ant ibodies. Therefore all 1 6 antibodie s beg in with equa l co ncentra t ions of 1. Once the new concentration values are derived from (8) they are normalize d using ( 14), and mu ltiplied by N to k eep the total nu mber of clones at 1 6. This process prevents scaling problems that aris e w hen the total numbe r become s too high a nd is in keeping with many other inv estiga tions involving AI S networks, for e xample [6] and [21]. In addition, studies with mice ha ve suggested that an almost constant number of b-cells are active, so i t is likely that there is a mechanism in nature tha t controls this [20]. I n orde r to test the hy potheses H 1 - H 3 , an i ncremental approach is adopt ed, i.e. three exp erimental sy stems S1 – S3, each with i ncreasing levels of idio typic complexity are used to solve the navigation problem and the performance of each is compared in terms of the spee d and agility of the robot. S1’s overall winning antib ody is alway s x w 1 but S2’s is the antibody with th e highe st global stre ngth of match T g ( x i ) and S3 ’s is th e one with the h ighest activation A ( x i ). S1 and S2 are therefo re simply sub-programs o f S3 that i mplement only stag es 1 and 2 respectively. I n ad dition, S1 and S 2’s antibod y concentration s are held con st ant at 1 th roug hout, i.e . the ter ms C ( x i ) and C ( x w 1 ) in (4) and (5) are effectivel y ignor ed to simplify t he dy namics and provide a n indication of the ef fec t of introducing a netwo rk that is indepe ndent of concent ration. However, in S3 the concentration le vels are allowed to vary , i.e. the system implements 20 the concent ration t erms in (4), (5), (8 ) and (9) as va riables, which represents a fu ll AIS sy stem. Table II I summarizes the three syste ms and desc ribes how fitness is measured f or each. Since fitness in S1 does not consider idiotyp ic effects and ignores concent rations of antibodies it is purely an RL sy st em that uses P as a belief table for executing actions . I t is in no sense an AI S . S2 is not a true AI S as it does not base selection on a function of a ntibody conce ntration an d molecular collisions are not modeled within the network . In addition, the sy stem has no global feedback fro m the network as the streng t h T g ( x i ) is used only to select the fittest antibody , and only the fittest undergoes a paratop e adjustment from reinfo rcement learning. T he T g ( x i ) value s for the other antibodies are not used to adjust the paratope in any way . System S3 represents a true AI S because feed back f rom the network is glob al thro ugh a lteration of all antibody concentrations usi ng (8) and there is also feedback from concentrations to the network since collisions between molecules are model ed in (4) and (5). To i llust rate use of t he equati ons (3 ) – (6), Table VI shows the result s of calculati ng T 1 ( x i ), T 2 ( x i ), T 3 ( x i ) and T g ( x i ) using the idioto pe values from Ta ble IV and the example p aratope value s give n in Tab le V. I n these calculations all antibod y concentrations are held co nstant at 1 for simplicity , as in the case of S2 and k 1 is set at 0.625. I n the example, the a ntigens pre sented to th e sy stem are 1, 3 and 5 , hence 5 is dominan t. The table shows that the stage 1 winner is antibody 14 but that the idioty pic processes n ominate antibody 10, i.e. an alternative reverse antibo dy as the overall winner. Once the fitt est antibody has been chosen it executes it s action and its effect is assessed using reinforcement learnin g. The appropriate element of the parato pe matrix P is ei th er in cr eas ed or dec re as ed, depend i ng on w hether a rewa rd or pe nalty is issued by the reinfo rceme nt alg orithm. E. Reinforcement Learning Reinforcement learnin g occurs when knowledge is implicitly coded in a scalar reward o r penalty function. There is no teacher and no inst ruction ab out the correct actio n, just a score t hat is y ielded by the robot’s interaction w ith its environme nt. Here, the technique is employ ed for dy namic estimation of the de gree of 21 match b etween antibodies (ac tions) and antigens (environmental situations). This pape r seeks to c ompare a basic RL sy stem (S1) with hy brids that utilize an idioty pic networ k (S2 and S3) a nd thus esta blish whether the idiotypic treatment enhances robot performance. It is therefore essential to construct a g ood rein forcement sch eme to test whether the network can add so mething to a design that already performs reasonab ly well. For this reason, a scheme that is h ighly engineered is used. This is based on a “brute force” approach that coerces the robot into b ehaving in a desirabl e way , e.g. by penalizi ng it for going backwards un der certain condit ions. Th e reward and pen alty increments coded are ratio orientated e.g. the robot is rewarded more wh en it travels fast than when it travels slow if the se nsors show no danger. As well as testing a good RL algorithm, a weaker st rategy is also employed as a direst test for H 1 . If idioty pic robots can pr oduce g ood resu lts despite using p oor learning the n they will have demo nstrated a deg ree of detach ment fro m the st ructured reinforcement signals. I n both strategies the reinforcement value r f is set to 0 at the start of the learning exercise, which is carried out once ev ery te n loops but five lo ops out of sy nchronization with the c ompletion of the actions. I n other words, approx imately half a second after acting, the selected antibody' s performance is score d either negatively or posi tively by re-reading t he sensor values an d usin g one of t he scori ng algorit hms d escribed below . The algo rithm used is larg ely de penden t upon th e domina n t antig en. Howeve r, this does not re nder the scheme too problem specific because the antigens re present environmental situations th at are reasonably universa l to navigation and tracking problems. A br ief desc ription of the str onger reinfor cement de sign is given below. Note tha t the absolute values of the scoring parameters are not presented as they are somewh at arbitrary. The network sy st em S3 should be ab le to work alongside any basic RL scheme to enhance perf orma nce. For domina nt antig ens 0, 1 and 2 ( ob stacles present ) the learning a lgorithm pro vides linear rewards for an increase in the dist ance between t he robot and t he obsta cl e and li ne ar pe nal t ie s f or a d ecr eas e. Ho wev er , i f the sec tor prod ucing the minimum re ading c hange s it is not immedia tely obvious whether the robo t is 22 encounteri ng the same obj ect, so the scores are reduced by a factor of 4. Reverse antibod ies are awarded an additiona l penalty since re versing away from obstac les doe s not contribute to the overa ll goal of moving forwards to wards the target. For domina nt antige n 4 ( low average laser reading ) the algorithm sco res in a linear fashion, rewarding an increase in average laser reading an d penaliz ing a decreas e. As o ne of the obj ectives of robot navigat ion is to utilize space so that collisions are avoide d it also makes sense to re ward any antibody that is able to move the robo t forward from enclosed t o more open ar eas. The change in average reading is hence also used as a global assessment metric fo r all cases, rega rdless of the dominant antigen. In addition, r everse antibod ies that have reduced the average reading are penaliz ed further to discourage their ge neral use. Reversing is contrary to the ov eral l goal and should only be nec essary to escape from stall situation s. When antigens 5 or 6 ( stalled or block ed be hind ) are dominant assessment is based on the distance traveled in the half seco nd between reading the sens ors and scoring. This scheme provi des linear rewards for movement and a fixed pe nalty for failing to move . The algorithm for dominant antig ens 3 and 7 ( ave rage read ing above threshold a nd d oor marker seen ) rewards faster antibo dies as the robo t can afford to tr avel quickl y when no obstacles are present and it is not trapped. Slower antibodies receive a small pena l ty . Additionally , antibo d ies that keep the door marker in sight are rewarded further an d the score is even great e r fo r those moving d irectly toward it. I n contr ast, antibodies clo se to a door marker that th en lose sight of it are penaliz ed. Again, r f is reduced for negative speeds. The final reinf orcement learning score r f (either positive or n egative) is added t o the paratope matrix element P [ x w , y d ], i.e. t hat representin g the affinity between the dominant antigen y d and the overall win ning antibo dy x w . However, if P [ x w , y d ] be com es nega tive a s a re sult of th is, i t is set to 0. Th e alg orithm is summarized b y (15) below. ( ) 0.5 1 ) ( ] [ 0, ] [ MAX + t f t d w + t d w r + y , x P = y , x P ( 1 5 ) 23 Any anti bodies that are penalized also have t heir con centration increase removed, i .e. their con centration is set back to the figures from the p revious i teration. The weaker learning strat egy is the same as that d escribed above excep t that it o ver penali zes the obstacle avoidanc e antibodies 1, 2, 4, 5, 6, 7, 8, 9 a nd 12 by apply ing the door tra cking part of the a lgorithm for dominant an tigens 3 a nd 7 to all antige ns. This is too toug h a test and as a result robots do n ot tend to develop very good obstacle a voidance strateg ies. The prog ram architecture also differs slightly as all dominant an t ige ns, even those with zer o P [ x i , y d ], are given a value of 2 in the array G ( x i ) in (11) . The sy stem is thus less robust since antibo dies with zero match to the domin ant antigen may be selected, meaning that negativ e reinforcemen t scoring has no effect. This en courages repeated behavi or. VI . EXPERIMENTAL PROCED URES A. Measuring Ro bot Perform ance and System P roperties The program recogniz es a collisio n when t he dominant antigen is eit her blocked b ehind or stall ed . However, on it s own the nu mber of col lisio ns or stal ls n s does not repr esent a go od measu re of task fitnes s since it does not allow for rob ots that are hig hly cau tious but too slow. Robots should be able to complete the course as rapidly and with as few stalls as possible. On the other hand, the time to complete the course t does not prov ide an indication of the robot' s safety attributes; a fast robot is no g ood if it dama ges itself or the environment. For this reason a score metric S that combines the run t ime with the number o f stalls is used to de termine task fitness. S is defined by (16), where φ is the ratio of mean t to mean n s over all experime nts (17). Th e score thus give s equal w eig hting to n s and provid es a linear combinatio n of the two metrics that has t he same mean as t over all ru ns, see pr oof be low. Th roug hout thes e expe rimen ts φ is set at 9.08, th e figure computed in a series of pr e-trials using all three sy stems, S1 – S3. 24 2 t n S s + = ϕ ( 1 6 ) s n t = ϕ ( 1 7 ) t t n n t t n N t n S s s s s = + = + = Σ + Σ = 2 2 2 2 ϕ ϕ I n addition to S , the reinf orcement success rate, defined as th e percentage of positi ve scores awarded and the rate of an tibody change (the percent age of selected antibod ies that di ffer from t he previous i teration) are measured for each sy st em. For S2 and S3 the rate of i dioty pic difference (the percentage of iterations where the stage 1 winner an d final winn er differ) an d reinforcement success rat e when an idiotypic difference occurs are calculated, bo th ov er the entire run and duri ng stall conditions only . These metrics may help in explaining an y perc eived differe nces be t wee n the sy stems. A set of data th at shows the iter ation number and antibo dy used duri ng a collisio n is also p reserved. B. Selection of Param eters – Initial Inve st igation Befor e exhaustive comparison s of the thre e sy stems are c arried o u t, seve ral pr eliminary investigations are undert aken to e st ablish suitable para meters for b and k 1 in equation s (6) and (8). However, sin ce all antibod ies are to be retained in the repertoire withou t replacement, it is n ot necessary to test so stringently for an acceptable v alue for k 2 . This parame ter serve s mainly to determine how ra pidly antibodies die out in sy ste m S3 , so a s lo ng a s it is kep t lo w in comp ariso n to b no antib odies ar e rem oved f rom the sy stem. Testing has shown that a valu e of 0.05 is effectiv e for this pu rpose when th e sy stem is implem ented with 8 ≤ b ≤ 8 00 and wi th 0. 00 ≤ k 1 ≤ 1.00. If a meta-dy namic system were to be used, k 2 would bec ome muc h more i mportant, see [ 21]. As antibod y repl acement is not the focus of t his work , the value of 0.0 5 is retained th roughout all furth er experi ments . I n order to establish satisfactory values for b a nd k 1 the Maze W orld environme nt is used as the test b ed 25 with the robot using the initial paratope matrix P 1 . As a pr ecursor to a mor e tho roug h tre at ment, first b is set at 8 for sy stem S 3 and a se t of results for k 1 valu es between 0. 0 and 1.0 are determi ned by findi ng the mean score S for si x r uns. N ext b values of 80 and 80 0 are triale d in the same wa y and a set of me an scores for sy st em S2 and a mean score for sy stem S1 are found. For S1 30 runs are co mpleted and a 95% confide nce interv al is computed using a standar d one-ta iled t-test. The me an ra te of idioty pic differenc e is also measured for systems S2 and S3 each time. Fig. 3 shows a plot of mean score ag ainst k 1 for S2 and fo r S3 using the differen t b values and Fig. 4 shows m ean idiot y pi c difference rate. Although system S1 does not use k 1 its perfo rmance is included in Fig 3. as a straig ht line for comp arison pu rposes. The chart in Fig. 4 shows that k 1 has a great effect on the degree of idiot y pic inf luence for bot h S3 and S2 and st rongly suggests that t his is independent of the value used for b in S3. It also shows that k 1 values in the range 0.0 t o 1.0 produce differen ce rates almost b etween 0% – 9 0%, altho ugh after k 1 = 0.8 this tails off to about 4 %. In fact, the relationship between k 1 and idioty pic differe nce ra t e appe ars almo st linear, whic h fits in with the theory since from (6) a low k 1 should reduc e the supp ression of a ntibodies, pro ducing a high number of iterations wh ere the antibody selected first dif fers f rom the final winne r. I n contrast, a higher value shou ld in crease suppression and redu ce the rate of difference. It is also notable that fo r k 1 values lower than abo ut 0.8 S 2’s rates are sl ightly lower th an those of S3. T he chart in Fig . 3 provide s clear evidence tha t the perfor mance of the robot is depend ent upon sele cting appropriate values for both k 1 and b . The 95 % confiden ce interval for th e mean score of S 1 is between 402 and 265, w hich means tha t out of the four other sy stems trialed, only S3 with b set at 8 or 80 is likely to be able to produce results with significantly be tter scores. Sy stem S2 and sy stem S3 with b = 800 both stay within t hese confidence limit s for all valu es of k 1 tested. In addition, for b = 80 a nd b = 8 the re gion of the x -axis between 0.45 and 0 .65 shows a tren d towards a dip in mean score, which remains well below the lower limit for S1. The eviden ce from Fi g. 3 and Fig. 4 suggests t hat there is an op timal l evel for the idio typic difference rate and hence fo r k 1 ; a low level of suppression (low k 1 ) produc es a sy stem that tries altern ative beh aviors too 26 often as it is not strict enough in its se l ection of the m , but a high suppressio n level (hig h k 1 ) creates a sy stem that does not try them often enough as it is too rigid. Hence a sy stem with a higher k 1 value has a lower idioty pic difference rate and as the k 1 value bec omes h igher it bec omes less disti nct fr om sy stem S1, w hich has 0% diffe rence since it does not use idioty pic se lectio n. This theory is supported by Fig. 3 as the low er k 1 values remain within the b ounds for the mean score of S1 for all the systems tested. Alternatively, as k 1 approaches zero the robot wil l tend not to accept the winner from stage 1 and will hence m iss out on striking a desirable b alance between accepting and rej ecting it. Consequentl y , its performance d eteriorates as evidenced in the graph. Fig . 4 shows a re markab ly similar pattern of idioty pic difference f or widely different b values, suggesting that b does not have much inf luenc e on idioty pic stimul ation and suppression levels. However it is clear from Fig . 3 that a b value of 800 is not likely to out- perfo rm S1 for any given k 1 value, so how can this b e explained? Fro m (8) it is appare nt that b play s an important role in d ete rmining the weighting that is given to the global strength of match T g when calculati ng the new concentrati on values. I t is therefore used to provide an indicat ion of t he relativ e importan ce of T g against the hist oric con centration v alue C ( x i ) t . Use of lower values favors antibodies t hat have b een successful in the past, whereas a high value tends to produce a sy stem that choo ses those that best m atch current environmental i nformation. In addition , a higher valu e give s rise to a fa st er rate o f chang e of concentration me aning that levels c an bu ild up a nd r educe rapidly betwee n i terations , providing a more useful indication of a ntibody fitness. I t is likely that there is a ra nge of values that strike a good b alance between us ing histor ical data and current envir onment al inform ation and it is probable that 800 is too hig h for this range when implementing this reinfor cement stru cture with this particula r enviro nment and idiotope. There is an ex tremely hig h level of correl ation between th e score data, stall data and task completi on time data, i.e. the patterns i n the graphs are almost identi cal. For this reason, it is sufficient to proceed by examin ing score dat a only. 27 C. Selection of Parame ters – More Deta iled Investigatio n I n order to gain a better understanding of the role of b and k 1 , a wider range of b values (8, 15, 20, 40 , 60, 80, 100, 120, 160, 200 and 800 ) is examined. This r ange is used fo r k 1 between 0.45 and 0. 65 in i ncrements of 0.5 0, adopting the same exper imental pr ocedu re, i.e. six runs in Maze W orld starting with the r andom para t ope matr ix P 1 only. This region of k 1 is selected because of its superior performance in the first set of trials, i. e. it is assumed that k 1 values outsid e of this range are u nlikely to y ield mean scores significantly differ ent to those f rom S1. Fig. 5 s hows mean scores against b for k 1 values in t he selected range. I t is readily apparent from the graph that t here is a region of b where best perfo rmance is with k 1 set at 0.60 and where b est perform ance also represents a “good ” performance with m ean score less than 2 00. This ran ge is approximat ely between 40 and 1 60. When k 1 = 0.60 is us ed with hig her va lues of b , performance drops con siderably with no best scores lower th an 200. At th e oth er end of the scale, l ower b values still perf orm well, but the optimum k 1 appears to hav e shifted towards a lower valu e. Fig. 6 is a pl ot of mean score against k 1 for the region 40 ≤ b ≤ 160 on ly. He re , maximum perf orma nce occurs at k 1 = 0.60 f or each system and mean score values are well b elow the l ower confidence li mit line fo r S1. The same is not true for any of the othe r value s test ed, although it is possible that values very close to 0.60 may also prod uce this phenomeno n. I n addition, one- tailed t-te sts have show n that the me an scor e (173) at k 1 = 0.60 in this rang e of b is significantly higher than th e mean scores for all the other v alues of k 1 , (295, 254, 248, 23 4) a t the 9 9% level. I t is r easona ble to assume t hat th e optimum va lue for k 1 is very close to 0.60 for this sc heme, env ironment and idiotope, as long as b re mains in the approximate ran ge 40 to 160. An inte rp reta tion of the gr aph s in Fi gs . 5 and 6 is th at the approx imate region 4 0 ≤ b ≤ 160 represents a more stable form of equation ( 8), whic h shows optimum perfor mance c lose to k 1 = 0.60, i. e. when the idioty pic difference rate is at about 20%. The stability is attributed to the selec tion of b , which pe rmits a good ba lance between use of historical antibody information and reac tion to the environment. However, for higher b values, the equation beco mes too depe ndent on e nvironmenta l information, i.e . it becomes more 28 like S 2 sin ce concentration plays a less important rol e. It is not surp rising that the performance of S2 and S3 with b = 800 a re similar in F ig. 3. Co nversel y , f or b less than 40 the sy stem tends to rely more on h istorical informatio n, placin g less emphasis on wh ich ant ibody has the higher T g value an d more on co ncen tration. These preliminary experimen ts provide a g ood g auge for p aramete r setting during more extensive testing of S3 and ha ve also show n that b is far mo re robust than k 1 since good perform ances can be achi eved for widely differing b values, whereas k 1 is much more sensitive to chan ge. I n additio n, pre-trials with the weak learning strategy have revealed that the rate of idio ty pic differ ence is intrinsica lly higher than the st rong learning strategy because the robot is penalized by reinforcement more often. Consequen tly the “optimum” k 1 value increases for th e stable region o f b when learning is weaker. This is because m ore suppression is needed to y ield an eq uivalent idioty pic differenc e rate to the stro nger learning strategy . D. More R igorous Com parison of S 1 , S2 and S 3 Here, the aim is t o show that the AIS-RF hy brid S3 can perform significantly better than the RF sy stem S1 or indee d the simpler hy brid S2. When investig ating pa rame ters six runs a re ab le to provide a strong indicatio n of rob ot perform ance, but six runs are not suff icient for accurately testin g s ignificant dif ferences between the systems. For this reason, each system S1 – S3 is tested 30 times in Maze W orld , six times with each of the five initially ra ndom para t ope matrices P 1 – P 5 . The 90 ru ns are repeated again in Mirror W orld , starting not w ith a random pa ratope matrix, but with the appropriate matrix saved from the first run. T his is to test how wel l the robot learns from its first ex perience and to disting uish a “luc ky ” first run fr om a ge nuinely resource ful one, w here a good se t of behav iors develop s. The parameter k 1 is set at 0.625 for both S2 and S3 as this represents a value close to th e empirically measured “opti mum ”. I n additi on, the parameter b is set at 80 for S3 as this value lies well within the stable region. Results measur e mean, maximum an d minimum numbe r of stal ls, task comple tion time and scor e and also mean antibo dy ch ange rate, idiotypic difference rate and reinforcement success rate. Runs with scores 29 below 20 0 are labeled as good since they show above av erage attainment. Alt ernatively, runs wit h scores greater than 40 0 are declared bad since th is re presents a p erformance ranked in the botto m 10%. Howe ver, the a bove measur ements p rovide informa t ion on all runs and do not i ndicat e what i s happe ning to sy stem dy namics when robots get into difficulty , i. e. when th ey are spending a lot of time tr y i ng to escape entrapmen t. For this reason, a n umber of run s with an n s value h igher than average, (i. e. a lower performance) are sam pled fro m the S 1 and S3 data, so that an tibody inform ation is availabl e for approx imately 80 long stall sequences, i.e. seq uences that last m ore than one iterati on. I t is also important to establish that S3 is abl e to ou t-perform S1 for o ther valu es of k 1 and b within the established “opt imum” regions . For this reason an addit i onal t wo set s of 30 runs are cond ucted with S3 in Maze W orld using untr ained r obots. Th ese tw o tests u se k 1 = 0.600, b = 60 and k 1 = 0.58 5, b = 100. S2 is not tested furthe r with d iffer ent par amete rs since it does no t use b and its preliminary results already cover ed the region 0.0 ≤ k 1 ≤ 1.0. I n addition, S1 and S3 are the sy stems of most interest. Finally , S1 and S3 are tested with untrained robots in Maze W orld using t he weak reinfo rcement learnin g strategy . Only one world is used since only o ne comp arison with t he good scheme is necessary . Although the original value used for b (80) is pre serve d, k 1 is raised to 0.800 to reduce the idio ty pic differ ence rate down to a lev el compa rable to the rate f or the “o ptimum” k 1 value use d in strong le arning ex perimen ts, i.e . to about 20%. VII . RESULTS AND DI SCUS SION Table VI I shows the mea ns and standa rd devia tions for S , t and n s and the p ercentages of go od and bad runs for each system used (with initial parameters) in Maze Wo rl d . T a b l e V III p r es e n t s t h e s a m e d a t a f o r Mirror Wo rld and Tabl e IX ave rages the data from both world s. Tabl e X reveals the results of conducting standard one-tailed t-tests on the mea n s of S , t and n fro m these data sets. Throu ghout this research, differences are accepted as significant at the 95% level b ut if the d ifference is also significant at the 99% 30 level this is indicate d. The per centa ges of g ood and bad runs from this set of expe riments ar e also shown graphically in Fig. 7. Table XI presents mean and s tandard deviation da t a f or the expe riments using dif fere nt k 1 and b parameters for S3. It includes th e performances of S1 and S3 with the orig i nal parameters to m ake comparison easier. S ignificant differences between th ese results and the in itial S1 resu lts are summarized in Table XII . Th e weaker learn ing results are prov ided in T ables XII I (means and st andard deviatio ns) and XIV (significant d ifferences). A. Initial Parameters - The Untrained Robots In Maze World S3 shows the best perfo rmance in terms of all of the fitness measures and S1 shows the worst, with system S2 second best in each case. In addition, 53 % of S3's runs are considered go od and none are c onsidere d bad, compar ed with 27% g ood, 33 % bad for S1 and 33% go od, 7% bad for S2 (see Fig. 7 ). S3 is significantly better than S 1 at the 99% level for S and n s and at the 95% lev el for t . Also, S3 is significan tly better than S2 and S2 is significantly better than S1 at the 95% level when c omparing the mean n s and S values . Althoug h S2 out-perf orms S1, sy stem S3 demonstrates fa ster and safer results than S2. Thi s indicates that a full implementation of the network is required to el icit a suitable idioty pic response. In S2 there is no globa l fee dback to the sy stem from the co mmunicating antibodies, i.e . the adjusted strength of match values T g make a difference o nly within t he current iterati on and are di scarded once the stage 2 competition is finished. However, in S3 the T g values are incorp orated into the upda ted concentration le vel through (8) for every antibody in the sy stem. The co ncentration level s also feed back in to the ne twork for the next iteration via (4) a nd (5), w hich rend ers a muc h more dy namic sy stem. Compa rison of S2 an d S3 has th us shown that concen t ration leve ls have a vital role in me diating the suppre ssive an d stimulator y responses of the idioty pic sy stem, i .e. it is not suffi cient just to nominate alternative be haviors on the basis of a fitness metric tha t is gover ned on ly by paratope and idiotope va lues. The par atope a nd idiotope c ompariso ns also 31 need to be weigh ted using a second fi tness measure t hat is non-anti gen specific (concentration ). Further research into t he complex dynamics is clearly ne eded, bu t it is app arent that co ncentration serves t o enrich the process by which alternative antibodies are selected . It is pos si ble that S3 is able to discriminate between suitable an d inapprop riate alternatives in a more efficient manner. The above observati ons repr esent strong st atistical evidence that th e implementat ion of a full id iotypic netwo rk improves the perfor mance of a reinforc ement lea rning robot, inf luencin g its beha vior in a pos itive manner during the initial learning period. Howev er, an aly sis of maximum and minimum data revea l s that all three sy stems are capab le of f ast and sa fe runs, (the minimum t is between 152 and 1 56 and the minimum n s is between 2 an d 9 for each sy stem) and all are likel y to get in to some ki nd of difficulty (maxi mum t values are all over 380 and ma ximum n s values a r e a ll above 45). H oweve r, the ma x imum values for t (485, 406 and 385 f or S1 - S3 re spective l y ), the maximum number of stalls ( 127, 111 a nd 50) and the lower stand ard deviation s for S3 indicate th at the idioty pic robots are somehow prote cted f rom executing di sastrous ru ns. In order to investigate th is further, the rates of reinforcement success, antibody change and i diotypic difference are examin ed. Analy si s of the mean rates of reinf orcement success reveals an important differenc e between sy stems S1 and S3. In S1 the mean is 48%, but t his falls to 46% in st all situations. In S2 the mean is the same (50%) in both cases. However, in S3 the overall su ccess rate (49%) rises to 5 8% when the r obot stalls . The difference between S1 and S3 i s significant at the 95% level, i.e. S3 produces a signi ficantly better success rate when it st alls than S1. I n addi tion, th e difference is signi ficant at the 95% lev el within S 3, i.e. it prod uces a significantly higher success rate during stalls than overa ll. I n S3 the m ean idio ty pic difference rate i s 21% ri si ng to 29 % duri ng stal ls (alth ough the di fference is not significant). T he mean idiotypic success rate is 20% ove rall, but incr ease s to 49% wh en the robot stalls. This represents a significant dif ference at the 99% level and suggests that , when stalled, the full idiotypic robot is able to choose more su ccessful antibodies by increasing the rate of idioty pic difference. This advocates that a m echanism for recogniz ing and respond ing to dangerous situations is inherent in the 32 idioty pic dy namics. There are no significant diff erences between S2 and S 3 in terms of id iotypic differences and idiotypic success rates. S 2's mean idi oty p ic difference rate i s 18% risin g to 27% during stall s. The mean idioty pic success rate is 21% in total, increasing to 46% when the robot stalls. T his represents a si gnificant difference within the system at the 99% lev el but is n ot significantly different to the v alue of 49% for S3. Ho wever, the figure s may be misleading bec ause in S2 the process of selecting the stage 1 winne r is not influence d by past idiot y p ic calculations as th ere is never any glob al feedback from the network . In S3 the choice of antibod y in stage 1 is directly affected by past T g scores, which mea n s that undetectable idiotypic differences are constant ly o ccurring. Not surpri sin gly , when the robots are stalled the antib ody change rates show significant dif ferences between the sy stems with S3 demonstrating a r ate of 65% rising to 88% when stalled compared wit h 66% risi ng t o only 78% for S1 an d 58% ri si ng to 82 % for S2. The change rate f or S3 is significantly higher than both the others at the 95% level. However, th e observed increase in change rate is significant at the 99 % level within each system, which shows that there is a need for rapid antibody change during stall condition s and that all the sy stems are capable of delivering such change s. This interpretation may be deceptive tho ugh, b ecause these results d eal with bo th good and bad runs and collisions lasting only one iteration ar e also counted as stalls. I n order to gain a better understanding it is necessary to consider the sampled long stall data. Detailed analysis of the long stall sequence data reveals that th ere is a significant difference a t the 99% level for the mean dura tion of sequen ces. I n S3 the idioty pic rob ots remain stuck in the se seque nces f or an average of 4.78 iterations before f reeing themselves, wherea s the non-idioty pic robots in S1 remain trapped on average for 8.53 iterations. In additi on, ex aminat i on of the antibodies used in these trap conditions shows that the mean number of repe ated be haviors in S3 is 1.54 compar ed with 3.42 f or S1. This diff ere nce is significant at the 95% leve l. I n addition, the anti body chang e rate dur ing the se seque nces is 68% for S3 but only 19% for S1. However, the re is no significan t differen ce between the means of the numbers of long stall sequences. These observat ions support the vi ew that idioty pic robots are a b le to out-perfo rm their non- 33 idio ty pic co unterpart s by freeing thems elves fro m stal ls mo re quickl y . I n addi tion , it sugges ts that their rapid esc ape is accomplish ed by an ability to switch be haviors at a highe r rate. This provides g ood evide nce to support hy potheses H 2 and H 3 and is furthe r substantia ted by analy sis of the idioty pic differe nces i n S3 for this sub-set o f the data. On average 72% of S 3's long stall sequences are terminat ed (i.e. the robot escapes) when an idiot y p ic difference occu rs. More over in 63% of the long sequences the idiotypic difference generat es an untried antibo dy t hat proves successful when the stage 1 matching process is still sugg esting the use of antibodies that have already failed . This analysis helps to ex plain t he large differences in th e standard deviation s between S1 and S3 and fits in with the oth er observations. All the systems are capable of perform ing well, but wh en stall probl ems occur it seems that S3 is able to resolve them more rapidly , which means that it is not inclined to produce disastrous runs. B. Initial Parameters - The Trained Robots In Mirror W orld , where the initial paratope matrix is taken as the output from Maze Wo rld , each sy stem improve s on the mean n s , t , and S from it s Maze World trials, which demonstrates that all three sy stems allow a degree of learning to t ake place. I n additio n, standard deviatio ns are generally lower as th ere are far fewer bad perfo rmances and an im provemen t in S is demonstra ted on 77%, 70% and 80% of runs for S1 - S3 respectiv ely . The percentage of good runs also increases from the f irst envi ronment to 60%, 5 7% and 80% for the three systems. However, although S3 produces no u nacceptable runs, S1 and S2 still under- perf orm for appr oximately 7%. S3 aga in shows the best per forman ce in ter ms of all of the metrics a nd consistently demonstr ates a low er standa rd deviation than S1. It is also sig nificantly better than S1 at the 99% level f or S and n s . In addition, S3 shows th at it is signi ficantly better than S 2 at the 95% l evel for all the criteria. S2 p roduces the second best results on a ll counts except for time to complete the task, but i ts performance is not sig nificantly better than S1. The ob servations and analy sis of the Mirror W orld data sugg est that the robots are less l ikely to get into difficulti es when us ing a paratope matr ix fro m a previo us run, because they have developed good obstacle 34 avoidanc e strate gies from their e arlier expe rience s. This means that stalls tend to happen less f reque ntl y and run tim es are generally faster. Howev er, significant diff erences are still apparent between S1 an d S3 and S2 and S 3, sugges ting th at th e full idio typic network stil l h as an i mportant role to play in assisting r obots to escape from traps after init ial learning has taken place. Th is supports th e hy po thesis H 1 , that idioty pic sy stems permit a deg ree of de-co upling fr om an eng ineered learning sy stem, since it alludes to the fact that the idioty pic network is still able to influ ence a reinforcement sy stem posit ively , even after the ro bot has had ample time to comple te the learn ing proc ess. C. Initial Parameters - T h e Co mbi ne d R esu lt s Analy sis of the com bined results, i .e. means of the two n s , t , and S values from each run, shows that S3 is significantly better than S1 and S2 for all three fitness measures, even task ti me. These d ifferences are at the 99% level for S and n s and a t the 95% lev el for t . The effec t of combining the resu lts in this way is to smooth out the da ta, redu cing the standar d deviations, which allows g ood compar ison. Fro m Table X it is read ily appare nt that S3 is superior both to S2 a nd S1. I n addition , the per centag es of good an d bad r uns reflect the increm ental nature of th e sy stems, with S3 perfor ming well in 67% of all runs and ba dly in none , S2 running well in 57% an d badly in 3% and S1 ac hieving a good run in 30% and a bad run in 20%, see Fig. 3. This analysis provides goo d evidence that the full idioty pic network can significantly improve robot performance du ring lon ger tasks, i.e. thro ugh tasks that in clude a learning phase an d a mature phase, where a stable paratope matrix has developed. D. Varying Parame ters Table XI shows that S3 achiev es a remarkably similar performance to its fi rst trial when different para meters are used. This is espec ially true for the mea n and standa rd deviation of the sco re and nu mber of stalls. As in the first tri al these are both significan tly better than S1’s performance at the 99% level. 35 More over, S3 impr oves on its or igina l mean time of 237 secon ds, y ielding mea n completion times o f 215 and 22 5 second s for b = 100 and b = 60 respectively . Th ese increased levels of perfo rmance permi t significant differences b etween S3 and S 1 at the 99% level f or time, rath er than at the 9 5% level in the original data. This prov ides even stro nger evidence to supp ort the case for th e full idi oty p ic advantage and demonstrates that there is a de gree of flexibility within the k 1 and b paramete rs. These additional sets of results may also indicate that 0.625 is slightly too high to be op timum for k 1 in th is reg ion of b and with this reinforcement sch eme. E. The W eaker Le arning Strate gy C o m p a ri s o n o f T ab le s V II a n d XIII s h o w s t h a t t h e m e a n s an d s t a n d ar d d ev i a t i o n s o f S , t and n s increase for both sy stems when the weaker strateg y is implemen ted. In addition, the number of g ood scores red uces and the numbe r of bad scores incr ease s, refle cting the fac t that obsta cle av oidance is more di fficult to achieve. W ithin the weak l earning experim ents, the d ifferences between mean S , t and n s for S1 and S3 is significant at th e 99% level. This i s furt her evidence in support of the implemen tation of a full idioty pic network to accompany rein forcement learning, bu t the real value of th is experi ment lies in co mparing S3's weak learni ng perfo rmance with S1's strong l earning perfor mance. S 3 achieves a mean ( n s , t , S ) of (27, 273, 257) wi th weak learni ng compared with (45 , 277 , 342) for S 1 using str ong learning. The differen ce between the n s values is si gnificant at the 99% l evel and the differen ce between the S v alues is significan t at the 95 % level. This means that r obots implementing the full idioty pic network a nd poor le arning are p erforming as well as (poss ibly better than) rob ots wi th goo d learni ng but n o id iotypic sele ction, which suggests that a ful l netwo rk may be able to offe r a degre e of co mpensa tion for poo r learn ing. This sup ports hy pothesis H 1 and shows that the idioty pic robots may have bee n implementing more cr eative solutions to the problem. 36 V III. CONCLUSI O N AND FUT UR E AIMS A. Conclusion A computationa l method for simulating idioty pic effe cts is deve loped, base d on Farmer’s popular mode l of Jerne’s id iotypic network. The scheme is incorporated in to a reinforcement learning (RL ) architecture and compa res antibod y idiotopes and par atopes in order to determ ine inter-a ntibody suppre ssion and stimulati on levels. The archi tecture is ful ly descri bed and incr ementally implemented with virtua l robots that perform a color-tracki ng task i n order t o test three hypotheses H 1 – H 3 . H 1 asserts that idiotypic sy stems allow a degree of detachment from reinforcement learning, H 2 proposes that they red uce premature convergence and H 3 postulates th at they allow escape f rom repeated b ehavior patt erns. The use of the full idioty pic network (S3) pro duces significantly better results than partial implementa tions that u se RL only (S1) and a simplifie d networ k without g lobal feedba ck (S2), thus highlig hting the be nefits of introducing grea ter idioty pic complexity . The faster a nd saf er pe rforman ce of S3 is chiefly attributed to its ab ility t o recover from stall situations much more rapidly than the other sy stems, which is thought to be a di rect result of idioty pic activity . In d eed thi s paper provides ev idence to suggest that during a sequence of stalls, S3 i s capable o f increasing the rate of antibody change autonomo usly so that repeated beh aviors are disca rded in fa vor of suita bly selecte d alterna tives. This may be a resu lt of the sy stem's ability to raise the r ate of idioty pic communication du ri ng a stall, so that a much higher reinf orcemen t success ra te is achieved , imply ing that idioty pic netwo rks have an inhere nt mecha nism for de tecting and respo nding to t rap situa tions. These results confirm the likelihood of hy pothesis H 2 , as an increased rate of antib ody change implie s a much less greedy st rateg y . In ad dition, this paper supplies ev idence tha t during stall sequence s, the idioty pic proce ss tends to gene rate previously untried succe ssful antibodies whilst the antige n matching proce ss rec ommends re peate d failures. This is direct evidence for th e support o f H 2 and also u pholds H 3 . Howeve r, since rep eated loop s of behavior may also occur in non-stall situations, further tests that isolate recurr ing behavio r patterns are recomm ended to 37 test H 3 furt her. The simplified idioty pic sy st em S2 is believed to ha ve under p erformed in compar ison to S3 because of the lack of glob al feedba ck fr om the idioty pic network to the antibody concentra tions and vic e v ersa. Its inferi or attain ment demonstr ates that co ncen tration and feedback are ext remely import ant components of an idiotypic sy stem, possi bly provid ing an add itional mem ory fe ature that allows discrim inati on between suitable and inappropriate alternativ es in a more efficien t manner. Evidence to sup port H 1 is provided by comparing perfor mance of the sy stems after tra ining, whe re S3 still proves supe rior to S1. This show s that the networ k retains its influe nce ov er the sy stem once lear ning has taken place, i. e. that there is a sense of de-coupl ing from the r einforcement strategy . Furtherm ore, when S3 is implemented with a wea ker learning stra tegy , its performance is still significantly better than with S1 using stronger learning. This clearly sug gests that a full idiotyp ic network permits robots grea ter scope for creating soluti ons to the task as they are able to assert a degree of ind ependence over behav iors prescribed by the engineered reinforcement signals. B. Future A ims It may be argued that u sing a han d design ed idio tope is equivalent to providing the robot with a priori inform ation about the beh aviors, sinc e it effectiv ely shows whi ch are of similar type. I n deed, the idio ty pic selection a lgori thm may be regar ded as so mewhat r edunda nt when using a contr ived ma trix such as this because it is readi ly apparent which anti bodies are o f similar type and which are different. The next step to this research is therefo re to investigate whether si milar results can be obtained when an initially random variab le idiotope is used. The varia ble matrix would develo p by incrementing an ti body -antig en combinatio ns that produce a high rate of negative reinforcement learning scores as in [15]. Once a self regu lating and vari able idiotope is in plac e a meta-dy namic sy stem with muta tion may be designe d and concentratio n levels can be used t o determine which an tibodi es are retained and which die. Fut ure research will therefo re focus on d evelopin g means for creating n ew antibodies, testing and mutating t hem. 38 I n addition, further investigation into the complex dy namics of the full idiotypic network is the log ical extensio n to this w ork. In particular, mo re extensive research into the relationships between the parameters k 1 , k 2 and b will b e conduct ed with testi ng taking place usin g a wide variety of enviro nments, reinforcement scheme s, problems, robots an d antibody selection me chanisms. The effe ct of the idiotope matrix upon the se parameters will also be studied by testing different fix ed matrices and seve r al variab le schemes. M oreover, since it is always extremely d ifficult to k now whethe r simulation resu lts gener alize to the real world, these systems will also be trialed us ing real robots that attempt to so lve similar problems in dy namically changing e nvironmen ts. It is possible that an idioty pic network may bring eve n more advantage to a real world sy stem, since it is less pr edictable a nd should theref ore re quire a less pr e- determine d method of behavior selec tion. Howe ver, prior to this it is nece ssary to develop a me t hod tha t can provide reasonably good starting paratop es, allowing close to zero stalls fo r the real ro bots. A majo r part of extending this work will therefore involv e in tegrating the AI S sy stem with a genetic algo rithm that will run in highl y accelerated simulati ons, evolv ing a strong set of base rul es to initi alize the real r obot. R EFERENCES [1] N. K. Jern e, “Towards a netw ork theory of the imm une sy stem ”, A nn. Imm unol. (In st Pasteur) , 125 C, pp. 37 3-389, January 1974 [2] J . D. Fa rmer, N . H. Pac kard , A. S. Perelson, “ The immun e sy stem, adapta tion, and ma chine lea rning” , Physica , D, Vol. 2, I ssue 1-3, pp. 18 7-204, Octobe r – Novemb er 1986 [3] F. M . Burnet, The clonal selection th eory of acq uired immu nity , Cambridg e Univer sity Press, 1959 [4] L. N. de Ca stro , J. Tim mis, Artificial immune systems: A new c omputational intelligence approach , L ondon, Spring er- Verla g, 2002 39 [5] Y. W atanabe, A. Ishiguro, Y. Shirai, Y. Uchikawa, “Emergent constructi on of behavior arbitratio n mecha nism based on the immune sy stem”, in Procee dings of the 1998 IEEE International Conference on Evolut io nary Co mputat ion , (ICEC) , pp. 481-4 86, May 1998 [6] T. Kondo, A. Ishiguro, Y. W atanabe, Y. Shirai , Y. Uchikawa, “Evoluti onary construction of an immune ne t work- based be havior arbitra tion mec hanism fo r autonomo us mobile robots”, Electrical Engine ering in Japan , Vol . 123, No. 3, pp. 1-1 0, Decemb er 1998 [7] P. A. Varga s, L . N. de Ca stro, R. Miche l an, “An immune learning classif ier network for autonom ous navig ation”, Lecture Note s Computer Scienc e , 2787, pp. 69-8 0, September 2003 [8] P. A. Varga s, L. N. de Castro, F. J. Von Zu ben “Map ping a rtificia l immune sy stems into lear ning classifier sy stems” , Lecture Notes in Artificial Intelligence , 2661, p p. 16 3-186, (IWL CS September 2002), La nzi, P. L . et al . (eds.) , 2003 [9] G. C. Luh, W. W. L i u, “Rea ctive immune netw ork based mobile robot na vig ation”, Lecture Notes Comp uter Scien ce , 3239, pp. 119-1 32, Septe mber 2004 [10] J. Suz uki, Y. Yamamoto , “Buildi ng an arti ficial imm une networ k for decentraliz ed policy negotiation in a communica tion end sy stem: Ope n webse rver /iNexus study ”, in Proc. of the 4 th W orld Conference on Systemic s, Cybernetics a nd Informatics , (SCI 2000) , Orlando , FL, USA, July 2000 [11] D. Chowdhury , “I mmune network : an examp le of com plex adaptive sy stems”, Art ificial Immune Systems & Their Applicat ions, Dasgupta, D. (ed.), Springer, pp . 89-104 , 1999 [12] R. Michelan, F. J. Von Zuben, “Decentralized cont rol sy stem for autonomous navig ation based on an evolved a rtific ial immune ne t wor k”, in Proceedi ngs of the 20 02 Con gress on Evolut ionary Comp utati on , Vol. 2, pp. 10 21-1026 , (CEC200 2), Honolulu , Hawaii, May 12-17 2002 [13] M. Krautmacher, W. Dilger, “AI S based robot nav igation in a rescue scenario”, Le c tur e No te s Comp uter Scien ce , 3239, pp. 106-1 18, Septe mber 2004 [14] D. F loreano, F. Mondad a, “Evolution of homing navig ation in a re al mobile ro bot”, IEEE Transactions on Sys tems, Man and Cy bernetic s – Part B Cybernetics , 26 (3) , pp. 39 6-407, Jun e 1996 40 [15] A. Ishiguro, T. Kondo, Y . Watanabe, Y. Uchikawa, “A reinfo rcement learning met hod for dynamic behavior arbitration of autonom ous mobile ro bots base d on the immunolog i cal informat ion proc essing mechanisms”, Tran s. IEE of Ja pan , 117-C, No. 1, pp. 42-4 9, Januar y 1997 [16] G. Parker, “Co- evolving model parameter s for a ny time le arning in evolutionar y robotics”, Robots and Auto nomous Syste ms , 33, pp. 13 -30, October 20 00 [17] V. Gull apalli, “Skill ful cont rol under un certainty via direct reinforcem ent learning”, Robots and Autonomous Sy st ems , 15, pp. 237-246 , Aug ust 1995 [18] M. J . Matari ć , “Reinforcement learning in the multi-robot domain”, Autonomo us Robots , 4 (1), p p. 73-83, Marc h 1997 [19] R. T. Vau ghan, B. P. Gerkey , A. Howard, “T he Play er/Stage project: T ools for multi-robot and distributed sensor sy stems”, in Proceedings of the Interna tiona l Conference Adva nced Roboti cs (ICAR 2003) , pp. 317 -323, Coimbr a, Portug al, June 30 – July 3, 2003 [20] P. F. Stadler, P. Schuste r, A. S. Pe relson, “ I mmune netw orks modele d by replicator e quations”, J. Math. Bi ol. , 33 (2), pp. 111-13 7, 1994 [21] S. Cay zer, U. Aickelin, “A rec ommende r sy stem based on idioty pic artific ial immune networks” , Journa l of Mathematical Mode lling and Algorithms , 4 (2), pp . 181-198, 2005 [22] R. A. Brooks, “A r obust lay ered con trol syste m for a mobile robot” , IEEE J ournal of Robo tics and Automation , RA-2 (1), pp. 14-2 3, Marc h 1986 [23] S. Sathy anath, F. Sahin, “AI SIMAM – An AI S based intelligen t multi age nt model ands its applica tion to a mine detection proble m”, in Proceed ings of the ICARIS 200 2 1 st International Conference on Artificial Immune Systems , Canter bury , UK, Septembe r 9 – 11, 2002 [24] S. Sathy anath, F. Sahin, “Application of Artific ial I mmune System ba sed intelligent multi a gent model to a m ine de tection problem”, i n Proceedin gs of the SMC 2 002, IEEE Internati onal Conference on Sy stems, Ma n, and Cyber netics , Vol. 3, Tun isia, October 2002 41 [25] W. J . Opp, F. Sahin, “An Artific ial I mmune Sy stem Approac h to mobile sensor networks a nd mine de tec tion” , in Proceedings of the SMC 20 04, IEEE In tern ational Conference on Systems, Man, an d Cybernetics , Vol. 1, pp . 947-95 2, Octob er 10-13, 20 04 BIOGRAPHI ES Amanda M. Whit brook received the B.Sc. (Hons) degree in Mathematics and Physics from The Nottingham Trent Uni versity , U.K. in 1993, the M.Sc . degr ee in Management of I nformation Technolog y from the University of Nottingham, U.K. in 2005 an d the Ph.D. d egree in numerica l analy sis from The Nottingham Trent University , U.K. in 1998. She has previous ly been employe d as a programmer and data analy st and as an information systems designer. She curre ntly holds the position of Res earch Fellow within the Automated Sche duling, Optimization and Plannin g (ASAP) Rese arch Group in the School of Computer Scienc e at the Unive rsity of Nottingham, U.K. Her research inte rests include robotics, artificial intelligence, ad aptive learning , evolutionar y algorithms an d artifi cial immune sy stems. Uwe Aickelin (M’06) received a Managem ent Science degree from the University of Mannheim, Germa ny , in 1996 and a Euro pean Master and Ph. D. in Mana gemen t Science from the Univer sity of Wales, Swansea, U.K., in 1996 and 199 9, respectively. I mmediat ely fol lowing his Ph.D. , he joined t he University of the West o f England in Bristol, U.K. wh ere he worked fo r three y ears in the Mathemati cs Department as a lecturer in Operatio nal Research. In 2002, he acc epted a le cturesh ip in Computer Scie nce a t the Univer sity of Bradfor d, U.K. , mainly focusing on computer secur ity. Since 2003 h e has worked for the Unive rsity of Nottingham, U. K., in the Automated Scheduling , Optimization and Plan ning ( ASAP) Rese arch Group in the Scho ol of Co mputer S cience w here he is now a Rea der in Compute r Science and Dire ctor of the I nter-disciplinary Optimiza tion L aborator y . He 42 currently holds an EPSRC Advanced Fellowsh ip focusing on artificial immune systems, anomaly detection and m athematical modeling. In total, he has been awarded over £2 mi llion EPSRC research fun ding as Principal I nvestiga tor (including an Adventure Grant and two ID EAS Factory projects) on topics including artificial immune sy stems, dang er theory , compute r security , robotics and ag ent based simula tion. Dr. Aickelin is an Associate Editor of the IE EE Tr ansa ctions on Evolu ti onar y Comput atio n , t he Assistant Edit or of the Journal o f the O perational Research Society and an Editoria l Boar d member of Evolut iona ry Intelligence . Jonatha n M. Ga ribal di received the B.Sc. (Ho ns) degree in Physics from Bristol University, U.K., and the M.Sc. degree in I ntell igent Sy st ems and the Ph.D. degree in uncert ainty handlin g in immediate neonatal assessme nt from the University of Ply mouth, U.K., in 1984, 1990, and 1997, resp ectively . He is currently an Associate Professor within the Auto mated Schedul ing, Optimiz ation and Planning (ASAP) Rese arch Gr oup in the School of Computer Science at the Universi ty of Nottingham, U.K . The ASAP resea rch gro up tackles a wi de r ang e of de cision ma king pr oblems with particula r empha si s on heur istic and meta -heuristic approa ches to co mbinatorial optimiza tion. He has pu bli shed over 4 0 pap ers on fuzzy expert sy stems and fuzzy modeling, in cluding thre e book chapte rs, and h as e dited tw o book s. His main resea rch intere sts are modeling un certainty in human reasoning and in the intelligent an aly sis of large , multimodal, noisy data se ts. He has cr eated a nd implemente d fuzzy expert sy stems, and develo ped meth ods for fuzzy model optimization. 43 FI GURE CAPTIONS Fig. 1. Showing antibody paratope a nd idiotope re gions a nd inter- antibody stimulation and suppr ession [4]. Fig. 2. The Maze World used fo r condu cting th e door tra cking experimen t s with untra ined rob ots. Fig. 3. Mean scor e vers us k 1 for S2 and S3 with b values of 8, 80 a nd 800 . Fig. 4. Idioty pic differ enc e rate ve rsus k 1 for S2 and S3 with b va lues o f 8, 80 a nd 800. Fig. 5. Mean scor e vers us b for S3 with k 1 between 0. 45 an d 0.65 Fig. 6. Mean scor e vers us k 1 betwee n 0.45 and 0.65 for S3 in the reg ion 40 ≤ b ≤ 160 Fig. 7. Histogra m of goo d and ba d runs for th e three sy stems, S1 – S 3.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment