Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

Con trolling Fish Sc ho ols via Reinforcemen t Learning of Virtual Fish Mo v emen t Y usuk e Nishii 1 and Hiroaki Ka washima ∗ 1,2 1 F acult y of Engineering, Ky oto Univ ersity , Ky oto, Japan 2 Graduate School of Informatics, Ky oto Univ ersity , Ky oto, Japan Note: This r ep ort pr esents an English tr anslation of the original Jap anese b achelor’s the- sis submitte d by Y usuke Nishii in 2018 to the Under gr aduate Scho ol of Ele ctric al and Ele c- tr onic Engine ering, F aculty of Engine ering, Kyoto University, under the sup ervision of Hir o aki Kawashima. The r ese ar ch was c onducte d b etwe en 2017 and 2018 while the authors wer e aﬃli- ate d with Kyoto University. This tr anslation includes minor r evisions for clarity. The original thesis was c omplete d on F ebruary 9, 2018, and its c ontent r eﬂe cts the state of r ese ar ch at that time. The tr anslation aims to pr eserve the me aning and structur e of the original work while making it ac c essible to a br o ader audienc e. The liter atur e r eview and the scientiﬁc c ontext of this work r eﬂe ct the state of the ﬁeld at the time of the original submission date. Abstract This study in vestigates a method to guide and control ﬁsh sc ho ols using virtual ﬁsh trained with reinforcement learning. W e utilize 2D virtual ﬁsh displa yed on a screen to o vercome tec hnical c hallenges such as durabilit y and mov ement constrain ts inherent in ph ys- ical robotic agents. T o address the lac k of detailed b eha vioral mo dels for real ﬁsh, we adopt a mo del-free reinforcement learning approach. First, simulation results show that re- inforcemen t learning can acquire eﬀective mo vemen t p olicies ev en when simulated real ﬁsh frequen tly ignore the virtual stimulus. Second, real-world exp erimen ts with live ﬁsh conﬁrm that the learned p olicy successfully guides ﬁsh sc h ools to ward sp eciﬁed target directions. Statistical analysis reveals that the prop osed metho d signiﬁcantly outp erforms baseline con- ditions, including the absence of stimulus and a heuristic “stay-at-edge” strategy . This study pro vides an early demonstration of how reinforcemen t learning can be used to inﬂuence collectiv e animal b eha vior through artiﬁcial agents. 1 In tro duction 1.1 Researc h Bac kground The long-term ob jectiv e of this researc h is to guide and con trol the mov ement of ﬁsh schools using externally con trolled rob otic agents. P otential applications range from choreographing ﬁsh sc ho ol tra jectories for aquarium exhibitions to na vigating sc ho ols in aquaculture settings. In such practical scenarios, the target ﬁsh schools are exp ected to b e relatively large, making it impractical to pro vide stim uli to every individual within the group. Consequently , it b ecomes necessary to control the en tire school b y stimulating only a limited n um b er of individuals. T o ac hieve this, it is crucial to pro vide stim uli tailored to the real-time state of the sc ho ol, for instance, b y iden tifying whic h individuals exert the strongest inﬂuence on the collective motion ∗ Presen t address: Graduate School of Information Science, Universit y of Hyogo, Kobe, Japan. Corresp onding author: k a wa shima@gsis.u-hy ogo.ac.jp 1 and determining how they should mov e to guide the group tow ard a desired ob jectiv e. This suggests that feedbac k con trol, whic h provides appropriate stim uli based on the evolving state of the school, is indisp ensable. A primary approac h to such con trol inv olv es the use of motion prediction mo dels for ﬁsh sc ho ols. These models would allo w for predicting the school’s resp onse to sp eciﬁc stimuli and pre- ev aluating control strategies against desired targets. Regarding motion mo dels, previous studies, suc h as the one by Couzin et al. [ 1 ], hav e successfully simulated natural sc ho oling b eha viors and con tributed signiﬁcan tly to elucidating collective mec hanisms. Ho wev er, at present, these mo dels ha ve not yet reached the level of accuracy required to predict the real-time mov ements of actual ﬁsh sc ho ols. 1.2 Researc h Ob jectiv es P otential stimuli for inﬂuencing ﬁsh b eha vior include predators, fo od, and artiﬁcial consp eciﬁcs. Ho wev er, the use of predators is exp ected to cause signiﬁcant stress to the target ﬁsh, while the con tinuous provision of fo o d can lead to health and en vironmental issues. Therefore, this study fo cuses exclusiv ely on the use of artiﬁcial consp eciﬁcs as stimuli. T o address the challenge of not ha ving a predictive model for the target school’s mov emen t, we prop ose an approac h that acquires mov ement p olicies through reinforcement learning (RL) [ 2 ], with the artiﬁcial consp eciﬁc serving as the RL agent. RL is a learning framew ork in which an agen t observes the state of the environmen t, determines an action based on that state, and receiv es a reward as an ev aluation of that action to acquire a near-optimal p olicy . This metho d oﬀers the distinct adv antage of not requiring an explicit mo del of the environmen t (the target sc ho oling behavior). While the long-term goal is to implemen t these artiﬁcial consp eciﬁcs as physical rob ots, con trolling the precise motion of the rob ots themselv es remains a signiﬁcant c hallenge. T o circum ven t this, w e utilize virtual ﬁsh display ed on a screen. Virtual ﬁsh not only allow for easier motion control but also oﬀer superior durabilit y compared to mec hanical robots. Since RL requires virtual ﬁsh to in teract with the sc ho ol ov er extended perio ds to acquire an eﬀective p olicy , the durabilit y of virtual agents is highly suitable for the ob jectiv es of this study . Based on the ab o ve, this researc h aims to inv estigate a method for controlling ﬁsh schools b y applying RL to the mov ements of virtual ﬁsh. F or simplicit y , the scale of the ﬁsh sc ho ol is limited to a small n umber of individuals. 1.3 Related W ork V arious models hav e b een prop osed to describ e collective b eha vior in nature. Notable examples include the Kuramoto mo del [ 3 , 4 ], originally dev elop ed to explain the synchronization of ﬁreﬂy ﬂashing; the Boids mo del [ 5 ], whic h sim ulates the ﬂo c king behavior of birds; and the mo del b y Couzin et al. [ 1 ], whic h characterizes the collectiv e mov emen ts of ﬁsh sc ho ols. Sev eral studies hav e inv estigated schooling b eha vior by pro viding external stim uli to ﬁsh. F or instance, Sw ain et al. observ ed the reactions of ﬁsh to predator-lik e stim uli [ 6 ]. K opman et al. dev elop ed a rob otic consp eciﬁc that adjusts its tail-ﬁn motion based on the real-time positions of liv e ﬁsh to observ e their behavioral changes [ 7 ]. F urthermore, researc h has also fo cused on using visual stimuli to inﬂuence b eha vior. Kaw ashima et al. utilized a camera-display system to presen t virtual ﬁsh to a school, conﬁrming that the real ﬁsh synchronized their mo vemen ts with the recipro cating motion of the virtual ﬁsh [ 8 ]. A fundamen tal biological reaction of ﬁsh to visual stim uli is the optomotor response. This is considered a reﬂexive b eha vior in which ﬁsh follo w a moving ob ject or pattern. A w ell-known exp erimen tal demonstration in v olves placing a cylinder with v ertical blac k-and-white strip es around a water tank; as the cylinder rotates, the ﬁsh swim along with the strip es [ 9 ]. Such 2 reﬂexiv e behaviors suggest the signiﬁcan t potential for con trolling the mo vemen t of ﬁsh schools through precisely designed visual stim ulation. 1.4 Structure of the Rep ort The remainder of this rep ort is organized as follows. Section 2 describ es the problem formulation of this study and the conﬁguration of the in teraction system b et ween the virtual ﬁsh and the real ﬁsh. Section 3 explains how the interaction system is mo deled as a Mark o v Decision Pro cess (MDP), a standard framew ork for representing agent-en vironment interactions in RL, and details the Q-learning algorithm emplo y ed in this study . Section 4 presen ts a preliminary ev aluation using simulations to verify whether RL remains eﬀective even when simulated real ﬁsh frequen tly ignore the virtual ﬁsh. Section 5 discusses the implementation of RL in a real-world en vironment and ev aluates the p erformance of the acquired mov emen t p olicies. Finally , Section 6 concludes this study and suggests directions for future work. 2 In teraction System for Fish Sc ho ols and Virtual Fish 2.1 Problem F orm ulation F or simplicity , this study considers a ﬁsh school consisting of a small num b er of individuals of the same sp ecies. The target species should b e easy to obtain and maintain, and p ossess strong sc ho oling tendencies. T o satisfy these criteria, we use the Rummy-nose tetra ( Hemigr ammus bleheri 1 ). Generally , the p ositions of ﬁsh in a tank v ary freely in three-dimensional (3D) space. How ev er, in this study , the exp erimen ts are conducted in a narrow environmen t to treat the ﬁsh p ositions as tw o-dimensional (2D) coordinates. The internal dimensions of the tank used for housing and exp erimen ts are 34 . 0 cm in width, 14 . 0 cm in length (front-to-bac k distance), and 28 . 3 cm in heigh t. During exp erimen ts, partition plates are used to restrict the swimming area to 30 . 4 cm in width, 5 . 0 cm in length, and 12 . 5 cm in heigh t. A displa y is placed against the bac k of the tank to presen t virtual ﬁsh, and the in teraction is observ ed using a camera positioned in front of the tank. A sc hematic diagram of this system is sho wn in Fig. 1 . Details regarding the presentation of virtual ﬁsh and the camera-based observ ation are provided in Sections 2.2 and 2.3 , respectively . 2.2 Rationale for Using Vi rtual Fish First, w e discuss the justiﬁcation for using virtual ﬁsh on a screen. Fish sc ho ols are formed through reactions b et ween individuals, categorized as follows: A ttraction: Individuals recognize consp eciﬁcs visually and mov e tow ard each other. Alignmen t: Individuals follow the mo vemen ts of neigh b ors through the optomotor response. Repulsion: Individuals main tain distance from neigh b ors based on lateral line sensations to a void collisions. Among these, attraction leads to “aggregation,” where ﬁsh gather in close proximit y , and align- men t leads to the formation of a “sc ho ol,” where they mov e in a uniﬁed direction. Therefore, w e exp ect that the mov ement of the sc ho ol can b e controlled by inducing attraction and alignmen t, reactions primarily based on visual stimuli, using virtual ﬁsh. Indeed, it has b een rep orted that when a school of three Rummy-nose tetras w as presented with virtual ﬁsh p erforming p eriodic 1 Hemigr ammus bleheri has b een reclassiﬁed as Petitel la bleheri in 2020, while the original name is retained in this translation to maintain consistency with the 2018 source text. 3 Camera Fish tank Display Update the control strategy Controller Get the state of real fish school Decide the motion of virtual fish Reinforcement learning Real-time feedback system Figure 1: Ov erview of the in teraction system betw een a ﬁsh school and virtual ﬁsh. Figure 2: Image and arrangement of the virtual ﬁsh, including the background color. recipro cating motions via a camera-display system, the real ﬁsh tended to synchronize with the virtual ones [ 8 ]. Thus, the use of virtual ﬁsh is considered v alid. Next, w e consider the design of the virtual ﬁsh. Since the fron t-to-back dimension of the swimming area is restricted, w e place a displa y 2 against the bac k of the tank. As previously men tioned, attraction is triggered by visual recognition of consp eciﬁcs. Therefore, it is eﬀectiv e to use virtual ﬁsh that resemble the target sp ecies in shap e, size, and texture. Consequently , w e use an image created b y photographing real ﬁsh, as shown in Fig. 2 . The bac kground color is k ept uniform across the display . 2 ASUS MB168B (15.6-inch, resolution 1366 × 768 , pixel pitch 0 . 252 mm , resp onse time 11 ms , USB 3.0 inter- face). 4 Figure 3: Example of a captured image. Due to the light from the displa y , the real ﬁsh app ear dark as silhouettes. 2.3 Camera-Based Measuremen t of Fish Sc ho ols A camera 3 is p ositioned in fron t of the tank to capture the in teraction. The frame rate is set to 10 fps , which is suﬃcien t to capture the mo vemen ts of the ﬁsh. An example of a captured image is shown in Fig. 3 . W e deﬁne a “viewp ort” as a plane that collapses the front-to-bac k dimension within the restricted swimming area. Co ordinates within the viewp ort are expressed as ( x viewport , y viewport ) , normalized such that x viewport , y viewport ∈ [0 , 1] . The corresp ondence b et ween the captured image co ordinates ( x camera , y camera ) and the view- p ort co ordinates ( x viewport , y viewport ) is approximately expressed as: x viewport = a x camera + b, (1) y viewport = c y camera + d. (2) The constants a, b, c, d are obtained b y man ually selecting the four b oundaries corresp onding to the viewp ort in an image captured before the exp erimen t. Similarly , the relationship b et w een the viewport co ordinates and the display co ordinates ( x display , y display ) is approximated as: x display = e x viewport + f , (3) y display = g y viewport + h. (4) T o determine the constants e, f , g , h , tw o p oin ts with kno wn displa y co ordinates are sho wn b efore the experiment. Their positions in the captured image are man ually iden tiﬁed and con verted to viewp ort co ordinates using Eqs. ( 1 ) and ( 2 ). This allo ws us to calculate e, f , g , h using the corresp onding coordinate pairs in the displa y co ordinate system and the viewp ort coordinate system. Through this process, b oth the p ositions in the captured images and the p ositions on the displa y can b e represented consisten tly within the viewp ort co ordinate system. 3 Reinforcemen t Learning of Virtual Fish Mov emen t 3.1 Mark ov Decision Pro cess for Fish School Control First, we outline the general framew ork of RL based on Sutton and Barto [ 2 ]. In RL, the en tity that performs learning and decision-making is called the agent , and the entit y it in teracts 3 FLIR Flea 3 FL3-U3-13E4C-C (resolution 1280 × 1080 , color, max frame rate 60 fps , USB 3.0 interface), lens: FUJINON HF9HA-1B (ap erture F1.4–F16, fo cal length 9 mm ). 5       ! "   # "  ! "$%    & "$% Figure 4: In teraction b et ween the agen t and the en vironment in reinforcemen t learning. with is called the envir onment . The agen t and en vironment interact at eac h discrete time step n = 0 , 1 , . . . . Sp eciﬁcally , the agent observ es a state s n ∈ S (where S is the state space) from the environmen t and, based on this, selects an action a n ∈ A (where A is the action space). By p erforming action a n , the agen t inﬂuences the environmen t, resulting in the observ ation of the next state s n +1 and a corresp onding reward r n +1 ∈ R . This interaction is illustrated in Fig. 4 . When the system satisﬁes the Mark o v prop ert y , that is, when s n +1 and r n +1 are determined probabilistically only b y s n and a n , the model is called a Mark ov Decision Pro cess (MDP). An MDP with ﬁnite sets S and A is kno wn as a ﬁnite MDP . In man y cases, RL aims to address situations that can b e appro ximated as ﬁnite MDPs. The goal of the agen t is to learn actions a n that maximize the long-term cumulativ e rew ard r n +1 , r n +2 , . . . , formulated as the maximization of the discounted return R n using a discount rate γ ∈ [0 , 1] : R n = ∞ X k =0 γ k r n + k +1 . (5) Here, we mo del the ﬁsh sc ho ol control problem for RL. F or simplicit y , we fo cus on controlling the x -coordinate, the primary direction of mo v emen t, while ignoring the y -co ordinate. The con trol ob jective is to guide the cen troid of the real ﬁsh school to ward the edges of the tank. The target direction is switched b et w een the left and righ t edges at appropriate in terv als. By assuming left-right symmetry , we consider that the p olicy learned to guide the school to ward one edge is equally eﬀective for the opp osite direction. The discrete time step n corresponds to the time when the (10 n ) -th frame is captured by the camera; th us, the duration of eac h step is 1 s . Contin uous time is denoted by t , and the con tinuous time corresp onding to step n is denoted b y t n . Before deﬁning S and A , w e deﬁne regions called c el ls within the viewp ort. Cells are deﬁned b y dividing the viewp ort horizontally in to W in terv als. F or a giv en n umber of divisions W , a cell w ∈ { 0 , 1 , . . . , W − 1 } is deﬁned in the viewp ort co ordinate system as:           x y      x ∈  w W , w + 1 W  , y ∈ [0 , 1]  when guiding tow ard the right edge,  x y      x ∈  W − w − 1 W , W − w W  , y ∈ [0 , 1]  when guiding tow ard the left edge. (6) This cell deﬁnition is illustrated in Fig. 5 . The state is deﬁned as a pair of w real and w virtual , representing the cells con taining the cen troids of the real ﬁsh sc ho ol and that of the virtual ﬁsh, respectively . Thus, the state space is: S = { ( w real , w virtual ) | w real , w virtual ∈ { 0 , 1 , 2 , . . . , W − 1 }} . (7) 6  w = 0 $ = % − 1 $ = % − 1 w = 0 Figure 5: Deﬁnition of cells. When the num b er of divisions is W , the cell at the target end is deﬁned as W − 1 , and the cell at the opp osite end is deﬁned as 0 . Next, w e deﬁne the actions. The mo vemen t of the virtual ﬁsh at each step is determined as follo ws: Step 1: Determine the target p oin t x virtual - target . 1. Obtain the current centroid p ositions x real and x virtual . 2. Determine the cell w virtual con taining x virtual . 3. Select a cell displacement ∆ w ∈ { 0 , ± 1 , ± 2 , . . . , ± ∆ w max } . 4. Set the target cell w virtual - target : ˜ w virtual - target = w virtual + ∆ w (8) w virtual - target =      0 if ˜ w virtual - target < 0 , ˜ w virtual - target if 0 ≤ ˜ w virtual - target ≤ W − 1 , W − 1 if W − 1 < ˜ w virtual - target . (9) 5. Using the x -coordinate center of w virtual - target ( x virtual - target ) and the y -co ordinate of the real ﬁsh centroid ( y real ), set the target p oin t: x virtual - target =  x virtual - target y real  . (10) Step 2: Mov e to ward the target p oin t until the next step. The cen troid follows a ﬁrst-order lag motion: d x virtual d t = 1 τ virtual ( x virtual - target − x virtual ) , (11) 1 τ virtual > 0 ( constan t ) . (12) Since the motion of the virtual ﬁsh is determined by ∆ w , we deﬁne ∆ w as the action. Thus, the action space is: A = { 0 , ± 1 , ± 2 , . . . , ± ∆ w max } . (13) Finally , the reward r n is deﬁned based on the x -co ordinate of the real ﬁsh sc ho ol cen troid x real at step n : r n = ( 2 × ( x real − 0 . 5) when guiding tow ard the right edge, 2 × (0 . 5 − x real ) when guiding to ward the left edge. (14) Th us, r n ∈ [ − 1 , +1] , and r n approac hes +1 as the centroid x real gets closer to the target edge. 7 3.2 Reinforcemen t Learning via Q-Learning In this section, we describ e the Q-learning algorithm, based again on Sutton and Barto [ 2 ]. The agen t’s actions are determined probabilistically according to a p olicy π . The probability distribution of action a n giv en state s n is: a n ∼ π ( ·| s n ) , (15) where π ( a | s ) represen ts the probability of c ho osing action a in state s . The exp ected discounted return when following p olicy π from state s is deﬁned as the state-v alue function: V π ( s ) = E π { R n | s n = s } . (16) The optimal p olicy π ∗ is deﬁned as the p olicy satisfying π ∗ ∈ arg max π V π ( s ) (17) for all s ∈ S . The goal of RL is to acquire π ∗ . Giv en the optimal action-v alue function Q ∗ ( s, a ) = max π Q π ( s, a ) , where Q π ( s, a ) = E π { R n | s n = s, a n = a } , the optimal p olicy can be obtained as: π ∗ ( a | s ) =    1 | arg max a ′ Q ∗ ( s, a ′ ) | if a ∈ arg max a ′ Q ∗ ( s, a ′ ) , 0 otherwise. (18) Based on the Bellman optimalit y equation, Q ∗ ( s, a ) = E  r n +1 + γ max a ′ Q ∗ ( s n +1 , a ′ )     s n = s, a n = a  , (19) the function Q ( s n , a n ) can be up dated iterativ ely to appro ximate Q ∗ b y making it conv erge to ward r n +1 + γ max a ′ Q ( s n +1 , a ′ ) . The learned p olicy π corresp onding to this approximated Q is expressed as: π ( a | s ) =    1 | arg max a ′ Q ( s, a ′ ) | if a ∈ arg max a ′ Q ( s, a ′ ) , 0 otherwise. (20) Q-learning implements this through the follo wing pro cedure: Step 1: Initialize the action-v alue function Q and observe the initial state s 0 . Step 2: Rep eat for each step n = 0 , 1 , . . . , N − 1 : 1. Select action a in state s based on the policy π n deriv ed from Q . 2. Execute action a and observ e reward r and next state s ′ . 3. Up date Q according to: Q ( s, a ) ← Q ( s, a ) + α  r + γ max a ′ Q ( s ′ , a ′ ) − Q ( s, a )  , (21) α ∈ (0 , 1) ( constant, called the “step size” ) . (22) 4. s ← s ′ Regarding π n , it is eﬀective to p erform explor ation (trying v arious actions) in the early stages of learning and follow the b est-kno wn p olicy in the later stages. Thus, we use an ϵ -greedy algorithm where ϵ is decay ed at each step: ϵ n = 1 − n N − 1 (23) π n ( a | s ) =        1 − ϵ n | arg max a ′ Q ( s, a ′ ) | + ϵ n |A| if a ∈ arg max a ′ Q ( s, a ′ ) , ϵ n |A| otherwise. (24) 8 4 Preliminary Ev aluation in a Sim ulation En vironmen t When in teracting with a ﬁsh school using virtual ﬁsh, the frequency with which the real ﬁsh react to the virtual ﬁsh is exp ected to b e limited. T o inv estigate whether an eﬀective policy can b e acquired via RL under suc h conditions, w e p erform RL in a simulation environmen t that mimics the mov ement of real ﬁsh. F urthermore, as will b e discussed in Section 5 , learning an eﬀective p olicy in an actual inter- action system is expected to be extremely time-consuming. By using the action-v alue function Q obtained in the sim ulation en vironment, where each time step can b e processed muc h faster than in the real environmen t, as an initial v alue, we exp ect to reduce the learning time required in the real-world exp erimen ts. 4.1 Mo deling the Mov emen t of Real Fish W e sim ulate the b eha vior of real ﬁsh using an action mo del in whic h the decision to react to the virtual ﬁsh is determined probabilistically . By v arying this probability , we examine whether RL can acquire eﬀective p olicies even when real ﬁsh infrequently react to virtual ﬁsh. Based on observ ations of the ﬁsh during housing, the following b eha vioral c haracteristics were iden tiﬁed: • They sho w minimal reaction to other individuals that are far a w ay . • When they do react to another individual, they often begin swimming to ward that indi- vidual. • Most of their mov ements can b e view ed as a repetition of linear motions lasting for a few seconds. W e dev elop ed a mo del for real ﬁsh that reﬂects these prop erties. F or simplicit y , we assume that the real ﬁsh alwa ys act as a collective school and simulate the motion of their centroid. The follo wing parameters are held constant within eac h simulation: • ∆ t max > 0 : Maxim um duration b efore the real ﬁsh switch their target p oin t. • θ > 0 : Maxim um distance at whic h real ﬁsh react to the virtual ﬁsh. • p ∈ [0 , 1] : Probability that the real ﬁsh ignore the virtual ﬁsh even when they are suﬃciently close. • δ x max > 0 , δ y max > 0 : Maxim um displacement of the target point when the virtual ﬁsh are ignored. • 1 /τ real > 0 : In verse of the time constant for the ﬁrst-order lag system of the real ﬁsh. Let x real =  x real y real  and x virtual represen t the cen troid p ositions of the real ﬁsh sc ho ol and the virtual ﬁsh, resp ectiv ely . The motion of x real is mo deled as a series of mo vemen ts tow ard a switc hing target p oin t x real - target , eac h of whic h is gov erned b y a ﬁrst-order lag system, sim ulated as follo ws: Step 1: Randomly select a duration ∆ t ∈ (0 , ∆ t max ] for which the ﬁsh mo ve linearly . Step 2: Determine the target p oin t x real - target as follows: • If ∥ x virtual − x real ∥ ≤ θ : 9 – With probabilit y 1 − p (reaction case), set: x real - target = x virtual . (25) – With probabilit y p (ignore case), select random relativ e displacemen ts δ x ∈ [ − δ x max , + δ x max ] and δ y ∈ [ − δ y max , + δ y max ] , and set: x real - target = x real +  δ x δ y  . (26) • Otherwise (out of range): Determine x real - target using random displacemen ts as in Eq. ( 26 ). Step 3: During the in terv al ∆ t , mov e to w ard x real - target based on d x real d t = 1 τ real ( x real - target − x real ) . (27) Step 4: Rep eat Steps 1 through 3. 4.2 Ev aluation of Robustness Against Sto c hastic Beha vior 4.2.1 Sim ulation Ev aluation Conditions W e ev aluate whether RL can acquire eﬀective mo vemen t policies even in scenarios where real individuals infrequently respond to the virtual stimulus. T o this end, we conduct RL for each ignoring probabilit y p deﬁned in the b eha vioral mo del in Section 4.1 . W e then inv estigate whether the resulting p olicy approximates the optimal policy b y analyzing the rewards obtained when follo wing this p olicy . F or eac h ignoring probabilit y p = 0 , 0 . 3 , 0 . 6 , 0 . 9 , Q-learning is performed with the n um b er of training steps N v aried across { 10 2 , 10 3 , 10 4 , 10 5 , 10 6 , 10 7 } . Other parameter settings are summarized in T able 1 . After training, the agent follo ws the learned p olicy π (as deﬁned in Eq. ( 20 )) for M steps to calculate the av erage reward R : R = 1 M M X n =0 r n , (28) where M = 9000 . Since b oth the learned p olicy and the resulting rew ard R are inheren tly sto c hastic, we expect the ev aluation R to v ary across trials. Therefore, w e rep eat the training and ev aluation pro cess 10 times for eac h com bination of p and N , and calculate the mean of R across these trials. Note that the rew ard R obtained under an optimal p olicy v aries dep ending on the ignoring probabilit y p . T o pro vide a benchmark for ev aluating the learned p olicies, w e deﬁne a heuris- tic policy ˆ π that is expected to yield near-optimal rew ards. In this simulation, the maximum in teraction distance θ (the range within whic h real ﬁsh react to the virtual ﬁsh) is set to 0.3 times the viewport width, and the num b er of cell divisions is W = 10 . Consequen tly , a policy ˆ π that targets a p osition three cells ahead of the real ﬁsh’s current cell w real (i.e., w real + 3 ) is considered near-optimal. Sp eciﬁcally , given a state ( w real , w virtual ) , the baseline p olicy ˆ π is deﬁned as follows: ∆ ˜ w = w real + 3 − w virtual . (29) ∆ w =      − ∆ w max if ∆ ˜ w < − ∆ w max , ∆ ˜ w if − ∆ w max ≤ ∆ ˜ w ≤ ∆ w max , +∆ w max if ∆ w max < ∆ ˜ w . (30) ˆ π ( a | s ) = ( 1 if a = ∆ w, 0 otherwise. (31) 10 T able 1: Simulation parameters. θ and δ x max are relativ e to the viewport width, and δ y max is to the viewp ort height. P arameter Description V alue γ Discoun t factor (Q-learning) 0.9 α Step size (Learning rate) (Q-learning) 0.1 W Num b er of cells 10 ∆ w max Max cell displacement p er step (virtual ﬁsh) 2 1 /τ virtual In verse time constant (virtual ﬁsh) 3 s − 1 ∆ t max Max duration for target switc hing (real ﬁsh) 3 s θ Max reaction distance (real ﬁsh) 0.3 δ x max Max x -displacement during ignore cases (real ﬁsh) 0.2 δ y max Max y -displacement during ignore cases (real ﬁsh) 0.2 1 /τ real In verse time constant (real ﬁsh) 1 s − 1 -0 . 1 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1 1 10 10 0 10 00 10 00 0 10 00 00 10 00 00 0 10 00 00 00 R (average of 10 training) N p = 0.0 p = 0.0 ( baseline ) p = 0.3 p = 0.3 ( baseline ) p = 0.6 p = 0.6 ( baseline ) p = 0.9 p = 0.9 ( baseline ) Figure 6: A v erage reward R in simulation. p represen ts the probability of the sim ulated real ﬁsh ignoring the virtual ﬁsh. The baseline denotes the reward obtained using ˆ π as deﬁned in Eq. ( 31 ). Here, the maximum displacement of the virtual ﬁsh’s cell p er step is set to ∆ w max = 2 , consistent with the constraints of the RL agent. W e denote the rew ard obtained by this baseline p olicy ˆ π o ver M = 9000 steps as ˆ R . Since ˆ R is also expected to v ary across trials, we calculate ˆ R 10 times for eac h p and take the mean. If the mean reward R of a learned p olicy is suﬃcien tly close to the mean b enc hmark ˆ R , we consider the acquired p olicy to b e near-optimal. 4.2.2 Results and Discussion of Simulation Ev aluation The results of the ev aluation described in Section 4.2.1 are shown in Fig. 6 . Regardless of the ignoring probabilit y p , it can b e observed that as the num b er of training steps N increases, the a verage reward R obtained under the learned p olicy approaches the reward ˆ R obtained under the baseline p olicy ˆ π . Figure 7 shows the ratio of the mean rew ard R under the learned p olicy to the mean reward ˆ R under the baseline p olicy . F or ignoring probabilities p = 0 , 0 . 3 , and 0 . 6 , the ratio reaches 11 -0 . 1 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1 1. 1 1 10 10 0 10 00 10 00 0 10 00 00 10 00 00 0 10 00 00 00 R / R -baseline N p = 0. 0 p = 0. 3 p = 0. 6 p = 0. 9 Figure 7: Ratio of the av erage reward of the learned p olicy to that of the baseline p olicy in sim ulation. p represents the probabilit y of the sim ulated real ﬁsh ignoring the virtual ﬁsh. appro ximately 1.0 when the n um b er of training steps N is suﬃcien tly large. This suggests that the learned p olicies in these cases are nearly optimal. F urthermore, even in the case of p = 0 . 9 , where the real ﬁsh ignore the virtual stim ulus with high frequency , the ratio reac hes appro ximately 0.7 at N = 10 7 , indicating that learning progresses to a certain extent even under suc h challenging conditions. 5 Reinforcemen t Learning in a Real-W orld Environmen t 5.1 A cquisition of Cen troid P ositions for Real and Virtual Fish The centroid p osition of the real ﬁsh is determined as follows . F or an RGB color image, let p u,v =   r u,v g u,v b u,v   represen t the pixel at column u and row v . The entire image is denoted as { p u,v } , where r u,v , g u,v , and b u,v are the red, green, and blue v alues at p osition ( u, v ) . F or a gra yscale image, the pixel v alue is denoted as p u,v , and the image as { p u,v } . Step 1: A binary image is obtained using background subtraction. Given a pre-recorded bac k- ground image { ˜ p u,v } and a threshold θ , w e compute for each ( u, v ) : p ′ u,v = ( 1 if ∥ p u,v − ˜ p u,v ∥ > θ , 0 otherwise. (32) The resulting binary image  p ′ u,v  is exp ected to hav e a v alue of 1 where real or virtual ﬁsh are present, and 0 elsewhere. Step 2: As shown in Fig. 3 , real ﬁsh app ear as dark silhouettes due to the displa y’s backligh ting. This color diﬀerence is used to distinguish b et w een real and virtual ﬁsh. Using a decision function f obtained via a Supp ort V ector Mac hine (SVM), we compute: p ′′ u,v = ( 1 if p ′ u,v = 1 and f ( p u,v ) > 0 , 0 otherwise. (33) 12 T able 2: P arameters for RL and control in the real-w orld environmen t. P arameter Description V alue γ Discoun t factor (Q-learning) 0.9 α Step size (Learning rate) (Q-learning) 0.1 W Num b er of cell divisions 10 ∆ w max Max cell displacement p er step (virtual ﬁsh) 2 1 /τ virtual In verse time constant (virtual ﬁsh) 1 s − 1 The SVM is pre-trained on a dataset where f ( p ) is +1 for real ﬁsh pixels and − 1 for virtual ﬁsh pixels. Step 3: The binary image  p ′′ u,v  is pro cessed with an op ening op eration (an equal num b er of erosions and dilations) to obtain  p ′′′ u,v  . This ensures that pixels with a v alue of 1 corresp ond only to the real ﬁsh. Step 4: The cen troid ˜ x real of the pixels with v alue 1 in  p ′′′ u,v  is calculated as: ˜ x real = m 1 , 0 m 0 , 0 m 0 , 1 m 0 , 0 ! where m i,j = X u,v u i v j p ′′′ u,v . (34) Finally , ˜ x real is conv erted to viewp ort co ordinates x real using Eqs. ( 1 ) and ( 2 ). Step 3 is a pro cedure designed to eliminate misclassiﬁcations originating from the SVM in Step 2. Sp eciﬁcally , classiﬁcation errors o ccasionally caused certain subregions to b e mark ed as contain- ing a real ﬁsh ( p ′′ u,v = 1 ) ev en when no ﬁsh was present. By implemen ting Step 3, we successfully remo ved most of these false p ositiv es. Regarding the position of the virtual ﬁsh cen troid, w e set the time constan t τ imitated of the ﬁrst-order lag system in Eq. ( 11 ) to b e suﬃciently small so that the mov ement sp eed is high enough. This allows us to assume that the virtual ﬁsh reac h suﬃcien tly close to their target p oin t x virtual - target within a single time step. Consequently , the p osition of the virtual ﬁsh centroid at an y given time is treated as the target p oin t selected in the preceding state. 5.2 Ev aluation of the Learned P olicy W e inv estigate whether RL can yield an eﬀectiv e mo vemen t p olicy for virtual ﬁsh to guide ﬁsh sc ho ols in a real-w orld environmen t. Our analysis is based on the b eha vior of b oth real and virtual individuals under the learned policy . Since training in a real-world environmen t is expected to b e extremely time-consuming, w e initialize the action-v alue function Q using the v alues obtained from simulation (with an ignoring probabilit y p = 0 . 6 and N = 10 6 steps) to shorten the required training duration. Reinforcemen t learning w as conducted for N = 1800 steps. The parameter settings are summarized in T able 2 . Three randomly selected real ﬁsh were used for the experiment. Sub- sequen tly , 900 steps of control were p erformed on the same individuals following the learned p olicy (hereafter referred to as the “prop osed metho d”). W e verify whether the prop osed metho d successfully guides the real individuals tow ard the target direction and determine if it is more eﬀectiv e than a simple, manually deﬁned p olicy . W e therefore inv estigated the b eha vior of real individuals for 900 steps under the following baselines: Baseline 1 (stay at edge): The virtual ﬁsh mov e tow ard the target direction and stays at the edge of the viewp ort once it reac hes it. Baseline 2 (w/o stimulus): No virtual ﬁsh are display ed. 13 T able 3: Mean real ﬁsh x -co ordinates for each guidance direction and the diﬀerence b et ween them for each control metho d. Condition (a) T arget: Left (b) T arget: Righ t Diﬀerence (b-a) Prop osed Metho d 0 . 358 0 . 525 0 . 166 Baseline 1 (stay at edge) 0 . 461 0 . 562 0 . 101 Baseline 2 (w/o stimulus) 0 . 501 0 . 495 − 0 . 006 T able 4: Results of the signiﬁcance tests for the diﬀerence in the mean x -co ordinates of real ﬁsh based on the target guidance direction. A tw o-tailed W elc h’s t -test was performed for eac h virtual ﬁsh control metho d to ev aluate the statistical signiﬁcance of the diﬀerence. Condition p -v alue Prop osed Metho d 1 . 46 × 10 − 51 Baseline 1 (stay at edge) 8 . 48 × 10 − 15 Baseline 2 (w/o stimulus) 5 . 80 × 10 − 1 Here, the p olicy for Baseline 1 w as deﬁned as follo ws, with the maxim um displacement p er step set to ∆ w max = 2 to remain consistent with the prop osed metho d: π ( a | s ) = ( 1 if a = +∆ w max , 0 otherwise. (35) The reactions of real ﬁsh to the virtual ﬁsh are exp ected to b e inﬂuenced b y the preceding mo vemen ts of the virtual stimuli. Therefore, to av oid carry-ov er eﬀects from previous trials, a rest p eriod of at least 10 minutes was provided b et ween the training phase, the ev aluation of the prop osed metho d, and eac h baseline exp erimen t. The cen troids of b oth real and virtual ﬁsh were recorded at each time step and are summarized in Fig. 8 . In the prop osed metho d, the virtual ﬁsh mo ve within the area b et ween the target p osition and the area near the real ﬁsh, app earing to guide the real individuals tow ard the goal. W e then generated histograms of the p ositions of real individuals (cen troids) for each target guidance direction. Figures 9 , 10 , and 11 sho w the results for the Prop osed Metho d, Baseline 1, and Baseline 2, resp ectiv ely . T able 3 summarizes the mean p ositions and the diﬀerences b et ween them for eac h target direction. A comparison reveals that the magnitude of the diﬀerence in mean p ositions follo ws the order: (a) the Proposed Metho d, (b) Baseline 1 (stay at edge), and (c) Baseline 2 (w/o stim ulus). The histograms also appear to b e sharp er in the same order. W e conducted W elc h’s t -test (tw o-tailed) at a signiﬁcance level of 5 % to determine whether the target direction resulted in a signiﬁcant diﬀerence in the mean p ositions of real individuals. The sample was deﬁned as { x real ( t n ) | guided to x = 0 (or x = 1) at step n } , (36) assuming an appro ximately normal distribution. The resulting p -v alues are shown in T able 4 . These v alues indicate that while there was no signiﬁcant diﬀerence in (c) Baseline 2, signiﬁcant diﬀerences w ere observed in b oth (a) the Proposed Metho d and (b) Baseline 1. Th us, it can b e concluded that the real individuals w ere signiﬁcantly guided in the intended directions in b oth (a) and (b). F urthermore, we ev aluated the degree of diﬀerence b et ween the histograms of the x -co ordinates for each target direction (Figs. 9 to 11 ) using the Bhattac haryy a distance. As sho wn in Fig. 12 , the results conﬁrm that the diﬀerence in the distribution of the real individuals’ p ositions follows 14 0 0. 2 0. 4 0. 6 0. 8 1 1. 2 0 15 30 45 60 75 90 10 5 12 0 13 5 15 0 16 5 18 0 19 5 21 0 22 5 24 0 25 5 27 0 28 5 30 0 31 5 33 0 34 5 36 0 37 5 39 0 40 5 42 0 43 5 45 0 46 5 48 0 49 5 51 0 52 5 54 0 55 5 57 0 58 5 60 0 61 5 63 0 64 5 66 0 67 5 69 0 70 5 72 0 73 5 75 0 76 5 78 0 79 5 81 0 82 5 84 0 85 5 87 0 88 5 90 0 Centroid position x Target direction Time step n Virtual fish Real fish (a) Prop osed Metho d 0 0. 2 0. 4 0. 6 0. 8 1 1. 2 0 15 30 45 60 75 90 10 5 12 0 13 5 15 0 16 5 18 0 19 5 21 0 22 5 24 0 25 5 27 0 28 5 30 0 31 5 33 0 34 5 36 0 37 5 39 0 40 5 42 0 43 5 45 0 46 5 48 0 49 5 51 0 52 5 54 0 55 5 57 0 58 5 60 0 61 5 63 0 64 5 66 0 67 5 69 0 70 5 72 0 73 5 75 0 76 5 78 0 79 5 81 0 82 5 84 0 85 5 87 0 88 5 90 0 Centroid position x Target direction Time step n Virtual fish Real fish (b) Baseline 1 (stay at edge) 0 0. 2 0. 4 0. 6 0. 8 1 1. 2 0 15 30 45 60 75 90 10 5 12 0 13 5 15 0 16 5 18 0 19 5 21 0 22 5 24 0 25 5 27 0 28 5 30 0 31 5 33 0 34 5 36 0 37 5 39 0 40 5 42 0 43 5 45 0 46 5 48 0 49 5 51 0 52 5 54 0 55 5 57 0 58 5 60 0 61 5 63 0 64 5 66 0 67 5 69 0 70 5 72 0 73 5 75 0 76 5 78 0 79 5 81 0 82 5 84 0 85 5 87 0 88 5 90 0 Centroid position x Time step n Target direction Real fish (c) Baseline 2 (w/o stimulus) Figure 8: Time series of the x -coordinate of the real ﬁsh centroid during real-w orld control. 15 0 10 20 30 40 50 60 0. 0 0 0. 0 4 0. 0 8 0. 1 2 0. 1 6 0. 2 0 0. 2 4 0. 2 8 0. 3 2 0. 3 6 0. 4 0 0. 4 4 0. 4 8 0. 5 2 0. 5 6 0. 6 0 0. 6 4 0. 6 8 0. 7 2 0. 7 6 0. 8 0 0. 8 4 0. 8 8 0. 9 2 0. 9 6 1. 0 0 Frequency Horizontal position of the real fish centroid x direction = 0 (mean = 0.358) (a) Guiding tow ard the left ( x = 0 ) 0 10 20 30 40 50 60 0. 0 0 0. 0 4 0. 0 8 0. 1 2 0. 1 6 0. 2 0 0. 2 4 0. 2 8 0. 3 2 0. 3 6 0. 4 0 0. 4 4 0. 4 8 0. 5 2 0. 5 6 0. 6 0 0. 6 4 0. 6 8 0. 7 2 0. 7 6 0. 8 0 0. 8 4 0. 8 8 0. 9 2 0. 9 6 1. 0 0 Frequency Horizontal position of the real fish centroid x direction = 1 (me an = 0.525) (b) Guiding tow ard the right ( x = 1 ) Figure 9: Histograms of real ﬁsh x -coordinates (Proposed Metho d). Orange bins indicate the mean v alue’s interv al. 16 0 10 20 30 40 50 60 0. 0 0 0. 0 4 0. 0 8 0. 1 2 0. 1 6 0. 2 0 0. 2 4 0. 2 8 0. 3 2 0. 3 6 0. 4 0 0. 4 4 0. 4 8 0. 5 2 0. 5 6 0. 6 0 0. 6 4 0. 6 8 0. 7 2 0. 7 6 0. 8 0 0. 8 4 0. 8 8 0. 9 2 0. 9 6 1. 0 0 Frequency Horizontal position of the real fish centroid x direction = 0 (mean = 0.461) (a) Guiding tow ard the left ( x = 0 ) 0 10 20 30 40 50 60 0. 0 0 0. 0 4 0. 0 8 0. 1 2 0. 1 6 0. 2 0 0. 2 4 0. 2 8 0. 3 2 0. 3 6 0. 4 0 0. 4 4 0. 4 8 0. 5 2 0. 5 6 0. 6 0 0. 6 4 0. 6 8 0. 7 2 0. 7 6 0. 8 0 0. 8 4 0. 8 8 0. 9 2 0. 9 6 1. 0 0 Frequency Horizontal position of the real fish centroid x direction = 1 (mean = 0.562) (b) Guiding tow ard the right ( x = 1 ) Figure 10: Histograms of real ﬁsh x -coordinates (Baseline 1: stay at edge). 17 0 10 20 30 40 50 60 0. 0 0 0. 0 4 0. 0 8 0. 1 2 0. 1 6 0. 2 0 0. 2 4 0. 2 8 0. 3 2 0. 3 6 0. 4 0 0. 4 4 0. 4 8 0. 5 2 0. 5 6 0. 6 0 0. 6 4 0. 6 8 0. 7 2 0. 7 6 0. 8 0 0. 8 4 0. 8 8 0. 9 2 0. 9 6 1. 0 0 Frequency Horizontal position of the real fish centroid x direction = 0 (mean = 0.501) (a) Guiding tow ard the left ( x = 0 ) 0 10 20 30 40 50 60 0. 0 0 0. 0 4 0. 0 8 0. 1 2 0. 1 6 0. 2 0 0. 2 4 0. 2 8 0. 3 2 0. 3 6 0. 4 0 0. 4 4 0. 4 8 0. 5 2 0. 5 6 0. 6 0 0. 6 4 0. 6 8 0. 7 2 0. 7 6 0. 8 0 0. 8 4 0. 8 8 0. 9 2 0. 9 6 1. 0 0 Frequency Horizontal position of the real fish centroid x direction = 1 (mean = 0.495) (b) Guiding tow ard the right ( x = 1 ) Figure 11: Histograms of real ﬁsh x -coordinates (Baseline 2: w/o stim ulus). 18 0 0. 0 2 0. 0 4 0. 0 6 0. 0 8 0. 1 0. 1 2 0. 1 4 0. 1 6 0. 1 8 Proposed Baseline 1 (stay at edge) Baseline 2 (w/o stimulus) B h attacharyya distance between histograms Stimulus condition Figure 12: Distances b et ween the x -co ordinate distributions of real ﬁsh for left w ard and right ward guidance. F or each pair of histograms in Figs. 9 to 11 , the eﬀectiveness of the guidance was quan tiﬁed by calculating the Bhattac haryya distance b et w een the distributions resulting from the tw o target directions. the order: (a) the Prop osed Metho d, (b) Baseline 1 (stay at edge), and (c) Baseline 2 (w/o stim ulus). Consequently , it is evident that the Prop osed Metho d (a) guided the real individuals more prominently than Baseline 1 (b). 6 Conclusion In this study , we in vestigated whether eﬀective mo vemen t p olicies for virtual ﬁsh to guide ﬁsh sc ho ols could b e acquired through reinforcemen t learning (RL). First, there was a concern regarding whether RL could yield eﬀectiv e p olicies in scenarios where the frequency of real ﬁsh resp onding to virtual stim uli is not necessarily high. T o address this, we established a b eha vioral mo del in which real individuals sto c hastically decide whether to follo w or ignore virtual ﬁsh when they are suﬃciently close. Our simulation results conﬁrmed that reasonably eﬀectiv e p olicies could b e obtained even when the probabilit y of real individuals ignoring the virtual stimuli w as high. Second, w e v eriﬁed the feasibility of ﬁsh guidance in a real-world environmen t using the learned p olicy . The results demonstrated that when the virtual ﬁsh mov ed based on the learned p olicy , the mean positions of the real ﬁsh (cen troids) diﬀered signiﬁcantly dep ending on the target guidance direction. This conﬁrms that the real individuals were successfully guided in the in tended directions. F urthermore, we compared the learned p olicy with a man ual baseline p olicy where the virtual ﬁsh simply mov e to and sta y at the edge of the viewp ort in the target direction. This comparison revealed that the distribution of the real ﬁsh’s p ositions changed more prominen tly under the control of the learned p olicy than under the simple manual policy . These ﬁndings conﬁrm that RL can acquire a more eﬀective p olicy than a simple, human-deﬁned one. While this study demonstrated the p oten tial of RL-based ﬁsh sc ho ol control, sev eral issues remain for future inv estigation. V alidation of image-based stim uli as virtual ﬁsh: In this study , we used textures of real individuals for the visual stimuli. Ho wev er, it remains unclear whether these stimuli are 19 recognized by real ﬁsh as actual members of their school. T o clarify this, it is necessary to in vestigate whether b eha vioral resp onses diﬀer when presen ting abstract stimuli, such as strip ed patterns or random dots, compared to the texture-mapp ed virtual ﬁsh used in our exp erimen ts. In vestigation of the n umber of individuals In this study , the num b ers of b oth virtual and real ﬁsh were ﬁxed throughout the exp erimen ts. How ev er, tow ard the future goal of guiding large-scale sc ho ols with a small n um b er of virtual or rob otic agents, it is necessary to examine how the in teraction dynamics c hange when the resp ectiv e num b ers of real and virtual individuals are v aried. Examination of state representation in RL The state deﬁned in this study was based on the resp ectiv e centroids of the real ﬁsh sc ho ol and the virtual ﬁsh group. Ho wev er, ev en if the sc ho ol is at the same position, the optimal action is exp ected to diﬀer dep ending on whether the school is mo ving to ward or a wa y from the target direction. Therefore, p oten tial improv emen ts include incorporating information regarding the school’s velocity in to the state represen tation. F urthermore, for the future goal of in teracting with only a subset of individuals within a large-scale school, it will b e necessary to target sp eciﬁc individuals that exert a signiﬁcan t inﬂuence on the collectiv e motion. Deﬁning the state in such scenarios w ould require incor- p orating more microscopic information, suc h as the spatial relationships among individuals and the status or strength of inter-individual interactions. On the other hand, there is a trade-oﬀ where the training duration for RL increases as the size of the state space grows. Therefore, it is necessary to devise metho ds to incorp orate the aforementioned information while eﬀectiv ely managing the num b er of elemen ts in the state set. Examination of action representation in RL In the exp erimen ts conducted in this study , the mo vemen t of the four virtual ﬁsh w as designed such that they mo ved as a single co ordinated unit while main taining their relativ e positions. Ho wev er, to ac hieve more eﬀectiv e in teractions with the same n umber of agen ts, it is expected that ha ving eac h virtual ﬁsh mo v e independently to interact with diﬀeren t individuals within the school w ould b e more eﬃcient. On the other hand, similar to the challenges faced with the state space, it is necessary to de- vise metho ds that enable indep enden t mo vemen t for each virtual ﬁsh while simultaneously limiting the size of the action space to a manageable level. Ethical Appro v al All animal exp erimen ts were conducted with the approv al of the Graduate School of Informatics, Ky oto Universit y (Appro v al No. Inf-K17007, Marc h 21, 2017). References [1] I. D. Couzin, J. Krause, R. James, G. D. Ruxton, and N. R. F ranks, Collectiv e memory and spatial sorting in animal groups, Journal of the or etic al biolo gy , V ol. 218, No. 1, (2002), pp. 1–11. [2] R. S. Sutton and A. G. Barto, R einfor c ement L e arning : A n Intr o duction , MIT Press, (1998). 20 [3] Y. Kuramoto, Self-en traimen t of a p opulation of coupled nonlinear oscillators, International Symp osium on Mathematic al Pr oblems in The or etic al Phy sics (ed. H. Araki), V ol. 39, (1975), pp. 420–422. [4] F. A. Ro drigues, T. K. D. P eron, P . Ji, and J. Kurths, The kuramoto mo del in complex net works, Physics R ep orts , V ol. 610, (2016), pp. 1–98. [5] C. W. Reynolds, Flocks, herds, and schools: A distributed b eha vioral mo del, Computer Gr aphics , V ol. 21, No. 4, (1987), pp. 25–34. [6] D. T. Swain, I. D. Couzin, and N. E. Leonard, Real-time feedbac k-controlled rob otic ﬁsh for b eha vioral experiments with ﬁsh sc ho ols, Pr o c e e dings of the IEEE , V ol. 100, No. 1, (2012), pp. 150–163. [7] V. Kopman, J. Laut, G. Polv erino, and M. Porﬁri, Closed-lo op control of zebraﬁsh resp onse using a bioinspired rob otic-ﬁsh in a preference test, Journal of The R oyal So ciety Interfac e , V ol. 10, No. 78, (2013). [8] H. Ka washima, Y. Kanec hik a, and T. Matsuy ama, Camera-display system for the in terac- tion analysis of liv e ﬁsh vs ﬁsh-like graphics, The 17th Me eting on Image R e c o gnition and Understanding (MIRU 2014) , (2014). [9] G. P . Arnold, Rheotropism in ﬁshes, Biolo gic al R eviews , V ol. 49, No. 4, (1974), pp. 515–576. 21

Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment