Using Fast Weights to Attend to the Recent Past

Using F ast W eights to Attend to the Recent Past Jimmy Ba Univ ersity of T oronto jimmy@psi.toronto.edu Geoffrey Hinton Univ ersity of T oronto and Google Brain geoffhinton@google.com V olodymyr Mnih Google DeepMind vmnih@google.com Joel Z. Leibo Google DeepMind jzl@google.com Catalin Ionescu Google DeepMind cdi@google.com Abstract Until recently , research on artiﬁcial neural networks was largely restricted to sys- tems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynam- ics at many different time-scales and this suggests that artiﬁcial neural networks might beneﬁt from variables that change slower than activities but much faster than the standard weights. These “fast weights” can be used to store temporary memories of the recent past and they provide a neurally plausible way of imple- menting the type of attention to the past that has recently prov ed very helpful in sequence-to-sequence models. By using fast weights we can av oid the need to store copies of neural activity patterns. 1 Introduction Ordinary recurrent neural netw orks typically hav e two types of memory that have very different time scales, very dif ferent capacities and very dif ferent computational roles. The history of the sequence currently being processed is stored in the hidden activity v ector , which acts as a short-term memory that is updated at ev ery time step. The capacity of this memory is O( H ) where H is the number of hidden units. Long-term memory about how to conv ert the current input and hidden vectors into the next hidden vector and a predicted output vector is stored in the weight matrices connecting the hidden units to themselves and to the inputs and outputs. These matrices are typically updated at the end of a sequence and their capacity is O( H 2 ) + O( I H ) + O( H O ) where I and O are the numbers of input and output units. Long short-term memory networks [Hochreiter and Schmidhuber, 1997] are a more complicated type of RNN that work better for discovering long-range structure in sequences for two main reasons: First, they compute increments to the hidden activity vector at each time step rather than recomputing the full v ector 1 . This encourages information in the hidden states to persist for much longer . Second, they allo w the hidden activities to determine the states of gates that scale the ef fects of the weights. These multiplicati ve interactions allo w the ef fective weights to be dynamically adjusted by the input or hidden activities via the gates. Howe ver , LSTMs are still limited to a short-term memory capacity of O( H ) for the history of the current sequence. Until recently , there was surprisingly little practical in vestigation of other forms of memory in recur- rent nets despite strong psychological evidence that it exists and ob vious computational reasons why it was needed. There were occasional suggestions that neural networks could beneﬁt from a third form of memory that has much higher storage capacity than the neural activities but much faster dynamics than the standard slo w weights. This memory could store information speciﬁc to the his- tory of the current sequence so that this information is a vailable to inﬂuence the ongoing processing 1 This assumes the “remember gates ” of the LSTM memory cells are set to one. without using up the memory capacity of the hidden activities. Hinton and Plaut [1987] suggested that fast weights could be used to allow true recursion in a neural netw ork and Schmidhuber [1993] pointed out that a system of this kind could be trained end-to-end using backpropagation, but neither of these papers actually implemented this method of achieving recursion. 2 Evidence from physiology that temporary memory may not be stored as neural activities Processes like working memory , attention, and priming operate on timescale of 100ms to minutes. This is simultaneously too slow to be mediated by neural activ ations without dynamical attractor states (10ms timescale) and too fast for long-term synaptic plasticity mechanisms to kick in (minutes to hours). While artiﬁcial neural network research has typically focused on methods to maintain temporary state in activ ation dynamics, that focus may be inconsistent with evidence that the brain also—or perhaps primarily—maintains temporary state information by short-term synaptic plasticity mechanisms [Tsodyks et al., 1998, Abbott and Regehr, 2004, Barak and Tsodyks, 2007]. The brain implements a variety of short-term plasticity mechanisms that operate on intermediate timescale. For e xample, short term facilitation is implemented by lefto ver [ Ca 2+ ] in the axon termi- nal after depolarization while short term depression is implemented by presynaptic neurotransmitter depletion Zucker and Regehr [2002]. Spike-time dependent plasticity can also be in voked on this timescale [Markram et al., 1997, Bi and Poo, 1998]. These plasticity mechanisms are all synapse- speciﬁc. Thus they are more accurately modeled by a memory with O ( H 2 ) capacity than the O ( H ) of standard recurrent artiﬁcial recurrent neural nets and LSTMs. 3 F ast Associative Memory One of the main preoccupations of neural netw ork research in the 1970s and early 1980s [W illshaw et al., 1969, Kohonen, 1972, Anderson and Hinton, 1981, Hopﬁeld, 1982] was the idea that memories were not stored by somehow keeping copies of patterns of neural activity . Instead, these patterns were reconstructed when needed from information stored in the weights of an associativ e network and the very same weights could store many different memories An auto-associative memory that has N 2 weights cannot be expected to store more that N real-valued vectors with N components each. How close we can come to this upper bound depends on which storage rule we use. Hopﬁeld nets use a simple, one-shot, outer -product storage rule and achie ve a capacity of approximately 0 . 15 N binary vectors using weights that require log ( N ) bits each. Much more efﬁcient use can be made of the weights by using an iterati ve, error correction storage rule to learn weights that can retriev e each bit of a pattern from all the other bits [Gardner, 1988], b ut for our purposes maximizing the capacity is less important than having a simple, non-iterativ e storage rule, so we will use an outer product rule to store hidden activity vectors in fast weights that decay rapidly . The usual weights in an RNN will be called slow weights and they will learn by stochastic gradient descent in an objectiv e function taking into account the fact that changes in the slow weights will lead to changes in what gets stored automatically in the fast associati ve memory . A fast associativ e memory has se veral advantages when compared with the type of memory assumed by a Neural T uring Machine (NTM) [Grav es et al., 2014], Neural Stack [Grefenstette et al., 2015], or Memory Netw ork [W eston et al., 2014]. First, it is not at all clear ho w a real brain would implement the more exotic structures in these models e.g., the tape of the NTM, whereas it is clear that the brain could implement a fast associativ e memory in synapses with the appropriate dynamics. Second, in a fast associativ e memory there is no need to decide where or when to write to memory and where or when to read from memory . The fast memory is updated all the time and the writes are all superimposed on the same fast changing component of the strength of each synapse. Every time the input changes there is a transition to a new hidden state which is determined by a combination of three sources of information: The new input via the slow input-to-hidden weights, C , the previous hidden state via the slow transition weights, W , and the recent history of hidden state vectors via the f ast weights, A . The ef fect of the ﬁrst tw o sources of information on the ne w hidden state can be computed once and then maintained as a sustained boundary condition for a brief iterative settling process which allows the fast weights to inﬂuence the new hidden state. Assuming that the fast weights decay exponentially , we now show that the effect of the fast weights on the hidden vector 2 . . . . Sustained boundary condition Slow transition weights Fast transition weights Figure 1: The fast associati ve memory model. during an iterativ e settling phase is to pro vide an additional input that is proportional to the sum over all recent hidden activity vectors of the scalar product of that recent hidden vector with the current hidden activity vector , with each term in this sum being weighted by the decay rate raised to the power of how long ago that hidden vector occurred. So fast weights act like a kind of attention to the recent past b ut with the strength of the attention being determined by the scalar product between the current hidden vector and the earlier hidden vector rather than being determined by a separate parameterized computation of the type used in neural machine translation models [Bahdanau et al., 2015]. The update rule for the f ast memory weight matrix, A , is simply to multiply the current fast weights by a decay rate, λ , and add the outer product of the hidden state v ector , h ( t ) , multiplied by a learning rate, η : A ( t ) = λA ( t − 1) + η h ( t ) h ( t ) T (1) The next vector of hidden activities, h ( t + 1) , is computed in two steps. The “preliminary” vector h 0 ( t + 1) is determined by the combined effects of the input vector x ( t ) and the previous hidden vector: h 0 ( t + 1) = f ( W h ( t ) + C x ( t )) , where W and C are slo w weight matrices and f ( . ) is the nonlinearity used by the hidden units. The preliminary vector is then used to initiate an “inner loop” iterativ e process which runs for S steps and progressi vely changes the hidden state into h ( t + 1) = h S ( t + 1) h s +1 ( t + 1) = f ([ W h ( t ) + C x ( t )] + A ( t ) h s ( t + 1)) , (2) where the terms in square brackets are the sustained boundary conditions. In a real neural net, A could be implemented by rapidly changing synapses but in a computer simulation that uses se- quences which hav e fewer time steps than the dimensionality of h , A will be of less than full rank and it is more efﬁcient to compute the term A ( t ) h s ( t + 1) without e ver computing the full fast weight matrix, A . Assuming A is 0 at the beginning of the sequence, A ( t ) = η τ = t X τ =1 λ t − τ h ( τ ) h ( τ ) T (3) A ( t ) h s ( t + 1) = η τ = t X τ =1 λ t − τ h ( τ )[ h ( τ ) T h s ( t + 1)] (4) The term in square brackets is just the scalar product of an earlier hidden state vector , h ( τ ) , with the current hidden state vector , h s ( t + 1) , during the iterati ve inner loop. So at each iteration of the inner loop, the fast weight matrix is exactly equiv alent to attending to past hidden vectors in proportion to their scalar product with the current hidden vector , weighted by a decay factor . During the inner loop iterations, attention will become more focussed on past hidden states that manage to attract the current hidden state. The equi valence between using a fast weight matrix and comparing with a set of stored hidden state vectors is very helpful for computer simulations. It allo ws us to explore what can be done with fast 3 weights without incurring the huge penalty of having to abandon the use of mini-batches during training. At ﬁrst sight, mini-batches cannot be used because the fast weight matrix is different for ev ery sequence, but comparing with a set of stored hidden v ectors does allow mini-batches. 3.1 Layer normalized fast weights A potential problem with fast associati ve memory is that the scalar product of two hidden vectors could v anish or e xplode depending on the norm of the hidden vectors. Recently , layer normalization [Ba et al., 2016] has been sho wn to be very effecti ve at stablizing the hidden state dynamics in RNNs and reducing training time. Layer normalization is applied to the vector of summed inputs to all the recurrent units at a particular time step. It uses the mean and variance of the components of this vector to re-center and re-scale those summed inputs. Then, before applying the nonlinearity , it in- cludes a learned, neuron-speciﬁc bias and g ain. W e apply layer normalization to the f ast associativ e memory as follows: h s +1 ( t + 1) = f ( LN [ W h ( t ) + C x ( t ) + A ( t ) h s ( t + 1)]) (5) where LN [ . ] denotes layer normalization. W e found that applying layer normalization on each iteration of the inner loop makes the fast associativ e memory more robust to the choice of learning rate and decay hyper-parameters. For the rest of the paper, fast weight models are trained using layer normalization and the outer product learning rule with fast learning rate of 0.5 and decay rate of 0.95, unless otherwise noted. 4 Experimental results T o demonstrate the effecti veness of the fast associative memory , we ﬁrst inv estigated the problems of associati ve retrie val (section 4.1) and MNIST classiﬁcation (section 4.2). W e compared fast weight models to regular RNNs and LSTM variants. W e then applied the proposed fast weights to a facial expression recognition task using a fast associative memory model to store the results of processing at one lev el while examining a sequence of details at a ﬁner level (section 4.3). The hyper-parameters of the experiments were selected through grid search on the validation set. All the models were trained using mini-batches of size 128 and the Adam optimizer [Kingma and Ba, 2014]. A description of the training protocols and the hyper-parameter settings we used can be found in the Appendix. Lastly , we show that fast weights can also be used effecti vely to implement reinforcement learning agents with memory (section 4.4). 4.1 Associative retriev al W e start by demonstrating that the method we propose for storing and retrieving temporary memo- ries works effecti vely for a toy task to which it is very well suited. Consider a task where multiple key-v alue pairs are presented in a sequence. At the end of the sequence, one of the keys is presented and the model must predict the value that was temporarily associated with the key . W e used strings that contained characters from English alphabet, together with the digits 0 to 9. T o construct a train- ing sequence, we ﬁrst randomly sample a character from the alphabet without replacement. This is the ﬁrst ke y . Then a single digit is sampled as the associated value for that key . After generating a sequence of K character-digit pairs, one of the K different characters is selected at random as the query and the network must predict the associated digit. Some examples of such string sequences and their targets are sho wn below: Input string T arget c9k8j3f1??c 9 j0a5s5z2??a 5 where ‘?’ is the tok en to separate the query from the ke y-value pairs. W e generated 100,000 training examples, 10,000 v alidation e xamples and 20,000 test e xamples. T o solv e this task, a standard RNN has to end up with hidden activities that somehow store all of the ke y-value pairs after the ke ys and values are presented sequentially . This makes it a signiﬁcant challenge for models only using slo w weights. W e used a neural network with a single recurrent layer for this experiment. The recurrent network processes the input sequence one character at a time. The input character is ﬁrst con verted into a 4 Model R=20 R=50 R=100 IRNN 62.11% 60.23% 0.34% LSTM 60.81% 1.85% 0% A-LSTM 60.13% 1.62% 0% Fast weights 1.81% 0% 0% T able 1: Classiﬁcation error rate comparison on the associativ e retriev al task. 0 20 40 60 80 100 120 140 Updates x 5000 0.0 0.5 1.0 1.5 2.0 Negative log likelihood A-LSTM 50 IRNN 50 LSTM 50 FW 50 Figure 2: Comparison of the test log likelihood on the associativ e retriev al task with 50 recurrent hidden units. learned 100-dimensional embedding vector which then provides input to the recurrent layer 2 . The output of the recurrent layer at the end of the sequence is then processed by another hidden layer of 100 ReLUs before the ﬁnal softmax layer . W e augment the ReLU RNN with a fast associative memory and compare it to an LSTM model with the same architecture. Although the original LSTMs do not have explicit long-term storage capacity , recent work from Danihelka et al. [2016] extended LSTMs by adding complex associativ e memory . In our experiments, we compared fast associativ e memory to both LSTM variants. Figure 2 and T able 1 show that when the number of recurrent units is small, the fast associativ e memory signiﬁcantly outperforms the LSTMs with the same number of recurrent units. The result ﬁts with our hypothesis that the fast associativ e memory allows the RNN to use its recurrent units more effecti vely . In addition to having higher retrie val accuracy , the model with fast weights also con ver ges faster than the LSTM models. 4.2 Integrating glimpses in visual attention models Despite their man y successes, con v olutional neural networks are computationally expensi ve and the representations they learn can be hard to interpret. Recently , visual attention models [Mnih et al., 2014, Ba et al., 2015, Xu et al., 2015] have been shown to overcome some of the limitations in Con vNets. One can understand what signals the algorithm is using by seeing where the model is looking. Also, the visual attention model is able to selectiv ely focus on important parts of visual space and thus avoid any detailed processing of much of the background clutter . In this section, we sho w that visual attention models can use fast weights to store information about object parts, though we use a very restricted set of glimpses that do not correspond to natural parts of the objects. Giv en an input image, a visual attention model computes a sequence of glimpses o ver regions of the image. The model not only has to determine where to look next, but also has to remember what it has seen so f ar in its working memory so that it can make the correct classiﬁcation later . V isual attention models can learn to ﬁnd multiple objects in a large static input image and classify them correctly , but the learnt glimpse policies are typically over -simplistic: They only use a single scale of glimpses and they tend to scan ov er the image in a rigid way . Human eye movements and ﬁxations are far more complex. The ability to focus on different parts of a whole object at different scales allows humans to apply the very same knowledge in the weights of the network at many different scales, but it requires some form of temporary memory to allow the network to integrate what it disco vered in a set of glimpses. Improving the model’ s ability to remember recent glimpses should help the visual attention model to discover non-trivial glimpse policies. Because the fast weights can store all the glimpse information in the sequence, the hidden activity vector is freed up to learn how to intelligently integrate visual information and retriev e the appropriate memory content for the ﬁnal classiﬁer . T o explicitly verify that larger memory capacity is beneﬁcial to visual attention-based models, we simplify the learning process in the following way: First, we provide a pre-deﬁned glimpse control signal so the model knows where to attend rather than having to learn the control policy through reinforcement learning. Second, we introduce an additional control signal to the memory cells so the attention model kno ws when to store the glimpse information. A typical visual attention model is 2 T o make the architecture for this task more similar to the architecture for the next task we ﬁrst compute a 50 dimensional embedding vector and then e xpand this to a 100-dimensional embedding. 5 Integration transition weights Slow transition weights Fast transition weights Update fast weights and wipe out hidden state Figure 3: The multi-le vel fast associati ve memory model. Model 50 features 100 features 200 features IRNN 12.95% 1.95% 1.42% LSTM 12% 1.55% 1.10% Con vNet 1.81% 1.00% 0.9% Fast weights 7.21% 1.30% 0.85% T able 2: Classiﬁcation error rates on MNIST . complex and has high variance in its performance due to the need to learn the policy network and the classiﬁer at the same time. Our simpliﬁed learning procedure enables us to discern the performance improv ement contributed by using fast weights to remember the recent past. W e consider a simple recurrent visual attention model that has a similar architecture to the RNN from the pre vious experiment. It does not predict where to attend but rather is given a ﬁxed sequence of locations: the static input image is broken down into four non-overlappi ng quadrants recursiv ely with two scale levels. The four coarse regions, down-sampled to 7 × 7 , along with their the four 7 × 7 quadrants are presented in a single sequence as sho wn in Figure 1. Notice that the two glimpse scales form a two-lev el hierarchy in the visual space. In order to solve this task successfully , the attention model needs to integrate the glimpse information from different lev els of the hierarchy . One solution is to use the model’ s hidden states to both store and integrate the glimpses of different scales. A much more ef ﬁcient solution is to use a temporary “cache” to store any of the unﬁnished glimpse computation when processing the glimpses from a ﬁner scale in the hierarchy . Once the computation is ﬁnished at that scale, the results can be integrated with the partial results at the higher lev el by “popping” the previous result from the “cache”. Fast weights, therefore, can act as a neurally plausible “cache” for storing partial results. The slow weights of the same model can then specialize in integrating glimpses at the same scale. Because the slow weights are shared for all glimpse scales, the model should be able to store the partial results at sev eral lev els in the same set of fast weights, though we ha ve only demonstrated the use of fast weights for storage at a single lev el. W e ev aluated the multi-lev el visual attention model on the MNIST handwritten digit dataset. MNIST is a well-studied problem on which many other techniques have been benchmarked. It contains the ten classes of handwritten digits, ranging from 0 to 9. The task is to predict the class label of an isolated and roughly normalized 28x28 image of a digit. The glimpse sequence, in this case, consists of 24 patches of 7 × 7 pixels. T able 2 compares classiﬁcation results for a ReLU RNN with a multi-lev el fast associativ e mem- ory against an LSTM that gets the same sequence of glimpses. Again the result shows that when the number of hidden units is limited, fast weights give a signiﬁcant improvement ov er the other 6 Figure 4: Examples of the near frontal faces from the MultiPIE dataset. IRNN LSTM Con vNet Fast W eights T est accuracy 81.11 81.32 88.23 86.34 T able 3: Classiﬁcation accurac y comparison on the facial expression recognition task. models. As we increase the memory capacities, the multi-le vel fast associati ve memory consistently outperforms the LSTM in classiﬁcation accuracy . Unlike models that must integrate a sequence of glimpses, con volutional neural networks process all the glimpses in parallel and use layers of hidden units to hold all their intermediate computational results. W e further demonstrate the effecti veness of the fast weights by comparing to a three-layer con volutional neural network that uses the same patches as the glimpses presented to the visual attention model. From T able 2, we see that the multi-lev el model with fast weights reaches a very similar performance to the Con vNet model without requiring any biologically implausible weight sharing. 4.3 Facial expression recognition T o further in vestig ate the beneﬁts of using fast weights in the multi-lev el visual attention model, we performed facial expression recognition tasks on the CMU Multi-PIE face database [Gross et al., 2010]. The dataset was preprocessed to align each face by eyes and nose ﬁducial points. It was downsampled to 48 × 48 greyscale. The full dataset contains 15 photos taken from cameras with different viewpoints for each illumination × expression × identity × session condition. W e used only the images taken from the three central cameras corresponding to − 15 ◦ , 0 ◦ , 15 ◦ views since facial expressions were not discernible from the more extreme viewpoints. The resulting dataset contained > 100 , 000 images. 317 identities appeared in the training set with the remaining 20 identities in the test set. Giv en the input face image, the goal is to classify the subject’ s facial expression into one of the six different categories: neutral, smile, surprise, squint, disgust and scream. The task is more realistic and challenging than the previous MNIST experiments. Not only does the dataset ha ve unbalanced numbers of labels, some of the expressions, for example squint and disgust, are are v ery hard to dis- tinguish. In order to perform well on this task, the models need to generalize over dif ferent lighting conditions and vie wpoints. W e used the same multi-le vel attention model as in the MNIST exper - iments with 200 recurrent hidden units. The model sequentially attends to non-ov erlapping 12x12 pixel patches at two different scales and there are, in total, 24 glimpses. Similarly , we designed a two layer Con vNet that has a 12x12 recepti ve ﬁelds. From T able 3, we see that the multi-lev el fast weights model that knows when to store information outperforms the LSTM and the IRNN. The results are consistent with previous MNIST experiments. Howe ver , Con vNet is able to perform better than the multi-le vel attention model on this near frontal face dataset. W e think the efﬁcient weight-sharing and architectural engineering in the ConvNet combined with the simultaneous a vailability of all the information at each le vel of processing allows the Con vNet to generalize better in this task. Our use of a rigid and predetermined policy for where to glimpse eliminates one of the main potential adv antages of the multi-le vel attention model: It can process informativ e details at high resolution whilst ignoring most of the irrelev ant details. T o realize this advantage we will need to combine the use of fast weights with the learning of complicated policies. 7 (a) 0 2 4 6 8 10 12 14 steps 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 Avgerage Reward RNN RNN+FW LSTM (b) 0 5 10 15 20 25 30 steps 1.0 0.5 0.0 0.5 1.0 Avgerage Reward RNN RNN+FW LSTM (c) Figure 5: a) Sample screen from the game ”Catch” b) Performance curves for Catch with N = 16 , M = 3 . c) Performance curves for Catch with N = 24 , M = 5 . 4.4 Agents with memory While different kinds of memory and attention hav e been studied extensiv ely in the supervised learning setting [Grav es, 2014, Mnih et al., 2014, Bahdanau et al., 2015], the use of such models for learning long range dependencies in reinforcement learning has receiv ed less attention. W e compare dif ferent memory architectures on a partially observable variant of the game ”Catch” described in [Mnih et al., 2014]. The game is played on an N × N screen of binary pixels and each episode consists of N frames. Each trial begins with a single pixel, representing a ball, appearing somewhere in the ﬁrst row of the column and a two pixel ”paddle” controlled by the agent in the bottom row . After observing a frame, the agent gets to either keep the paddle stationary or mov e it right or left by one pixel. The ball descends by a single pixel after each frame. The episode ends when the ball pixel reaches the bottom row and the agent receiv es a reward of +1 if the paddle touches the ball and a re ward of − 1 if it doesn’t. Solving the fully observ able task is straightforw ard and requires the agent to move the paddle to the column with the ball. W e make the task partially- observable by providing the agent blank observations after the M th frame. Solving the partially- observable version of the game requires remembering the position of the paddle and ball after M frames and moving the paddle to the correct position using the stored information. W e used the recently proposed asynchronous advantage actor-critic method [Mnih et al., 2016] to train agents with three types of memory on different sizes of the partially observable Catch task. The three agents included a ReLU RNN, an LSTM, and a fast weights RNN. Figure 5 shows learning progress of the different agents on two variants of the game N = 16 , M = 3 and N = 24 , M = 5 . The agent using the fast weights architecture as its policy representation (sho wn in green) is able to learn faster than the agents using ReLU RNN or LSTM to represent the policy . The improvement obtained by fast weights is also more signiﬁcant on the larger version of the game which requires more memory . 5 Conclusion This paper contributes to machine learning by showing that the performance of RNNs on a v ariety of different tasks can be improv ed by introducing a mechanism that allows each new state of the hidden units to be attracted to wards recent hidden states in proportion to their scalar products with the current state. Layer normalization makes this kind of attention work much better . This is a form of attention to the recent past that is somewhat similar to the attention mechanism that has recently been used to dramatically improve the sequence-to-sequence RNNs used in machine translation. The paper has interesting implications for computational neuroscience and cognitiv e science. The ability of people to recursi vely apply the very same knowledge and processing apparatus to a whole sentence and to an embedded clause within that sentence or to a complex object and to a major part of that object has long been used to ar gue that neural networks are not a good model of higher-le vel cognitiv e abilities. By using fast weights to implement an associative memory for the recent past, we hav e sho wn ho w the states of neurons could be freed up so that the knowledge in the connections of a neural network can be applied recursi vely . This ov ercomes the objection that these models can only do recursion by storing copies of neural activity v ectors, which is biologically implausible. 8 References Sepp Hochreiter and J ¨ urgen Schmidhuber . Long short-term memory . Neural computation , 9(8):1735–1780, 1997. Geoffre y E Hinton and David C Plaut. Using fast weights to deblur old memories. In Pr oceedings of the ninth annual conference of the Cognitive Science Society , pages 177–186. Erlbaum, 1987. J Schmidhuber . Reducing the ratio between learning complexity and number of time varying v ariables in fully recurrent nets. In ICANN93 , pages 460–463. Springer, 1993. Misha Tsodyks, Klaus Pawelzik, and Henry Markram. Neural networks with dynamic synapses. Neural computation , 10(4):821–835, 1998. LF Abbott and W ade G Regehr . Synaptic computation. Nature , 431(7010):796–803, 2004. Omri Barak and Misha Tsodyks. Persistent activity in neural networks with dynamic synapses. PLoS Comput Biol , 3(2):e35, 2007. Robert S Zucker and W ade G Regehr . Short-term synaptic plasticity . Annual revie w of physiology , 64(1): 355–405, 2002. Henry Markram, Joachim L ¨ ubke, Michael Frotscher, and Bert Sakmann. Regulation of synaptic efﬁcac y by coincidence of postsynaptic aps and epsps. Science , 275(5297):213–215, 1997. Guo-qiang Bi and Mu-ming Poo. Synaptic modiﬁcations in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. The Journal of neur oscience , 18(24):10464– 10472, 1998. David J W illshaw , O Peter Buneman, and Hugh Christopher Longuet-Higgins. Non-holographic associativ e memory . Natur e , 1969. T euvo K ohonen. Correlation matrix memories. Computers, IEEE T ransactions on , 100(4):353–359, 1972. James A Anderson and Geoffre y E Hinton. Models of information processing in the brain. P arallel models of associative memory , pages 9–48, 1981. John J Hopﬁeld. Neural networks and physical systems with emergent collecti ve computational abilities. Pr o- ceedings of the national academy of sciences , 79(8):2554–2558, 1982. Elizabeth Gardner . The space of interactions in neural network models. Journal of physics A: Mathematical and general , 21(1):257, 1988. Alex Graves, Greg W ayne, and Iv o Danihelka. Neural turing machines. arXiv preprint , 2014. Edward Grefenstette, Karl Moritz Hermann, Mustafa Sule yman, and Phil Blunsom. Learning to transduce with unbounded memory . In Advances in Neural Information Pr ocessing Systems , pages 1819–1827, 2015. Jason W eston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv pr eprint arXiv:1410.3916 , 2014. D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Repr esentations , 2015. J. Ba, R. Kiros, and G. Hinton. Layer normalization. arXiv:1607.06450, 2016. D. Kingma and J. L. Ba. Adam: a method for stochastic optimization. arXiv:1412.6980, 2014. Ivo Danihelka, Greg W ayne, Benigno Uria, Nal Kalchbrenner , and Alex Grav es. Associative long short-term memory . arXiv preprint , 2016. V . Mnih, N. Heess, A. Graves, and K. Ka vukcuoglu. Recurrent models of visual attention. In Neural Informa- tion Pr ocessing Systems , 2014. J. Ba, V . Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In International Confer ence on Learning Repr esentations , 2015. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov , Richard Zemel, and Y oshua Ben- gio. Show , attend and tell: Neural image caption generation with visual attention. In International Confer- ence on Machine Learning , 2015. Ralph Gross, Iain Matthews, Jeffre y Cohn, T akeo Kanade, and Simon Baker . Multi-pie. Image and V ision Computing , 28(5):807–813, 2010. A. Grav es. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2014. V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley , David Silver , and K oray Ka vukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna- tional Conference on Machine Learning , 2016. 9 Supplementary Material A Experimental details A.1 Associative retriev al W e used a single hidden layer recurrent neural network which takes a 100 dimensional embedding vector as its input. W e compared the fast weights memory against three other different RNN archi- tecture: IRNN, standard LSTM and associati ve LSTM. The non-recurrent slo w recurrent weights are initialized from uniform distribution between ( − 1 / √ H , 1 / √ H ) , where H is the number outgoing weights. The slo w weights learning rate is tuned using the 10,000 validation examples. Below , we provide the speciﬁc hyper-parameter settings for the models used in the e xperiments: Fast weights : The fast weights learning rate, η , is set to 0.5 and the fast weights decay rate, λ , is set to 0.9. The fast weights are updated once at ev ery time step. W e experimented with more iterations for the “inner loop” and the performance are similar . The recurrent slow weights are initialized to an identity matrix scaled by 0 . 05 . W e use the ReLU acti vation for f ( · ) in the recurrent layer . IRNN : The recurrent slow weights are initialized to an identity matrix scaled by 0 . 5 . ReLU is used as the non-linearity in the recurrent layer . Associative LSTM : W e used 4 copies of memory cells for the associati ve LSTM. There are 3 read- write heads used for storage and retriev al memory access. A.2 Integrating glimpses in visual attention models: MNIST and F acial expression recognition Both tasks used the similar parameter initialization and the hyper -parameter setup that are compara- ble to the associativ e retriev al task mentioned abov e. A.3 Agents with memory All agents used recurrent networks to represent the policy . At each time step the input was passed through a hidden layer with 128 ReLU units and then passed to the recurrent core. All agents used 128 recurrent cells. The output at e very step was a softmax o ver the v alid actions and a single linear output for the estimate of the value function. W e used random search to ﬁnd hyperparameters values for the learning rate, the number of Hebbian steps, and fast weight learning rate and decay where applicable. W e av eraged results ov er the top 5 models. B Implementing the fast weights “inner loop” in biological neural networks W e considered two different ways of performing this inner loop settling. In method 1 (which is what we use) the inputs to the hidden units after an outer loop transition using W are stored and provide sustained boundary conditions during the inner loop settling. In method 2 (which is more biologically plausible) we simply add the identity matrix to the fast weight matrix so that the inner loop settling tends to sustain the hidden activity v ector . For ReLUs, these two methods are equiv alent when the fast weight matrix is zero . They are similar but not exactly equiv alent when the fast weights are non-zero. Using layer normalization, we found that method 1 worked slightly better than Method 2, but Method 2 w ould be much easier to implement in a biological network. 10

Using Fast Weights to Attend to the Recent Past

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment