Learning to Reason With Adaptive Computation

Lear ning to Reason with Adaptiv e Computation Mark Neumann ∗ Pontus Stenetor p Sebastian Riedel mark.neumann.15@ucl.ac.uk {p.stenetorp,s.riedel}@cs.ucl.ac.uk Univ ersity Colle ge London, London, United Kingdom Abstract Multi-hop inference is necessary for machine learning systems to successfully solve tasks such as Recognising T extual Entailment and Machine Reading. In this work, we demonstrate the ef fecti v eness of adapti v e computation for learning the number of inference steps required for examples of dif ferent comple xity and that learning the correct number of inference steps is difﬁcult. W e introduce the ﬁrst model in v olving Adapti ve Computation T ime which pro vides a small performance beneﬁt on top of a similar model without an adapti ve component as well as enabling considerable insight into the reasoning process of the model. 1 Introduction Recognising T extual Entailment (R TE) is the task of determining whether a hypothesis can be inferred from a premise. W e argue that natural language inference requires the combination of inferences and aim to provide a stepping stone to w ards the de velopment of such a method. These steps can be compared to backtracking in a logic programming language and by employing an attention mechanism we are able to visualise each inference step, allo wing us to interpret the inner workings of our model. At the centre of our approach is Adapti ve Computation Time [ 1 ], which is the ﬁrst example of a neural network deﬁned using a static computational graph which can ex ecute a v ariable number of inference steps which is conditional on the input. Howe ver , A CT was originally only applied to a vanilla Recurrent Neural Network, whereas we sho w that it can be applied to arbitrary computation for a problem not explicitly deﬁned to beneﬁt from A CT . 2 A Motivating Multi-hop Infer ence Example When humans resolve entailment problems, we are adept at not only breaking do wn large problems into sub-problems, but also re-combining the sub-problems in a structured way . Often in simple problems – such as negation resolution – only the ﬁrst step is necessary . Howe ver , in more complicated cases – such as multiple co-reference resolution, or lexical ambiguity – it can be necessary to both decompose and then reason. For instance the contradicting statements: • Premise: An elderly gentleman stands near a bus stop, using an umbrella for shelter because there is a thunderstorm. • Hypothesis: An old man holding a closed umbrella is sheltering from bad weather under a bus stop. require multi-step, temporally dependent reasoning to resolv e correctly . On closer examination, the ke y to resolution here is the action relating entities and conditions in the scene, leading to an inference chain similar to: • An old man, an umbrella and a bus stop are present 3 ∗ Currently at the Allen Institute for Artiﬁcial Intelligence. • The weather is bad 3 • He is sheltering from the weather 3 • He is sheltering from the weather using a closed umbr ella 7 where the ﬁnal true/false statement is b uilt up from ﬁrst observing facts about the scene and then combining and extending them. W e use this idea of combining distinct lo w-lev el inferences and show that Adapti ve Computation and multi-step inference can be used to e xamine how incorporating additional inferences into the inference chain can inﬂuence the ﬁnal classiﬁcation. The visualisation of which provides an additional tool for the analysis of deep learning based models. 3 A pproach Attention mechanisms for neural networks, ﬁrst introduced for Machine Translation [ 2 ], hav e demon- strated state-of-the-art performance for R TE [ 3 , 4 ]. Our model is an extension of the Iterati ve Alternating Attention model originally employed for Machine Reading, [ 5 ] combined with the De- composable Attention (D A) model previously proposed for R TE. [ 4 ] The original motiv ation behind the Decomposable Attention model was to bypass the bottleneck of generating a single document representation, as this can be prohibitiv ely restrictiv e when the document grows in size. Instead, the model incorporates an “inference”’ step, in which the query and the document are iterati vely attended ov er for a ﬁxed number of iterations in order to generate a representation for classiﬁcation. Our contrib ution is to generalise this inference step to run for an adaptive number of steps conditioned on the input using a single step of the Adapti ve Computation T ime algorithm for Recurrent Neural Networks. W e also sho w that ACT can be used to e xecute arbitrary functions by considering it as a (nearly) differentiable implementation of a while loop (Appendix C). BO W Attention Encoders: As in the original D A formulation, the vector representations of words are simply generated by the dot product of the result of a single feedforward network appended to the original word vector . The un-normalised alignment weight given hypothesis and premise representations h 1 ...h m and p 1 ...p n are deﬁned as: e ij = F ( h i ) T F ( p j ) (1) Where F : R d → R d is a feedforward network with ReLU activ ation functions. The vector representations of the hypothesis and premise are then generated by taking the softmax ov er the respecti ve dimensions of the E matrix and concatenating with the original word vector representation: β j = m X i =1 Softmax 1 ...m ( e ij ) · h i { ˜ p j } n i =1 = [ β j , p j ] α i = n X j =1 Softmax 1 ...n ( e ij ) · p j { ˜ h i } m i =1 = [ α i , h i ] (2) Alternating Iterative Attention: Now that we have a representation of both the hypothesis and the premise, we iterativ ely attend over them. At inference iteration t , we generate an attention representation of the hypothesis: q it = Softmax 1 ...m ( ˜ h T i ( W h s t − 1 + b h )) q t = X i q it ˜ h i (3) Where q it are the attention weights, W h ∈ R d × s where s is the dimensionality of the inference GRU state . This attention representation of the hypothesis is then used to generate an attention mask over the premise: d it = Softmax 1 ...n ( ˜ p T i ( W p [ s t − 1 , q t ] + b p )) d t = X i d it ˜ p i (4) 2 Where d it are the attention weights, W p ∈ R d × ( d + s ) where s is the dimensionality of the infer ence GR U state and [ x, y ] denotes the concatenation of vectors. Gating Mechanism: Although the attention representations could no w be concatenated into an input for the inference GR U, we also mak e one last addition proposed by [ 5 ] in order to allow attention representations to be forgotten/not used. r t = G p ([ s t − 1 , d t , q t , d t · q t ]) s t = G h ([ s t − 1 , d t , q t , d t · q t ]) (5) Where G p , G h are 2 layer, feedforward networks f : R s +3 d → R d . The generated attention representations are multiplied element-wise with the result of the gating function and concatenated( [ r t · d t , s t · q t ] ∈ R 2 d ), forming the input at time t to the inference GR U. In [ 5 ], the number of inference GR U steps is a hyperparameter of the model. W e instead learn the number of inference steps to take using a single step of the Adapti ve Computation T ime algorithm, described in Appendix C, where the input into the halting layer is the output y t of the inference GR U. 4 Results Figure 1: V isualisation of the attention weights produced at each inference step which are used to form the states fed into the inference GR U. At each time-step, the hypothesis is attended ov er using the inference GR U state as the query . This representation is then appended to the query and used to generate an attention mask o ver the Premise. On the right, we sho w the softmax classiﬁcation if the model had used the inference GR U state at that timestep to classify Here, taking multiple steps corrects the classiﬁcation. In Figure 1 and Appendix B, we hav e provided attention visualisations demonstrating the way the model uses it’ s access to the premise and hypothesis representations. W e found the follo wing points particularly of note: • Clear demonstrations of conditional attention, in the sense that the attention weights change ov er the course of the A CT steps and the model attends o ver different parts of the sentences. • The model very rarely takes a single inference step, e ven in cases of high re gularisation. • There are se veral instances of a lar ge A CT weight on the ﬁnal loop in the inference GR U. Demonstrating that when the model has found the key attention representation which provides the necessary insight, it can learn to halt immediately . 3 Additionally , the Ponder Cost parameter shows context dependent beha viour – during preliminary experiments using RNN encoders, the optimal Ponder Cost settings were distinct. This further reinforces the argument that this parameter is dif ﬁcult to set correctly . Below , we present results on the Stanford Natural Language Inference (SNLI) corpus [6]. Model T est Accuracy Parameters Logistic Regression w Le xical features [6] 78.2% n/a Baseline LSTM [6] 77.6% 220k 1024D "Skip Thought" GR U [7] 81.4% 15m SPINN-PI Recursiv e NN [8] 83.2% 3.7m 100D LSTM w word-by-word attention [3] 83.5% 250k 200D Decomposable Attention [4] 86.3% 380k 300D Full tree matching NTI-SLSTM-LSTM [9] 87.3% 3.3m 200D Decomposable Attention (our implementation) [4] 83.8% 380k Adaptiv e Attention (ours) 82.7% 990k Adaptiv e Attention (2 step) 76.2% 990k Adaptiv e Attention (4 steps) 81.7% 990k Adaptiv e Attention (8 steps) 81.0% 990k T able 1: T able showing relati ve performance on the test set of the SNLI corpus. 5 Discussion It should be noted that we were unable to replicate the performance originally reported for the D A model [ 4 ] with our implementation https://github.com/DeNeutoy/Decomposable_Attn despite correspondence with the authors, but continue to seek the reason for this discrepanc y . Ho wever , giv en our objectiv e was to demonstrate the usefulness of adapti ve computation applied to reasoning using attention, we feel this does not detract signiﬁcantly from the purpose of the paper . Additionally , our model takes a mean of 5 steps at inference time, meaning it is both more efﬁcient and performant than using a large, ﬁx ed number of steps. The performance plateau demonstrated in Appendix A is most likely an artif act of the way the SNLI corpus is constructed – in that Amazon T urkers hav e a ﬁnancial incenti ve to provide the minimal hypothesis contradicting, agreeing with, or neutral to the premise. This construction naturally beneﬁts bag of words approaches, as minimal changes often result in word ov erlap features being particularly discriminatory . Additionally our approach in volv es combining conditional attention masks, which may hav e less beneﬁt in a situation where there is little difference between the premise and h ypothesis. Overall, we ha ve demonstrated that taking more steps in a multi-hop learning scenario does not always improv e performance and that it is desirable to adapt the number of steps to indi vidual data examples. Further exploration into other regularisation methods to tune the ponder cost parameter should be considered, given the model’ s sensiti vity in this regard. The assumption that a single state, rather than combinations or dif ferences of states, is the best method to determine the halting probability should be in vestigated and will be the subject of future work, as halting promptly demonstrably aids performance. Additionally , we believ e that the in vestigation of attention mechanisms, particularly efﬁcient implementations, such as [ 4 , 10 ], when combined with input conditional computation is an important av enue of future research. Acknowledgments This work w as supported by a Marie Curie Career Integration A ward and an Allen Distinguished In vestigator A ward. The authors would like to thank Tim Rocktäschel for helpful feedback and comments. References [1] Alex Gra ves. Adaptive computation time for recurrent neural networks. CoRR , abs/1603.08983, 2016. 4 [2] Dzmitry Bahdanau, K yunghyun Cho, and Y oshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR , abs/1409.0473, 2014. [3] T im Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, T omás K ociský, and Phil Blun- som. Reasoning about entailment with neural attention. CoRR , abs/1509.06664, 2015. [4] Ankur P . Parikh, Oscar T ackstrom, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. CoRR , abs/1606.01933, 2016. [5] Alessandro Sordoni, Phillip Bachman, and Y oshua Bengio. Iterative alternating neural attention for machine reading. CoRR , abs/1606.02245, 2016. [6] Samuel R. Bo wman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In EMNLP , 2015. [7] Ivan V endrov , Jamie Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order -embeddings of images and language. CoRR , abs/1511.06361, 2015. [8] Samuel R. Bo wman, Jon Gauthier, Abhina v Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. A fast uniﬁed model for parsing and sentence understanding. CoRR , abs/1603.06021, 2016. [9] Tsendsuren Munkhdalai and Hong Y u. Neural tree index ers for text understanding. CoRR , abs/1607.04492, 2016. [10] Marcin Andrychowicz and Karol Kurach. Learning efﬁcient algorithms with hierarchical attentiv e memory . CoRR , abs/1602.03218, 2016. [11] John Duchi, Elad Hazan, and Y oram Singer . Adaptiv e subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. , 12:2121–2159, July 2011. [12] Diederik P . Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR , abs/1412.6980, 2014. [13] Jeffre y Pennington, Richard Socher , and Christopher D. Manning. Glove: Global v ectors for word representation. In EMNLP , 2014. A Does taking more steps impr ove perf ormance? Below , we sho w the accuracy as a function of the ﬁxed inference steps by ev aluating the model repeatedly whilst ﬁxing the maximum number of inference steps it can take at test time. From the below graph, it is clear that there is a difference between models using a single inference hop and ones employing more than this. Howe ver , performance clearly does not scale linearly with respect to the number of inference hops taken by the model giv en the similar results for 4-20 hops. W e argue that this indifference in performance between inference steps shows that choosing this value in a data dependant way is preferable. Note that the uptick in ﬁnal performance (when the model may use A CT to determine when to halt) is due to the way A CT combines the inferences using a weighted av erage over the indi vidual attentions. 5 Figure 2: V alidation accuracy for the Adaptiv e Attention model compared to the same model, where the number of steps is ﬁxed at inference time. B Additional Diagrams Figure 3: Here, we observ e large ﬁnal A CT weights once the model has asserted that not only is a bird present (steps 2 to 3), but speciﬁcally it is standing on the line. 6 Figure 4: An example of incorrect classiﬁcation where standard analysis would simply demonstrate a failure - actually , the model is extremely uncertain of whether this falls into the contradiction or neutral categories and only chooses the wrong class by a very small margin. This example additionally pro vides e vidence to support the idea that steps should be adaptive - although this is classiﬁed incorrectly , if only a single step were taken here, the e xample would hav e been correctly classiﬁed. C Adaptive Computation T ime In order to determine the number of steps taken, gi ven a state s n t at time t and inner recurrence step n , deﬁne: h n t = σ ( W p s n t + b p ) p n t =  R ( t ) n = N ( t ) h n t other wise (6) Where p n t is the halting probability at outer timestep t , inner recurrence timestep n , W p ∈ R d , b p ∈ R are learnt parameters and σ is the element-wise sigmoid function. N ( t ) is the inner recurrence timestep at which the accumulated halting probabilities p 1 ...n t reach 1 −  and R ( t ) is the remainder at timestep N ( t ) : N ( t ) = min ( n : n X i =1 p i t ≥ 1 −  ) R ( t ) = 1 − N ( t ) − 1 X i =1 p i t (7) Giv en the abo ve, we deﬁne the next cell output and state to be weighted sums of the intermediate states: s t = N ( t ) X i =1 p i t s i t y t = N ( t ) X i =1 p i t y i t (8) The ﬁnal addition in [ 1 ] is the so-called Ponder Cost which is used to regularise the number of computational steps taken. This is made up of a combination of the remainder function and the iteration counter and added to the original L2 loss function. 7 P ( x ) = T X i =1 N ( t ) + R ( t ) (9) Adaptiv e computation can be seen as a differentiable implementation of a while loop, in that we can perform other operations within the body of a single A CT timestep. This idea generalises aw ay from the context of ACT within a vanilla recurrent neural network and allo wing more complex functionality , such as modules including attention or extraneous inputs. Algorithm 1 A CT as a while loop 1: procedur e I N N E R AC T S T E P 2: while p ≤ 1 −  do 3: y n t , s n t = I nner C ell ( x n , s n − 1 t ) 4: p i t = σ ( W p s n t + b p ) 5: p = p + p i t 6: end while 7: s t = P N ( t ) i =1 p i t s i t 8: y t = P N ( t ) i =1 p i t y i t 9: end procedur e Note that this algorithmic implementation is for a single A CT timestep. D Experimental Details In this section, we describe results and insights from running the models described in the Methods section on the Stanford Natural Language Inference Corpus [6]. For both the Adapti ve Attention model and the Decomposable Attention model, we ran grid searches ov er the following hyperparameter ranges for 15 epochs ov er the full training data. T able 2: Hyperparameter ranges for grid searches. Bold and italicised te xt represent the best settings for Adaptiv e Attention and Decomposable Attention models respectiv ely . Hyperparameter Range Ponder Cost(AA only) 0.001 , 0.0005, 0.0001, 0.00005 Learning Rate AA: { 0.01 , 0.05, 0.001 } D A: {0.001, 0.0005, 0.0001 } Dropout 0.1, 0.2 Embedding Dim 200 , 256 , 300 Batch Size 8 , 16, 32 Epsilon (AA only) 0.01 , 0.2 Not included in the hyperparameter search were vocab size, ﬁxed to the 40,000 most frequent words (with an OO V token used for words not found in this vocabulary) and the Inference GRU size, ﬁxed to 200. W e clipped gradients at an absolute value of 5. All models were trained using the Adagrad optimiser [ 11 ] or the AD AM optimiser [ 12 ] respectiv ely for the AA and DA models. W ord embeddings are initialised using GloV e pre-trained vector representations [ 13 ] (which are normalised for the DA model, following correspondence with the authors) and words without a pre-trained representation are initialised to vectors drawn from a standard Normal distrib ution N (0 , 0 . 01) . W ord embeddings are not updated during training, b ut non-linear embedding projection parameter is trained. Follo wing the approach tak en by Bo wman et al. in the original SNLI paper , we discard any examples with no gold label, leaving us with 549,367, 9,842 and 9,824 for training, validation and testing respectiv ely . After running these hyperparameter searches, the hyperparameter setting with the best accuracy on the v alidation split is selected and trained until con ver gence. Only these models are ev aluated on the test set. 8 T o generate the comparison between the Adaptiv e Attention model and the models with ﬁxed numbers of hops, we ﬁx the hyperparameters to the best v ariant from the grid search and simply halt inference at different ﬁx ed numbers of memory hops. 9

Learning to Reason With Adaptive Computation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment