SpaRCe: Improved Learning of Reservoir Computing Systems through Sparse Representations

S P A R C E : I M P R OV E D L E A R N I N G O F R E S E R V O I R C O M P U T I N G S Y S T E M S T H R O U G H S P A R S E R E P R E S E N T A T I O N S Luca Manneschi Department of Computer Science Univ ersity of Shefﬁeld Shefﬁeld, UK, lmanneschi1@sheffield.ac.uk Andrew C. Lin Department of Biomedical Science, The Univ ersity of Shefﬁeld, Shefﬁeld, UK, andrew.lin@sheffield.ac.uk Eleni V asilaki Department of Computer Science Univ ersity of Shefﬁeld Shefﬁeld, UK, e.vasilaki@sheffield.ac.uk April 20, 2021 A B S T R AC T “Sparse" neural networks, in which relatively fe w neurons or connections are activ e, are common in both machine learning and neuroscience. Whereas in machine learning, “sparsity” is related to a penalty term that leads to some connecting weights becoming small or zero, in biological brains, sparsity is often created when high spiking thresholds prev ent neuronal activity . Here we introduce sparsity into a reservoir computing network via neuron-speciﬁc learnable thresholds of activity , allowing neurons with low thresholds to contribute to decision-making but suppressing information from neurons with high thresholds. This approach, which we term “SpaRCe", optimises the sparsity lev el of the reservoir without affecting the reserv oir dynamics. The read-out weights and the thresholds are learned by an on-line gradient rule that minimises an error function on the outputs of the network. Threshold learning occurs by the balance of two opposing forces: reducing inter- neuronal correlations in the reservoir by deactiv ating redundant neurons, while increasing the activity of neurons participating in correct decisions. W e test SpaRCe on classiﬁcation problems and ﬁnd that threshold learning improv es performance compared to standard reserv oir computing. SpaRCe alleviates the problem of catastrophic forgetting, a problem most evident in standard echo state networks and recurrent neural networks in general, due to increasing the number of task-specialised neurons that are included in the network decisions. 1 Introduction The performance of artiﬁcial neural netw orks is often impro ved by adopting “sparse" representations, in which relati vely fe w neurons or connections are activ e. Previous research has studied the role of sparse connectivity , in terms of memory , in Hopﬁeld networks, demonstrating how sparse connectivity increases storage capacity [ 1 ] [ 2 ] [ 3 ] [ 4 ]. Memory retriev al and associati ve learning hav e been studied in the context of neural network attractors, and the work in [ 5 ] has provided an abstract mathematical analysis of retrie val capacity . From the machine learning perspecti ve, adopting sparse connectivity can lead to more interpretable models [ 6 ], a reduced computational cost [ 7 ], and can help solve overﬁtting problems [ 8 ]. sparsity in machine learning is typically introduced to artiﬁcial networks through regularisation, in which a penalty term leads to the reduction of the connection weights. In this regard, the work in [ 7 ] demonstrated how structured sparsity can impro ve computational speed and accuracy in a con volutional neural network. Rasmussen et al. [ 9 ] showed how the choice of regularization parameters of the model can impact the interpretability and the reproducibility of a classiﬁer of neuroimaging data, and showed the existence of a trade-of f between pure classiﬁcation accuracy and reproducibility . Sparsity is also a well-known concept in neuroscience: biological neurons are highly selective in systems ranging from mammalian sensory cortex [ 10 ] to the insect mushroom body [ 11 ] [ 12 ]. Howe ver , unlik e in typical machine learning approaches, biological sparsity is introduced not only by reducing connection weights between neurons, but also by the fact that neurons ha ve spiking thresholds: they only ﬁre when their summed inputs exceed a certain threshold. High spiking thresholds relative to the size of synaptic inputs can often contribute to high selectivity of neurons, as with Ken yon cells (KCs), the principal neurons of the insect mushroom body , which ﬁre sparsely in response to odor stimuli [ 13 ] [ 14 ] [ 15 ] [ 16 ]. In the fruit ﬂy Dr osophila , this sparse odor coding enhances learned discrimination of similar odors [ 12 ]. Moreover , spiking thresholds v ary across neurons [ 13 ] and ov er time for the same neuron [ 17 ] [ 18 ], and spiking thresholds for dif ferent neurons are adapted to neurons’ particular input statistics [ 17 ] and past acti vity [ 19 ]. Here we applied the concept of adaptable spiking thresholds to machine learning to create SpaRCe, a method for achieving sparse representations applicable to Reservoir Computing. Reservoir computing takes inspiration from the comple x non-linear behavior of recurrent neural networks and their abilities to process temporal information. Such a computing paradigm exploits the inherent complexity of a dynamical system, which is called the reserv oir and that can be virtually or physically deﬁned, to represent a stream of data into a high dimensional space useful for learning. In the classical learning framew ork of such systems, training occurs on the read-out connections only , which connect the activities of the non-linear reservoir components to the output neurons. The adv antages of the reservoir computing framework are: ﬁrst, the system can be physically deﬁned as in [ 20 , 21 , 22 , 23 , 24 ], aspect that can dramatically reduce the computational and energetic cost [ 25 ]; second, the response of the reservoir can exhibit memory ov er a range of timescales without training of the internal dynamic of the system. In the latter aspect, the reservoir computing framew ork is antithetical to standard recurrent neural networks, whose connectivity is trained through the computationally e xpensiv e BPTT algorithm. In this work, we model the reserv oir as a recurrent network of leaky integrators [ 26 ] kno wn as echo state network (ESN). The connectivity between the nodes is represented through a random sparse ﬁxed adjacency matrix, whose spectrum of eigen values allo ws the system to exhibit a wide range of timescales and to efﬁciently represent temporal information. Previous w orks hav e explored alternativ es to the typical Erdos-Renyi connecti vity of the reservoir , imposing a regular structure as a delay line [ 27 ], where the nodes are connected unidirectionally composing a circular graph, or deﬁning the graph connectivity through more complex topologies [ 28 ] as scale-free or small world. Furthermore, complex structures composed by multiple, interconnected ESNs have been recently studied, and the works [ 29 ] [ 30 ] demonstrate ho w the use of different hyperparameters for dif ferent ESNs can expand the range of timescales that is accessible by the system and consequently increase the quality of the reservoir representation for temporal tasks with complex temporal dynamic. In this regard, the w ork [ 30 ] gi ves insights into the understanding of hierarchical systems from the point of view of accessible timescales, of fering a theoretical and a practical analysis on how to tune the hyperparameters of the system to a considered task. While ESNs are most traditionally adopted by the machine learning community for time-series prediction and forecasting [ 26 ], more recent works have studied their behavior for time-series classiﬁcation [ 31 ]. In particular, [ 31 ] studies and compares different methods to exploit the representation of the reserv oir as input to more complex machine learning algorithms as multilayer perceptrons or SVM. In this work, we formulate a dif ferent approach to improv e the classiﬁcation ability of reservoir systems in general while maintaining the low cost of computation that is intrinsic to the concept of reservoir computing. Analogously to the concept of ﬁring thresholds, SpaRCe exploits learnable thresholds to optimize the lev el of sparsity inside the network. Both the learnable thresholds and the read-out weights (but not the recurrent connections within the reservoir) are optimised by minimising an error function that does not include any normalization term. W e analysed the learning rule deriv ed from this error minimisation and found that learni ng occurs by two antagonist factors: the ﬁrst raises the thresholds proportionally to the correlated activity of the nodes (thus silencing nodes that are correlated and therefore redundant), while the second lowers the thresholds of nodes that contribute to the correct classiﬁcation (Fig. 2). The novelty of our w ork lies in the fact that a sparsity le vel is reached due to the presence of “on-line” learnable ﬁring thresholds, rather than to penalty terms [ 32 ] [ 6 ] [ 33 ], as well as in the detailed analysis on ho w such thresholds enable ESNs to perform competitiv ely in tasks where their performance was lacking. 2 Methods 2.1 Standard Echo State Network An echo state network (ESN) is composed of randomly connected leaky inte grators. The activity of such a system is deﬁned through the following equation [26]: V ( t + δ t ) = (1 − α ) V ( t ) + αf  γ W in s ( t ) + ρ W V ( t )  (1) where α = δ t τ deﬁnes the temporal scale of the neuron, V ( t ) is the activity vector of the integrators and s ( t ) is the input signal. W is the ﬁx ed sparse random matrix that describes the recurrency of the reservoir . The matrix W is random and 2 sparse, corresponding to an Erdos-Ren yi graph where the probability of a connection is p E R . The eigen values of W are rescaled to be conﬁned inside the unit circle of the imaginary plane, necessary condition for the echo state property . The hyperparameter ρ (between 0 and 1) controls further the radius of the spectrum of the system’ s eigenv alues. Finally , γ is a gain factor of the input signal. The speciﬁc form of the input matrix W in and the acti vation function f is task-dependent and will be speciﬁed in sections 3.1 and 3.2. It is possible to control a priori the range of timescales that the reservoir e xhibits by appropriately choosing α and ρ as described in the methodology reported in Appendix A.1 and our earlier work [ 30 ]. In the standard ESN learning protocol, learning occurs exclusi vely on the read-out deﬁned from a v ector ˜ V that represents an ensemble of the reservoir activities V . The output of the system is then computed through y = W T o ˜ V (2) E = 1 2 h ˜ y − y i 2 (3) where ˜ y denotes the desired output values, and the output connecti vity W T o is optimised to minimise the cost function E through ridge regression [ 26 ] or through gradient descent methods [ 34 ]. W e note how , depending on the task and on the learning framew ork, the ˜ V vector can be deﬁned from dif ferent ensembles of activities of the reserv oir across time. The idea to e xploit the dynamic of the ESNs, instead of using its activities at one step only , is particularly useful for classiﬁcation tasks. Previous works ha ve introduced different techniques to deﬁne a ˜ V vector from the ESN’ s dynamic: adopting PCA on the multidimensional response of the reservoir across time [ 31 ], introducing a learnable temporal kernel to deﬁne the read-out [ 35 ], or concatenating the past acti vities of the reserv oir to deﬁne a higher dimensional vector ˜ V [ 34 ] [ 36 ]. In this w ork, we will exploit the latter approach, while other techniques such as PCA, which reduces the dimensionality of the ESN’ s representation, are more desirable in the case of overﬁtting. In this regard, while the most spontaneous choice for time-series prediction would be ˜ V = V ( t ) , corresponding to the acti vity of the reservoir at the current time, it is also possible to expand the dimensionality of the system by including pre vious reservoir representations at times t l and deﬁne a vector ˜ V = C   V ( t l )  t l ε T  , where C denotes the concatenation operator and T the ensemble of time steps t l considered. This latter approach has been adopted also in physical reservoir models through the use of virtual nodes [ 37 ] [ 38 ]. Of course, such dimensionality e xpansion leads to an artiﬁcial increase of the memory of the system. Howe ver , the concatenation of pre vious acti vities does not guarantee the understanding of the dependencies among ev ents that are distant in time, because the memory of a past input signal could have f aded away when a new stimulus s ( t ) comes. A scheme of this procedure for time-series classiﬁcation can be found in Fig. 1 A. 2.2 Sparse Reservoir Computing In contrast to pre vious models that deﬁne the output of the neural network through a read-out of the ˜ V vector , i.e. the activities of the reserv oir considered for the learning process, we introduced another variable x i for each dimension of ˜ V , deﬁned as follows: x i = sig n  ˜ V i  r elu n | ˜ V i | − θ i o (4) θ i = P n  | ˜ V i |  + ˜ θ i (5) where r elu stands for rectiﬁed linear unit, sig n is the sign function (1 if ˜ V i > 0 , -1 if ˜ V i < 0 , 0 if ˜ V i = 0 ), and θ i is a threshold that enables x i to be sparse. Thus, the variable x i is zero if the absolute value of the variable V i is lower than the corresponding threshold. W e term this variant SpaRCe for Sparse Reservoir Computing. Of course, x = ˜ V if θ = 0 , which is the case where the formulated learning procedure coincides with the standard ESN paradigm. Considering Eq. 5, each threshold is composed by two factors: • P n ( | ˜ V i | ) , deﬁned as the percentile n of the distrib ution of acti vity of the i-th component of | ˜ V | across a dataset, where the same value of n is applied to all dimensions of ˜ V . Giv en that the response ˜ V of the reservoir to a speciﬁc input remains unchanged across learning, this term can be computed ov er the training dataset and maintained constant throughout the simulation. Thus, the role of P n ( ˜ | V i | ) is to initialise the thresholds from an initial condition where the sparsity lev el of all dimensions of x is n/ 100 . The thresholding mechanism introduced is symmetric with respect to zero, as is visible from the examples in Fig. 1D-E. While panel D highlights the portions of the distributions that are positive after normalisation, panel E sho ws the resulting distrib utions after application of Eq. 5. In this formulation, the value of n 3 Figure 1: Comparison between the basic reservoir computing frame work and SpaRCe, which exploits the concepts of adaptable thresholds to introduce sparsity in the representation of the ESN. A : The typical reservoir computing sampling paradigm for a time-series classiﬁcation task. Time ﬂo ws from top to bottom. The input signal s ( t ) is fed into the ESN through the input connecti vity matrix γ W in . An ensemble of the activities V ( t ) across time is selected and concatenated to compose a vector ˜ V , which is then used to deﬁne the read-out. T ypical choices of ESN activities used to deﬁne the read-out are ˜ V = V ( T ) , where T denotes the ﬁnal temporal step of the input sequence, or ˜ V = C   V ( t )  ∀ t  , denoting the concatenation of all the reservoir dynamic across the temporal length of s ( t ) . Of course, while the ﬁrst approach relies on the ESN intrinsic memory capacity , the latter case corresponds to a dimensionality expansion that contributes artiﬁcially to the memory of the system. Both these approaches and intermediate cases, i.e. where the dynamic of the reservoir across time is sampled with lo w frequency to compose the read-out, will be exploited in the tasks studied. B-C : Comparison between the typical reservoir computing read-out ( B ) and the SpaRCe model ( C ). B : The ESN output y = W o ˜ V is responsible for the classiﬁcation process of the example sequence s ( t ) . In this paradigm, learning occurs exclusi vely on W o . C : Scheme of the SpaRCe model. Thresholds are introduced at the lev el of the ˜ V vector , lea ving unaffected the dynamic of the reservoir and making the approach applicable to any physical or virtual reservoir model. Each threshold value is composed by a normalisation term P n ( | ˜ V i | ) , deﬁned as the n-th percentile of the acti vity distribution of the i-th component across the data, plus an adaptable term ˜ θ i (see text for more details). D-E : Acti vity distributions of ˜ V ( D ) and x ( E ) for three example nodes. The highlighted red region in D corresponds to the values for which the nodes would be active if the normalisation mechanism proposed in Eq. 4 would be applied (percentile n = 50 in this case). From E , it is clear that Eq.4 also shifts the activity distributions acting as a normalisation mechanism. needs to be considered as an additional, interpretable hyperparameter . Howe ver , we will show that all the sensible choices of n will correspond to faster conv ergence and better performance of the formulated algorithm in comparison to the standard ESN learning paradigm across all tasks considered. Throughout this paper , the same P n ( ˜ | V i | ) s computed ov er the training dataset will be used for the v alidation and test dataset. • An adaptable component ˜ θ i optimised through gradient descent learning rules. This parameter adapts the sparsity level for the considered task and can rediscover the standard learning paradigm in the case where ˜ θ i = − P n ( | ˜ V i | ) , ∀ i . 4 10 1 10 2 10 3 10 4 Training time -0.01 -0.005 0 0.005 0.01 10 1 10 2 10 3 10 4 Training time -10 -5 0 5 10 -4 A. B. C. D . 10 50 90 Frequen cy of active nodes Figure 2: The SpaRCe read-out proposed leads to decreased variability in the ESN representation, to specialised responses, and to a interpretable learning rule that acts as feature selection mechanism. The learning rule for the thresholds is driv en by the imbalance between two antagonist forces. A Distributions of the average acti vities of nodes across an example dataset (MNIST , section 2.3) for the standard ESN and for SpaRCe as the percentile of the normalisation mechanism changes at the beginning of training (when ˜ θ = 0 ). SpaRCe descreases the variability of the acti vity distributions, acting as a normalisation mechanism. B : Frequency of acti ve nodes for dif ferent starting sparsity le vels and dif ferent number of classes for a classiﬁcation task with ten classes (MNIST , section 2.3) before training ( ˜ θ = 0 ). As sparsity ( P n ) increases, nodes respond to a smaller number of classes becoming more specialised. C : Analysis of the two forces ∆ θ (1) and ∆ θ (2) in volved in the learning rule for the thresholds. The positi ve y-axis shows a running a verage of ∆ θ (1) with solid lines, while the negati ve y-axis shows a running a verage of ∆ θ (2) with dashed lines. hi indicates av eraging across all neurons. ∆ θ (1) increases the threshold v alues for nodes that are equally contributing to the classiﬁcation process. ∆ θ (2) decreases the threshold values thanks to the positi ve contrib ution of the output weights connected to the correct output. Colours correspond to initial conditions ( P 10 , P 60 ). D : A verage cumulative change of a threshold. If the starting level of sparsity is suboptimal and low (high) the a verage threshold change is positi ve (negati ve). The additional complexity arising from Eq. 4 is summarized by Fig. 1 B, which depicts the difference between the read-out of a standard Echo-State netw ork and our formulation. Eq. 4 acts as normalisation operator that directly controls sparsity in the reservoir representation and that, thanks to the learnable components ˜ θ , works as a feature-selection mechanism. In the latter sense, we can now demonstrate the interpretability of the updating equations on θ deriv ed from a gradient descent approach. For the case of a mean square error function, the learning rule can be factorised in 5 two terms: ∆ θ k = η θ h ∆ θ (1) k + ∆ θ (2) k i (6) ∆ θ (1) k = N class X j =1 y j W o j k sig n  x k  = = N class X j =1 N X l =1 W o j l W o j k x l sig n  x k  (7) ∆ θ (2) k = − W o ˜ j k sig n  x k  (8) where η θ is the learning rate on the thresholds, ˜ j refers to the correct output class for the sample considered, W o is the output connecti vity matrix. Eq. 7 and 8 are deri ved by assuming one-hot encoding (Appendix B.2). Considering x k > 0 (analogous considerations are v alid for x k < 0 ), the factor ∆ θ (2) decreases (increases) the threshold v alue of nodes with W o ˜ j k > 0 ( W o ˜ j k < 0 ) that help the network reach the right (wrong) classiﬁcation. Thus, ∆ θ (2) is dri ven by the output weight between the considered node (if it is activ e) and the desired class. In contrast, ∆ θ (1) is a measure of correlation of acti vities between different nodes in the reservoir and increases the thresholds of neurons that hav e coherent synapses and that are simultaneously activ e (and are therefore redundant). W e examined the two factors ∆ θ (1) and ∆ θ (2) across learning for an example of sequence classiﬁcation and for different initial sparsity le vels. These are plotted in Fig. 2C, with ∆ θ (1) in solid lines on the positiv e y-axis and ∆ θ (2) in dotted lines on the negati ve y-axis. The two forces are almost symmetric, but their slight imbalance provides the direction to change the threshold values. Indeed, if the starting sparsity lev el is high, the total force is negati ve and the factor ∆ θ (2) dominates, while if the sparsity lev el is lo w , the correlation term ∆ θ (1) wins and the thresholds increase on av erage (compare P 70 and P 10 in Fig. 2D, which shows cumulati ve ∆ θ (2) and ∆ θ (1) ). The reason that the magnitudes of the forces are larger for a higher starting sparsity (Fig. 2C) is that stimulus representations overlap less when the sparsity le vel is higher and neurons are more specialised, i.e. preferably ﬁre for one pattern ov er the others. This leads to more coherent sign of the output weights of the nodes belonging to a cluster tow ard a speciﬁc class, which increases ∆ θ (1) according to Eq. 7. A similar analysis of the learning rule for a cross entropy cost function is reported in Appendix B.2. Fig. 2A,B describes the beneﬁts due to the application of the proposed normalisation mechanism: ﬁrst, it directly controls sparsity and introduces specialised neurons (Fig. 2B); second, it shifts the activity distrib utions ˜ V of the network into a more stereotyped response x , av oiding the possibility that learning could be dominated by the nodes with highest acti vities (Fig. 2A). Finally , it stabilises the learning process for a wide range of possible learning rates that would not be accessible without thresholds (Supplementary Materials). This speciﬁc formulation allows the model to use local information to learn the threshold values and optimize sparse representations. W e note also how Eq. 4 does not affect the timescales of the network and consequently preserves the idea of reservoir computing as a ﬁxed, dynamically rich representation. Furthermore, the method formulated constitutes a computationally inexpensiv e procedure that is easily applicable to any type of reserv oir , virtually or physically deﬁned. 2.3 Benchmarks W e tested the proposed model on the following benchmarks, comparing its performance to the standard read-out of an ESN: • A biologically inspired task to test the storage capacity of the system in the memorisation of associations between sequences and desired output values. In this case, the model is tested on the same associations used for training, but with different realisations of multiplicative white noise added on the sequences. The dependence of SpaRCe on the starting sparsity le vel is analysed along comparisons of the performance achiev ed with other methods for reading out from the ESN representation, including deep feedforward networks. • Three v ariants of classiﬁcation tasks based on the MNIST dataset, where data are presented to the system as temporal varying sequences. In the ﬁrst task, we studied the performance of the model when each image, composed by 28 × 28 pixels, is presented column by column as a 28-dimensional sequence of 28 time steps (MNIST [ 36 ]). In the second, the input is processed in the same way , but a random permutation of the pix els is 6 applied to all data to make the task more challenging and to randomise the structures of the images (pMNIST). In the third, we applied a random permutation of the data as in the second paradigm, b ut each image is presented pixel by pixel as a one dimensional sequence of 784 temporal steps (psMNIST [ 39 ]). When the initial conditions of the model are studied, the performance are computed and shown on the test set as P n varies for comparison, b ut to select the best P n value and report the highest performance we used a v alidation dataset. Of course, we selected the hyperparameters corresponding to the highest accuracy on the v alidation set and then computed the performance on the test dataset. • T wo tasks that in volv e sequential learning. In the ﬁrst case, we applied ten different permutations to the data and trained the model processing the tasks one after the other [ 40 ] [ 41 ]. In the second, the network is trained on the data belonging to different classes sequentially [ 40 ] [ 41 ]. In both cases, the network can access a speciﬁc dataset (corresponding to a permutation or a class) only once during training. As before, a v alidation dataset is adopted when selecting the best hyperparameters (in particular P n ) of the proposed model. 3 Results 3.1 Threshold lear ning increases storage capacity W e ev aluated the performance of a standard ESN and SpaRCe in classifying an ensemble of sequences { S i } i =1 ,..,N seq of three successiv e stimuli, where the dimensionality of the signal is N I n = 24 . Each stimulus of a sequence is deriv ed from the simulated response of N I n = 24 projection neurons (PNs, in the ﬂy olfactory system) to 110 different odors [ 42 ] [ 43 ]. This simulated activity , which we call s H O (HO for Hallem-Olsen), has previously been used in computational analyses of ﬂy olfaction [44] [45] [46]. Sequences are generated to test the storage capacity of the system in memorising associations between input signals and desired output v alues. The procedure for building dif ferent sequences from single stimuli is described the Appendix (section C.1, Fig. 1), but essentially , it guarantees that each sequence has to be classiﬁed as independently as possible from other sequences, and that there are no correlations of elements among different inputs that can inform the classiﬁcation process. Thus, the system can only memorise the associations among a speciﬁc succession of elements and the corresponding output value. There are three stimuli in a sequence, each of them is presented for a time interval ∆ t = 0 . 1 s in order to allow the network to integrate the information. The total duration of an input sequence is T = 0 . 3 s . Giv en a sequence s i ( t ) , we added multiplicativ e white noise to each separate dimension to make the task more complex. Thus, the i-th dimension of the ﬁnal sequence S i ( t ) to be classiﬁed is S i ( t ) = s i ( t ) + σ s ξ ( t ) s i ( t ) , where ξ ( t ) is a Gaussian distributed random variable with zero mean and unitary v ariance. For this speciﬁc task, the activ ation function f of Eq. 1 is a rectiﬁed linear unit and the connections of the input adjacency matrix W in follow a lognormal distrib ution. This particular form of W in is inspired by the biological results in [ 13 ] [ 47 ] [ 48 ]. In this case, ˜ V = V ( T ) , meaning that only the activities at the last temporal step of a sequence are adopted for the read-out, and the memory of past ev ents is left to the internal dynamic of the ESN. A schematic representation of the task can be found in the Appendix. W e ﬁrst in vestigated how the model depends on initial sparsity , and found an optimal sparsity le vel for the task. W e initialized the network with different initial sparsity percentiles (i.e., thresholds at different percentiles of the V distribution, n = [10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90] ), and tracked the mean square error o ver the learning process (Fig. 3A). Errors decreased as learning proceeded, but at each time point, the lo west error occurred for an initial sparsity of about 50% . Furthermore, models initialized with sparsity v alues other than 50% (an explanation of why 50% can be found in the Appendix) con verged to ward 50% as training proceeded, as sho wn by the black dashed lines connecting dots of training instances from the top to the bottom of the graph (Fig. 3A). This sho ws how the learning rule pushes the percentage of activ e nodes tow ard the optimal sparsity lev el. Notably , the error is smallest when specialisation is highest (for a deﬁnition of specialisation, see Appendix A.2, Eq. A5; conceptually , specialisation represents to what extent a node is active for stim uli of one class, not stimuli of other classes), as shown in Fig. 3B, which reports the error as a function of sparsity and specialisation. For all training instances analysed, the lowest error corresponds to the highest specialisation value. Thus, specialisation provides a systematic way to choose the starting condition of the network. Indeed, it is possible to select the thresholds using the percentile v alue that yields the highest specialisation measure. Howe ver , there is no need to excessi vely ﬁne-tune the initialization, since the learning rule will optimize the threshold values anyw ay . W e note also that this simulation is performed through a simple gradient descent algorithm, and that the model’ s dependence on initial conditions can be ameliorated by using more complex optimizers, as will be sho wn in section 3.2. 7 T raining inst ance 1.5 0.3 X 10 5 0.6 0.9 1.2 A. B. Figure 3: The learning process modulates the network’ s sparsity le vel to ward an optimal percentage of activ e nodes. A : Performance as a function of sparsity for dif ferent training instances of the model (a color represents a speciﬁc training time, which increases from top to bottom). For each instance the results are ﬁtted with a second degree polynomial that has a minimum around 0 . 5 on the x-axis, demonstrating the existence of an optimal percentage of acti ve nodes. The dashed line connecting the results for div erse training time highlights the change in the sparsity level achie ved through the learning rule. B : Performance as a function of sparsity and specialisation. The best performance corresponds to the highest specialisation v alues for all training instances, demonstrating the interpretability of the model. Finally , we compared the performance of the SpaRCe model with: 1) Echo state network (ESN) without thresholds, where the same on-line learning is applied to the output weights W out only . W e note that the algorithm SpaRCe learns N more parameters (the thresholds) in comparison to the Echo State Network without thresholds. 2) Hidden layer (HL), where we added a full hidden layer of N h nodes on the top of the reservoir . This approach learns an additional connecti vity matrix between the reserv oir and the hidden layer , dramatically increasing the number of parameters by a factor of approximately N h N . 3) Echo state network (ESN) with online learning and L 1 or L 2 regularization terms on the output weights. The SpaRCe model outperforms the standard ESN with or without the regularization terms, based on classiﬁcation accuracy and root mean square error (Fig. 4A,C,D). This adv antage is consistent across dif ferent lev els of external noise ( σ s ) and dif ferent numbers of stimuli (Fig. 4C,D). Furthermore, SpaRCe performs comparably to a network with an additional full hidden layer with N h ≈ 100 nodes, ev en though the hidden layer dramatically increases the number of learnable parameters compared to SpaRCe (Fig. 4A,B,D). In comparison to the addition of a hidden layer , the model SpaRCe provides a cheap formulation to achie ve an optimal and reliable sparsity lev el (see Fig. 4B, where the number of learnable parameters for the models are reported with the corresponding performance). In general, when introducing a hidden layer of N h neurons trained with backpropagation, for a network of N o output neurons, we would learn additional parameters N × N h + N h + N h × N o + N o = N h × ( N + N o + 1) + N o (weights + biases + output weights + biases, 1000 × 100 + 100 + 100 × 2 + 2 ≈ 10 5 for the case with N h = 100 ), while for a classical reservoir we learn N × N o + N o ≈ N × N o ( 1000 × 2 + 2 ≈ 2 × 10 3 ) and for SpaRCe N × N o + N + N o ≈ ( N + 1) × N o ( 1000 × 2 + 1000 + 2 ≈ 3 × 10 3 ). While adding a hidden layer goes against the principle of ESN of e xploiting the network dynamics while using simple learning methods, it remains an interesting comparison for quantifying the ef ﬁciency of the proposed method. W ith N h = 100 and while N o << N h our proposed thresholded architecture is efﬁcient in terms of numbers of learnable parameters in comparison to adding a fully trained hidden layer to an ESN. W e conclude that the SpaRCe model considerably improv ed the performance and the con vergence time of a reserv oir on this biologically inspired benchmark task, with the relativ ely small ov erhead of N additional parameters, one per reservoir node. 8 0 0.5 1 1.5 2 2.5 3 Training inst ance 10 5 0.5 0.6 0.7 0.8 0.9 1 10 3 10 4 10 5 # trainab le p aramete rs 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 50 100 150 200 250 300 350 N seq 0.7 0.8 0.9 1 A. B. C. D. Figure 4: The SpaRCe algorithm increases the memory capacity of the ESN and the stability of the found solution. A : Classiﬁcation accuracy and root mean square error of the models for a case where the number of sequences to be classiﬁed is 192 . Each minibatch corresponds to the presentation of 20 training samples. In this case, the performance of SpaRCe are related to a starting P n = 50 , while the hidden layer has N h = 100 nodes. Dots correspond to the training instances in which the considered models solve the task (for the models that can solve the task), thus sho wing the training speed of the algorithms analysed. B : Performance of SpaRCe for three different starting sparsity levels (red, orange colours), for a standard ESN (black) and of HL (blue colours) for diverse N h nodes. The x-axis reports the number of trainable parameters and is shown in a logarithmic scale. The graph shows how SpaRCe is able to reach good performance while maintaining a lo w number of trainable parameters. C : Comparison of the root mean square error for the ESN and the ESN with thresholds as the external noise σ s and the number of stimuli v ary . The introduction of thresholds lead to robust result. D : Performance as the number of inputs to be classiﬁed increases, for the hidden layer model with N h = 100 (blue), SpaRCe (red) with P n = 50 and a standard ESN (black). SpaRCe and HL solv es the tasks considered, but the latter uses a number of trainable parameters (reported along the performance with the colour scheme that reﬂects the referred model) that is of two magnitudes higher than the ﬁrst. The accuracy of the standard ESN drops considerably as the number of sequences increases. The inset shows the root mean square error for SpaRCe and for HL, varying the starting condition P n and number of nodes N h respectiv ely . The errors shown correspond to the training instance in which the fastest model reaches perfect classiﬁcation accuracy . Numbers reﬂect the number of trainable parameters for example cases. 3.2 Threshold lear ning increases perf ormance In this section we faced three variants of classiﬁcation tasks using the MNIST dataset. Each image is fed into the network sequentially either one column at a time or one pix el at a time to make the task temporally dependent. Thus, one written digit corresponds to a sequence of 28 time steps of a 28 dimensional input in the column by column paradigm, or to a sequence of 728 time steps of a one dimensional input. 9 Storage capacity T ask σ 0 . 3 ∆ t 0 . 1 s T 0 . 3 s δ t 0 . 01 s α 0 . 1 ρ 0 . 95 N 1000 γ 1 p E R 0 . 001 η W 2 × 10 − 3 η θ 2 × 10 − 4 minibatch size 20 T able 1: The table reports the parameters deﬁning the task, the hyperparameters of the ESN and the training hyperparameters for the storage capacity task. MNIST , column by column The application of ESNs on this speciﬁc task was previously analysed in [ 36 ], in which the original dataset was preprocessed and augmented by resizing and deforming the original images. W ithout such a preprocessing, the ESN could not outperform a simple perceptron [ 36 ]. In contrast, in this work we use the original dataset, without any additional transformations. In this experiment, the pixels of each image are fed to the ESN sequentially column by column; thus, the input signal s ( t ) corresponding to an example image is a 28-dimensional sequence of temporal length 28. In this case, ˜ V = C   V ( t )  ∀ t  , meaning that all the dynamic of the recurrent network across time is used to deﬁne the read-out of the system. The cost function adopted is a sigmoidal cross entropy E = − h X j ˜ y j log  σ ( y j )  + (1 − ˜ y j ) log  1 − σ ( y j ) i (9) which is analysed in B.2 (Appendix). The optimizer used is Adam [ 49 ]. W e ﬁrst tested the SpaRCe model with various initial sparsity levels (Fig. 5A, different colours). Regardless of the initial condition, the ﬁnal performance was similar , as was the ﬁnal le vel of sparsity (the size of dots in Fig. 5A shows the percentage of acti ve nodes in the network). The average error achie ved by SpaRCe is 1 . 9% . The hyperparameter values are given in T able 2. The model reaches performance le vels comparable to those achie ved using tw o-layer or three-layer neural networks with backpropagation [ 50 ]. W e note that con volutional neural networks are the best performing netw orks for this task and for images classiﬁcation problems in general. The best performance corresponds to an error rate of 0 . 21% on the MNIST dataset and it is achieved through a pool of ﬁv e con volutional neural networks [ 50 ]. Howe ver , the task faced here with SpaRCe is more challenging than the common approach used to train neural networks on the MNIST dataset, in which the whole image is fed into the network at once. The performance of con volutional neural networks and of multilayer perceptrons are reported in T able 2 as a benchmark for the results obtained with ESN with SpaRCe. T o demonstrate the importance of thresholds in the SpaRCe model, we compared the performance of SpaRCe to that of an ESN without thresholds trained online with the same optimizer , using a learning rate optimized through grid search. SpaRCe outperformed the ESN in both classiﬁcation accuracy and con ver gence time (Fig. 5B, ‘MNIST’). W e also found that the normalisation mechanism introduced thanks to SpaRCe stabilises the learning process for relatively higher learning rates vs standard ESNs. As a second task, we applied the same model to the MNIST dataset where the pixels of each image are reordered through a permutation of the data. Each image is again fed one column at a time. Again, SpaRCe outperformed the ESN without thresholds (Fig. 5B, ‘pMNIST’). psMNIST W e next analysed the performance of the SpaRCe model and of ESNs in general on the psMNIST (permuted sequential MNIST) task, which has became a standard benchmark for recurrent neural networks [ 39 ]. In this case, each image is reordered through a ﬁxed permutation and fed into the recurrent network one pix el at a time. The task is particularly challenging because temporal dependencies must be learned between widely-separated time steps of an image. Since 10 Hyperparameters MNIST/sMNIST psMNIST α 0 . 17 α 2 0 . 017 ρ 0 . 97 ρ 2 0 . 99 N 1000 N 2 500 γ 0 . 1 γ 21 0 . 15 p E R 0 . 01 p E R 2 0 . 01 α 1 1 ρ 1 1 N 1 300 γ 1 1 p E R 1 0 . 01 η W 2 × 10 − 3 η W 2 × 10 − 3 η θ 2 × 10 − 4 η θ 2 × 10 − 4 minibatch 20 minibatch 20 Results MNIST psMNIST E S N 95 . 2 E S N 2 94 . 7 S paRC e 98 . 1 S paRC e 2 95 . 4 N N 98 . 5 ∗ GRU 95 . 4 ∗∗ C onv . 98 . 3 / 99 . 8 ∗ LS T M 89 . 9 ∗∗ T able 2: T able of the hyperparameters for the three benchmark tasks in Section 3.2. The hyperparameters for the psMNIST are double since two ESNs are used for this task (as the symbol 2 ov er SpaRCe highlights). The suf ﬁx one corresponds to the ﬁrst reservoir and the suf ﬁx two to the second reservoir . γ 21 indicates the input gain of the adjacency matrix from the ﬁrst to the second ESN. The asterisks ∗ ∗∗ indicate that the results are taken from [ 50 ] and [ 39 ] respecti vely . N N , C onv . , GRU and LS T M stand for multilayer perceptrons, con volutional networks, Gated Recurrent Unit and Long Short T erm Memory respectiv ely . each sequence is a succession of 784 pixels, we decided to sample the dynamic of the reservoir across time at constant steps multiples of 28, deﬁning ˜ V = C   V ( t ) : 28 | t   , where 28 | t indicates that t has to be di visible by 28 . This sampling procedure at constant temporal time steps could be suboptimal compared to an approach where the most informati ve time steps are selected. Ho wev er , an analysis of optimal sampling procedures goes beyond the scope of this paper . Our ﬁrst attempts to solve the task with a single ESN gav e performance comparable to or worse than a standard percep- tron model trained on the whole image, because the ESN could not simultaneously discov er long time dependencies and quickly adapt to new inputs. Indeed, in any network of randomly-connected lo w pass ﬁlters such as ours, it can be difﬁcult to associate events that are distant in time, as it may be impossible to ﬁnd a workable balance in the trade-of f between keeping a memory of past e vents (which requires each node’ s activity to ha ve a long time constant, i.e., slo w decay) vs. allowing the network to e volv e dynamically over time (which requires a short time constant, i.e., fast decay). T o overcome this dif ﬁculty , we used an architecture composed of two reserv oirs. The ﬁrst has 300 nodes with fast time constants, and the second has 1000 nodes with slower time constants (the parameters for the two reservoirs are reported in T able 2). The faster reservoir signals unidirectionally to the slower one, which is used for learning. This type of structure, where the timescales of the ﬁrst network are faster than the second, was found to be optimal in previous works [ 30 ] . The sparsity le vel in the connecti vity of the faster reserv oir is important: it regulates a trade-of f between too much connectivity (the relation between the input information to the second reservoir and the input signal becomes too complex) vs. not enough connectivity . The best performance arises when the network has the shortest path lengths between two nodes that could permit a suf ﬁcient amount of memory . Using this architecture, SpaRCe performs comparably to published methods despite using a much simpler training algorithm. As with MNIST and pMNIST , on psMNIST SpaRCe outperformed the threshold-less ESN in both accuracy and speed (Fig. 5B). T o compare SpaRCe with published models, we repeated the simulation with an ESN with N = 500 nodes (thus 154 × 10 3 parameters), which ga ve an accuracy > 0 . 95 , comparable to the best, more complex recurrent neural networks that e xploit backpropagation through time (BPTT), whose performances are 0.954 (GR U) and 0.899 (LSTM) [ 39 ] (T able 2) with a comparable number of parameters ( ≈ 167 × 10 3 ). While BPTT needs to unroll all the dependencies of the neural network acti vities across time, ESNs have the advantage to train a much simpler perceptron on top of the reservoir representation. In particular , considering that the dynamic of a reservoir across 11 datasets can be computed once only and then used to train a high dimensional perceptron, the computational cost of ESNs is reduced to the computational cost of RNNs. W e emphasize that the procedure of concatenating previous temporal representations is not simply a shortcut, but it is necessary to increase the dimensionality of the representation in order to solve complex machine learning tasks through a reservoir computing approach. Indeed, the idea behind reserv oir computing is to exploit the temporal dynamic of a system as a ﬁxed and higher dimensional representation that allows it to separate the classes of a classiﬁcation task through an hyperplane. This approach is antithetical to the learning process of a recurrent neural network with backpropagation through time, which trains the dynamics of the system and draws a nonlinear manifold to solve the classiﬁcation task. Furthermore, the concatenating procedure does not guarantee an y understanding of the long temporal dependencies among pixels that are necessary to effecti vely solve the problem. It is therefore of interest that ESNs using SpaRCe could perform comparably to state of the art recurrent networks, whose parameters are trained via a f ar more complex algorithm, backpropagation through time. 1 2 3 4 5 6 7 8 Minibatch # 10 4 0.93 0.94 0.95 0.96 0.97 0.98 0.99 P 0 P 20 P 40 P 60 10 2 10 3 Minibatch # 0.2 0.4 0.6 0.8 0.94 0.96 0.98 ESN+SpaRCe ESN 10 40 80 A. B. MNIST pMNIST psMNIST Figure 5: The SpaRCe model shows comparable performance to a 2 / 3 hidden layer neural network on the MNIST dataset and accuracy comparable to more complex RNNs trained with BPTT on the psMNIST task. On the contrary , the adoptation of the standard ESN read-out leads to lesser performance. A : The sizes of the dots reﬂect the percentage of acti ve nodes (sparser network=smaller dots) in the network. Each minibatch corresponds to the presentation of 20 training samples. The abscissa of the inset ﬁgure is scaled logarithmically . B : Performance of ESN with (red) and without (grey) threshold learning on the three tasks analysed, measured by accuracy and con vergence time (C.T .). The network with the SpaRCe model outperforms the standard ESN read-out on all the benchmarks, but the contribution of the thresholds decreases as the task becomes more complex. This can be understood by considering that the increasing complexity of the tasks from left to right of the graph arises from a greater demand of the network’ s ability to understand long term dependencies. This aspect depends on the system dynamics and is not strongly related to threshold learning. Furthermore, the SpaRCe model con verges about 5 times f aster than an ESN without thresholds. 3.3 SpaRCe alleviates catastr ophic for getting Catastrophic forgetting refers to the inability of standard neural networks to learn different tasks sequentially . If a neural network is trained on a speciﬁc dataset and then retrained to perform a nov el task, it will probably forget what it has learned before. This unsolved problem [ 40 ] is critical for the future dev elopment of neural networks in general. Previous research formulated models that mitigate catastrophic for getting, using a variety of techniques cate gorised by Kemk er et al. [ 40 ]. These techniques include regularization (impose a cost to changing the weights that contribute to pre vious tasks, as in Elastic W eight Consolidation, or EWC), rehearsal (re-playing previously learned data during subsequent training, as in GeppNet), and sparse coding (reducing the fraction of acti ve nodes, as in the Fixed Expansion Layer model, or FEL, and Hard Attention to the T ask, or HA T [ 41 ]). Sparse coding is also the approach we use here, with SpaRCe. Howe ver , in contrast to these techniques that exploit additional information, such as model awareness of the task identity , and the computation of ad-hoc quantities, such as the importance of speciﬁc parameters (EWC) or nodes (HA T) for a gi ven task, we will demonstrate how the application of Eq. 4 alone with a high starting sparsity le vel can alle viate catastrophic forgetting. Notably , the same methodology applied in the previous simulations, to impro ve con ver gence time and performance of reservoir computing, can improve the ability of an ESN to learn different tasks sequentially . The dif ference between the application of SpaRCe in a single task and in a sequential tasks paradigm is the 12 tuning of the hyperparameters of the proposed model, in particular the percentile value n and the learning rates η W and η θ . W e will demonstrate how sparsity regulates a trade-of f between initial learning speed vs. memory retention because high sparsity decreases ov erlaps among representations, which pre vents ne w learning from disrupting old memories, whereas low sparsity means more nodes are acti ve, allowing memories to be formed f aster on new tasks. This trade-of f between initial learning vs. preventing forgetting is studied in detail for two paradigms that are commonly used to measure catastrophic forgetting in neural netw orks: • Sequential data permutations. A different permutation is applied to the considered dataset N task times. These new N task reshufﬂed datasets are then learnt sequentially by the system. In our simulation, N task = 10 . In such a scenario, the complexity of the different datasets is the same, and we trained SpaRCe for approximately two epochs for each task. • Sequential classes. The ﬁrst task is composed by the data corresponding to half of the possible classes, while the other classes are trained sequentially . Since in the dataset considered there will be 10 classes, the number of tasks that the network has to learn sequentially is six. In this case, we trained the model for approximately one epoch for each task. The dataset used is MNIST , where the reservoir processed ev ery image column by column as in Section 3.2, adopting ˜ V = C   V ( t )  ∀ t  as before. Of course, the model is able to learn from the data corresponding to a task only once. W e learn from the training dataset and compute the accuracy on the testing dataset by v arying the initial starting sparsity lev els through grid search of the v alue of n in Eq. 4 and as the number of tasks considered varies (Fig. 6C,D). These performance are used for comparison and demonstration only , and a v alidation dataset will be used to select the best hyperparamters’ setting. In comparison to pre vious simulations, the learning rate for the thresholds is smaller (see T able 3), because it was crucial to pre vent a dramatic change of sparsity le vels during learning, as such an abrupt change in the percentage of activ e nodes would alter stimulus representations and thereby affect pre viously learned tasks. In both tasks, lower sparsity le vels allowed better initial learning on nov el data (Accuracy across N task = 1 , Fig. 6C,D), while higher sparsity le vels alle viated forgetting of previous tasks during subsequent training. In other words, lo w sparsity allows good performance when the number of tasks is lo w , but the accuracy decreases quickly when N task increases. On the contrary , at optimal sparsity lev els, accuracy remains high e ven as the number of tasks increases (highlighted path on the surface plot in Fig. 6C,D). These two conﬂicting trends combined make the total accuracy across all datasets an in verted U when plotted against sparsity le vel (Fig. 6A,B). Moreov er , we computed the following quantities to measure the alle viation of catastrophic forgetting in more details: α Ov er all = 1 N task − 1 N task X n =2 acc n acc (1 , 1) (10) α M emor y = 1 N task N task X n =1 [ acc ( n, N task ) − acc ( n, n )] (11) α N ew = 1 N task N task X n =1 acc ( n, n ) (12) where acc n is the accuracy computed on the datasets seen until task number n (included), and acc (1 , 1) is the accuracy of the ﬁrst dataset immediately after its learning, which is considered as the ideal baseline. In general, we denote with acc ( n, m ) the accuracy of the n -th dataset after the presentation of m datasets. The performance metric α M emor y measures model’ s ability to remember previous tasks, i.e., to alleviate catastrophic for getting, while the metric α N ew measures the model’ s ability to learn new tasks. In other terms, α M emor y is the av erage difference o ver tasks between the performance obtained after the whole training and immediately after the presentation of a speciﬁc dataset, while α N ew is the a verage performance of a dataset after its presentation. These metrics and α Ov er all are taken from [ 40 ], where the performance of various models that alle viate catastrophic forgetting in multilayer perceptrons are compared. Finally , we repeated the simulation ten times for each of the tw o paradigms considered, av eraged over dif ferent possible permutations (sequential data permutations) and for different ways of dividing the data into separate classes and corresponding tasks (sequential classes), selected the best performing algorithm based on the v alidation dataset and computed its performance α ov erall on the test dataset. The performance obtained is reported in T able 3 together with the results obtained by a standard ESN and published models that use a variety of strategies to prev ent catastrophic forgetting in multilayer perceptrons. Of course, we do not expect to compete with ne wer Deep learning methods that exploit additional information and many more parameters. The results reported are obtained following, to the best of our kno wledge, the same methodology as [ 40 ], and dif fer only in the necessity of an ESN to process the data temporally , 13 rather than feeding the whole image to the network at once. In this aspect, the temporal processing of the data can make the task only more challenging for our model, since catastrophic forgetting appears to constitute an e ven more difﬁcult problem to recurrent neural networks [ 51 ]. W e need to highlight, howe ver , a drawback of our model: the tuning of an additional hyperparameter , that is the starting sparsity lev el P n regulated by the proposed normalisation mechanism. Ho wever , we note how all other methods that alleviate catastrophic forgetting in neural networks usually hav e additional hyperparameters (set through heuristics or grid search), as the learning rate of the penalty term in EWC. If the results on multilayer perceptrons can be considered exclusi vely as a useful baseline, we notice from T able 3 the complete inability of an ESN to solve sequential learning. Indeed, the training of the read-out weights of an ESN in the standard reserv oir computing paradigm causes a complete forgetting of pre vious tasks, resulting in an overall performance that corresponds to the learning of the last task processed by the model. The success of SpaRCe on these catastrophic for getting tasks arises from both the initial sparsity and from threshold learning. W e analysed the relativ e importance of the starting sparsity lev el vs. the online threshold adaptation introduced by the learning rule in the Supplementary Materials. A. C. B. D . Sequential classes Sequentia l permutations Figure 6: SpaRCe helps to prevent catastrophic forgetting on the two analysed benchmarks. Different data points correspond to div erse repetitions of the experiment. A : Results on the MNIST dataset in the sequential classes paradigm (see text). B : Results on the permuted datasets paradigm. C - D : Performance as function of the starting percentage of acti ve nodes and the number of tasks that are learned in the catastrophic forgetting simulations ( C refers to sequential classes and D for permuted datasets). The surface is a cubic interpolation of the accuracy as the number of datasets and the starting sparsity level v ary . The path shows the best performing sparsity le vels across v arious number of tasks; its movement from right to left demonstrates the necessity of adopting increasing lev el of sparsity as the number of datasets increases and the memory of previous tasks becomes more rele vant. 4 Discussion It is customary in Machine Learning to introduce sparsity via regularisation: an ad-hoc penalty term is added to the network’ s error function to penalize the use of the weight parameters, leading to solutions with smaller (or sparser) 14 Hyperparameters Sequential classes Sequential permutations α 0 . 17 α 0 . 17 ρ 0 . 97 ρ 0 . 97 N 1000 N 1000 γ 0 . 1 γ 0 . 1 p E R 0 . 01 p E R 0 . 01 η W 5 × 10 − 4 η W 1 × 10 − 3 η θ 5 × 10 − 5 η θ 1 × 10 − 5 minibatch 20 minibatch 20 Results, α ov erall Sequential classes Sequential permutations E S N 0 . 1 E S N 0 . 1 S paRC e 0 . 870 S paR C e 0 . 897 E W C 0 . 133 + E W C 0 . 746 + F E L 0 . 439 + F E L 0 . 279 + GeppN et 0 . 922 + GeppN et 0 . 364 + T able 3: T able of parameters used in the catastrophic forgetting tasks. W e note that the learning rates used for the thresholds is smaller than in the previous simulations, since we needed to avoid changing the sparsity le vel quickly , in order to keep the optimised sparsity lev el near the starting v alue imposed ( P n ). SpaRCe performs comparably to the best of the ad hoc models, which tend to perform well only on one of the two tasks analysed. The symbol + indicates that the results were taken from [40]. weight values. In contrast, in neuroscience, sparsity is instead deﬁned as the percentage of neurons that are activ e per stimulus, suggesting constraints not necessarily on the weights but rather directly on the neuronal acti vity . In this work, we take this latter approach. W e learn a threshold per neuron via the minimisation of a standard error function, thereby associating sparsity directly to network performance. W e demonstrate theoretically that such a rule reduces the usage of reservoir neurons that ha ve correlated activities and are connected to the output in the same way , that is with weights of the same sign. W e also show theoretically , for the restricted case of positiv e weights (supplementary material), and in simulations, for the most general case (main text), that higher neuronal specialisation is a consequence of the standard error function applied on our network, where output neurons receiv e thresholded inputs from the reservoir . Due to threshold learning, neurons surviving the threshold will preferentially ﬁre for one class vs another , which improves performance, as the search for a satisf actory solution takes place in a smaller search space. Indeed, we compare learning with and without thresholds on the same network, and ﬁnd that thresholds improve both the speed and the accuracy of learning while the standard regularisation terms do not beneﬁt the echo state network. Notably , threshold learning and weight learning in our setup is a two-w ay interaction: the threshold changes depend proportionally on the size of the weights. In practice, we hav e found that statistically an increase in neuronal specialisation follows large weight changes. W e formulated the recurrent neural network by allo wing one observable v ariable per neuron (the thresholded activity) and one hidden v ariable (the activity before the threshold). As such, learning the thresholds does not disrupt the dynamics of the neurons in the reservoir , which will ev olve as in a standard echo state network. Instead, the output neurons will either recei ve input from the reservoir neurons or not, depending on the individual threshold of each of the reservoir neurons, which means that to learn the thresholds, we do not need to resort to backpropag ation through time or other complex techniques. A biological interpretation of this structure could be that the recurrently connected neurons signal to each other based on subthreshold depolarization rather than action potentials. Such signalling could occur through dendro-dendritic synapses, which hav e been observed in the ﬂy mushroom body [ 52 ], the structure that inspired the task in section 3.1. Most interestingly , the learning rule we deri ve is structurally identical to the update rule for bias via backpropagation, since the learnable component of our formulation is effecti vely equi valent to adding a hidden layer containing as many neurons as the reservoir but with a very speciﬁc architecture: a one to one ﬁxed connection to the neurons of the reservoir and a learnable threshold. The advantages of the proposed model (compared to adding a ﬁxed weight, fully connected layer) arise from the speciﬁc architecture, the number of trainable parameters, the normalisation mechanism and the fact that weights and thresholds are learned in different time scales (learning rates). In our simulations, a hidden layer of 50 neurons was required to catch up with SpaRCe with approximately 47000 additional parameters (in comparison the 3000 used for SpaRCe), and it required longer training times (about twice slower). Another advantage of the threshold learning is that it helps stabilise the network if a large learning rate has been selected. In simulations we hav e observed that high learning rate values that lead to instabilities in the learning process for the non-threshold model lead to excellent performance for the threshold model: the thresholds act as a stabilisation mechanism, by 15 quickly decreasing the acti vity of the network through a f aster deactiv ation of neurons. As the mathematical analysis conﬁrms, the threshold updates are proportional to the output weights, which suggests that thresholds move faster for larger weights. Furthermore, the simulations also indicate that the sparser the initial conditions the stronger the threshold changes, which again can be understood by the contradicting terms that a non-sparse netw ork contributes to the threshold update. Notably , our model competes with feedforward and recurrent networks on standard benchmark problems (MNIST , sequential MNIST and permuted sequential MNIST), and is always best or close to the best alternativ e algorithm. This is a truly impressi ve result gi ven the simplicity of our model and the relati vely small ov erhead of the threshold process on the echo state network. While an echo state network is unlikely to compete in general with more complex networks that hav e a signiﬁcantly higher number of learnable parameters, SpaRCe permits echo state networks to achiev e performances that were not possible before, at least not without augmentation of the dataset before given to the network or combination with other , more complex machine learning algorithms. Perhaps less obvious, the smart threshold initialisation is key to the success of the rule and its remarkably consistent performance, regardless of the e xact threshold initialisation conditions. It is apparent from the mathematical formulae that the gradient rule for the threshold cannot activ ate silent neurons. Therefore, if the initialisation is entirely random, neurons with excessi vely high initial thresholds would ne ver ﬁre during the stimulus presentation. Effecti vely , such neurons would be remo ved from the network for the whole duration of the simulation. T o prevent this issue, the total input is ﬁrst presented to the recurrent network, and we observe the operational activity range of each neuron. This allo ws us to set up a threshold within this regime, making sure that each neuron is acti ve for a pre-decided percentage of time, across all stimulus presentations. In fact, one doesn’t need to use the exact input of the network, b ut any signal(s) with the same statistics as the actual input. Similarly , it turns out that while some initial v alues may be better in terms of performance, in practice all that is needed is to giv e the same chance to all neurons to be active during the stimulus presentation, and the threshold learning will take care of the rest. W ithin the context of catastrophic forgetting, the last statement does not entirely hold. There is a clear adv antage for our model if the sparsity le vel is initially set high, which we can also understand in the context of threshold updates being larger in magnitude for sparser networks. Our model competes with published models across two standard tests for catastrophic forgetting, unsurprisingly , giv en that we have shown that a consequence of our method is the increase of neuronal specialisation. In a sparse network dif ferent neurons are more likely to be used for different classes, and during the learning of new classes ne w neurons are likely to be recruited. In fact, sparsity without threshold learning signiﬁcantly helps in the case of catastrophic forgetting, but threshold learning adds to the ability of better learning newer sets. In general, howe ver , we do not expect to outperform complex ne w methods that exploit additional information: our model, following the principle of echo state networks and reserv oir computing in general, uses inherent properties of the network (e.g. dynamics, sparsity) to boost performance in classiﬁcation tasks and in catastrophic forgetting at a minimum computational cost. In summary , our work leads to a reinterpretation of the traditional role of thresholds in neural networks. W e hav e shown that by disentangling the learning of the thresholds from the learning of the weights, and having a layer where learning takes place via threshold adaptation only , we were able to achieve sparse solutions and explain mathematically how this sparsity arises. T o the best of our kno wledge, this is the ﬁrst time that this analysis and interpretation is provided for threshold learning, and we believ e that this work might be applicable to network structures beyond echo state networks. Finally , reservoir computing is of increasing interest to the neuromorphic computing community , particularly to those who aim to use material dynamics for computation. For instance, in the spintronic community , magnetic devices are proposed as reservoir replacements, and more complex methods such as deep learning cannot be implemented in the material lev el. In the context of the echo state network, the reservoir serves only as a spatiotemporal kernel [ 53 ], i.e. it increases the dimensionality of the input signal in order to allow a linear model (a perceptron) to separate the classes. Therefore, it can be replaced by any highly non-linear but non-chaotic system that can transform its input to an appropriate higher dimensional space. Such proof of concept systems can be found for instance in [ 54 ] [ 55 ]. Our algorithm does not impose any modiﬁcation to the reserv oir itself, which allows its use ev en when the recurrent netw ork is replaced by a physical material. Acknowledgement W e thank Paolo Del Giudice and Guido Gigante for their input on the analysis of the timescales in the reservoir model, and Jelmer Borst for suggesting the use of our method on catastrophic forgetting. EV and A CL would like to ackno wledge support from a Google Deepmind A ward. EV was partially funded by the Engineering and Ph ysical Sciences Research Council (Grant Nos. EP/S009647/1, EP/S030964/1 and EP/P006094/1). A CL w as partically 16 funded by the European Research Council (639489) and the Biotechnology and Biological Sciences Research Council (BB/S016031/1). 17 5 A ppendix 5.1 Reservoir initialization The equation describing the dynamic of reservoir of leak y integrators is V ( t + 1) = (1 − α ) V ( t ) + αf  W in s ( t ) + ρ W V ( t )  (A1) where W is a random connectivity matrix whose eigen values are uniformly distributed inside the unit circle of the imaginary plane, and ρ < 1 is a constant. The rescaling factor ρ is called spectral radius and it is explicitly deﬁned to control the maximum absolute value of the eigen values of the matrix ρ W . The fact that the eigen values of the connectivity matrix ρ W are constrained inside the unit circle of the imaginary plane is a necessary condition for the Echo State property of the network. Given the eigen values λ W of W , the eigen values λ of the linearised dynamic system associated to Eq. A1 are λ = (1 − α ) + αρλ W (A2) and thus λ W are compressed by a factor α and translated by a factor 1 − α in the imaginary plane. As a consequence, λ follows the probability distrib ution p ( x, y ) =    1 π α 2 ρ 2 , if  x − (1 − α )  2 + y 2 ≤ α 2 ρ 2 0 , otherwise (A3) where x = Re ( λ ) and y = I m ( λ ) for simplicity of notation. Since the real part of the eigen values is associated to the timescales τ of the dynamic system as Re ( λ ) = exp( − δ t τ ) ≈ 1 − δ t τ , it is possible to compute the marginal distribution ov er x of p ( x, y ) for the real part, and then compute the distribution of timescales. A simple strategy to choose α and ρ by knowing the range of the timescales [ τ m , τ M ] that the network should exhibit is to notice how the fastest (slowest) timescale τ m ( τ M ) is giv en by the minimum (maximum) real eigen value of the dynamic system. Calling λ m = min  Re ( λ )  and λ M = max  Re ( λ )  , and recalling we hav e λ m = 1 − α + αρ ( − 1) = 1 − α (1 + ρ ) = = exp( − δ t/τ m ) ≈ 1 − δ t τ m → → α (1 + ρ ) ≈ δ t τ m and λ M = 1 − α + αρ (+1) = 1 − α (1 − ρ ) = = exp( − δ t τ M ) ≈ 1 − δ t τ M → → α (1 − ρ ) ≈ δ t τ M Solving the system abov e, we end up with α ≈ δ t 2 τ m + δ t 2 τ M and ρ ≈ δ t 2 ατ m − δ t 2 ατ M that are relations between α , ρ and the minimum and maximum timescales that the model can exhibit. In this way , it is possible to choose the hyperparameters α and ρ by selecting a priori the more interpretable parameters τ m and τ M . W e want to emphasize that this procedure does not guarantee an optimal choice of the hyperparameters, but it can guide the search and it assures a good choice in terms of temporal memory of the reservoir . 18 5.2 Thresholds initialization Imposing a democratic initialisation where each node has the same probability to be acti ve, the initial condition and sparsity lev el are deﬁned by the choice of the starting percentile P n . Here we deﬁned two approaches to choose P n : • A simple grid search over n . Here, N P reservoirs are trained in parallel for the ﬁrst 10% of time steps in the training instance, and the best performing reservoir is selected for the remainder of training. From our results, a small fraction of the training time is enough to choose the starting condition without any loss in the performance. • Select the sparse representation that leads to the highest value of specialisation, a measure of the quality of the sparse representations that is deﬁned below . The measure of specialisation ( S p ) reﬂects how a le vel of sparsity can facilitate the learning process in a classiﬁcation task. The assumption behind the following formulation is that for a good sparse representation the ensembles of activ e nodes for different classes should ov erlap as little as possible. Let us consider two classes j and k and a neuron i . The node is speciﬁc if there is an asymmetry in the number of times it is acti ve for one class with respect to the other . Generalizing this idea it is possible to build a measure spec ij k for a node i deﬁned as spec ij k = | N ij M j − N ik M k | (A4) where N ij ( N ik ) is the number of times the neuron i was acti ve after the presentation of a stimulus of class j ( k ) and M j ( M k ) is the total number of presentations of the stimuli belonging to class j ( k ). Since the denominator of Eq. A4 contains the total number of presentations, spec ij k does not simply increase with the le vel of sparsity introduced. Let us focus on the particular case where M j ≈ M k . A too high le vel of sparsity would lead the node to be almost silent, with a consequent poor specialisation value due to N ij and N ik being both close to zero. On the contrary , a too low sparsity lev el would lead the neuron to be excessi vely responsi ve, and spec ij k would be poor because N ij ≈ N ik ev en if N ij and N ik are both high. Giv en spec ij k it is possible to compute a measure of specialisation for each single neuron as S p i = h spec ij k i ( > 0) j k (A5) where h· i ( > 0) j k is the av erage over positi ve elements for the indexes j k . It is possible to select the starting initial values of the thresholds as the n-th percentile of the distribution V that leads to the highest specialisation measure. Figure 3 (Main text) sho ws how the best performing sparse representation corresponds approximately to the maximum v alue of the av erage specialisation across neurons S p = 1 N P N i S p i . 5.3 Gradients on thresholds The training procedure minimizes a measure of the distance E ( t ) between the output y = W o x of the neural network and the desired value ˜ y . Theoretically , E = dist  ˜ y , y  (B1) W e will now apply a gradient based optimization on an example cost function, and show how the resulting learning rule for the thresholds can be interpreted. Gradient on θ , Mean Square Error (MSE) Let us consider the mean square cost function, giv en by E = dist  ˜ y , y  = = X j h ˜ y j − y j i 2 = = X j h ˜ y j − X i W o j i x i i 2 = = X j h ˜ y j − X i W o j i sig n ( ˜ V i ) r elu  | ˜ V i | − P n ( | ˜ V i | ) − ˜ θ i i 2 (B2) 19 A. B. C. D . Figure 7: ∆ θ (1) and ∆ θ (2) driv e the learning process of the thresholds. A-C: V alues of ∆ θ (1) and ∆ θ (2) av eraged across the population of nodes for tw o different starting sparsity levels. These two factors ha ve interpretable meaning (see Main text). B-D: A verage of the cumulati ve threshold change. If the starting condition is suboptimal and lo w (high), such an average will be positi ve (negati ve) and consequently increasing (decreasing) the level of sparsity . Panels A and B refer to the storage capacity task of Section 3.1, where the cost function is a mean square error, while panels C and D refer to the classiﬁcation of the MNIST dataset in the column by column paradigm of Section 3.2, where a sigmoid cross-entropy function was adopted. In the latter case, the two factors are deﬁned in Eq. B10 and B9. where we hav e used a read-out of Eq. 2 (Main text) to deﬁne the output of the neural network. A gradient based approach that minimizes E leads to the following learning rule on the output weights ∆ W o lk = − η W ∂ E ∂ W lk = = η W h ˜ y l − y l i x k 20 Accuracy Fraction of Active no des A. B. ‘Unstable’ training of ESN ‘Unstable’ training of ESN Figure 8: The proposed model acts as a stabilisation mechanism that permits the utilisations of higher learning rates in comparison to the values adopted in the standard ESN read-out. The result are obtained on the MNIST task of Section 3.2 in the column by column paradigm. For all the results shown, the model is initialised from a starting sparsity le vel P n = 50 . W e plot the accuracy ( A ) and the sparsity le vel ( B ) as the learning rates η W and η θ vary . A : The best performance of the algorithm corresponds to a region where the learning rate on the thresholds is at least 10 − 1 times lower than the learning rate on the output weights. This result is expected, considering that the learning rule on the thresholds depend explicitly on the output weights (Eq. B3), and thus thresholds can be accurately learnt when the output weights carry information on the classiﬁcation process, i.e. weights are learnt faster than thresholds. B : The percentage of active nodes decreases as the learning rates η W and η θ increase. This can be understood considering that, when the learning rates are high, the model avoids abrupt changes on the output y by decreasing the activities of the representation, i.e. increasing the sparsity level of the netw ork. The parameter space above the white horizontal line corresponds to a region where learning with the standard ESN read-out is unstable because of the too high value of η W adopted. In that region, training is characterised by an undesirable increase of the cost function across learning for the ESN but not for SpaRCe. and to the following learning rule for the thresholds ∆ θ k = − η θ ∂ E ∂ θ k = − η θ ∂ E ∂ ˜ θ k = = η θ N class X j =1  ˜ y j − y j  ∂ ∂ ˜ θ k n X i W o j i x i o = = − η θ N class X j =1  ˜ y j − y j  W o j k sig n ( ˜ V k ) H  | ˜ V k | + − P n ( | ˜ V k | ) − ˜ θ k  = = η θ N class X j =1  ˜ y j − y j  W o j k sig n ( x k ) = = − η θ N class X j =1 ˜ y j W o j k sig n  x k  + + η θ N class X j =1 y j W o j k sig n  x k  (B3) By taking into account the speciﬁc case of a classiﬁcation task where ˜ y j is positiv e for j that corresponds to the desired class and zero otherwise, it is possible to manipulate Eq. B3 and to separate it in two terms to uncov er the meaning of the learning on the thresholds. 21 ∆ θ k = − η θ β W o ˜ j k sig n  x k  + + η θ N class X j =1 y j W o j k sig n  x k  = − η θ β W o ˜ j k sig n  x k  + + η θ N class X j =1 N X l =1 W o j l W o j k x l sig n  x k  (B4) where ˜ j indicates the correct class for the considered input, and β is the positi ve quantity equal to the correct desired output value ˜ y ˜ j . Eq. B4 contains two clearly interpretable factors: ∆ θ (1) = N class X j =1 N X l =1 W o j l W o j k x l sig n  x k  (B5) ∆ θ (2) = − β W o ˜ j k sig n  x k  (B6) Gradient on θ , cross entropy The error function has the form E = − h X j ˜ y j log  σ ( y j )  + (1 − ˜ y j ) log  1 − σ ( y j ) i In this case, the learning rule for the thresholds is ∆ θ k = − η θ X j ˜ y j  1 − σ ( y j )  W j k sig n ( x k )+ − η θ X j  1 − ˜ y j  σ ( y j ) W j k sig n ( x k ) = = − η θ X j ˜ y j W j k sig n ( x k )+ + η θ X j y j σ ( y j ) W j k sig n ( x k ) (B7) The two terms in Eq. B7 hav e comparable meaning to ∆ θ (2) and ∆ θ (1) of Eq. B5 and B6 computed for the mean square error , T o demonstrate this, we can consider the case of a classiﬁcation task where y true j = 1 for the correct class and zero otherwise. Furthermore, considering that the neural network output is not in the saturating regime of the sigmoid function when the majority of the learning happens, we can use the dominant ﬁrst term of the T aylor series of the sigmoid and approximate the second term of Eq. B7 ∆ θ k = − η θ W ˜ j k sig n ( x k )+ + η θ X j σ ( y j ) W j k sig n ( x k ) = = − η θ W ˜ j k sig n ( x k )+ + η θ X j  1 2 + 1 4 y j + ...  W j k sig n ( x k ) = = − 1 2 η θ W ˜ j k sig n ( x k ) + 1 2 η θ X j 6 = ˜ j W j k sig n ( x k ) + η θ X j  1 4 y j + ...  W j k sig n ( x k ) (B8) 22 where we can deﬁne ∆ θ (2) = − 1 2 η θ W ˜ j k sig n ( x k ) + 1 2 η θ X j 6 = ˜ j W j k sig n ( x k ) (B9) ∆ θ (1) = η θ X j  1 4 y j + ...  W j k sig n ( x k ) (B10) Eq. B9 has an analogous meaning to Eq. B6 with the additional term P j 6 = ˜ j W j k sig n ( x k ) that increases (decreases) thresholds of nodes that are helping (impeding) the wrong classiﬁcation process. Considering only the linear term of the T aylor expansion of the sigmoid function, Eq. B10 has the exact same form as Eq. B5. T o demonstrate its role as balancing term that deactiv ates nodes that are helping in a similar way the classiﬁcation process (their contribution has the same sign of y j ), we estimated Eq. B10 by subtracting ∆ θ (2) to the value of the gradient. The result of this procedure is shown in P anel C of Fig. 7 (Supplementary). Finally , the sparsity le vel reached after optimisation depends on the v alues of the learning rates η W and η θ used. Indeed, threshold learning can act as a balancing force that, in the case of high learning rates, decreases the percentage of active nodes in the network and stabilizes the training phase. In this regard, Fig. 8 (Supplementary) shows the accurac y and the corresponding sparsity le vel after optimisation as η W and η θ v ary for the MNIST classiﬁcation task faced in Section 3.2 where the input is giv en to the network column by column (as a 28-th dimensional sequence of 28 time steps). 5.4 Initialisation and threshold adaptation in sequential lear ning A. B. 75 80 85 90 95 100 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 learning 0 75 80 85 90 95 100 -0.04 -0.02 0 0.02 0.04 0.06 Memory New Figure 9: The initialisation procedure is the key element to alleviate catastrophic for getting. A : Dif ference between the full SpaRCe model and a network where the thresholds are exclusi vely initialised for two metrics (Eq. B11 and B12), which measure the ability of the network to remember (red) and to learn nov el tasks (blue) respectively . B : Overall accuracy with and without thresholds adaptation, which leads to an increase in the maximum performance of approximately ﬁ ve percent. The high accuracy reached exclusi vely with the initialisation demonstrates its importance to alle viate catastrophic forgetting. W e used Eq. 10, 11 and 12 introduced in the Main T ext to calculate the differences ∆ α M emor y = α M emor y − α M emor y , θ 0 (B11) ∆ α N ew = α N ew − α N ew , θ 0 (B12) between a model where thresholds are initialised and then learned and a model where the thresholds are initialised only (as marked by the subscript θ 0 ). Up to high sparsity levels, threshold learning reduces the model’ s ability to remember previous tasks by changing and compromising the parameters optimised on past datasets (seen as ∆ α M emor y < 0 in Fig. 9A, Supplementary). Howe ver , at initial sparsity above P n = 94 , the initial memory capacity of the network is reduced by the small number of acti ve nodes; threshold learning helps to recruit inactive neurons and to increase the amount of resources a vailable. This leads to a consequent increase of α M emor y for high sparsity le vels. At the same time, threshold optimisation facilitates the learning of new datasets by adapting the parameters to the new incoming 23 stimuli ( ∆ α N ew > 0 , Fig. 9A, Supplementary , blue trend). Altogether , the trade-off between memory and adaptability on nov el data results in the trend seen in Fig. 9B (Supplementary): at lower initial sparsity le vels, threshold learning worsens o verall performance (because it reduces α M emor y ), while at higher initial sparsity le vels, threshold learning improv es overall performance (because high sparsity pre vents threshold learning from worsening α M emor y while still allowing it to improve α N ew ). In summary , threshold learning modulates the starting condition and increases the maximum performance ( α ov erall ) by about ﬁv e percent. 5.5 Procedur e for b uilding sequences Giv en an ensemble of elements E =  A, B , C, ...  , we formulated a systematic procedure to build sequences of N t (in the case analysed, N t = 3 ) elements from E . Our goal is to test the storage capacity of the model to learn associations between sequences and desired outputs. T o achie ve this, we pre vented correlations between similar sequences from helping classiﬁcation, by placing similar sequences in different classes. As described in Fig. 10A (Supplementary), the procedure is based on the repetition of the following tw o steps: 1. Giv en a number of output classes N class , we picked N t N class random elements with repetitions from E and composed the successions as in Fig. 10 (Supplementary), where the simple case of two output classes is considered. 2. W e picked other N t N class random elements with no repetitions. Then, we changed the last temporal elements of the sequences generated before with N class of the new picked stimuli, associating the ne w sequences to different desired output values as shown in the ﬁgure. Finally , we proceed in this way for all the previous temporal elements of the considered N class successions from which we started (point 1). W e notice how the similarities between sequences can not be used to infer the right classiﬁcation, since correlations among elements are associated to different output classes. Each element is associated to a multidimensional signal that, with addition of multiplicativ e white noise, is presented to the network (Fig. 10 B, Supplementary). A N C A B D A C H A B N A C N A N C A B P A C P N, P, S, T… A, B, D, A, C, H 1 0 0 1 0 1 1 0 1) 2) A S N A S N A T P A T P 1 0 0 1 0 1 1 0 Succession Desired output Succession Desired output … . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a. b. c. A S N A. B. Figure 10: T ask diagrams. A: Schematic of the procedure to deﬁne sequences. W e notice how similarities among sequences are associated to different desired outputs, making the task a test of the storage capacity of the network. B: Schematic of the task on the biological data. Input example, succession of three stimuli (A,S,N) of time duration ∆ t = 0 . 1 s each. Multiplicative white noise σ s N (0 , 1) is added throughout all the sequence, with different realisations of N (0 , 1) (gaussian with zero mean and unitary variance) for each temporal step. Only the last activities of the nodes in the reservoir is used for the read-out, which is di vided into three models ( a , standard ESN, b , SpaRCe, c , additional hidden layer) of increasing comple xity and whose performance are tested and reported in the main text. 24 5.6 Necessity of sparse repr esentation for obtaining an optimal solution with positi ve output weights The following section contains a formal proof of how sparsity is necessary to ﬁnd an optimal solution of a classiﬁcation problem when the output weights are constrained to be positiv e. The cost function E in a classiﬁcation task and for the entire dataset can be written as: E = X k X µε C k N class X j  ˜ y µ k j − y µ k j  2 (D1) where k indicates the correct class and the expression µ ε C k can be read as all datapoints µ that belongs to the cluster C k from which we desire output k . W e assume that the target is 1 for the correct class and 0 otherwise. An ideal solution of this problem with an online learning learning algorithm can be deﬁned as: ∀ ε > 0 , ∃ n | E n ≤ ε (D2) where n is the minibatch (or batch) number and it is a measure of the training time. Thus, Eq. D2 means that the cost function can be made as small as desired by going through enough training instances. The fact that the sum of the quadratic terms of the cost function has to be less than or equal to a desired  value means that [ ˜ y µ k j − y µ k j ] 2 ≤ ε, ∀ µ, ∀ k → − √ ε ≤ ˜ y µ k j − y µ k j ≤ √ ε, ∀ µ, ∀ k Thus, for each datapoint µ k and each output class j ( 1 − √ ε ≤ y µ k j ≤ 1 + √ , if j = k − √ ε ≤ y µ k j ≤ √  if j 6 = k which, by considering separately all different classes, becomes                        1 − √ ε ≤ y µ 1 1 ≤ 1 + √ , − √ ε ≤ y µ k 1 ≤ √  if k 6 = 1 1 − √ ε ≤ y µ 2 2 ≤ 1 + √ , − √ ε ≤ y µ k 2 ≤ √  if k 6 = 2 . . . 1 − √ ε ≤ y µ C C ≤ 1 + √ , − √ ε ≤ y µ k C ≤ √  if k 6 = C (D3) T rivially , the difﬁculty of the classiﬁcation task lies in the requirement that the acti vity of the same output node has to be close to one for some datapoints and close to zero for others; this condition is highlighted explicitly by the above set of inequalities, which can be rewritten      1 − √ ε ≤ P l W out 1 l r elu ( V µ 1 l − θ l ) ≤ 1 + √ , − √ ε ≤ P l W out 1 l r elu ( V µ k l − θ l ) ≤ √ , if k 6 = 1 . . . Since we required positiv e output weights we ﬁnd − √ ε ≤ X l W out 1 l r elu ( V µ k l − θ l ) ≤ √ ε (D4) → W out 1 q ≈ O ( √ ε ) , ∀ q , k | V µ k q > θ q (D5) 1 − √ ε ≤ X l W out 1 l r elu ( V µ 1 l − θ l ) ≤ 1 + √ ε (D6) → W out 1 m  O ( √ ε ) , ∀ m | V µ 1 m > θ m (D7) the conditions abov e can be satisﬁed only when the inde xes q do not completely o verlap with the inde xes m , i.e. the representations do not totally overlap and a vector of thresholds is introduced to separate the ensemble of nodes that are activ e for different classes. W e note that this simple proof holds in the case of positiv e output weights and positive reservoir acti vities only . 25 5.7 Interpr etation of the optimal ﬁfty percent of active nodes f or memory capacity The aim of the task faced in section 2.2 (Main text) is to measure the memory capacity of the model and the stability of the solution found by the model. The results show a rob ust optimal level of sparsity of 0 . 5 despite the speciﬁc values of noise lev el and number of output classes. Such lev el of sparsity maximises the probability that different representations hav e at least one node that is not in common. Giv en N nodes and an undeﬁned input s i , the ensemble of activ e nodes for that signal can be imagined as a random sample of p × N nodes from the total possible ensemble of neurons. Thus, each representation can be imagined as the extraction of p × N elements from an urn of N elements, where p is the imposed percentage of active nodes. In order to have representations that do not completely overlap we need to maximise the number of possible outcomes of the extraction, and this will guarantee that at least one node is dif ferent among the v arious representations. The number of possible extractions of acti ve nodes corresponds to the number of combinations without repetitions: ˜ N = N ! ( N − p × N )!( p × N )! (D8) which has a maximum at p = 0 . 5 . Thus, p = 0 . 5 is the sparsity level that maximises the probability that, giv en an undeﬁned ensemble of input stimuli, the corresponding representations will hav e at least one non-overlapping neuron. References [1] M. V . Tsodyks and M. V . Feigel’man, “The enhanced storage capacity in neural networks with low acti vity level, ” EPL (Eur ophysics Letters) , vol. 6, no. 2, p. 101, 1988. [2] M. Tsodyks, “ Associativ e memory in asymmetric diluted network with lo w level of activity , ” EPL (Eur ophysics Letters) , v ol. 7, no. 3, p. 203, 1988. [3] B. Derrida, E. Gardner , and A. Zippelius, “ An exactly solv able asymmetric neural network model, ” EPL (Eur o- physics Letters) , v ol. 4, no. 2, p. 167, 1987. [4] D. J. Amit, H. Gutfreund, and H. Sompolinsky , “Storing inﬁnite numbers of patterns in a spin-glass model of neural networks, ” Physical Review Letter s , vol. 55, no. 14, p. 1530, 1985. [5] S. Romani, I. Pinko viezky , A. Rubin, and M. Tsodyks, “Scaling la ws of associativ e memory retrie val, ” Neural computation , vol. 25, no. 10, pp. 2523–2544, 2013. [6] T . Hastie, R. Tibshirani, and M. W ainwright, Statistical learning with sparsity: the lasso and generalizations . Chapman and Hall/CRC, 2015. [7] W . W en, C. W u, Y . W ang, Y . Chen, and H. Li, “Learning structured sparsity in deep neural networks, ” in Advances in Neur al Information Pr ocessing Systems 29 , D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 2074–2082. [Online]. A vailable: http://papers.nips.cc/paper/6504- learning- structured- sparsity- in- deep- neural- networks.pdf [8] N. Sriv astav a, G. Hinton, A. Krizhevsk y , I. Sutskev er , and R. Salakhutdinov , “Dropout: a simple way to pre vent neural networks from ov erﬁtting, ” The journal of machine learning r esearc h , vol. 15, no. 1, pp. 1929–1958, 2014. [9] P . M. Rasmussen, L. K. Hansen, K. H. Madsen, N. W . Churchill, and S. C. Strother , “Model sparsity and brain pattern interpretation of classiﬁcation models in neuroimaging, ” P attern Recognition , v ol. 45, no. 6, pp. 2085–2100, 2012. [10] E. T . Rolls and M. J. T ovee, “Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex, ” Journal of neur ophysiology , v ol. 73, no. 2, pp. 713–726, 1995. [11] K. S. Honegger , R. A. Campbell, and G. C. T urner , “Cellular-resolution population imaging rev eals robust sparse coding in the drosophila mushroom body , ” Journal of Neur oscience , vol. 31, no. 33, pp. 11 772–11 785, 2011. [12] A. C. Lin, A. M. Bygrave, A. De Calignon, T . Lee, and G. Miesenböck, “Sparse, decorrelated odor coding in the mushroom body enhances learned odor discrimination, ” Nature neur oscience , vol. 17, no. 4, p. 559, 2014. [13] G. C. T urner , M. Bazhenov , and G. Laurent, “Olfactory representations by drosophila mushroom body neurons, ” Journal of neur ophysiology , vol. 99, no. 2, pp. 734–746, 2008. [14] E. Gruntman and G. C. T urner , “Integration of the olfactory code across dendritic claws of single mushroom body neurons, ” Nature neur oscience , vol. 16, no. 12, p. 1821, 2013. [15] H. Li, Y . Li, Z. Lei, K. W ang, and A. Guo, “T ransformation of odor selectivity from projection neurons to single mushroom body neurons mapped with dual-color calcium imaging, ” Pr oceedings of the National Academy of Sciences , vol. 110, no. 29, pp. 12 084–12 089, 2013. 26 [16] J. Perez-Oriv e, O. Mazor , G. C. T urner , S. Cassenaer , R. I. W ilson, and G. Laurent, “Oscillations and sparsening of odor representations in the mushroom body , ” Science , vol. 297, no. 5580, pp. 359–365, 2002. [17] J. M. Jeanne and R. I. W ilson, “Con ver gence, diver gence, and recon ver gence in a feedforward network improv es neural speed and accuracy , ” Neur on , vol. 88, no. 5, pp. 1014–1026, 2015. [18] R. Azouz and C. M. Gray , “Dynamic spike threshold rev eals a mechanism for synaptic coincidence detection in cortical neurons in viv o, ” Pr oceedings of the National Academy of Sciences , vol. 97, no. 14, pp. 8110–8115, 2000. [19] M. S. Grubb and J. Burrone, “ Activity-dependent relocation of the axon initial segment ﬁne-tunes neuronal excitability , ” Natur e , vol. 465, no. 7301, p. 1070, 2010. [20] C. Du, F . Cai, M. A. Zidan, W . Ma, S. H. Lee, and W . D. Lu, “Reservoir computing using dynamic memristors for temporal information processing, ” Nature communications , v ol. 8, no. 1, pp. 1–10, 2017. [21] M. S. Kulkarni and C. T euscher , “Memristor -based reservoir computing, ” in 2012 IEEE/A CM international symposium on nanoscale ar chitectur es (NANO ARCH) . IEEE, 2012, pp. 226–232. [22] X. Zhu, Q. W ang, and W . D. Lu, “Memristor networks for real-time neural activity analysis, ” Natur e communica- tions , vol. 11, no. 1, pp. 1–9, 2020. [23] K. V andoorne, P . Mechet, T . V an V aerenbergh, M. Fiers, G. Morthier , D. V erstraeten, B. Schrauwen, J. Dambre, and P . Bienstman, “Experimental demonstration of reservoir computing on a silicon photonics chip, ” Nature communications , vol. 5, no. 1, pp. 1–6, 2014. [24] Y . Paquot, F . Duport, A. Smerieri, J. Dambre, B. Schrauwen, M. Haelterman, and S. Massar, “Optoelectronic reservoir computing, ” Scientiﬁc r eports , vol. 2, p. 287, 2012. [25] K. Nakajima, “Physical reservoir computing—an introductory perspecti ve, ” Japanese J ournal of Applied Physics , vol. 59, no. 6, p. 060501, 2020. [26] H. Jaeger , M. Lukoševi ˇ cius, D. Popovici, and U. Sie wert, “Optimization and applications of echo state networks with leaky-inte grator neurons, ” Neural networks , v ol. 20, no. 3, pp. 335–352, 2007. [27] A. Rodan and P . T ino, “Minimum complexity echo state network, ” IEEE transactions on neural networks , vol. 22, no. 1, pp. 131–144, 2010. [28] H. Cui, X. Liu, and L. Li, “The architecture of dynamic reservoir in the echo state network, ” Chaos: An Inter disciplinary Journal of Nonlinear Science , v ol. 22, no. 3, p. 033127, 2012. [29] C. Gallicchio, A. Micheli, and L. Pedrelli, “Deep reservoir computing: A critical experimental analysis, ” Neur o- computing , vol. 268, pp. 87–99, 2017. [30] L. Manneschi, M. O. A. Ellis, G. Gigante, A. C. Lin, P . Del Giudice, and E. V asilaki, “Exploiting multiple timescales in hierarchical echo state networks, ” F rontier s , 2021. [31] F . M. Bianchi, S. Scardapane, S. Løkse, and R. Jenssen, “Reservoir computing approaches for representation and classiﬁcation of multiv ariate time series, ” IEEE T ransactions on Neural Networks and Learning Systems , 2020. [32] J. Huang, T . Zhang, and D. Metaxas, “Learning with structured sparsity , ” Journal of Machine Learning Resear ch , vol. 12, no. No v , pp. 3371–3412, 2011. [33] E. J. Candes, M. B. W akin, and S. P . Boyd, “Enhancing sparsity by reweighted l 1 minimization, ” Journal of F ourier analysis and applications , vol. 14, no. 5-6, pp. 877–905, 2008. [34] M. Lukoše vi ˇ cius, “ A practical guide to applying echo state networks, ” in Neural networks: T ricks of the trade . Springer , 2012, pp. 659–686. [35] Q. Ma, L. Shen, W . Chen, J. W ang, J. W ei, and Z. Y u, “Functional echo state netw ork for time series classiﬁcation, ” Information Sciences , vol. 373, pp. 1–20, 2016. [36] N. Schaetti, M. Salomon, and R. Couturier , “Echo state netw orks-based reservoir computing for mnist handwritten digits recognition, ” in 2016 IEEE Intl Confer ence on Computational Science and Engineering (CSE) and IEEE Intl Confer ence on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES) . IEEE, 2016, pp. 484–491. [37] J. T orrejon, M. Riou, F . A. Araujo, S. Tsunegi, G. Khalsa, D. Querlioz, P . Bortolotti, V . Cros, K. Y akushiji, A. Fukushima et al. , “Neuromorphic computing with nanoscale spintronic oscillators, ” Nature , v ol. 547, no. 7664, pp. 428–431, 2017. [38] G. T anaka, T . Y amane, J. B. Héroux, R. Nakane, N. Kanazawa, S. T akeda, H. Numata, D. Nakano, and A. Hirose, “Recent advances in ph ysical reservoir computing: A review , ” Neural Networks , vol. 115, pp. 100–123, 2019. 27 [39] S. Chandar , C. Sankar , E. V orontsov , S. E. Kahou, and Y . Bengio, “T owards non-saturating recurrent units for modelling long-term dependencies, ” in Pr oceedings of the AAAI Conference on Artiﬁcial Intelligence , v ol. 33, 2019, pp. 3280–3287. [40] R. Kemk er , M. McClure, A. Abitino, T . L. Hayes, and C. Kanan, “Measuring catastrophic forgetting in neural networks, ” in Thirty-second AAAI conference on artiﬁcial intellig ence , 2018. [41] J. Serra, D. Suris, M. Miron, and A. Karatzoglou, “Overcoming catastrophic forgetting with hard attention to the task, ” arXiv preprint , 2018. [42] E. A. Hallem and J. R. Carlson, “Coding of odors by a receptor repertoire, ” Cell , vol. 125, no. 1, pp. 143–160, 2006. [43] S. R. Olsen, V . Bhandawat, and R. I. W ilson, “Divisi ve normalization in olfactory population codes, ” Neur on , vol. 66, no. 2, pp. 287–299, 2010. [44] S. X. Luo, R. Axel, and L. Abbott, “Generating sparse and selective third-order responses in the olfactory system of the ﬂy , ” Proceedings of the National Academy of Sciences , v ol. 107, no. 23, pp. 10 713–10 718, 2010. [45] M. Parnas, A. C. Lin, W . Huetteroth, and G. Miesenböck, “Odor discrimination in drosophila: from neural population codes to behavior , ” Neur on , vol. 79, no. 5, pp. 932–944, 2013. [46] K. Krishnamurthy , A. M. Hermundstad, T . Mora, A. M. W alczak, and V . Balasubramanian, “Disorder and the neural representation of complex odors: smelling in the real world, ” arXiv pr eprint arXiv:1707.01962 , 2017. [47] S. J. Caron, V . Ruta, L. Abbott, and R. Ax el, “Random con ver gence of olfactory inputs in the drosophila mushroom body , ” Nature , v ol. 497, no. 7447, p. 113, 2013. [48] S. Song, P . J. Sjöström, M. Reigl, S. Nelson, and D. B. Chklovskii, “Highly nonrandom features of synaptic connectivity in local cortical circuits.” PLoS biology , vol. 3, no. 3, p. e68, Mar . 2005. [49] D. P . Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [50] L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web], ” IEEE Signal Pr ocessing Magazine , vol. 29, no. 6, pp. 141–142, 2012. [51] G. Arora, A. Rahimi, and T . Baldwin, “Does an lstm forget more than a cnn? an empirical study of catastrophic forgetting in nlp, ” in Pr oceedings of the The 17th Annual W orkshop of the Australasian Langua ge T echnology Association , 2019, pp. 77–86. [52] Z. Zheng, J. S. Lauritzen, E. Perlman, C. G. Robinson, M. Nichols, D. Milkie, O. T orrens, J. Price, C. B. Fisher , N. Shariﬁ, S. A. Calle-Schuler , L. Kmecova, I. J. Ali, B. Karsh, E. T . Trautman, J. A. Bogovic, P . Hanslovsk y , G. S. X. E. Jefferis, M. Kazhdan, K. Khairy , S. Saalfeld, R. D. Fetter, and D. D. Bock, “A Complete Electron Microscopy V olume of the Brain of Adult Drosophila melanogaster .” Cell , vol. 174, no. 3, pp. 730–743.e22, Jul. 2018. [53] M. Hermans and B. Schrauwen, “Recurrent kernel machines: Computing with inﬁnite echo state networks, ” Neural Computation , v ol. 24, no. 1, pp. 104–133, 2012. [54] D. Markovi ´ c, N. Leroux, M. Riou, F . Abreu Araujo, J. T orrejon, D. Querlioz, A. Fukushima, S. Y uasa, J. T rastoy , P . Bortolotti et al. , “Reservoir computing with the frequenc y , phase, and amplitude of spin-torque nano-oscillators, ” Applied Physics Letters , v ol. 114, no. 1, p. 012409, 2019. [55] M. Romera, P . T alatchian, S. Tsunegi, F . A. Araujo, V . Cros, P . Bortolotti, J. T rastoy , K. Y akushiji, A. Fukushima, H. Kubota et al. , “V o wel recognition with four coupled spin-torque nano-oscillators, ” Nature , v ol. 563, no. 7730, p. 230, 2018. 28

SpaRCe: Improved Learning of Reservoir Computing Systems through Sparse Representations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment