Example Selection For Dictionary Learning

In unsupervised learning, an unbiased uniform sampling strategy is typically used, in order that the learned features faithfully encode the statistical structure of the training data. In this work, we explore whether active example selection strategi…

Authors: Tomoki Tsuchida, Garrison W. Cottrell

Example Selection For Dictionary Learning
Accepted as a workshop contrib ution at ICLR 2015 E X A M P L E S E L E C T I O N F O R D I C T I O N A RY L E A R N I N G T omoki Tsuchida & Garrison W . Cottrell Department of Computer Science and Engineering Univ ersity of California, San Diego 9500 Gilman Driv e, Mail Code 0404 La Jolla, CA 92093-0404, USA { ttsuchida,gary } @ucsd.edu A B S T R A C T In unsupervised learning, an unbiased uniform sampling strategy is typically used, in order that the learned features faithfully encode the statistical structure of the training data. In this work, we explore whether activ e e xample selection strate gies — algorithms that select which examples to use, based on the current estimate of the features — can accelerate learning. Specifically , we in vestigate ef fects of heuristic and salienc y-inspired selection algorithms on the dictionary learning task with sparse activ ations. W e show that some selection algorithms do improve the speed of learning, and we speculate on why they might w ork. 1 I N T R O D U C T I O N The efficient coding hypothesis, proposed by Barlow (1961), posits that the goal of perceptual sys- tem is to encode the sensory signal in such a way that it is efficiently represented. Based on this hypothesis, the past tw o decades hav e seen successful computational modeling of low-le vel percep- tual features based on dictionary learning with sparse codes. The idea is to learn a set of dictionary elements that encode “naturalistic” signals efficiently; the learned dictionary might then model the features of early sensory processing. Starting with Olshausen and Field (1996), the dictionary learn- ing task has thus been used extensi vely to explain early perceptual features. Because the objective of such a learning task is to capture the statistical structure of the observed signals faithfully and efficiently , it is an instance of unsupervised learning. As such, the dictionary learning is usually performed using unbiased sampling: the set of data to be used for learning are sampled uniformly from the training dataset. At the same time, the world contains an overab undance of sensory information, requiring organ- isms with limited processing resources to select and process only information rele v ant for survival Tsotsos (1990). This selection process can be expressed as perceptual action or attentional filter- ing mechanisms. This might at first appear at odds with the goal of the dictionary learning task, since the selection process necessarily biases the set of observed data for the organism. Howe ver , the con verse is also true: as better (or dif ferent) features are learned ov er the course of learning, the mechanisms for selecting what is relev ant may change, ev en if the selection objective stays the same. If a dictionary learning task is to serv e as a realistic algorithmic model of the feature learning process in organisms capable of attentional filtering, this mutual dependency between the dictionary learning and attentional sample selection bias must be taken into consideration. In this w ork, we examine the effect of such sampling bias on the dictionary learning task. In particu- lar , we explore interactions between learned dictionary elements and example selection algorithms. W e in vestig ate whether any selection algorithm can approach, or ev en improve upon, learning with unbiased sampling strategy . Some of the heuristics we examine also have close relationships to models of attention, suggesting that they can be plausibly implemented by org anisms ev olving to effecti vely encode stimuli from their en vironment. 1 Accepted as a workshop contrib ution at ICLR 2015 2 D I C T I O NA RY L E A R N I N G Assume that a training set consisting of N P -dimensional signals X N , { x ( i ) } N i =1 is generated from a K -element “ground-truth” dictionary set A ∗ = [ a 1 a 2 · · · a K ] under the following model: x ( i ) = A ∗ s ( i ) +  ( i ) , { s ( i ) j : s ( i ) j > 0 } ∼ E xp ( λ ) iid ,  ( i ) ∼ N (0 , I σ 2  ) iid . (1) Each signal column v ector x ( i ) is restricted to ha ving exactly k positi ve acti vations: s ( i ) ∈ C s , { s ∈ I R P ≥ 0 : k s k 0 = k } , and each dictionary element is constrained to the unit-norm: A ∗ ∈ C A , { A : k ( A ) j k 2 = 1 ∀ j } . The goal of dictionary learning is to recover A ∗ from X N , assuming λ and σ 2  are known. T o that end, we wish to calculate the maximum a posteriori estimate of A ∗ , arg min A ∈C A 1 N N X i =1 min s ( i ) ∈C s  1 2 σ 2  k x ( i ) − As ( i ) k 2 2 + λ k s ( i ) k 1  . (2) This is difficult to calculate, because A and { s ( i ) } N i =1 are simultaneously optimized. One practical scheme is to fix one variable and alternately optimize the other , leading to subproblems ˆ S = " arg min s ( i ) ∈C s  1 2 σ 2  k x ( i ) − ˆ As ( i ) k 2 2 + λ k s ( i ) k 1  # N i =1 , (3) ˆ A = arg min A ∈C A 1 2 N k X N − A ˆ S k 2 F . (4) As in the Method of Optimal Directions (MOD) Engan et al. (1999), this alternate optimization scheme is guaranteed to conv erge to a locally optimal solution for ˆ A MAP estimation problem (2). This scheme is also attractiv e as an algorithmic model of lo w-lev el feature learning, since each optimization process can be related to the “analysis” and “synthesis” phases of an autoencoder network Olshausen and Field (1997). In this paper , we henceforth refer to problems (3) and (4) as encoding and updating stages, and their corresponding optimizers as f enc and f upd . 2 . 1 E N C O D I N G A L G O RI T H M S The L 0 -constrained encoding problem (3) is NP-Hard Elad (2010), and v arious approximation meth- ods have been extensi vely studied in the sparse coding literature. One approach is to ignore the L 0 constraint and solve the remaining nonne gati ve L 1 -regularized least squares problem LARS : ˆ s ( i ) = arg min s ≥ 0  1 2 σ 2  k x ( i ) − ˆ As k 2 2 + λ 0 k s k 1  , (5) with a larger sparsity penalty λ 0 , λP /k to compensate for the lack of the L 0 constraint. This works well in practice, since the distribution of s ( i ) j (whose mean is 1 /λ 0 ) is well approximated by E xp ( λ 0 ) . F or our simulations, we use the Least Angle Regression (LARS) algorithm Duchi et al. (2008) implemented by the SP AMS package Mairal et al. (2010) to solve this. Another approach is to greedily seek nonzero acti v ations to minimize reconstruction errors. The matching pursuit family of algorithms operate on this idea, and they ef fecti vely approximate the encoding model 2 Accepted as a workshop contrib ution at ICLR 2015 OMP : ˆ s ( i ) = arg min s ≥ 0  1 2 σ 2  k x ( i ) − ˆ As k 2 2  s.t. k s k 0 ≤ k . (6) This approximation ignores the L 1 penalty , but because nonzero acti v ations are exponentially dis- tributed and mostly small, this approximation is also effecti ve. W e use the Orthogonal Matching Pursuit (OMP) algorithm Mallat and Zhang (1993), also implemented by the SP AMS package, for this problem. An even simpler variant of the pursuit-type algorithm is the thresholding Elad (2010) or the k -Sparse algorithm Makhzani and Frey (2013). This algorithm takes the k largest values of ˆ A | x ( i ) and sets ev ery other component to zero: k-Sparse : ˆ s ( i ) = supp k { ˆ A | x ( i ) } (7) This algorithm is plausibly implemented in a feedforward phase of an autoencoder with a hidden layer that competes horizontally and picks k “winners”. The simplicity of this algorithm is important for our purposes, because we allow the training examples to be selected after the encoding stage, and the encoding algorithm must operate on a much larger number of examples than the updating algorithm. This view also motiv ated the nonnegati ve constraint on s ( i ) , because the activ ations of the hidden layers are likely to be con veyed by nonne gati ve firing rates. 2 . 2 D I C T I O NA RY U P DA T E A L G O R I TH M For the updating stage, we only consider the stochastic gradient update, another simple algorithm for learning. For the reconstruction loss L rec ( A ) , 1 2 N k X N − A ˆ S k 2 F , the gradient is ∇ L rec = 2( A ˆ S − X ) ˆ S | / N , yielding the update rule ˆ A ← ˆ A − η t ( ˆ A ˆ S − X N ) ˆ S | / N . (8) Here, η t is a learning rate that decays in versely with the update epoch t : η t ∈ Θ(1 /t + c ) . After each update, ˆ A is projected back to C A by normalizing each column. Gi ven a set of training examples, this encoding and updating procedure is repeated a small number of times (10 times in our simulations). 2 . 3 A C T I V I T Y E QU A L I Z A T I O N One practical issue with this task is that a small number of dictionary elements tend to be assigned to a large number of activ ations. This produces “the rich get richer” effect: regularly used elements are more often used, and unused elements are left at their initial stages. T o avoid this, an acti vity normalization procedure takes place after the encoding stage. The idea is to modulate all activities, so that the mean activity for each element is closer to the across-element mean of the mean acti vities; this is done at the cost of increasing the reconstruction error . The equalization is modulated by γ , with γ = 0 corresponding to no equalization and γ = 1 to fully egalitarian equalization ( i.e. all elements would hav e equal mean activities). W e use γ = 0 . 2 for our simulations, which we found empirically to provide a good balance between equalization and reconstruction. 3 E X A M P L E S E L E C T I O N A L G O R I T H M S T o examine the ef fect of the example selection process on the learning, we extend the alternate optimization scheme in equations (3, 4) to include an example selection stage. In this stage, a selection algorithm picks n  N examples to use for the dictionary update (Figure 1). Ideally , the examples are to be chosen in such a w ay as to make learned dictionary ˆ A closer to the ground-truth 3 Accepted as a workshop contrib ution at ICLR 2015 X N All signals encoding S N All activ ations Example selection X n , S n Selected examples, activ ations updating ˆ A Dictionary estimate Figure 1: The interaction among encoding, selection and updating algorithms. A ∗ compared to the uniform sampling. In the following, we describe a number of heuristic selection algorithms that were inspired by models of attention. W e characterize example selection algorithms in two parts. First, there is a choice of goodness measur e g j , which is a function that maps ( s ( i ) , x ( i ) ) to a number reflecting the “goodness” of the instance i for the dictionary element j . Applying g j to { s ( i ) } N i =1 yields goodness values G N for all k dictionary elements and all N examples. Second, there is a choice of selector function f sel . This function dictates the way a subset of X N is chosen using G N values. 3 . 1 G O O D N E S S M E A S U R E S Of the various goodness measures, we first consider Err : g j ( s ( i ) , x ( i ) ) = k ˆ As ( i ) − x ( i ) k 1 . (9) Err is moti v ated by the idea of “critical examples” in Zhang (1994), and it fav ors examples with large reconstruction errors. In our paradigm, the criticality measured by Err may not correspond to ground-truth errors, since it is calculated using current estimate ˆ A rather than ground-truth A ∗ . Another related idea is to select examples that would produce large gradients in the dictionary update equation (8), without regard to their directions. This results in Grad : g j ( s ( i ) , x ( i ) ) = k ˆ As ( i ) − x ( i ) k 1 · s ( i ) j . (10) W e note that Grad extends Err by multiplying the reconstruction errors by the activ ations s ( i ) j . It therefore prefers examples that are both critical and produce lar ge acti vations. One observation is that the le vel of noise puts a fundamental limit on the recov ery of true dictionary: better approximation bound is obtained when observation noise is lo w . It follows that, if we can somehow collect examples that happen to hav e low noise, learning from those examples might be beneficial. This moti vated us to consider SNR : g j ( s ( i ) , x ( i ) ) = k x ( i ) k 2 2 k ˆ As ( i ) − x ( i ) k 2 2 · s ( i ) j . (11) This measure prefers examples with lar ge estimated signal-to-noise ratio (SNR). Another idea focuses on the statistical property of activ ations s ( i ) , inspired by a model of visual saliency proposed by Zhang et al. (2008). Their saliency model, called the SUN model, asserts that signals that result in rare feature activ ations are more salient. Specifically , the model defines the saliency of a particular visual location to be proportional the self-information of the feature activ ation, − log P ( F = f ) . Because we assume nonzero activ ations are exponentially distributed, this corresponds to 4 Accepted as a workshop contrib ution at ICLR 2015 SUN : g j ( s ( i ) , x ( i ) ) = s ( i ) j  ∝ − log P ( s ( i ) j )  . (12) W e note that this model is not only simple, but also does not depend on x ( i ) directly . This makes SUN attractiv e as a neurally implementable goodness measure. Another saliency-based goodness measure is inspired by the visual saliency map model of Itti et al. (2002): SalMap : g j ( s ( i ) , x ( i ) ) = S al iency M ap ( x ( i ) ) . (13) In contrast to the SUN measure, SalMap depends only on x ( i ) . Consequently , SalMap is imper- vious to changes in ˆ A . Since the signals in our simulations are small monochrome patches, the “saliency map” we use only has a single-scale intensity channel and an orientation channel with four directions. 3 . 2 S E L E C T O R F U N C T I O N S W e consider two selector functions. The first function chooses top n examples with high goodness values across dictionary elements: BySum : f sel ( G N ) = top n elements of K X j =1 G ( i ) j . (14) The second selector function,selects examples that are separately “good” for each dictionary ele- ment: ByElement : f sel ( G N ) = { top n/K elements of G ( i ) j | j ∈ 1 ...K } . (15) This is done by first sorting G ( i ) j for each j and then picking top examples in a round-robin fashion, until N examples are selected. Barring duplicates, this yields a set consisting of top n/k elements of G ( i ) j for each element j . Algorithm 1 describes how these operations take place within each learning epoch. In our simulations, we consider all possible combinations of the goodness measures and selec- tor functions for the e xample selection algorithm, except for Err and SalMap . Since these two goodness measures do not produce different v alues for different dictionary element activ ations s ( i ) j , BySum and ByElement functions select equiv alent example sets. 4 S I M U L A T I O N S In order to e valuate e xample selection algorithms, we present simulations across a v ariety of dictio- naries and encoding algorithms. Specifically , we compare results using all three possible encoding models ( L0 , L1 , and k-Sparse ) with all eight selection algorithms. Because we generate the training examples from a kno wn ground-truth dictionary A ∗ , we quantify the inte grity of learned dictionary ˆ A t at each learning epoch t using the minimal mean square distance D ∗ ( ˆ A , A ∗ ) , min P π 1 K P k ˆ A t P π − A ∗ k 2 F , (16) 5 Accepted as a workshop contrib ution at ICLR 2015 (a) Gabor dictionary (b) Alphanumeric dictionary Figure 2: Ground-truth dictionaries and generated examples X N . Each element / generated example is a 8x8 patch, displayed as a tiled image for the ease of visualization. White is positi ve and black is negati ve. with P π spanning all possible permutations. W e also in vestig ate the ef fect of A ∗ on the learning. One way to characterize a dictionary set A is its mutual coherence µ ( A ) , max i 6 = j | a | i a j | Elad (2010). This measure is useful in theoretical analysis of recovery bounds Donoho et al. (2006). A more practical characterization is the av erage coherence ¯ µ ( A ) , 2 K ( K − 1) P i 6 = j | a | i a j | . Regardless, exact recov ery of the dictionary is more challenging when the coherence is high. The first dictionary set comprises 100 8 x 8 Gabor patches (Figure 2a). This dictionary set is inspired by the fact that dictionary learning of natural images leads to such a dictionary Olshausen and Field (1996), and they correspond to simple recepti ve fields in mammalian visual cortices Jones and Palmer (1987). With µ ( A ∗ ) = 0 . 97 but ¯ µ ( A ∗ ) = 0 . 13 , this dictionary set is relati vely incoherent, and so the learning problem should be easier . The second dictionary set is composed of 64 8 x 8 alphanumeric letters with alternating rotations and signs (Figure 2b). This artificial dictionary set has µ ( A ∗ ) = 0 . 95 with ¯ µ ( A ∗ ) = 0 . 34 1 . W ithin each epoch, 50,000 examples are generated with 5 nonzero activ ations per example ( k = 5 ), whose magnitudes are sampled from E xp (1) . σ 2  is set so that examples have SNR of ≈ 6 dB. Each selection algorithm then picks 1% ( n = 500 ) of the training set for the learning. For each experiment, ˆ A is initialized with random examples from the training set. 1 Both dictionaries violate the recovery bound described in Donoho et al. (2006). Amiri and Haykin (2014) notes that this bound is prone to be violated in practice; as such, we explicitly chose “realistic” parameters that violate the bounds in our simulations. Algorithm 1 Learning with example selection Initialize random ˆ A 0 ∈ C A from training examples For t = 1 to max. epochs: 1. Obtain training set X N = { x ( i ) } N i =1 2. Encode X N : S N = { f enc ( x ( i ) ; ˆ A ) } N i =1 3. Select n “good” examples – Calculate G N = { [ g j ( s ( i ) , x ( i ) )] j =1 ...k } N i =1 – Select n indices: Γ = f sel ( G N ) – S n = { s ( i ) } i ∈ Γ , X n = { x ( i ) } i ∈ Γ 4. Loop 10 times: (a) Encode X n : S n ← { f enc ( x ( i ) ; ˆ A ) } n i =1 (b) Equalize S n : ∀ s ( i ) ∈ S n , s ( i ) j ← s ( i ) j · ( 1 K P K j =1 P n i =1 s ( i ) j / P n i =1 s ( i ) j ) γ (c) Update ˆ A : ˆ A ← ˆ A − η t ( ˆ AS n − X n ) S | n /n (d) Normalize columns of ˆ A . 6 Accepted as a workshop contrib ution at ICLR 2015 4 . 1 R E S U LT S Figure 3 sho ws the av erage distance of ˆ A from A ∗ for each learning epoch. W e observe that ByElement selection policies generally work well, especially in conjunction with Grad and SUN goodness measures. This trend is especially noticeable for the alphanumeric dictionary case, where most of the BySum -selectors perform worse than the baseline selector that chooses examples ran- domly ( Uniform ). The ranking of the selector algorithms is roughly consistent across the learning epochs (Figure 3, left column), and it is also robust with the choice of the encoding algorithms (Figure 3, right column). In particular , good selector algorithms are beneficial e ven at the relativ ely early stages of learning ( < 100 epochs, for instance), in contrast to the simulation in Amiri and Haykin (2014). This is surprising, because at early stages of learning, poor ˆ A estimates result in bad activ ation estimates as well. Nev ertheless, good selector algorithms soon establish a positive feedback loop for both dictionary and activ ation estimates. One interesting e xception is the SalMap selector . It works relativ ely well for Gabor dictionary (and closely tracks the SUNBySum selector), b ut not for the alphanumeric dictionary . This is presumably due to the design of the SalMap model: because the model uses oriented Gabor filters as one of its feature maps, the overall effect is similar to the SUNBySum algorithm when the signals are generated from Gabor dictionaries. 0 200 400 600 800 1000 0 0 . 005 0 . 01 0 . 015 Epochs D ∗ ( ˆ A, A ∗ ) ErrBySum Uniform SNRBySum GradBySum SalMap SUNBySum GradByElement SNRByElement SUNByElement LARS OMP k-Sparse (a) Gabor dictionary 0 200 400 600 800 1000 0 0 . 005 0 . 01 0 . 015 Epochs D ∗ ( ˆ A, A ∗ ) SalMap ErrBySum Uniform SNRBySum GradBySum SUNBySum SNRByElement SUNByElement GradByElement LARS OMP k-Sparse (b) Alphanumeric dictionary Figure 3: Distance from true dictionaries. Graphs on the left column sho w the time course of the learning using the LARS encoding. The legends are ordered from worst to best at the end of the simulation (1000 epochs). Graphs on the right column compares the performance of different encoding models. The ordinate is the distance at the end, in the same scale as the left graphs. 4 . 2 R O B U S T N E S S In order to assess the robustness of the e xample selection algorithms, we repeated the Gabor dictio- nary simulation across a range of parameter values. Specifically , we experimented with modifying the following parameters one at a time, starting from the original parameter v alues: 7 Accepted as a workshop contrib ution at ICLR 2015 • The signal-to-noise ratio ( 10 log 10 (2 λ 2 /σ 2  ) [dB]) • The number of nonzero elements in the generated examples ( k ) • The ratio of selected examples to the original training set ( n/ N ) • The number of dictionary elements ( K ) Figure 4 shows the result of these simulations. These results sho w that good selector algorithms improv e learning across a wide range of parameter values. Of note is the number of dictionary elements K , whose results suggest that the improv ement is greatest for the “complete” dictio- nary learning cases; the adv antage of selection appears to diminish for extremely ov er -complete (or under-complete) dictionary learning tasks. − 18 − 16 − 14 − 12 0 0 . 005 0 . 01 0 . 015 SNR [dB] D ∗ ( ˆ A , A ∗ ) 2 4 6 8 10 0 0 . 005 0 . 01 0 . 015 k D ∗ ( ˆ A , A ∗ ) ErrBySum Uniform SNRBySum GradBySum SalMap GradByElement SUNBySum SNRByElement SUNByElement 0 . 01 0 . 1 1 0 0 . 005 0 . 01 0 . 015 n/ N D ∗ ( ˆ A , A ∗ ) 8 16 32 64 128 256 0 0 . 005 0 . 01 0 . 015 K D ∗ ( ˆ A , A ∗ ) Figure 4: Distances from the true dictionaries for dif ferent model parameters, using the LARS en- coding. 5 D I S C U S S I O N In this work, we examined the effect of selection algorithms on the dictionary learning based on stochastic gradient descent. Simulations using training examples generated from kno wn dictionaries rev ealed that some selection algorithms do indeed impro ve learning, in the sense that the learned dictionaries are closer to the known dictionaries throughout the learning epochs. Of special note is the success of SUN selectors; since these selectors are very simple, they hold promise for more general learning applications. Few studies ha ve so far in vestigated example selection strategies for the dictionary learning task, although some learning algorithms contain such procedures implicitly . For instance, K-SVD Aharon et al. (2006) relies upon identifying a group of e xamples that use a particular dictionary element during its update stage. The algorithm in Arora et al. (2013) also makes use of a sophisticated example grouping procedure to pro v ably recov er dictionaries. In both cases, though, the focus is on breaking the inter-dependency between ˆ A and ˆ S , instead of characterizing ho w some algorithms – notably those of the perceptual systems – might improv e learning despite this inter-dependenc y . One recent paper that does consider example selection on its o wn is (Amiri and Haykin, 2014), whose cognit algorithm is explicitly related to perceptual attention. The point that differentiates this work lies in the generativ e assumption: cognit relies on having additional information av ail- able to the learner , in their case the temporal contiguity of the generativ e process. W ith a spatially and temporally independent generation process, the generati ve model we considered here is simpler but more dif ficult to solve. Why do selection algorithms improv e learning at all? At first glance, one may assume that any non-uniform sampling would ske w the apparent distribution D ( X n ) from the true distribution of 8 Accepted as a workshop contrib ution at ICLR 2015 SUNByElement SNRByElement GradByElement SUNBySum SalMap GradBySum SNRBySum Uniform ErrBySum 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 S N R o f X n [ d B ] SUNByElement SNRByElement GradByElement SUNBySum SalMap GradBySum SNRBySum Uniform ErrBySum 0.0 0.1 0.2 0.3 0.4 0.5 0.6 D ( D ( X n ) | | D ( X N ) ) GradByElement SUNByElement SNRByElement SUNBySum GradBySum SNRBySum Uniform ErrBySum SalMap 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 S N R o f X n [ d B ] GradByElement SUNByElement SNRByElement SUNBySum GradBySum SNRBySum Uniform ErrBySum SalMap 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 D ( D ( X n ) | | D ( X N ) ) (a) Gabor dictionary (b) Alphanumeric dictionary Figure 5: Characterization of X n . Left columns: SNR of X n (higher is better). Right columns: D ( D ( X n ) ||D ( X N )) (lower is better). the training set D ( X N ) , and thus lead to learning of an incorrect dictionary . Howe ver , as we have empirically sho wn, this is not the case. One intuitiv e reason – one that also underlies the design of the SNR selectors – is that “good” selection algorithms picks samples with high information content. For instance, samples with close to zero activ ation content provide little information about the dictionary elements that compose them, ev en though such samples abound under our generati ve model with exponentially-distrib uted activ ations. It follo ws that such samples provide little benefit to the inference of the statistical structure of the training set, and the learner would be well-advised to discard them. T o validate this, we calculated the (true) SNR of X n at the last epoch of the learning for each selection algorithm (Figure 5, left columns). This shows that all selection algorithms picked X n with much higher SNR than Uniform . Howe ver , the correlation between the overall performance ranking and SNR is weak, suggesting that this is not the only factor driving good example selection. Another factor that contributes to good learning is the spread of examples within X n . Casual obser - vation rev ealed that the BySum selector is prone to picking similar e xamples, whereas ByElement selects a larger variety of examples and thus retains the distribution of X N more faithfully . T o quantify this, we measured the distance of the distribution of selected examples, D ( X n ) , from that of all training examples, D ( X N ) , using the histogram intersection distanceRubner et al. (2000). The right columns of Figure 5 shows that this distance, D ( D ( X n ) ||D ( X N )) , tends to be lower for ByElement selectors (solid lines) than BySum selectors (dashed lines). Like the SNR measure, howe ver , this quantity itself is only weakly predictive of the ov erall performance, suggesting that it is important to pick a large v ariety of high-SNR examples for the dictionary learning task. There are sev eral directions to which we plan to extend this work. One is the theoretical analysis of the selection algorithms. For instance, we did not explore under what conditions learning with example selection leads to the same solutions as an unbiased learning, although empirically we ob- served that to be the case. As in the curriculum learning paradigm Bengio et al. (2009), it is also possible that different selection algorithms are better suited at different stages of learning. Another is to apply the acti ve example selection processes to hierarchical architectures such as stacked au- toencoders and Restricted Boltzmann Machines. In these cases, an interesting question arises as to how information from each layer should be combined to make the selection decision. W e intend to explore some of these questions in the future using learning tasks similar to this work. R E F E R E N C E S Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD: An Algorithm for Designing Over- complete Dictionaries for Sparse Representation. Signal Pr ocessing, IEEE T ransactions on , 54 (11):4311–4322, 2006. Ashkan Amiri and Simon Haykin. Improv ed Sparse Coding Under the Influence of Perceptual Attention. Neural Computation , 26(2):377–420, February 2014. 9 Accepted as a workshop contrib ution at ICLR 2015 Sanjeev Arora, Rong Ge, and Ankur Moitra. New Algorithms for Learning Incoherent and Over - complete Dictionaries. arXiv .org , August 2013. Horace B Barlow . Possible Principles Underlying the Transformations of Sensory Messages. Sen- sory Communication , pages 217–234, 1961. Y oshua Bengio, J ´ er ˆ ome Louradour, Ronan Collobert, and Jason W eston. Curriculum learning. In Pr oceedings of the 26th annual international confer ence on mac hine learning , pages 41–48. A CM, 2009. David L Donoho, Michael Elad, and Vladimir T emlyakov . Stable recovery of sparse ov ercomplete representations in the presence of noise. IEEE T ransactions on Information Theory , 52(1):6–18, 2006. John Duchi, Shai Shalev-Shwartz, Y oram Singer , and T ushar Chandra. Efficient projections onto the l 1 -ball for learning in high dimensions. In Pr oceedings of the International Confer ence on Machine Learning (ICML) , 2008. Michael Elad. Sparse and Redundant Representations: F r om Theory to Applications in Signal and Image Pr ocessing . Springer , 2010. Kjersti Engan, Sven Ole Aase, and John Hakon Husoy . Method of optimal directions for frame design. In 1999 IEEE International Conference on Acoustics, Speec h, and Signal Pr ocessing , pages 2443–2446, 1999. Laurent Itti, Christof K och, and Eieb ur Nieb ur . A model of salienc y-based visual attention for rapid scene analysis. IEEE T ransactions on P attern Analysis and Machine Intelligence , 20(11):1254– 1259, 2002. Judson P Jones and Larry A Palmer . An ev aluation of the two-dimensional Gabor filter model of simple recepti ve fields in cat striate corte x. J ournal of Neur ophysiology , 58(6):1233–1258, 1987. Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factor - ization and sparse coding. The Journal of Machine Learning Resear ch , 11:19–60, 2010. Alireza Makhzani and Brendan Frey . k-Sparse Autoencoders. arXiv .or g , December 2013. Stephane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE T ransactions on Signal Pr ocessing , 41(12):3397–3415, 1993. Bruno A Olshausen and David J Field. Emer gence of simple-cell recepti ve field properties by learning a sparse code for natural images. Natur e , 381(6583):607–609, 1996. Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? V ision Researc h , 37(23):3311–3325, 1997. Y ossi Rubner , Carlo T omasi, and Leonidas J Guibas. The earth mov er’ s distance as a metric for image retriev al. International Journal of Computer V ision , 40(2):99–121, 2000. John K Tsotsos. Analyzing vision at the complexity le vel. Behav Br ain Sci , 13(3):423–469, 1990. Byoung-T ak Zhang. Accelerated learning by active e xample selection. International Journal of Neural Systems , 5(1):67–76, 1994. Lingyun Zhang, Matthe w H. T ong, Tim K Marks, Honghao Shan, and Garrison W Cottrell. SUN: A Bayesian framew ork for saliency using natural statistics. J ournal of V ision , 8(7), 2008. 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment