Pylearn2: a machine learning research library

Pylearn2 is a machine learning research library. This does not just mean that it is a collection of machine learning algorithms that share a common API; it means that it has been designed for flexibility and extensibility in order to facilitate resea…

Authors: Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin

Pylearn2: a machine learning r esearch library Ian J. Goodf ellow 1 , David W arde-Farley 1 , Pascal Lamblin 1 , V incent Dumou lin 1 , Mehdi Mirza 1 , Razvan P ascanu 1 , James Bergstra 2 , Fr ´ ed ´ eric Bastien 1 , and Y oshua Bengio 1 1 D ´ epartement d’Informa tique et de Recherche Op ´ erationelle, Uni versit ´ e de Montr ´ eal { goodfeli, wardefar, lamblinp, dumouliv, mirzamom, pascanur } @ iro.umontrea l.ca, nouiz@nouiz. org, yoshua.bengi o@umontreal. ca 2 Center for Theor etical Neuroscience, University of W ater loo. james.bergst ra@uwaterloo .ca Abstract Pylearn2 is a machine learning research library . This does not just mean that it is a collection of machine learnin g algorith ms that share a commo n API; it means t hat it has been designed fo r flexibility and exten sibility in o rder to facilitate research projects that in volve new or u nusual u se cases. In this paper we giv e a brief history of the library , an overvie w of its basic philo sophy , a summa ry of the library’ s architecture , an d a description of ho w the Pylearn2 community function s socially . 1 Intr oduction Pylearn2 is a mach ine learnin g research librar y developed by LISA at Universit ´ e d e Montr ´ eal. The goal of the library is to facilitate machine learnin g research. This means that the library has a focu s on flexibility an d extensibility , in order to make sure that nearly any r esearch idea is feasible to implement in th e library . Th e target user base is machine learning researchers. Being “user friend ly” for a research user mean s that it shou ld be easy to understand exactly what the code is doin g and configur e it very precisely for any desire d experimen t. Sometimes this may come at the cost of requirin g the user to be a n expert practitioner, who must un derstand how the algorithm works in order to accomp lish basic data analysis tasks. This is different from other notable machine learnin g libraries, such as scikit- learn [ 39] o r the learn ing algor ithms pr ovided as part o f Open CV [ 7], the ST AIR V ision L ibrary [ 23], etc. Such machine learn ing libr aries aim to provide good p erforma nce to users who do n ot necessarily understand how the un derlying algorithm works. Pylearn 2 has a different user base, and thus different design goals. In this pap er , we gi ve a general sense of th e library’ s design an d how the community functions. W e begin with a brief history of the library . Finally , we gi ve an ov erview of the li brary ’ s philoso phy , the architecture of th e library itself, and the development work flow that the Pylearn2 commun ity uses to improve the library . GitHub repository https://gith ub.com/lisa- lab/pylea rn2 Documen tation http://deepl earning.net/ software/pylearn2 User mailing list pylearn- users@googleg roups.com Dev eloper mailing list pylearn- dev@googlegro ups.com T able 1: Other Pylearn2 resources 1 2 History Pylearn2 is L ISA ’ s thir d major effort to d esign a fle xible machin e learning research library , the fo r- mer two being PLearn and Pylearn. It is built on top of the lab’ s mathema tical expression compiler, Theano [5 , 3]. In late 2010, a series of committees of LISA lab members met to p lan how to fu lfill LISA ’ s software development needs. These comm ittees determ ined that no existing pub licly avail- able ma chine library had design goa ls that would satisfy the req uirements imposed by the kind o f research d one at LISA. The co mmittees d ecided to create Py learn2, and d rafted som e ba sic design ideas and the guidin g p hilosoph y of the libr ary . The first imp lementation w ork on Pylea rn2 began as a class project in early 2011. The library was u sed fo r research work m ostly within LISA over the next two ye ars. In this tim e the struc ture of the librar y changed several times but e ventually became stable. The ad dition of Continu ous Integration from Tra vis-CI [2] with the development workflow f rom GitHub [1] h elped to greatly impr ove the stability o f the library . GitHub provides a useful interface for reviewing code befo re it is merged , and T ravis-CI tells reviewers whether th e code passes the tests. In late 201 1 Py learn2 was used to win a tr ansfer learn ing con test [22]. After this, a h andful of researchers outside LISA b ecame interested in usin g it to rep roduce the results fr om this challeng e. Howe ver , th e majority o f Pylear n2 users wer e still LI SA me mbers. Pylearn2 first gained a sign ificant user base outside LISA in the first half of 2013. This was in part d ue to the attention the library received after it was used to set the state of th e art on se veral computer vision benchmar k tasks [21], and in p art d ue to m any Kagglers starting to use th e library after th e b aseline solu tions to some Kaggle contests were provided in Pylearn2 format [20]. T o day , over 250 GitHub users watch th e r epository , nearly 200 subscribe to the m ailing list, and over 100 hav e ma de their o wn f ork to work on new featur es. O ver 3 0 GitHu b users ha ve contr ibuted to the library . 3 License and citation information Pylearn2 is released under the 3 -claused BSD licen se, so it may be used for co mmercial purposes. The license does no t requir e anyone to cite Pylearn2 , but if y ou use Pylearn2 in published research work we encourage you to cite this article. 4 Philosophy Dev elopmen t of Pylearn2 is guided by se veral principles: • Pylearn2 is a machine learning resear ch library–its users ar e researchers. This means the Pylearn2 framework shou ld no t impose many restrictions on what is po ssible to do with the library , and it is acceptable to assume that the user has som e tech nical sop histication and knowledge of machine learning . • Pylearn2 is b uilt from re-usable par ts, t hat ca n be used in many combinations o r independently . In particu lar , no user should be forced to lear n all p arts of the library . If a user wants on ly to u se a Py learn2 Model , it should b e possible to d o so without lear ning about Pylearn2 TrainingAlgo rithm s, Cost s, etc. • Pylearn2 a voids over -planning. Each feature is designed with an eye toward allowing more f eatures to be d ev eloped in the futur e. W e do e nough plann ing to ensure tha t our designs are mod ular and easy to extend. W e generally do n ot do much more plannin g than that. Th is a voids paralysis and ov er-engineering . • Pylearn2 provides a domain-specific language that provides a compact way o f speci- fying all hyperp aramete rs for an experiment. Pyle arn2 accomp lishes this using Y AML with a f ew extensions. A brief Y AML file can instantiate a co mplex experimen t without exposing any imp lementation- specific detail. Th is makes it easier for researcher s wh o do not use Pylearn2 to read the specification of a Pylea rn2 experiment and reproduce it usin g their own software. 2 5 Library over view Pylearn2 co nsists of several componen ts that can be combined to f orm comp lete learning alg orithms. Most components do not actu ally execute any n umerical cod e–they just provide symbolic expr es- sions. This is possible because Pylea rn2 is b uilt o n top of Thean o [5, 3]. Thean o provid es a lang uage for descr ibing expressions ind ependen t of how they are actu ally implem ented, so a single Py learn2 class provides both CPU and GPU func tionality . An other advantage of u sing sy mbolic representa- tions as the main arguments to Pylearn 2 methods is that it is possible to comp ute many functions of a sy mbolic expression that can no t be computed from a nume rical v alue a lone. For examp le, it is possible to c ompute the deri vati ve of a Thean o expression, while it is not po ssible to comp ute the deriv ativ e of the process that generated a numerical value giv en only the value itself. This m eans that many interfaces are simpler–few e xpressions need to be p assed between objects, because the recip- ient can create its o wn mod ifications of th e expression it is passed, rather than n eeding an in terface for each modified value it requires. 5.1 Core components The m ain way the Pylearn2 ac hiev es flexibility and extensibility is decom position into reusable parts. Th e three ke y c ompon ents used to implemen t most features are the Dataset , Model , and TrainingAlgo rithm classes. A Data set provides the data to be train ed on. A Model stores parameters and can g enerate Theano expressions tha t per form so me u seful task giv en inp ut data (e.g., estimate the probab ility density at a po int in space, in fer a class label given in put features, etc.). A TrainingAlgo rithm ad apts a Model to a particular Datase t . Generally each of these objects is in turn modu lar ( Dataset s ha ve modular prepro cessing, many Model c lasses are organized into Layer s, TrainingAl gorithms can minimize a modular Cost a nd can have their behavior modified by v arious modular callbacks and a modular TerminationCri terion , etc.). This modularity means that if a researcher has an in novati ve idea to test out, and that i dea only affects o ne compon ent, the researcher can simply rep lace or subclass the com ponent in question. The vast majority of the learn ing s ystem can still be used with the new idea. This modu larity is in contr ast to most other machin e learning libraries, where the Model generally does m ost of the work. A scikit-learn model is gener ally accomp anied by a fi t metho d that is a complete training algorithm and that ca n’t be applied to any oth er m odel. So me libr aries are m ore modular but do n’t entirely divide the labo r b etween models and training algo rithms as sharply as Pylearn2 does. For example, in T orch [9] or DistBelief [14] the Model s are mod ular, but to train a layer, the layer needs to implement a t the very least a ba ckprop ogation method for comp uting deriv ativ es. In Pylearn 2, the Model is only r esponsible for creating symb olic expressions, which the Tr ainingAlgori thm may or may not sym bolically differentiate at a later time. (Individual Theano o ps must still implemen t a gr ad meth od, but a comparatively small number of basic ops can be used to implement th e compara ti vely large numb er o f more complex models that appe ar in most machine learning libraries). Howe ver , another aspect of Pylearn2’ s design ph ilosophy is that no user should be f orced to learn the entire f ramework. It’ s possible to just im plement a train batch method an d ha ve the DefaultTrainingAl gorithm do n othing but serve the Model b atches of data from the Dataset , or to ignore Train ingAlgorithm s altogether and just pass a Da taset to the Model ’ s train all method. T o facilitate cod e reuse, when ev er possible, ind ividual comp onents that ar e shipp ed with th e li- brary aim to be as mo dular and orthogon al as possible, relying on other existing comp onents – e.g., most Model s that becom e part of the lib rary will de fer the ir lea rning functionality to an existing TrainingAlgo rithm and /or Cost unless sufficiently specialized as to be infeasible. 5.2 The Dataset class Dataset s are essentially interfaces b etween s ources of data in arbitrar y f ormat and the in-mem ory array formats that Pylearn2 expects. All Pylearn2 Dataset s provide the same interface but can be implemented to use a ny b ack-end f ormat. Currently , all Pylearn2 D ataset s just read d ata from disk, but in princ iple a Dataset cou ld access li ve streaming data from the netw ork or a peripheral device like a webcam. 3 Dataset s allow the d ata to b e presen ted in many form ats, r egardless of how it is stored. For example, a minibatch of 64 different 32 × 32 p ixel RGB images could be presented as a 64 × 3072 element desig n matrix, or it could be pr esented as a 64 × 32 × 32 × 3 ten sor . The choice of wh ich attribute to put on which axis can ch ange to suppo rt dif ferent pieces of software too (for example, Theano conv olution prefers batc h size × channels × rows × colu mns, while cuda -convnet [30] (which is wrapped in Pylearn2 ) pref ers chann els × rows × columns × batch size). The data can also be presented in many d ifferent orders, allowing iteration in seq uential or different typ es o f r andom order . Dataset s can implement as many or as few iteration schemes as the implemen tor wants to–depe nding on the back end of the data, not all iteration schemes are ef ficient o r even po ssible (for example, if the iterato r needs to read f rom a network drive, rand om iteration may be very slow , and if the iterator reads li ve video from a webcam, there is no way for it to visit the future). Most Dataset s used in the deep learning commun ity c an be represented as a design matrix stored in den se matrix f ormat. For these datasets, implementing an appr opriate Dataset ob ject is very easy . The im plementer only ne eds to subclass the DenseDesignMatri x class and imp lement a constructo r that load s the desired data. If the data is already stored in NumPy [37] or pickle format, it is not ev en necessary to implement any ne w Pytho n code to use the dataset. Some datasets can be described as dense design matrices but a re to o big to fit into memor y . Pylearn2 supports this use case via the DenseDes ignMatrixPyT ables . T o use this class, the data mu st be stored in HDF5 format on disk. Most Dataset s also support some kind of preprocessing that can modify the data after it has been loaded. 5.2.1 Implemented Dataset s Pylearn2 cu rrently con tains wrapp ers for sev eral da tasets. These include the datasets used f or D ARP A ’ s unsup ervised and transfe r lear ning ch allenge [ 25], the dataset used fo r th e NIPS 2011 workshops challenge in rep resentation learning [32], th e CIF AR-10 and CIF AR-100 datasets [ 29], the MNIST dataset [34], some of the MNIST variations datasets [31], the NORB dataset [35], the Street V ie w House Number s dataset [36], the T o ronto F aces Database [50], some of the UCI reposi- tory d atasets [1 7, 16], and a dataset of 3D animations of fi sh [18]. Addition ally , there are many kind s of prepro cessing, such as PCA [38], ZCA [4], various kinds of local con trast normalization [46], as well as helper functio ns to set up the entire pr eprocessing pip eline from some well-k nown successful and docum ented systems [8, 45]. 5.3 The Model class A Model is any object that stores param eters (f or the purpose of Pylearn 2, a “non -parametr ic” model is ju st one with a variable n umber of p arameters). The b asic Model c lass has very f ew interface elements. Subclasses of the Model class define richer interfaces. The se interfaces define different qu antities that the Model can com pute. For example, the MLP class pr ovides an fp rop method that provides a symbolic expression for for ward pr opagation in a multilay er p erceptro n. If the final layer o f the MLP is a softmax layer representin g distributions over c lasses y , and the fprop m ethod is passed a Theano variable represen ting inputs x , the output will be a Th eano variable repre senting p ( y | x ) . The Model class is not requir ed to kn ow how to train itself, though many models do. 5.3.1 Linear operator s, spaces, and con volution Linear operations are key par ts of many machine learnin g models. The distinction between many importan t classes o f machin e learning models is often nothin g more than what specific structu re of linear tr ansformatio n they use. For example, bo th MLPs and con volutional networks apply linear operator s followed by a non linearity to transform inp uts into ou tputs. In the MLP , the linear opera- tion is m ultiplication by a dense matr ix. In a con volutional network, the linear oper ation is discrete conv o lution with finite supp ort. Th is o peration can be viewed as matr ix multiplication by a sparse matrix with several elem ents of the matrix constrained to be equal to each o ther . T he point is that both use a linea r transform ation. Pylearn 2’ s LinearTransfo rm class provides a generic repre- sentation of lin ear oper ators. Pylearn2 func tionality wr itten using this class can thus be written once 4 and then extend ed to do dense matrix m ultiplication, co n volution, tiled conv o lution, local connec- tions, e tc. simply by pr oviding different implementatio ns of the linear ope rator . This idea grew out of James Bergstra’ s Theano-linear module which has since been incorpor ated into Pylearn2. Different l inear operator implementation s require their inputs to be form atted in different ways. For example, conv olution applied to an image req uires a form at that ind icates the 2D position of each element of the inp ut, while dense matrix multiplication just requ ires a linea rized vector represen- taiton of the im age. In Pylea rn2, classes called Spac e s represent these different views of the same underly ing data. De nse matrix multip lication acts on data that lives in a VectorSpa ce , while 2D conv o lution acts on da ta th at lives in a Co nv2DSpace . Space s generally kn ow how to con vert between each other, when possible. For example, a n image in a Con v2DSpace can be flatten ed into a vector in a VectorSpace . Sev eral linear operators (an d related con volutional network o perations like spa tial m ax po oling) in Pylearn2 ar e im plemented as wrappers that add Theano sema ntics on top of the extremely fast cuda-co n vnet library [30], making P ylearn 2 a very practical library to use for co n volutional network research. 5.3.2 Implemented models Because the p hilosophy that Pylearn2 developers sho uld write features whe n they are needed, an d because most Pylearn2 developers so far have been deep learning researcher s, Pylearn2 mostly con- tains deep learn ing m odels or models that ar e used as building blocks for deep architectures. This includes autoe ncoders [6], RBMs [47] inc luding RBMs with Gaussian visible u nits [54], DBMs [4 5], MLPs [44], conv o lutional networks [ 33], and local c oordin ate coding [ 56]. Howe ver , Pylear n2 is not restricted to deep learning fun ctionality . W e encourag e sub missions of other machine learning models. Often, LISA r esearchers work on p roblems whose scale exceeds that of the typ ical ma- chine learn ing lib rary user, so we occasionally implem ent features for simpler algo rithms, such as SVMs [10] with reduced memory consumptio n or k-mea ns [49] with fast multicore training. Pylearn2 has imp lementation s of sev eral mo dels th at were developed at LISA, includin g denois- ing auto- encoder s (DAEs) [53], contractive auto -encod ers (CAEs) [4 2] including high er-order CAEs [43], sp ike-and-slab RBMs (ssRBMs) [ 11] includin g ssR BMs with pooling [12], reconstruc- tion samplin g autoencod ers [13 ], and deep sparse re ctifier nets [19]. Pylearn2 also con tains models that were developed not just at L ISA but or iginally developed using Pylear n2, such as spike-and- slab sparse coding with parallel variational inference [22] and maxo ut units for neural nets [21]. 5.4 The TrainingAlgo rithm class The role of the TrainingAlgor ithm class is to adjust the parameter s stored in a Model in order to adapt the model to a given Dataset . T he Training Algorithm is also responsible for a few less impo rtant tasks, such as setting up the Mon itor to rec ord various v alues throu ghout training (to make learn ing curves, e ssentially). The Train ingAlgorithm is one of the very fe w Pylearn 2 classes that actually p erforms nume rical compu tation. It gathers Th eano expressions assembled by the Model and other classes, syn thesizes them in to expression s for learning rules, com piles the learning rules into Thean o functio ns that accomplish the learning, and executes the Theano expressions. In fact, Pylearn2 TrainingAlgorithms are not e ven requir ed to use Theano at all. Some use a m ixture of T heano and gen eric Python code; for example to perform line search es with the control logic done with basic Python loops and branching b ut the numerical computation done by Theano . Most TrainingAlgorith ms support constrained o ptimization by asking the Mod el to pro ject the result of each lear ning update back into an allowed region. Many Pylearn2 mod els impose non- negativity constraints on parameters th at represent, fo r example, the condition al variance of some random variable, and most Pylearn2 ne ural network layer s supp ort max norm co nstraints on the weights [48]. 5.4.1 The Cost class Many training algorith ms can be e xpressed as procedures for iterativ ely minimizing a cost function. This provides another opportun ity for sharing code between algorithm s. T he Cost class represents 5 a cost func tion ind ependen t of the algorithm used to minimize it, and each TrainingAlgo rithm is free to use this repr esentation or not, dependin g on wh at is most approp riate. A Cost is essen- tially just a c lass fo r generating a Th eano expression describing the cost fu nction, but it has a few extra pieces of function ality . For examp le, it can add mo nitoring chann els that are r elev ant to the co st being m inimized (f or example, one popu lar cost is the negati ve log likelihood of class lab els un der a softmax model–th is cost automatically adds a monitoring channel that tracks the misclassification rate of the cla ssifier). One extremely imp ortant a spect o f the Co st is that it has a get gradien ts method. Unlike Theano ’ s grad m ethod, this method is not guaranteed to return accurate gradie nts. This allo ws many algorithms that use app roximate grad ients to be implem ented using the same machinery as algo rithms th at use the exact gradien t. For example , the persistent contrastive diver - gence [55, 51] algorithm minimizes an intractable cost with intractable gradien ts–the log likelihood of a Boltzmann machine. The Pylea rn2 Co st f or persistent contrasti ve diver gence returns None for the v alue of the cost functio n itself, to exp ress that the co st f unction can’t be co mputed. However , the stand ard SGD class is still able to perform stochastic gr adient descent o n the co st, b ecause the get gradients meth od for the cost returns a sampling-b ased approximation to the gradient. No special optimization class is needed to handle this seemingly exotic case. Other costs implemented in the library include dropo ut [27], contrastive divergence [26], noise con- trastiv e estimation [24], score matching [2 8], denoising score matching [52 ], softmax log likelihood for c lassification with MLPs, an d Gau ssian log likelihood for regression with MLPs. Many add i- tional simpler costs can be added together with the SumOfCos ts class to combine these primary costs with secondary costs to add regularization, such as weight decay or sparsity regularization. 5.4.2 Implemented TrainingAlgori thm s Currently , Pylearn2 contains three main TrainingAlgor ithm classes. The DefaultTrain ingAlgorithm does nothing but serve min ibatches of data to the Mode l ’ s default minibatch learning rule. Th e SGD class does stochastic gradien t descent on a Cost . This class sup ports extensions includin g Po lyak averaging [ 41], momen tum [44], and early stopping. The BGD class does batch g radient d escent (in practice, large minib atches) aka the method of steepest descent [15]. The BGD class is able to accumulate con tributions to the g radient fro m se veral minibatch es before ma king an u pdate, thereby enabling it to use batches that are too large to fit in memory . Option al flags enable the BGD class to imp lement other similar algorithms such as nonlinear conjugate gradient descent [40]. 6 Deve lopment workflow and user commun ity Pylearn2 has many kinds of users and d ev elopers. One nee d n ot be a Pylearn 2 d ev eloper to d o research with Pylearn2 . Pylearn2 is a v aluable research tool ev en for people who do not need to de- velop any ne w algorithms. Th e wide array of reference implementations av a ilable in Pylearn2 make it useful for stu dying how existing algorithm s beh av e u nder various cond itions, o r for obtainin g baseline results on new tasks. Researchers wh o wish to implemen t new algorithms with Pylearn2 do not necessarily n eed to be- come Pylearn2 developers either . I t’ s common to develop experim ental features priv a tely in an offline repositor y . It’ s also p erfectly fin e to share Pylear n2 classes as par t of a 3rd party rep ository rather than having them merged to the main Pylearn2 repository . For those w ho do wish to contribute to Pylearn2, thank y ou! The process is d e- signed to make su re the library is as stable as possible. Dev elopers should first write to pylearn-dev@ googlegroups .com to plan how to imp lement their feature. If the fea ture requires a chan ge to e xisting APIs, it’ s important to follow the be st practices guide 1 . Once a plan is in place, de velopers should write the feature in their o wn fork of Pylearn2 on GitHub, then submit a pull request to the main repositor y . Our automated test suite will run on the pull requ est and indicate whether it is safe to m erge. Pylear n2 developers will also review the p ull req uest. When both the automatic tests and the re v iewers are s atisfied, one o f u s will merge the pull request. Be sure to write to pylearn-dev to find a revie wer for your pull request. 1 http://deeple arning.net/sof tware/pylearn 2/api_change.html 6 All kind s of pull requests are welco me–new featur es (provided that they hav e tests), config files fo r importan t results, bug fixes, and tests for existing features. 7 Conclusion This article has described th e Pylearn 2 libra ry , including its history , design p hilosophy and goals, basic architectu re, and developer workflow . W e hop e you find Pylearn2 usefu l in your researc h and welcome your potential contributions to i t. Refer ences [1] ( 2013) . GitHub . http://github .com . [2] ( 2013) . T ravis CI. http://tra vis- ci.org . [3] Bastien, F ., Lamblin , P ., Pascanu, R., Bergstra, J., Goodfellow , I., Bergeron, A., Bo uchard, N., and Bengio, Y . (2012 ). Theano : new featur es and speed improvements. Deep Learning and Unsuperv ised Feature Learnin g NIPS 201 2 W orksho p. [4] Bell, A. an d Sejnows ki, T . J. (1 997). The indepe ndent componen ts of natur al scenes are edge filters. V ision Resear ch , 37 , 3327– 3338 . [5] Bergstra, J., Breuleu x, O., Bastien, F ., Lamblin , P ., Pascanu, R., Desjard ins, G., T u rian, J., W ard e-Farley , D., and Bengio , Y . (2010). Theano: a CPU and GPU math expression compiler . In Pr oc eedings of the Python for Scientific Computing Confer ence (SciPy) . Oral Presentation . [6] Bo urlard, H. and Kamp, Y . (19 88). Auto-association by multilayer perce ptrons and singu lar value decomp osition. Biological Cybernetics , 59 , 291–29 4. [7] Bra dski, G. (200 0). The OpenCV Library. Dr . Dob b’ s J o urnal of Softwar e T ools . [8] Co ates, A., Lee, H., an d Ng, A. Y . (201 1). An analysis of single-laye r networks in unsuper- vised feature lear ning. In Pr ocee dings o f th e Th irteenth International Con fer ence on Artificial Intelligence and Statistics (AIST A TS 2011) . [9] Co llobert, R., Ka v ukcuog lu, K., and Farabet, C. (2011). T or ch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop . [10] Cortes, C. and V apn ik, V . (1995). Supp ort vector networks. Ma chine Learning , 20 , 273–297. [11] Courville, A., B ergstra, J., and Bengio, Y . (2011a) . A spike and slab restricted Boltzmann ma- chine. In G. Gordon, D. Du nson, and M . Dud` ık, editor s, Pr oc eedings of the F o urteenth Interna- tional Confer ence on Artificial Intelligence and Statistics , volume 15 of JMLR W&CP . Recipient of People’ s Choice A ward. [12] Courville, A., Bergstra, J., and Ben gio, Y . (201 1b). Unsupe rvised models of images by sp ike- and-slab R BMs. I n Pr o ceedings of theT wenty-eig ht I nternationa l Conference on Machine Learn- ing (ICML ’11) . [13] Dauphin, Y ., Glorot, X., and Ben gio, Y . (2011). L arge-scale learnin g of embedd ings with re- construction sampling . In Pr oceeding s of theT wenty- eight Internation al Conference on Machine Learning (ICML ’11) . [14] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M., Senior , A., T uc ker , P ., Y ang , K., an d Ng, A. Y . (20 12). Large scale distributed deep networks. In NIPS’201 2 . [15] Debye, P . (195 4). The collected pap ers of P e ter J.W . Deby e . Ox Bow Press. [16] Diaconis, P . and Efron, B. (198 3). Computer-intensi ve me thods in statistics. 24 8 (5), 116–12 6, 128, 130. [17] Fisher , R. A. ( 1936) . The use of multiple measur ements in taxono mic pr oblems. Anna ls of Eugenics , 7 , 179–188 . [18] Franzius, M., W ilber t, N., and W iskott, L . (2 008). Inv ariant object reco gnition with slo w fea- ture analysis. In Pr oceeding s of the 18th internationa l conference o n Artificial Neural Netw orks, P a rt I , ICANN ’ 08, pages 961–970, Berlin, Heidelberg. S pring er-V erlag. 7 [19] Glorot, X., Bordes, A., and Bengio, Y . (2011). Dee p sparse r ectifier neural networks. I n JMLR W&CP: Pr oceedings of th e F ourteenth I nternationa l Con fer ence on Artificia l Intelligence and Statistics (AIST A TS 2011) . [20] Goodfellow , I., Er han, D., Carrier, P .-L ., Courville, A., Mirza , M., Hamn er , B., Cukierski, W . , T an g, Y ., Thaler, D., Lee, D.-H. , Zhou, Y ., Ramaiah, C., Feng, F ., Li, R., W an g, X., Athanasak is, D., Shawe-T aylor, J., Milakov , M., Park, J., Io nescu, R., Popescu, M., Grozea, C., Bergstra, J., Xie, J., Romaszko, L., Xu, B., Chuang , Z., and Bengio, Y . (2013a). Challeng es in representation learning: A r eport on three m achine learn ing contests. In Internation al Confer ence On Neural Information Pr ocessing . [21] Goodfellow , I., W arde-Farley , D. , Mirza, M., Cour ville, A., and Bengio, Y . (2013 b). Maxout networks. In S. Dasg upta and D. McAllester , ed itors, ICML ’13 , page 1319132 7. [22] Goodfellow , I., Courv ille, A., and Bengio, Y . (2013c). Scaling u p spike-and-slab models for unsuper vised featu re learnin g. IEE E T ransactions on P attern Analysis and Machine Intelligence , 35 (8) , 1902 –1914 . [23] Gould, S., Russakovsk y , O., Goodfellow , I., Baumstarck , P . , Ng, A. Y ., and Koller , D. (201 0). The stair vision library . [24] Gutmann, M. and Hy varinen, A. (2010 ). Noise-contr asti ve estimation: A new estimation principle for un normalize d statistical mo dels. In Pr oceeding s of The Thirteenth Internatio nal Confer ence on Artificial Intelligence and Statistics (AIST ATS’10) . [25] Guyon, I., Dror, G., Lemaire, V . , T aylor, G., and Aha, D. W . (201 1). Unsupervised and transfer learning challenge. I n Pr oc. Int. Joint Conf. on Neural Networks . [26] Hinton, G. E. ( 2000) . T r aining pr oducts of experts by minimizin g contrastive divergence. T echnical Report GCNU TR 2000- 004, Gatsby Unit, Univ ersity College London. [27] Hinton, G. E., Sriv astav a, N. , Krizhevsky , A., Sutskever , I., and Salakh utdinov , R. (2012) . Improving neu ral networks by preventing co-ad aptation of featur e d etectors. T echnical report, arXiv:1207.058 0. [28] Hyv ¨ arinen, A. (2 005). Estimation of non-no rmalized s tatistical models using scor e matching . Journal of Machine Learning Resear ch , 6 , 695–7 09. [29] Krizhevsky , A. a nd Hinton , G. (20 09). Learning multip le layers of features from tiny imag es. T echnical report, Univ ersity of T oro nto. [30] Krizhevsky , A., Sutskev er, I., and Hinton, G. (2012). ImageNet classification with deep conv o- lutional neu ral networks. In Advances in Neural Info rmation Pr oce ssing Systems 2 5 (NIPS’20 12) . [31] Larochelle, H., Er han, D., Cour ville, A., Bergstra, J., and Beng io, Y . ( 2007) . An empirical ev a luation of deep arch itectures on pro blems with many factors of variation. I n ICML ’ 07 , pages 473–4 80. A CM. [32] Le, Q. V ., Ranzato, M., Salakhutdin ov , R., Ng, A. , and T enenbau m, J. (2011) . NIPS W o rk- shop on Challen ges in Learnin g Hierar chical Models: T ransfer Learning a nd Optimization . https://site s.google.com /site/nips2011workshop . [33] LeCun, Y . and Ben gio, Y . (19 95). Conv olutional n etworks f or i mages, speech, a nd ti me-series. In M. A. Arbib, editor, The Han dboo k of Brain Theo ry and Neural Networks , p ages 255–2 57. MIT Press. [34] LeCun, Y ., Bottou, L., Bengio, Y ., and H affner , P . (1998 ). Gradient- based learning app lied to docume nt recogn ition. Pr oceedings of the IEEE , 86 (11), 2278–232 4. [35] LeCun, Y ., Huan g, F .-J., and Bottou, L . (20 04). Learning method s for gen eric object recog- nition with in variance to pose and lighting . In Pr oceed ings o f the Computer V ision and P attern Recognition Confer e nce (CVPR’04) , volume 2, pages 97–104 , Los Alamitos, CA, USA. IEEE Computer Society . [36] Netzer , Y ., W ang, T ., Coates, A., Bissacco, A., W u, B., an d Ng , A. Y . (2 011). Readin g d igits in n atural image s with unsuper vised feature lear ning. Deep L earning and Unsuperv ised Feature Learning W orksh op, NIPS. [37] Oliphant, T . E. (20 07). Python f or scientific com puting. Compu ting in Science an d Engineer- ing , 9 , 10–20. 8 [38] Pearson, K. ( 1901) . On lines and planes of closest fi t to systems of points in space. Philo soph- ical Magazine , 2 (6), 559–572 . [39] Pedregosa, F ., V ar oquaux , G., Gr amfort, A., Michel, V ., Thir ion, B., Grisel, O ., Blond el, M., Prettenhof er , P ., W eiss, R., Dubo urg, V ., V anderplas, J., Passos, A., Cou rnapeau , D., Brucher, M., Perro t, M., and Duchesnay , E. (2011 ). Scik it-learn: Machine learn ing in Python . J ournal of Machine Learning Resear ch , 12 , 2825– 2830 . [40] Polak, E. and Ribiere, G. (1969 ). Note sur la co n vergence de m ´ ethod es de directions con- jugu ´ ees. Revue F ranc ¸ aise d’Informatiqu e et de Recher che Op ´ erationnelle , 16 , 35– 43. [41] Polyak, B. and Juditsky , A. (1992) . Acceleration of stochastic approximation by averaging. SIAM J. Contr ol and Optimization , 30(4) , 838–85 5. [42] Rifai, S., V incent, P ., Muller , X. , Glor ot, X. , and Bengio, Y . (20 11a). Contractive auto - encoder s: Explicit inv ariance during f eature extraction. In Pr oceedings of theT wenty-eight In- ternational Confer ence on Machine Learning (ICML ’11) . [43] Rifai, S., Mesnil, G., V ince nt, P ., Muller , X., Bengio, Y ., Daup hin, Y ., and Glorot, X. (2 011b ). Higher order contractiv e auto- encoder . I n Eur o pean Confer en ce on Machine Learning and Prin- ciples and Practice of Knowledge Discovery in Databases (ECML PKDD) . [44] Rumelhart, D. E., Hinton, G. E., and W illiams, R. J. (198 6). Le arning internal representation s by error propaga tion. In D. E. Rume lhart and J. L. McClelland , edito rs, P arallel Distributed Pr oce ssing , volume 1, chapter 8, pages 318 –362. MIT Press, Cambrid ge. [45] Salakhutdinov , R . and Hinton , G. (2009) . Deep Boltzmann machin es. In AIST ATS’2009 , pages 448–4 55. [46] Sermanet, P ., Chintala, S., and LeCun, Y . (2012) . Con volutional neural networks applied to house numbers d igit c lassification. In I nternationa l Confer ence o n P attern Recognitio n (ICPR 2012) . [47] Smolensky , P . (1 986). Info rmation proc essing in dynam ical s ystems: Found ations of harmony theory . In D. E. Rum elhart and J. L. McClelland , editor s, P a rallel Distrib u ted Pr o cessing , vol- ume 1, chapter 6, pages 194–28 1. MIT Press, Cambridg e. [48] Srebro, N. and Shraibma n, A. (2005 ). Rank, trace-norm and max-norm. In Pr oceedin gs of the 18th Annual Confer ence on Learning Theory , pages 545–5 60. Springer-V erlag. [49] Steinhaus, H. (1957). Sur la di vision des corps mat ´ er iels en parties. In Bull. Acad. P olon. Sci. , pages 801– 804. [50] Susskind, J., An derson, A., and Hinton, G. E. (2010) . The Toronto face dataset. T echnical Report UTML TR 2010-0 01, U. T oronto. [51] T ie leman, T . (20 08). Training restricted Boltzm ann machines using appro ximations to the likelihood g radient. In W . W . Cohen, A. McCallum, and S. T . Roweis, editor s, ICML 2008 , p ages 1064– 1071 . A CM. [52] V incent, P . (2011). A connection between score matching and denoising autoencoders. Neural Computation , 23 (7 ), 166 1–167 4. [53] V incent, P . , Laroch elle, H., Bengio, Y ., and Manzagol, P .- A. (2008) . Extracting and composin g robust feature s with denoising autoencoders. In ICML’08 , pages 1096–110 3. ACM. [54] W elling, M., Rosen-Zvi, M ., and Hin ton, G. E. (2005 ). Expon ential family har monium s with an application to informatio n retrieval. In NIPS’04 , v olume 17, Cambridge, MA. MIT Press. [55] Y ou nes, L. ( 1999) . On the conver gence o f Markovian stochastic algorithm s with rapid ly de- creasing ergodicity rates. Stochastics and Stochastic Reports , 65 (3 ), 177–228 . [56] Y u, K., Zhang , T ., and Gong, Y . (2009 ). Nonlinear learnin g using local coordinate coding . In Y . Ben gio, D. Sch uurman s, J. Lafferty , C. K. I. W illiams, and A. Culo tta, editors, Advances in Neural Information Pr ocessing Systems 22 , pages 2223– 2231 . 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment