An Open Source Pattern Recognition Toolbox for MATLAB

An Op en Source P attern Recognition T o olb o x for MA TLAB Kenneth D. Morton, Jr. CoV ar Applied T ec hnologies kenny@covartech.com P eter T orrione CoV ar Applied T ec hnologies pete@covartech.com Leslie Collins Duk e Universit y lcollins@ee.duke.edu Sam Keene The Co oper Union keene@cooper.edu June 24, 2014 Abstract P attern recognition and machine learning are becoming in tegral parts of algorithms in a wide range of applications. Diﬀeren t algo- rithms and approac hes for machine learning include diﬀeren t tradeoﬀs b et w een p erformance and computation, so during algorithm dev elop- men t it is often necessary to explore a v ariet y of diﬀerent approac hes to a given task. A to olbox with a uniﬁed framework across m ultiple pattern recognition tec hniques enables algorithm dev elop ers the abilit y to rapidly ev aluate diﬀeren t choices prior to deploymen t. MA TLAB is a widely used environmen t for algorithm dev elopment and prototyp- ing, and although sev eral MA TLAB toolb oxes for pattern recognition are curren tly a v ailable these are either incomplete, exp ensiv e, or re- strictiv ely licensed. In this work we describ e a MA TLAB to olbox for pattern recognition and mac hine learning known as the PR T (Pattern Recognition T o olbox), licensed under the p ermissiv e MIT license. The PR T includes many popular techniques for data prepro cessing, sup er- vised learning, clustering, regression and feature selection, as well as a methodology for combining these components using a simple, uni- form syntax. The resulting algorithms can b e ev aluated using cross- v alidation and a v ariety of scoring metrics to ensure robust p erformance when the algorithm is deplo yed. This pap er presents an o verview of the PR T as well as an example of usage on Fisher’s Iris dataset. 1 1 In tro duction In this w ork we describ e the PR T, an ob ject oriented framew ork for pattern recognition within MA TLAB developed to enable rapid data and algorithm exploration and enable practitioners to build complex mac hine learning tech- niques. The PR T is freely a v ailable and released under the MIT license. The PR T deﬁnes standard in terfaces for machine learning datasets (prt- DataSets) and pattern recognition and machine learning tasks (prtActions). Since prtActions alwa ys pro vide a prtDataSet as an output, individual ma- c hine learning actions can b e easily combined together to form a machine learning algorithm (a prtAlgorithm). Since prtActions and prtAlgorithms pro vide a uniﬁed calling syn tax for cross-v alidation and algorithm ev alua- tion, algorithmic changes can b e rapidly ev aluated with few co de mo diﬁca- tions. F urthermore, the PR T provides extensive supp ort for visualization of results, and provides an easy path for implementing new features. The PR T also pro vides a num b er of other beneﬁts whic h space restric- tions prohibit us from discussing in detail, but which are w ell do cumen ted. These include, k-folds and cross-v alidation tec hniques, data visualization metho ds, a wide arra y of standard classiﬁcation and regression techniques (e.g., SVM[2], R VM [5], Random F orest [1], PLSDA [3]) as well as v ari- ous pre-pro cessing, feature selection, decision rules, and a suite of to ols for mo deling data distributions. The remainder of this pap er is organized as follows. Section 2, discusses some of the more imp ortan t and nov el features of the PR T. Section 3 shows a simple example illustrating the usage of the PR T. Section 4 oﬀers some ﬁnal p oin ts. 2 Key F eatures In the PR T features X and (optionally) targets Y are con tained in and man- aged b y the prtDataSet class, with sub classes that sp ecify the nature of the targets or lab els. F or example, datasets that utilize in teger lab els typically used for classiﬁcation are known as prtDataSetClass ob jects and those that ha ve real v alued targets (as in regression) are known as prtDataSetRegress ob jects. In con trast algorithms, their parameters, and the techniques b y whic h they are trained and ev aluated are deﬁned and managed b y the class prtAc- tion. Example prtActions include classiﬁers, data preprocessors, feature selection techniques and regression techniques. prtActions provide an inter- 2 face including tw o metho ds: train and run . train takes a prtDataSet and outputs a prtAction of the same type with inferred parameters. run , on the other hand, maps a prtDataSet to another prtDataSet. Since all run metho ds output prtDataSets, the output of any run metho d can b e used as the input to another prtAction’s train metho d. Therefore, prtActions can b e combined together in complicated structures to enable rapid dev elopmen t of m ulti-stage algorithms. In the parlance of the PR T, a machine learning algorithm is comprised of several individual actions that op erate sequentially or in parallel. F or example, it is common to ﬁrst p erform data prepro cessing follo wed by clas- siﬁcation. In the PR T a prtAlgorithm is a sub class of prtAction which stores and manages a collection of prtActions and the connections b etw een them. Because a prtAlgorithm is also a prtAction, metho ds suc h as train, run and kfolds op erate on prtAlgorithms as they op erate on prtActions. A prtAlgorithm is constructed by using the + and / op erators to com- bine prtAction ob jects. The + operator is used for sequential algorithm ﬂo w while the / operator is used for parallel algorithm ﬂow. Using these oper- ators complex algorithm ﬂows can b e constructed to p erform tasks such as classiﬁer fusion. The following section illustrates the construction and use of prtAlgorithms to combine pre-pro cessing and classiﬁcation actions. 3 Example use of the PR T This section illustrates example use of the PR T using Fisher’s Iris dataset [4]. W e will demonstrate how the PR T can b e used to build, ev aluate, and visualize mac hine learning algorithms and data. A function to generate a prtDataSetClass containing Fisher’s iris data is included in the PR T, prtDataGenIris. Fisher’s Iris dataset con tains 4 features, the length and width of the sepal and petal, from 3 diﬀeren t species of iris. This example will fo cus on binary classiﬁcation (or detection) of one of these sp ecies, setosa, from the other t wo sp ecies. T o create a binary classiﬁcation dataset, a new prtDataSet, ds , is created that has the same observ ations but has binary targets that indicate if an observ ation is of the setosa class. F or visualization, we elect to pro ject the iris data onto its 2 dimensional principal comp onen ts space. Prior to applying principal comp onents it is customary to remo ve the mean and p erform standard deviation normaliza- tion of eac h feature dimension. This can be done b y using the prtAction, prtPrePro cZm uv (zero mean, unit v ariance). F ollowing this, principal com- 3 p onen ts analysis can b e applied to the data using prtPrePro cPca. Giv en this dataset, an y n umber of classiﬁers can b e constructed and trained to distinguish b etw een the tw o classes. W e will examine tw o classi- ﬁers, probabilistic maximum a p osteriori classiﬁcation (prtClassMap), and the relev ance v ector mac hine, (prtClassRvm). Because eac h classiﬁer is a prtAction, syntax is the same for b oth classiﬁers. The follo wing co de illus- trates how the normalization, PCA and classiﬁcation op erations are easily com bined. algoMap = prtPreProcZmuv + prtPreProcPca(’nComponents’,2) + prtClassMap; algoRvm = prtPreProcZmuv + prtPreProcPca(’nComponents’,2) + prtClassRvm; Both algorithms can then b e ev aluated using 5 fold random cross-v alidation using the kfolds metho d. crossValidatedOutputMap = kfolds(algoMap,ds,5); crossValidatedOutputRvm = kfolds(algoRvm,ds,5); F rom the outputs of the ab o v e commands the receiver op erator char- acteristics can b e plotted to compare the exp ected p erformance of the tw o classiﬁers as a function of p oten tial thresholds. [pfMap, pdMap] = prtScoreRoc(crossValidatedOutputMap); [pfRvm, pdRvm] = prtScoreRoc(crossValidatedOutputRvm); Figure 1 shows the results of pro jecting the data into 2 dimensions, the decision con tours that result from b oth classiﬁers, and their corresp onding receiv er op erating curves. F or brevit y , the plotting commands are omitted. Although this example has focused on a fairly simple and w ell studied classiﬁcation problem, several of the most useful asp ects of the PR T hav e b een highlighted. The abilit y to quickly construct prtAlgorithms that share a uniﬁed syn tax com bined with the abilit y to visualize datasets and classiﬁer decision contours and calculate cross-v alidated p erformance metrics enables rapid algorithm design. During the design pro cess a v ariety of tec hniques and algorithm ﬂows can b e explored and ev aluated in the uniﬁed framework pro vided by the PR T. 4 Figure 1: A) Visualization of principal comp onents pro jection. B) Decision con tours from maximum a p osteriori classiﬁcation. C) Decision contours from using a relev ance vector machine.D) Receiver op erator c haracteristics for the detection of the iris setosa. 4 Conclusion This do cumen t has discussed the PR T - an op en source and p ermissiv ely licensed pattern recognition toolb o x for MA TLAB. The PR T pro vides a framew ork for and implemen tation of man y standard mac hine learning tec h- niques and most imp ortantly it pro vides methodology for combining individ- ual techniques to form pattern classiﬁcation algorithms that can b e rapidly mo diﬁed and ev aluated. Using the PR T a wide v ariet y of algorithmic p ossi- bilities can b e explored in a short amount of time and robust op eration can b e ensured by taking adv an tage of built-in algorithm cross-v alidation. The PR T includes full do cumen tation of all functions, a quick start guide, and a unit test suite for all functionality . In addition, the developers main- tain an activ e blog and discussion forum for supp ort of the to ol at the follo w- ing URL: http://newfolder.github.io . It is compatible with MA TLAB v ersion 2008a and later, and therefore runs on Windo ws, Linux/Unix and Mac platforms. As MA TLAB is t ypically av ailable campus wide at most academic institutions, and is widely used in industry , w e feel this to ol will b e extremely useful to b oth researchers and practitioners worldwide. The PR T pro vides a straightforw ard framework for algorithm and classi- 5 ﬁer ev aluation, which enables researc hers to implement their o wn algorithms and immediately compare them to standard approaches. The op en-source and p ermissiv ely licensed nature of the PR T enables other researchers and practitioners to expand the to olbox’s capabilities. In addition, template ob jects are pro vided that allo w users to deﬁne their o wn methods without requiring the use of learning a complicated API. As the pro duct is hosted on the op en source hosting site GitHub, the authors would lik e to encourage an y and all contributions. References [1] Leo Breiman. Random forests. Machine L e arning , 45(1):5–32, 2001. [2] Corinna Cortes and Vladimir V apnik. Supp ort-v ector netw orks. Machine L e arning , 20(3):273–297, 1995. [3] Sijmen de Jong. SIMPLS: An alternativ e approac h to partial least squares regression. Chemometrics and Intel ligent L ab or atory Systems , 18(3):251–263, Marc h 1993. [4] R. A. Fisher. The use of multiple measurements in taxonomic problems. A nnals of Human Genetics , 7(2):179–188, 1936. [5] M E Tipping. Sparse ba yesian learning and the relev ance vector machine. Journal of Machine L e arning R ese ar ch , 1:211–244, 2001. 6

An Open Source Pattern Recognition Toolbox for MATLAB

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment