Class Probability Estimation via Differential Geometric Regularization

Class Pr obability Estimation via Differ ential Geometric Regularization Qinxun Bai Q I N X U N @ C S . BU . E D U Departmen t of Computer Science, Boston Univ ersity , Boston, MA 0 2215 USA Steven Rosenber g S R @ M A T H . B U . E D U Departmen t of Mathematics and Statistics, Boston Uni versity , Boston, MA 0 2215 USA Zheng W u W U Z H E N G @ B U . E D U The Mathworks Inc. Natick, MA 017 60 USA Stan Sclaroff S C L A RO FF @ C S . B U . E D U Departmen t of Computer Science, Boston Univ ersity , Boston, MA 0 2215 USA Abstract W e study the problem of supervised learn ing for both binary and multiclass classiﬁcation from a uniﬁed geometric perspecti ve. In particular , we propo se a geometric r e g ularization techn ique to ﬁnd the submanifo ld correspond ing to a robust estimator of the c las s p robability P ( y | x ) . The regularization term measures the v o lume of this submanifo ld, based on the in tuition that overﬁt- ting produces rapid local oscillations and hence large v olu me of the estimator . This techniq ue can be applied to regular ize any classiﬁcation func- tion that satisﬁes two requirements: ﬁr s tly , an es- timator of the class prob ability can be obtained; secondly , ﬁrst and second deriv atives of the class probab ility estimator can be calculated. In ex- periments, we apply our regularization techn ique to standard loss functions for classiﬁcation, ou r RBF-based im plementation compar es fav orably to widely used regularization meth ods for both binary and multiclass classiﬁcation. 1. Intr oduction In supervised lea rning for classiﬁcation, the idea o f regu- larization seeks a balance b etween a p erfect description of the training data and the potential for ge neralization to un- seen data. Most regular ization techniq ues are deﬁned in the form of p enalizing some functiona l norms. For instance, one of the most successfu l class iﬁcation methods, the sup - port vector machin e (SVM) ( V apnik , 19 98 ; Sch ¨ olkopf & Smola , 20 02 ) and its variants ( Bartlett et al. , 2006 ; Stein- wart , 200 5 ), use a RKHS norm as a regula rizer . While function al norm based regularization is widely-used in ma- chine learning , we feel that there is impor tant loca l g eomet- ric informatio n overloo k ed by this app roach. In many real world classiﬁcation pro blems, if the fe ature space is meaning ful, then all samples that are loc ally within a small enough neighb orhood of a training sam ple should have class probability P ( y | x ) similar to the training sam- ple. For in st ance, a small enou gh perturbatio n of R GB values at some pixels of a hum an face image should not change dramatically the lik elihood of correct iden tiﬁca- tion of this image during face r ecognition. However , such “small local oscillations” of the class probability a re no t explicitly in corporated b y penalizing com monly used fu nc- tional norms. For instance , as repo rted b y Goodfellow et al. ( 2014 ), linear models and their combinatio ns can be eas- ily fo oled b y hard ly percep ti ble pertu rbations of a co rrectly predicted image, ev en though a L 2 regularizer is adopted. Geometric r e gularization tech niques have also bee n studied in machine learning. Belkin et al. ( 2006 ) employed ge o- metric re gularization in the f orm of the L 2 n orm o f the gra- dient magnitude supported o n a manif old. This approach exploits the geometry of th e marginal distribution P ( x ) for semi-supervised learning, ra ther tha n the geo metry of the class pro bability P ( y | x ) . Other related g eometric reg- ularization methods are moti vated by the success of le vel set methods in image segmentatio n ( Cai & Sowmya , 2007 ; V arshney & W illsky , 2010 ) and Euler’ s Elastica in imag e processing ( Lin et al. , 201 2 ; 201 5 ). In particular , the Le vel Learning Set ( C ai & Sowmya , 2007 ) com bines a counting function of train ing samp les and a geo metric p enalty o n the surface ar ea of the decision b oundary . The Geomet- ric Level Set ( V arshney & W illsky , 2010 ) gene ralizes this idea to standard empir ical risk minimization schem es with margin-based loss. Along this line, the Euler’ s Elastica Model ( Lin et al. , 2012 ; 2015 ) proposes a regularization Class Probability Estimation via Differ ential Geometric Regularization technique that penalize s both th e g radient o scillati ons and the curvature of the decision boundary . Howe ver , all three methods focus on the g eometry o f th e decision b oundary supported in the domain of the feature spa ce, and the “small local oscillation ” o f the class prob ability is not explicitly addressed. In this work, we argue that the “small local oscillation” of the c las s pro bability actua ll y lies in the prod uct spac e o f the feature d omain and the probabilistic ou tput space, an d can be characterize d by the geometry of a submanifo ld in this produ ct space correspo nding to the class probab ility . Let f : X → ∆ L − 1 be a class probability estimato r , wh ere X is the feature space and ∆ L − 1 is the prob abilistic simplex for L classes. From a g eometric perspecti ve, if we re gard { ( x , f ( x )) | x ∈ X } , the functiona l graph ( in the geom etric sense) o f f , as a subm anifold in X × ∆ L − 1 , then “small local oscillations” c an be measure d b y the local ﬂatness of this submanif old. In our appro ach, th e lea rning pr ocess can be viewed as a submanifo ld ﬁtting problem that is solved b y a g eometric ﬂow me thod. I n p articular , ou r app roach ﬁnds a subman i- fold by iterati vely ﬁtting the training sam ples in a curvature or volume decrea s ing manner withou t any a priori assump- tions on the geometry of the subman ifold in X × ∆ L − 1 . W e use gr adient ﬂow metho ds t o ﬁn d an optimal direction , i.e. at each step we ﬁnd the vector ﬁ eld pointing in the optimal direction to mov e f . As we will see in the next sectio n, this regularization ap proach n aturally hand les bin ary and mul- ticlass classiﬁcation in a un iﬁ ed way , while previous deci- sion bo undary based techn iques (and most f unctional reg- ularization approach es) ar e origin ally designed for binary classiﬁcation, an d rely on “on e versus one”, “one versus all” or mo re efﬁciently a binary cod ing strategy ( V arshney & W illsky , 2010 ) to generalize to multiclass case. In exp eriments, a radial b asis function (RBF) based im- plementation of our for mulation comp ares fa vorably to widely used binary an d mu lticlas s classiﬁcation methods on datasets fr om the UCI re pository and r eal-world d atasets including the Flickr Material Datab ase (FMD) and the MNIST Database of handwritten digits. In summary , our contributions ar e: • A geo metric per specti ve on overﬁtting and a regular - ization app roach that exploits the geo metry of a robust class probab ility estima tor for classiﬁ cation, • A uniﬁed gradien t ﬂow based algorithm for both bi- nary a nd m ulticlass cla s siﬁcation that can be applied to standard loss functions, and • A RBF-based implem entation tha t achieves pr omising experimental results. Figure 1. Example of three-class l ea rning, i .e., L = 3 , where the input space X is 2 d . T raining samples of the three classes are marked with red, g reen and blue dots respectiv ely . The class label for each training sample correspon ds to a verte x of the simplex ∆ L − 1 . As a result, each mapped training point ( x i , z i ) l ies on one face (correspon ding to it s label y i ) of the space X × ∆ 2 . 2. Method Overview In our work, we propose a regularizatio n sch eme that e x- ploits the geometry of a r ob u s t c las s pr obability estimator and sugg es t a gradien t ﬂo w b ased appro ach to solve f or it. In the follow , we will describe ou r approa ch. Related math- ematical notation is summarized in T able 2 . Follo wing the p robabilistic setting of classiﬁcation, g i ven a sample ( feature) space X ⊂ R N , a label space Y = { 1 , . . . , L } , an d a ﬁnite train ing set of labeled samples T m = { ( x i , y i ) } m i =1 , where each training sample is gen- erated i.i.d. from distribution P over X × Y , o ur goal is to ﬁnd a h T m : X → Y such th at for any n e w sample x ∈ X , h T m predicts its label ˆ y = h T m ( x ) . The optimal generalizatio n risk (Bayes risk) is achiev ed by the classiﬁ er h ∗ ( x ) = a rgmax { η ℓ ( x ) , ℓ ∈ Y } , where η = ( η 1 , . . . , η L ) with η ℓ : X → [0 , 1 ] b eing th e ℓ th class proba bility , i.e. η ℓ ( x ) = P ( y = ℓ | x ) . Our regular ization ap proach exploits the geo metry o f the class probability estimator, and can be regarded as a “h y- brid” plu g-in/ERM scheme ( Audibert & T s ybakov , 20 07 ). A regularized loss m inimization prob lem is setup to ﬁnd an estimator f : X → ∆ L − 1 , where ∆ L − 1 is the stan- dard ( L − 1) -simplex in R L , and f = ( f 1 , . . . , f L ) is an estimator of η with f ℓ : X → [0 , 1] . The estima- tor f is then “plugged-in ” to get the classiﬁer h f ( x ) = argmax { f ℓ ( x ) , ℓ ∈ Y } . Figure 1 shows an example of th e setup of our approach, for a syn thetic three-class classiﬁcation problem . The sub- manifold corr esponding to estimator f is the gra ph (in the geometric sense) of f : gr( f ) = { ( x , f 1 ( x ) , . . . , f L ( x )) : x ∈ X } ⊂ X × ∆ L − 1 . W e denote a point in the space X × ∆ L − 1 as ( x , z ) = ( x 1 , . . . , x N , z 1 , . . . , z L ) , wh ere x ∈ X a nd z ∈ ∆ L − 1 . Then in this pro duct space, Class Probability Estimation via Differ ential Geometric Regularization ( a ) ( b ) ( c ) ( d ) Figure 2. Example of b inary learning via grad ient ﬂow . As shown i n ( a ) , the feature space X is 2 d , training points are sampled u niformly within the region [ − 15 , 15] × [ − 15 , 15] , and labeled by the function y = sign(10 − k x k 2 ) (the red circle). In the initialization step, sho wn in ( b ) , positi ve and negativ e training points map t o the two faces of the space X × ∆ 1 respecti vely . Our gradient ﬂ o w method starts from a neutral function f 0 ≡ 1 2 and move s tow ards the negati ve direction (red and blue arro ws) of the penalty gradient ∇P f 0 . Figure ( c ) sho ws the submanifold ( gr( f 1 ) ) one st e p after ( b ) . T h e submanifold t h en continues to ev olve t o wards −∇P f t step by step and the ﬁnal output after con verg ence of the algorithm is shown in ( d ) . a training p air ( x i , y i = ℓ ) natu rally maps to the poin t ( x i , z i ) = ( x i , 0 , . . . , 1 , . . . , 0) , w ith the one- hot vector z i (with the 1 in its ℓ -th slot) at the vertex of ∆ L − 1 corre- sponding to P ( y = y i | x ) = 1 . W e point out two proper ti es of this g eometric setup. Firstly , it inherently han dles mu lticlass classiﬁcation, with binary classiﬁcation as a special case. Second ly , while the dimen- sion of th e ambient space, i.e. R N + L , depen ds on bo th the feature dimension N and number of classes L , the intrinsic dimension of the submanif old gr( f ) only dep ends on N . 2.1. V ariational f ormulation W e want gr( f ) to appr oach the m apped training points while remaining as ﬂat as possible, so we imp ose a penalty on f consisting of an empirica l loss term P T m and a geo- metric regularizatio n term P G . For P T m , we can cho ose ei- ther the wide ly-used cross-entropy loss fun ction for multi- class classiﬁcation or the simpler Euclid ean distance func- tion between the simplex co ordinates o f the grap h p oint an d the m apped training po int. For P G , we would id eally con- sider an L 2 measure of the Riemann curv a ture of gr( f ) , as the vanishing o f th is term gi ves optimal (i.e., locally distor- tion free) diffeomorp hisms from gr( f ) to R N . Ho wev er , the Riemann curvature tensor takes the fo rm of a combina- tion o f deriv atives up to third o rder , and th e correspon ding gradient vector ﬁeld is even mo re complica ted and inefﬁ- cient to compu te in practice. As a result, we measure the graph’ s v olume, P G ( f ) = R gr( f ) dvol , where dv ol is the induced volume from the Lebesgue measure on the ambi- ent space R N + L . More precisely , we ﬁnd the fu nction that minimize s the fol- lowing pen alty P : P = P T m + λ P G : M = Maps( X , ∆ L − 1 ) → R (1) on the set M of smooth functions from X to ∆ L − 1 , where λ is the tradeoff parameter between empirical loss and reg- ularization. I t is impo rtant to note that any relativ e scal- ing of the d omain X will not affect the estimate of the class probability η , as scaling will distort gr( f ) but will not change the critical function estimating η . 2.2. Gradient ﬂow and geometric f o undation The standard techniq ue for solving variational form ulas is the Eule r -Lag range PDE. Howe ver , du e to our geo met- ric term P G , ﬁnding the minimal solutions of th e Euler- Lagrang e equation s for P is d if ﬁcult, instead, we solve for argmin P using grad ient ﬂow in functional space M . A simple but intuitive simulated examp le of binary learn- ing usin g gradient ﬂow for our appro ach is giv en in Fig- ure 2 . For the explanation purposes only , we replace M with a ﬁnite dim ensional Riemannian manifold M . W ith- out loss of ge nerality , we also assume that P is smooth , then it has a differential d P f : T f M → R fo r each f ∈ M , where T f M is th e tangen t space to M at f . Since d P f is a linear func tional on T f M , th ere is a u nique tangent vec- tor , denoted ∇P f , such that d P f ( v ) = h v , ∇P f i for all v ∈ T f M . ∇P f points in the dir ection o f maximal in- crease of P at f . Thus, the solutio n of th e n e gativ e gra- dient ﬂow d f t /dt = −∇P f t is a ﬂow line of steepest descent starting at an initial f 0 . For a dense open set of initial points, ﬂow lines approach a local minimum of P at t → ∞ . W e always choose the initial function f 0 to be the “neutral” choice f 0 ( x ) ≡ ( 1 L , . . . , 1 L ) which re asonably assigns equal condition al proba bility to all classes. Similar gradien t ﬂo w procedu res are w idely used in variational p roblems, such as level set m ethods ( Osher & Sethian , 1988 ; Sethian , 1999 ), Mu mford-Shah fun c- tional ( Mu mford & Shah , 1989 ), etc. In th e classiﬁcation literature, V arshney & Willsk y ( 20 10 ) were th e ﬁrst to use gradient ﬂo w meth ods for solving level set based en ergy function s, the n followed by Lin et al. ( 2012 ; 2015 ) to solve Euler’ s Elastica mode ls . I n our case, we ar e exploiting the Class Probability Estimation via Differ ential Geometric Regularization geometry in the space X × ∆ L − 1 , rather than standard vec- tor spaces. Since o ur grad ient ﬂow me thod is actually applied on the inﬁnite dim ensional m anifold M , we have to un derstand both the top ology and the Riemannia n geometry of M . For the topolo gy , we put the Fr ´ echet top ology o n M ′ = Maps( X , R L ) , the set of smoo th maps from X to R L , an d take the induced topology on M . Intuitively speaking, two function s in M a re close if the function s and all their par- tial de ri vati ves are pointwise clo se. Sin ce M is an op en Fr ´ ec het submanifo ld with boundary inside the vector space M ′ , so as with an open set in Euclidean space, we c an canonically identify T f M with M ′ . For the Riemannian metric, we take the L 2 metric on each tangent space T f M : h φ 1 , φ 2 i := R X φ 1 ( x ) φ 2 ( x )dvol x , with φ i ∈ M ′ and dvol x being the v olu me form of the induced Riema nnian metric on the gr aph of f . ( St rictly speaking , the v olume form is pu lled back to X by f , usually den oted b y f ∗ dvol .) The dif ferential d P f is linear as abov e, an d by a direct cal- culation, th ere is a uniq ue tangent vector ∇P f ∈ T f M such that d P f ( φ ) = h∇P f , φ i for all φ ∈ T f M . Thus, we can co nstruct the gradien t ﬂow equ ation. However , u nlik e the case of ﬁnite d imensions, the existence of ﬂow lines is not au tomatic. Assuming the existence o f ﬂow lines, a generic initial point ﬂows to a local minimum of P . In any case, our RBF-based implementatio n in § 3 mimicking gra- dient ﬂow is well d eﬁned. Note that we think o f X as large enoug h so th at the train - ing data actually is sampled well inside X . This allows us to tr eat X as a c losed manifold in ou r gradient calcu- lations, so that boundary ef fects can be ignored. A simi- lar natural bounda ry co ndition is als o adop ted by previous work ( V arshney & W illsky , 2010 ; Lin et al. , 2012 ; 2015 ) . 2.3. More on related work There exist some other works th at are r elated to some as- pects of our work. Most n otably , Sob ole v regularization, in volves fun ctional norms of a cer tain numb er of deriva- ti ves of th e prediction function . For instance , the manifold regularization ( Belkin et al. , 2006 ) men tioned in § 1 uses a Sobolev re gularization term, Z x ∈M k∇ M f k 2 dP ( x ) , (2) where f is a smo oth fun ction on manifold M . A discrete version of ( 2 ) co rresponds to the g raph L aplacian r e g ular - ization ( Zhou & Sch ¨ olkopf , 2005 ). Lin et al. ( 2015 ) dis- cussed in detail the difference between a Sobolev n orm and a cu rv atu re-based norm for th e purpo se o f exploiting the geometry of the decision bound ary . For ou r pu rpose, while imposing , say , a h igh Sobolev norm 1 , will also lead to a ﬂattening of the hyper surface gr( f ) , these n orms are n ot speciﬁcally tailo red to m easur - ing the ﬂatne s s of g r ( f ) . In other w o rds, a h igh Sobolev norm bo und will imply the volume bo und we desire, b ut not vice versa . As a result, imposing high Sobolev nor m constraints ( re gardless of co mputational difﬁculties) over - shrinks the hyp othesis space from a lear ning theory point of view . In contr ast, our regularizatio n term (g i ven in ( 11 )) in volves only the c ombination of ﬁrst der i vati ves of f that speciﬁcally address the geo metry behind the “small local oscillation” prior observed in practice. Our training procedure fo r ﬁndin g the optima l graph of a function is, in a general sense, also re lated to the manif old learning problem ( T en enbaum et al. , 2000 ; Ro weis & Saul , 2000 ; Belkin & Niyog i , 2003 ; Donoho & Grimes , 2003 ; Zhang & Zh a , 2 005 ; Lin & Zha , 2 008 ). The most clo sely related work is ( Donoho & Grimes , 2 003 ), which seeks a ﬂat s ubman ifold o f Euclidean space that contains a dataset. Again, there are key differences. Since the goal of ( Donoho & Grimes , 2003 ) is dimensionality re duction, their mani- fold ha s high co dimension, while ou r functional g raph has codimensio n L − 1 , which may be as low as 1 . More impo r - tantly , w e do not assum e that the g raph of o ur target fu nc- tion is a ﬂat (or volume min imizing) submanif old, and we instead ﬂo w to wards a function wh ose graph is as ﬂat (or volume minimizing) as possible. In this regard, o ur work is related to a large b ody of literatur e on Morse theory in ﬁnite and inﬁnite dimension s , and on mean curvature ﬂow ( Chen et al. , 1999 ; Mantegazza , 2011 ). 3. Example Formu lation: RBFs W e no w illustrate our appro ach using an RBF representa- tion of ou r estimator f . RBFs are also used b y previous ge- ometric classiﬁcation methods ( V arshney & W illsky , 20 10 ; Lin et al. , 2012 ; 2015 ). Giv en values o f f are pro babilistic vectors, it is common to represent f as a “so ftmax” o utput of RBFs, i.e. f j = e h j P L l =1 e h l , where h j = m X i =1 a j i ϕ i ( x ) , for j = 1 , . . . , L , (3) where ϕ i ( x ) = e − 1 c k x − x i k 2 is the RBF fun ction centered at training sample x i , with kernel width parameter c . Estimating f becomes an optimization pr oblem for th e m × L coefﬁcient matrix A = ( a ℓ i ) . Th e following eq uation determines A : [ h ( x 1 ) , . . . , h ( x m )] T = GA, wh ere G ij = ϕ j ( x i ) . (4) 1 “High S ob ole v no rm” is the co n ventional term for S ob ole v norm with high order of deriv ativ es. Class Probability Estimation via Differ ential Geometric Regularization T o p lug th is RBF rep resentation in to ou r g radient ﬂow scheme, the grad ient v ector ﬁeld ∇P f is ev aluated at each sample point x i , and A is upd ated by A ← A − τ G − 1 [ ∇P h ( x 1 ) , . . . , ∇P h ( x m )] T , (5) where τ is the step-size param eter , an d ∇P h ( x i ) =  ∂ f ∂ h  T x i ∇P f ( x i ) . (6) Here ∇P h ( x i ) den otes the gr adient vector ﬁeld w .r .t. h ev aluated at x i , a nd the L × L Jacobian matr ix h ∂ f ∂ h i x i can be o btained in clo sed form from ( 3 ). In the fo llo wing sub - sections, we give exact for ms o f the empirical penalty P T m and the g eometric penalty P G , and discuss the co mputation of ∇P h for both penalty terms. 3.1. The empirical penalty P T m W e con sider tw o widely-used loss fun ctions for the em pir - ical penalty term P T m . Quadratic loss. Since P T m measures th e deviation of gr( f ) from the mapp ed train ing po ints, it is natu ral to choose the qu adratic func ti on of the Euclidean distance in the simplex ∆ L − 1 , P T m ( f ) = m X i =1 k f ( x i ) − z i k 2 , (7) where z i is the one-hot vector correspondin g to the groun d truth lab el of x i . The g radient vector w .r .t. f evaluated at x i is ∇P T m , f ( x i ) = 2 ( f ( x i ) − z i ) . The gradien t vector w .r .t. h ev aluated at x i is ∇P T m , h ( x i ) = 2  ∂ f ∂ h  T x i ( f ( x i ) − z i ) , (8) ev aluatio n of h ∂ f ∂ h i T x i is the same as in ( 6 ). Cross-entropy loss. The cross-entro p y loss function is also widely-used for probab ilis tic output in clas siﬁcation, P T m ( f ) = − m X i =1 L X ℓ =1 z ℓ i log f ℓ ( x i ) , (9) whose gradien t vector ﬁeld w .r .t. h ev alu ated at x i is ∇P T m , h ( x i ) = f ( x i ) − z i . (10) 3.2. The geometric penalty P G As discussed in § 2 , we wish to p enalize gra phs for exces- si ve curvature and we use the following function, wh ich measures the volume of the gr( f ) : P G ( f ) = Z gr( f ) dvol = Z gr( f ) p det( g ) dx 1 . . . dx N , (11) where g = ( g ij ) with g ij = δ ij + f a i f a j , is the Riemm a- nian metric on gr( f ) indu ced fro m the s tandard dot pr oduct on R N + L . W e use the summation conv ention on repeated indices. Note that this regularization term is clearly very different from the s tandard Sobolev norm of any orde r . It is standard that ∇P G = − T r II ∈ R N + L on the space of all embedd ings of X in R N + L . If we restrict to the sub- manifold o f grap hs of f ∈ M ′ , it is easy to calcu late that the gradient of geometric penalty ( 11 ) is ∇P G, f = V G, f = − T r I I L , (12) where T r II L denotes the last L compon ents of T r I I . Then the geometric gradient w .r .t. h is ∇P G, h = V G, h = −  ∂ f ∂ h  T T r I I L . (13) Evaluation of h ∂ f ∂ h i and T r I I L at x i leads to ∇P G, h ( x i ) . The fo rmulation gi ven above is general in that it e ncom- passes both the binary and the multiclass cases. For both cases, ev alu ation of h ∂ f ∂ h i at the trainin g poin ts is the same as that in ( 6 ), and evaluation o f T r I I L at any po int x can be perform ed explicitly by the follo wing theorem. Theorem 1. F or f : R N → ∆ L − 1 , T r I I L for gr( f ) is given by T r II L = ( g − 1 ) ij  f 1 j i − ( g − 1 ) r s f a r s f a i f 1 j , . . . , f L j i − ( g − 1 ) r s f a r s f a i f L j  , (14) wher e f a i , f a ij denote partial derivatives of f a . The pr oof is in Appendix A . Note that for our RBF r ep- resentation ( 3 ), the p artial deriv a ti ves f a i , f a ij can b e easily obtained in closed form. Simplex co ns traint. Th e class p robability estimato rs f : X → ∆ L − 1 always takes values in ∆ L − 1 ⊂ R L . While this constrain t is au tomatically satisﬁed fo r the ﬂow of the empirical gradient vector form ula ( 8 ) and ( 10 ), it may fail for th e ﬂow of geom etric grad ient vector f ormula ( 12 ) . There are tw o ways to enforce th is constrain t fo r the g eo- metric gradient v ector ﬁeld. First, since our initial fu nction Class Probability Estimation via Differ ential Geometric Regularization f 0 takes values at the center of ∆ L − 1 , we can ortho gonally project th e geo metric g radient vector V G, f to V ′ G, f in the tangent space Z = { ( y 1 , . . . , y L ) ∈ R L : P L ℓ =1 y ℓ = 0 } of the simp le x , an d then scale τ V ′ G, f ( τ is the stepsize) to ensure that the ra nge of the new f 1 lies in ∆ L − 1 . W e then iterate. Mo re simply , we can select L − 1 of the L com- ponen ts of f ( x ) , call the new f unction f ′ : X → R L − 1 , and compu te the ( L − 1 ) -dimensional gradien t vector V G, f ′ following ( 12 ) and ( 14 ). The omitted compon ent of the de- sired L -grad ient vector is determine d by − P L − 1 ℓ =1 V ℓ G, f ′ , by the deﬁnition of Z . Our implementa ti on reported fol- lows th is second approach, where we choose th e ( L − 1) compon ents of f by omitting th e c omponent co rresponding to the class with least number of training samples. 3.3. Algorithm summary Algorithm 1 gives a summ ary o f th e c las siﬁer learning pro- cedure. Inp ut to the algorithm is the training set T m , RBF kernel width c , trad e-of f par ameter λ , and step-size param- eter τ . F o r initialization, ou r algorithm ﬁrst initializes the function values of h and f fo r ev ery tr aining po int, and then constructs matrix G and solves for A by ( 4 ). I n the subsequen t steps, at each iteration, our algo rithm ﬁrst e val- uates the gradient vector ﬁeld ∇ P h at every trainin g point, then upd ates coefﬁcient matrix A by ( 5 ). For the overall penalty fun ction P = P T m + λ P G , we co mpute th e total gradient vector ﬁeld ∇P h ev aluated at x i , ∇P h ( x i ) = ∇P T m , f ( x i ) + λ ∇P G, f ( x i ) (15) =      h ∂ f ∂ h i T x i  2( f ( x i ) − z i ) − λ T r II L x i  , quadratic f ( x i ) − z i − λ h ∂ f ∂ h i T x i T r II L x i , cross-entro p y . Our algorith m iterates un til it conv erges or reach es th e maximum iteration number . The same algorith m applies to both the quadra tic loss a nd the cross-entropy loss. T o evaluate th e total g radient vec- tors ∇P h ( x i ) in each iteration , for the quadra ti c loss, we use ( 8 ) and ( 13 ) to compu te the to tal gradient vector ( 22 ); for the cross-entro p y loss, w e u se ( 10 ) and ( 1 3 ) instead. The remain ing st eps of the proced ure are exactly th e same for both loss function s. The ﬁnal predictor learned by our algorithm is given by F ( x ) = ar gmax { f ℓ ( x ) , ℓ ∈ { 1 , 2 , · · · , L } } . (16) 4. Experiments T o evaluate the effecti veness of the p roposed r e gulariza- tion appr oach, we com pare our RBF-based implementation with tw o groups of related classiﬁcation methods. The ﬁrst Algorithm 1 Geometric regularized classiﬁcation Input: training data T m = { ( x i , y i ) } m i =1 , RBF kernel width c , trade-off par ameter λ , step-size τ Initialize: h ( x i ) = (1 , . . . , 1) , f ( x i ) = ( 1 L , . . . , 1 L ) , ∀ i ∈ { 1 , · · · , m } , construct matr ix G and solve A b y ( 4 ) for t = 1 to T do – Evaluate the to tal gradient vector ∇P h ( x i ) at ev- ery training point according to ( 22 ). – Update the A by ( 5 ). end for Output: class pr obability estimator f gi ven by ( 3 ) . group o f methods are stand ard RBF-based m ethods that u se different regu larizers than ou rs. Th e second group of meth- ods are previous geometric re gularizatio n metho ds. In particular, th e ﬁrst grou p include s the Radial Basis Func- tion Network (RBN), SVM with RBF kernel (SVM) and the Import V ector Mach ine (IVM) ( Zhu & Hastie , 2005 ) (a gr eedy search variant of the standard RBF kernel lo gis- tic regression classiﬁer). No te th at b oth SVM and IVM use RKHS regularizers a nd th e IVM als o uses the similar cross-entro p y loss as Ou rs-CE. The second grou p inc ludes the Level Lear ning Set classi- ﬁer ( Cai & Sowmya , 2007 ) (LLS) , the Geom etric Level Set classiﬁer ( V arshney & Wills ky , 2010 ) (GLS) and the Eu- ler’ s Elastica classiﬁer ( Lin et a l. , 2 012 ; 2 015 ) (EE ). No te that both GLS and EE use R BF represen tations an d EE also uses the same quadratic distance loss as Ours-Q. W e test both the quadr atic lo ss version (Ours-Q) an d the cross-entro p y loss version (Ours-CE) of ou r implementa- tion. 4.1. UCI datasets W e tested our classiﬁcation method on four binar y c las si- ﬁcation datasets and fou r multiclass classiﬁcation datasets. Giv en that V arshney & Willsk y ( 201 0 ) has covered se veral methods on our co mparing list and their im plementation is pu blicly a vailable, we choose to use the same datasets as ( V arshney & W illsky , 2010 ) and carefully follow the ex- act experimental setup. T enfold cross-validation error is reported . For each of the ten folds, the kernel-width con - stant c and tradeoff parameter λ are found using ﬁvefold cross-validation on th e training folds. All dimensions of input sample points ar e nor malized to a ﬁxed ran ge [0 , 1] throug hout the experiments. W e select c from the set of val- ues { 1 / 2 5 , 1 / 2 4 , 1 / 2 3 , 1 / 2 2 , 1 / 2 , 1 , 2 , 4 , 8 } and λ fro m the set o f values { 1 / 1 . 5 4 , 1 / 1 . 5 3 , 1 / 1 . 5 2 , 1 / 1 . 5 , 1 , 1 . 5 } that minimizes the ﬁvefold cross-validation error . Th e step-size τ = 0 . 1 and iteration number T = 5 are ﬁxed over all Class Probability Estimation via Differ ential Geometric Regularization T able 1. T enfold cross-valida tion error rate (percent) on four binary and four multiclass classiﬁcation datasets from the UCI machine learning repository . ( L, N ) denote the number of classes and input feature dimensions respecti vely . W e compare bo th the quadratic loss version (Ours-Q) and the cross-entropy loss version (Ours-CE) of our method with 6 RBF-based classiﬁcati o n methods and (or) geometric regularization methods: SVM wi th RBF kernel (SVM), R a dial basis function network (RBN), Le vel learning set classiﬁer ( Cai & Sowmya , 2007 ) (L LS), Geometric le vel set classiﬁer ( V arshney & Willsky , 2010 ) (GLS ), Import V ector Machine ( Zhu & Hastie , 2005 ) (IVM), Euler’ s Elastica classiﬁer ( Lin et al. , 2012 ; 2015 ) (EE). The mean error rate averaged over all eight datasets is shown in the bottom ro w . T op performance for each dataset is sho wn in bold. D A TA S E T ( L, N ) R B N S V M I V M L L S G L S E E O U R S - Q O U R S - C E P I M A ( 2 , 8 ) 2 4 . 6 0 2 4 . 1 2 2 4 . 1 1 2 9 . 9 4 2 5 . 9 4 23.33 2 3 . 9 8 2 4 . 5 1 W D B C ( 2 , 3 0 ) 5 . 7 9 2 . 8 1 3 . 1 6 6. 5 0 4 . 4 0 2.63 2.63 2.63 L I V E R ( 2 , 6 ) 3 5 . 6 5 2 8 . 6 6 2 9 . 2 5 3 7 . 3 9 3 7 . 6 1 2 6 . 3 3 25.74 2 6 . 3 1 I O N O S . ( 2 , 3 4 ) 7 . 3 8 3.99 2 1 . 7 3 1 3 . 1 1 1 3 . 6 7 6 . 5 5 6 . 8 3 6 . 2 6 W I N E ( 3 , 1 3 ) 1 . 7 0 1 . 1 1 1 . 6 7 5. 0 3 3 . 9 2 0 . 5 6 0.00 0.00 I R I S ( 3 , 4 ) 4 . 6 7 2.67 4 . 0 0 3 . 3 3 6 . 0 0 4 . 0 0 3 . 3 3 3 . 3 3 G L A S S ( 6 , 9 ) 3 4 . 5 0 3 1 . 7 7 29.44 3 8 . 7 7 3 6 . 9 5 3 2 . 2 8 29 . 8 7 29.44 S E G M . ( 7 , 1 9 ) 1 3 . 0 7 3. 8 1 3 . 6 4 14 . 4 0 4 . 0 3 8 . 8 0 2.47 2 . 7 3 A L L - A V G 1 5 . 9 2 1 2 . 3 7 1 4 . 6 3 1 8 . 5 6 1 6 . 5 7 1 3 . 0 6 11.86 1 1 . 9 0 datasets. W e used the same settings for both loss fu nctions. T able 1 r eports the results o f this experiment. The top per- former for each dataset is marked in bold, an d the aver - aged p erformance of each m ethod over all testing datasets is summ arized in the botto m r o w . The nu mbers fo r RBN, LLS an d GLS are c opied fro m T able 1 of ( V arshney & Will- sky , 2010 ). Results fo r SVM and IVM are obtained by run- ning p ublicly av ailab le im plementations for SVM ( Chang & Lin , 2011 ) and I VM ( Ro s cher e t al. , 201 2 ). Results fo r EE are obtained b y runn ing an implem entation provided by the auth ors of ( Lin et al. , 2012 ) . When ru nning these im- plementation s, we followed the same exp erimental setup as described above and exh austi vely searched f or the op timal range for the kernel band width and the trad e-of f par ameter via cross-validation. As shown in the last row of T able 1 , two versions of our approa ch are overall the to p two p erformers a mong all r e- ported meth ods. In par ticular , Ou rs-Q attains top perfo r - mance on fo ur out of the eight be nchmarks, Ours-CE at- tains top perfor mance on thre e o ut of the eight bench marks. The perform ance of th e two versions of our m ethod are very close, which sho ws the r ob u s tness of our geom etric regularization approa ch cross different loss function s for classiﬁcation. No te that three pairs o f comparison s, IVM vs Ours-CE, GLS vs Ours-Q/Ou rs-CE, and EE v s Our s -Q are of particular interest. W e are g oing to d is cuss th em in detail respectively . The IVM m ethod o f kernel logistic regression uses th e same RBF-based implementation and very similar cross- entropy loss as ou r cr oss-entropy versio n Ou rs-CE, and both m ethods handle the mu lt iclass case inherently . The main difference lies in regularizatio n, i.e., the standard RKHS norm regu larizer vs our geometric regular izer . Ours-CE outperforms IVM on six o f the e ight b enchmars in T able 1 , and achieves e qual p erformance on one of the remaining two, an d is only slightly behind o n “PIMA”. T he overall sup erior per formance of Ours-CE demonstrates the advantage of the p roposed geometric r e gularization over the standard RKHS norm regularization. The GLS method uses the same RBF-based implem enta- tion as our s and also exploits v o lume geometry for regu- larization. As d escribed in § 1 , however , the re ar e key dif- ferences between the two regulariz ation techniq ues. GLS measures the volume of the d ecision boundary supported in X , while our appro ach measures the volume of a sub- manifold sup ported in X × ∆ L − 1 that c orresponds to the class probab ilit y estimator . Our regularization technique handles the binar y and multiclass cases in a un iﬁ ed fram e- work, while the d ecision boun dary based techniq ues, such as GLS (an d EE), were in herently design ed fo r th e bin ary case and rely on a binar y coding strate gy to train log 2 L decision boun daries to gener alize to the multiclass case. In our experimen ts , both Ours-Q an d Ours-CE ou tperform GLS on all the ben chmarks we have tested. This d emon- strates the effectiveness of exploiting the geome try o f the class probab ility in addre s sing the “small local oscillation” for classiﬁcation. The EE method of Euler’ s Elastica model uses the same RBF-based implem entation and the same q uadratic lo ss as ou r quad ratic loss version Ou rs-Q. Th e main d if fer- ence, again , lies in regularizatio n, i.e., a combina tion of 1 - Sobolev norm an d curvature penalty on the d ecisi on b ound- ary vs o ur v olu me penalty on the submanifo ld correspon d- ing to the class prob ability estimato r . Since E E adopts a combinatio n of soph is ticated g eometric measur es on the decision b oundary , wh ich ﬁt speciﬁcally th e binary case, it achie ves top performan ce on binary datasets. Ho wever , as e x plained in § 1 , the geome try of the class p robability fo r Class Probability Estimation via Differ ential Geometric Regularization T able 2. Notations h f ( x ) = arg max ℓ ∈Y f ℓ ( x ) , : plug-in classiﬁer of f : X → ∆ L − 1 ∆ L − 1 : t h e standard ( L − 1) -simplex in R L ; η ( x ) = ( η 1 ( x ) , . . . , η L ( x )) : class probability: η ℓ ( x ) = P ( y = ℓ | x ) M : { f : X → ∆ L − 1 ) : f ∈ C ∞ } M ′ : { f : X → R L : f ∈ C ∞ } T f M : the tangent space to M at some f ∈ M ; T f M ≃ M ′ The graph of f ∈ M (or M ′ ) : gr( f ) = { ( x , f ( x ) ) : x ∈ X } g ij = ∂ f ∂ x i ∂ f ∂ x j : T h e Riemannian metric on gr( f ) induced from the standard dot product on R N + L ( g ij ) = g − 1 , with g = ( g ij ) i,j =1 ,...,N dvol = p det( g ) dx 1 . . . dx N , the volume element on gr( f ) { e i } N i =1 : a smoothly varying orthono rmal basis of the tangent spaces T ( x , f ( x ) gr( f ) of the graph of f T r I I : the trace of the second fundamental form of gr( f ) , T r I I ∈ R N + L T r I I =  P N i =1 D e i e i  ⊥ : with ⊥ the orthogonal projection to the subspace perpendicular to the tangent space of gr( f ) and D y w the directional deriv ative of w in y directi o n T r I I L : the projection of T r I I onto the last L coordinates of R N + L ∇P : the gradient vector ﬁeld of a function P : M → R on a possibly inﬁnite dimensional manifold M general class iﬁcation, which is captur ed b y our approach, cannot be captured by decision bou ndary based t echniqu es. That is the reason wh y Ours-Q, a gen eral scheme fo r both the b inary and mu lti class c ase, outpe rforms EE on a ll four multiclass datasets, while it still ach ie ves to p perform ance on binary datasets. This again demonstra tes our geo met- ric p erspecti ve and regulariza ti on a pproach that exploits the geometry of the class prob ability . 4.2. Real-world datasets T o test the scalability of our meth od to high dimen- sional an d large-scale prob lems, w e also cond uct exper - iments on two real-world d atasets, i.e., the Flick r Ma- terial D atabase ( FMD ) for image classiﬁcation and the MNIST ( MNIST ) Database of handwritten digits. FMD (409 6 dimensional). The FMD dataset co ntains 10 categories of images with 100 ima ges per category . W e ex- tract im age feature s u s ing the SIFT descriptor augmented by its feature coor dinates, im plemented by the VLFeat li- brary ( VLFeat ). W ith this descriptor, Bag-of- visual-words uses 4096 vector-quantized visual words, histogram square rooting , followed by L2 no rmalization. W e com pare o ur method with an SVM classiﬁer with RBF kern els, using exactly the same 40 96 dimension al fe ature. Our method achieves a co rrect classiﬁcation rate of 48 . 8% while the SVM b aseline ach ie ves 46 . 4% . Note th at while rece nt works (Qi et al., 2015; Cimpoi et al., 2015) report better perfor mance on this dataset, the e f for t f ocuses on better feature design, not on the classi ﬁer itself. Th e feature s u s ed in tho s e work s, such as loc al texture d escriptors and CNN features, are more sophisticated. MNIST (60,0 00 sa mp les). The MNIST dataset contains 10 classes ( 0 ∼ 9 ) o f handwritten digits with 60 , 00 0 sam - ples for training and 10 , 000 sam ples for testing. Each sam- ple is a 28 × 28 grey scale image. W e use 1 000 RBFs to represent our f unction f , with RBF centers obtaine d by ap- plying K-means clustering on the trainin g set. Note that our learning and regularization app roach still han dles all the 60 , 000 training samples as described by Algo rithm 1 . Our meth od achieves an erro r r ate of 2 . 7 4% . While there are many resu lts repor ted on this d ataset, we feel that th e most co mparable metho d with our representatio n is the Ra- dial Basis Fu nction Network with 1000 RBF un its ( LeCun et al. , 199 8 ), which achieves an er ror r ate o f 3 . 6% . This experiment shows the potential that ou r geo metric regular - ization appro ach scales to larger datasets. 5. Discussion Our geom etric regularization approach c an also be v ie wed as a comb ination of co mmon physical mo dels. As illus- trated in Figure 1 and 2 , ea ch train ing pair ( x , y ) corr e- sponds to a point at one of the vertices of the simplex asso- ciated with x . As a result, all train ing d ata lie on the bo und- ary of the space X × ∆ L − 1 , while the f unctional graph of a class probab ility estimator f is a h ypersurface (subman- ifold) in X × ∆ L − 1 . An initial estimator without trainin g informa tion correspo nds to the ﬂat h yperplane in th e n eu- tral position. In respo nse to the p resence of the training data, this neu tral hy persurface deform s towards the train- ing data, as if attracted by a gravitational force due to point masses center ed at the train ing poin ts. Simultane ously , the regularization term for ces the h ypersurface to rem ain as ﬂat (or as volume minimizing) as possible, as if in the presence of surface tension. Thus this term follows the physics of Class Probability Estimation via Differ ential Geometric Regularization soap ﬁ lms and min imal surfaces ( Dierkes et al. , 1992 ). Ge- ometric ﬂows like the one proposed here are often mod eled on ph ysical processes. In ou r case, the ﬂow can b e vie wed as a mixed gravity and surface ten s ion physical exper iment. 6. Conclusion W e have intro duced a new ge ometric perspec ti ve on regu- larization for classiﬁcation that explo its the geo metry of a robust class pro bability e s timator . Under this per specti ve, we p ropose a general regularization appr oach that applies to both binary and multiclass cases in a un iﬁed way . In ex- periments w it h an examp le formu lation based on RBFs, ou r implementatio n achie ves fav or able results com paring with widely used RBF-based c las siﬁcation method s an d pr e vi- ous geometr ic r e gularization methods. While experimen- tal results dem onstrate the ef fec ti veness of our geometr ic regularization techniq ue, it is also imp ortant to study con- vergence pro perties of th is app roach fr om a learn ing th e- ory pe rspecti ve. As an initial attempt, we have established Bayes con sis tency for an easy case of em pirical penalty function an d d etails ar e provid ed in Appendix B . W e will continue this study in the future . Refer ences Audibert, Jean-Yves and Tsybakov , Ale xandre . Fast learn- ing rates fo r p lug-in c las siﬁers. Annals of Statistics , 35 (2):60 8–633, 2007. Bartlett, Peter L, Jordan , Michael I, and McAuliffe, Jon D. Con vexity , classiﬁcation, and risk boun ds. Journal of the American Statistical Association , 101(47 3):138– 156, 2006. Belkin, Mik hail an d Niyogi, Partha. Laplacian eigen- maps for dimen sionality red uction an d data representa- tion. Neural Computatio n , 15(6):1 373–1396 , 2003 . Belkin, Mikh ail, Niyo gi, Partha, an d Sind hwani, V ikas. Manifold regu larization: A geom etric framework for learning f rom lab eled and unlab eled examples. Journal of Machine Learning Resear ch , 7:2399– 2434, 200 6. Cai, Xiongcai a nd Sowmya, A rcot. Le vel learn ing set: A novel c las siﬁer based on acti ve conto ur mo dels. In Pr oc. Eur op ean Conf. on Machine Learning (ECML) , p p. 79– 90. 2007 . Chang, Chih-Chu ng a nd Lin, Chih- Jen. LIBSVM: A libra ry fo r suppo rt vector mach ines. AC M T ransactions on Intelligent Systems and T echnol- ogy , 2:27:1– 27:27, 2 011. Software av ailab le at http://www .csie.ntu.e du.tw/ ∼ cjlin/libsvm . Chen, Y un-Gang , Giga, Y oshikazu, and Goto, Shun ’ichi. Uniquene s s and existence of v is cosity solutions of g en- eralized mean curvature ﬂow equation s . In Fundamental contributions to th e continuu m theory of evolving pha s e interfaces in s olids , pp. 375 –412. Sprin ger , Berlin, 1999. Devroye, Luc, Gy ¨ orﬁ, L ´ aszl ´ o, and Lug osi, G ´ abor . A p r ob - abilistic theory of pattern r ecognitio n . Springer , 19 96. Dierkes, Ulrich, Hildeb randt, Stefan, K ¨ uster , Albrecht, and W ohlrab, Ortwin. Min imal surface s . Sp ringer , 1992. Donoho , David and G rimes, Carrie . Hessian eigen- maps: Locally linear embed ding techniques for h igh- dimensiona l da ta. P r ocee dings of the National Academy of Sciences , 100( 10):5591–5 5 96, 200 3. FMD. http://peop le.csail.mit.edu/celiu/CVPR2010/FMD/ . Accessed: 201 5-06-01. Goodfellow , Ian J, Shlens, Jonathon, and Szegedy , Chris- tian. Explainin g and har nessing adversarial examples. arXiv pr ep ri nt arXiv:141 2.6572 , 2014. Guckenheimer, John and W orfo lk, Patrick. Dynam ical sys- tems: some computation al problems. In Bifur cations and periodic orbits of vector ﬁ elds (Montreal, PQ, 19 92) , volume 4 08 of NA TO Adv . S ci. In st. Ser . C Math. Phys. Sci. , pp. 241–277. Klu wer Acad. Publ. , Do rdrecht, 199 3. LeCun, Y ann, Bottou , L ´ eon, Beng io, Y oshua, and Haffner , Patrick. Grad ient-based learnin g app lied to documen t recogn it ion. P r ocee dings of the IEEE , 86(1 1):2278– 2324, 1998 . Lin, T on g and Zha, Ho ngbin. Rieman nian manifold learn- ing. I EEE T rans. on P a ttern Analysis and Machine In - telligence (P AMI) , 30(5):796 –809, 200 8. Lin, T ong, Xue, Hanlin, W an g, Lin g, and Zha , Hongbin . T otal variation and Eu ler’ s elastica f or supervised learn- ing. P r oc. Interna tional Con f . o n Ma c h ine Learning (ICML) , 2012. Lin, T o ng, Xue, Hanlin, W ang, Ling, Huang , Bo, and Zha, Hongbin . Supervised learnin g via euler’ s elastica m od- els. Journal of Machine Learning Resear ch , 16 :3637– 3686, 2015 . Mantegazza, Carlo. Lectur e Notes o n Mea n Curva- tur e F low , volume 290 of Pr ogr ess in Mathematics . Birkh ¨ auser/Spring er Basel AG, Basel, 20 11. MNIST . http://http://yan n.lecun.com/exdb/mnist/ . Ac- cessed: 2 015-06-01 . Mumfor d, Da vid and Shah, Jayan t. Optimal app roxima- tions by piece wise smooth fun ctions an d associated v ari- ational pr oblems. Co mmunications on pure and applied mathematics , 42(5):5 77–685, 1 989. Class Probability Estimation via Differ ential Geometric Regularization Osher , Stanley and Sethian, James. Fronts propag ating with curvature-dep endent speed : alg orithms b ased o n hamilton- jacobi f ormulations. Journal of Computational Physics , 79(1):1 2–49, 1 988. Roscher, Ribana, F ¨ orstner , W olfgan g, and W aske, Bj ¨ orn. I 2 vm: inc remental imp ort vector machin es. Image and V ision Compu ting , 30(4 ):263–278, 2012 . Roweis, Sam and Saul, La wrence. Non linear dimension al- ity red uction by lo cally linear em bedding. S cience , 290 (5500 ):2323–232 6 , 200 0. Sch ¨ olkopf, Bernhar d and Smola, Alexander . Learning w ith kernels: Su pport vector machines, r e gu larization, op ti- mization, and beyond . MIT press, 2002. Sethian, James Albert. Level set metho ds and fast mar ching methods: evolving inte r faces in co mputational geome tr y , ﬂuid mechan ics , co mputer vision, and materials science , volume 3. Cambridge university press, 1999 . Steinwart, Ingo . Consistency of sup port vector machines and oth er re gularized kernel classiﬁers. I EEE T rans. In - formation Theory , 51(1) :128–142, 2005. Stone, Charles. Consistent nonp arametric regression. An- nals of Statistics , pp. 595–62 0, 1977 . T en enbaum, Joshu a, De Silv a, V in, and Lan gford, J ohn. A global geometric f rame work f or no nlinear dimen sional- ity reduction. S cience , 29 0(5500):23 19–2323, 2 000. V apnik , Vladimir Nau movich. Statistical lea r ning theory , volume 1. W iley Ne w Y or k, 199 8. V arshney , Kush and Wi llsky , Alan. Classiﬁcation using geometric level sets. Journal o f Machine Learnin g Re- sear ch , 11:4 91–516, 2010 . VLFeat. http://www .vlfeat.o r g/application s/ap ps.html . Ac- cessed: 2 015-06-01 . Zhang, Z henyue and Zh a, Hong yuan. Pr incipal man ifolds and no nlinear d imensionality redu ction via tangent spac e alignment. SIAM Journal on S cientiﬁc Computing , 26 (1):31 3–338, 2005. Zhou, Dengyong and S ch ¨ olkopf, Bernhard. Re gularizatio n on discrete spaces. In P attern Recognition , pp. 361–368 . Springer, 2005 . Zhu, Ji and Hastie, Tre vor . Kernel logistic re gression and the import vecto r machine. Journal of Compu tational and Graphical Statistics , 2005. Class Probability Estimation via Differ ential Geometric Regularization A. Proof of Th eor em 1 Pr oof. For f : R N → ∆ L − 1 ⊂ R L , { r j = r j ( x ) = (0 , . . . , j 1 , . . . , 0 , f 1 j , . . . , f L j ) : j = 1 , . . . N } is a basis of the tang ent space T x gr( f ) to gr( f ) . Here f i j = ∂ x j f i . Let { e i } be an orthonormal fram e of T x gr( f ) . W e have e i = B j i r j for some in vertible ma trix B j i . Deﬁne the metric matrix g fo r t he basis { r j } by g = ( g kj ) with g kj = r k · r j = δ kj + f i k f i j . Then δ ij = e i · e j = B k i B t j r k · r t = B k i B t j g kt ⇒ I = ( B B T ) g ⇒ B B T = g − 1 . Thus B B T is computab le in terms of derivati ves of f . Let D u w be the R N + L directional derivati ve o f w in the direction u . Th en T r I I = P ν D e i e i = P ν D B j i r j B k i r k = B j i P ν D r j B k i r k = B j i P ν [( D r j B k i ) r k ] + B j i B k i D r j r k = B j i B k i P ν D r j r k = ( g − 1 ) j k P ν D r j r k , since P ν r k = 0 . W e have r k = (0 , . . . , 1 , . . . , f 1 k ( x 1 , . . . , x N ) , . . . , f L k ( x 1 , . . . , x N )) = ∂ R N + L k + L X i =1 f i k ∂ R N + L N + i , so in particular, ∂ R N + L ℓ r k = 0 if ℓ > N . Thu s D r j r k = (0 , . . . , N 0 , f 1 kj , . . . , f L kj ) . So far , we have T r I I = ( g − 1 ) j k P ν (0 , . . . , N 0 , f 1 kj , . . . , f L kj ) . Since g is given in terms of deri vati ves of f , we need to write P ν = I − P T in terms of d eri vati ves of f . For any u ∈ R N + L , we have P T u = ( P T u · e i ) e i = ( u · B j i r j ) B k i r k = B j i B k i ( u · r j ) r k = ( g − 1 ) j k ( u · r j ) r k . Thus T r I I (17) = ( g − 1 ) j k P ν (0 , . . . , N 0 , f 1 kj , . . . , f L kj ) (18) = ( g − 1 ) j k (0 , . . . , N 0 , f 1 kj , . . . , f L kj ) − P T [( g − 1 ) j k (0 , . . . , N 0 , f 1 kj , . . . , f L kj )] = ( g − 1 ) j k (0 , . . . , N 0 , f 1 kj , . . . , f L kj ) − ( g − 1 ) j k [( g − 1 ) r s (0 , . . . , N 0 , f 1 r s , . . . , f L r s ) · r j ] r k = ( g − 1 ) j k (0 , . . . , N 0 , f 1 kj , . . . , f L kj ) − ( g − 1 ) j k ( g − 1 ) r s  f i r s f i j  r k = ( g − 1 ) ij  0 , . . . , j − ( g − 1 ) r s f a r s f a i , . . . , 0 , (19) f 1 j i − ( g − 1 ) r s f a r s f a i f 1 j , . . . , f L j i − ( g − 1 ) r s f a r s f a i f L j  , after a relab eling of indices. T herefore, th e last L compo - nent of T r I I are given by T r II L = ( g − 1 ) ij  f 1 j i − ( g − 1 ) r s f a r s f a i f 1 j , . . . , f L j i − ( g − 1 ) r s f a r s f a i f L j  . B. An Easy Example with Bayes Consistency W e n o w give an examp le with a loss function that enab les easy Bayes consistency proof under some mild initializa- tion assumption. Related no tation is summer ized in § C . For ease of reading , we ch ange the notation for empirical penalty P T m in the App endix to P D , i.e., P = P D + λ P G . P D measures the deviation of g r ( f ) from th e map ped train- ing p oints, a natural geometr ic distance pen alty term is an L 2 distance in R L from f ( x ) to the averaged z com ponent of the k -nearest training points: P D ( f ) = R D, T m ,k ( f ) = Z X d 2 f ( x ) , 1 k k X i =1 ˜ z i ! d x , (20) where d is the Euclidean distance in R L , ˜ z i is the vector of the last L co mponents o f ( ˜ x i , ˜ z i ) = ( ˜ x 1 i , . . . , ˜ x N i , ˜ z 1 i , . . . , ˜ z L i ) , with ˜ x i the i th nearest neigh bor of x in T m , and d x is the Leb esgue measure. The g radient vector ﬁeld is ∇ ( R D, T m ,k ) f ( x , f ( x )) = 2 k k X i =1 ( f ( x ) − ˜ z i ) . Class Probability Estimation via Differ ential Geometric Regularization Howe ver, ∇ ( R D, T m ,k ) f is discontinu ous on the set D of points x such that x has eq uidistant training points am ong its k nearest neig hbors. D is th e unio n of ( N − 1) - dimensiona l hyper planes in X , so D has measur e zer o. Such points will necessarily exist unless th e last L com - ponen ts of th e mapped trainin g po ints are all 1 or all 0 . T o rectify this, we can smo oth out ∇ ( R D, T m ,k ) f to a vector ﬁeld V D, f ,φ = 2 φ ( x ) k k X i =1 ( f ( x ) − ˜ z i ) . (21) Here φ ( x ) is a smooth d amping fun ction close to the sin- gular function δ D , which has δ D ( x ) = 0 for x ∈ D and δ D ( x ) = 1 for x 6∈ D . Outside any ope n n eighborhood of D , ∇ R D, T m ,k = V D, f ,φ for φ close enou gh to δ D . Recall the g eometric pen alty from the submission, i.e., P G ( f ) = R gr( f ) dvol , with the geom etric gr adient vector ﬁeld being V G, f = − T r I I L . Then the gradien t v ector ﬁeld V tot,λ,m, f ,φ of this example penalty P is, V tot,λ,m, f ,φ = ∇P f = V D, f ,φ + λV G, f = 2 φ ( x ) k k X i =1 ( f ( x ) − ˜ z i ) − λ T r I I L . (22) B.1. Consistency analysis For a trainin g set T m , we let f T m = ( f 1 T m , . . . , f L T m ) be the class proba bility estimator g i ven by ou r approach. W e d enote the gene ralization risk of the co rresponding plug-in c las siﬁer h f T m by R P ( f T m ) = E P [ 1 h f T m ( x ) 6 = y ] . The Bayes risk is d eﬁned by R ∗ P = inf h : X →Y R P ( h ) = E P [ 1 h η ( x ) 6 = y ] . Our algorithm is Bayes consistent if lim m →∞ R P ( f T m ) = R ∗ P holds in probab ili ty for all distribu- tions P on X × Y . Usually , grad ient ﬂow method s are ap - plied to a convex f unctional, so that a ﬂow line approach es the u nique global minimum. I f the domain of th e function al is an inﬁnite dimensio nal manif old of (e .g. smoo th) fun c- tions, we al ways assume th at ﬂow lines e xist and that the actual minimum exists in this manifold. Because o ur func ti onals are not conv ex, an d because we are strictly sp eaking not working with gradient vector ﬁelds, we ca n only hope to prove Bayes consistency for the set of initial estimator s in the stable ma nifold of a stable ﬁxed point (or sink ) of the vector ﬁeld ( Guckenheimer & W or- folk , 199 3 ). Recall that a stable ﬁxed point f 0 has a maxi- mal op en ne ighborhood , the s table manifold S f 0 , on which ﬂow lines tend towards f 0 . For the man ifold M , the sta- ble man ifold f or a stable critical point of the vector ﬁeld V tot,λ,m, f ,φ is inﬁnite dimensional. The pr oof of Bayes con sis tency for multiclass (including binary) classiﬁcation follows these s teps: Step 1: lim λ → 0 R ∗ D,P ,λ = 0 . Step 2: lim n →∞ R D,P ( f n ) = 0 ⇒ lim n →∞ R P ( f n ) = R ∗ P . Step 3: For all f ∈ M = Maps( X , ∆ L − 1 ) , | R D, T m ( f ) − R D,P ( f ) | m →∞ − − − − → 0 in pr obability . Proofs o f these steps are provided in following sub- sections. F or the notatio n see § C . R ∗ D,P ,λ is th e minimum of the regularize d D risk R D,P ,λ ( f ) for f : R D,P ,λ ( f ) = R D,P ( f ) + λ P G ( f ) , with R D,P ( f ) = R X d 2 ( f ( x ) , η ( x )) d x the D -risk. Also, R D, T m ,λ ( f ) = R D, T m ( f ) + λ P G ( f ) , with R D, T m ( f ) = R X d 2  f ( x ) , 1 k P k i =1 ˜ z i  d x the empirical D -risk. Theorem 2 (Bayes Consistency) . Let m be the size of the training data set. Let f 1 ,λ,m ∈ S f D, T m ,λ , the sta- ble manifold for the global minimum f D, T m ,λ of R D, T m ,λ , and let f n,λ,m,φ be a sequ ence o f fu nctions on the ﬂo w line o f V tot,λ,m, f ,φ starting with f 1 ,λ,m with the ﬂow time t n → ∞ a s n → ∞ . Then R P ( f n,λ,m,φ ) m,n →∞ − − − − − − − → λ → 0 ,φ → δ D R ∗ P in pr ob ability for all distr ibutions P on X × Y , if k /m → 0 as m → ∞ . Pr oof. I n the notation of § C , if f D, T m ,λ is a global mini- mum fo r R D, T m ,λ , then o utside of D , f D, T m ,λ is th e limit of critical p oints for the negative ﬂow of V tot,λ,m, f , φ as φ → δ D . T o see this, ﬁx an ǫ i neighbo rhood D ǫ i of D . For a sequence φ j → δ D , V tot,λ,m,f ,φ j is independen t o f j ≥ j ( ǫ i ) on X \ D ǫ i , so we ﬁn d a function f i , a criti- cal point of V tot,λ,m, f ,φ j ( ǫ i ) , equal to f D, T m ,λ on X \ D ǫ i . Since a n y x 6∈ D lies outside some D ǫ i , the sequen ce f i conv erges at x if we let ǫ i → 0 . Thus we can ign ore the choice of φ in ou r proo f, and drop φ fro m the no tation. For our algorith m, f or ﬁxed λ, m , we have as ab ov e lim n →∞ f n,λ,m = f D, T m ,λ , so lim n →∞ R D, T m ,λ ( f n,λ,m ) = R D, T m ,λ ( f D, T m ,λ ) , for f 1 ∈ S f D, T m ,λ . By Step 2, it sufﬁces to show R D,P ( f D, T m ,λ ) m →∞ − − − − → λ → 0 0 . In probability , we h a ve ∀ δ > Class Probability Estimation via Differ ential Geometric Regularization 0 , ∃ m > 0 suc h tha t R D,P ( f D, T m ,λ ) ≤ R D,P ( f D, T m ,λ ) + λ P G ( f D, T m ,λ ) ≤ R D, T m ( f D, T m ,λ ) + λ P G ( f D, T m ,λ ) + δ 3 (Step 3) = R D, T m ,λ ( f D, T m ,λ ) + δ 3 ≤ R D, T m ,λ ( f D,P ,λ ) + δ 3 (minimality of f D, T m ,λ ) = R D, T m ( f D,P ,λ ) + λ P G ( f D,P ,λ ) + δ 3 ≤ R D,P ( f D,P ,λ ) + λ P G ( f D,P ,λ ) + 2 δ 3 (Step 3) = R D,P ,λ ( f D,P ,λ ) + 2 δ 3 = R ∗ D,P ,λ + 2 δ 3 ≤ δ, ( Step 1) for λ close to zero. Since R D,P ( f D, T m ,λ ) ≥ 0 , we are done. B.2. Step 1 Lemma 3. (Step 1) lim λ → 0 R ∗ D,P ,λ = 0 . Pr oof. Af ter the smoothing pr ocedure in § 3.1 for the dis- tance p enalty term, the fun ction R D,P ,λ : M → R is co n- tinuous in the Fr ´ echet topology o n M . W e check that the function s R D,P ,λ : M → R are equico ntinuous in λ : fo r ﬁxed f 0 ∈ M an d ǫ > 0 , there exists δ = δ ( f 0 , ǫ ) such that | λ − λ ′ | < δ ⇒ | R D,P ,λ ( f 0 ) − R D,P ,λ ′ ( f 0 ) | < ǫ. This is immediate: | R D,P ,λ ( f 0 ) − R D,P ,λ ′ ( f 0 ) | = | ( λ − λ ′ ) P G ( f 0 ) | < ǫ , if δ < ǫ/ |P G ( f 0 ) | . It is standard that th e in ﬁ mum inf R λ of an eq uicontinuous family of functions is continuo us in λ , so lim λ → 0 R ∗ D,P ,λ = R ∗ D,P ,λ =0 = R D,P ( η ) = 0 . B.3. Step 2 W e assume that the class probab ilit y fun ction η ( x ) : R N → R L is smooth, and that the marginal distribution µ ( x ) is continuous. W e also let µ denote the correspond- ing measure on X . Notatio n: h f ( x ) = a rgmax { f ℓ ( x ) , ℓ ∈ Y } . Of course, 1 h f ( x ) 6 = y =  1 , h f ( x ) 6 = y , 0 , h f ( x ) = y . Lemma 4. (Step 2 for a subseque nce) lim n →∞ R D,P ( f n ) = 0 ⇒ lim i →∞ R P ( f n i ) = R ∗ P for some subsequen ce { f n i } ∞ i =1 of { f n } . Pr oof. T he left han d side of the Lem ma is Z X d 2 ( f n ( x ) , η ( x )) d x → 0 , which is equiv alent t o Z X d 2 ( f n ( x ) , η ( x )) µ ( x ) d x → 0 , (23) since X is co mpact and µ is contin uous. Therefo re, it suf- ﬁces to show Z X d 2 ( f n ( x ) , η ( x )) µ ( x ) d x → 0 (24) = ⇒ E P [ 1 h f n ( x ) 6 = y ] → E P [ 1 h η ( x ) 6 = y ] . W e recall that L 2 conv ergence imp lies pointwise conver - gence a.e, so ( 23 ) implies th at a subsequence of f n , also denoted f n , has f n → η ( x ) pointwise a.e. on X . (By our assumptio n on µ ( x ) , these statements hold for either µ or Lebesgue measure.) By Ego ro v’ s theo rem, for any ǫ > 0 , th ere exists a set B ǫ ⊂ X w it h µ ( B ǫ ) < ǫ such that f n → η ( x ) unif ormly on X \ B ǫ . Fix δ > 0 and set Z δ = { x ∈ X : # { argma x ℓ ∈Y η ℓ ( x ) } = 1 , | ma x ℓ ∈Y η ℓ ( x ) − submax ℓ ∈Y η ℓ ( x ) | < δ } , where s ubm ax ℓ ∈Y denotes the secon d largest elemen t in { η 1 ( x ) , . . . , η L ( x ) } . For th e moment, assume th at Z 0 = { x ∈ X : # { a rgmax ℓ ∈Y η ℓ ( x ) } > 1 } has µ ( Z 0 ) = 0 . It follows easily 2 that µ ( Z δ ) → 0 as δ → 0 . On X \ ( Z δ ∪ B ǫ ) , we hav e 1 h f n ( x ) 6 = y = 1 h η ( x ) 6 = y for n > N δ . Thus E P [ 1 X \ ( Z δ ∪ B ǫ ) 1 h f n ( x ) 6 = y ] = E P [ 1 X \ ( Z δ ∪ B ǫ ) 1 h η ( x ) 6 = y ] . 2 Let A k be sets wi th A k +1 ⊂ A k and with µ ( ∩ ∞ k =1 A k ) = 0 . If µ ( A k ) 6→ 0 , then there exists a subsequence, also called A k , with µ ( A k ) > K > 0 for some K . W e claim µ ( ∩ A k ) ≥ K , a contradiction. For the claim, let Z = ∩ A k . If µ ( Z ) ≥ µ ( A k ) for all k , we are done. If not, since the A k are nested, we c an replace A k by a set, also called A k , of measure K and such that the new A k are still nested. For the relabeled Z = ∩ A k , Z ⊂ A k for all k , and we may assume µ ( Z ) < K. Thus there exists Z ′ ⊂ A 1 with Z ′ ∩ Z = ∅ and µ ( Z ′ ) > 0 . Since µ ( A i ) = K , we must hav e A i ∩ Z ′ 6 = ∅ for all i . Thus ∩ A i is strictly larger t h an Z , a co ntradiction. In summary , the claim must hold, so we get a contradiction to assuming µ ( A k ) 6→ 0 . Class Probability Estimation via Differ ential Geometric Regularization (Here 1 A is the characteristic function of a set A .) As δ → 0 , E P [ 1 X \ ( Z δ ∪ B ǫ ) 1 h f n ( x ) 6 = y ] → E P [ 1 X \ B ǫ 1 h f n ( x ) 6 = y ] . and similarly for f n replaced by η ( x ) . Dur ing th is process, N δ presumab ly g oes to ∞ , but that precisely means lim n →∞ E P [ 1 X \ B ǫ 1 h f n ( x ) 6 = y ] = E P [ 1 X \ B ǫ 1 h η ( x ) 6 = y ] . Since    E P [ 1 X \ B ǫ 1 h f n ( x ) 6 = y ] − E P [ 1 h f n ( x ) 6 = y ]    < ǫ, and similarly for η ( x ) , we get    lim n →∞ E P [ 1 h f n ( x ) 6 = y ] − E P [ 1 h η ( x ) 6 = y ]    ≤    lim n →∞ E P [ 1 h f n ( x ) 6 = y ] − lim n →∞ E P [ 1 X \ B ǫ 1 h f n ( x ) 6 = y ]    +    lim n →∞ E P [ 1 X \ B ǫ 1 h f n ( x ) 6 = y ] − E P [ 1 X \ B ǫ 1 h η ( x ) 6 = y ]    +    lim n →∞ E P [ 1 X \ B ǫ 1 h η ( x ) 6 = y ] − E P [ 1 h η ( x ) 6 = y ]    ≤ 3 ǫ. (Strictly speaking, lim n →∞ E P [ 1 h f n ( x ) 6 = y ] is ﬁrst lim sup and then lim inf to show th at th e limit exists.) Since ǫ is arbitrary , the proof is complete if µ ( Z 0 ) = 0 . If µ ( Z 0 ) > 0 , we reru n the proof with X rep laced by Z 0 . As ab o ve, f n | Z 0 conv erges unifor mly to η ( x ) off a set of measure ǫ . Th e argumen t ab o ve, without the set Z δ , gives Z Z 0 1 h f n ( x ) 6 = y µ ( x ) d x → Z Z 0 1 h η ( x ) 6 = y µ ( x ) d x . W e then proc eed with the proo f above on X \ Z 0 . Corollary 5 . (Step 2 in general) F or our algo r ithm, lim n →∞ R D,P ( f n,λ,m ) = 0 ⇒ lim i →∞ R P ( f n,λ,m ) = R ∗ P . Pr oof. Cho ose f 1 ,λ,m as in Theo rem 2 . Since V tot,λ,m, f n,λ,m has po intwise length go ing to zero as n → ∞ , { f n,λ,m ( x ) } is a Cauchy sequen ce for all x . This im - plies that f n,λ,m , and not ju st a subsequenc e, converges pointwise to η . B.4. Step 3 Lemma 6. (Step 3) If k → ∞ and k /m → 0 as m → ∞ , then for f ∈ Maps( X , ∆ L − 1 ) , | R D, T m ( f ) − R D,P ( f ) | m →∞ − − − − → 0 in pr obability , for all distributions P that generate T m . Pr oof. Sin ce R D,P ( f ) is a constan t for ﬁxed f and P , conv ergence in p robability will follow from w eak co n ver- gence, i.e., E T m [ | R D, T m ( f ) − R D,P ( f ) | ] m →∞ − − − − → 0 . W e have | R D, T m ( f ) − R D,P ( f ) | =      Z X " d 2 f ( x ) , 1 k k X i =1 ˜ z i ! − d 2 ( f ( x ) , η ( x )) # d x      ≤ Z X      d 2 f ( x ) , 1 k k X i =1 ˜ z i ! − d 2 ( f ( x ) , η ( x ))      d x . Set a = f ( x ) − 1 k P k i =1 ˜ z i , b = f ( x ) − η ( x ) . The n   k a k 2 2 − k b k 2 2   =      L X ℓ =1 a 2 ℓ − L X ℓ =1 b 2 ℓ      =      L X ℓ =1 ( a 2 ℓ − b 2 ℓ )      ≤ L X ℓ =1 | a 2 ℓ − b 2 ℓ | ≤ 2 L X ℓ =1 | a ℓ − b ℓ | ma x {| a ℓ | , | b ℓ |} ≤ 2 L X ℓ =1 | a ℓ − b ℓ | , since f ℓ ( x ) , 1 k P k i ˜ z ℓ i , η ℓ ( x ) ∈ [0 , 1] . Ther efore, it sufﬁces to show that L X ℓ =1 E T m " Z X      (( f ℓ ( x ) − 1 k k X i ˜ z ℓ i ) − ( f ℓ ( x ) − η ℓ ( x ))      d x # m →∞ − − − − → 0 , so the result follows if lim m →∞ E T m , x "      η ℓ ( x ) − 1 k k X i ˜ z ℓ i      # = 0 f or all ℓ. (2 5) By Jensen’ s inequality ( E [ f ]) 2 ≤ E ( f 2 ) , ( 25 ) follows if lim m →∞ E T m , x   η ℓ ( x ) − 1 k k X i ˜ z ℓ i ! 2   = 0 f or all ℓ. (26) Let η ℓ k,m ( x ) = 1 k P k i ˜ z ℓ i . Then η ℓ k,m is actu ally an estimate of the class p robability η ℓ ( x ) by th e k -Neare s t Neighbor rule. Following the proo f of Stone’ s Theorem ( Stone , 1 977 ; Devroye et al. , 1996 ), if k m →∞ − − − − → ∞ an d k /m m →∞ − − − − → 0 , ( 26 ) holds for all distributions P . Class Probability Estimation via Differ ential Geometric Regularization C. Notation h f ( x ) = a rgmax { f ℓ ( x ) , ℓ ∈ Y } : plug-in classiﬁer of estimator f : X → ∆ L − 1 1 h f ( x ) 6 = y =  1 , h f ( x ) 6 = y , 0 , h f ( x ) = y . R P ( f ) = E P [ 1 h f ( x ) 6 = y ] : genera lization risk for the estimator f η ( x ) = ( η 1 ( x ) , . . . , η L ( x )) : class probab ility f unction: η ℓ ( x ) = P ( y = ℓ | x ) R ∗ P = R P ( η ) : Bayes risk ( D - risk for our P D ) R D,P ( f ) = Z X d 2 ( f ( x ) , η ( x )) d x ( empirical D -risk ) R D, T m ( f ) = R D, T m ,k ( f ) = Z X d 2 f ( x ) , 1 k k X i =1 ˜ z i ! d x where ˜ z i is the vector of the last L co mponents of ( ˜ x i , ˜ z i ) , with ˜ x i the i th nearest neighbor of x in T m ( volume penalty term ) P G ( f ) = Z gr( f ) dvol R D,P ,λ ( f ) = R D,P ( f ) + λ P G ( f ) : regularized D -risk for estimato r f R D, T m ,λ ( f ) = R D, T m ( f ) + λ P G ( f ) : regularized empirical D -risk for estimator f f D,P ,λ = function attaining the global minimum for R D,P ,λ R ∗ D,P ,λ = R D,P ,λ ( f D,P ,λ ) : minimum value for R D,P ,λ f D, T m ,λ = f D, T m ,k,λ : function attaining the global minimum for R D, T m ,λ ( f ) Note that we assume f D,P ,λ and f D, T m ,λ exist.

Class Probability Estimation via Differential Geometric Regularization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment