Robust Nonparametric Regression via Sparsity Control with Application to Load Curve Data Cleansing
Nonparametric methods are widely applicable to statistical inference problems, since they rely on a few modeling assumptions. In this context, the fresh look advocated here permeates benefits from variable selection and compressive sampling, to robus…
Authors: Gonzalo Mateos, Georgios B. Giannakis
Rob ust Nonparametric Regression via Sparsity Control with Application to Load Curv e Data Cleansing † Gonzalo Mateos and Geor gios B. Giannakis (contact author) ∗ Submitted: March 26, 2022 Abstract Nonpara metric methods ar e widely applicable to statistical in ference pr oblems, since they rely o n a few mode ling assumptio ns. In this con text, the fresh look advocated here p ermeates ben efits f rom variable selection and compressive sampling, to robustify nonpar ametric regression against ou tliers – that is, data m arkedly deviating fr om the postulated mo dels. A variational counterp art to least-trim med squ ares regression is sh own closely related to an ℓ 0 -(pseudo )norm -regularized estimato r , that enc ourage s spa rsity in a vector explicitly modeling the o utliers. Th is co nnection sugge sts efficient solvers based o n convex relaxation, whic h lead n aturally to a variational M-type estimator equivalent to the least-absolute sh rinkage and selection operato r (Lasso). Outliers are identified by judic iously tuning regular ization parameters, which amounts to controlling the spar sity of the outlier v ector along the whole r obustification path o f Lasso solutions. Reduced bias and enhanced generalization capability are attracti ve feature s of an i mproved estimator obtained after rep lacing th e ℓ 0 -(pseudo )norm with a n onconve x surrogate. Th e novel robust spline-based smo other is ad opted to cleanse loa d curve d ata, a key task aid ing oper ational de cisions in the envisioned smar t grid system. Comp uter simulations and tests on real loa d curve data corrob orate the effecti veness of the novel spa rsity-contr olling robust estimator s. Index T erms Nonpara metric regression, outlier rejection , sparsity , La sso, sp lines, load cur ve cleansing. EDICS Category: SS P-NP AR, MLR-LEAR. † W ork in this paper was supported by the NSF grants CCF -0830480 , 10 16605 and ECCS -0824007 , 10 02180. Part of the paper will appear in the Intl. Conf. on Acoust., Speech, and Signal Pr oc. , Prague, Czech Republic, May . 22-27, 2011. ∗ The authors are wit h the Dept. of E lectrical and Computer Engineering, Uni versity of M innesota, 200 Un ion S treet SE, Minneapolis, MN 55455. T el/fax: ( 612)626 -7781/625-458 3; Emails: { mate0058,geor gios } @umn.edu IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITT ED) 1 I . I N T RO D U C T I O N Consider the class ical problem of function estimation, in which an inpu t vector x := [ x 1 , . . . , x p ] ′ ∈ R p is giv en, and the goa l is to predict the rea l-v alued sca lar response y = f ( x ) . Function f is unk nown, to b e estimated from a training data set T := { y i , x i } N i =1 . When f is a ssumed to be a me mber of a finitely-parameterized family of fun ctions, standa rd (non-)linear regression techniqu es can be adopted. If on the other hand, one is only willing to assume that f belongs to a (possibly infin ite dimensiona l) spa ce of “smooth” functions H , the n a nonparame tric approa ch is in order , and this will be the focus of this work. W ithout further constraints beyond f ∈ H , functiona l estimation from finite data is an ill-posed problem. T o bypa ss this ch allenge, the problem is typically s olved by minimizing app ropriately regularized c riteria, allowi ng one to control model complexity; s ee, e.g., [12], [34]. It is then further ass umed tha t H has the structure of a reproducing kernel Hilbert space (RKHS), with corresp onding positi ve definite reprodu cing kernel fun ction K ( · , · ) : R p × R p → R , and norm denoted by k · k H . Under the formalism of r e gu larization networks , one se eks ˆ f as the solution to the variati onal p roblem min f ∈H " N X i =1 V ( y i − f ( x )) + µ k f k 2 H # (1) where V ( · ) is a c on vex loss function, a nd µ ≥ 0 controls complexity b y weighting the eff ect o f the smoothnes s f unctiona l k f k 2 H . Interestingly , the Represe nter Theorem ass erts that the unique solution of (1) is finitely pa rametrized and has the form ˆ f ( x ) = P N i =1 β i K ( x , x i ) , where { β i } N i =1 can be obtaine d from T ; se e e .g., [29], [38]. Fu rther d etails on RKHS , and in particular on the evaluation of k f k H , can be found in e.g., [38, Ch. 1 ]. A fundamental relationsh ip betwee n model c omplexity c ontrol and generalization capab ility , i.e., the predicti ve ability of ˆ f beyon d the training set, was formalized in [37]. The generalization e rror pe rformance of app roaches that minimize the sum of squa red mode l residua ls [that is V ( u ) = u 2 in (1)] regularized by a term of the form k f k 2 H , is d egraded in the presen ce of outliers. This is because the lea st-squares (LS) part of the c ost is n ot robust, an d can result in se vere overfitting of the (contaminan ted) training d ata [21]. Recent eff orts have co nsidered rep lacing the squa red loss with a robust co unterpart such as Huber’ s function, or its vari ants, but lack a data-dri ven means of selecting the proper threshold that de termines which datum is c onsidered an outlier [43]; se e also [27]. Other approach es have instead relied on the so-termed ǫ -i nsen siti ve loss function, originally propos ed to so lve function ap proximation problems using suppo rt vector machine s (SVMs) [37]. These family of es timators often referred to as sup port vector regression (SVR), have been s hown to enjoy robustness prope rties; se e e.g., [26], [28], [32] and references therein. In [8], improved performance in the presence of outliers is IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 2 achieved by refining the S VR solution through a subsequen t robust learning p hase. The starting p oint here is a variati onal least-trimmed squares (VL TS) estimator , suitable for robust function approximation i n H (Section II). It is established that VL TS is closely related to an (NP-hard) ℓ 0 -(pseudo)norm-regularized e stimator , ado pted to fit a regress ion mode l that explicitly incorporates an unknown sparse vector of outliers [17]. As in compres siv e sampling (CS) [35 ], efficient (ap proximate) solvers are obtained in Sec tion III by replacing the outlier vector’ s ℓ 0 -norm with its clos est con vex approximant, the ℓ 1 -norm. This leads naturally t o a variational M- type estimator o f f , also sh own equi valent to a least-abso lute s hrinkage and selec tion operator (La sso) [33] on the vec tor of ou tliers (Section III-A). A tuna ble p arameter in La sso c ontr ols the sparsity of the estimated vector , a nd the nu mber o f outliers a s a byproduc t. Hen ce, ef fectiv e methods to se lect this parameter are of pa ramount importance. The link between ℓ 1 -norm regularization and robustness was a lso exploited for parameter (but not function) estimation in [17] an d [22]; see also [40] for related ideas in the context of face recognition, and error correction codes [4]. In [17] howe ver , the s election of Lasso’ s tuning pa rameter is only justified for Gaussian training data; whereas a fix ed v alue moti vated by CS error bounds is a dopted in the Bayesian formulation of [22]. Here ins tead, a more g eneral a nd systema tic approa ch is pursued in Se ction III-B, building on co ntemporary algorithms that can efficiently c ompute all r obustifaction paths of Lass o solutions, i.e., for all values o f the tuning parame ter [11], [16], [41]. In this sense , the metho d here capitalizes on but is not limited to sparse s ettings, sinc e on e ca n examine a ll pos sible sparsity levels along the robustification path. An es timator with reduc ed bia s and improved generalization c apability i s obtained in Section IV, after replacing the ℓ 0 -norm with a nonc on vex surrogate, instead of the ℓ 1 -norm that introduces bias [33], [44]. Simulated tests de monstrate the ef fectiv eness of the novel approac hes in robustifying thin- plate smoothing sp lines [10] (Section V -A), and in estimating t he sinc f unction (Section V -B) – a paradigm typically ad opted to ass ess performance of robust function ap proximation approach es [8], [43]. The motiv a ting application b ehind the robust nonpa rametric me thods of this pa per is load cu rve cleans- ing [6] – a critical t ask in p ower sys tems enginee ring an d manage ment. Load c urve data (also kn own as load profiles) refers to the e lectric energy cons umption periodica lly recorde d by meters a t specific points across the power g rid, e.g., en d user-points an d substations. Accurate load profiles are critical ass ets aiding ope rational dec isions in t he en visione d smart g rid sys tem [20]; see also [1], [2], [6 ]. Howev er , in the p rocess of acquiring and transmitting such mass i ve volumes of information to a ce ntral proc essing unit, data is often no isy , corrupted, or lost altogethe r . This could be due to several reason s inc luding meter misscalibration or outright failure, a s well as commu nication errors due to nois e, ne twork conges tion, and IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 3 connec ti vity outage s; s ee Fig. 1 for an example. In addition, data significan tly deviating from nominal load mode ls (outliers) are not unc ommon, and could be attrib u ted to unsc heduled maintenanc e lead ing to shutdown of heavy indu strial loa ds, we ather c onstraints, holidays, s trikes, and major sporting ev ents, just to name a few . In this context, it is critical to ef fecti vely reject outliers, and replace the contaminated data with ‘he althy’ load predictions, i.e., to c leanse the load data. While mos t utilities carry out this task man ually base d on their own person nel’ s kn ow-ho w , a first scalab le an d principled a pproach to load profile clean sing which is based on statistical learning methods was rec ently prop osed in [6]; which also includes an extensiv e literature review o n the rela ted p roblem of outlier iden tification in time-series. After estimating the regression function f via either B-spline or Kernel smo othing, po intwise confidenc e intervals a re constructed bas ed on ˆ f . A datum is dee med as an ou tlier when ever it fall s outside its associated co nfidence interval. T o control the degree of s moothing ef fected by the e stimator , [6] requires t he use r to label the outliers pres ent in a training su bset of data, and in this sense the ap proach therein is not fully automatic. Here instead, a novel alternative to load curve c leansing is developed after spec ializing t he rob ust estimators of Sec tions III an d IV, to the case of cubic smo othing splines (Sec tion V -C). The smoo thness-and outlier sparsity-controlling pa rameters are selected according to the guidelines in Section III-B; hence, no inpu t is required from the data analys t. T he propose d s pline-based method is tested on real load curve data from a government building. Concluding remarks are given in Section VI, while so me technical de tails are deferred to the Appendix. Notation: Bold uppercase letters will denote matrices, whereas bold lowercase letters will stand for column vectors. Opera tors ( · ) ′ , tr ( · ) an d E [ · ] will denote transpos ition, matrix trace and expec tation, resp ectiv ely; | · | will b e used for the cardinality of a set and the magnitude o f a sc alar . The ℓ q norm of vector x ∈ R p is k x k q := ( P p i =1 | x i | q ) 1 /q for q ≥ 1 ; and k M k F := p tr ( MM ′ ) is the matrix Frobeniou s norm. Positiv e definite ma trices will be d enoted by M ≻ 0 . The p × p identity matrix will be represented by I p , wh ile 0 p will de note the p × 1 v ector of all ze ros, and 0 p × q := 0 p 0 ′ q . I I . R O B U S T E S T I M A T I O N P RO B L E M The training data comprises N n oisy samples o f f taken at the input points { x i } N i =1 (also known as knots in the s plines parlance ), an d in the pres ent context they can be p ossibly contamina ted with outliers. Building on the parametric least-trimmed sq uares (L TS) approac h [31], the desired robust e stimate ˆ f can be obtained as the solution of the follo wing v ariational (V)L TS minimization problem min f ∈H " s X i =1 r 2 [ i ] ( f ) + µ k f k 2 H # (2) IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 4 where r 2 [ i ] ( f ) is the i -th order statistic among the sq uared residua ls r 2 1 ( f ) , . . . , r 2 N ( f ) , and r i ( f ) := y i − f ( x i ) . In words, given a feasible f ∈ H , to ev aluate the su m of the cost in (2) one: i) computes all N squared residuals { r 2 i ( f ) } N i =1 , ii) o rders them to form the nondec reasing sequ ence r 2 [1] ( f ) ≤ . . . ≤ r 2 [ N ] ( f ) ; and iii) sums up the sma llest s terms. As in the parametric L TS [31], the so-termed trimming con stant s (also known as coverage) determines the brea kdown point of the VL TS es timator , sinc e the largest N − s residuals do no t participate in (2). Ide ally , one would like to make N − s equal to the (typically u nknown) number of outliers N o in the training data. For most p ragmatic sce naria whe re N o is unknown, the L TS estimator is an attracti ve option du e to its high b reakdown point and desirable theoretical properties, namely √ N -con sistency and a symptotic normality [31]. The tuning parameter µ ≥ 0 in (2) controls the tradeoff b etween fidelity to the (trimmed) data, a nd the degree of “smoothne ss” measured by k f k 2 H . In particular , k f k 2 H can be interpreted as a ge neralized ridge regularization term pe nalizing more tho se func tions with lar ge coefficients in a basis expansion in volving the eigenfunctions of the kernel K . Gi ven that the s um in (2) is a nonc on vex func tional, a nontri vial iss ue pertains t o the existence of the proposed VL TS e stimator , i.e ., whethe r or not (2) att ains a minimum in H . Fortunately , a (co nceptually) simple solution procedure suffices to show tha t a minimizer does indeed exist. Con sider spec ifically a giv en subsample of s training d ata points, say { y i , x i } s i =1 , and solve min f ∈H " s X i =1 r 2 i ( f ) + µ k f k 2 H # . A unique minimizer of the form ˆ f ( j ) ( x ) = P s i =1 β ( j ) i K ( x , x i ) is guarantee d to exist, wh ere j is us ed here to denote the cho sen subsa mple, and the c oefficients { β ( j ) i } s i =1 can be obtained by solving a particular linear sys tem of e quations [38, p . 11]. This procedure can be repeated for e ach subsa mple (there are J := N s of the se), to obtain a collection { ˆ f ( j ) ( x ) } J j =1 of c andidate so lutions of (2). T he winne r(s) ˆ f := ˆ f ( j ∗ ) yielding the minimum c ost, is the des ired VL TS estimator . A rema rk is now in order . Remark 1 (VL TS complexity): Even tho ugh conceptua lly simple, the s olution p rocedure just d escribed guarantees existence of (at least) one solution, but entails a combinatorial search over all J subsample s which is intrac table for mo derate to large sa mple s izes N . In the context o f linear regress ion, algorithms to obtain app roximate L TS s olutions are available; see e.g ., [30]. A. Robust func tion appr oximation via ℓ 0 -norm re gularization Instead of disca rding large res iduals, the alternativ e ap proach prop osed he re explicitly a ccoun ts for outliers in the regression model. T o this end, con sider the scalar variables { o i } N i =1 one per training datum, IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 5 taking the value o i = 0 whenever da tum i adheres to the po stulated nominal model, a nd o i 6 = 0 otherwise. A regression mo del naturally acc ounting for the prese nce o f outliers is y i = f ( x i ) + o i + ε i , i = 1 , . . . , N (3) where { ε i } N i =1 are zero-mean indepe ndent and iden tically distributed (i.i.d.) random variables modeling the observation errors. A similar model was advocated u nder diff erent assu mptions in [17] and [22], in the context o f robust pa rametric regression; see also [4] and [40]. For an outlier-fr ee datum i , (3) reduces to y i = f ( x i ) + ε i ; henc e, ε i will be often referred to as the nominal noise. Note that in (3), both f ∈ H as well as the N × 1 vector o := [ o 1 , . . . , o N ] ′ are unknown; thus, (3) is u nderdetermined. On the o ther ha nd, as outliers are expe cted to o ften co mprise a small fraction of the training sample sa y , not exceeding 2 0% – vector o is typically sparse , i.e., most of its entries are zero; see also Remark 3. Spa rsity compens ates for un derdeterminacy and provides valuable side-information when it comes to efficiently estimating o , iden tifying outliers a s a by product, and co nseque ntly p erforming r obust estimation of the unknown function f . A natural criterion for controlli ng o utlier sparsity is to s eek the des ired estimate ˆ f as the solution of min f ∈H o ∈ R N " N X i =1 ( y i − f ( x i ) − o i ) 2 + µ k f k 2 H # , s.t. k o k 0 ≤ τ (4) where τ is a prese lected threshold, and k o k 0 denotes the ℓ 0 -norm of o , which equals the number of nonzero e ntries of its vector argument. Sparsity is directly controlled by the selection of the tuning parameter τ ≥ 0 . If the number of outliers N o were known a priori, the n τ sho uld be selected equ al to N o . Unfortunately , analogous ly to related ℓ 0 -norm constraine d formulations in compres siv e s ampling and spars e s ignal representations , p roblem (4) is NP -hard. In a ddition, (4) can be recast to an equiv alent (unconstrained ) Lag rangian form; see e.g., [3] min f ∈H o ∈ R N " N X i =1 ( y i − f ( x i ) − o i ) 2 + µ k f k 2 H + λ 0 k o k 0 # (5) where the tuning Lagran ge multiplier λ 0 ≥ 0 plays a role similar to τ in (4), and the ℓ 0 -norm sparsity encourag ing pena lty is added to the co st. T o further moti vate mod el (3) a nd the proposed criterion (5) f or robust no nparametric regression, it is worth che cking the s tructure o f the minimi zers { ˆ f , ˆ o } of the co st in (5) . Consider for the sa ke of argument that λ 0 is gi ven, an d its v a lue i s such that k ˆ o k 0 = ν , f or some 0 ≤ ν ≤ N . The goal is to cha racterize ˆ f , as well as t he positions a nd v a lues o f the nonze ro entries of ˆ o . Note that because k ˆ o k 0 = ν , the last IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 6 term i n (5) is constant, h ence inco nseque ntial to t he minimization. Upon defining ˆ r i := y i − ˆ f ( x i ) , it is not hard to s ee that the e ntries of ˆ o satisfy ˆ o i = 0 , | ˆ r i | ≤ √ λ 0 ˆ r i , | ˆ r i | > √ λ 0 , i = 1 , . . . , N (6) at the optimum. Th is is intuiti ve, since for those ˆ o i 6 = 0 the be st thing to do in t erms of minimizing the overall cost is to s et ˆ o i = ˆ r i , and thus null the correspo nding squa red-residual terms in (5). In conclus ion, for the cho sen value of λ 0 it h olds tha t ν s quared residuals effecti vely do not contribute to the co st in (5). T o determine the suppo rt o f ˆ o and ˆ f , o ne alternati ve is to exhaustively test all N ν admissible support combinations. For each one of thes e comb inations (indexed by j ), let S j ⊂ { 1 , . . . , N } be the index set describing the support of ˆ o ( j ) , i.e. , ˆ o ( j ) i 6 = 0 if a nd on ly if i ∈ S j ; and |S j | = ν . By v irtue of (6), the correspond ing can didate ˆ f ( j ) minimizes min f ∈H X i ∈S j r 2 i ( f ) + µ k f k 2 H while ˆ f is the one amo ng all { ˆ f ( j ) } that yields the least c ost. The previous discus sion, in conjunction with the one prece ding Remark 1 comp letes the argument req uired to estab lish the following result. Proposition 1: If { ˆ f , ˆ o } minimizes (5) wit h λ 0 chosen s uch t hat k ˆ o k 0 = N − s , then ˆ f also solve s the VLTS problem (2) . The importance of P roposition 1 is threefold. First, it formally justifies model (3) and its es timator (5) for robust function approximation, in light of the well docu mented merits of L TS regress ion [30]. Seco nd, it further solidifies the conn ection be tween s parse l inear regression and rob ust e stimation. T hird, the ℓ 0 - norm regularized formulation in (5) lends itself n aturally to ef fic ient solvers based on c on vex relaxation, the subject dea lt with next. I I I . S P A R S I T Y C O N T R O L L I N G O U T L I E R R E J E C T I O N T o overcome the complexity h urdle in s olving t he rob ust regression problem in (5), one can reso rt to a suitable relaxation of the ob jecti ve function. The goal is to formulate an optimization problem which is tractable, and whose solution yields a sa tisfactory ap proximation to the minimizer of the original ha rd problem. T o this end, it is useful to recall that the ℓ 1 -norm k x k 1 of vec tor x is the c losest con vex approximation of k x k 0 . Th is prop erty also utilized in the c ontext of c ompressive samp ling [35], p rovides the moti vation to r elax the NP-hard problem (5) to min f ∈H o ∈ R N " N X i =1 ( y i − f ( x i ) − o i ) 2 + µ k f k 2 H + λ 1 k o k 1 # . (7) IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 7 Being a co n vex optimization problem, (7) can be solved ef ficiently . The no ndif ferentiable ℓ 1 -norm reg- ularization term con trols sparsity on the estimator of o , a prope rty that has been recen tly exploited in div erse p roblems in e ngineering, statistics and machine lea rning. A n otew orthy representative is the lea st- absolute shrinkage and selection operator (Lass o) [33], a po pular tool in statistics for joint es timation a nd continuous variable selection in linea r regression problems. In its Lagran gian form, La sso is a lso known as basis p ursuit deno ising in the signa l processing literature, a term co ined by [7] in the c ontext of finding the best sparse signa l expansion us ing an overcomplete b asis. It is pe rtinent to p onder on whether problem (7) has built-i n ability to provide rob ust e stimates ˆ f in the prese nce of outliers. T he an swer is in the affirmati ve, s ince a straightforward argument (details are deferred to the Appe ndix) shows that (7) i s e quiv a lent to a variational M-type estimator found by min f ∈H " N X i =1 ρ ( y i − f ( x i )) + µ k f k 2 H # (8) where ρ : R → R is a s caled version of Huber’ s conv ex loss function [21 ] ρ ( u ) := u 2 , | u | ≤ λ 1 / 2 λ 1 | u | − λ 2 1 / 4 , | u | > λ 1 / 2 . (9) Remark 2 (Regular ized r e gression and rob u stness): E xisting works on linear regression h av e pointed out the equiv a lence between ℓ 1 -norm regularized regression and M-type estimators, u nder s pecific a ssump- tions on the distribution o f the outli ers ( ǫ -contamination) [17], [23]. Ho wever , they ha ve not recog nized the link with L TS through the co n vex relaxa tion of (5), and the conne ction ass erted by Proposition 1. Here, the treatment go es beyo nd linear regression by co nsidering non parametric functiona l ap proximation in RKHS. Line ar regression is subsume d as a special ca se, when the linear kernel K ( x , y ) := x ′ y is adopted. In ad dition, no assumption is impos ed on the outlier vec tor . It is interes ting to c ompare the ℓ 0 - and ℓ 1 -norm formulations [cf. (5) and (7), resp ectiv ely] in terms of their equivalent p urely variational counterparts in (2) and (8), that entail robust los s functions. While the VL TS estimator c ompletely discards large residua ls, ρ still retains them, but downweighs their ef fect through a linea r penalty . Moreover , while (8) is con vex, (2) is no t and this has a direct impact on the complexity to obtain either estimator . Rega rding the trimming cons tant s in (2), it c ontrols the number of residuals retained and henc e the breakdown point of VL TS. Considering instea d the thresho ld λ 1 / 2 in Huber’ s func tion ρ , when the outliers’ distrib ution is known a-priori, its v alue is av ailable in closed form so that the robust estimator is o ptimal in a well-defin ed se nse [21]. Con ver gence in proba bility o f M-type cubic smoothing splines e stimators – a spe cial problem subsu med by (8) – w as studied in [9]. IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 8 A. Solving the con vex relaxation Becaus e (7) i s jointly con vex in f a nd o , an alternating minim ization (AM) algorithm can be ad opted to so lve (7), for fixed values of µ and λ 1 . Se lection of these parameters is a c ritical issue that w ill be discuss ed in Section III-B. AM solvers a re iterati ve proce dures that fix on e of the vari ables to its most up to date value, and minimize the res ulting c ost with respec t to the other o ne. Then the roles are reversed to complete one cycle, a nd the overall two-step minimization procedure is repeated for a prescribed n umber of iterations, or , until a co n vergence criterion is met. Le tting k = 0 , 1 , . . . denote iterations, consider that o := o ( k − 1) is fixed in (7). Th e update for f ( k ) at the k -th iteration is gi ven by f ( k ) := arg min f ∈H " N X i =1 ( y i − o ( k − 1) i ) − f ( x i ) 2 + µ k f k 2 H # (10) which corres ponds to a standard regularization problem for fun ctional a pproximation in H [12], but with outlier-compensated da ta n y i − o ( k − 1) i , x i o N i =1 . It is well known that the minimizer of the variational prob- lem (10) is finitely pa rameterized, a nd given b y the kernel expan sion f ( k ) ( x ) = P N i =1 β ( k ) i K ( x , x i ) [38]. The vector β := [ β 1 , . . . , β N ] ′ is found by s olving the linear system of eq uations [ K + µ I N ] β ( k ) = y − o ( k − 1) (11) where y := [ y 1 , . . . , y N ] ′ , and the N × N matrix K ≻ 0 has entries [ K ] ij := K ( x i , x j ) . In a nutshell, upd ating f ( k ) is equ i valent to up dating vector β ( k ) as p er (11), where only the independent vector v a riable y − o ( k − 1) change s ac ross iterations. Because the s ystem matrix is positi ve d efinite, the per iteration systems of linear equ ations (11) ca n be e f ficiently s olved after computing once, the C holesky factorization o f K + µ I N . For fixed f := f ( k ) in (7), the o utlier vector update o ( k ) at iteration k is obtained as o ( k ) := arg min o ∈ R N " N X i =1 r ( k ) i − o i 2 + λ 1 k o k 1 # (12) where r ( k ) i := y i − P N j =1 β ( k ) j K ( x i , x j ) . Problem (12) ca n b e recognize d a s a n instance of Las so for the so-termed orthono rmal case , in pa rticular for an ide ntity regression ma trix. The solution of su ch Lasso problems is readily ob tained via soft-thresholding [15], in the form of o ( k ) i := S r ( k ) i , λ 1 / 2 , i = 1 , . . . , N (13) where S ( z , γ ) := sign ( z )( | z | − γ ) + is the soft-thr esho lding operator , and ( · ) + := max(0 , · ) denotes the projection onto the no nnegativ e reals. The coordinatewi se u pdates in (13) are in par with t he s parsifying property of the ℓ 1 norm, since for “small” residuals, i.e., r ( k ) i ≤ λ 1 / 2 , it follo ws that o ( k ) i = 0 , and the IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 9 Algorithm 1 : AM solver Initialize o ( − 1) = 0 , and run till convergence for k = 0 , 1 , . . . do Update β ( k ) solving [ K + µ I N ] β ( k ) = y − o ( k − 1) . Update o ( k ) via o ( k ) i = S y i − P N j =1 β ( k ) j K ( x i , x j ) , λ 1 / 2 , i = 1 , . . . , N . end for return f ( x ) = P N i =1 β ( ∞ ) i K ( x , x i ) i -th training d atum is deemed outlier free. Upda tes (11) an d (13) comprise the iterative AM so lver of the ℓ 1 -norm regularized problem (7), which is tabulated as Algorithm 1. Con vexity ensures con vergence to the global optimum solution re gardless of the initial c ondition; see e.g ., [3]. Algorithm 1 is also conce ptually interesting, since it explicitly rev eals the intertwining b etween the outlier identification p rocess, and the es timation of the regress ion function with the appropriate outlier- compens ated data. An add itional point is worth mentioning after inspection of (13) in the limit as k → ∞ . From the de finition of the soft-thresholding operator S , for tho se “large” res iduals ˆ r i := lim k →∞ r ( k ) i exceeding λ 1 / 2 in magnitude, ˆ o i = ˆ r i − λ 1 / 2 wh en ˆ r i > 0 , and ˆ o i = ˆ r i + λ 1 / 2 o therwise. In o ther words, lar ger residu als that the method identifies as corresp onding to outlier -contamina ted data are shrunk , b ut not comple tely discarded. By plugging ˆ o ba ck into ( 7), t hese “ lar ge” residuals cancel out in the squa red error term, but still contribute li nearly throug h the ℓ 1 -norm regularizer . This is exactly what one would expect, in light of the equiv a lence e stablished with the v a riational M -type estimator in ( 8). Next, it is e stablished that a n alternative to s olving a se quence of linear systems and sc alar Las so problems, is to solve a single instance of the Lasso with s pecific resp onse vector and (non -orthonormal) regression matrix. Proposition 2: Co nsider ˆ o Lasso defined as ˆ o Lasso := arg min o ∈ R N k X µ y − X µ o k 2 2 + λ 1 k o k 1 (14) where X µ := I N − K ( K + µ I N ) − 1 ( µ K ) 1 / 2 ( K + µ I N ) − 1 . (15) Then the minimizers { ˆ f , ˆ o } of (7) a r e f ully determined given ˆ o Lasso , as ˆ o := ˆ o Lasso and ˆ f ( x ) = P N i =1 ˆ β i K ( x , x i ) , with ˆ β = ( K + µ I N ) − 1 ( y − ˆ o Lasso ) . Pr o of: For notational con venience introduce the N × 1 vectors f := [ f ( x 1 ) , . . . , f ( x N )] ′ and ˆ f := IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 10 [ ˆ f ( x 1 ) , . . . , ˆ f ( x N )] ′ , where ˆ f ∈ H is the mi nimizer of (7) . Next, consider rewri ting (7) as min o ∈ R N min f ∈H k ( y − o ) − f k 2 2 + µ k f k 2 H + λ 1 k o k 1 . (16) The quantity inside the square brackets is a function of o , and ca n be written exp licitly a fter carrying out the minimization with respect to f ∈ H . From the results in [38], it f ollows tha t the vector of optimum predicted v alues at the points { x i } N i =1 is gi ven by ˆ f = K ˆ β = K ( K + µ I N ) − 1 ( y − o ) ; see also the d iscus- sion after (10). Similarly , one finds that k ˆ f k 2 H = ˆ β ′ K ˆ β = ( y − o ) ′ ( K + µ I N ) − 1 K ( K + µ I N ) − 1 ( y − o ) . Having minimized (16) with resp ect to f , the quantity ins ide the square brackets is ( Γ µ := ( K + µ I N ) − 1 ) min f ∈H h k ( y − o ) − f k 2 2 + µ k f k 2 H i = ( y − o ) − ˆ f 2 2 + µ k ˆ f k 2 H = k ( y − o ) − KΓ µ ( y − o ) k 2 2 + µ ( y − o ) ′ Γ µ KΓ µ ( y − o ) = k ( I N − KΓ µ ) y − ( I N − KΓ µ ) o k 2 2 + µ ( y − o ) ′ Γ µ KΓ µ ( y − o ) . ( 17) After expa nding the quad ratic form in the right-hand s ide of (17), an d eliminating the term that does no t depend on o , p roblem (16) beco mes min o ∈ R N h k ( I N − KΓ µ ) y − ( I N − KΓ µ ) o k 2 2 − 2 µ y ′ Γ µ KΓ µ o + µ o ′ Γ µ KΓ µ o + λ 1 k o k 1 i . Completing the square o ne arri ves at min o ∈ R N I N − KΓ µ ( µ K ) 1 / 2 Γ µ y − I N − KΓ µ ( µ K ) 1 / 2 Γ µ o 2 2 + λ 1 k o k 1 which completes the p roof. The res ult i n Proposition 2 opens the possibility for ef fecti ve methods to select λ 1 . These methods to be described in detail in the ens uing section, capitalize on recen t algorithmic a dvances on Lass o s olvers, which allow on e to e f ficiently compute ˆ o Lasso for all values o f the tuning pa rameter λ 1 . T his is crucial for obtaining satisfactory r obust estimates ˆ f , since control ling the sparsity in o by tuning λ 1 is tantamount to controlling the nu mber of outliers in mo del (3). B. Selection of the tuning parameters: rob ustification pa ths As argued before, the tuning para meters µ and λ 1 in (7) co ntrol the degree of smoothne ss in ˆ f and the number of ou tliers (nonzero en tries in ˆ o Lasso ), respec ti vely . From a statistical le arning theory s tandpoint, µ and λ 1 control the amo unt of regularization a nd mode l c omplexity , thus c apturing the so-termed effecti ve degrees o f freedom [19]. Complex mode ls tend to hav e worse gen eralization capability , even though the prediction error over the training set T may b e small (overfitt ing). In the c ontexts o f regularization IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 11 networks [12] and Las so estimation for regression [33], corres ponding tuning parame ters are typically selected via mod el selection techniqu es such as cros s-validation, or , by minimizing the prediction error over an ind epende nt te st set, if av a ilable [19 ]. Ho wever , these simple methods are severely cha llenged in the p resence o f multiple outliers. For example, the s wamping ef fect refers to a very lar ge value of the residual r i correspond ing to a left out clea n datum { y i , x i } , beca use of an unsatisfactory model estimation based on all da ta except i ; data which contain outliers. The i dea here offers an alternati ve method to overcome the aforementioned c hallenges, and the possibility to efficiently compute ˆ o Lasso for all values of λ 1 , given µ . A b rief overvie w of the s tate-of-the-art in La sso solvers is gi ven first. Sev eral me thods for selecting µ a nd λ 1 are then des cribed, which diff er on the assumptions of what is kno wn regarding the outlier model (3). Lasso amounts to solving a quadratic programming (QP) problem [33]; hen ce, an iterativ e procedure is required to determine ˆ o Lasso in (14) for a gi ven value of λ 1 . While standard QP solvers can be c ertainly in voked to this end , an increasing a mount o f e f fort h as b een pu t rec ently toward developing f ast algorithms that capitalize on the un ique prope rties of Lasso . The LA RS a lgorithm [11] is a n e fficient scheme for computing the entire pa th of solutions (co rresponding to all v alues of λ 1 ), s ometimes referred to a s regularization paths. LARS ca pitalizes on piecewi se linearity of the Lasso path of solutions, while i ncurring the complexity of a single L S fit, i.e., when λ 1 = 0 . Coordinate des cent a lgorithms have be en shown competitiv e, even outperforming LARS wh en p is large, a s d emonstrated in [16]; s ee also [15], [42], and the referenc es therein. Coordinate desc ent solvers ca pitalize on the fact that Lasso can afford a very simple solution in the s calar ca se, which is given in closed form in terms of a soft-thresholding ope ration [cf. (13)]. Further computational savings are attained through the use of warm starts [15], when computing the Lass o p ath of solutions over a grid of dec reasing values o f λ 1 . An efficient s olver capitalizing on variable separab ility ha s been prop osed in [41]. Consider then a grid of G µ values of µ in the interval [ µ min , µ max ] , evenly space d in a log arithmic scale. Likewise, for e ach µ consider a similar type of grid consisting of G λ values of λ 1 , whe re λ max := 2 min i | y ′ X ′ µ x µ,i | is the minimum λ 1 value such that ˆ o Lasso 6 = 0 N [16], and X µ := [ x µ, 1 . . . x µ,N ] in (14). T y pically , λ min = ǫλ max with ǫ = 10 − 4 , say . Note that eac h of the G µ values o f µ giv es rise to a different λ grid, since λ max depend s on µ through X µ . Given the pre viously surve yed algorithmic alternati ves to tackle the La sso, it i s safe to ass ume that (14) can be efficiently solved over the (no nuniform) G µ × G λ grid of values of the tuning parameters. This way , for ea ch value of µ one obtains G λ samples of the Lasso path of solutions, wh ich in the present c ontext ca n be referred to a s r obustification pa th . As λ 1 decreas es, mo re vari ables ˆ o Lasso ,i enter the mo del signifying that more of the training data a re deemed to IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 12 contain outliers. An examp le of t he robustification path is given in Fig. 3. Based on the robustification p aths an d t he prior knowledge av ailable on the outlier model (3), sev eral alternati ves are gi ven next to s elect the “ best” pair { µ, λ 1 } in the g rid G µ × G λ . Number of outliers is known : When N o is known, by direct insp ection of the robustification paths one ca n determine the range of values for λ 1 , for which ˆ o Lasso has exac tly N o nonzero en tries. Specializing to the interval of interest, and after discarding outliers which are now fixed and known, K -fold cross -vali dation methods can be ap plied to determine λ 1 . V ariance of the nominal noise is kno wn: Suppos ing that the v ariance σ 2 ε of the i.i.d. n ominal noise variables ε i in (3) is kn own, one ca n proce ed a s follows. Using the so lution ˆ f obtained for each pair { µ i , λ j } on the grid, form the G µ × G λ sample variance matrix ¯ Σ with ij -th en try [ ¯ Σ ] ij := X u | ˆ o Lasso ,u =0 ˆ r 2 u / ˆ N o = X u | ˆ o Lasso ,u =0 ( y u − ˆ f ( x u )) 2 / ˆ N o (18) where ˆ N o stands for the numb er of no nzero entries in ˆ o Lasso . Although no t made explicit, the r ight-hand side of (18) de pends o n { µ i , λ j } t hrough the estimate ˆ f , ˆ o Lasso and ˆ N o . The entries [ ¯ Σ ] ij correspond to a sample estimate of σ 2 ε , without considering those training data { y i , x i } that the method determined to be contaminated with ou tliers, i.e., those ind ices i for wh ich ˆ o Lasso ,i 6 = 0 . Th e “winne r” tun ing parameters { µ ∗ , λ ∗ 1 } := { µ i ∗ , λ j ∗ } are su ch that [ i ∗ , j ∗ ] := arg min i,j | [ ¯ Σ ] ij − σ 2 ε | (19) which is an a bsolute vari ance deviation (A VD) criterion. V ariance of the nominal noise is u nknown: If σ 2 ε is unkn own, one can s till compute a rob ust estimate of the variance ˆ σ 2 ε , and repeat the previous procedu re (with known nominal noise v ariance) after replacing σ 2 ε with ˆ σ 2 ε in (19). One o ption is base d o n the med ian absolute deviation (MAD) estimator , namely ˆ σ ε := 1 . 4 826 × median i ( | ˆ r i − me dian j ( ˆ r j ) | ) (20) where the residuals ˆ r i = y i − ˆ f ( x i ) are formed based on a nonrobust estimate of f , obtained e.g., after solving (7) with λ 1 = 0 an d using a s mall subse t of the training dataset T . The factor 1 . 4826 provides an approximately unb iased es timate of the s tandard deviation when the nominal n oise is Gau ssian. T ypically , ˆ σ ε in (20) is us ed as an es timate for the sc ale of the errors in ge neral M-type robust estimators; see e.g., [9] and [27]. Remark 3 (How sparse is sp arse): E ven though the very nature of ou tliers dictates that N o is typically a small fraction of N – an d thus o in (3) is sp arse – the method here c apitalizes on, b ut is not li mited to sparse settings. For instanc e, choosing λ 1 ∈ [ λ min ≈ 0 , λ max ] along the robustification pa ths allo ws IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 13 one to continuously control the sparsity le vel, and potentially selec t the right value of λ 1 for any g i ven N o ∈ { 1 , . . . , N } . Admittedly , if N o is lar ge relati ve to N , then even if it is po ssible to identify a nd discard the outliers, the e stimate ˆ f may not be ac curate due to the lack of outlier -free data. I V . R E F I N E M E N T V I A N O N C O N V E X R E G U L A R I Z A T I O N Instead of subs tituting k o k 0 in ( 5) by it s closest c on vex ap proximation, namely k o k 1 , lett ing the surrogate function to be non-con vex c an yield tighter a pproximations. For example, the ℓ 0 -norm of a vector x ∈ R n was s urrogated in [5] by the log arithm of the geo metric mean of its elemen ts, or by P n i =1 log | x i | . In rank minimization problems, apart from the n uclear n orm relaxation, minimizing the loga rithm of the determinant of the unknown matrix h as been p roposed as an alternative surrogate [14]. Adop ting related ideas in the pre sent nonparametric context, consider approx imating (5) by min f ∈H o ∈ R N " N X i =1 ( y i − f ( x i ) − o i ) 2 + µ k f k 2 H + λ 0 N X i =1 log( | o i | + δ ) # (21) where δ is a s ufficiently small positive offset introduc ed to avoid numerical instability . Since the su rrogate term in (21) is conc av e, the overall problem is nonc on vex. Still, local me thods based on iterati ve linearization of log ( | o i | + δ ) , around the c urrent iterate o ( k ) i , can be adopte d to minimize (21). From the conc avity of the logarithm, its local linear approximation serves as a globa l overestimator . Stan- dard majorization-minimization algorithms motiv a te minimizing the globa l linear overestimator instead. This leads to the following iteration for k = 0 , 1 , . . . (see e.g., [25] for further details) [ f ( k ) , o ( k ) ] := arg min f ∈H o ∈ R N " N X i =1 ( y i − f ( x i ) − o i ) 2 + µ k f k 2 H + λ 0 N X i =1 w ( k ) i | o i | # (22) w ( k ) i := | o ( k − 1) i | + δ − 1 , i = 1 , . . . , N . (23) It is poss ible to eliminate the optimization variable f ∈ H from (22), by direct a pplication o f t he result in Proposition 2. The e quiv alent up date for o at iteration k is then gi ven by o ( k ) := arg min o ∈ R N " k X µ y − X µ o k 2 2 + λ 0 N X i =1 w ( k ) i | o i | # (24) which amoun ts to an iterativ ely reweighted version of (14). If the value of | o ( k − 1) i | is small, then in the next iteration the corresponding regularization term λ 0 w ( k ) i | o i | ha s a lar ge we ight, thus promoting s hrinkage of that coo rdinate to zero. On the other h and when | o ( k − 1) i | is significant, the cos t in the n ext iteration downweighs the regularization, and places more importance to the LS co mponent o f the fit. For small δ , IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 14 analysis of the limiting point o ∗ of (24) reveals that λ 0 w ∗ i | o ∗ i | ≈ λ 0 , | o ∗ i | 6 = 0 0 , | o ∗ i | = 0 and hence, λ 0 P N i =1 w ∗ i | o ∗ i | ≈ λ 0 k o ∗ k 0 . A good initialization for the iterati on in (24) and (23) is ˆ o Lasso , which corresponds to the solution of (14) [and (7)] for λ 0 = λ ∗ 1 and µ = µ ∗ . This is equi valent to a single iterati on of (24) with all weights equal to unity . The nu merical tests in Se ction V will indica te that even a single i teration of (24 ) suffices to obtain improved estimates ˆ f , in comparison to thos e obtaine d from (14). The follo wing remark sheds further light tow ards unde rstanding why this should be expe cted. Remark 4 (Refine ment through bias r e duction): U niformly weighted ℓ 1 -norm regularized e stimators such as (7) a re biase d [44 ], due to the s hrinkage effected on the estimated coefficients. It will be argued next tha t the improvements due to (24) can be le veraged to b ias reduction. Sev eral workaround s h av e been proposed to correct the bias in s parse regression, that cou ld as we ll be applied here. A first possibility is to retain only the sup port o f (14) a nd re-estimate the amplitudes via , e.g., the un biased LS es timator [11]. An alternative approach to reducing bias is through n onconv ex regularization us ing e. g., the smoo thly clipped abso lute deviation (SCAD) sche me [13]. Th e SCAD p enalty cou ld replace the sum of loga rithms in (21), still leading to a non con vex problem. T o retain the e f ficiency of con vex optimization solvers while simultaneously limiting the bias , suitably weighted ℓ 1 -norm regularizers h ave been propose d instead [44 ]. The cons tant we ights in [44] play a role similar to those in (23); hence, bias reduction is exp ected. V . N U M E R I C A L E X P E R I M E N T S A. Robust thin-plate smoothing splines T o validate the propo sed approach to robust nonp arametric regr ession , a simulated test is carried out here in the co ntext of thin-plate smoothing spline ap proximation [10], [39]. Spec ializing (7) to this s etup, the robust thin-plate splines estimator can b e formulated as min f ∈S o ∈ R N " N X i =1 ( y i − f ( x i ) − o i ) 2 + µ Z R 2 k∇ 2 f k 2 F d x + λ 1 k o k 1 # (25) where ||∇ 2 f || F denotes the Frobenius norm o f the Hes sian of f : R 2 → R . The penalty functional J [ f ] := Z R 2 k∇ 2 f k 2 F d x = Z R 2 " ∂ 2 f ∂ x 2 1 2 + 2 ∂ 2 f ∂ x 1 ∂ x 2 2 + ∂ 2 f ∂ x 2 2 2 # d x (26) extends to R 2 the one-dimen sional roughne ss regularization use d in smoo thing spline mode ls. For µ = 0 , the (non-unique) estimate i n (25) correspond s t o a r oug h f unction interpolating the outlier compensate d IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 15 data; while as µ → ∞ the es timate is linear (cf. ∇ 2 ˆ f ( x ) ≡ 0 2 × 2 ). The optimization is over S , the sp ace of Sobolev fun ctions, for wh ich J [ f ] is well define d [10, p. 85]. Reproduc ing kernel Hilbert spaces such as S , with inner -products (and norms) in volving deri vati ves are studied in detail in [ 38]. Dif ferent from the cases considered so far , the smoothing penalty in (26 ) is only a s eminorm, s ince first-order po lynomials vanish unde r J [ · ] . Omit ting details than can be found in [38, p. 30], under f a irly general cond itions a u nique minimizer of (25) exists. The so lution a dmits the fin itely pa rametrized form ˆ f ( x ) = P N i =1 β i K ( x , x i ) + α ′ 1 x + α 0 , where in this case K ( x , y ) : = k x − y k 2 log k x − y k is a radial basis function. In simple te rms, the solution as a kernel expans ion i s a ugmented with a member of the null spac e of J [ · ] . The unknown parameters { β , α 1 , α 0 } are obtained in closed form, as so lutions to a constrained, regularized LS problem; see [38, p. 33]. As a result, Propo sition 2 still holds with minor modifications on the s tructure of X µ . Remark 5 (Bayesian framework): Adop ting a Ba yesian persp ectiv e, one cou ld model f ( x ) in (3) as a sa mple func tion of a zero mea n Ga ussian s tationary proc ess, with covariance func tion K ( x , y ) = k x − y k 2 log k x − y k [24]. Conside r as well that { f ( x ) , { o i , ε i } N i =1 } are mutua lly ind epende nt, while ε i ∼ N (0 , µ ∗ / 2) and o i ∼ L (0 , µ ∗ /λ ∗ 1 ) in (3) are i.i.d. Gaussia n and Lap lace distributed, res pectiv ely . From the res ults in [24 ] a nd a straightforw ard calcu lation, it follows that se tting λ 1 = λ ∗ 1 and µ = µ ∗ in (25) yields estimates ˆ f (an d ˆ o ) wh ich are optimal in a maximum a posteriori s ense. Th is provides yet another me ans of se lecting the parameters µ and λ 1 , further expand ing the options p resented in Sec tion III-B. The s imulation setup is as follows. Noisy s amples of the true function f o : R 2 → R comprise the training s et T . F unction f o is gene rated as a Gaus sian mixture with two compone nts, with respec ti ve mean vectors and covariance ma trices given by µ 1 = 0 . 2295 0 . 4996 , Σ 1 = 2 . 2431 0 . 4577 0 . 4577 1 . 0037 , µ 2 = 2 . 4566 2 . 9461 , Σ 2 = 2 . 9069 0 . 5236 0 . 5236 1 . 7299 . Function f o ( x ) is depicted in Fig. 4 (a). The training da ta s et comprises N = 20 0 exa mples, with inpu ts { x i } N i =1 drawn from a uniform distributi on in the sq uare [0 , 3] × [0 , 3] . Several v alues ranging from 5% to 25% o f the data a re g enerated co ntaminated wi th outli ers. W ithout loss of generality , the corrupted data correspond to the first N o training sa mples with N o = { 10 , 20 , 30 , 40 , 50 } , for which the response values { y i } N o i =1 are indepe ndently drawn from a uniform distribution over [ − 4 , 4] . Outlier -free da ta are ge nerated according to the model y i = f o ( x i ) + ε i , where the indepe ndent additi ve no ise terms ε i ∼ N (0 , 10 − 3 ) are Gauss ian distrib uted, for i = N o + 1 , . . . , 200 . For the cas e where N o = 20 , the data used in the experiment is shown in F ig. 2. S uperimposed to the tr ue function f o are 180 black points correspo nding IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 16 T A BLE I R E S U LT S F O R T H E T H I N - P L A T E S P L I N E S S I M U L A T E D T E S T N o λ ∗ 1 µ ∗ ¯ err for (7) ¯ err fo r (21) Err T fo r (7) Err T fo r (21) 10 3 . 87 × 10 − 2 2 . 90 × 10 − 3 1 . 00 × 10 − 4 1 . 03 × 10 − 4 2 . 37 × 10 − 5 2 . 27 × 10 − 5 20 3 . 83 × 10 − 2 1 . 55 × 10 − 2 1 . 00 × 10 − 4 9 . 16 × 10 − 5 4 . 27 × 10 − 5 2 . 39 × 10 − 5 30 2 . 28 × 10 − 2 6 . 67 × 10 − 2 1 . 22 × 10 − 4 1 . 18 × 10 − 4 2 . 89 × 10 − 5 1 . 93 × 10 − 5 40 2 . 79 × 10 − 2 6 . 10 × 10 − 3 1 . 01 × 10 − 4 1 . 14 × 10 − 4 1 . 57 × 10 − 5 1 . 32 × 10 − 5 50 2 . 49 × 10 − 2 5 . 42 × 10 − 2 1 . 01 × 10 − 4 9 . 9 × 10 − 5 1 . 19 × 10 − 5 1 . 05 × 10 − 5 to data drawn from the n ominal model, as well as 2 0 red outlier po ints. For this experiment, the n ominal noise variance σ 2 ε = 10 − 3 is assume d known. A nonuniform grid of µ and λ 1 values is cons tructed, as de scribed in Section III-B. The relev ant parameters are G µ = G λ = 200 , µ min = 10 − 9 and µ max = 1 . For each value of µ , the λ 1 grid span s t he interval defined by λ max := 2 min i | y ′ X ′ µ x µ,i | and λ min = ǫλ max , whe re ǫ = 10 − 4 . Eac h of the G µ robustification p aths correspond ing to the solution of (14) is ob tained u sing the SpaR SA toolbox in [41 ], exploiting warm starts for faster conv ergence. Fig. 3 depicts a n example with N o = 20 an d µ ∗ = 1 . 55 × 10 − 2 . W ith the robustification paths at ha nd, it is p ossible to form the samp le variance matrix ¯ Σ [cf. (18)], an d select the optimum tuning parameters { µ ∗ , λ ∗ 1 } based on the criterion (19). Finally , the robust estimates are refined by running a single iteration of (24) a s described in Section IV. The value δ = 10 − 5 was utilized, and several experiments indicated tha t the resu lts are quite insens iti ve to the selection of t his pa rameter . The s ame experiment was conduc ted for a variable number of outliers N o , a nd the results are listed in T able I. In a ll cases, a 100% outlier identification success rate was obtained, for t he chose n value o f the tun ing parameters. This even happen ed at the first stag e o f the method, i.e., ˆ o Lasso in (14) had the correct support in all c ases. It has been ob served in some other setups that (14) ma y select a lar ger support than [1 , N o ] , but after running a few iterations of (24) the true support was typically identified. T o asse ss quality of the estimated function ˆ f , two figures of merit were considered . First, the training e rr or ¯ err was ev a luated as ¯ err = 1 N − N o N X i = N o y i − ˆ f ( x i ) 2 i.e., the average loss ov er the training sample T after excluding outliers. Second, t o assess the gene ralization IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 17 capability of ˆ f , a n approximation to the gene ralization er r o r Err T was computed as Err T = E y − ˆ f ( x ) 2 |T ≈ 1 ˜ N ˜ N X i =1 ˜ y i − ˆ f ( ˜ x i ) 2 (27) where { ˜ y i , ˜ x i } ˜ N i =1 is an independen t test set generated from the mode l ˜ y i = f o ( ˜ x i ) + ε i . For the res ults in T able I, ˜ N = 961 was adopted co rresponding to a uniform rectang ular grid of 31 × 31 points ˜ x i in [0 , 3] × [0 , 3] . Ins pection of T able I rev eals that the training errors ¯ err are co mparable for the func tion estimates obtained after solving (7) or its non con vex refineme nt (21). Interestingly , when it co mes to the more p ragmatic gene ralization error Err T , the refined es timator (21) h as an edge for all values of N o . As exp ected, the b ias reduction eff ected by the iterativ ely reweighting proce dure of Section IV improves considerab ly the ge neralization capability of the me thod; see also Re mark 4. A pictorial summary of the r esults is given in Fig 4, for N o = 20 ou tliers. Fig 4 (a) depicts t he true Gaussian mixture f o ( x ) , whereas Fig. 4 (b) shows the no nrobust thin-plate splines estimate obtained after solving min f ∈S " N X i =1 ( y i − f ( x i )) 2 + µ Z R 2 k∇ 2 f k 2 F d x # . (28) Even though the thin-plate pen alty enforces some degree of smo othness , the estimate is severely disrupted by the presence of outliers [cf. the dif feren ce on the z -axis ran ges]. On the other hand, Figs. 4 (c) and (d), resp ectiv ely , show the robust estimate ˆ f with λ ∗ 1 = 3 . 83 × 10 − 2 , a nd its bias reducing refinement. The improvement is a pparent, corroborating the effecti vene ss of t he p roposed approa ch. B. Sinc function estimation The u niv a riate function sinc ( x ) := s in ( π x ) / ( π x ) is commonly ad opted to ev aluate the p erformance of nonparame tric regression methods [8], [43]. Gi ven noisy training examples with a small fraction of outliers, approximating s inc ( x ) over the interval [ − 5 , 5] is conside red in the present simulated test. The sparsity-controlling robust n onparametric regression methods o f this paper are compared with the SVR [37] and robust SVR in [8], for the case of the ǫ -insen sitve loss function with values ǫ = 0 . 1 an d ǫ = 0 . 0 1 . In order to implement (R)SVR, routines from a pu blicly av ailable SVM Matlab toolbox we re utilized [18]. Results for the nonrobust regularization network app roach in (1 ) (with V ( u ) = u 2 ) are reported as well, to asse ss the performance degradation incurred whe n compared to the aforementioned robust alternativ es. Becaus e the fraction o f outliers ( N o / N ) in the training data is assu med kn own to the me thod of [8], the same will be assumed to wards selecting t he tuning parameters λ 1 and µ in (7), as described in Section III-B. The { µ, λ 1 } -grid pa rameters selected for the experiment in Se ction V -A we re u sed here as well, IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 18 T A BLE II G E N E R A L I Z A T I O N E R R O R ( E R R T ) R E S U LT S F O R T H E S I N C F U N C T I O N E S T I M A T I O N E X P E R I M E N T Method σ 2 ε = 1 × 10 − 4 σ 2 ε = 1 × 10 − 3 σ 2 ε = 1 × 10 − 2 Nonrobust [ (1) with V ( u ) = u 2 ] 5 . 67 × 10 − 2 8 . 28 × 10 − 2 1 . 13 × 10 − 1 SVR with ǫ = 0 . 1 5 . 00 × 10 − 3 6 . 42 × 10 − 4 6 . 15 × 10 − 3 RSVR with ǫ = 0 . 1 1 . 10 × 10 − 3 5 . 10 × 10 − 4 4 . 47 × 10 − 3 SVR with ǫ = 0 . 01 8 . 24 × 10 − 5 4 . 79 × 10 − 4 5 . 60 × 10 − 3 RSVR with ǫ = 0 . 01 7 . 75 × 10 − 5 3 . 90 × 10 − 4 3 . 32 × 10 − 3 Sparsity-controlling in (7) 1 . 47 × 10 − 4 6 . 56 × 10 − 4 4 . 60 × 10 − 3 Refinement in (21) 7 . 46 × 10 − 5 3 . 59 × 10 − 4 3 . 21 × 10 − 3 except f or µ min = 10 − 5 . Space H is cho sen to be the RKH S induced by the positi ve definite Gaussian kernel function K ( u, v ) = exp − ( u − v ) 2 / (2 η 2 ) , with parame ter η = 0 . 1 for all ca ses. The training s et comprises N = 50 examples, with sca lar inputs { x i } N i =1 drawn from a uniform distrib ution over [ − 5 , 5] . Uniformly distributed o utliers { y i } N o i =1 ∼ U [ − 5 , 5] are artificially adde d in T , with N o = 3 resu lting in 6% contamination. Nominal data i n T adheres to the model y i = sinc ( x i ) + ε i for i = N o + 1 , . . . , N , where the inde pende nt additi ve noise terms ε i are ze ro-mean Gau ssian distribut ed. Three dif ferent values a re considered for the nominal noise variance, n amely σ 2 ε = 1 × 10 − l for l = 2 , 3 , 4 . For the cas e where σ 2 ε = 1 × 10 − 4 , the data used in the experiment are s hown in Fig. 5 (a). Supe rimposed to the true function s inc ( x ) (shown in blue) a re 47 black points corresp onding to the noisy d ata obeying the nominal model, a s well as 3 ou tliers depicted as re d points. The results are summarized in T able II, w hich lists the generalization e rrors Err T attained by the dif ferent methods tested, and for varying σ 2 ε . The indepen dent test set { ˜ y i , ˜ x i } ˜ N i =1 used to ev a luate (27) was gene rated from the mode l ˜ y i = sinc ( ˜ x i ) + ε i , w here the ˜ x i define a ˜ N = 101 -element uniform grid over [ − 5 , 5] . A firs t (expected) o bservation is that all robust alternatives markedly outperform the nonrobust regularization network app roach in (1), b y an order of magnitude or ev en more, regardless of the value of σ 2 ε . As reported in [8], RSVR uniformly outpe rforms SVR. For the case ǫ = 0 . 01 , RSVR also uniformly outperforms the sparsity-controlling metho d in (7). Interestingly , after refining the estimate obtained via (7) through a couple i terations of (24) (cf. Section IV), the lowest generaliza tion errors are obtained, uniformly ac ross a ll simulated values of the nominal noise variance. Res ults for the RSVR with ǫ = 0 . 01 c ome sufficiently close, and are equally satisfactory for all practical purpose s; see also Fig. 5 IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 19 for a pictorial s ummary of the results when σ 2 ε = 1 × 10 − 4 . While specific error values or method rankings are arguably ane cdotal, two conclusions stand o ut: (i ) model (3) and its sp arsity-controlling estimators (7) and (21) are eff ectiv e approa ches to nonparame tric regression in the presence of o utliers; and (ii ) when initialized with ˆ o Lasso the refine d estimator (21) can considerab ly improve the performance of (7), a t the price of a modest increase in compu tational complexit y . While (7) endowed wit h the sparsity-controlling mechan isms of Section III-B tends to overestimate the “true” suppo rt o f o , numerica l res ults have con sistently shown that the r efineme nt in Section IV is more eff ectiv e when it co mes to suppo rt rec overy . C. Load curv e data cleansing In this se ction, t he robust non parametric metho ds described so far a re applied to the problem of load curve clea nsing outlined in S ection I. Given load data T := { y i , t i } N i =1 correspond ing to a building’ s power consu mption measurements y i , acquired at time instants t i , i = 1 , . . . , N , the proposed a pproach to load curve cleansing minimizes min f ∈S o ∈ R N " N X i =1 ( y i − f ( t i ) − o i ) 2 + µ Z R f ′′ ( t ) dt + λ 1 k o k 1 # (29) where f ′′ ( t ) de notes the se cond-order de ri vati ve of f : R → R . This way , the solution ˆ f provides a cleanse d estimate of the load profi le, and the sup port of ˆ o indica tes the instants where s ignificant load deviations, or , meter failures occu rred. Estimator (29) specializes (7) to the so-termed cu bic smoo thing splines ; see e.g., [19], [38]. It is also s ubsume d as a sp ecial ca se of the robust thin-plate s plines estimator (25), when the target f unction f has domain in R [cf. how the smoothing penalty (26) simplifies to the one in (29) in the on e-dimensional case]. In light of the aforementioned c onnection, it should not be s urprising that ˆ f admits a un ique, finite- dimensional minimizer , wh ich corres ponds to a natural s pline with knots at { t i } N i =1 ; se e e. g., [19, p. 151]. Specific ally , it foll ows that ˆ f ( t ) = P N i =1 ˆ θ i b i ( t ) , whe re { b i ( t ) } N i =1 is the b asis set o f na tural spline functions, and the vector of expa nsion coe f ficients ˆ θ := [ ˆ θ 1 , . . . , ˆ θ N ] ′ is given by ˆ θ = B ′ B + µ Ψ − 1 B ′ ( y − ˆ o ) where ma trix B ∈ R N × N has ij -th entry [ B ] ij = b j ( t i ) ; while Ψ ∈ R N × N has ij -th entry [ Ψ ] ij = R b ′′ i ( t ) b ′′ j ( t ) dt . Spline coef ficients c an be computed more efficiently if the bas is of B-splines is adopted instead; details can be f ound in [ 19, p . 189] and [36]. W ithout considering the outlier variables in (29), a B-spline es timator for load curve cleansing was put forth in [6]. An alternative Nadaray a-W a tson estimator from the Kernel smoothing family was co nsidered IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 20 as we ll. In any case , outliers are identified during a post-processing stage, aft er the load curve has been estimated no nrobustly . S upposing for ins tance that the approac h in [6] correctly identifies ou tliers mos t of the time, it still d oes not yield a clea nsed es timate ˆ f . This should be contrasted with t he e stimator (29), which ac counts for the outlier comp ensated data to yield a cleansed es timate at o nce. Moreover , to select the “optimum” smoothing parameter µ , the app roach of [6] requires the user to man ually label the outliers present in a training subset of data, d uring a pre-processing sta ge. This s ubjectiv e componen t makes it challenging to rep roduce the results of [6], and for this reason comparison s with the aforementione d scheme are not inc luded in the s equel. Next, estimator (29 ) is tested o n rea l load curve data pro vided by the NorthWrite Ener gy Group. The dataset consists o f power consumption measuremen ts (in kWh) for a government b uilding, collected every fifteen minutes during a pe riod of mo re than five ye ars, ranging from July 2005 to Oc tober 2010. Da ta is downsampled by a factor of four , to yield one measuremen t per hour . For the pres ent experiment, only a subset of the who le da ta is utilized for co ncretenes s, where N = 50 1 was ch osen corresp onding to a 501 hour period. A snaps hot of this training load curve da ta in T , spanning a particular three-wee k period is shown in Fig. 6 (a). W eekday activity patterns can be clearly discerned from thos e correspon ding to weekends, as expec ted for most government buildings; but different, e .g., for the load profile of a grocery store. Fig. 6 (b) shows the non robust s moothing spline fit to the training data in T (also shown for comparison purpose s), o btained after solving min f ∈S " N X i =1 ( y i − f ( t i )) 2 + µ Z R f ′′ ( t ) dt # (30) using Matlab’ s built-i n sp line toolbo x. P arameter µ was chose n based on le av e-one-out cross-v alidation, and it is appa rent that no cleansing o f the load profile takes place. Ind eed, the resulting fitted function follo ws very c losely the training data, ev en during the a bnormal energy p eaks observed on the so -termed “building op erational transition shou lder pe riods. ” Becaus e wi th real load curve data the nomina l noise variance σ 2 ε in (3) is unknown, selection of the tuning parameters { µ, λ 1 } in (29) requ ires a robust estimate o f the variance ˆ σ 2 ε such as the MAD [cf. Section III-B]. Similar to [6], it is assume d that the nominal errors a re z ero mea n Gaussia n distrib uted, so that (20) can be applied yielding the value ˆ σ 2 ε = 0 . 69 64 . T o form the res iduals in (20 ), (30) is solved first using a small subs et of T that comprises 126 measuremen ts. A nonuniform grid of µ an d λ 1 values i s constructed, a s des cribed in Section III-B. Relev an t parameters are G µ = 100 , G λ = 200 , µ min = 10 − 3 , µ max = 10 , a nd ǫ = 10 − 4 . The robustification paths (one pe r µ value in the grid) we re obtained using the SpaRSA toolbox in [41], w ith the sample variance matrix ¯ Σ formed as in (18 ). The op timum tuning IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 21 parameters µ ∗ = 1 . 637 and λ ∗ 1 = 3 . 6841 are finally determined ba sed on the criterion (19), where the unknown σ 2 ε is replaced with ˆ σ 2 ε . Finally , the cleans ed load curve is refined by running four iterations of (24) as desc ribed in Section IV, with a value of δ = 10 − 5 . Results are depicted in Fig. 7, where the cleanse d l oad curves are superimpose d t o the training da ta in T . R ed circles indica te those data points deemed as outliers, information that i s readily obtained from the support of ˆ o . By inspection of Fig. 7, it is apparen t that the prop osed spa rsity-controlling estimator ha s the desired clea nsing capa bility . T he cleanse d load cu rves closely follo w the training d ata, but are smooth en ough to avoid overfitting the abnormal ene r gy p eaks on the “sho ulders. ” Inde ed, these peak s are in most cases identified as outliers. As seen from Fig. 7 (a), the solution of (29) tends to overestimate the support of o , since one could ar gue that some of the red circles in Fig. 7 (a) do no t correspond to outliers. Ag ain, the nonc on vex regularization in Section IV p runes the outli er support ob tained via (29), r esulting in a more a ccurate result a nd reducing the number of outliers ide ntified from 77 to 4 1 . V I . C O N C L U D I N G S U M M A RY Outlier -robust nonp arametric regression methods were developed in this paper for function a pproxima- tion in RKHS. Building on a neat link betwe en the seemingly u nrelated fields of robust s tatistics and sparse regression, the novel estimators were found rooted at the cross roads of outlier -resilient estimation, the Lass o, and c on vex optimization. Es timators as fundamental a s LS for linear regression, regularization networks, a nd (thin-plate) smoo thing splines, can be robustified unde r the proposed framework. T raining samples from the (un known) tar g et function were assume d g enerated from a regress ion model, which explicitly incorporates an unk nown spa rse vector of ou tliers. T o fit su ch a model, the propose d variati onal estimator minimizes a tradeoff betwee n fid elity to the training data, the degree of “smoothne ss” of the regress ion function, and the sparsity lev el of the vec tor of outliers. Wh ile mode l complexity control eff ected through a s moothing penalty has quite well u nderstood ramifications i n terms of generalization capability , the major innovati ve claim here is that spa rsity con tr ol is tan tamount to robustness c ontrol. This is indee d the case since a tuna ble parameter in a Lasso reformulation of the vari ational estimator , controls the degree of sparsity in the estimated vector of model outli ers. Selec tion of tuning parameters could be a t first thought as a mund ane task . Howev er , a r guing on the importance of su ch ta sk in the context of robust nonparame tric regression, as well as devising principled method s to effecti vely carry out smoothnes s an d s parsity control, are at the heart of this paper’ s nov elty . Sp arsity control c an be c arried out at a f fordable comp lexity , by capitalizing on s tate-of-the-art algorithms that can efficiently compute the whole path of Lasso solutions. In this sense, t he method he re capitalizes on but is not limit ed to sparse IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 22 settings where few outliers are pres ent, since one can efficiently examine the g amut of sparsity lev els along the robustification path. Co mputer simulations have shown that the novel me thods of this paper outperform existing a lternati ves including SVR, and o ne if its robust variants. As a n app lication domain r elev ant to rob ust no nparametric regression, the problem of load curve cleansing for power systems engineering was a lso con sidered along with a solution propose d based o n robust cubic s pline s moothing. Numerical tes ts o n rea l load c urve data demonstrated that the smoothn ess and sp arsity controlling methods of this paper are effecti ve in c leansing loa d profiles, without user intervention to aid the learning proc ess. A C K N O W L E D G M E N T The authors would like to tha nk NorthWrite Ene r gy Group a nd Prof. Vladimir Ch erkassky (Dept. of ECE, Univ ersity of Minneso ta) for providing the loa d curve data analys ed in Section V -C. A P P E N D I X T owards establishing the equiv alence between problems (7) and (8), c onsider the pair { ˆ f , ˆ o } that s olves (7). Assume that ˆ f is giv en, a nd the goa l is to determine ˆ o . Upon defining the residuals ˆ r i := y i − ˆ f ( x i ) and becaus e k o k 1 = P N i =1 | o i | , the entries o f ˆ o are sepa rately gi ven by ˆ o i := arg min o i ∈ R ( ˆ r i − o i ) 2 + λ 1 | o i | , i = 1 , . . . , N , (31) where the term µ k ˆ f k 2 H in (7) has been omitted, since it is inconse quential for the minimization with respect to o . For e ach i = 1 , . . . , N , because (31) is nondiff erentiable at the origin one should c onsider three ca ses: i) if ˆ o i = 0 , it follo w s that the minimum cost in (31 ) is ˆ r 2 i ; ii) if ˆ o i > 0 , the first-order condition for optimality gi ves ˆ o i = ˆ r i − λ 1 / 2 provided ˆ r i > λ 1 / 2 , a nd the minimum cost is λ 1 ˆ r i − λ 2 1 / 4 ; otherwise, iii ) if ˆ o i < 0 , it follows tha t ˆ o i = ˆ r i + λ 1 / 2 provided ˆ r i < − λ 1 / 2 , and the minimum cos t is − λ 1 ˆ r i − λ 2 1 / 4 . In other words, ˆ o i = ˆ r i − λ 1 / 2 , ˆ r i > λ 1 / 2 0 , | ˆ r i | ≤ λ 1 / 2 ˆ r i + λ 1 / 2 , ˆ r i < − λ 1 / 2 , i = 1 , . . . , N . (32) Upon plugging (32) into (31), the minimum cos t in (31) after minimizing with respect to o i is ρ ( ˆ r i ) [cf. (9) and the argument p receding (32)]. All in a ll, the con clusion is that ˆ f is the minimizer of (8) – in addition to being the s olution of (7) b y definition – co mpleting the proof. IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 23 R E F E R E N C E S [1] Energy I ndepend ence and Security Act of 2007, an Act of the Congress of the United States of America Publ. L. No. 110-140, H.R. 6, Dec. 2007 . [2] The Smart Grid: An Introduction, United States Department of Energy , Office of Electricity Delivery and Energy Reliability , Jan. 2010. [Online]. A v ail able: http://www .oe.energy .gov /1165.htm [3] D. P . Bertsekas, Nonlinear Pr ogr amming , 2nd ed. Athena-Scientific, 1999. [4] E. J. Candes and P . A. Ran dall, “Highly rob ust error correction by con vex programming, ” IE EE T rans. on Inf. Theory , vol. 54, no. 7, pp. 2829–2840 , 2008. [5] E. J. Candes, M. B. W akin, and S . Boy d, “Enhancing sparsity by rewe ighted ℓ 1 minimzation, ” J ournal of F ourier Analysis and Applications , vol. 14, pp. 877–905, Dec. 2008. [6] J. Chen, W . Li, A. Lau, J. Cao, and K. Eang, “ Automated load curve data cleansing in power systems, ” IEEE T rans. Smart Grid , vol. 1, pp. 213–221, Sep. 2010. [7] S. S . Chen, D. L. Dono ho, and M. A. Saunders, “ Atomic d ecomposition by basis pursuit, ” SIAM Journa l on Scientific Computing , vol. 20, pp. 33–61, 1998. [8] C. C . Chuang, S. F . Fu, J. T . Jeng, and C. C. Hsiao, “Robust support vector regression networks for function approximation with outliers, ” IEEE T rans. Neural Netw . , vol. 13, pp. 1322–1 330, Jun. 2002. [9] D. D. Cox, “ Asymptotics for M-type smoothing splines, ” Ann. Statist. , vol. 11, pp. 530–551, 1983. [10] J. Duchon, Splines Minimizing Rotation-In variant Semi-norms in Sobolev Spaces . Springer-V erlag, 1977 . [11] B. Efr on, T . Hastie, I. M. Johnstone, and R. Tibshirani, “Least angle re gression, ” Ann. Statist. , vol. 32, pp. 407–499, 2004. [12] T . Evgeniou, M. Pontil, and T . Poggio, “Regularization networks and support vector machines, ” Advances in Computational Mathematics , vol. 13, pp. 1–50, 2000. [13] J. Fan and R. L i, “V ariable selection via nonconcav e penalized li kelihoo d and its oracle properties, ” J. Amer . Stat. A ssoc. , vol. 96, pp. 1348–1360, 2001. [14] M. Fazel, “Matrix rank minimization wit h applications, ” Ph.D. dissertation, S tanford Univ ersi ty , 2002. [15] J. F riedman, T . Hastie, H. Hofling, and R. T ibshirani, “P athwise coordinate optimization, ” Ann. Appl. Stat. , vol. 1, pp. 302–33 2, 2007. [16] J. Fri edman, T . Hastie, and R. Tib shirani, “Regularized paths f or generalized linear models via coordinate descent, ” Jou rnal of Statistical Softwar e , vol. 33, 2010. [17] J. J. Fuchs, “ An inv erse problem approach to robust r egression , ” in Pro c. of Intl. Conf. on Acoustics, Speec h and Signal Pr ocessing , Phoeniz, AZ, Mar . 1999, pp. 180–188. [18] S. R. Gunn, Matlab SVM T oolbox , 1997. [Online]. A vailable: http://www .isis.ecs.soton.ac.uk/resources/svminfo / [19] T . Hastie, R. T ibshirani, and J. Friedman, The Elements of Statistical L earning , 2nd ed. Springer , 2009. [20] S. G. Hauser , visi on for the smart grid, presented at the U.S. Department of Energy S mart Grid R&D Roundtable Meeting, Dec. 9, 2009. [21] P . J. Huber and E. M. Ronchetti, Robust Statistics . New Y ork: W iley , 2009. [22] Y . Jin and B . D. Rao, “ Algorithms for robust linear regression by exploiting the connection to sparse signal recovery , ” in Pr oc. of Intl. Conf. on A coustics, Speec h and Signal Pro cessing , Dallas, TX, Mar . 2010, pp. 3830–3833. [23] V . Kek atos and G. B. Giannak is, “From sparse signals to sparse residuals for robust sensing, ” IEEE T rans. Signal Pr ocess. , vol. 59, 2011 (to appear); http://spincom.umn.edu/journal.html. IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 24 [24] G. Kimbeldorf and G. W ahba, “ A correspondence between bayesian estimation on stochastic processes and smoothing by splines, ” Ann. Math. Statist. , v ol. 41, pp. 495–502 , 1970. [25] K. Lange, D. Hunter , and I. Y ang, “Optimization transfer using surrogate objectiv e functions (with discussion), ” J. Computat. Graph . Statist. , v ol. 9, pp. 1–59, 2000. [26] Y . J. Lee, W . F . Heisch, and C. M. Huang, “ ǫ -SSVR: A smooth suppo rt vector machin e for ǫ -insensiti ve regression, ” IEE E T rans. Knowl. Data Eng . , vol. 17, pp. 678–6 85, 2005. [27] O. L. Mangasarian and D . R. Musicant, “Rob ust l inear and support vector regression, ” IEE E T rans. P attern Anal. Mach. Intell. , vol. 22, pp. 950–955, Sep. 2000. [28] S. Mukherjee, E. Osuna, and F . Girosi, “Nonlinear prediction of chaotic time series using a support vector machine, ” in Pr oc. of Wrkshp. Neural Networks for Signal Pr oces. , Amelia Island, F L, 97, pp. 24–2 6. [29] T . Poggio and F . Girosi, “ A theory of networks for approximation and learning, ” A. I. Memo No. 1140 , Arti ficial I ntelligence Laboratory , Massachussets Institute of T echnology , 1989. [30] P . J. Rousseeuw and K. V . Driessen, “Computing L TS regre ssion for large data sets, ” Data Mining and Knowledge Discovery , vol. 12, pp. 29–45, 2006. [31] P . J. Rousseeuw and A. M. Leroy , Robust re gr ession and outlier detection . New Y ork: W iley , 1987. [32] A. J. Smola and B. Scholkopf, “ A tutorial on support vector regression, ” Neur o COLT T echnica l R eport T R-1998-030 , Royal Hollo way College, L ondon, 1998. [33] R. T ibshirani, “Regression shrinkage and selection via the lasso, ” J . Royal. Statist. Soc B , vol. 58, pp. 267–288, 1996. [34] A. N. T i khono v and V . Y . Arsenin, Solutions of Ill-posed Pr oblems . W ashington, DC: W . H. Winston, 1977. [35] J. Trop p, “Just relax: Con vex programming methods for identifying sparse signals, ” IEEE T rans. Inf. Theory , vol. 51, pp. 1030–1 051, Mar . 2006. [36] M. Unser , “Splines: A perfect fit for signal and image processin g, ” IEE E Signal Pr ocessing Magazine , vo l. 16, pp. 22–38, Nov . 1999. [37] V . V apnik, Statist ical Learning Theory . New Y ork: Wiley , 1998. [38] G. W ahba , Spline Models for Observational Data . Philadelphia: SIAM, 1990. [39] G. W ahba and J. W endelberger , “Some new mathematical method s for v ariational objectiv e analysis using splines and cross v alidation, ” Monthly W eather R evie w , vol. 108, pp. 1122–1 145, 1980. [40] J. Wri ght and Y . Ma, “Dense error correction via ℓ 1 -minimization, ” IEEE T rans. Inf. Theory , vol. 56, no. 7, pp. 3540–3560 , 2010. [41] S. J. Wright, R. D. Now ak, and M. A. T . Fi gueiredo, “Sparse reconstruction by separable approximation, ” IE EE T rans. Signal Pr ocess. , vo l. 57, pp. 2479– 2493, 2009. [42] T . W u and K. Lange, “Coordinate descent algorithms for lasso penalized regression, ” Ann. Appl. Stat. , vol. 2, pp. 224–244, 2008. [43] J. Zhu, S. C. H. Hoi, and M. R. T . Lyu , “Robust regularized kernel regression, ” IEEE Tr ans. Syst., Man, Cybern. B Cybern. , vol. 38, pp. 1639–1644, Dec. 2008. [44] H. Zou, “The adapti ve Lasso and its oracle properties, ” J . Amer . Stat. Assoc. , vol. 101, no. 476, pp. 1418 –1429, 2006. IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 25 0 20 40 60 80 100 120 10 15 20 25 30 35 40 45 Time (hours) Power consumption (kWh) Load curve data Outlier Fig. 1. Example of load curve data with outliers. Fig. 2. True Gaussian mixture function f o ( x ) , and its 180 noisy samples take n ove r [0 , 3] × [0 , 3] shown as black dots. The red dots indicate t he N o = 20 outliers in the tr aining data set T . The green points indicate the predicted responses ˆ y i at the sampling points x i , from the estimate ˆ f obtained after solving (25). Note how all green points are close to the surface f o . IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 26 10 −3 10 −2 10 −1 10 0 −4 −3 −2 −1 0 1 2 3 4 λ 1 Coefficients o i Fig. 3. Robustification path with optimum smoothing parameter µ ∗ = 1 . 55 × 10 − 2 . The data is corrupted wit h N o = 20 outliers. The coefficients ˆ o i correspondin g to the outliers are shown in red, while the rest are shown in blue. T he vertical line indicates the selection of λ ∗ 1 = 3 . 83 × 10 − 2 , and sho ws that the outliers were correctly identified. 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 −0.05 0 0.05 0.1 0.15 x 1 x 2 f(x 1 ,x 2 ) (a) 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 x 1 x 2 f(x 1 ,x 2 ) (b) 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 −0.05 0 0.05 0.1 0.15 x 1 x 2 f(x 1 ,x 2 ) (c) 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 −0.05 0 0.05 0.1 0.15 x 1 x 2 f(x 1 ,x 2 ) (d) Fig. 4. Robu st estimation of a Gaussian mixture using thin-plate splines. The data i s corrupted with N o = 20 outliers. (a) T rue function f o ( x ) ; (b) nonrobust predicted function obtained after solving (28); (c) predicted function after solving ( 25) with the optimum tuning parameters; (d) refined predicted function using the noncon ve x regularization in ( 21). IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 27 −5 −4 −3 −2 −1 0 1 2 3 4 5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 True function Outliers Noisy data (a) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Prediction True function (b) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (c) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (d) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (e) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (f) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (g) −5 −4 −3 −2 −1 0 1 2 3 4 5 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 (h) Fig. 5. Robust estimation of the sinc function. T he data is corrupted with N o = 3 outliers, and the nominal noise variance is σ 2 ε = 1 × 10 − 4 . (a) Noisy training data and outliers; (b) predicted v alues obtained after solving (1) with V ( u ) = u 2 ; ( c) S VR predictions for ǫ = 0 . 1 ; (d) RSVR predictions f or ǫ = 0 . 1 ; (e) SVR predictions for ǫ = 0 . 01 ; (f) R SVR predictions for ǫ = 0 . 01 ; (g) predicted v alues obtained after solving (7); (h) refined predictions using the noncon ve x r egularization i n (21). IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 28 20 70 120 170 220 270 320 370 420 470 10 15 20 25 30 35 40 45 50 55 60 Time (hours) Power consumption (kWh) Load curve data (a) 20 70 120 170 220 270 320 370 420 470 10 15 20 25 30 35 40 45 50 55 60 Time (hours) Power consumption (kWh) Fitted model Load curve data (b) Fig. 6. Load curv e data cleansing. (a) Noisy training data and outliers; (b) fitted load profile obtained after solving (30). IEEE TRANSA CTIONS ON SIGNAL PR OCESSING (SUBMITTE D) 29 20 70 120 170 220 270 320 370 420 470 10 15 20 25 30 35 40 45 50 55 60 Time (hours) Power consumption (kWh) Fitted model Outliers Load curve data (a) 20 70 120 170 220 270 320 370 420 470 10 15 20 25 30 35 40 45 50 55 60 Time (hours) Power consumption (kWh) Fitted model Outliers Load curve data (b) Fig. 7. L oad curve data cleansing. (a) Cleansed load profile obtained after solving (29); (b) refined load profile obtained after using the nonco n ve x regularization in (21).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment