On the complexity of piecewise affine system identification

On the complexit y of piecewise aﬃne system iden tiﬁcation F abien Lauer Univ ersit ´ e de Lorraine, CNRS, LORIA, UMR 7503, F-54506 V andœuvre-l` es-Nancy , F rance July 23, 2018 Abstract The pap er provides results regarding the computational complexity of hybrid system iden- tiﬁcation. More precisely , w e focus on the estimation of piecewise aﬃne (PW A) maps from input-output data and analyze the complexit y of computing a global minimizer of the error. Previous w ork show ed that a global solution could be obtained for contin uous PW A maps with a worst-case complexity exponential in the n umber of data. In this pap er, we show how global optimalit y can b e reac hed for a slightly more general class of possibly discontin uous PW A maps with a complexity only p olynomial in the num b er of data, how ever with an exp onen tial complexit y with resp ect to the data dimension. This result is obtained via an analysis of the in trinsic classiﬁcation subproblem of asso ciating the data p oin ts to the diﬀeren t mo des. In addition, we prov e that the problem is NP-hard, and thus that the exp onen tial complexity in the dimension is a natural exp ectation for any exact algorithm. 1 In tro duction Hybrid system identiﬁcation aims at estimating a mo del of a system switching b et ween diﬀerent op erating mo des from input-output data. More precisely , most of the literature considers autore- gressiv e with external input (ARX) mo dels to cast the problem as a regression one [1]. Then, t w o cases can be distinguished: switc hing regression, where the system arbitrarily switc hes from one mo de to another, and piecewise aﬃne (PW A) regression, where the switc hes dep end on the regressors. A n um b er of metho ds with satisfactory performance in practice are now a v ailable for these problems [2]. Ho wev er, compared with linear system iden tiﬁcation, a ma jor w eakness of these metho ds is their lack of guaran tees. F or the particular case of noiseless data, the algebraic metho d [3] pro vides a solution to switching regression with a small n umber of mo des. How ever, the qualit y of the estimates quic kly degrades with the increase of the noise level. A few sparsity-based methods [4, 5] also oﬀer guarantees in the noiseless case, but these are sub ject to a condition on b oth the data and the sought solution. In the presence of noise, most methods consider the minimization of the error of the mo del ov er the data [1]. While this do es not necessarily yields the b est predictive mo del (due to issues like iden tiﬁability , p ersistence of excitation and access to a limited amoun t of data), obtaining statistical guaran tees with such an approach has a long history in statistics and system identiﬁcation [6]. How ever, such results are not av ailable for hybrid systems. This is probably due to the fact that minimizing the error of a h ybrid model is a diﬃcult noncon vex optimization problem inv olving the simultaneous classiﬁcation of the data p oin ts into mo des and the regression of a submo del for each mo de. Thus, theoretical guaran tees could only b e obtained under the rather strong assumption that this problem has been solv ed to global optimalit y and most of the literature [7, 8, 9, 10, 11, 12] fo cuses on this issue with heuristics of v arious degrees of accuracy and computational eﬃciency . Many recent w orks [4, 13, 14, 15, 16, 5] try to av oid lo cal minima b y considering conv ex formulations, but these only yield optimalit y with respect to a relaxation of the original problem. Global optimality in the presence of noise was only reac hed in [17] for a particular class of con tinuous PW A maps kno wn as hinging-hyperplanes by reformulating the problem as a mixed-in teger program solved by 1 branc h-and-b ound tec hniques. How ever, suc h optimization problems are NP-hard [18] and branch- and-b ound algorithms ha v e a w orst-case complexity exponential in the n um b er of in teger v ariables, here prop ortional to the n umber of data and the num b er of modes. Inspired b y related clustering problems, suc h as the minimization of the sum of squared distances b et ween p oin ts and their group centers, we could minimize the hybrid mo del error by en umerating all p ossible classiﬁcations of the p oin ts. But the n umber of classiﬁcations is exp onen tial in the n um b er of data. Con v ersely , the other approach enumerating a sample of v alues for the real v ariables of the problem is exp onen tial in the dimension and can only oﬀer an appro ximate solution. Ov erall, the literature do es not pro vide a metho d that can guaran tee both the optimality and the computabilit y of a global minimizer of the error, while the computational complexity of this problem remains unknown and cannot b e deduced from the NP-hardness of classical clustering problems [19] (see [18, 20] for an introduction to computational complexit y and its relev ance to con trol theory). Con tribution The pap er pro vides t wo results regarding the computational complexity of PW A regression, and more precisely for the problem of ﬁnding a global minimizer of the error of a PW A mo del, formalized in Sect. 2. First, w e sho w in Sect. 3 that the problem is NP-hard. Then, w e show in Sect. 4 that, for an y ﬁxed dimension of the data, an exact solution can b e computed in time p olynomial in the n umber of data via an en umeration of all possible classiﬁcations. T o obtain this result and av oid the exp onen tial gro wth of the n um b er of classiﬁcations with the n umber of data, w e sho w that, in PW A regression, the classiﬁcation of the data points is highly constrained and the num b er of classiﬁcations to test can b e limited. The price to pa y for this gain is an exp onen tial complexit y with resp ect to the data dimension and the n umber of mo des. F uture work is outlined in Sect. 5. Notations W e use the indicator function 1 E of an even t E that is 1 if the even t o ccurs and 0 otherwise. W e deﬁne sign( u ) = 1 if u ≥ 0 and − 1 otherwise. Given a set of lab els Q ⊂ Z and a set of N p oin ts, a lab eling of these p oints is any q ∈ Q N . W e use j = argmax k ∈Q u ( k ) as a shorthand for j = min { l ∈ argmax k ∈Q u ( k ) } . Given tw o sets, X and Y , Y X is the set of functions from X to Y . 2 Problem formulation As in most w orks, we concen trate on discrete-time PW ARX system iden tiﬁcation considered as a PW A regression problem with regression vectors x i = [ y i − 1 , . . . , y i − n y , u i , . . . , u i − n u ] T ∈ X built from past inputs u i − k and outputs y i − k . Since we are interested in computational complexity results, w e restrain the data to rational, digitally representable, v alues and set X ⊆ Q d . The outputs are assumed to b e generated b y a PW A system f as y i = f ( x i ) + v i , where v i is a noise term. More precisely , PW A mo dels can b e expressed via a set of n aﬃne submo dels and a function h : X → Q = { 1 , . . . , n } determining the activ e submo del: f ( x ) = w T h ( x ) x , where x = [ x T , 1] T . W e call the function h a classiﬁer as it classiﬁes the data p oin ts in the diﬀeren t mo des. Typically , PW A systems are deﬁned with h implementing a p olyhedral partition of X , with mo des possibly spanning unions of polyhedra. Ho wev er, in most of the literature on PW A system identiﬁcation [1, 7, 8, 9, 16], h is estimated within the family of linear classiﬁers H = { h ∈ Q X : h ( x ) = argmax k ∈Q h T k x + b k , h k ∈ Q d , b k ∈ Q } , (1) based on a set of n linear functions and for whic h a mo de spanning a union of polyhedra must b e mo deled as sev eral mo des with similar aﬃne submo dels. F or PW A maps with n = 2 modes, h is a binary classiﬁer for which it is common to consider its output in Q = {− 1 , +1 } instead of { 1 , 2 } . Such a binary classiﬁer can b e obtained b y taking the sign of a real-v alued function. If this function is linear (or aﬃne), then w e obtain a linear classiﬁer, whic h is equiv alent to a separating h yp erplane dividing the input space X in t wo half-spaces. In this case, the function class H can b e deﬁned as H = { h ∈ Q X : h ( x ) = sign( h T x + b ) , h ∈ Q d , b ∈ Q } (2) 2 with a single set of parameters ( h , b ) corresp onding to the normal to the hyperplane and the oﬀset from the origin. An equiv alence with the m ulti-class form ulation in (1) is obtained b y using h = h 1 − h 2 and b = b 1 − b 2 . In this pap er, we consider the common estimation approach of minimizing the error on N data pairs ( x i , y i ) ∈ X × Q , measured p oin twise by a loss function ` : Q → Q + as ` ( y i − f ( x i )) = X j ∈Q 1 h ( x i )= j ` ( y i − w T j x i ) . More precisely , we foc us on well-posed instances of the problem where N is signiﬁcan tly larger than the dimension d and the num b er of mo des n is giv en. Indeed, with free n the problem is ill-p osed as the solution is only deﬁned up to a trade-oﬀ b et ween the num b er of mo des and the mo del accuracy . F or the conv erse well-posed approach that minimizes n for a given error bound, a complexity analysis can be found in [21]. Under these assumptions, the problem is as follows. Problem 1 (Error-minimizing PW A regression) . Given a data set { ( x i , y i ) } N i =1 ∈ ( X × Q ) N with X ⊆ Q d and an inte ger n ∈ [2 , N / ( d + 1)] , ﬁnd a glob al solution to min w ∈ Q n ( d +1) ,h ∈H 1 N N X i =1 X j ∈Q 1 h ( x i )= j ` ( y i − w T j x i ) , (3) wher e w = ( w j ) j ∈Q is the c onc atenation of al l p ar ameter ve ctors and H ⊂ Q X is the set of n -c ate gory line ar classiﬁers as in (1) or (2) . The follo wing analyzes the time complexit y of Problem 1 under the classical mo del of com- putation kno wn as a T uring machine [18]. The time complexit y of a problem is the low est time complexit y of an algorithm solving any instance of that problem, where the time complexity of an algorithm is the maximal num b er of steps o ccuring in the computation of the corresp onding T uring machine program. The loss function ` is assumed to b e computable in p olynomial time throughout the pap er. 3 NP-hardness This section con tains the pro of of the following NP-hardness result, where an NP-hard problem is one that is at least as hard as any problem from the class NP of nondeterministic p olynomial time decision problems [18] (NP is the class of all decision problems for whic h a solution can b e certiﬁed in p olynomial time). Theorem 1. With a loss function ` such that ` ( e ) = 0 ⇔ e = 0 , Pr oblem 1 is NP-har d. The pro of uses a reduction from the partition problem, kno wn to b e NP-complete [18], i.e., a problem that is both NP-hard and in NP . Problem 2 (P artition) . Given a multiset (a set with p ossibly multiple instanc es of its elements) of d p ositive inte gers, S = { s 1 , . . . , s d } , de cide whether ther e is a multisubset S 1 ⊂ S such that X s i ∈ S 1 s i = X s i ∈ S \ S 1 s i . More precisely , we will reduce Problem 2 to the decision form of Problem 1. Problem 3 (Decision form of PW A regression) . Given a data set { ( x i , y i ) } N i =1 ∈ ( X × Q ) N , an inte ger n ∈ [2 , N / ( d + 1)] and a thr eshold  ≥ 0 , de cide whether ther e is a p air ( w , h ) ∈ Q n ( d +1) × H such that 1 N N X i =1 X j ∈Q 1 h ( x i )= j ` ( y i − w T j x i ) ≤ , (4) wher e H is the set of line ar classiﬁers as in (1) or (2) and the loss function ` is such that ` ( e ) = 0 ⇔ e = 0 . 3 Prop osition 1. Pr oblem 3 is NP-c omplete. Pr o of. Since given a candidate pair ( w , h ) the condition (4) can b e veriﬁed in p olynomial time, Problem 3 is in NP . Then, the pro of of its NP-completeness pro ceeds b y showing that the P artition Problem 2 has an aﬃrmative answer if and only if Problem 3 with  = 0 has an aﬃrmative answ er. Giv en an instance of Problem 2, let N = 2 d + 3, n = 2, Q = {− 1 , 1 } and build a data set with ( x i , y i ) =                ( s i e i , s i ) , if 1 ≤ i ≤ d ( − s i − d e i − d , s i − d ) , if d < i ≤ 2 d ( s , 0) , if i = 2 d + 1 ( − s , 0) , if i = 2 d + 2 ( 0 , 0) , if i = 2 d + 3 , where e k is the k th unit vector of the canonical basis for Q d and s = P d k =1 s k e k . If Problem 2 has an aﬃrmative answer, then, using the notations of (2), w e can set w 1 = X k ∈ I 1 e k − X k ∈ I − 1 e k , w − 1 = − w 1 , h = X k ∈ I 1 e k − X k ∈ I − 1 e k , b = 0 , where e k = [ e T k , 0] T , I 1 is the set of indexes of the elements of S in S 1 and I − 1 = { 1 , . . . , d } \ I 1 . This gives w T 1 x i =                          s i = y i , if i ≤ d and i ∈ I 1 − s i , if i ≤ d and i ∈ I − 1 s i − d = y i , if i > d and i − d ∈ I − 1 − s i − d , if i > d and i − d ∈ I 1 P k ∈ I 1 s k − P k ∈ I − 1 s k = 0 = y i , if i = 2 d + 1 P k ∈ I − 1 s k − P k ∈ I 1 s k = 0 = y i , if i = 2 d + 2 0 = y i , if i = 2 d + 3 and we can similarly sho w that w T − 1 x i = y i , if i ∈ I − 1 ∨ i − d ∈ I 1 ∨ i > 2 d, while h T x i is p ositiv e if i ∈ I 1 ∨ i − d ∈ I − 1 and negative if i ∈ I − 1 ∨ i − d ∈ I 1 . Therefore, for all p oin ts, w T h ( x i ) x i = y i , i = 1 , . . . , 2 d + 3, and the cost function of Problem 1 is zero, yielding an aﬃrmativ e answer for Problem 3. It remains to pro v e that if (4) holds with  = 0, then Problem 2 has an aﬃrmativ e answ er. T o see this, note that due to ` b eing p ositiv e, a zero cost implies a zero loss for all data p oin ts. Thus, b y ` ( e ) = 0 ⇔ e = 0, if (4) holds with  = 0, w T h ( x i ) x i = y i , i = 1 , . . . , 2 d + 3 . (5) Also note that if h ( x i ) = h ( x i + d ) = 1 for some i ≤ d , we ha ve s i w 1 ,i + w 1 ,d +1 = − s i w 1 ,i + w 1 ,d +1 = s i . This is only p ossible if s i = 0, which is not the case (otherwise w e can simply remov e s i without inﬂuencing the partition problem), or if w 1 ,d +1 = s i . The latter is imp ossible if h ( x i ) = h ( x i + d ) since h is a linear classiﬁer that m ust return the same category for all p oin ts on the line segment b et ween x i and x i + d , which includes the origin x 2 d +3 = 0 and thus would imply by (5) that w T 1 x 2 d +3 = w 1 ,d +1 = y 2 d +3 = 0. As a consequence, h ( x i ) = h ( x i + d ) = 1 cannot hold, and since w e can similarly show that h ( x i ) = h ( x i + d ) = − 1 cannot hold, we hav e h ( x i ) 6 = h ( x i + d ) for all i ≤ d . Hence, (5) leads to w T 1 x i 6 = y i ⇒ w T 1 x d + i = y d + i , i = 1 , . . . , d (6) w T − 1 x i 6 = y i ⇒ w T − 1 x d + i = y d + i , i = 1 , . . . , d. (7) 4 Let ˆ I 1 = { i ∈ { 1 , . . . , d } : w T 1 x i = y i } and ˆ I − 1 = { 1 , . . . , d } \ ˆ I 1 . Then, if h ( x 2 d +3 ) = +1, w 1 ,d +1 = 0 and for all i ≤ d , w T 1 x i = w i s i . Therefore, for all i ∈ ˆ I 1 , w i = 1, while for all i ∈ ˆ I − 1 , (6) gives w T 1 x d + i = y d + i , i.e., − w i s i = s i and w i = − 1. This leads to w T 1 x 2 d +1 = X i ∈ ˆ I 1 s i − X i ∈ ˆ I − 1 s i = − w T 1 x 2 d +2 . Th us, if w T 1 x 2 d +1 = y 2 d +1 = 0 or w T 1 x 2 d +2 = y 2 d +2 = 0, a v alid partition in the sense of Problem 2 is obtained with S 1 = { s i } i ∈ ˆ I 1 . In addition, if w T 1 x 2 d +1 6 = 0 and w T 1 x 2 d +2 6 = 0, then b y (5), w T − 1 x 2 d +1 = w T − 1 x 2 d +2 = 0, which by construction implies that w − 1 ,d +1 = 0. In this case, we redeﬁne ˆ I − 1 = { i ∈ { 1 , . . . , d } : w T − 1 x i = y i } and ˆ I 1 = { 1 , . . . , d } \ ˆ I − 1 to obtain w − 1 ,i = 1 for all i ∈ ˆ I − 1 and w − 1 ,i = − 1 for all i ∈ ˆ I 1 , resulting also in a v alid partition by the fact that w T − 1 x 2 d +2 = P i ∈ ˆ I 1 s i − P i ∈ ˆ I − 1 s i = 0. Since a similar reasoning applies to the case h ( x 2 d +3 ) = − 1 b y symmetry (substituting w − 1 for w 1 ), a zero cost, i.e., (4) with  = 0, alw ays implies an aﬃrmative answ er to Problem 2. Pr o of of The or em 1. Since the decision form of Problem 1 with ` ( e ) = 0 ⇔ e = 0, i.e., Problem 3, is NP-complete, Problem 1 with such a loss function is NP-hard (solving Problem 1 also yields the answ er to Problem 3 and th us it is at least as hard as Problem 3). 4 P olynomial complexit y in the num b er of data W e no w state the result regarding the p olynomial complexit y of Problem 1 with respect to N under the follo wing assumptions, the ﬁrst of which holds almost surely for randomly dra wn data p oin ts, while the second one holds for instance for ` ( e ) = e 2 with a linear time complexity T ( N ) = Ø( N ) [22]. Assumption 1. The p oints { x i } N i =1 ar e in gener al p osition, i.e., no hyp erplane of Q d c ontains mor e than d p oints. Assumption 2. Given { ( x i , y i ) } N i =1 ∈ ( X × Q ) N , the pr oblem min v ∈ Q d +1 P N i =1 ` ( y i − v T x i ) has a p olynomial time c omplexity T ( N ) for any ﬁxe d inte ger d ≥ 1 . Theorem 2. F or any ﬁxe d numb er of mo des n and dimension d , under Assumptions 1 – 2, the time c omplexity of Pr oblem 1 is no mor e than p olynomial in the numb er of data N and in the or der of T ( N )Ø  N dn ( n − 1) / 2  . The pro of of Theorem 2 relies on the existence of exact algorithms with complexity p olynomial in N for the binary case ( n = 2, Prop osition 4) and the multi-class case ( n ≥ 3, Corollary 1). These algorithms are based on a reduction of Problem 1 to a com binatorial search in tw o steps. The ﬁrst step reduces the problem to a classiﬁcation one. Indeed, Problem 1 can be reform ulated as the searc h for the classiﬁer h , since b y ﬁxing h , the optimal parameter vectors { w j } j ∈Q can b e obtained by solving n indep enden t linear regression problems on the subsets of data resulting from the classiﬁcation by h , which, b y Assumption 2, can b e p erformed in the p olynomial time T ( N ). This yields the follo wing reform ulation of the problem. Prop osition 2. Pr oblem 1 is e quivalent to min w ∈ Q n ( d +1) ,h ∈H 1 N N X i =1 X j ∈Q 1 h ( x i )= j `  y i − w T j x i  (8) s.t. ∀ j ∈ Q , w j ∈ argmin v ∈ Q d +1 N X i =1 1 h ( x i )= j ` ( y i − v T x i ) . The second step reduces the estim ation of h to a combinatorial problem solv ed in Ø  N dn ( n − 1) / 2  op erations, as detailed in Sect. 4.1 – 4.2 for n = 2 and in Sect. 4.3 for n ≥ 3. 5 4.1 Finding the optimal classiﬁcation W e reduce the complexit y of searching for the classiﬁer b y considering all possible linear classiﬁ- cations instead of all p ossible linear classiﬁers. In other words, we pro ject the class H of classiﬁers on to the set of p oin ts S = { x i } N i =1 to reduce a contin uous search to a combinatorial problem. This is in line with the tec hniques used in statistical learning theory [23] for the diﬀeren t purp ose of computing error b ounds for inﬁnite function classes. Thus, we in tro duce deﬁnitions from this ﬁeld. Deﬁnition 1 (Pro jection onto a set) . The pr oje ction of a set of classiﬁers H ⊂ Q X onto S = { x i } N i =1 , denote d H S , is the set of al l lab elings of S that c an b e pr o duc e d by a classiﬁer in H : H S = { ( h ( x 1 ) , . . . , h ( x N )) : h ∈ H} ⊆ Q N . Deﬁnition 2 (Gro wth function) . The growth function Π H ( N ) of H at N is the maximal numb er of lab elings of N p oints that c an b e pr o duc e d by classiﬁers fr om H : Π H ( N ) = sup S ∈X N |H S | . W e no w focus on binary PW A maps and th us on binary classiﬁers with output in Q = {− 1 , +1 } . F or suc h classiﬁers, we ob viously ha ve Π H ( N ) ≤ 2 N for all N . By further restricting H to aﬃne classiﬁers as in (2), results from statistical learning theory (see, e.g., [23]) provide the tigh ter b ound Π H ( N ) ≤  eN d +1  d +1 , whic h is p olynomial in N and thus promising from the viewpoint of global optimization. How ever, its pro of is not constructiv e and do es not pro vide an explicit algorithm for en umerating all the lab elings. The following theorem, though leading to a lo oser bound on the gro wth function, oﬀers a constructive scheme to compute the pro jection H S , which is what we need in order to test all the labelings in H S for global optimization. Theorem 3. The gr owth function of the class of binary aﬃne classiﬁers of Q d , H in (2) , is b ounde d for any N > d by Π H ( N ) ≤ 2 d +1  N d  = Ø( N d ) and, for any set S of N p oints in gener al p osition, an algorithm builds the pr oje ction H S in Ø( N d ) time. The pro of of Theorem 3 relies on the follo wing prop osition, whic h is illustrated b y Fig. 1. Prop osition 3. F or any binary aﬃne classiﬁer h in H (2) and any ﬁnite set of N > d p oints S = { x i } N i =1 in gener al p osition, ther e is a subset of p oints S h ⊂ S of c ar dinality | S h | = d and a sep ar ating hyp erplane of p ar ameters ( h S h , b S h ) p assing thr ough the p oints in S h , i.e., ∀ x ∈ S h , h T S h x + b S h = 0 , with k h S h k = 1 , (9) which yields the same classiﬁc ation of S in the sense that ∀ x i ∈ S \ S h , h ( x i ) = sign ( h T S h x i + b S h ) . (10) Pr o of sketch. F or all classiﬁers h with separating hyperplanes passing through d p oin ts of S , the statement is ob vious. F or the others passing through p p oin ts with 0 ≤ p < d , they can be transformed to pass through additional p oin ts without changing the classiﬁcation of the remaining p oin ts. If p = 0, it suﬃces to translate the hyperplane to the closest p oin t. If 0 < p < d , the h yp erplane can be rotated with a plane of rotation that leav es unchanged the subspace spanned by the p p oin ts and a minimal angle yielding a rotated h yp erplane passing through p 0 > p p oin ts, where p 0 ≤ d b y the general p osition assumption. Iterating this sc heme until p = d yields a hyperplane passing through the points in S h of parameters ( h S h , b S h ) satisfying (9) and (10). W e can now prov e Theorem 3. 6 −5 −4 −3 −2 −1 0 1 2 3 4 5 −10 −5 0 5 Figure 1: The h yp erplane H (plain line) pro duces the same classiﬁcation (in to + and ◦ ) as the h yp erplane H S (dashed line) obtained by a translation (dotted line) and a rotation of H such that it passes through exactly 2 points of S ( ∗ ). Pr o of of The or em 3. F or any lab eling q in H S , there is a classiﬁer h ∈ H that pro duces this lab eling. Applying Prop osition 3 to h , w e obtain another classiﬁer h S h of parameters ( h S h , b S h ) that passes through the p oin ts in S h and that agrees with h on S \ S h . Let ˆ q ∈ {− 1 , +1 } N b e deﬁned b y ˆ q i = h S h ( x i ), i = 1 , . . . , N . Then, w e generate 2 d lab elings by setting its entries ˆ q i with i ∈ S h to all the 2 d com binations of signs (recall that | S h | = d ). By construction, there is no lab eling of S that agrees with q on S \ S h other than these 2 d lab elings. Since this holds for any q ∈ H S , the cardinality of H S cannot be larger than 2 d times the num b er of hyperplanes passing through d points of S . Since eac h subset S h ⊂ S of cardinalit y d giv es rise to t wo h yp erplanes of opp osite orien tations, the n umber of such hyperplanes is 2  N d  and w e ha ve Π H ( N ) ≤ 2 d +1  N d  < 2 d +1 N d d ! = Ø( N d ). In addition, there is an algorithm that enumerates all the subsets S h in  N d  iterations and builds H S b y computing a hyperplane passing through the p oin ts 1 in S h and the corresp onding 2 d +1 lab elings at eac h iteration. Since these inner computations can b e p erformed in constan t time with resp ect to N , the algorithm has a time complexity in the order of  N d  = Ø( N d ). 4.2 Global optimization of binary PW A mo dels W e can use the results ab o ve to reduce the complexit y of Problem 1 in the binary case, considered in the following in its equiv alent form (8) from Prop osition 2. First, note that the cost function in (8) only dep ends on h , since all feasible v alues of w for a giv en h yield the same cost. F urthermore, the cost do es not dep end on the exact v alue of h , but only on the resulting classiﬁcation, i.e., on h ( x i ), i = 1 , . . . , N . Thus, given a global solution h ∗ to (8), any classiﬁer h pro ducing the same classiﬁcation yields the same cost function v alue and hence is also a global solution. Thus, the problem reduces to the search for the correct classiﬁcation q ∈ H S , whose complexit y is in Ø(Π H ( N )) and b ounded by Theorem 3. In addition, for the purp ose of binary PW A regression, opp osite lab elings q and − q are equiv alent and can b e pruned from H S . This is due to the symmetry of the cost function (8). Algorithm 1 pro vides a solution to Problem 1 for the binary case while taking this symmetry in to accoun t. 1 The normal h of a h yp erplane { x : h T x + b = 0 } passing through d p oin ts { x i } d i =1 in Q d can b e computed as a unit v ector in the n ull space of [ x 2 − x 1 , . . . , x d − x 1 ] T , while the oﬀset is given b y b = − h T x i for an y of the x i ’s. 7 Algorithm 1 Exact solution to Problem 1 for n = 2 Input: A data set { ( x i , y i ) } N i =1 ⊂ ( Q d × Q ) N . Initialize S ← { x i } N i =1 and J ∗ ← + ∞ . for all S h ⊂ S suc h that | S h | = d do Compute the parameters ( h S h , b S h ) of a h yp erplane passing through the p oin ts in S h . Classify the data points: S 1 = { x i ∈ S : h T S h x i + b S h > 0 } , S 2 = { x i ∈ S : h T S h x i + b S h < 0 } . for all classiﬁcation of S h in to S 1 h and S 2 h do Set w j ∈ argmin v ∈ Q d +1 X x i ∈ S j ∪ S j h ` ( y i − v T x i ) , j = 1 , 2 , J = 1 N 2 X j =1 X x i ∈ S j ∪ S j h ` ( y i − w T j x i ) , and up date the b est solution ( J ∗ ← J, w ∗ ← [ w T 1 , w T 2 ] T , h ∗ ← h S h , b ∗ ← b S h ) if J < J ∗ . end for end for return w ∗ , h ∗ , b ∗ . Prop osition 4. Under Assumptions 1 – 2, Algorithm 1 exactly solves Pr oblem 1 for n = 2 and any ﬁxe d d with a p olynomial c omplexity in the or der of T ( N )Ø( N d ) . Pr o of. By following a similar path as for Theorem 3, Algorithm 1 can b e pro ved to test all linear classiﬁcations of the data p oints up to symmetric ones. Since Algorithm 1 computes a solution in terms of w that is feasible for (8) for eac h of these classiﬁcations, the v alue of J coincides with the cost function of (8) for a particular h . By the symmetry of this cost function with respect to h and the fact that it only depends on h via its v alues at the data p oin ts, Algorithm 1 computes all p ossible v alues of the cost function, including the exact global optim um of (8), and returns a global minimizer. Thus, by Prop osition 2, it also solves Problem 1. The total n um b er of iterations of Algorithm 1 is 2 d  N d  = Ø( N d ) and, under Assumption 2, these iterations only in volv e operations computed in p olynomial time in the order of T ( N ), hence the ov erall time complexity in the order of T ( N )Ø( N d ). 4.3 Multi-class extension F or n > 2, the b oundary b et ween 2 modes j and k > j implemen ted by a linear classiﬁer from H in (1) is a hyperplane of equation h j k ( x ) = h j ( x ) − h k ( x ) = 0, i.e., based on the diﬀerence of the tw o functions h j ( x ) = h T j x + b j and h k ( x ) = h T k x + b k . Based on these h yp erplanes, the classiﬁcation rule can b e written as h ( x ) = argmax k ∈Q h k ( x ) = j, such that ( h j k ( x ) ≥ 0 , ∀ k > j, h kj ( x ) < 0 , ∀ k < j. Based on these facts, we can build an algorithm to recov er all p ossible classiﬁcations consistent with a linear classiﬁcation in the sense of (1). Theorem 4. F or the set of multi-class line ar classiﬁers of Q d , H in (1) , the gr owth function is b ounde d for any N > d by Π H ( N ) ≤  2 d +1  N d  n ( n − 1) / 2 = Ø( N dn ( n − 1) / 2 ) and, for any set S of N p oints of Q d in gener al p osition, an algorithm builds H S in Ø( N dn ( n − 1) / 2 ) time. 8 Pr o of. Any classiﬁcation produced b y a classiﬁer from (1) can b e computed from the signs of the n H = n ( n − 1) / 2 functions h j k = h j − h k , 1 ≤ j < k ≤ n , corresp onding to the pairwise separating h yp erplanes. F or an y S , for each of these hyperplanes, Prop osition 3 provides an equiv alent binary classiﬁer which must b e one from the 2  N d  h yp erplanes passing through d points S j k of S . The n um b er of sets of n H suc h h yp erplanes is 2 n H  N d  n H . Since these classiﬁers cannot pro duce all the 2 n H d classiﬁcations of the n H d p oin ts in the sets S j k , we must also take these into account so that the num b er of classiﬁcations of S is upp er b ounded b y |H S | ≤ 2 n H d 2 n H  N d  n H = Ø( N dn H ). This upp er b ound holds for any S , and thus also applies to the growth function. Finally , an algorithm that makes explicit all the classiﬁcations mentioned abov e to build H S can b e constructed in a recursiv e manner, with one classiﬁcation per iteration and thus with a similar n umber of iterations, eac h one including computations p erformed in constan t time. Theorem 4 implies the following for PW A regression. Corollary 1. Under Assumptions 1 – 2, a glob al solution to Pr oblem 1 with n ≥ 3 c an b e c ompute d with a p olynomial c omplexity in the or der of T ( N )Ø( N dn ( n − 1) / 2 ) . 5 Conclusions The pap er discussed complexit y issues for PW A regression and show ed that i) the global minimiza- tion of the error is NP-hard in general, and ii) for ﬁxed num b er of mo des and data dimension, an exact solution can b e obtained in time polynomial in the n umber of data. The pro of of NP-hard ness also implies that the problem remains NP-hard ev en when the n umber of mo des is ﬁxed to 2, which indicates that the complexit y is mostly due to the data dimension. An op en issue concerns the conditions under whic h a PW A system generates tra jectories satisfying the general p osition as- sumption used by the p olynomial-time algorithm. F uture work will also focus on the extension of the results to the case of arbitrarily switc hed systems and heuristics inspired b y the p olynomial- time algorithm, whose practical application remains limited b y an exp onen tial complexit y in the dimension. Ac kno wledgemen ts The author would like to thank the anon ymous reviewers for their comments and suggestions. Thanks are also due to Y ann Guermeur for carefully reading this manuscript. References [1] S. Paoletti, A. L. Juloski, G. F errari-T recate, R. Vidal, Identiﬁcation of hybrid systems: a tutorial, Europ ean Journal of Con trol 13 (2-3) (2007) 242–262. [2] A. Garulli, S. Paoletti, A. Vicino, A survey on switched and piecewise aﬃne system iden- tiﬁcation, in: Pro c. of the 16th IF A C Symp. on System Identiﬁcation (SYSID), 2012, pp. 344–355. [3] R. Vidal, S. Soatto, Y. Ma, S. Sastry , An algebraic geometric approach to the identiﬁcation of a class of linear hybrid systems, in: Pro c. of the 42nd IEEE Conf. on De cision and Control (CDC), Maui, Haw a ¨ ı, USA, 2003, pp. 167–172. [4] L. Bak o, Iden tiﬁcation of switc hed linear systems via sparse optimization, Automatica 47 (4) (2011) 668–677. [5] I. Maruta, H. Ohlsson, Compression based identiﬁcation of PW A systems, in: Pro c. of the 19th IF AC W orld Congress, Cape T own, South Africa, 2014, pp. 4985–4992. 9 [6] L. Ljung, System identiﬁcation: Theory for the User, 2nd Edition, Prentice Hall, 1999. [7] G. F errari-T recate, M. Muselli, D. Lib erati, M. Morari, A clustering technique for the identi- ﬁcation of piecewise aﬃne systems, Automatica 39 (2) (2003) 205–217. [8] A. Bemp orad, A. Garulli, S. Paoletti, A. Vicino, A b ounded-error approac h to piecewise aﬃne system iden tiﬁcation, IEEE T ransactions on Automatic Control 50 (10) (2005) 1567–1580. [9] A. L. Juloski, S. W eiland, W. Heemels, A Ba y esian approach to identiﬁcation of hybrid sys- tems, IEEE T ransactions on Automatic Control 50 (10) (2005) 1520–1533. [10] F. Lauer, G. Bloch, R. Vidal, A contin uous optimization framework for hybrid system iden ti- ﬁcation, Automatica 47 (3) (2011) 608–613. [11] F. Lauer, Estimating the probabilit y of success of a simple algorithm for switched linear regres- sion, Nonlinear Analysis: Hybrid Systems 8 (2013) 31–47, supplemen tary material av ailable at http://www.loria.fr/ ~ lauer/klinreg/ . [12] T. Pham Dinh, H. Le Thi, H. Le, F. Lauer, A diﬀerence of conv ex functions algorithm for switc hed linear regression, IEEE T ransactions on Automatic Control 59 (8) (2014) 2277–2282. [13] N. Oza y , M. Sznaier, C. Lagoa, O. Camps, A sparsiﬁcation approach to set mem b ership iden tiﬁcation of switched aﬃne syste ms, IEEE T ransactions on Automatic Con trol 57 (3) (2012) 634–648. [14] H. Ohlsson, L. Ljung, S. Boyd, Segmentation of ARX-models using sum-of-norms regulariza- tion, Automatica 46 (6) (2010) 1107–1111. [15] H. Ohlsson, L. Ljung, Identiﬁcation of switc hed linear regression mo dels using sum-of-norms regularization, Automatica 49 (4) (2013) 1045–1050. [16] F. Lauer, V. L. Le, G. Blo c h, Learning smo oth mo dels of nonsmo oth functions via conv ex optimization, in: Pro c. of the IEEE Int. W orkshop on Mac hine Learning for Signal Pro cessing (MLSP), Santander, Spain, 2012. [17] J. Roll, A. Bemporad, L. Ljung, Identiﬁcation of piecewise aﬃne systems via mixed-integer programming, Automatica 40 (1) (2004) 37–50. [18] M. Garey , D. Johnson, Computers and Intractabilit y: a Guide to the Theory of NP- Completeness, W.H. F reeman and Compan y , 1979. [19] D. Aloise, A. Deshpande, P . Hansen, P . Popat, NP-hardness of Euclidean sum-of-squares clustering, Machine Learning 75 (2) (2009) 245–248. [20] V. Blondel, J. Tsitsiklis, A survey of computational complexity results in systems and con trol, Automatica 36 (9) (2000) 1249–1274. [21] R. Alur, N. Singhania, Precise piecewise aﬃne mo dels from input-output data, in: Proc. of the 14th Int. Conf. on Embedded Soft ware (EMSOFT), 2014. [22] G. H. Golub, C. F. V an Loan, Matrix Computations, 4th Edition, John Hopkins Universit y Press, 2013. [23] V. V apnik, Statistical Learning Theory , John Wiley & Sons, 1998. 10

On the complexity of piecewise affine system identification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment