Orthogonal Floating Search Algorithms: From The Perspective of Nonlinear System Identification

Orthogonal Floating Search Algorithms: From The Perspecti ve of Nonlinear System Identiﬁcation Faizal Haﬁz a, ∗ , Akshya Swain a , Eduardo Mendes b a Department of Electrical & Computer Engineering , The University of Auc kland, Auc kland, New Zealand b Department of Electr onics Engineering, F ederal University of Minas Ger ais, Belo Horizonte, Brazil Abstract The present study proposes a ne w Orthogonal Floating Search framew ork for structure selection of nonlinear systems by adapting the existing ﬂoating search algorithms for feature selection. The proposed framework inte- grates the concept of orthogonal space and consequent Error -Reduction-Ratio (ERR) metric with the existing ﬂoating search algorithms. On the basis of this framew ork, three well-known feature selection algorithms ha ve been adapted which include the classical Sequential F orward Floating Search (SFFS), Improv ed sequential F orward Floating Search (IFFS) and Oscillating Search (OS). This framework retains the simplicity of classical Orthogonal Forward Regression with ERR (OFR-ERR) and eliminates the nesting e ﬀ ect associated with OFR-ERR. The performance of the proposed framew ork has been demonstrated considering se veral benchmark non-linear systems. The results sho w that most of the existing feature selection methods can easily be tailored to identify the correct system structure of nonlinear systems. K e ywor ds: Forward model selection, Model structure selection, N ARX model, Nonlinear system, System identiﬁcation 1. Introduction Over the last sev eral decades, researchers from v arious ﬁelds of science and engineering have developed several methods to construct mathematical models from measured input-output data ( popularly known as system identiﬁca- tion) [1, 2]. Since most of the practical systems are nonlinear , the ﬁrst step in the process of ﬁtting the models is the choice of the model amongst v arious av ailable models (e.g., V olterra, W iener , Polynomial / Rational) which can e ﬀ ec- tiv ely capture the dynamics in input-output data. In this study , the focus is on the identiﬁcation of nonlinear systems represented by polynomial nonlinear auto-regressiv e with exogenous inputs (NARX) models [1]. Note that the iden- tiﬁcation of system is a two stage process which inv olves 1) Determination of the signiﬁcant terms and 2) Estimation of the corresponding parameters. While the parameters can be estimated using least-squares based algorithms, the determination of the signiﬁcant terms is a challenging task and often referred to as ‘structur e selection’ . A close examination of the structure selection problem in nonlinear system identiﬁcation re veals that this problem has signiﬁcant similarities with the featur e selection problem encountered in the pattern recognition. In essence, both problems belong to a broader class of combinatorial optimization problem which is referred here as the attribute selection problem. T o understand this further , consider a superset of ‘ n ’ number of ‘attrib utes’ , denoted by: X model = { x 1 , x 2 , . . . x n } , where any attribute (say ‘ x i ’), could represent either a featur e (in pattern recognition ) or a term of ∗ Corresponding author Email addr ess: faizalhafiz@ieee.org (Faizal Haﬁz) the NARX model (in system identiﬁcation ). The goal of the attribute selection is to determine a subset of signiﬁcant attributes, X ? ⊂ X model , by maximizing a suitable criterion function,‘ J ( · )’, as follows: J ( X ? ) = max X⊂X model J ( X ) (1) An exhausti ve search to solv e such combinatorial problem is often intractable even for a moderate number of attrib utes, ‘ n ’, as it requires the ev aluation of 2 n subsets. Over the years sev eral approaches have been proposed to address the structure selection problem of NARX models, e.g . , see [3 – 8]. Among these, the Orthogonal Forward Regression (OFR) [4, 9, 10] is, arguably , computationally the most e ﬀ ectiv e. Perhaps, for this reason, it has been extensiv ely applied in many applications [1, 11]. The central idea of this approach is to decouple the terms in the orthogonal space so that the contribution of each term to the output variance can indi vidually be determined. From the search perspecti ve, the OFR is a sequential gr eedy approach where the term with the highest performance metric, referred to as Err or-Reduction-Ratio (ERR), is included in each step and it has prov en to be very e ﬀ ective in various applications [3, 11]. Howe ver , it has been shown that under certain conditions the OFR may yield sub-optimal term subsets [12 – 16]. The reason for such suboptimal performance is often attributed to the performance metric, ERR. For example, Mao and Billings [12] have shown that for the given term, the value of ERR is dependent on the orthogonalization path , i.e. , the order in which the terms are orthogonalized. This may lead to the selection of spurious terms, when there is an information overlap among the non-orthogonal terms, especially in the earlier stages of the search [15]. T o alle viate these issues, several approaches hav e been proposed to extend / modify the classical OFR algorithm. These can be categorized mainly into two approaches: 1) T wo-step approach [12, 15, 17] and 2) Improv ed performance metric [13, 14, 16]. In the two-step approach, OFR is applied ﬁrst to identify the initial term subset which is further reﬁned by a secondary search procedure. For example, in [12], the initial term subset is reﬁned by applying Genetic Algorithm (GA) to identify the optimal orthogonalization path . Howe ver , the explosi ve growth of the search space limits the application of this approach, e.g. , for ‘ n ’ number of terms, the total number of possible orthogonalization paths becomes ‘ n !’. In comparison, if GA (or any suitable algorithm) is applied directly to select term subset the search space reduces from n ! to 2 n [18]. The ‘iterative OFR’ (iOFR) algorithm [15] uses each term in the initial term subset as a pivot to identify new , and possibly better , term subsets in the secondary search. In [17], a backwar d elimination approach is proposed as the secondary search to remove the spurious terms contained in the initial term subset. Most of the two-step approach, e.g. [17], are usually e ﬀ ective pro vided all the system / signiﬁcant terms are identiﬁed in the initial term subset. The focus of the second approach is to replace the performance metric, ‘ERR’ [13, 14, 16]. For this purpose, sev eral performance metrics hav e been proposed based on Simulation Error [13], Mutual Information [14] and Ultra- orthogonal least squares [16]. While these metrics improv e the search performance of OFR in man y cases, they often increase the computational complexity of OFR. It is worth to emphasize that the aforementioned approaches to improv e OFR are based on the pre vailing belief that the sub-optimal performance of OFR is due only to its performance metric, ERR. Howe ver , in this study , it is argued that this may be in part due to the uni-dir ectional sequential search nature of OFR where terms are progressively added into the term subset until pre-speciﬁed ERR threshold is reached. This leads to the ‘nesting e ﬀ ect’ , i.e. , once a term is included in the subset, it cannot be removed from the term subset. This imposes a hierarchy of terms, whereas in practice, all signiﬁcant terms are equally important. Note that the nesting e ﬀ ect is not new and it has been well-known among the feature selection community since the early se venties [19–21]. Over the years, state-of-the-art 2 ‘ﬂoating’ feature selection algorithms have speciﬁcally been developed to address the nesting e ﬀ ect [22 – 25]. Since there already exist a class of ﬂoating feature selection algorithms and gi ven that both structure selection and feature selection share a common search objectiv e, the question arises: Can the featur e selection algorithms be adapted to addr ess the structur e selection pr oblem? The present study aims to address this apropos question. In particular , the proposed Orthogonal Floating Search Framew ork integrates the ke y properties of the classical Sequential Forward Floating Search (SFFS) and its variants such as Improv ed F orward Floating Search (IFFS) and Oscillating Search (OS) with the Error-Reduction-Ratio (ERR) to determine the structure of nonlinear systems. In each step of these algorithms, after adding a ‘signiﬁcant’ term (dis- cussed in Section 3.1), the selected terms are re-examined to identify and remove the ‘least signiﬁcant’ term (discussed in Section 3.1). Thus, the spurious terms can e ventually be replaced by the system / signiﬁcant terms during the search process. The search performance is therefore signiﬁcantly improv ed without the need for a second search procedure or a complicated performance measure. The e ﬃ cacy of the proposed approach is demonstrated on se veral benchmark non-linear systems. The worst case scenario ( non-persistent e xcitation ) and the identiﬁcation of a discrete-time model of a continuous-time system are also considered. The results of this in vestigation convincingly demonstrate that the feature selection algorithms can indeed be tailored for the purpose of structure selection. The rest of the article is or ganized as follo ws: The polynomial N ARX model and the OFR approach are discussed brieﬂy in Section 2. Next, the proposed Orthogonal Floating Search Framework is discussed in detail in Section 3. The frame work of this in vestigation is described in Section 4. Finally , the results of this in vestigation are discussed at length in Section 5. 2. Preliminaries In the following, the polynomial N ARX model and the orthogonal regression approach are brieﬂy discussed. 2.1. The P olynomial N ARX Model The NARX model represents a non-linear system as a function of recursiv e lagged input and output terms as follows: y ( k ) = F n l { y ( k − 1) , . . . , y ( k − n y ) , u ( k − 1) , . . . , u ( k − n u ) } + e ( k ) where y ( k ) and u ( k ) respecti vely represent the output and input at time intervals k , n y and n u are corresponding lags and F n l {· } is some nonlinear function of degree n l . The total number of possible terms or model size ( n ) of the N ARX model is giv en by , n = n l X i = 0 n i , n 0 = 1 and n i = n i − 1 ( n y + n u + i − 1) i , i = 1 , . . . , n l (2) This model is essentially linear-in-parameters and can be e xpressed as: y ( k ) = n X i = 1 θ i x i ( k ) + e ( k ) , k = 1 , 2 , . . . N (3) where, x 1 ( k ) = 1 , and x i ( k ) = p y Y j = 1 y ( k − n y j ) q u Y k = 1 u ( k − n u k ) i = 2 , . . . , n , 3 p y , q u ≥ 0; 1 ≤ p y + q u ≤ n l ; 1 ≤ n y j ≤ n y ;1 ≤ n u k ≤ n u ; n l is the degree of polynomial expansion; ‘ N ’ denotes the total number of data points. 2.2. Orthogonal Re gression In this study , the orthogonalization is used to determine the signiﬁcance of a particular term, which will be dis- cussed in Section 3.1. It is, therefore, pertinent to brieﬂy discuss the concept of orthogonal regression introduced by Billings et al. [4, 9, 10]. In the orthogonal regression, as the name suggests, the original terms are replaced by the corresponding orthogonal terms. Gi ven that the orthogonal terms are decoupled, the signiﬁcance of each term can be determined independently in the orthogonal space. This could be achiev ed by a simple Gram-Schmidt orthogonaliza- tion: w 1 = x 1 , w i = x i − i − 1 X m = 1 h w m x i i w T m w m Consequently , the N ARX model (3) can be represented in the orthogonal space as follows: y ( k ) = n X i = 1 w i ( k ) g i + e ( k ) , k = 1 , 2 , . . . N Giv en that, in the orthogonal space, w T i w j = 0 holds for i , j , the signiﬁcance of the i th term, x i , can be easily be determined using the Error-Reduction-Ratio (ERR) metric as follo ws: E RR ( x i ) = g 2 i w T i w i y T y , where, g i = w T i y w T i w i , i ∈ [1 , n ] (4) 3. Proposed Orthogonal Floating Sear ch Framework This study aims to in vestigate whether the existing feature selection algorithms can be tailored to address the non- linear system identiﬁcation problem. For this purpose, a new search framew ork is being proposed which integrates the term orthogonalization into ‘ ﬂoating ’ feature selection algorithms and therefore referred to as ‘ Orthogonal Floating Sear ch F rame work ’. The term orthogonalization decouples the structure selection and the parameter estimation, which are crucial steps of the nonlinear system identiﬁcation. It is, therefore, a vital ﬁrst step to adapt the feature selection algorithms for system identiﬁcation. Further, the following ‘ ﬂoating ’ feature selection algorithms hav e been adapted: Sequential F or- war d Floating Sear ch (SFFS) [22], Impr oved F orwar d Floating Sear ch (IFFS) [25] and Oscillating Sear ch (OS) [24]. The moti vation for the selection of these methods is two-fold: 1) The term orthogonalization can be integrated with relativ e ease into the ﬂoating search methods. 2) The step-wise subset building approach of the sequential ﬂoating methods ( e.g . , SFFS and IFFS) is similar to that of the Orthogonal Forward Regression (OFR-ERR) [4]. Further , in OS, the depth of the search process can easily be controlled as per the prev ailing requirements, as will be discussed in Section 3.4. The concept of feature signiﬁcance is central to the ﬂoating search methods. In this study , we extend this concept in the context of system identiﬁcation and integrate the term orthogonalization to determine the signiﬁcance of a term / term subset. This is discussed in Section 3.1. Further , each search method and associated adaption are discussed in the conte xt of the system identiﬁcation in Section 3.2, Section 3.3 and Section 3.4. Note that the adapted v ersions of SFFS, IFFS and OS are referred to respectively as Orthogonal Sequential Floating search (OSF), Orthogonal Improv ed sequential Floating search (OIF) and Orthogonal Oscillating Search (O 2 S). 4 3.1. T erm Signiﬁcance Determination of feature signiﬁcance is the core of the ﬂoating search methods [22 – 25], where in each search step, the most signiﬁcant feature / features are selected from the pool of a vailable features and then the least signiﬁcant feature / features are removed, from the selected feature subset. If a similar approach is to be applied to address the structure selection problem, it is essential to determine ﬁrst the ‘ signiﬁcance ’ of a term / term subset. For this purpose, in this study , we extend the deﬁnitions of feature signiﬁcance developed by Pudil, Somol and co-workers [22–24] as follows: Let X k = { x 1 , x 2 , . . . x k } be the term subset selected by the search algorithm at particular search step from the superset of N ARX terms, X model = { x 1 , x 2 , . . . x n } . Consequently , the subset of a vailable terms is gi ven by X model \ X k . Let X i o denote i th subset containing ‘ o ’ number of terms. The criterion function corresponding to the term subset X k is denoted by J ( X k ) and is giv en by J ( X k ) = k X i = 1 E RR ( x i ) , ∀ x i ∈ X k (5) where, E RR ( x i ) is giv en by (4). For these subsets, the signiﬁcance of term / term subset is deﬁned as follo ws: Deﬁnition 1. The most signiﬁcant term with r espect to X k is denoted as ‘ x M S ’ and is deﬁned as: x M S = { x i | J ( X k ∪ x i ) = max i J ( X k ∪ x i ) , x i ∈ X model \ X k } , ∀ i ∈ [1 , n ] (6) Deﬁnition 2. The least signiﬁcant term in X k is denoted as ‘ x LS ’ and is deﬁned as: x LS = { x i | J ( X k \ x i ) = max i J ( X k \ x i ) , x i ∈ X k } , ∀ i ∈ [1 , n ] (7) Deﬁnition 3. The most signiﬁcant subset of ‘o’ number of terms with r espect to X k is denoted as ‘ X M S o ’ and is deﬁned as: X M S o = {X i o | J ( X k ∪ X i o ) = max i J ( X k ∪ X i o ) , X i o ∈ X model \ X k } (8) Deﬁnition 4. The least signiﬁcant subset of ‘o’ number of terms in X k is denoted as ‘ X LS o ’ and is deﬁned as: X LS o = {X i o | J ( X k \ X i o ) = max i J ( X k \ X i o ) , X i o ∈ X k } (9) 3.2. Orthogonal Sequential Floating Sear ch The principle of the generalized ﬂoating search was introduced by Pudil et al. in [22] which is essentially a sequential approach with a distinct ability to ‘ backtr ack ’, i.e. , the selected / discar ded term can be discar ded / selected in later stages. It is worth to note that the ﬂoating search can proceed either in forward (bottom-up) or backwar d (top-down) direction. In this study , we focus on the forward or bottom-up approach. This is referred to as Orthogonal Sequential Floating (OSF) search. The pseudo code for OSF is shown in Algorithm 1. Note that the OSF does not include ‘ term swapping ’ procedure (Line 21-31, Algorithm 1), in contrast to the improv ed ﬂoating search approach (which will be discussed in Section 3.3). 5 Algorithm 1: Orthogonal Floating Structure Selection. Input : Input-output Data, ( u , y ); Number of required terms, ‘ ξ ’ Output: Identiﬁed Structure, X 1 Begin with the empty subset: X ← ∅ , k = 0, J ( X i ) = 0 , ∀ i ∈ [1 , ξ ] 2 while k < ξ do */ Sequential Forward Selection 3 Add the most signiﬁcant term to X , i.e. , ˆ X ← {X ∪ x MS } 4 if J ( ˆ X ) > J ( X k + 1 ) then 5 X ← ˆ X ; J ( X ) ← J ( ˆ X ) 6 else 7 X ← X k + 1 ; J ( X ) ← J ( X k + 1 ) 8 end 9 k ← k + 1 */ Backwards Elimination 10 Set the ﬁrst e xclusion ﬂag , f 1 = 1 11 while k > 2 do 12 Identify the least signiﬁcant term, x LS , in X 13 if ( f 1 = 1 ∧ x LS = x MS ) OR J ( X\ x LS ) ≤ J ( X k − 1 ) then 14 Stop Elimination. 15 else 16 Remov e the least signiﬁcant term, i.e. , 17 X ← X\ x LS ; J ( X ) ← J ( X\ x LS ); k ← k − 1 18 end 19 f 1 = 0 20 end */ Term Swapping : Included only in OIF 21 for i = 1 to k do 22 Remov e the i th term, i.e. , ˆ X i ← X\ x i 23 Add the most signiﬁcant term to ˆ X i , i.e. , ˆ X i ← { ˆ X i ∪ x MS } 24 end 25 Select the best swapping subset, i.e. , ˆ X = { ˆ X i | J ( ˆ X i ) = max i = 1: k J ( ˆ X i ) } 26 if J ( ˆ X ) > J ( X k ) then 27 X ← ˆ X ; J ( X ) ← J ( ˆ X ) 28 if k > 2 then 29 Go T o Step 11 for the Backward Elimination 30 end 31 end 32 end Note that the OSF does not include the ‘ T erm Swapping ’ procedure sho wn in Line 21-31. The search process in OSF be gins with an empty term subset and is continued till the term subset with the desired cardinality ‘ ξ ’ is obtained. Each search step essentially in volv es two procedures: inclusion of the most signiﬁcant term followed by the removal of the least signiﬁcant term / terms through backtr acking . This procedure is discussed brieﬂy in the following: Let X k denote a term subset with k number of terms at a particular search step. First, follo wing Deﬁnition 1, the most signiﬁcant term ( x M S ) with respect to X k , is identiﬁed from the pool of av ailable terms. This term is included in the term subset provided it leads to a better criterion function, i.e. , J ( X k ∪ x M S ) > J ( X k + 1 ). This procedure is outlined in Lines 3-9, Algorithm 1. After including a term, the focus is on removing least-signiﬁcant terms present in the 6 Figure 1: Search Behavior of OSF for the numerical example with ξ = 6. Note that the ‘backtracking’ at Step 6 led to better subsets, e.g. , J ( X + 2 ) > J ( X 2 ), J ( X + 3 ) > J ( X 3 ) and so on. T able 1: Search Steps In volv ed in OSF † Step ξ s t e p Subset † J ( · ) 1 1 X 1 = { x 1 } 0.7454 2 2 X 2 = { x 1 , T 1 } 0.8609 3 3 X 3 = { x 1 , x 3 , T 1 } 0.9496 4 4 X 4 = { x 1 , x 3 , x 4 , T 1 } 0.9636 5 5 X 5 = { x 1 , x 3 , x 4 , x 5 , T 1 } 0.9708 6 2 X + 2 = { x 2 , x 5 } 0.8804 7 3 X + 3 = { x 2 , x 3 , x 5 } 0.9795 8 4 X + 4 = { x 2 , x 3 , x 5 , T 2 } 0.9937 9 5 X + 5 = { x 1 , x 2 , x 3 , x 4 , x 5 } 0.9986 10 6 X 6 = { x 1 , x 2 , x 3 , x 4 , x 5 , T 3 } 0.9986 † System T erms: x 1 → c , x 2 → y ( k − 1), x 3 → u ( k − 2), x 4 → y ( k − 2) 2 , x 5 → u ( k − 1) 2 Spurious T erms: T 1 → y ( k − 1) u ( k − 1) 2 , T 2 → y ( k − 1) y ( k − 2) 2 , T 3 → u ( k − 1) u ( k − 2) 2 selected term subset through adaptive backtracking, as sho wn in Lines 10-20, Algorithm 1. It is worth to emphasize that the backtracking continues to remov e the least signiﬁcant term until an improvement in the criterion function is obtained, i.e. , J ( X k \ x LS ) > J ( X k − 1 ). T o gain an insight into the search dynamics of the proposed OSF , consider the following system. S 1 : y ( k ) = 0 . 5 + 0 . 5 y ( k − 1) + 0 . 8 u ( k − 2) − 0 . 05 y ( k − 2) 2 + u ( k − 1) 2 + e ( k ) (10) For identiﬁcation purposes, 1000 input-output data-points are generated by exciting the system S 1 with zero-mean uniform white noise sequence with unit variance, u ∼ W U N (0 , 1) and Gaussian white noise, e ∼ W G N (0 , 0 . 05). The model set of 165 N ARX terms is obtained with the following speciﬁcations: [ n u , n y , n l ] = [4 , 4 , 3]. The OSF is applied to identify six signiﬁcant terms of S 1 , i.e . , ξ = 6. Note that the ﬂoating search methods require a priori speciﬁcation of subset size, ‘ ξ ’. A simple approach to select the appropriate subset size ξ and the associated issues will be discussed in Section 3.5. The search behavior of OSF is explained as follows: In each search step, the term subset, its cardinality and the corresponding criterion function are recorded, which are shown in T able 1 and Fig. 1. For the sake of clarity , the system terms and spurious terms are respecti vely denoted by ‘ x ’ and ‘ T ’. Further, a better subset for the giv en cardinality is annotated by ‘ + ’, e.g . , X + 2 indicates that J ( X + 2 ) > J ( X 2 ). The positi ve e ﬀ ects of backtracking are clearly evident in Fig. 1. The backtracking occurs after the subset X 5 is obtained. Note that the subsets obtained after backtracking ( X + 2 , X + 3 , X + 4 and X + 5 ) yield improved criterion function in comparison to the previous subsets with the similar cardinality ( X 2 , X 3 , X 4 and X 5 ). A closer examination of subsets given in T able 1 reveals that the backtracking could remov e the spurious term ‘ T 1 ’ which is ﬁrst included at Step 2 and remains in the selected subsets till Step 5. Further, the subset containing all the system terms, X + 5 = { x 1 , x 2 , x 3 , x 4 , x 5 } , is obtained during backtracking at Step 6. This subset is ev entually selected at Step 9 when the forward inclusion procedure failed to yield a better subset, i.e. , the search is restored to pre vious best subset (in this case X + 5 , as J ( X + 4 ∪ x M S ) < J ( X + 5 ); see Line 4-9, Algorithm 1). 7 Figure 2: Search Behavior of OIF for the numerical example with ξ = 5. Note the backtracking after X 4 leading to better subsets, X + 2 , X + 3 and X + 4 . T able 2: Search Steps In volv ed in OIF † Step ξ s t e p Subset J ( · ) 1 1 X 1 = { x 1 } 0.7454 2 2 X 2 = { x 1 , T 1 } 0.8609 3 3 X 3 = { x 1 , x 3 , T 1 } 0.9496 4 4 X 4 = { x 1 , x 3 , x 4 , T 1 } 0.9636 5 2 X + 2 = { x 2 , x 5 } 0.8804 6 3 X + 3 = { x 2 , x 3 , x 5 } 0.9795 7 4 X + 4 = { x 2 , x 3 , x 5 , T 2 } 0.9937 8 5 X 5 = { x 1 , x 2 , x 3 , x 4 , x 5 } 0.9986 † System T erms: x 1 → c , x 2 → y ( k − 1), x 3 → u ( k − 2), x 4 → y ( k − 2) 2 , x 5 → u ( k − 1) 2 Spurious T erms: T 1 → y ( k − 1) u ( k − 1) 2 , T 2 → y ( k − 1) y ( k − 2) 2 3.3. Orthogonal Impr oved Floating Sear ch The next algorithm considered in this study is based on the improved ﬂoating search proposed in [25]. This algo- rithm adds a new procedure referred to as ‘ swapping ’ to replace a weak feature, besides retaining both the procedures of the ﬂoating search. The impro ved ﬂoating search is adapted for the structure selection by including the ERR metric associated with orthogonal least square algorithms of [4]. This is referred to as Orthogonal Improved Floating (OIF) search and discussed brieﬂy in the following. The procedures in volv ed in OIF are shown in Algorithm 1. As discussed earlier , OIF retains the ‘ forward in- clusion ’ (Line 3-9, Algorithm 1) and ‘ bac ktrac king ’ (Line 10-20, Algorithm 1) procedures of the OSF and the same discussions are valid here. In addition, the swapping procedure is introduced after backtrac king as shown in Line 21- 31. The objective of this procedure is to replace a weak / non-signiﬁcant term which is e xplained as follo ws: Assume that X k = { x 1 , x 2 , . . . x k } is the term subset obtained after bac ktracking . T o replace a non-signiﬁcant term, sev eral ne w candidate term subsets are generated from X k . The i th candidate term subset (denoted as ˆ X i ) is generated from X k in the following tw o steps: 1. First, the i th term is remov ed from X k , i.e. , ˆ X i ← {X k \ x i } 2. Next, the most signiﬁcant term with respect to ˆ X i is included, i.e. , ˆ X i ← { ˆ X i ∪ x M S } Follo wing these steps, a total of ‘ k ’ candidate term subsets are obtained from X k and the subset with the highest criterion function, J ( · ), is referred to as ‘ swapping subset ’(denoted by ˆ X ). If the swapping subset yields an improved criterion function, then it replaces the current term subset X k and sent for backtracking, as outlined in Line 26-31, Algorithm 1. Thus, OIF takes a two-pronged approach to remove weak / non-signiﬁcant terms: the backtrac king removes the weak term / terms and the swapping replaces a weak term. Especially , the swapping procedure encourages the explo- ration of new term subsets which is likely to yield improv ed search performance in comparison to OSF . T o in vestigate this further, the OIF is applied to identify the signiﬁcant terms of the system S 1 considered in (10) under the similar test conditions. The ﬁnal subsets obtained in each step and the corresponding criterion function are shown in T able 2. The varia- tions in the criterion function are sho wn in Fig. 2. Though, the ov erall search dynamics of both OSF (Fig. 1) and OIF (Fig. 2) may appear similar in nature, it is interesting to note that the backtracking in OIF occurs one step earlier (Step 8 Algorithm 2: Orthogonal Oscillating Search (O 2 S) for Structure Selection Input : Input-output Data, ( u , y ); Number of required terms, ‘ ξ ’ Output: Identiﬁed Structure, X ξ */ Initialize 1 Select the maximum search depth, O ma x 2 Set the search depth, o = 1 and ‘ ﬂags ’, f 1 = 0, f 2 = 0 3 X ξ ← ∅ 4 for i = 1 to ξ do 5 Add the most signiﬁcant term to X ξ , i.e. , X ξ ← {X ξ ∪ x MS } 6 end 7 while o < O ma x do */ Down Swing 8 Remov e ‘ o ’ number of least signiﬁcant terms from X ξ , i.e. , X ξ − o ← X ξ \X LS o 9 Add ‘ o ’ number of most signiﬁcant terms to X ξ − o , i.e. , ˆ X ξ ← {X ξ − o ∪ X MS o } 10 if J ( ˆ X ξ ) > J ( X ξ ) then 11 X ξ ← ˆ X ξ ; J ( X ξ ) ← J ( ˆ X ξ ); f 1 ← 0 */ better subset is found 12 else 13 f 1 ← 1 */ down swing is unsuccessful 14 end 15 if f 1 = 1 ∧ f 2 = 1 then 16 o = o + 1 */ increase search depth 17 end */ Up Swing 18 Add ‘ o ’ number of most signiﬁcant terms to X ξ , i.e. , X ξ + o ← {X ξ ∪ X MS o } 19 Remov e ‘ o ’ number of least signiﬁcant terms from X ξ + o , i.e. , ˆ X ξ ← X ξ + o \X LS o 20 if J ( ˆ X ξ ) > J ( X ξ ) then 21 X ξ ← ˆ X ξ ; J ( X ξ ) ← J ( ˆ X ξ ); f 2 ← 0 */ better subset is found 22 else 23 f 2 ← 1 */ up swing is unsuccessful 24 end 25 if f 1 = 1 ∧ f 2 = 1 then 26 o = o + 1 */ increase search depth 27 end 28 end 5, T able 2) in comparison to OSF (Step 6, T able 1). This could be e xplained by the swapping procedure of OIF which enables the inclusion of missing system term ‘ x 2 ’ one step earlier . 3.4. Orthogonal Oscillating Sear ch The third algorithm considered in this study is the oscillating search introduced by Somol and Pudil in [24] which is the natural successor of the ﬂoating search principle. It retains one of the key properties of ﬂoating search, i.e. , the ability to backtr ack . Unlike the ﬂoating search principle, where the focus is on the individual feature, the oscillating search focuses on a subset of features. The basic search engine in this method is the ‘ swing ’ procedure where the subset under consideration is perturbed by successi ve addition / r emoval and r emoval / addition of multiple features. In this study , the oscillating search is also adapted for the structure selection by the introduction of ‘orthogonalization’ 9 T able 3: Search Steps In volv ed in O 2 S with ξ = 5 and O ma x = 2 Step Search Depth ‘ o ’ ξ s t e p Subset † J ( · ) f 1 f 2 Remark 1 - 5 X 5 = { x 1 , x 3 , x 4 , x 5 , T 1 } 0.97079 0 0 Initial Model 2 1 4 X 4 = { x 1 , x 3 , x 4 , T 1 } Down Swing Did Not Improv e 3 5 X 5 = { x 1 , x 3 , x 4 , x 5 , T 1 } 0.97079 0 0 4 1 6 X 6 = { x 1 , x 2 , x 3 , x 4 , x 5 , T 2 } Up Swing Improv ed 5 5 X + 5 = { x 1 , x 2 , x 3 , x 4 , x 5 } 0 . 99856 1 0 6 1 4 X 4 = { x 2 , x 3 , x 4 , x 5 } Down Swing Did Not Improv e 7 5 X 5 = { x 1 , x 2 , x 3 , x 4 , x 5 } 0.99856 1 0 8 1 6 X 6 = { x 1 , x 2 , x 3 , x 4 , x 5 , T 2 } Up Swing Did Not Improv e 9 5 X 5 = { x 1 , x 2 , x 3 , x 4 , x 5 } 0.99856 1 1 10 2 3 X 3 = { x 2 , x 3 , x 5 } Down Swing Did Not Improv e 11 5 X 5 = { x 2 , x 3 , x 5 , T 3 , T 4 } 0.99505 1 1 † System T erms: x 1 → c , x 2 → y ( k − 1), x 3 → u ( k − 2), x 4 → y ( k − 2) 2 , x 5 → u ( k − 1) 2 Spurious T erms: T 1 → y ( k − 1) u ( k − 1) 2 , T 2 → u ( k − 1) u ( k − 2) 2 , T 3 → y ( k − 1) y ( k − 2) 2 , T 4 → y ( k − 2) 3 and referred to as Orthogonal Oscillating Search (O 2 S). The steps in volved in O 2 S are outlined in Algorithm 2 and discussed brieﬂy in the following: Let ‘ ξ ’ denote the required number of terms. The search is initialized with a subset, X ξ , containing ξ number of terms. This initial subset is obtained by successi vely adding the most signiﬁcant terms (See Deﬁnition 1) as outlined Line 4-6, Algorithm 2. Subsequently , this subset is perturbed throughout the search process by the ‘ swing ’ procedures. The rationale behind the swing procedure is that a better subset could be obtained by replacing multiple (say , ‘ o ’) number of weak or non-signiﬁcant terms present in the current subset with the similar number of rele vant or signiﬁcant terms. This could be achiev ed by the following two ‘swing’ procedures: 1) Do wn Swing and 2) Up Swing. During the ‘ down swing ’, at ﬁrst ‘ o ’ number of least signiﬁcant terms in X ξ are remov ed which yields a subset of ( ξ − o ) terms, X ξ − o . Subsequently , ‘ o ’ number of most signiﬁcant terms are added to X ξ − o to yield a ne w , and possibly a better subset of ξ terms, ˆ X ξ . This new subset is retained provided the criterion function is improved. The down swing procedure is outlined in Line 8-14, Algorithm 2. The Up-swing procedure, as the name suggests, ﬁrst adds ‘ o ’ number of most signiﬁcant terms to X ξ to yield a subset X ξ + o . Next, a ne w subset ˆ X ξ is obtained by removing the o number of least signiﬁcant terms from X ξ + o . The other aspects are similar to the down swing. This procedure is outlined in Line 18-24, Algorithm 2. It is clear that both swing procedures require identiﬁcation of the follo wing two subsets with respect to the term subset under consideration: the subset of ‘ o ’ most signiﬁcant terms (denoted as X M S o ; see Deﬁnition 3) and the subset containing ‘ o ’ weak / non-signiﬁcant terms (denoted as X LS o ; see Deﬁnition 4). In this study , X M S o is identiﬁed using OSF search procedure described in Section 3.2. The set of weak terms X LS o is identiﬁed using the top-down or backwar d sear ch v ariant of OSF . 10 Figure 3: The v ariation in subset cardinality during the O 2 S Search Further , the ‘search depth’ (denoted by ‘ o ’) determines the number of terms which are either to be added or to be remov ed in a particular swing. The search depth is adaptiv e and its value ‘ o ’ is dependent on the search dynamics. The search begins with o = 1 and if two successi ve swings fail to identify a better term subset then it is incremented by 1. The search depth is reset to 1 whenev er swing leads to an improved subset. The search is terminated when ‘ o ’ reaches to a pre-speciﬁed maximum search depth, O ma x . Consequently , the depth of search can easily be controlled by varying O ma x . This is one of the distinct and important ability of O 2 S which regulates the search e ﬀ orts as per the prev ailing requirement. T o analyze the search behavior of O 2 S, it is applied to identify system terms of the system S 1 considered in the numerical example (10) with the following speciﬁcations: ξ = 5 and O ma x = 2. During this search for system terms, the term subset and the corresponding criterion function obtained in each step are recorded and shown here in T able 3. The change in subset cardinality during the search process is shown in Fig. 3. The search is initialized by forward inclusion of 5 terms, as per Line 4-6, Algorithm 2. This initial subset selects one spurious term, instead of the system term x 2 , as shown in Step 1, T able 3. The search begins by the down swing procedure which does not improve the subset, as seen in Step 3, T able 3. Ho wev er , in the subsequent up-swing, a better term subset, X + 5 , is identiﬁed, i.e. , J ( X + 5 ) > J ( X 5 ). Note that at this stage, all the system terms hav e been identiﬁed and the spurious term is discarded, as seen in Step 5, T able 3. The subsequent down swing (Step 7) and up swing (Step 9) identiﬁes the same subset with no improvement in the criterion function, J ( · ). Consequently , the search depth is increased by ‘1’, i.e. , o = 2. The search terminates at Step 11 as the down swing could not ﬁnd a better subset and any further increase in search depth will lead to o > O ma x . Note that the variation in subset cardinality during the aforementioned swing procedures are clearly visible in Fig. 3. Especially , see the drop in subset cardinality at Step 10. The increase in search depth at this step requires the search for the term subset with { ξ − o } = 3 number of terms. 3.5. Selection of Subset Car dinality (Model Or der Selection) The ﬂoating search algorithms require the speciﬁcation of ‘subset cardinality’ or number of terms to be identi- ﬁed. Giv en that this is not known a priori , in this study , the following procedure is followed to estimate the subset cardinality or ‘ model or der ’. For the system under consideration, a search algorithm is applied to identify several term subsets of increasing cardinality in a predeﬁned search interval, denoted by [ ξ min , ξ ma x ]. A set of subsets thus obtained can be used to estimate the model order using an appropriate Information Criterion (IC). The objecti ve here is to locate a ‘ plateau ’ 11 Figure 4: The selection of subset cardinality or ‘model order’. ξ ? denotes the known cardinality of the system. Note that the knee-point for all criteria is obtained at ξ = 5. or ‘ knee-point ’ in the information criteria which would indicate an acceptable compromise for the bias-variance dilemma. The term subsets corresponding to such knee-point or plateau can be selected as the system model. T o further understand this procedure, consider the system S 1 in numerical example (10). F or this system, OIF is applied to identify term subsets of increasing cardinality in the range of [ ξ min , ξ ma x ] = [2 , 20]. Thus, a family identiﬁed term subsets, denoted by ‘ Ω ’, is obtained as follows: Ω = {X ξ min , X ξ min + 1 , . . . , X ξ max } (11) Next, for each identiﬁed subset, X ξ ∈ Ω , the various information criteria given in Appendix A are determined. Fig. 4 shows the variation in information criteria as cardinality is varied from ξ min to ξ ma x . It is observed that the ‘ knee-point ’ for all the criteria is obtained at ξ = 5. This coincides with the actual cardinality of the system S 1 . In sev eral comparativ e in vestigations [18, 26], the Bayesian Information Criterion (BIC) was found to be com- parativ ely robust among the existing information criteria. Therefore, for the remainder of the study , the cardinality corresponding to the minimum BIC is selected as the model order . Although following this approach the correct model order could be obtained for all test system (as will be shown in Section 5.1), in practice, we recommend the closer inspection of se veral term subsets surrounding the plateau or knee-point; as the existing research indicated that the information criteria tend to ov er-ﬁt [26]. Since the existing research suggest that the dynamics of many nonlinear systems can be captured with 15 or fewer signiﬁcant terms [10], a pragmatic approach is followed in this study to ﬁx the search interval of the cardinality as follows: [ ξ min , ξ ma x ] = [2 , 20]. This can be considered as a thumb-rule. Howe ver , some exception may exist in certain non-linear systems. A simple procedure to determine the cardinality interv al in such a scenario is discussed in Section 5.2. 4. In vestigation Framework In the following, the frame work of this in vestigation is discussed. The o verall procedure follo wed for the structure selection is shown in Algorithm 3. The test nonlinear system considered in this study are discussed in Section 4.1. 12 Algorithm 3: Orthogonal Floating Structure Selection Input : Input-output Data, ( u , y ) Output: Identiﬁed Model, Structure and Coe ﬃ cients */ Data Pre-processing 1 Generate set of model terms by specifying n u , n y and n l of the N ARX model */ Search for the system structure 2 Select a orthogonal ﬂoating search algorithm: OSF , OIF (Algorithm 1) or O 2 S (Algorithm 2) 3 Select the search interval, [ ξ min , ξ ma x ] 4 Ω ← ∅ 5 for eac h ‘k’, k ∈ [ ξ min , ξ ma x ] do 6 Identify subset of ‘ k ’ signiﬁcant terms, X k 7 Ω ← { Ω ∪ X k } 8 end 9 Select the term-subset ( X alg ) with the minimum BIC, i.e. , X alg = {X i | BI C ( X i ) = arg min BI C ( X i ) , ∀ X i ∈ Ω } Further , the possible search outcomes are discussed in Section 4.2, to ev aluate the term subset found by the search algorithms qualitativ ely . 4.1. T est Non-linear Systems In this study , a total of 6 non-linear system have been selected from the existing in vestigations on structure se- lection [6, 7, 12, 13, 27, 28] and shown here in T able 4. The systems are excited by a white noise sequence having either uniform or Gaussian distribution as sho wn in T able 4. For identiﬁcation purposes, a total of 1000 input-output data points, ( u , y ), are generated from each system. The structure selection is performed following the principle of cr oss-validation ; where 700 data points are selected for the estimation purpose and the remaining data points are used for v alidation, i.e. , N v = 300. For each system, the N ARX model is generated by setting the input-output lags and the degree of non-linearity to: [ n u , n y , n l ] = [4 , 4 , 3]. This gives the N ARX model set, X model , with a total of 165 terms following (2), i.e . , n = 165. T able 4: T est Non-linear Systems System Known Structure Input ( u ) † Noise ( e ) † S 1 y ( k ) = 0 . 5 + 0 . 5 y ( k − 1) + 0 . 8 u ( k − 2) + u ( k − 1) 2 − 0 . 05 y ( k − 2) 2 + e ( k ) WUN(0 , 1) WGN(0 , 0 . 05) S 2 y ( k ) = 0 . 5 y ( k − 1) + 0 . 3 u ( k − 1) + 0 . 3 u ( k − 1) y ( k − 1) + 0 . 5 u ( k − 1) 2 + e ( k ) WUN(0 , 1) WGN(0 , 0 . 002) S 3 y ( k ) = 0 . 8 y ( k − 1) + 0 . 4 u ( k − 1) + 0 . 4 u ( k − 1) 2 + 0 . 4 u ( k − 1) 3 + e ( k ) WGN(0 , 1) WGN(0 , 0 . 33 2 ) S 4 y ( k ) = 0 . 1586 y ( k − 1) + 0 . 6777 u ( k − 1) + 0 . 3037 y ( k − 2) 2 − 0 . 2566 y ( k − 2) u ( k − 1) 2 − 0 . 0339 u ( k − 3) 3 + e ( k ) WUN(0 , 1) WGN(0 , 0 . 002) S 5 y ( k ) = 0 . 7 y ( k − 1) u ( k − 1) − 0 . 5 y ( k − 2) + 0 . 6 u ( k − 2) 2 − 0 . 7 y ( k − 2) u ( k − 2) 2 + e ( k ) WUN( − 1 , 1) WGN(0 , 0 . 004) S 6 y ( k ) = 0 . 2 y ( k − 1) 3 + 0 . 7 y ( k − 1) u ( k − 1) + 0 . 6 u ( k − 2) 2 − 0 . 7 y ( k − 2) u ( k − 2) 2 − 0 . 5 y ( k − 2) + e ( k ) WUN( − 1 , 1) WGN(0 , 0 . 004) † WUN( a , b ) denotes white uniform noise sequence in the interval [ a , b ]; WGN( µ, σ ) denotes white Gaussian noise sequence with the mean ‘ µ ’ and the variance ‘ σ ’. 13 4.2. Sear ch Outcomes For comparati ve ev aluation purposes, the term subset found by each search algorithm is qualitatively ev aluated. For this purpose, the follo wing term sets are deﬁned with reference to the N ARX model gi ven by (3), • X model : the set containing all terms of the NARX model • X ? : the optimum term subset or the set of system terms, X ? ⊂ X model • X alg : subset of terms identiﬁed by the search algorithm, X alg ⊂ X model • X s pur : set of spurious terms which are selected by the search algorithm, b ut are not present in the actual system, i.e. , X s pur = X alg \ X ? • ∅ : the null set On the basis of these deﬁnitions, one of the following four search outcome can be e xpected: 1. Identiﬁcation of the Correct Structur e ( Exact F itting ) : In this scenario the identiﬁed model contains all the system terms and does not include any spurious terms, i.e. , X alg = X ? and X s pur = ∅ 2. Over Fitting : The identiﬁed model contains all the system terms; however spurious terms are also selected , i.e. , X alg ⊃ X ? and X s pur , ∅ 3. Under F itting-1 : The algorithm fails to identify all the system terms; though it does not include any spurious terms , i.e. , X alg ⊂ X ? and X s pur = ∅ 4. Under F itting-2 : The algorithm fails to identify all the system terms; however spurious terms are selected , i.e. , X alg 2 X ? and X s pur , ∅ Thus, qualitatively , the search is successful when all the system / signiﬁcant terms are identiﬁed, i.e. , when the outcome is either Exact-F itting or Over-F itting . Note that at this stage, the structure identiﬁed by the algorithms are unaltered. The inclusion of few spurious terms can be tolerated provided that all signiﬁcant terms are included ( i.e. Over -Fitting scenario) as the spurious terms can easily be identiﬁed and removed through a simple null-hypothesis test on the corresponding coe ﬃ cients. 5. Results The goal of this study is to inv estigate the suitability of existing feature selection algorithms for the task of non- linear system identiﬁcation. For this purpose, 3-well known ﬂoating search algorithms hav e been adapted in the proposed orthogonal ﬂoating search frame work: OSF , OIF and O 2 S, which are discussed in Section 3. The adapted algorithms are initially applied to identify the signiﬁcant terms of 6 test non-linear systems described in Section 4.1. The search performance of the algorithms is compared on these systems from v arious perspecti ves in Section 5.1. The issues related to selection of cardinality interval [ ξ min , ξ ma x ] are discussed in Section 5.2. Further , it is w orth to note that while the proposed orthogonal ﬂoating search can work with any suitable criterion function, in this study , the Error-Reduction-Ratio (ERR) has been selected for this purpose due to its relati ve simplic- ity . Ho wev er , ERR is often criticized for its ‘ local ’ nature and blamed for the unsatisfactory search outcome. This issue is in vestigated via a numerical e xample in Section 5.3. 14 (a) S 1 (b) S 2 (c) S 3 (d) S 4 (e) S 5 (f) S 6 Figure 5: Model Order Selection. ‘ ξ ? ’ denotes the known cardinality of the system. T able 5: Search Outcomes † System ξ ? OSF OIF O 2 S ξ al g Outcome ξ al g Outcome ξ al g Outcome S 1 5 6 Over-Fitting 5 Exact-Fitting 5 Exact-Fitting S 2 4 4 Exact-Fitting 4 Exact-Fitting 4 Exact-Fitting S 3 4 4 Exact-Fitting 4 Exact-Fitting 4 Exact-Fitting S 4 5 5 Exact-Fitting 5 Exact-Fitting 5 Exact-Fitting S 5 4 4 Exact-Fitting 4 Exact-Fitting 4 Exact-Fitting S 6 5 5 Exact-Fitting 5 Exact-Fitting 5 Exact-Fitting † See Section 4.2 for the deﬁnitions of search outcomes Finally , the proposed orthogonal ﬂoating algorithms are applied to identify a discrete model of a continuous time system. The identiﬁed discrete model is v alidated through generalized frequency response functions. This part of the in vestigation is discussed in Section 5.4. 5.1. Comparative Evaluation The objectiv e of this part of the study is to compare the performance of orthogonal ﬂoating search algorithms. Follo wing the procedure outlined Algorithm 3, the signiﬁcant terms of the system shown in T able 4 are identiﬁed. It is worth to note that while OSF and OIF do not have any control parameter , O 2 S requires the selection of maximum search depth, O ma x . In this study , a maximum of 25% terms is allo wed to be exchanged, i.e. , O ma x = d 0 . 25 × min { ξ , n − ξ }e . The outcomes of model order selection are shown in Fig 5. Note that the ‘knee-point’ in BIC coincides with 15 (a) S 1 + OS F (b) S 1 + O I F (c) S 1 + O 2 S (d) S 2 + OS F (e) S 2 + O I F (f) S 2 + O 2 S Figure 6: T erm Selection Beha vior-I. ξ ? denotes actual cardinality of the system. The ‘dotted’ vertical line indicate the v alue ξ when all the system terms are identiﬁed. the known cardinality (actual number of terms) of the systems. The selected cardinality (number of terms) for each system is shown in T able 5 along with the qualitativ e search outcomes. It is interesting to see that the best possible search outcome, ‘ Exact-F itting ’, is obtained for all combinations of system and search algorithm, except one. For the system S 1 , an ov er-ﬁtted model (with one spurious term) is selected by OSF as seen in Fig. 5a and T able 5. Note that this spurious term can easily be removed via a simple null-hypothesis test. Nev ertheless, it is pertinent to identify the cause of this ‘ov er-ﬁtting’ which could be ascribed either to the information criterion or to the search algorithm. T o in vestigate this further , the term selection behavior of the algorithms is determined. In particular , as the car- dinality , ξ , is increased progressively from ξ min to ξ ma x , the selection frequency of system terms is observed over the subsets identiﬁed by the algorithm, as follows: τ i ,ξ =          1 , if x i ∈ X ξ 0 , otherwise , ∀ X ξ ∈ Ω (12) where, ‘ τ i ,ξ ’ encodes whether the i th term, ‘ x i ’, is included in the term subset X ξ identiﬁed by the algorithm. Following this procedure, the term selection behavior is determined for all algorithms and sho wn here in Fig. 6, 7 and 8. W e ﬁrst focus on the ‘ o ver-ﬁtting ’ outcome by OSF on S 1 . The term selection by OSF for S 1 is depicted in Fig. 6a. It is observed that all the system terms are ﬁrst identiﬁed at ξ = 6 which is slightly higher than ξ ? . This explains the ‘ov er-ﬁtting’ outcome. For the same system, OIF (Fig. 6b) and O 2 S (Fig. 6c) could identify all system terms from ξ = ξ ? . This further underlines the e ﬃ cacy of the ‘ term swapping ’ procedure introduced in OIF . Further , although the BIC could identify correct model order for all the systems in this study , the possibility 16 (a) S 3 + OS F (b) S 3 + O I F (c) S 3 + O 2 S (d) S 4 + OS F (e) S 4 + O I F (f) S 4 + O 2 S Figure 7: T erm Selection Beha vior-II. ξ ? denotes actual cardinality of the system. The ‘dotted’ vertical line indicate the v alue ξ when all the system terms are identiﬁed. of ‘over -ﬁtting’ cannot be ruled out. W e therefore focus on the search behavior for ξ > ξ ? . It is easier to follow that all the system terms must be included in the term subsets with cardinality ξ ≥ ξ ? . Thus it is desirable to hav e τ i ,ξ = 1 , ∀ ξ ≥ ξ ? . If this condition is satisﬁed then the correct system structure can still be obtained from the over - ﬁtted structures through an appropriate secondary reﬁnement procedure, e.g. , null-hypothesis test on the coe ﬃ cients. The results show that this condition is satisﬁed for all systems by OIF and O 2 S, as seen in Fig. 6, 7 and 8. OSF could also satisfy this condition for all the systems except S 1 . 5.2. Comments on the Interval Selection The objective of this part of the study is to discuss the issues related to the selection of the cardinality interval. Giv en that the cardinality of the system under consideration is not kno wn a priori , a careful selection of the cardinality interval is crucial. In the pre vious subsections, a pragmatic approach is followed to select the cardinality interv al, i.e. , [ ξ min , ξ ma x ] = [2 , 20]; since in practice, ov er-parameterized models with higher cardinality are usually not desirable. Howe ver , the selection of cardinality interval is system dependent. Therefore, the cardinality interval should be selected judiciously , especially for the higher order systems. In order to gain further insight into this issue, consider 17 (a) S 5 + OS F (b) S 5 + O I F (c) S 5 + O 2 S (d) S 6 + OS F (e) S 6 + O I F (f) S 6 + O 2 S Figure 8: T erm Selection Behavior-III. ξ ? denotes actual cardinality of the system. The ‘dotted’ vertical line indicate the v alue ξ when all the system terms are identiﬁed. the following nonlinear system with 23 terms, S 7 : y ( k ) = 0 . 8833 u ( k − 1) + 0 . 0393 u ( k − 2) + 0 . 8546 u ( k − 3) + 0 . 8528 u ( k − 1) 2 (13) + 0 . 7582 u ( k − 1) u ( k − 2) + 0 . 1750 u ( k − 1) u ( k − 3) + 0 . 0864 u ( k − 2) 2 + 0 . 4916 u ( k − 2) u ( k − 3) + 0 . 0711 u ( k − 3) 2 − 0 . 0375 y ( k − 1) − 0 . 0598 y ( k − 2) − 0 . 0370 y ( k − 3) − 0 . 0468 y ( k − 4) − 0 . 0476 y ( k − 1) 2 − 0 . 0781 y ( k − 1) y ( k − 2) − 0 . 0189 y ( k − 1) y ( k − 3) − 0 . 0626 y ( k − 1) y ( k − 4) − 0 . 0221 y ( k − 2) 2 − 0 . 0617 y ( k − 2) y ( k − 3) − 0 . 0378 y ( k − 2) y ( k − 4) − 0 . 0041 y ( k − 3) 2 − 0 . 0543 y ( k − 3) y ( k − 4) − 0 . 0603 y ( k − 4) 2 + e ( k ) For the identiﬁcation purposes, 1000 input-output data points are gathered with u ∼ W U N (0 , 1) and e ∼ W G N (0 , 0 . 01 2 ). A total of 286 candidate terms are obtained with the following speciﬁcation for the NARX model: [ n u , n y , n l ] = [5 , 5 , 3]. Initially , the system model is identiﬁed follo wing the similar procedure, where the cardinality interval is speciﬁed as: [ ξ min , ξ ma x ] = [2 , 20]. The outcome of the cardinality selection for this interval is shown in Fig. 9a which sho ws that the BIC is minimum at ξ = 20. Howe ver , it is not certain whether this is the desired ‘knee-point’. Therefore, the cardinality interval is increased to ξ ma x = 50. The consequent outcome is sho wn in Fig. 9b. This result indicates a continual increase in BIC for ξ ≥ 30 and thereby conﬁrms that the plateau around ξ ∈ [22 , 29] does contain the desired knee-point. As illustrated by this numerical example, the cardinality interval can easily be adjusted by observing the con- 18 (a) ξ ma x = 20 (b) ξ ma x = 50 Figure 9: Determination of Cardinality Interv al sequent ‘ ξ vs. BIC’ relationship. W e recommend the OSF to determine / adjust the cardinality interv al due to its inexpensi ve computational requirements. 5.3. Orthogonal Floating Sear ch Under Slow V arying Input As discussed in Section 3.1, the present inv estigation uses the Error-Reduction-Ratio (ERR), instead of the other performance metrics, such as simulation error, ultra least square criterion. Howe ver , in recent years, the OFR-ERR has been criticized for leading to suboptimal search performance under certain scenarios [6, 7, 13, 15], where it is argued that this is due to the ERR criterion. Howe ver , the results of this inv estigation on se veral benchmark nonlinear systems con vincingly demonstrate that ERR could be an e ﬀ ectiv e metric, pro vided it is paired with an appropriate search strate gy . Thus, it can be ar gued that the suboptimal performance of OFR-ERR may be the consequence of the search strategy and not of the performance metric. Note that OFR-ERR is essentially a unidirectional sequential search approach which leads to the ‘ nesting e ﬀ ect ’, i.e . , once a term is included it cannot be discarded. Thus the search performance may signiﬁcantly be improved if the nesting e ﬀ ect could be alleviated. The proposed orthogonal search framework e ﬀ ecti vely addresses this issue and therefore it can yield better search performance ev en with a relati vely simple metric such as ERR. T o demonstrate this, we consider the follo wing numerical example, which has been used by various researchers for comparativ e in vestigation [13, 15, 16]: S 8 : w ( k ) = u ( k − 1) + 0 . 5 u ( k − 2) + 0 . 25 u ( k − 1) u ( k − 2) − 0 . 3 u ( k − 1) 3 (14) y ( k ) = w ( k ) + 1 1 − 0 . 8 z − 1 e ( k ) , e ( · ) ∼ W G N (0 . 02) with u ( k ) = 0 . 3 1 − 1 . 6 z − 1 + 0 . 64 z − 2 v ( k ) , v ( · ) ∼ W G N (0 , 1) where, ‘ y ( · )’ is the observ ed value of ‘ w ( · )’. The system is excited with the slow-v arying input u ( · ) which is an AR process with a repeat pole at 0.8. A total of 165 candidate terms are obtained by the following N ARX speciﬁcations: [ n u , n y , n l ] = [4 , 4 , 3]. The system is identiﬁed by the follo wing three algorithms: OFR-ERR, OIF and O 2 S. The search dynamics of the compared algorithms are recorded and are sho wn in T able 6 (OFR-ERR), T able 7 (OIF) and T able 8 (O 2 S). Note that, initially , all the algorithms selected the autoregressiv e terms ‘ y ( k − 1)’ and ‘ y ( k − 2)’. As seen in T able 6, OFR-ERR could not remove the autoregressiv e terms due to the nesting e ﬀ ect . Furthermore, sev eral other spurious terms have 19 T able 6: OFR-ERR Search Dynamics for S † 8 ξ Subset ‡ J ( · ) 1 { y ( k − 1) } 0.87426 2 { y ( k − 1) , y ( k − 2) } 0.92403 3 { y ( k − 1) , y ( k − 2) , u ( k − 1) } 0.94207 4 { y ( k − 1) , y ( k − 2) , u ( k − 1) , u ( k − 1) 2 } 0.98738 5 { y ( k − 1) , y ( k − 2) , u ( k − 1) , u ( k − 1) 2 , u ( k − 1) 3 } 0.99383 6 { y ( k − 1) , y ( k − 2) , u ( k − 1) , u ( k − 1) 2 , u ( k − 1) 3 , u ( k − 2) 3 } 0.99428 7 { y ( k − 1) , y ( k − 2) , u ( k − 1) , u ( k − 1) 2 , u ( k − 1) 3 , u ( k − 2) 3 , u ( k − 3) 3 } 0.99871 † spurious terms are shown in bold face ‡ The following system terms ha ve not been included: u ( k − 2), u ( k − 1) u ( k − 2) T able 7: OIF Search Dynamics for S † 8 Step ξ s t e p Subset J ( · ) 1 1 { y ( k − 1) } 0.87418 2 2 { y ( k − 1) , y ( k − 2) } 0.92385 3 3 { y ( k − 1) , u ( k − 1) 3 , u ( k − 2) 3 } 0.98635 4 3 { u ( k − 1) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99632 5 4 { u ( k − 1) , u ( k − 2) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99757 † spurious terms are shown in bold face T able 8: O 2 S Search Dynamics for S † 8 Step Search Depth ‘ o ’ ξ s t e p Subset J ( · ) Remarks 1 - 4 { y ( k − 1) , y ( k − 2) , u ( k − 1) 3 , u ( k − 2) 3 } 0.98717 Initial Model 2 1 3 { u ( k − 1) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99632 Down Swing Improv ed 3 4 { u ( k − 1) , u ( k − 2) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99757 4 1 5 { y ( k − 1) , u ( k − 1) , u ( k − 2) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99758 Up Swing Did Not Improv e 5 4 { u ( k − 1) , u ( k − 2) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99757 6 1 3 { u ( k − 1) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99632 Down Swing Did Not Improv e 7 4 { u ( k − 1) , u ( k − 2) , u ( k − 1) u ( k − 2) , u ( k − 1) 3 } 0.99757 † spurious terms are shown in bold face been selected instead of two system terms: ‘ u ( k − 2)’ and ‘ u ( k − 1) u ( k − 2)’. In contrast, OIF could easily remo ve the autoregressi ve terms through swapping procedure (see Step 3 and 4, T able 7). Similarly , the autoregressiv e terms are replaced by the system terms and the correct structure is identiﬁed in the ﬁrst down-swing procedure of O 2 S (see Step 3, T able 8). 20 It is obvious that the ﬂoating frame work enabled OIF and O 2 S to identify a better subset for the giv en cardinality . For example, all the system terms are identiﬁed by OIF and O 2 S at ξ = 4 with the criterion function J ( · ) = 0 . 99757 (see T able 7-8). In contrast, OFR-ERR identiﬁed only one system term at ξ = 4. This is further reﬂected in the relativ ely lower value of the criterion function, J ( · ) = 0 . 98738 (see T able 6). This example thus clearly demonstrate that ERR could be an e ﬀ ectiv e performance metric when it is paired with a proper search strategy . It is worth to emphasize that, for identiﬁcation purposes, the input signal to the system should be persistently exciting in nature. The example considered in (14) is the e xception to this fundamental fact, and it represents a worst- case scenario. In practice, such data should not be used for identiﬁcation purposes. This example has been considered only for comparativ e analysis. 5.4. Discr ete models of Continuous-time Systems The e ﬀ ectiv eness of the proposed framew ork is further demonstrated by ﬁtting a polynomial N ARX model to a continuous-time system. Since a discrete model of a continuous time system is not unique, the identiﬁed discrete-time model is validated by comparing Generalized Frequency Response Functions (GFRFs). Note that if the identiﬁed model has captured the system dynamics, the corresponding GFRFs should match. For this purpose, the Du ﬃ ng ’ s Oscillator is considered, which is gi ven by , ¨ y ( t ) + 2 ζ ω n ˙ y ( t ) + ω 2 n y ( t ) + ω 2 n ε y ( t ) 3 − u ( t ) = 0 (15) where, ω n = 45 π , ζ = 0 . 01 and ε = 3. The input-output data are generated by exciting the system with u ( · ) ∼ W U N (0 , 1) and the output is sampled at 500 H z . A total of 1000 data points are generated. The discrete NARX model is ﬁtted using 700 data points, and the remaining 300 data points are used for v alidation. The super-set containing a total of 286 N ARX terms is obtained by the following speciﬁcations: [ n u , n y , n l ] = [5 , 5 , 3]. The following model is identiﬁed by all the proposed algorithms OIS, OIF and O 2 S: y ( k ) = 1 . 9152 y ( k − 1) − 0 . 99436 y ( k − 2) + 1 . 983 × 10 − 6 u ( k − 1) (16) + 1 . 9792 × 10 − 6 u ( k − 2) − 0 . 34289 y ( k − 1) 3 + 0 . 22813 y ( k − 2) y ( k − 1) 2 − 1 . 079 × 10 − 7 u ( k − 1) y ( k − 1) 2 − 0 . 11847 y ( k − 1) y ( k − 2) 2 The ﬁtting error obtained ov er the validation data points is 3 . 96 × 10 − 6 . The linear and the third order GFRFs of the continuous system are shown in Fig. 10a, 10c and 10e. As indicated by these results, the frequency response of the Du ﬃ ng’ s oscillator is characterized by the presence of se veral ‘ridges’ and ‘peaks’. The ridges occur when the system is excited by the linear resonant frequency ( i.e . , f 1 , f 2 or f 3 = ± 22 . 5 Hz) or when the follo wing condition is satisﬁed: f 1 + f 2 + f 3 = ± 22 . 5. These ridges are sho wn by ‘dotted’ lines in the contour plot shown in Fig. 10e. Further , the peaks occur whenev er all frequencies are equal to ± 22 . 5. Note that these observations are in line with the earlier in vestigation [29]. The GFRFs obtained with the discrete identiﬁed model are shown in Fig. 10b, 10d and 10f. It is clear that the identiﬁed model could capture all the characteristics features of the Du ﬃ ng’ s oscillator . 6. Discussion & Conclusions A ne w Orthogonal Floating Search Framework has been proposed to identify signiﬁcant terms of polynomial N ARX models. The proposed framework integrates the classical ERR metric with ﬂoating search algorithms which retains the simplicity of OFR while eliminating its nesting e ﬀ ects. Follo wing this framework, three well-known 21 (a) Linear GFRF (continuous) (b) Linear GFRF (discrete) (c) Third Order GFRF (continuous) (d) Third Order GFRF (discrete) (e) Third Order GFRF Contour (continuous) (f) Third Order GFRF Contour (discrete) Figure 10: Linear and Third Order Frequency Response of Du ﬃ ng’ s oscillator . For the third order GFRF f 3 = f 1 . Note that in the contour plots ridges align at f 1 + f 2 + f 3 = ± 22 . 5. feature selection algorithms, such as SFFS, IFFS and OS, hav e been adapted to solve the structure selection problem. The e ﬀ ecti veness of this approach has been demonstrated considering se veral benchmark nonlinear systems including a system excited by a slowly varying input. The later system is often considered as a worst-case scenario. Further , a discrete-time model is ﬁtted to Du ﬃ ng’ s oscillator and it is validated by comparing generalized frequency response functions. These results convincingly demonstrate that the ﬂoating feature selection algorithms can easily be tailored 22 for the purpose of structure selection in nonlinear system identiﬁcation. For example, relati vely simple procedures of the ﬂoating search, such as bac ktrac king , term swapping and swing , can e ﬀ ecti vely counter the nesting e ﬀ ect and thereby yield signiﬁcant improvement in the search performance (as shown in the results). The results of this in vestigation support the argument that the suboptimal performance of OFR-ERR is due in major part to its poor search strategy . Appendix A. Information Criteria Let X ξ , ξ ∈ [ ξ min , ξ ma x ] denote a term subset identiﬁed the search algorithm. For this subset and the corresponding model- pr edicted output ( ˆ y ), the information criteria are giv en by , • Akaike Information Criterion (AIC): AI C ( X ξ ) = N v ln( E ) + %ξ, with % = 2 (A.1) • Bayesian Information Criterion (BIC): BI C ( X ξ ) = N v ln( E ) + ln( N v ) ξ (A.2) • Final Prediction Error (FPE): F PE ( X ξ ) = N v ln( E ) + N v ln( N v + ξ N v − ξ ) (A.3) • Khundrin’ s Law of Iterated Logarithm Criterion (LILC): L I L C ( X ξ ) = N v ln( E ) + 2 ξ ln ln N v (A.4) where,‘ N v ’ denotes the length of the validation data; E = 1 N v N v P k = 1 [ y k − ˆ y k ] 2 References [1] S. A. Billings, Nonlinear system identiﬁcation: NARMAX methods in the time, frequency , and spatio-temporal domains, John Wile y & Sons, 2013. [2] L. Ljung, System Identiﬁcation : Theory for the User , Prentice Hall PTR, Upper Saddle Riv er , NJ, USA, 1999. [3] X. Hong, R. J. Mitchell, S. Chen, C. J. Harris, K. Li, G. W . Irwin, Model selection approaches for non-linear system identiﬁcation: a revie w , International journal of systems science 39 (10) (2008) 925–946. [4] S. A. Billings, S. Chen, M. J. K orenberg, Identiﬁcation of MIMO non-linear systems using a forward-regression orthogonal estimator , International Journal of Control 49 (6) (1989) 2157–2189. [5] T . Baldacchino, S. R. Anderson, V . Kadirkamanathan, Structure detection and parameter estimation for NARX models in a uniﬁed EM framew ork, Automatica 48 (5) (2012) 857–865. [6] T . Baldacchino, S. R. Anderson, V . Kadirkamanathan, Computational system identiﬁcation for bayesian NARMAX modelling, Automatica 49 (9) (2013) 2641–2651. [7] A. Falsone, L. Piroddi, M. Prandini, A randomized algorithm for nonlinear model structure selection, Automatica 60 (2015) 227–238. [8] X. T ang, L. Zhang, X. W ang, Sparse augmented lagrangian algorithm for system identiﬁcation, Neurocomputing. [9] M. Korenber g, S. Billings, Y . Liu, P . McIlroy , Orthogonal parameter estimation algorithm for non-linear stochastic systems, International Journal of Control 48 (1) (1988) 193–210. [10] S. Chen, S. A. Billings, W . Luo, Orthogonal least squares methods and their application to non-linear system identiﬁcation, International Journal of control 50 (5) (1989) 1873–1896. [11] N. Chiras, C. Evans, D. Rees, Nonlinear gas turbine modeling using NARMAX structures, IEEE Transactions on Instrumentation and Measurement 50 (4) (2001) 893–898. [12] K. Mao, S. Billings, Algorithms for minimal model structure detection in nonlinear dynamic system identiﬁcation, International journal of control 68 (2) (1997) 311–330. [13] L. Piroddi, W . Spinelli, An identiﬁcation algorithm for polynomial N ARX models based on simulation error minimization, International Journal of Control 76 (17) (2003) 1767–1781. [14] H. L. W ei, S. A. Billings, Model structure selection using an integrated forward orthogonal search algorithm interfered with squared correla- tion and mutual information, Automatic Control and Systems Engineering, Univ ersity of She ﬃ eld, 2006. [15] Y . Guo, L. Guo, S. A. Billings, H.-L. W ei, An iterative orthogonal forward regression algorithm, International Journal of Systems Science 46 (5) (2015) 776–789. 23 [16] Y . Guo, L. Guo, S. Billings, H.-L. W ei, Ultra-orthogonal forward regression algorithms for the identiﬁcation of non-linear dynamic systems, Neurocomputing 173 (2016) 715 – 723. [17] K. Li, J.-X. Peng, E.-W . Bai, A two-stage algorithm for identiﬁcation of nonlinear dynamic systems, Automatica 42 (7) (2006) 1189 – 1197. [18] F . Haﬁz, A. Swain, E. M. Mendes, N. Patel, Structure selection of polynomial NARX models using two dimensional (2D) particle swarms, in: 2018 IEEE Congress on Evolutionary Computation (CEC), 2018, pp. 1–8. [19] M. Michael, W . Lin, Experimental study of information measure and inter-intra class distance ratios on feature selection and orderings, IEEE T ransactions on Systems, Man, and Cybernetics SMC-3 (2) (1973) 172–181. [20] S. D. Stearns, On selecting features for pattern classiﬁers, in: Proceedings of the 3rd International Joint Conference on Pattern Recognition, 1976, pp. 71–75. [21] J. Kittler , Feature set search algorithms, Pattern recognition and signal processing (1978) 41–60. [22] P . Pudil, J. Novo vi ˇ cov ´ a, J. Kittler , Floating search methods in feature selection, Pattern Recognition Letters 15 (11) (1994) 1119–1125. [23] P . Somol, P . Pudil, J. Novovi ˇ cov ´ a, P . Paclık, Adaptive ﬂoating search methods in feature selection, Pattern Recognition Letters 20 (11) (1999) 1157–1163. [24] P . Somol, P . Pudil, Oscillating search algorithms for feature selection, in: Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, V ol. 2, 2000, pp. 406–409 vol.2. [25] S. Nakariyakul, D. P . Casasent, An impro vement on ﬂoating search algorithms for feature subset selection, Pattern Recognition 42 (9) (2009) 1932–1940. [26] P . Stoica, Y . Selen, Model-order selection: a revie w of information criterion rules, IEEE Signal Processing Magazine 21 (4) (2004) 36–47. [27] E. M. A. M. Mendes, Identiﬁcation of nonlinear discrete systems with intelligent structure detection, Ph.D. thesis, Uni versity of She ﬃ eld, England, UK (1995). [28] M. Bonin, V . Seghezza, L. Piroddi, NARX model selection based on simulation error minimisation and LASSO, IET control theory & applications 4 (7) (2010) 1157–1168. [29] S. A. Billings, J. C. P . Jones, Mapping non-linear integro-di ﬀ erential equations into the frequency domain, International Journal of Control 52 (4) (1990) 863–879. 24

Orthogonal Floating Search Algorithms: From The Perspective of Nonlinear System Identification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment