Hierarchical Clustering for Smart Meter Electricity Loads based on Quantile Autocovariances

1 Hierarc hical Clustering for Smart Meter Electricit y Loads based on Quan tile Auto co v ariances Andr ´ es M. Alonso, F. Ja vier Nogales and Carlos Ruiz Abstract In order to improv e the eﬃciency and sustainabilit y of electricity systems, most coun tries w orldwide are deplo ying adv anced metering infrastructures, and in particular household smart meters, in the residen tial sector. This tec hnology is able to record electricity load time series at a v ery high frequency rates, information that can b e exploited to develop new clustering mo dels to group individual households b y similar consumptions patterns. T o this end, in this w ork w e propose three hierarc hical clustering metho dologies that allo w capturing diﬀerent c haracteristics of the time series. These are based on a set of “dissimilarit y” measures computed o ver diﬀeren t features: quantile auto-co v ariances, and simple and partial auto correlations. The main adv an tage is that they allo w summarizing each time series in a few represen tative features so that they are computationally eﬃcient, robust against outliers, easy to automatize, and scalable to h undreds of thousands of smart meters series. W e ev aluate the p erformance of each clustering mo del in a real-w orld smart meter dataset with thousands of half-hourly time series. The results show ho w the obtained clusters identify relev ant consumption b ehaviors of households and capture part of their geo-demographic segmen tation. Moreo v er, w e apply a sup ervised classiﬁcation pro cedure to explore which features are more relev ant to deﬁne each cluster. Index T erms Quan tile autov ariances, massive time series, hierarchical clustering, smart meters. I. Introduction A. Background and Aim M O VED b y the need of improving the eﬃciency and sustainability of aging electrical systems, man y coun tries w orldwide are adopting new information and comm unication tec hnologies, with special emphasis on the residen tial sector [1]. These tec hnologies imply a new paradigm in the economical and technical operation of distribution netw orks, and create new business opp ortunities for all the companies that take part in the electricity supply chain. It is very relev ant the extended in tegration of adv anced metering infrastructures (AMI) [2] with an sp ecial role pla yed by households “smart meters”. These devices allo w recording electricity A. M. Alonso, F. J. Nogales and C. Ruiz are with the Departmen t of Statistics and UC3M-BS Institute for Financial Big Data (IFiBiD), Universit y Carlos I I I de Madrid, Avda. de la Universidad 30, 28911-Legan´ es, Spain. (e-mails: andres.alonso@uc3m.es; fco javier.nogales@uc3m.es; carlos.ruiz@uc3m.es) 2 consumption data at a v ery high frequency rate and instan tly transmit this information to the retailing and/or distribution companies. F urthermore, as man y electricit y markets w orldwide are op en to comp etition in b oth the generation and retailing sectors, there is a gro wing interest b y the electrical companies in using these data to increase their proﬁt, their mark et share or the consumers’ w elfare. In this vein, the treatmen t of these new datasets require the research and implemen tation of no vel data science tec hniques, with practical applications on energy fraud deten tion, outliers iden tiﬁcation, consumers proﬁling, demand resp onse, tariﬀ design, load forecasting, etc. [3]. The sp ecial c haracteristics of the data stored b y smart meters (h undreds of thousands, or ev en millions, of high frequency time series), and their combination with exogenous v ariables (meteorological, calendar, economical, etc.), op en the p ossibility of designing sp eciﬁc clustering mo dels for household consumers. F urthermore, these mo dels can help to b etter understand the b eha vior of b oth aggregated and disaggregated electrical loads [4], and ho w this knowledge can b e exploited to impro v e electrical system’s. In particular, clustering households with similar consumption patterns has man y p oten tial applications. Retailing companies can be interested on grouping clien ts by consumption proﬁles to oﬀer tailor-made tariﬀs. This ma y increase consumers utilit y while ensuring rev enue-adequacy for the compan y . Moreov er, clustering may help to iden tify the b est candidate group of consumers to implement demand resp onse p olicies. In this v ein, system op erators and distribution companies can b eneﬁt from clustering techniques to impro v e their load forecasting accuracy [5], with a direct impact on system reliability or predictive main tenance. In this work w e prop ose diﬀeren t hierarchical-based clustering strategies based on a set of “dissimilarit y” measures: quan tile auto-co v ariance, and simple and partial auto correlations. These strategies summarize eac h consumption time series in only a few representativ e features so that they are highly eﬃcient, easy to automatize and scalable to hundreds of thousands of series, i.e., can b e successfully implemented in large-scale applications that make use of smart meters datasets. W e test the p erformance of our clustering mo dels by using a real-w orld dataset with thousands of electricit y consumption time series. The results are promising as the obtained clusters not only iden tify relev an t consumption patterns but also capture part of the geo-demographic segmen tation of the consumers. F urthermore, w e implement a multiclass supervised classiﬁcation algorithm, based on decision trees, in order to characterize the most imp ortan t features conditioning eac h cluster. B. Literature Review A review of several clustering tec hniques to group similar electricity consumers is presen ted in [6]. It is shown that the o verall p erformance of the diﬀerent techniques is related to their ability to isolate outliers. Reference [7] prop oses a clustering metho d for household consumers based on K-means and Principal Comp onents Analysis (PCA). The resulting clusters are sub ject to a multiple regression analysis to iden tify relev ant explanatory v ariables. The work in [8] addresses the consumer segmen tation problem b y normalizing the daily load shap es for eac h consumer, together with their 3 total consumption, to apply an adaptiv e K-means algorithm. A clustering mo del based on K-means is proposed in [9] to, fo cusing on commercial and industrial electricit y consumers, iden tify candidate users for energy eﬃciency policies and their businesses op ening and closing hours. Reference [10] ev aluates and compare three clustering techniques for smart meter data: k-medoid, K-means and Self Organizing Maps (SOM), to sho w that the latter presen ted to o v erall best p erformance. T raditional time series metho ds are applied in [11], lik e wa velets or auto correlation analysis, to the raw smart meter data to enrich the input of a K-mean based clustering algorithm for consumers segmentation. Reference [12] prop oses to use dynamic information, in terms of transitions b et ween adjacent time p erio ds, for consumers segmentation. The resulting clusters are used to ev aluate their p otential for demand resp onse p olicies. Sev eral works seek to identify relev ant features that condition the dynamic patterns of electricity consumers. F or instance, a sup ervised ML mo del is prop osed in [13] based on individual household consumption time series. With the same aim, [14] prop osed a metho dology to examine s mart meter data and identify imp ortan t determinan ts of consumers electricit y load. T o extend the num b er of features that can p oten tially b e used for proﬁling consumers, [15] complement the smart meter data with do or-to-door question surv eys. It is shown ho w these new dataset improv es the p erformance of a W ard’s hierarchical clustering algorithm. A detailed analysis of household consumption data is presen ted in [16] to identify those time p erio ds from which relev ant consumption features can b e extracted. Based on these features, a mixture-based clustering algorithm is prop osed and ev aluated b y b o otstrap techniques. Another mixture mo del framew ork, based on linear Gaussian appro ximations, is used b y [17] to deriv e relev ant load proﬁles from individual consumption patterns. T o impro ve computational p erformance, [18] presen ts a t wo-lev el clustering metho dology to deriv e represen tative consumptions proﬁles based on K-means. The ﬁrst level is used to obtain lo cal proﬁles that are generalized in the second lev el. With a similar aim, reference [19] prop oses a feature construction model for time series to cluster similar consumers. The mo del reduces the dimensionalit y of the problem b y using conditional ﬁlters and proﬁle errors. An eﬃcient frequency domain hierarc hical clustering model is prop osed [20] to deriv e adequate load proﬁles. Moreo v er, [21] studies ho w the temp oral resolution of the consumption time series may hav e an strong impact on b oth the qualit y and computational p erformance of the clustering tec hniques. Clustering tec hniques has b een used also to impro ve the accuracy of forecasting mo dels. In this v ein, a K-means based algorithm is emplo yed in [5] to derive consumption estimates and impute missing data. The cross-similarities b et w een consumptions series is used b y [22] to enhance the p erformance of a forecasting mo del, based on Long Short-term Memory (LSTM) net w orks. Similarly , [23], implemen ts a K-means based clustering algorithm to group similar consumers and then adjust a Neural Net work (NN) forecasting mo del for aggregated loads. Another clustering K-Means based algorithm is employ ed in [24] to household load curves to group similar consumers and enhance the p erformance of a nonparametric functional w av elet-kernel approach. Reference [25] also mak es used of consumers segmentation through PCA and K-means clustering to iden tify typical daily consumption proﬁles that can improv e the accuracy of a ML forecasting to ol. 4 C. Contributions W e build part of our research on the original methodology presented in [26], whic h prop oses to cluster time series based on quantile autocov ariances distances. An extensiv e simulation analysis and a real-world application on daily ﬁnancial time series show the abilit y of this approac h to identify diﬀeren t dep endence mo dels among the series. In the presen t work, and b y the ﬁrst time to the authors knowledge, w e adapt and extend part of the metho dology in [26] to iden tify relev ant clusters from massive and high frequency smart meters time series. In particular, b y considering the state of the art presen ted in Section I-B, the main con tributions of this w ork are ﬁvefold: 1) T o summarize each smart meter time series in an small set of meaningful features: auto corre- lation co eﬃcients, partial auto correlation co eﬃcients and quantile auto co v ariances. 2) T o propose three hierarc hical clustering models, based on Euclidean dissimilarit y measures, computed o ver the previous features. The mo dels are computationally eﬃcien t and robust against outlier observ ations. 3) T o test the prop osed metho dology in a real dataset, including thousands of half-hourly load time series, to characterize relev an t electricity consumption proﬁles. 4) T o mak e use of a sup ervised classiﬁcation pro cedure (decision trees) to identify those v ariables (features) that ha v e b een more relev an t to form the resulting clusters. 5) T o v erify that the resulting clusters are able to capture, up to some extend, the geo-demographic segmen tation of household consumers. D. Paper Organization This paper is organized as follows. In Section II the prop osed hierarc hical clustering metho dology for smart meter time series is presented. The n umerical results, based on a real-world dataset are presen ted in Section I I I. Finally , Section IV presen ts the main conclusions derived from this w ork. I I. The Clustering Metho dology Let’s assume that we observ e N time series, { X X X 1 , X X X 2 , . . . , X X X N } where X X X i = ( X i,t i , X i,t i +1 , . . . , X i,T i ) and ( t i , T i ) denotes the ﬁrst and the last times where the i -th time series is observed, resp ectiv ely . In our dataset, the ( t i , T i ) are the same for all time series but our pro cedures do not require this condition since they are based on extracted features from the time series. As men tioned in the previous section, there are many in teresting features to consider as “clustering” v ariables instead of using raw data. In our case, w e consider three sets of features that capture diﬀeren t asp ects of the time series dynamic b ehaviour: • The set of autocorrelation co eﬃcients of orders (1 , 2 , . . . , K ), that is, w e calculate the correlation co eﬃcien t b et w een the v ariables X i,t and X i,t + j for j = 1 , 2 , . . . , K deﬁned by ρ i ( t, t + j ) = C ov ( X i,t , X i,t + j ) ( V ar ( X i,t ) V ar ( X i,t + j )) 1 / 2 . (1) 5 • The set of partial auto correlation co eﬃcients of orders (1 , 2 , . . . , K ), that is, we calculate the correlation co eﬃcient b et ween observ ations separated b y j p erio ds, X i,t and X i,t + j , when w e eliminate the linear dep endence due to in termediate v alues. The partial auto correlation co eﬃcient will b e denoted by π i ( t, t + j ). • The set of quantile auto cov ariances of order j at quan tile lev els ( τ , τ 0 ) ∈ [0 , 1] 2 deﬁned by γ i, ( τ ,τ 0 ) ( t, t + j ) = C ov ( I ( X i,t ≤ q τ ,i ) , I ( X i,t + j ≤ q τ 0 ,i )) , (2) where I ( · ) denotes the indicator function and q τ ,i and q τ 0 ,i are the τ − and τ 0 − quan tiles of X i,t and X i,t + j , resp ectively . It is in teresting to realize the diﬀerences among features (1) and (2) since b oth inv olve the calculation of a co v ariance b et w een observ ations separated by j p erio ds. In (1), the cov ariance term is estimated b y 1 T i − j T i − j X t = t i X i,t X i,t + j − 1 T i − j T i − j X t = t i X i,t ∗ 1 T i − j T i − j X t = t i X i,t + j , whic h in v olves the pro ducts X i,t X i,t + j that can b e distorted by extreme or outlier observ ations. F or example, tw o v ery high loads observed at a distance of j p erio ds w ould spuriously increase the correlation at the j − lag. On the other hand, the quan tile auto cov ariance (2) is estimated b y b γ i, ( τ ,τ 0 ) ( t, t + j ) = 1 T i − j T i − j X t = t i I ( X i,t ≤ b q τ ,i ) I ( X i,t + j ≤ b q τ 0 ,i ) − τ τ 0 . (3) The inv olved pro ducts I ( X i,t ≤ b q τ ,i ) I ( X i,t + j ≤ b q τ 0 ,i are b ounded which imply a negligible eﬀect of outliers. The expression (3) can b e interpreted as a mean of the n um b er of times that v alues at t b elow b q τ ,i coincide with v alues at t + j below b q τ 0 ,i . The term τ τ 0 is the num b er of coincidences that o ccur completely randomly . Therefore, a p ositiv e b γ i, ( τ ,τ 0 ) means that the num b er of matches is greater (smaller) than exp ected b y c hance. It should b e noticed that the ab ov e c haracteristics, in general, dep end on t and j , but if the time series are stationary , then they do not dep end on t , whic h simpliﬁes their analysis. F or this reason, w e consider the (daily) seasonal diﬀerence of the smart meter load (logarithmic transformed) time series. That is, as the time series that will b e used in this pap er presen t an half-hourly frequency , then X i,t = ` i,t − ` i,t − 48 are the series to be clustered, where ` i,t = log L i,t denotes the logarithm of the load time series of the i -th smart meter. W e should ﬁx the largest lag, K , in the sets of auto correlation and partial auto correlation co eﬃcien ts. W e can ﬁt autoregressiv e mo dels to all the univ ariate time series, selecting the order by the BIC criterion, and take K = max 1 ≤ i ≤ N ( p i ), where p i is the selected order for i -th time series. It is shown in [27] that this pro cedure pro vides an upp er b ound of the memory of N stationary linear time series. The selected K was 96. This selection allo ws us to captures the main linear dep endencies in all time series. Also, for the set of quan tile auto co v ariances, we should ﬁx the lag and quantile lev els. In this case, following the suggestions of [26], w e use j = 1 and τ ∈ { 0 . 1 , 0 . 5 , 0 . 9 } since these v alues hav e sho wn that they are capable of capturing and diﬀerentiating diﬀerent t yp es of nonlinearities. Finally , the three clustering analyzes will b e based on the following sets of features: 6 a) ρ ρ ρ i = { ρ i (1) , ρ i (2) , . . . , ρ i (96) } i ∈{ 1 , 2 ,...,N } b) π π π i = { π i (1) , π i (2) , . . . , π i (96) } i ∈{ 1 , 2 ,...,N } c) γ γ γ i = { γ i , (0 . 1 , 0 . 1), γ i , (0 . 1 , 0 . 5), γ i , (0 . 1 , 0 . 9), γ i (0 . 5 , 0 . 1), γ i (0 . 5 , 0 . 5), γ i (0 . 5 , 0 . 9), γ i (0 . 9 , 0 . 1), γ i (0 . 9 , 0 . 5), γ i (0 . 9 , 0 . 9) } i ∈{ 1 , 2 ,...,N } Th us, the analysis will b e based on 96 × 1 v ectors of features for auto correlation and partial auto correlation co eﬃcients and based on 9 × 1 vectors of features for quan tile auto cov ariances. Once w e ha ve the vectors of features, we deﬁne a dissimilarity measure b etw een time series X X X i and X X X j b y the Euclidean distance of the corresp onding v ectors. That is: a) d AC ( X X X i , X X X j ) = k ρ ρ ρ i − ρ ρ ρ j k 2 b) d P AC ( X X X i , X X X j ) = k π π π i − π π π j k 2 c) d QC ( X X X i , X X X j ) = k γ γ γ i − γ γ γ j k 2 where k · k denotes de Euclidean distance. The distances d M ( X X X i , X X X j ) will be obtained for all pairs ( i, j ) with i 6 = j to construct the following N × N dissimilarit y matrix D M D M D M M =       0 d M ( X 1 , X 2 ) . . . d M ( X 1 , X N ) d M ( X 2 , X 1 ) 0 . . . d M ( X 2 , X N ) . . . . . . . . . . . . d M ( X N , X 1 ) d M ( X N , X 2 ) . . . 0       (4) where M ∈ { AC , P AC , QC } . The dissimilarity matrix (4) can be used in any cluster pro cedure whic h requires this kind of input. In particular, we can apply hierarchical clustering since it allo ws us to identify clusters as well as hierarchies among the clusters. In hierarc hical cluster pro cedures, to decide whic h groups should b e com bined, it is necessary to choose a measure of dissimilarity (link age criterion) b et ween sets. It is imp ortant to emphasize that this c hoice will inﬂuence the shape of the groups, since some sets could b e close according to one distance and far according to another. The three b est known measures are minimum or single-link age ( d s ), maximum or complete-link age ( d c ) and av erage link age ( d a ) deﬁned b y: d s ( A, B ) = min { d ( X i , X j ) : i ∈ A, j ∈ B } d c ( A, B ) = max { d ( X i , X j ) : i ∈ A, j ∈ B } d a ( A, B ) = 1 n A n B n A X i =1 n B X i =1 d ( X i , X j ) , where A and B are tw o sets of observ ations having n A and n B elemen ts, resp ectiv ely . In this w ork, we prefer to use complete link age as it ensures that the observ ations in a group are similar to all observ ations of the same group in the sense that once the cut-oﬀ p oin t in the dendrogram has b een set all the distances within of a cluster are smaller than this cut-oﬀ p oint. Once w e obtain the groups, an in teresting question is to know whic h v ariables hav e b een the most relev ant to form these groups. This question can b e addressed through the use of a sup ervised classiﬁcation procedure where the labels of the observ ations will b e the result of the clustering 7 metho dology . That is, if w e hav e k clusters, we will assign the lab els { 1 , 2 , . . . , k } to the observ ations of the respective clusters. These lab els and the features will b e the input of the sup ervised classiﬁcation pro cedure. In this w ork, w e will use decision trees [28] for m ulticlass classiﬁcation problem since for this pro cedure un biased estimates of the predictor (feature) imp ortance [29] are a v ailable. I I I. Numerical Results In this section w e use the public energy consumption dataset from [30]. It includes a sample of 5,567 households of London with their individual electricity consumption time series during 2013, in kWh (per half hour), date and time, and CA CI A CORN segmen tation (6 geo-demographic categories) [31]. In particular, to v alidate this work’s clustering methodology , we will compare the resulting clusters with the geo-demographic aggregated categories co ded as “ACORN GR OUPED”, which classify households into three main groups: “Aﬄuent”, “Comfortable” and “Adv ersit y”. Moreo ver, the dataset is also divided into t wo subgroups of consumers: i) std tariﬀ : Consumers whose electricit y tariﬀ is ﬁxed (standard) to a constan t price during the time of the study . ii) tou tariﬀ : Consumers with “time of use” tariﬀ for whic h the electricity price is diﬀeren t for each hour. In order to b etter characterized the inherent consumption b ehavior of individual households, w e hav e fo cused the following study on the std tariﬀ consumers, as these are not inﬂuenced b y a v ariable price signal. This initial group includes approximately 4500 time series from whic h some of them are discarded, due a high prop ortion of missing observ ation, rendering a ﬁnal subsample of around 3200 time series (households). The following three dendrograms, Fig. 1 - 3, are obtained using the QC, AC and P A C features and complete link age in tro duced in Section II. In the three graphs w e can observe some clear groups of observ ations (time series) and also observ ations that are joined to the hierarc hical structure at large lev els. Those observ ations ha ve a dynamic atypical b ehavior and are group ed in clusters with less than 1% of the total n um b er of time series. Once we discard the atypical observ ations, w e ﬁnd eigh t, six and sev en large clusters for QC, AC and P AC, resp ectively . Moreov er, the degrees of coincidence among these three clusters partitions are lo w as indicated b y the adjusted Rand indexes (0.0941 when comparing QC and A C; 0.1432 when comparing QC and P A C and 0.2687 when comparing A C and P AC). This implies that the three approaches lo ok at diﬀeren t c haracteristics of the time series. Figures 4 - 6 illustrate these large clusters obtained with QC, A C and P A C, respectively . In the ﬁgures, w e represent the mean of the features used to obtain the clusters. There are nine features, in the case of QC, corresp onding to the co v ariance of quantiles 10%, 50% and 90%. In the case of AC and P AC, w e use the ﬁrst 96 simple and partial auto correlations, resp ectiv ely . The clusters based on QC reveals diﬀerences in the median consumptions (.5 v ersus .5) and highest versus median consumptions (.9 versus .5). F or instance, it is remark able the diﬀerence b etw een c3 and c7 versus c1, c2, c4, c6 and c8 at the median consumptions. The c3 and c7 ha v e negative cov ariances and the 8 Figure 1. Dendrogram obtained with quantile auto cov ariance and complete link age. Figure 2. Dendrogram obtained with auto correlation co eﬃcients and complete link age. c1, c2, c4, c6 and c8 ha ve p ositiv e ones. That is, in the ﬁrst group, a consumption b elow the median tends to b e follo w ed by consumption ab o ve the median, while the second group tends to maintain their consumption b elo w the median. The groups by SA C and P AC sho w diﬀerences in the short range dependencies but also in the wa y they are around the lag 48 (one day). W e can fo cus on the ﬁrst correlations co eﬃcients that sho w diﬀerent degrees of p ersistency in the consumptions. F or instance, in the AC clusters, there is a clear order from high dep endency at c4, c1 and c2, medium at c3 and c5 and to low dep endency at c6. A t the P AC clusters, we can diﬀerentiate b etw een clusters with negative second partial auto correlation (c1, c2 and c6), medium (c3, c4, and c5) and hight p ositiv e (c7). That is, once we eliminate the ﬁrst order correlation, there are negativ e (or p ositiv e) direct eﬀects on the consumption at the 2-step ahead p eriod. Figures 7 - 9 provide the estimates of the predictor imp ortance. It is clear that all features are 9 Figure 3. Dendrogram obtained with partial auto correlation co eﬃcients and complete link age. .1 vs .1 .1 vs .5 .1 vs .9 .5 vs .1 .5 vs .5 .5 vs .9 .9 vs .1 .9 vs .5 .9 vs .9 Quantile levels -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 Quantile autocovariance c1 c2 c3 c4 c5 c6 c7 c8 Figure 4. Main clusters obtained with quantile auto cov ariances and complete link age. 0 1 22 43 64 86 07 28 49 6 1 0 0 Mid - hour lag -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 Autocorrelation c1 c2 c3 c4 c5 c6 Figure 5. Main clusters obtained with auto correlation co eﬃcients and complete link age. 10 0 1 22 43 64 86 07 28 49 6 1 0 0 Mid - hour lag -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 Partial autocorrelation c1 c2 c3 c4 c5 c6 c7 Figure 6. Main clusters obtained with partial auto correlation co eﬃcients and complete link age. Predictor Importance Estimates .1 vs .1 .1 vs .5 .1 vs .9 .5 vs .1 .5 vs .5 .5 vs .9 .9 vs .1 .9 vs .5 .9 vs .9 Predictors 0 0.5 1 1.5 2 2.5 3 3.5 4 Estimates 10 -3 Figure 7. Predictor imp ortance estimates for clusters based on quantile auto cov ariances. relev an t in the clustering based on QC but we can mak e a selection of features in the clusterings based on SAC and P AC. In particular, for SA C, the ﬁrst ﬁfteen lags and the four lags around the 48–lag app ear to be relev ant and, for P A C, the ﬁrst four lags and the four lags b efore and t wo lags after the 48– and 96–lags as w ell as those daily “seasonal” lags. It is interesting to notice that the 48–lag is not relev ant in the SAC but this is due to the (daily) seasonal diﬀerence. How ever, there are still stationary seasonal b eha vior as reﬂected by relev ant predictors/lags around the 48–lag. F or P A C, the daily lags are highly relev an t. The misclassiﬁcation rates estimated by cross–v alidation for the three trained decision trees w ere 9.9%, 23.3% and 21.4% when using QC, SAC and P A C, resp ectiv ely . These lo w rates point out that the obtained trees are goo d approximations to the clustering mec hanism. Of course, other supervised classiﬁcation pro cedures suc h as random forest or neural netw orks can b e used in order to obtain b etter approximations. T ables I - I I I show the n umber of households on each cluster that are classiﬁed in the three 11 Predictor Importance Estimates 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 Predictors 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Estimates 10 -3 Figure 8. Predictor imp ortance estimates for clusters based on auto correlation co eﬃcients. Predictor Importance Estimates 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 Predictors 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Estimates 10 -3 Figure 9. Predictor imp ortance estimates for clusters based on partial auto correlation co eﬃcients. A CORN GROUPED categories. Note that they are unevenly distributed across clusters. Indeed, w e ha v e p erformed chi-squared tests in those tables and the results are highly signiﬁcan t in the three cases rev ealing that clustering is related to ACORN GR OUPED classiﬁcation. This shows that the prop osed clustering metho dology is able to, up to some representativ e extend, provide insights on the geo-demographic c haracteristics of a household (Acorn groups), just b y studying the time series dep endencies. Figures 10 - 12 show the prototype’s hourly proﬁle for eac h cluster. The prototype is the medoid of each cluster, that is, the element in the cluster with minimal av erage dissimilarity to all ob jects in the cluster. W e can observe diﬀeren t characteristic consumption patterns asso ciated to diﬀerent t yp es of consumers. F or instance, the eight clusters obtained with quantile auto cov ariance and complete link age in Figure 10 allow distinguishing b etw een consumers with morning (clusters 3, 6 and 7) and ev ening (clusters 1, 2, 5 and 8) p eak loads, and those with a more constan t consumption pattern 12 Cluster Adv ersity Comfortable Aﬄuent c1 24 24 44 c2 293 278 360 c3 42 23 36 c4 52 40 37 c5 28 22 55 c6 482 343 358 c7 18 9 16 c8 146 160 258 T able I Clusters b y quantile auto cov ariance v ersus ACORN GR OUPED Cluster Adv ersity Comfortable Aﬄuent c1 70 83 161 c2 426 404 544 c3 252 171 179 c4 16 23 59 c5 120 79 99 c6 214 140 120 T able I I Clusters b y auto correlation co eﬃcients versus ACORNGR OUPED (cluster 4). This is also appreciated in the 6 clusters obtained with auto correlation co eﬃcients and complete link age in Figure 11. In this case, clusters 1 and 4 capture those consumers with tw o in termediate p eak loads in the morning and in the evening. Cluster 5 represen ts consumers with a single p eak consumption in the afterno on, and clusters 2, 3 and 5, present consumers with less v olatilit y . Similarly , Figure 12 shows the clusters obtained with partial auto correlation co eﬃcien ts and complete link age. Clusters 1, 2 and 5 characterize consumers with a steady increasing load that reac h its maxim um at midnight, while clusters 3, 4, 6 and 7, represen t consumers with tw o intermediate p eaks in the morning and evening. In this case clusters from eac h type are mainly diﬀeren tiated b y the av erage load consumption lev els. Cluster Adv ersity Comfortable Aﬄuent c1 27 16 61 c2 66 67 111 c3 397 376 501 c4 454 305 261 c5 39 51 129 c6 83 54 68 c7 18 29 39 T able I I I Clusters b y partial auto correlation co eﬃcients versus ACORNGR OUPED 13 Figure 10. Protot yp e’s hourly proﬁle for clusters obtained with quantile auto cov ariance and complete link age. IV. Conclusions In this w ork w e ha v e presen ted three diﬀeren t hierarc hical-based clustering strategies based on a set “dissimilarit y” measures computed o v er: quan tile auto-cov ariances, and simple and partial auto correlations. The main adv antage of this approac h is that we can summarize eac h series in only a set of represen tativ e features which makes them very easy to implemen t (highly eﬃcien t), easy to automatize and scalable to h undreds of thousands of series, i.e., v alid for real-w orld applications with large datasets of time series, as the ones obtained from smart meters. W e ev aluate the p erformance of these clustering mo dels with thousands of electricit y consumption time series. The results are promising: we are able to obtain highly represen tative clusters capturing diﬀeren t electricity load consumption patterns and iden tifying the level of inﬂuence of each of the mo dels’ features. Moreov er, 14 Figure 11. Protot yp e’s hourly proﬁle for clusters obtained with auto correlation coeﬃcients and complete link age. w e hav e seen how the prop osed clustering sc heme can pro vide meaningful insights on the geo- demographic lev el of a household (Acorn groups), just b y analyzing its time series dep endencies (auto correlations). Ac kno wledgment The authors gratefully ackno wledge the ﬁnancial supp ort from the Spanish gov ernment through pro ject MTM2017-88979-P and from F undacin Iberdrola through “Ayudas a la In v estigacin en Energa y Medio Am bien te 2018”. References [1] R. Hierzinger, M. Albu, H. V an Elburg, A. J. Scott, A. Lazic ki, L. Pen ttinen, F. Puente, and H. Sæle, “Europ ean smart metering landscape rep ort 2012,” SmartRegions Deliverable, vol. 2, 2012. [2] S. S. S. R. Depuru, L. W ang, V. Dev abhaktuni, and N. Gudi, “Smart meters for p ow er gridchallenges, issues, adv antages and status,” in 2011 IEEE/PES P ow er Systems Conference and Exp osition. IEEE, 2011, pp. 1–7. [3] B. Yildiz, J. Bilbao, J. Dore, and A. Sproul, “Recent adv ances in the analysis of residen tial electricity consumption and applications of smart meter data,” Applied Energy, v ol. 208, pp. 402–427, 2017. [4] Y. W ang, Q. Chen, T. Hong, and C. Kang, “Review of smart meter data analytics: Applications, methodologies, and c hallenges,” IEEE T ransactions on Smart Grid, v ol. 10, no. 3, pp. 3125–3148, 2018. 15 Figure 12. Protot yp e’s hourly proﬁle for clusters obtained with partial auto correlation coeﬃcients and complete link age. [5] A. Al-W ak eel, J. W u, and N. Jenkins, “K-means based load estimation of domestic smart meter measurements,” Applied energy, v ol. 194, pp. 333–342, 2017. [6] G. Chicco, “Overview and p erformance assessment of the clustering metho ds for electrical load pattern grouping,” Energy, v ol. 42, no. 1, pp. 68–80, 2012. [7] M. Koivisto, P . Heine, I. Mellin, and M. Lehtonen, “Clustering of connection points and load modeling in distribution systems,” IEEE T ransactions on Po wer Systems, vol. 28, no. 2, pp. 1255–1265, 2012. [8] J. Kw ac, J. Flora, and R. Ra jagopal, “Household energy consumption segmentation using hourly data,” IEEE T ransactions on Smart Grid, v ol. 5, no. 1, pp. 420–430, 2014. [9] A. Lavin and D. Klab jan, “Clustering time-series energy data from smart meters,” Energy eﬃciency, vol. 8, no. 4, pp. 681–689, 2015. [10] F. McLoughlin, A. Duﬀy , and M. Conlon, “A clustering approach to domestic electricity load proﬁle characterisation using smart metering data,” Applied energy, v ol. 141, pp. 190–199, 2015. [11] A. T ureczek, P . Nielsen, and H. Madsen, “Electricity consumption clustering using smart meter data,” Energies, vol. 11, no. 4, p. 859, 2018. 16 [12] Y. W ang, Q. Chen, C. Kang, and Q. Xia, “Clustering of electricity consumption b ehavior dynamics tow ard big data applications,” IEEE transactions on smart grid, v ol. 7, no. 5, pp. 2437–2447, 2016. [13] C. Beck el, L. Sadamori, T. Staake, and S. Santini, “Revealing household characteristics from smart meter data,” Energy, v ol. 78, pp. 397–410, 2014. [14] A. Kav ousian, R. Ra jagopal, and M. Fischer, “Determinants of residential electricity consumption: Using smart meter data to examine the eﬀect of climate, building c haracteristics, appliance sto ck, and o ccupants’ b ehavior,” Energy, vol. 55, pp. 184–194, 2013. [15] J. P . Gouveia and J. Seixas, “Unrav eling electricity consumption proﬁles in households through clusters: Combining smart meters and door-to-do or surveys,” Energy and Buildings, vol. 116, pp. 666–676, 2016. [16] S. Haben, C. Singleton, and P . Grindro d, “Analysis and clustering of residential customers energy b eha vioral demand using smart meter data,” IEEE transactions on smart grid, vol. 7, no. 1, pp. 136–144, 2015. [17] B. Stephen, A. J. Mutanen, S. Gallow ay , G. Burt, and P . J¨ arv entausta, “Enhanced load proﬁling for residential netw ork customers,” IEEE T ransactions on Po wer Delivery, vol. 29, no. 1, pp. 88–96, 2013. [18] O. Y. Al-Jarrah, Y. Al-Hammadi, P . D. Y o o, and S. Muhaidat, “Multi-lay ered clustering for p ow er consumption proﬁling in smart grids,” IEEE Access, v ol. 5, pp. 18 459–18 468, 2017. [19] R. Al-Otaibi, N. Jin, T. Wilcox, and P . Flach, “F eature construction and calibration for clustering daily load curv es from smart-meter data,” IEEE T ransactions on industrial informatics, vol. 12, no. 2, pp. 645–654, 2016. [20] S. Zhong and K.-S. T am, “Hierarchical classiﬁcation of load proﬁles based on their characte ristic attributes in frequency domain,” IEEE T ransactions on Po wer Systems, vol. 30, no. 5, pp. 2434–2441, 2014. [21] R. Granell, C. J. Axon, and D. C. W allom, “Impacts of raw data temp oral resolution using selected clustering metho ds on residen tial electricity load proﬁles,” IEEE T ransactions on P ow er Systems, v ol. 30, no. 6, pp. 3217–3224, 2014. [22] K. Bandara, C. Bergmeir, and S. Sm yl, “F orecasting across time series databases using recurrent neural netw orks on groups of similar series: A clustering approac h,” arXiv preprint arXiv:1710.03222, 2017. [23] F. L. Quilumba, W.-J. Lee, H. Huang, D. Y. W ang, and R. L. Szabados, “Using smart meter data to improv e the accuracy of intrada y load forecasting considering customer b ehavior similarities,” IEEE T ransactions on Smart Grid, vol. 6, no. 2, pp. 911–918, 2014. [24] M. Chaouch, “Clustering-based impro vemen t of nonparametric functional time series forecasting: Application to intra-da y household-lev el load curves,” IEEE T ransactions on Smart Grid, v ol. 5, no. 1, pp. 411–419, 2013. [25] B. Yildiz, J. I. Bilbao, J. Dore, and A. Sproul, “Household electricity load forecasting using historical smart meter data with clustering and classiﬁcation techniques,” in 2018 IEEE Innov ative Smart Grid T echnologies-Asia (ISGT Asia). IEEE, 2018, pp. 873–879. [26] B. Lafuente-Rego and J. A. Vilar, “Clustering of time series using quantile auto co v ariances,” Adv ances in Data Analysis and classiﬁcation, v ol. 10, no. 3, pp. 391–415, 2016. [27] A. M. Alonso and D. Pe˜ na, “Clustering time series by linear dep endency ,” Statistics and Computing, v ol. 29, no. 4, pp. 655–676, 2019. [28] L. Breiman, J. F riedman, C. Stone, and R. Olshen, Classiﬁcation and Regression T rees, ser. The W adsworth and Brooks-Cole statistics-probabilit y series. T aylor & F rancis, 1984. [29] H. Ishw aran, “V ariable imp ortance in binary regression trees and forests,” Electronic Journal of Statistics, vol. 1, pp. 519–537, 2007. [30] [Online]. Av ailable: https://data.london.go v.uk/dataset/smartmeter- energy- use- data- in- london- households [31] [Online]. Av ailable: https://acorn.caci.co.uk/do wnloads/Acorn- User- guide.p df

Hierarchical Clustering for Smart Meter Electricity Loads based on Quantile Autocovariances

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment