Horvitz-Thompson estimators for functional data: asymptotic confidence bands and optimal allocation for stratified sampling

When dealing with very large datasets of functional data, survey sampling approaches are useful in order to obtain estimators of simple functional quantities, without being obliged to store all the data. We propose here a Horvitz--Thompson estimator …

Authors: Herve Cardot, Etienne Josser

Horvitz–Thompson estimators for functional data: asymptotic confidence bands and optimal allo cation for stratified sampling Herv ´ e Cardot , Etienne Josserand email : herv e.cardot@u-b ourgogne.fr, etienne.josserand@u-bourgogne.fr Institut de Math ´ ematiques de Bourgogne, UMR CNRS 5584, Univ ersit ´ e de Bourgogne 9 Av en ue Alain Sa v ary - B.P . 47870, 21078 DIJON Cedex - F rance April 2, 2019 Abstract When dealing with very large datasets of functional data, surv ey sampling ap- proac hes are useful in order to obtain estimators of simple functional quan tities, without b eing obliged to store all the data. W e prop ose here a Horvitz–Thompson estimator of the mean tra jectory . In the con text of a sup erpopulation framework, we prov e under mild regularity conditions that we obtain uniformly consistent estimators of the mean function and of its v ariance function. With additional assumptions on the sampling design we state a functional Cen tral Limit Theorem and deduce asymptotic confidence bands. Stratified sampling is studied in detail, and we also obtain a functional version of the usual optimal allo cation rule considering a mean v ariance criterion. These tech- niques are illustrated by means of a test p opulation of N = 18902 electricit y meters for which we hav e individual electricity consumption measures every 30 minutes ov er one week. W e show that stratification can substantially impro ve b oth the accuracy of the estimators and reduce the width of the global confidence bands compared to simple random sampling without replacement. k eyw ords. Asymptotic v ariance; F unctional Central Limit Theorem; Sup erpopulation mo del; Supremum of Gaussian pro cesses; Survey sampling. 1 In tro duction The developmen t of distributed sensors has enabled access to p oten tially huge databases of signals evolving along time and observed on v ery fine scales. Exhaustive collection of suc h data w ould require ma jor inv estments, b oth for transmission of the signals through net w orks and for storage. As noted in Chiky & H´ ebrail (2008), survey sampling of the sensors, which entails randomly selecting only a part of the curv es of the p opulation and whic h represen ts a trade off betw een limited storage capacities and the accuracy of the data, 1 ma y be relev ant compared to signal compression in order to obtain accurate appro ximations to simple functional quantities such as mean tra jectories. Our study is motiv ated b y the estimation, in a fixed time interv al, of the mean elec- tricit y consumption curve of a large num ber of consumers. The F rench electricit y op erator EDF, ´ Electricit ´ e De F rance, in tends ov er the next few years to install o ver 30 million elec- tricit y meters, in each firm and household, which will b e able to send individual electricit y consumption measures on very fine time scales. Collecting, sa ving and analyzing all this information, which may b e considered as functional, w ould b e very exp ensiv e. As an il- lustrativ e example, a sample of 20 individual curv es, selected among a test p opulation of N = 18902 electricit y meters, is plotted in Figure 1. The curves consist, for each company selected, of the electricit y consumption measured every 30 min utes ov er a p erio d of one w eek. The target is the mean p opulation curv e, and w e note the high v ariabilit y b et w een individuals. Using surv ey sampling strategies is one wa y to get accurate estimates at reasonable cost. The main questions addressed in this pap er are to determine the precision of a survey sampling strategy and the strategies likely to improv e the sampling selection pro cess in order to obtain estimators that are as accurate as p ossible and to derive global confidence bands that are as sharp as p ossible for stratified sampling. There is a v ast literature in surv ey sampling theory ; see for example F uller (2009). Ho w ev er, as far as w e know, the con v ergence issue with such sampling strategies in finite p opulation has not yet b een studied in the functional data analysis literature (Ramsa y & Silv erman, 2005, M ¨ uller, 2005) except b y Cardot et al. (2010), where the ob jective w as to reduce the dimension of the data through functional principal components in the Hilbert space of square in tegrable functions. Here w e adopt a different p oin t of view and consider the sampled tra jectories as elements of the space of contin uous functions equipp ed with the usual sup norm in order to get uniform consistency results through maximal inequalities. Then, it is p ossible to build global confidence bands with the help of prop erties of suprema of Gaussian processes and the functional central limit theorem. 2 Notation, estimators and basic prop erties Let us consider a finite p opulation U N = { 1 , . . . , k , . . . , N } of size N , and supp ose that to eac h unit k in U N w e c an asso ciate a unique function Y k ( t ) , for t ∈ [0 , T ] , with T < ∞ . Our target is the mean tra jectory µ N ( t ) = 1 N X k ∈ U Y k ( t ) , t ∈ [0 , T ] . (1) W e consider a sample s dra wn from U N according to a fixed-size sampling design p N ( s ) , where p N ( s ) is the probability of dra wing the sample s. The size n N of s is nonrandom and w e supp ose that the first and second order inclusion probabilities satisfy π k = P ( k ∈ s ) > 0, 2 0 50 100 150 0 200 400 600 800 1000 1200 Hours Electricity consumption (kW−h) Figure 1: A sample of 20 individual electricit y consumption curves. The mean profile is plotted in b old line. for all k ∈ U N , and π kl = P ( k & l ∈ s ) > 0 for all k , l ∈ U N , k 6 = l , so that each unit and eac h pair of units can b e drawn with a non n ull probabilit y from the p opulation. It is now p ossible to write the classical Horvitz–Thompson estimator of the mean curve, b µ N ( t ) = 1 N X k ∈ U Y k ( t ) π k I k , t ∈ [0 , T ] , (2) where I k is the sample mem bership indicator, I k = 1 if k ∈ s and I k = 0 otherwise. W e clearly ha v e E ( I k ) = π k and E ( I k I l ) = π kl . It is easy to chec k (F uller, 2009) that this estimator is unbiased, i.e. for all t ∈ [0 , T ] , E { b µ N ( t ) } = µ N ( t ) . Its cov ariance function γ N ( s, t ) = cov { b µ N ( s ) , b µ N ( t ) } satisfies, for all ( s, t ) ∈ [0 , T ] × [0 , T ] , γ N ( s, t ) = 1 N 2 X k ∈ U N X l ∈ U N Y k ( s ) π k Y l ( t ) π l ∆ kl , with ∆ kl = π kl − π k π l if k 6 = l and ∆ kk = π k (1 − π k ) . An un biased estimator of γ N ( s, t ) , for all ( s, t ) ∈ [0 , T ] × [0 , T ] , is b γ N ( s, t ) = 1 N 2 X k ∈ s X l ∈ s Y k ( s ) π k Y l ( t ) π l ∆ kl π kl . 3 With real data, suc h as the electricity consumption tra jectories presen ted in Fig. 1, w e do not observe Y k ( t ) at every instant t in [0 , T ] but only hav e an ev aluation of Y k at d discretization p oints 0 = t 1 < · · · < t d = T . Assuming that there are no measuremen t errors, whic h seems realistic in the case of electricit y consumption curv es, and that the tra jectories are regular enough, linear in terp olation is a robust and simple wa y to obtain accurate appro ximations of the tra jectories at ev ery instan t t . F or each unit k in the sample s, the interpolated tra jectory is defined by ˜ Y k ( t ) = Y k ( t i ) + Y k ( t i +1 ) − Y k ( t i ) t i +1 − t i ( t − t i ) , t ∈ [ t i , t i +1 ] . (3) It is then p ossible to define the Horvitz–Thompson estimator of the mean curv e based on the discretized observ ations as b µ d ( t ) = 1 N X k ∈ s ˜ Y k ( t ) π k , t ∈ [0 , T ] . (4) The cov ariance function of b µ d , denoted b y γ d ( s, t ) = cov { b µ d ( s ) , b µ d ( t ) } , also satisfies for all ( s, t ) ∈ [0 , T ] × [0 , T ], γ d ( s, t ) = 1 N 2 X k ∈ U N X l ∈ U N ˜ Y k ( s ) π k ˜ Y l ( t ) π l ∆ kl , (5) and, as ab ov e, an unbiased estimator of γ d ( s, t ) is b γ d ( s, t ) = 1 N 2 X k ∈ s X l ∈ s ˜ Y k ( s ) π k ˜ Y l ( t ) π l ∆ kl π kl . (6) T o go further we must adopt an asymptotic p oin t of view assuming that the size N of the p opulation grows to infinity . 3 Asymptotic Prop erties 3.1 Assumptions Let us consider the sup erpopulation asymptotic framew ork introduced by Isaki & F uller (1982) and discussed in detail in F uller (2009).W e consider a sequence of gro wing and nested p opulations U N with size N tending to infinity and a sequence of samples s N of size n N dra wn from U N according to the fixed-size sampling designs p N ( s N ) . Let us denote b y π kN and π kl N their first and second order inclusion probabilities. The sequence of sub- p opulations is an increasing nested one while the sample sequence is not. F or simplicity of notation, w e drop the subscript N in the following when there is no am biguity . T o pro v e our asymptotic results, we make the following assumptions. Assumption 1. W e assume that lim N →∞ n N = π ∈ ]0 , 1[ . 4 Assumption 2. W e assume that min k π k ≥ λ > 0, min k 6 = l π kl ≥ λ ∗ > 0, lim sup N →∞ n max k 6 = l | π kl − π k π l | < C 1 < ∞ . Assumption 3. F or all k ∈ U, Y k ∈ C [0 , T ] , the space of con tinuous functions on [0 , T ] , and lim N →∞ µ N = µ in C [0 , T ] . Assumption 4. There are tw o p ositiv e constants C 2 and C 3 and β > 1 / 2 suc h that, for all N , N − 1 P k ∈ U ( Y k (0)) 2 < C 2 and N − 1 P k ∈ U ( Y k ( t ) − Y k ( s )) 2 ≤ C 3 | t − s | 2 β for all ( s, t ) ∈ [0 , T ] × [0 , T ] . Assumptions 1 and 2 concern the moment prop erties of the sampling designs and are fulfilled for sampling plans such as simple random sampling without replacement or stratified sampling (Robinson & S¨ arndal, 1983, Breidt & Opsomer, 2000). Assumptions 3 and 4 are of a functional nature and seem to be rather weak. Assumption 3 imp oses only that the limit of the mean function exists and is contin uous, and Assumption 4 states that the tra jectories hav e a uniformly b ounded second moment and their mean squared incremen ts satisfy a H¨ older condition. 3.2 Consistency W e can no w state the first consistency results, assuming that the grid of the d N discretization p oin ts b ecomes finer and finer in [0 , T ] as the p opulation size N tends to infinit y . Prop osition 3.1. L et Assumptions 1-4 hold. If the discr etization scheme satisfies lim N →∞ max { i =1 ,...,d N − 1 } | t i +1 − t i | 2 β = o ( n − 1 ) , then for some c onstant C, √ n E ( sup t ∈ [0 ,T ] | b µ d ( t ) − µ N ( t ) | ) < C. Prop osition 3.1 states that if the grid is fine enough then classical parametric rates of con v ergence can be attained uniformly , the additional h yp othesis meaning that for smoother tra jectories, i.e. larger β , fewer discretization p oints are needed. W e w ould also like to ob- tain that b γ d ( t, t ) is a consistent estimator of the v ariance function γ N ( t, t ). T o do so, we need to in tro duce additional assumptions concerning the higher-order inclusion probabili- ties and the fourth order moments of the tra jectories. Assumption 5. W e assume that lim N →∞ max ( i 1 ,i 2 ,i 3 ,i 4 ) ∈ D 4 ,N | E { ( I i 1 I i 2 − π i 1 i 2 )( I i 3 I i 4 − π i 3 i 4 ) }| = 0 , where D t,N denotes the set of all distinct t -tuples ( i 1 , . . . , i t ) from U N . W e also supp ose that there are tw o p ositive constants C 4 and C 5 , such that N − 1 P k ∈ U N Y k (0) 4 < C 4 , and N − 1 P k ∈ U N { Y k ( t ) − Y k ( s ) } 4 < C 5 | t − s | 4 β , for all ( s, t ) ∈ [0 , T ] × [0 , T ] 5 The first part of Assumption 5 is more restrictiv e than Assumption 2 and is assumed, for example, in Breidt & Opsomer (2000, part of assumption (A7)). It holds, for instance, in simple random sampling without replacement and stratified sampling. Prop osition 3.2. L et Assumptions 1-5 hold. If the discr etization scheme satisfies lim N →∞ max { i =1 ,...,d N − 1 } | t i +1 − t i | = o (1) , then n E ( sup t ∈ [0 ,T ] | b γ d ( t, t ) − γ N ( t, t ) | ) → 0 , N → ∞ . The multiplier n that app ears in the Prop osition 3.2 is due to the fact nγ N ( t, t ) is a b ounded function. Prop osition 3.2 only states that we can obtain a uniformly consis- ten t estimator of the v ariance function of the estimated mean tra jectory . More restrictiv e conditions concerning the sampling design would b e needed to get rates of conv ergence. 3.3 Asymptotic normality and confidence bands Pro ceeding further, w e w ould no w lik e to deriv e the asymptotic distribution of our estimator b µ d in order to build asymptotic confidence in terv als and bands. Obtaining the asymptotic normalit y of estimators in surv ey sampling is a tec hnical and difficult issue even for sim- ple quan tities such as means or totals of real n um b ers. Although confidence in terv als are commonly used in the survey sampling comm unit y , the Central Limit Theorem has only b een c hec k ed rigourously , as far as we know, for a few sampling designs. Erd¨ os & R ´ enyi (1959) and H` ajek (1960) prov ed that the Horvitz–Thompson estimator is asymptotically Gaussian for simple random sampling without replacement. These results w ere extended more recen tly to stratified sampling b y Bic k el & F reedman (1994) and some particular cases of t w o-phase sampling designs by Chen & Rao (2007). F uller (2009, § 1.3) prop oses a re- cen t review. Let us assume that the Horvitz–Thompson estimator satisfies a Central Limit Theorem for real v alued quantities with new moment conditions. Assumption 6 . There is some δ > 0 , such that N − 1 P k ∈ U N | Y k ( t ) | 2+ δ < ∞ for all t ∈ [0 , T ] , and { γ N ( t, t ) } − 1 / 2 { b µ N ( t ) − µ N ( t ) } → N (0 , 1) in distribution when N tends to infinit y . W e can no w formulate the following prop osition, which tells us that if the sampling design is suc h that the Horvitz–Thompson estimator of the total of real quan tities is asymp- totically Gaussian, then our estimator b µ d is also asymptotically Gaussian in the space of con tin uous functions equipp ed with the sup norm. This means that p oin t-wise normality can b e transposed, under regularity assumptions on the tra jectories and the asymptotic distance b et ween adjacent discretization p oin ts, to a functional Central Limit Theorem. Prop osition 3.3. L et Assumptions 1-4 and 6 hold and supp ose that the discr etization p oints satisfy 6 lim N →∞ max { i =1 ,...,d N − 1 } | t i +1 − t i | 2 β = o ( n − 1 ) . We then have that √ n ( b µ d − µ N ) → X in distribution in C [0 , T ] wher e X is a Gaussian r andom function taking values in C [0 , T ] with me an 0 and c ovarianc e function ˘ γ ( s, t ) = lim N →∞ nγ N ( s, t ) . The pro of, giv en in the App endix, is based on the Cramer–W old device which giv es access to multiv ariate normality when considering discretized tra jectories. Tigh tness ar- gumen ts are then inv ok ed in order to obtain the functional version of the Central Limit Theorem. Using heuristic argumen ts similar to those of Degras (2009), w e can also build asymptotic confidence bands in order to ev aluate the global accuracy of our estimator. T o do so, w e mak e use of an asymptotic result from Landau & Shepp (1970), which states that the suprem um of a centred Gaussian random function Z taking v alues in C [0 , T ] , with co v ariance function ρ ( s, t ) satisfies lim λ →∞ λ − 2 log P ( sup t ∈ [0 ,T ] Z ( t ) > λ ) = − n 2sup t ∈ [0 ,T ] ρ ( t, t ) o − 1 . (7) Assuming that inf t ˘ γ ( t, t ) > 0 , it is easy to prov e, with Slutsky’s Lemma and Prop ositions 3.2 and 3.3, that the sequence of random functions Z n ( t ) = { b γ d ( t, t ) } − 1 / 2 { b µ d ( t ) − µ N ( t ) } satisfies the Central Limit Theorem in C [0 , T ] and conv erges in distribution to Z ( t ) . Then, the con tin uous mapping theorem tells us that, for eac h λ > 0 , P { sup t | Z n ( t ) | > λ } conv erges to P { sup t | Z ( t ) | > λ } . Applying (7) to Z n , a direct computation yields that, for a given risk α > 0, P h | b µ d ( t ) − µ N ( t ) | < { 2 log(2 /α ) b γ d ( t, t ) } 1 / 2 , t ∈ [0 , T ] i ' 1 − α. (8) Equation (8) indicates that, compared to point-wise confidence interv als, global ones can b e obtained simply b y replacing the scaling given b y the quantile of a normal cen- tred unit v ariance Gaussian v ariable by the factor { 2 log(2 /α ) } 1 / 2 . F or example, if α =0 · 05, resp ectiv ely α =0 · 01, then { 2 log(2 /α ) } 1 / 2 = 2 · 716, resp ectiv ely 3 · 255, instead of 1 · 960, re- sp ectiv ely 2 · 576, for a p oin t-wise confidence in terv al with 0 · 95 confidence, resp ectively 0 · 99. The result presen ted in equation (7) is asymptotic and is therefore more reliable when α is close to zero as seen in our simulation study . 4 Stratified sampling designs W e no w consider now the particular case of stratified sampling with simple random sampling without replacement in all strata, assuming the p opulation U is divided into a fixed num- b er H of strata. This means that there is a partitioning of U into H sub-populations denoted b y U h , ( h = 1 , . . . , H ). W e can define the mean curv e µ h within eac h stra- tum h as µ h ( t ) = N − 1 h P k ∈ U h Y k ( t ) , t ∈ [0 , T ] , where N h is the num b er of units in 7 stratum h. The co v ariance function, γ h ( s, t ) , within stratum h is defined b y γ h ( s, t ) = N − 1 h P k ∈ U h { Y k ( s ) − µ h ( s ) }{ Y k ( t ) − µ h ( t ) } , ( s, t ) ∈ [0 , T ] × [0 , T ] . In stratified sampling with simple random sampling without replacemen t in all strata, the first and second order inclusion probabilities are explicitly known, and the mean curve estimator of µ N ( t ) is b µ strat ( t ) = N − 1 P H h =1 n − 1 h N h P k ∈ s h Y k ( t ) , t ∈ [0 , T ] , where s h is a sample of size n h , with n h < N h , obtained by simple random sampling without replacement in stratum U h . The cov ariance function of b µ strat , can b e expressed as γ strat ( s, t ) = 1 N 2 H X h =1 N h N h − n h n h e γ h ( s, t ) , ( s, t ) ∈ [0 , T ] × [0 , T ] , with ( N h − 1) e γ h ( s, t ) = N h γ h ( s, t ) . F or real v alued quan tities, optimal allo cation rules, whic h determine the sizes n h of the samples in all the strata, are generally defined in order to obtain an estimator whose v ariance is as small as p ossible. In our functional context, and as in the multiv ariate case (Co c hran, 1977, § 5A.2), determining an optimal allo cation clearly dep ends on the criterion to b e minimized. Indeed, one could consider man y different optimization criteria which w ould lead to different optimal allo cations rules. The width of the global confidence bands deriv ed in equation (8) dep end only on the standard deviation of the estimator at eac h instan t t and minimising the width at the worst instant of time or minimizing the av erage width along time are natural criteria. Nevertheless, finding the solution of such optimization problems is not trivial and not in v estigated further in this pap er. If we consider the optimal allo cation based on minimising the mean v ariance instead of the mean standard deviation, w e can then find explicit and simple solutions to min ( n 1 ,...,n H ) Z T 0 γ strat ( t, t ) dt sub ject to H X h =1 n h = n and n h > 0 , h = 1 , . . . , H . (9) The solution is n ∗ h = n N h S h P H i =1 N i S i , (10) with S 2 h = R T 0 e γ h ( t, t ) dt, h = 1 , . . . , H , similar to that of the multiv ariate case when con- sidering a total v ariance criterion (Co c hran, 1977). This means that a stratum with higher v ariance than the others should b e sampled at a higher sampling rate n h / N h . The gain when considering optimal allo cation compared to prop ortional allo cation, i.e. n h = nN h / N , can also b e derived easily . 5 An illustration with electricit y consumption Ov er the next few years ´ Electricit ´ e De F rance plans to install millions of sophisticated elec- tricit y meters that will b e able to send, on request, electricity consumption measurements 8 ev ery second. Empirical studies hav e shown that ev en the simplest survey sampling strate- gies, suc h as simple random sampling without replacemen t, are very comp etitiv e with signal pro cessing approaches such as wa velet expansions, when the aim is to estimate the mean consumption curve. T o test and compare the differen t p ossible strategies, a test p opulation of N = 18902 electricit y meters has b een installed in small and large companies. These electricit y meters hav e read electricity consumption every half an hour ov er a p erio d of tw o w eeks. W e split the temp oral observ ations and considered only the second w eek for estimation. The reading from first week were used to build the strata. Thus, our p opulation of curves is a set of N = 18902 vectors Y k = { Y k ( t 1 ) , . . . , Y k ( t d ) } with sizes d = 336. Iden tifying eac h unit k of the p opulation with its tra jectory Y k , we consider now a particular case of stratified sampling which consists in clustering the space C [0 , T ] of all p ossible tra jectories in to a fixed n um b er of H strata. 0 50 100 150 0 100 200 300 400 500 600 700 Hours Electricity consumption (kW−h) (a) 0 50 100 150 2 3 4 5 6 7 Hours Theoretical standard deviation (kW−h) (b) Figure 2: (a) Mean curve in eac h stratum. (b) Theoretical standard deviation function √ γ ( t, t ) for simple random sampling without replacemen t (solid line), stratified sampling with prop ortional allo cation (dashed line) and stratified sampling with optimal allo cation (dotted dashed line) sampling designs. The strata w ere built by clustering the p opulation according to the maxim um level of consumption during the first week. W e decided to retain H = 4 different clusters based on the quartiles so that all the strata ha v e the same size. The mean tra jectories during the first w eek in the clusters, drawn in Figure 2 (a), show a clear size effect. The strata hav e b een 9 n um b ered according to global mean consumption. Stratum 4, at the top of Figure 2 (a), corresp onds to consumers with high global lev els of consumption whereas stratum 1, at the b ottom of Figure 2 (a), corresp onds to consumers with low global levels of consumption. W e compared three sampling strategies, with the same sample size n = 2000 , to estimate the mean p opulation curve µ ( t ) and build confidence bands during the second w eek. In order to ev aluate these estimators, w e drew 1000 samples using the following sampling designs, SRSWR simple random sampling estimator without replacement, which was first tested by Electricit ´ e de F rance; Prop ortional stratified sampling with prop ortional allo cation, in whic h allo cation in each stratum is defined as follows n h = nN h / N ; the size of each stratum is 500; Optimal stratified sampling with optimal allo cation according to the rule defined in (10). The sizes of the strata are 126 (stratum 1), 212 (stratum 2), 333 (stratum 3) and 1329 (stratum 4). T o ev aluate the accuracy of the estimators, w e considered the follo wing loss criteria, ev aluated with discretized data using quadrature rules, for the estimator b µ, resp ectiv ely b γ , of the mean tra jectory , resp ectively of the mean v ariance, R ( b µ ) = Z T 0 | b µ ( t ) − µ ( t ) | dt, R ( b γ ) = Z T 0 | b γ ( t, t ) − γ ( t, t ) | dt. (11) Basic statistics for the estimation errors of the mean function are giv en in T able 1. First, w e observe that clustering the space of functions by means of stratified sampling leads to a large gain in terms of the accuracy of the estimators. In addition, there is a substantial difference b et ween the prop ortional and the optimal allo cation rules. T able 1: Estimation errors for µ and γ ( t, t ) for the different sampling designs. Mean function V ariance function Mean 1st quartile median 3rd quartile mean 1st quartile median 3rd quartile SRSWR 4 · 46 2 · 37 3 · 75 5 · 68 5 · 26 2 · 42 4 · 04 6 · 64 Prop ortional 3 · 48 2 · 03 2 · 87 4 · 43 4 · 77 2 · 07 3 · 51 5 · 79 Optimal 2 · 43 1 · 55 2 · 10 3 · 04 1 · 02 0 · 56 0 · 88 1 · 30 W e now examine the true standard deviation functions √ γ ( t, t ), whic h are prop ortional to the width of the confidence bands. They dep end on the sampling design and are drawn in Figure 2 (b). The theoretical standard deviation is muc h smaller, at all instants t, for the optimal allo cation rule, and it is ab out twice smaller compared to simple random sampling without replacemen t. There is also a strong p erio dicit y effect in the simple random sampling without replacemen t due to the lack of con trol o ver the units with high lev els of consumption 10 (stratum 4). Estimation errors, according to criterion (11), of the true co v ariance functions are rep orted in T able 1. The error is muc h smaller for s tratified optimal allo cation than for the other estimators; optimal allocation pro vides b etter estimates as well as b etter estimation of their v ariance. Finally , we computed the global confidence bands to chec k that formula (8), which relies on asymptotic prop erties of the supremum of Gaussian pro cesses, remains v alid when considering confidence lev els 0 · 95 and 0 · 99. The empirical cov erage is close to the nominal one for the simple random sampling without replacement, 93 · 8% and 98 · 3%, whereas it is a little bit lib eral, esp ecially for smaller lev els, for the stratified sampling designs, 88 · 7% and 96 · 8% for prop ortional allo cation, and 88 · 1% and 96 · 8% for optimal allo cation. 6 Concluding remarks The exp erimen tal results on a test p opulation of electricit y consumption curv es confirm that stratification, in conjunction with the optimal allocation rule, can lead, in cases of suc h high dimensional data, to imp ortant gains in terms of the accuracy of the estimation and width of the global confidence bands compared to more basic approaches. W e ha ve prop osed a simple rule to get confidence bands that could certainly b e impro v ed, in terms of empirical co v erage, by computing more realistic scaling factors with b o otstrap pro cedures (F araw ay , 1997) or Gaussian pro cess simulations (Degras, 2010). Cho osing appropriate strata is also an important asp ect of suc h improv emen t. Nev- ertheless, it will generally be impossible to determine for all units to which cluster they b elong. Borrowing ideas from Breidt & Opsomer (2008), one p ossible strategy is to p er- form clustering on the observed sam ple and then try to predict to which stratum the units that are not in the sample b elong using auxiliary information and sup ervised classification. W e hav e assumed that the observed tra jectories are not corrupted by noise at the dis- cretization p oints. Although this assumption seems quite reasonable in the case of electricit y consumption measuremen ts, it is not true in general. Thus, linear in terp olation ma y not alw a ys b e effectiv e and linear smo other estimators, suc h a k ernels or smo othing splines, w ould probably b e more appropriate wa ys to obtain functional versions of the discretized observ ations. Finally , another direction for future research is to combine optimal allo cation for strat- ification with mo del-assisted estimation when auxiliary information is av ailable. There are close relationships b etw een the shap e of electricity consumption curv es and v ariables such as past consumption, temp erature, household area or type of electricity con tract. Such an estimation pro cedure relies, as noted in Cardot et al. (2010), on a parsimonious represen- tation of the tra jectories in order to reduce the dimension of the data. One wa y to ac hiev e this is to first perform a functional principal comp onen ts analysis and then to mo del the relationship b et ween the principal comp onen ts and the auxiliary information. 11 Ac knowledgemen t The authors are grateful to the engineers, and particularly to Alain Dessertaine, of the D ´ ep artement R e cher che et D´ evelopp ement at ´ Electricit ´ e de F rance for fruitful discussions and for allowing us to illustrate this researc h with the electricit y consumption data. This article was also impro ved b y comments and suggestions of the referees as well as discussions with Dr. Camelia Goga and Pauline Lardin. Etienne Josserand thanks the Conseil R ´ egional de Bour go gne, F r anc e for its financial supp ort (F ABER PhD gran t). App endix : pro ofs of Pr op osition 3.1. W e study approximation and sampling errors separately : sup t ∈ [0 ,T ] | b µ d ( t ) − µ N ( t ) | ≤ sup t ∈ [0 ,T ] | b µ d ( t ) − b µ N ( t ) | + sup t ∈ [0 ,T ] | b µ N ( t ) − µ N ( t ) | . (12) Supp ose t ∈ [ t i , t i +1 [ then | Y k ( t ) − ˜ Y k ( t ) | ≤ | Y k ( t i ) − Y k ( t i +1 ) | + | Y k ( t ) − Y k ( t i ) | . By Assump- tions 1-2 and an application of the Cauch y–Sch w arz inequality , | b µ d ( t ) − b µ N ( t ) | ≤ 1 N X k ∈ s | Y k ( t ) − ˜ Y k ( t ) | π k ≤ 1 min k ∈ U N π k " 1 N X k ∈ U n Y k ( t i ) − ˜ Y k ( t ) o 2 # 1 / 2 ≤ 1 λ C 6 | t i +1 − t i | β , for some p ositive constant C 6 whic h do es not dep end on t. Consequently , √ n sup t ∈ [0 ,T ] | b µ d ( t ) − b µ N ( t ) | ≤ √ n C 6 λ max i ∈{ 1 ,...,d N − 1 } | t i +1 − t i | β . (13) W e now study the sampling error. Consider the pseudo-metric d 2 N ( s, t ) = nE { b µ N ( t ) − µ N ( t ) − b µ N ( s ) + µ N ( s ) } 2 for all ( s, t ) ∈ [0 , T ] × [0 , T ] . W e hav e, for some constant C 7 , d 2 N ( s, t ) ≤ n N 2 X k,` ∈ U N     ∆ k` π k π `     | Y k ( t ) − Y k ( s ) | | Y ` ( t ) − Y ` ( s ) | ≤ n N C 3 λ | t − s | 2 β + n λ 2 max k 6 = ` | ∆ k` |   1 N X k ∈ U N { Y k ( t ) − Y k ( s ) } 2   ≤ C 7 | t − s | 2 β . (14) W e apply a result of v an der V aart and W ellner (2000, § 2.2) based on maximal inequal- ities to get the uniform conv ergence and consider the pac king n um b er D ( , d N ), which is the maximum num b er of p oints in [0 , T ] whose distance b etw een each pair is strictly larger 12 than . It is clear from (14) that D ( , d N ) = O (  − 1 /β ) . Considering no w the particular Orlicz norm with ψ ( x ) = x 2 in Theorem 2.2.4 of v an der V aart and W elner (2000), w e directly final that R T 0 ψ − 1 (  − 1 /β ) d < ∞ when β > 1 / 2 , and consequently there is a constant C 8 suc h that E  √ n sup s,t | b µ N ( t ) − µ N ( t ) − b µ N ( s ) + µ N ( s ) |  ≤ C 8 . (15) Since sup t | b µ N ( t ) − µ N ( t ) | ≤ | b µ N (0) − µ N (0) | + sup s,t | b µ N ( t ) − µ N ( t ) − b µ N ( s ) + µ N ( s ) | , w e get the announced result with (12), (13) and (15). of Pr op osition 3.2. The pro of follows the same lines as the pro of of prop osition 3.1. Let us first write, sup t ∈ [0 ,T ] | b γ d ( t, t ) − γ N ( t, t ) | ≤ sup t ∈ [0 ,T ] | b γ d ( t, t ) − b γ N ( t, t ) | + sup t ∈ [0 ,T ] | b γ N ( t, t ) − γ N ( t, t ) | (16) Supp ose t ∈ [ t i , t i +1 [ and define δ kl ( t ) = | ˜ Y l ( t ) − Y l ( t ) | | Y k ( t ) | . With Assumptions 1-3, we ha v e, for some constan ts C 9 and C 10 , | b γ d ( t, t ) − b γ N ( t, t ) | ≤ C 9 N 2   X k ∈ s    ˜ Y 2 k ( t ) − Y 2 k ( t )    + max k 6 = l | ∆ kl | X k ∈ s X l 6 = k { δ kl ( t ) + δ lk ( t ) }   ≤ C 10 N | t i +1 − t i | β . Th us, using Assumption 1, n sup t ∈ [0 ,T ] | b γ d ( t, t ) − b γ N ( t, t ) | ≤ C 10 max i ∈{ 1 ,...,d N − 1 } | t i +1 − t i | β . (17) Consider no w the sampling error and define, for ( s, t ) ∈ [0 , T ] × [0 , T ] , d 2 N ( s, t ) = n 2 E { b γ N ( t, t ) − γ N ( t, t ) − b γ N ( s, s ) + γ N ( s, s ) } 2 and φ kl ( s, t ) = Y k ( t ) Y l ( t ) − Y k ( s ) Y l ( s ) . W e ha v e d 2 N ( s, t ) = n 2 N 4 X k,l ∈ U N X k 0 ,l 0 ∈ U N φ kl ( s, t ) φ k 0 l 0 ( s, t ) ∆ kl π k π l ∆ k 0 l 0 π k 0 π l 0 E  I k I l π kl − 1   I k 0 I l 0 π k 0 l 0 − 1  . F ollowing the same lines as the pro of of Theorem 3 in Breidt & Opsomer (2000), w e get after some algebra that, for some constant C 11 , d 2 N ( s, t ) ≤ C 11  n − 1 + max ( k,l ,k 0 ,l 0 ) ∈ D 4 ,N | E { ( I k I l − π kl ) ( I k 0 I l 0 − π k 0 l 0 ) }|  | t − s | 2 β . (18) Applying again a maximal inequalit y as in the Pro of of Prop osition 3.1, we get the announced result. of Pr op osition 3.3. Noting that, with (13), √ n { b µ d ( t ) − µ N ( t ) } = √ n { b µ N ( t ) − µ N ( t ) } + o (1) , uniformly in t, we only need to study the asymptotic distribution of the random function X n ( t ) = √ n { b µ N ( t ) − µ N ( t ) } , for t ∈ [0 , T ] . 13 W e first consider a m -tuple ( t 1 , . . . , t m ) ∈ [0 , T ] m , a v ector c T = ( c 1 , . . . , c m ) ∈ R m and pro v e that P m i =1 c i X n ( t i ) is asymptotically Gaussian for all c ∈ R m . Considering Y kc = P m i =1 c i Y k ( t i ) , it is clear, with Assumption 6, that N − 1 P k ∈ U | Y kc | 2+ δ < ∞ and we hav e m X i =1 c i X n ( t i ) = √ n ( 1 N X k ∈ s Y kc π k − m X i =1 c i µ N ( t i ) ) . Denoting by b µ c = N − 1 P k ∈ s π − 1 k Y kc the Horvitz–Thompson estimator of µ c = N − 1 P k ∈ U Y kc , it is clear that µ c = P m i =1 c i µ N ( t i ) , E ( b µ c ) = µ c , and with Assumption 6, √ n  b µ c − E ( b µ c )  con- v erges in distribution to N (0 , c T M c ) where M is a co v ariance matrix with generic elemen ts [ M ] ij = ˘ γ ( t i , t j ). The Cramer–W old device tells us that the vector ( X n ( t 1 ) , . . . , X n ( t m )) is asymptotically m ultiv ariate normal. Secondly , we need to chec k that X n satisfies a tightness prop ert y in order to get the asymptotic con v ergence in distribution in the space of contin uous functions C [0 , T ] . W e ha v e with (14), for all ( s, t ) ∈ [0 , T ] × [0 , T ] , E {| X n ( t ) − X n ( s ) | 2 } ≤ C 7 | t − s | 2 β , and the sequence X n is tigh t, when β > 1 / 2 , according to Theorem 12.3 of Billingsley (1968). References Bickel, P. J. & Freedman, D. A. (1984). Asymptotic normality and the b o otstrap in stratified sampling. A nnals of Statistics , 12 , 470-482. Billingsley, P. (1968). Conver genc e of Pr ob ability Me asur es. John Wiley , New Y ork. Breidt, F. J. & Opsomer, J. D. (2000). Lo cal p olynomial regression estimators in surv ey sampling. A nnals of Statistics , 28 , 1026-1053. Breidt, F. J. & Opsomer, J. D. (2008). Endogeneous p ost-stratification in surveys: classifying with a sample-fitted mo del. A nnals of Statistics , 36 , 403-427. Cardot, H., Chaouch, M., Goga, C. & C. Labru ` ere (2010). Prop erties of design- based functional principal comp onen ts analysis. J. of Statistic al Planning and Infer enc e , 140 , 75-91. Chen, J. & Rao, J. N. K. (2007). Asymptotic normality under tw o-phase sampling designs. Statistic a Sinic a , 17 , 1047-1064. Chiky, R. & H ´ ebrail, G. (2008). Summarizing distributed data streams for storage in data w arehouses. in DaW aK 2008, I-Y. Song, J. Eder and T. M. Nguy en, eds. L e ctur e Notes in Computer Scienc e , Springer, 65-74. Cochran, W.G. (1977). Sampling te chniques . 3rd Edition, Wiley , New Y ork. Degras, D. (2009). Nonparametric estimation of a trend based up on sampled con tin uous pro cesses. C. R. Math. A c ad. Sci. Paris , 347 , 191-194. 14 Degras, D. (2010). Simultaneous confidence bands for nonparametric regression with functional data. Statistic a Sinic a , to app ear. Erd ¨ os, P. & R ´ enyi, A. (1959). On the cen tral limit theorem for samples from a finite p opulation. Publ. Math. Inst. Hungar. A c ad. Sci. 4 , 49-61. F ara w a y, J.T. (1997). Regression analysis for a functional resp onse. T e chnometrics , 39 , 254-261. Fuller, W.A. (2009). Sampling Statistics. John Wiley and Sons. H ` ajek, J. (1960). Limiting distributions in simple random sampling from a finite p opula- tion. Publ. Math. Inst. Hungar. A c ad. Sci. 5 , 361-374. Isaki, C.T. & Fuller, W.A. (1982). Survey design under the regression sup erp opulation mo del. J. Am. Statist. Ass. 77 , 89-96. Landa u, H. & Shepp, L.A. (1970). On the supremum of a Gaussian process. Sankhy˜ a , 32 , 369-378 M ¨ uller, H.G. (2005). F unctional mo delling and classification of longitudinal data (with discussions). Sc and. J. Statist. , 32 , 223-246. Ramsa y, J. O. & Sil verman, B.W. (2005). F unctional Data Analysis . Springer-V erlag, 2nd ed. R obinson, P. M. & S ¨ arndal, C. E. (1983). Asymptotic prop erties of the generalized regression estimator in probability sampling. Sankhya : The Indian Journal of Statistics , 45 , 240-248. V an der V aar t, A.W. & Wellner, J.A. (2000). We ak Conver genc e and Empiric al Pr o c esses . Springer-V erlag, New Y ork. 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment