Comment: Monitoring Networked Applications With Incremental Quantile Estimation
Comment: Monitoring Networked Applications With Incremental Quantile Estimation [arXiv:0708.0302]
Authors: ** - 원 논문의 저자는 명시되지 않았으며, 본 코멘트는 Chambers et al.의 알고리즘을 기반으로 한 확장 및 비평을 제공한다. (정확한 저자 정보는 원 논문 또는 해당 코멘트의 출판 페이지를 참조 필요) **
Statistic al Scienc e 2006, V ol. 21, No. 4, 483– 484 DOI: 10.1214 /0883423 06000000628 Main article DO I: 10.1214/0883 42306000000583 c Institute of Mathematical Statistics , 2006 Comment: Monito ring Net w o rk ed Applications With Increme ntal Quantile Estimation Bin Y u First of all, w e would lik e to thank the authors for their timely p ap er fo cusing on the imp ortant area of streaming data. Rece ntly , muc h atten tion has b een p aid by the statistics communit y to the high d imensionalit y or massiv eness of data in the in- formation tec hn ology age. Ho we v er, streaming data represent the other imp ortan t feature of the IT age, the high rate of data. Both high dimensionalit y and high rate require fast computation, b u t the real-time constrain t on streaming data forces its computation to b e a magnitud e faster than that of the off-line or batc h mo de of massiv e d ata. As a result, in the ab- sence of su p ercomputers, th e algorithms for stream- ing data ha ve to b e very simple to b e effectiv e. Cham b ers et al. deal with streaming data for com- puter system monitoring. Streaming data arise also in man y other fields of science and engineering, such as astronomy , geosc ience and sensor n et works. Cham- b ers et al. devise a simple and practical algorithm for up dating quantile s to b e used to monitor the re- liabilit y of a large s ystem based on streamed d ata. Stationarit y is implicitly assum ed since one could argue that a go o d computer system sh ould b e more or less stable ov er time un til the system is u p dated. A desirable add-on to the estimated quantile of Cham b ers et al. is a measure of un certain ty whic h in the i.i.d. case is trivial b ecause of the relationship b et wee n the v ariance and mean of a b inomial ran- dom v ariable. Ho w ev er, it is h ard to imagine that a computer system follo ws an i.i.d. p ro cess. The real- time constrain t could mak e the purs uit of an uncer- Bin Y u is Pr ofessor, Dep artment of Statistics, University of California , Berkeley, Berkeley, California 94720 , USA e-mail: binyu@stat.b erkeley.e du This is an elec tronic reprint of the or iginal article published by the Institute of Mathema tical Statistics in Statistic al Scienc e , 20 0 6, V ol. 21, No. 4, 4 83–4 8 4 . This reprint differs from the o riginal in pag ination and t yp ogr aphic detail. tain ty measure harder th an the quan tile estimation itself. F or a natural environmen t to b e monitored by a sensor net work, the v ariable of interest (sa y , temp er- ature) is most lik ely to b e c hanging o v er time and hence nonstationary . F ortunately , there is an easy extension of the Cham b ers et al. algorithm to the nonstationary case. Because we can build the CDF and therefore the quantiles based on a moving win- do w of d ata, it is applicable to nonstationary data streams. Ho wev er, in this case, the data h a ve to b e k ept o ve r a duration of the s ize of the moving w in- do w W , in ad d ition to th e current estimate of th e CDF. F ormally , let W denote the size of the mo ving time w in do w whic h is application-sp ecific to guar- an tee some stationarit y of the v ariable within the windo w. Let O denote the initial block of (old) data to b e r emo ved wh en new data come in, K the d ata blo c k k ept and N the new blo ck to b e tak en in to accoun t: | W | = | O | + | K | and | O | = | N | . Since the curr en t empirical count of observ ations less than an y x is a summ ation of th e ind icator fun c- tion of the int erv al ( −∞ , x ] o v er the current blo c k of data (o v er K and N ), it can b e obtained b y using the last empir ical count and the summation o ver the old blo c k: X t ∈ curren t block I { X t ≤ x } = X t ∈ K I { X t ≤ x } + X t ∈ N I { X t ≤ x } = X t ∈ K I { X t ≤ x } + X t ∈ N I { X t ≤ x } + X t ∈ O I { X t ≤ x } − X t ∈ O I { X t ≤ x } = X t ∈ O I { X t ≤ x } + X t ∈ K I { X t ≤ x } + X t ∈ N I { X t ≤ x } − X t ∈ O I { X t ≤ x } 1 2 B. YU = X t ∈ previous blo ck I { X t ≤ x } + X t ∈ N I { X t ≤ x } − X t ∈ O I { X t ≤ x } . With pr op er scaling and w eighting, the empirical CDF f or the current blo c k can b e easily up d ated based on the empir ical C DF of th e previous b lo c k, pro vided that the O -blo c k data are ke pt and made a v ailable to the u p d ating algorithm. Th us this obvi- ous mod ification mak es Ch am b ers et al.’s algo rithm applicable to the nonstationary case. When the data stream is stationary , the prop osed metho d keeps only a CDF; hence it is a form of compression. T h e up dating data b lo c k in b oth the stationary and nonstationary cases migh t still b e too exp ensive to comm unicate in situations suc h as sen- sor net w orks where communicatio n is more costly than computation in terms of battery consumption. F or nonstationary data, data are ke pt in a mo ving windo w in add ition to the cur ren t CDF. If the data rate and v olum e are large, ev en a mo ving windo w of data might b e to o muc h. Hence one should com- press th em. It w ould b e b est if the up dating could b e done directly on compr essed data withou t decom- pressing. This calls for an interact ion of statistical analysis with data compression algorithms. More- o ver, if lossy compr ession has to b e carried out, one should allo cate more bits to the tails of the distribu - tion b ecause the extreme quant iles are monitored for p oten tial system anomalies. It would b e interesting further researc h to design a b it allocation algo rithm and a compression scheme to go w ith the quantile up d ating metho d in Chamber s et al. Natural ques- tions are: What is the ob jectiv e function for bit allo- cation? Ho w should it b e combined with the goal of statistica l estimation? What forms of co des should b e d esigned for data compression so that they can easily interac t with the CDF or quantile up dating algorithm? W e w ould like to fi nish with th e message th at the in teraction of statistical analysis with d ata compres- sion a lgorithms is indisp ensable for successful and timely in formation extracti on fr om high-dimensional and h igh-r ate IT data. Although there are works to address th is issue (e.g., Bra v erman et al., 2003 , and J¨ ornsten et al., 2003 ), muc h more needs to b e done and esp ecia lly s o in the streaming data con text. REFERENCES Bra verman, A., Fetzer, E., Elderi ng, A., Nittel, S. and Le ung, K. (2003). S emi-streaming q uantizatio n for remote sensing data. J. Comput. Gr aph. St atist. 12 759– 780. MR2040396 J ¨ ornsten, R., W ang, W., Y u, B. and Ramchandran, K. (2003). Microarra y image compression: SLOCO and the effect of information loss. Signal Pr o c essing 83 859–869.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment