Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes

Precise Request Tracing an d Performanc e Debuggin g for Multi-tier Services of Black B oxes Zhihong Zhang 1,2 , Jianfeng Zhan 1 , Yong Li 1 , Lei Wang 1 , Dan Meng 1 , Bo Sang 1 1 Institute of Co mputing Technology, Chinese Academy of Scien ces 2 The Research Instit ution of China Mobile Contact Perso n: jfzhan@ncic.ac.c n Abstract As more and more mu lti-tier services are developed from comme rcial com ponents or he terogene ous middlew are wit hout the source c ode avail able, bot h developers and administrators need a precise request traci ng tool to help unde rstand and de bug perform ance proble ms of large concurrent servic es of black boxes. Previous work fails to reso lve this issue in several ways: they either accept the imprecision of probabil istic c orrelati on methods , or rel y on knowle dge of prot ocols t o isolat e request s in pursuit of tracing accuracy. This pape r introduc es a tool na med Prec iseTrac er to help debug p erformance problems of multi-tier servi ces of bl ack boxes . Our cont ribution s are two -fold: first , we propose a precis e reques t traci ng algori thm for multi-tier services of black boxes, which only uses applica tion-inde pendent knowle dge; seco ndly, we present a component activi ty graph ab stracti on to represent causal p aths of requests and facilitate end- to-end p erformanc e debugging . The low over head and toler ance of nois e make Prec iseTrac er a promi sing traci ng tool for using on pro duction sy stems . 1. Introduction As more and more multi-tier services are developed from commercial off-the-shelf components or heterogeneous middleware and deployed on data centers without the source code available , both developers and ad ministrators need a precise request tracing tool to understand and d ebug perfor mance problems of large concurren t services of black boxes . Previous work f ails to resolve this issue in sev eral ways. Most related work req uires obtaining the source code of applications, middleware, and relies upon the develop er’s inst rument ation of a pplicat ions or middleware, and thus they cannot b e used for tracing requests of services of black b oxes. For ex ample, Magpie [1] requires the so urce code of applications and specific platforms with the appropr iate logging capabilities [2]. The work [2] [3 ] of HP Lab s accepts the imprecision of probab ilistic correlation methods, uses the time series analysis method an d offline inf ers causal paths from the logge d mess ages of dis tribute d systems of black boxes. Probably closest to our work is BorderPatrol [4], which relies on k nowledge of protocols to isolate events or requ ests and propo ses a precise request tracing method. When a multi-tier service is developed from commercial off-the-shelf compo nents or het erogeneo us mid dleware, BorderPa trol [4 ] require s writin g many protocol processors and h as to know more specialized knowledge than pure black-box a pproaches. In this paper , we present a new precise request tracing method and make the following contribu tions: (1) We design a novel precise tracing alg orithm to deduce causal paths of requ ests from interaction activities of components of black boxes. Our algorithm only use s applicat ion-ind ependent knowle dge , such as tim estam ps, end-t o-end com muni cation c hannels and process contexts, wh ich are available from the operat ing syste m. (2) We present a component activity graph (CAG) abstraction to repres ent causal paths of requests and facilitate end-to-end per formance debugging of a multi-tier service. Our exp eriments show we can pinpoint perform ance pro blems b y using CAGs to calculate changes in observ ed latency percentages of components. The paper i s organize d as follow s: Sect ion 2 formulates the problem; Section 3 explains th e system observation, pr esentation and analysis; Section 4 introduces the pr ecise tracing algorithm; Section 5 evaluates the system. Section 6 summarizes related work; Se ction 7 dra ws a concl usion. 2. Problem statement We treat each component in a multi-tiers service as a black box. In Fig.1, we observe that a req uest will cause a series of interaction activities in the operating system kernel (in short, activities), e.g. sending or receiving messages. This hap pens under the contexts (process or kernel thread ) of different components. W e record an activity of sending a message as S i , j , which means a process i sends a messa ge to a proce ss j . We record an activity of receiving a message as R i , j , which means a process j receives a message from a process i . Fig.1. Activities with c ausal relatio ns in the kerne l. When an individ ual request is serviced, a series of acti vities ha ve causal relati ons or happen ed-before relationships, as defined by Lamport [7]. W e define a sequence of activities with cau sal relations caused by an individual requ est as a causal pat h . For ex ample in Fig 1, The activities sequence {R c,1 , S 1,2 ,R 1,2 , S 2,3 , R 2,3 , S 3,x } constit utes a causal path. For each individual request, there is a causal path . Our project develops a traci ng tool to hel p developers and ad ministrators in the following ways: 1) Precisely trace each request and cor relate activities of components into causal paths. 2) Identify causal path p atterns and obtain latency percentages of components for typical cau sal path patterns. 3) Debug perfor mance problem s of a m ulti-ti er service with the help of 1) and 2). The application limits of our system are as follows: 1) We treat each component in a multi-tiers serv ice as a black box. We do not re quire the source c ode of the appli catio n, neither deploy t he instrumented middleware, and neither have knowledge of high-le vel prot ocols used by the service, like http etc. 2) A single execution entity (process or k ernel thread) of each component can only service one request in a certain period. For servicing each individual reques t, execution entities of compone nts coopera te throug h sending or receiving messages using a reliable comm unicat ion protocol , like T CP. Though no t all multi-tier services fall within the scope of o ur target applica tions, for tunatel y m any popular serv ices satisfy these assumptions. For example, our method can be used to analyze the concurre nt serve rs foll owing ni ne desi gn patt erns introduc ed in th e book of Uni x Network Progr amm ing by Stevens [5], including iteration model, concurrent process model and concurren t thread model. 3. The observation, presentation and analysis of system activities 3.1. The observation of system activities We independ ently observe interaction activities of components of blac k boxes on different nodes. Concentrating on our main focus, we only concern about when servicing a request starts, finishes, and when components receiv e or send messages within the confine of a da ta center . Our concerned activities types include: BEGIN, END, S END, and RE CEIVE. SEND and RECEIVE activities are the ones of sending and receiving messages. A BEGIN activity marks the start of servicing a new request, while an END activity marks the end of servici ng a reques t. For each activity, we log four attributes: activity type, t imest amp, c ontext id entifie r and me ssage identifier. For each activity, we use (hostname, program name, ID of process, ID of thr ead) tuple to describe its context identifier, and u se (IP of sender, port of sender, I P of receiver, port of receiver, message size) tuple to descr ibe its message identifier. The instrumentation mechanism of our PreciseTracer depends o n a open source software named SystemTap [8], which extend s the capabilities of Kprobe[9 ]- the tracing tool on a sing le Linux node. Using Sy stem Tap, we ha ve writt en a m odule n amed TCP_T RACE, which obtains cont ext inform ation of processes and thread s from the operating system kernel and inse rts probe p oints int o tcp_ sendmsg and tcp_recvmsg function s of the kernel communication stack to log SEND and RECEIVE activities. When an application sends or receives a message, a probe poi nt will be trap ped and an activi ty is log ged. The original fo rmat of an activity produced by the TCP_TRACE is “timestam p hostname program_name Process ID ThreadID SEN D/RECE IVE sender_i p:port- receiver_ip:port message_size”. Pr eciseTracer transforms the original format of an activity into more understandable n- ary tuples to describe context and message identifiers of activities. Distinguishing activity type is straightforward. SEND and RECEIVE activities are transformed directly. BEGIN or END activities are distinguished accord ing to the ports of the communication channels. For example, the RECEIVE activity from a client to the web server’s port 80 means the START of a request, and th e SEND activity in the same connection with oppos ite direction means the STOP of a request in our algo rithm. 3.2. Th e presentation and analysis of system activi ties Formally, a causal path can be described as a directed acyclic graph G (V, E), where vertices V ar e acti vities s et of com ponent s, and edge s E repr esent causal relations between activities. We define this abstraction as component activity graph ( CAG ). For an indivi dual reque st, a cor responding CAG represe nts all activities with causal relations in the life cycle of servic ing an indi vidual re quest. CAGs include two kinds of r elations: adjacent context relation and message relation. We formally define two relations based on h appened-before relation [7], which is denoted as → , as follows: Adjacent Context Relation : if x and y are activities observed in the same context c (process or kernel thread) an d caused by the same req uest r, and y x → holds t rue. If no activi ty z is obs erved in the same context, which satisfies the relations z x → and y z → , then we can say an ad jacent context relat ion exits between x and y, deno ted as y x c ⎯→ ⎯ . So the adjacen t context relation means that x has happened righ t before y in the same execution entity. Message Relatio n : for servicing a request r, If x is a SEND activity which send s a message m, and y is a RECEIVE activity which receives the same m essage m, then we can say a message relation exists between x and y, denoted as y x m ⎯ → ⎯ . So the m essage relation means that x of sending message h as happened right before y of r eceiving message in two different execution entities. If there is an edge from activity x to activity y in a CAG, for which y x c ⎯→ ⎯ or y x m ⎯ → ⎯ holds true, then x is the parent of y. In a CAG, ev ery activity vertex must satisfy the property: each activity vertex has n o more than two parents, and o nly a RECEIVE activity vertex could have two paren ts: one parent having ad jacent context relat ion and a nother one having m essa ge relat ion with it. Fig.1 shows a causal pat h for an indivi dual req uest, where the red solid arrow represen ts adjacent context relation and the blue dashed arro w represents message relation. For an individual request, it is clear that correlating a causal path is the course of build ing a CAG with th e inputted interaction activities. According to a CAG, we can calculate latencies o f components in servicing an indiv idual request. For exam ple, for the re quest i n Fig.1, t he lat ency of pr ocess 2 is (t (S 2, 3 ) - t (R 1, 2 )), and the latency o f the interaction from process 1 to process 2 is (t (R 1, 2 ) - t (S 1, 2 )), where t is the local timestamp of each activity. The latency of pro cess 2 is accurate, since all timestam ps are from the same node. The latency of the interaction from process 1 to process 2 is inaccurate, since we do not rem edy th e clock skew bet ween two nodes. We can classify CAGs into diff erent causal path patter ns according to the shapes of CAGs, since each CAG is a directed acyclic g raph. Each causal p ath pattern is composed of a series of isomorphic CAGs, where similar vertices represent activities of th e same type with the same context information. For a causal path pattern, we aggregate an d average n isom orphic CAG s to comput e an average causal path . Furthermore, we obtain the latencies of components for an aver age causal path. 4. The precise tracing algorithm TCP_TRACE TCP_TRACE TCP_TRACE Serv er Serv er Ser ve r Ranker C orre lato r CAG1 CAG2 CAG3 ... CAGn Engine Candidat e Output CAG Index Map Fig.2. The architec ture of PreciseTracer Fig.2 shows the architecture of PreciseTracer, which is composed of two major components: TCP_Trace and Correlator . Our precise tracing alg orithm includes three main steps: Step 1: When activities of componen ts are logged by TCP_Trace on different no des, they are sorted according to local timestamps in the first round. Step 2: A m odule of Correl ator, na med ra nker, is responsible for choosing candidate activities for compo sing CAGs. Step 3: a module of Correlator, named engine, constructs CAGs from the outputs o f the ranker, and then outputs CAGs . Before we proceed to introduce th e algorithm of the ranker, we explain how the engin e stores unfinished CAGs. All unfinis hed CAGs are indexe d with index map d ata st ructures. Index m ap ma ps a key to a va lue, and suppor ts basic search ing, inse rting a nd deleti ng operati ons. One inde x map, na med mmap , is used to mat ch me ssage rel ation s, and anoth er one, name d cmap , is used to match adjacent context relations. For the mmap, the key is the m essage identifier of an activity, and the value o f the mmap is an unmatched SEND activity with the same message identifier. The key in the cmap is the context identifier of an activity, and the value of the cmap is the latest activity with the same context identifier. 4.1. Choosing candida te activity for co mposting CAGs We choose the minimal local timestam p of activities on different nodes as the initial time, and set a sliding time window for processing activities stream. Activities, logged on differ ent nodes, will be fetched into the buffer of the ranker if their timestamps are within the slidi ng time window. The si ze of t he slid ing tim e window i s indepen dent of t he cloc k skews, and i t could be any value larg er than 0, since o ur tracing algorithm does not de pend on highl y prec ise cl ock synchro nizati on across di stribut ed nodes . The ranker puts activities into several different queues according to the IP addresses of the context identifiers of activities. Naturally, the activities in the same queue are sorted by the same local clock, so th e ranker only need co mpare head activities of each queue and select candidate activities f or composing CAGs according to the followin g rules: Rule 1 : If a head activity A in a queue has RECEIVE type and the ranker h ad found an activity X in the mmap, of which A X m ⎯ → ⎯ holds true, then A is the candidate. If the key is the message identifier of an activity A and the value of the mmap points to a SEND activity X with the same message identifier, we can say A X m ⎯ → ⎯ . Rule 1 ensures that when a SEND activity has become a candidate and been d elivered to the en gine, the RECEIVE activity having message relation w ith it will also become a candidate once it becomes a head activity in its queue. Rule 2 : If no head activity is qualified with the ru le 1, then the ranker compares the type of head activities of each queue acco rding to the priority of BEGINget_activity_type ( )) { 3: case BEGIN: 4: create a CAG with current activity as i ts root. 5: case END: 6: find the matched parent where parent - c >current, 7: if (t he matc h is found) { 8: add current into the matched CAG. 9: add a context edge from parent to current . 10: output CAG. 11: } 12: case SEND: 13: find matched parent_msg where parent_msg- c >current, 14: if (the m atch i s found) { 15: If ( parent_msg.type==SEND ） 16: parent_msg.size += current.size. 17: else { 18 : add current into the matched CAG. 19: a dd a context edge from parent_m sg to current. 20: } 21: } 22: case RECEIVE: 23: find matched parent_msg where parent_msg- m >current , 24: if (the m atch i s found) { 25: parent_msg.size-=current.size. 26: if (parent_msg.size ==0) { 27: add current into matched CAG. 28: add message edge from parent_msg to current. 29: find matched parent_cntx where parent_cntx- c >current, 30: if (the m atch i s found) 31: if ( parent_msg and parent_cntx are in the same CAG) 32: add a conte xt edge from parent_c ntx to current. 33: } 34: } 35: }//switch 36:}//while 37 } //correlate Fig.3. Th e pseudo code o f the algorit hm. Line 12-34 ha ndle SEND and R ECEIVE act ivitie s. The activities are inheren tly asymm etric between the sender and the r eceiver because of the underlying buffer sizes an d delivery mechanism. So the match between SEND and RECEIVE activ ities is not always one to one, but n to n relation. Fig.4 sho ws a case that the sender con secutively sends a message in two parts and the receiver receives messages in three parts. Our algorithm correlates and merges these activities according to the message sizes in the message identifier tuples. Fig.4. Merging multiple SEND and RECEIVE activities. A situa tion ma y happen that an act ivity is wrongly correlated into two causal path s because of r eusing threads in some concurrent prog ramming paradigms. For example in thread-pool implementation, one th read may serve one request at a time. When the work is done, the thread is recycled into threads po ol. Line 29- 32 check if the two parents are in the same CAG. If the check returns true, the engine will add an edge of context relation, or else not. 4.3. Disturbance tolerance In a cl ean envi ronmen t without dist urbance, R ule 1 and Rule 2 in Section 4.1 can pro duce correct causal paths. But in a practical environment, there may be two distur bances: no ise act ivitie s and c oncurrency distur bance. Noise activities are caused by other applications coexisting with the target service on th e same computer. Their activities through the kernel’s TCP stack will also be logged and ga thered by our tool. The ranker h andles noise activities in tw o ways: 1) filters noise activities according to their attributes, includ ing program nam e, IP and port . 2) If ac tiviti es can not be filtered with the attributes, the ranker checks them with is_noise function. I f true, the ranker will disca rd them. The pseudo al gorithm of the i s_noise function is ill ustrated i n Fig. 5. ______________________________ ________________________ bool is_ noise( Ac tivity * E) { return (( E->type==RECEIVE )&&(No matched SEND activity X in the m map with X- m >E) && ( No matched SEND activity Y in the buffer of ranker with Y- m >E)); } Fig.5. The pseudo cod e of is_noise ( ) func tion. The seco nd disturba nce is c alled concurrenc y distur bance, which onl y exis ts in mul ti-proc essor computers. Fig.6-a illustrates a po ssible case of wh ich two concurren t requests are serviced concurren tly by two multi-processor computers and four activities are observed. S 1 1,2 means a SEND activity produces on th e CPU1 of Nod e1, and R 0 1,2 is it s ma tched R ECEIVE activity produced on the CPU0 of Node2. When these four activities are fetched into the buffer of the ranker , they are put into two qu eues as shown in Figure 6-a. The head activities of both two queues ar e RECEIVE acti vities and block the matc hed SEND a ctivit ies of each other. The ranker handles this case by swappin g the head activity and th e following activity in two queues. F igure 6-b i llustrat es our sol ution. Fig.6. Example of concurrency di sturbance. 4.4. The complexity of the algorithm S 1 1,2 R 0 2,1 S 1 2,1 R 0 1,2 R 0 2,1 S 1 2,1 R 0 1,2 Node1 Node2 front tail S 1 1,2 Node1 Node2 front tail a. an example of concurrency distrubance b. sol ution of concurrency disturbance For a multi-tier service, the time complexity of our algorithm is approximately ) ( n p g O Δ ∗ ∗ , where g measures the structure complexity of a service, p is the number of request s in the f ixed durat ion, △ n is the size of activities sequence per req uest in the sliding time window. Furt herm ore, the t ime c omplex ity of ou r algorithm can be expressed as ) ( n g O ∗ , where n is the size of activities sequence in th e sliding time window. The space com plexity of our algorithm is approximately ) 2 ( n p g O Δ ∗ ∗ or ) 2 ( n g O ∗ . 5 ． Evaluati on 5.1. Experimental environments and setup Ht t pd We b Ser ver JBoss A p p lic a tio n Ser ver My SQL Dat ab ase Ser ver Cl i ent Em ul at or Cli ent Emul at or Cl i ent Em ul at or Preci seTr acer Request Request Request Pat h Log Log Log Pat h Pat h Pat h Fig.7. The deplo yment diagram of RUBiS. We choo se RUBiS [10] as the target application. Developed by Rice University, RUBiS is a three-tier auction site prototype modeled after eBay.com that i s used to evaluate application servers’ perfo rmance scalability. The exp erim ent pla tform is an 8 -node Linux clust er connect ed by a 100 Mbps Et hernet swi tch. Each node is a SMP wit h two PIII proc essors and 2G m emory. Every node runs the Red hat Federo Core 6 Linux with the kprobe [9] feature enab led. The deployment of the RUBiS i s shown in Fig. 7. In the following exp eriments, all experiments are done offline . Client nodes em ulate t wo kinds o f workload: read only wo rkload (Browse_o nly) and read_write mixed workload (Default). Acco rding to the user guide of RUBiS, ev ery workload inclu des 3 stages: up ramp with the duration of 2 minutes 9 milliseconds, runtim e sessi on with t he duratio n of 7 minute s 30 seconds 9 milliseconds, and dow n ramp with the durati on of 1 mi nute 10 m illi seconds. 5.2. Evaluating the a ccuracy of the algorithm For RUBiS, we modify the code, tag an d propagate a globa lly unique re quest ID wi th eac h request, and the following attributes are lo gged for the Apache web server, the JBoss Ser ver and the MySql database, includ ing (1) re quest ID , (2) the st art ti me and end t ime of servi cing a re quest; ( 3) ID of the p rocess or t hread. At the same time, with only the a pplication- independent knowledg e, such as timestam ps, end-to- end communication channels, we use PreciseTracer to obtain causal paths. From each causal path , we independently obtain in formation like (2) and (3). If all attributes of a causal path are consistent with the ones obtained fro m the logs of RUBiS, we confirm that the causal path is correct. So we define the path accuracy as follows : Path accuracy = correct paths/ all logged requ ests. We test the accu racy of our algo rithm for both Browse_Only and Default workload o f RUBiS in many experiments with d ifferent configur ations: (1) concurrent clients increase from 100 to 1 000, and at the same tim e the size of the sliding time window varies from 1 millisecond to 10 seco nds; (2) the clock skew changes from 1 millisecond to 5 00 milliseconds. (3) We run rlogin, ssh and MySQL client t o produ ce noise activities. In all these tests, the accuracy o f our algorithm is 100% wit h no fals e positiv e and no false ne gative . This is because all activities of components are logged using SystemTap[8], so our algo rithm correlates all activities into c ausal pa ths. Howeve r, for net work conge stion, the los s of activi ties m ay happen, t hough we have not observed up to now. The loss o f activities will result in deformed CAGs. When the possibility of loss of activities is low, we can distinguish nor mal CAGs from deformed CAGs according to th e difference of quantities. 5.3. Evaluating the efficiency of the algorithm 5.3.1. Evalu ating the comple xity. We set the sliding time window as 10 milliseconds. W hen concurrent clie nts vary from 100 to 1000, we record t he number of serviced requests and the correlation time. For differen t numb ers of co ncurrent c lients , the t est durat ion is fixed for the B rowse_Onl y workload , defined i n Secti on 5.1. Fig.8. Reques ts v.s. conc urrent clients. 0 20000 40000 60000 80000 100000 100 200 300 400 500 600 700 800 900 1000 Number of concurren t clients Requests(number) Fig.9. Correlatio n time v.s. requests. The unit in x- axis is 1000 reques ts. From F ig.8 and Fi g.9, we o bserve t hat the number of requests is linear in the n umber of concurren t clients and the correlation time is linear in the number of servic ed request s in the fi xed test duration before RUBiS i s satur ated at t he point of 800 client s. In Secti on 4.4, we concl ude that the ti me complexity of our algor ithm is approximately ) ( n p g O Δ ∗ ∗ . Our experi ment r esult in Fig.9 is con sistent with the analysis, since g is a constant fo r RUBiS and △ n is unc hanged in the fi xed sliding time wi ndow, so the correlation time is linear in the num ber of r equests in the fi xed test duration be fore RUBiS is saturated. Fig.10. Correl ation time v. s. sliding time window. Fig. 10 sh ows the effect of the size of the slid ing time window on the correlat ion time for different numbers of concurrent clients. Our analysis in Section 4.4 shows tha t, when t he numbe r of reques ts in the fixed duration is unch anged, the time complexity of the algorithm is linear in the size of △ n for RUBiS. △ n is the size of activities sequence per request in the sliding time window, which is determined by the size of the sliding time window. Fig. 11 sh ows the effect of the size of the slid ing tim e window on the m em ory consum ed by the Correlator for different con current clients. Wh en the size of the sliding time window increases from 10 seconds to 100 s econds, t he num ber of logged activities, fetched to the bu ffer of the Correlator , increases dramatically and results in the dr amatic increase of the consumed memory. 0 100 200 300 400 500 600 consumed memo ry(M) 200 500 800 number of co ncurrent cli ents 1 10 100 1000 10000 100000 Fig.11. The memory cons umed by the Correlator. 5.3.2. Eval uating the overhead. We compare the throughput and average response tim e of RUBi S for the Browse_Only wor kload when the instrumentation mechanism of PreciseTracer is enabled and disabled. 0 20 40 60 80 100 120 140 160 180 reques t/ second 100 300 500 700 900 number of concurrent clients Disable Enable Fig.12. The effect on the throughput of RU BiS. Fig.13. The effect on the average respon se time. In Fig.12 and Fig.13, we observe that wh en the numbers of con current clients are less than 500 , PreciseTracer has little effect on the t hroughput and average respon se time of RUBiS. As the number of concurrent clients increases, PreciseTracer has small effect on both of them. According to our statistics, the max o verhead in term s of the increase of throughput i s 3.7%, and the max overhead in terms of the increase of average response time is less than 30%. 0 200 400 600 800 1000 1200 1400 1600 1800 2000 average r esponse t ime (m s) 100 300 500 700 900 number of concu rrent clients Disable Enable 0 50 100 150 200 250 300 350 400 correl ation time (s) 200 500 800 number of concurrent clients 1 10 100 1000 10000 100000 0 50 100 150 200 250 300 350 10 21 32 42 52 61 69 76 78 72 requests (*1000) correlation time (s) 5.3.3. Evaluating noise tolerance. We tes t th e performance of PreciseTracer in an environment with the nois e acti vities produced by rlogi n, ssh and MySQL client. We set the size of the sli ding time window as 2 mi lliseconds. In our algorithm , noise activities produced by rlogin and ssh can be filtered by the program name attribute, but noise activities produced b y MySQL client cannot be filtered by the program name attribute, since MySQL client shares the same datab ase wi th RUB iS. Our al gorithm can di scard these noise activities. Fig .14 measures the effects of noise activities on the correlation time when about 200K noise activities related with MySQL Database are produced in the fixed duration fo r the Browse_Only workload. We ob serve that our algorithm is effective in tolerating noise. Fig.14. Th e overhead of noise tolerance. 5.4. Identifying performance bottleneck 5.4.1. Miscon figurati on shooting. For the experiment s in Section 5.3. 2, we observe that when th e number of concurrent clients increases from 700 to 800, the throughpu t of RUBiS decreases, while the averag e response time increases. An in teresting question is what is the wron g with RUBiS? Generally, we will observe the resource utilization rate of each tier and the m etrics of quality of service to pinpoint the bot tlenec ks. Using t he monit oring too l of RUBiS, we notice that the CPU usage of the each node is les s than 80% a nd the I/O usage rat e is not high. Obviously , the tra ditional method do es not help . To answer th is question, we use our tool to analyze the most frequent request ViewItem for RUBiS, compute the average causal path and visualize the view of latenc y perce ntages of components . We identify the problem quickly. From Figure 15, we observe that when the number of concurrent clients increases f rom 500 to more, the latency percentage of httpd2Java from first tier to second tier changes dr amatically, and the valu e is 46%, 80%, 71% and 60% res pective ly for 500 , 600, 700 and 800 concurr ent clie nts. In Fig.15, the lat ency percentage of httpd2Java is 46% for 500 clients, which means that the processing time of the interaction from httpd to Java takes up 46% of the whole time of servicing a request. Fig.15. The latency perc entages of compo nents At the same time, the latency percentage of httpd2httpd (first tier) in creases dramatically from 17% (700 cli ents) t o 31% (800 cl ients) . We observe the CPU us age of the J Boss node is less than 60% and the I/O usage rate is not high. When serv icing a request, the htt pd2http d is before the ht tpd2java i n a caus al path. So we can confirm that there is something wrong with the interaction between the httpd and the JBoss. Through reading the manual reference of RUBiS, we confirm that the p roblem may be mostly related with the configu ration of t hread pool in the JBo ss. Accordi ng to the m anual book of the J Boss, one parameter named MaxThreads controls the max available thread number, and one thread services a connection. The defau lt value of MaxThreads is 40. We set MaxThreads as 250 and r un the experiments again. In Fig.16, we observe that our work is effective. When con current clients increase f rom 500 to 800, th e throughput is increased and the average response time is decreased in comparison with that of the default configuration. Ho wever, for 900 con current clients, the resource limit of hardware platform results in the new bottl eneck. In Fig.1 6, TP_MT40 is the th roughput when Max Threads i s 40, and RT _MT250 is the average response time when MaxThreads is 250. 0 50 100 150 200 250 300 350 Correlat i on Ti m e ( s ) 100 300 500 700 90 0 Number of c onc urrent c l ien ts no_nois e nois e 0 10 20 30 40 50 60 70 80 l atenc y percent(% ) h ttp d 2 h ttp d ht t pd2jav a j a va 2 h ttp d java2jav a j a va2 mysql d m ys qld2jav a mysq l d2 mysql d c li ent=5 00 c li ent=6 00 c li ent=7 00 c li ent=8 00 Fig.16. Pe rformance for diffe rent MaxThreads. 5.4.2. Injected perfo rmance problem. To furth er validate the accuracy of locating performance problems using PreciseTracer, we have injected several performance problems into RUBiS components and the host nodes: for abnor mal case 1, we modify t he RUBiS code to inject a rando m delay into the second tier; for abnormal case 2, we lock the item s table of the RUBiS database to inject a delay into the third layer; for abnormal case 3, we chan ge the configuration of the Ethernet driver from 100 M to 10 M on the node running t he JBoss. Fig.17. L atency percentag es of componen ts for abnormal case s We use PreciseTracer to locate the com ponent in question wh ere different perf ormance problems are injected. Fig.17 sh ows the latency percentages of components for normal case and three abn ormal cases. For abnormal case 1 (E JB_Delay), the latency percentage of Jav a2Java (JBoss, secon d tier) increases from less than 10% for the normal case to more than 40%, and the latency percentages o f other components decrease with different amounts. So we can confirm that JBoss is in question. For abnormal case 2 (DataBase_Lock), the latency percentage of mysqld2mysqld (third tier) increases from 12% for the normal case to more than 20 %, and the latency percentage of j ava2mysqld (interaction from second tier to thir d tier) increases from 26% fo r the norm al ca se to m ore than 35 %. The Latency percentages of other components keep unchang ed or decrease. So we confirm that MySQL is in question. For the abnormal case 3 (EJB_Network) , the latency percentage of Java2 mysqld (from second tier to third tier) increases from 26% for the normal case to 47 %; mysqld2java (from third tier to seco nd tier) keeps about 37%. The latency percentage of h ttpd2java from first tier to second tier incr eases from 1% to 2%; th e percentage of jav a2httpd from second tier to first tier increases from 4% to 8%. We observ e that most of time of servicing requ est is spent on the interactions between second tier and third tier, and the three latency percentages of four interactions related with the second tier are increased. We con firm the second tier is in questi on. Further obs ervatio n shows the latency percentage of Java2java strangely decreases from 9% to almost 0%. So we confirm that there is something wrong wit h the net work of second t ier. 6. Related work 6.1. Black box model A much earlier project, DPM, [11] instru ments the operating system kernel and tracks the causality between pairs of message to trace unmodified applications. DPM is not precise, since the existence of a path in the resulting g raph does n ot necessarily mean that any real causal path followed all of those edges in that sequence [3]. Project5 [3] and WAP5 [2] accept the imprecision of probabilistic correlation methods . Project5 [3] uses the time series analysis to infer causal paths in a distri buted sy stem of black bo xes from the re lative timestam ps of network traffic. More recently WA P5 [2] infers causal paths for wid e-area systems from tracing stream on a per-process granu larity using library inter position. T he E2Eprof [6] proposes a pathm ap algorithm, similar to the convolution algorithm of Project5, but uses compact trace representations and a series of o ptimizations make it suitable for online performance diagnosis. Using the knowledge of protocols, BorderPatrol [4] isola tes and sch edules eve nt or reques t at the protocol level to precisely trace reques ts. BorderPatrol requires writi ng many protocol proc essors an d requires more specialized knowledge than pur e black-box appro ach. 6.2. Non-black box model The most invasive systems, such as Netlogger [12] and ETE [13] req uire programmers to add event logging t o careful ly-chose n points t o find causal pat hs 0 50 10 0 15 0 20 0 10 0 200 3 00 40 0 50 0 600 7 00 80 0 90 0 10 00 Nu m ber of C o nc ur r en t Cli ents Th ro u g h p u t (re q / s) 0 500 1000 1500 2000 A v er age R e spo nse Time (m s) T P_M T4 0 TP _MT 25 0 R T_M T4 0 RT _MT 25 0 0 5 10 15 20 25 30 35 40 45 50 la t e n cy per c en tage (% ) ht t p d2 h tt p d http d2j av a j a va 2ht tpd ja va 2 ja v a j a va 2m ysq ld my s ql d2j av a m y sq ld 2m ysq ld normal EJ B _ De l a y Dat abas e_Lo c k EJ B _ Ne t wo r k rather than infer them from passive traces. Pip [15] inserts annotations into the source code to log actual system behavior, but can extract causal path information with no false pos itives or false negatives. Magpie [1] collects events at d ifferent points in a system and uses an event schema, which is application- specific, to correlate these events into causal paths. Stardust [14], a system used as an on-line monitor ing tool in a distribu ted storage system, is implemented in a sim ilar wa y. Whodunit [16] annotat es profi le dat a with t ransact ion cont ext synop sis and provides finer grained knowledg e of transactions within each bo x. To avoid t he mo dificat ion of ap plicat ions’ s ource code, some work enforces middlewar e or infrastru cture changes. Pinpoint [17 ] locates component faults in J2EE platform by tagging and propagating a globally unique request ID with each request. Causeway [18] enforces change to netwo rk protocol so as to tag meta- data with existing module communicati on. X-Trace [19] modifies each network layer to carry X-Trace meta- data that enables path casual path reconstruction and focuses on debuggin g paths through m any ne twork layers. 7. Conclusion We have dev eloped the PreciseTracer tool to help users understa nd and deb ug perform ance pr oblems of a multi-tier service of black boxes. Our contrib utions are two-fold: (1) we h ave designed a precise tracing algorithm to derive cau sal paths for each individual request, using interaction activities of components of black boxe s. Our algo rithm only uses the appl ication - indepen dent know ledge , such as timestamps and end- to-end communication channels, which are available from the o perating system; (2) we have p resented a component activity graph (CAG) abstraction to represent causal paths of req uests and facilitate end-to- end performance debu gging of a multi-tier service. Ou r experiments show that one can successfully pinpoint performance prob lems from observed changes o f latency percentages of components, calculated from CAGs. In the near future, we will propo se the mathematical foundati on for auto mati c perform ance d ebugging. 8. Acknowledgements We are ver y grateful to anonymous reviewers, the DSN she pherd and Pr of. Weisong S hi for t heir helpf ul suggest ions. Thi s paper is su pported by the NSFC (Grant No . 60703020) and t he 863 program s (Grant No. 2009AA01Z139 a nd Grant No. 2006AA01A102 ). 9. References [1] P. Barham, A. Don nelly, R. Isaacs, and R.Mortier, Using Magpie for Request Extraction and Workload Modeling , in Proc. 6 th OSDI, 2004, pp. 18-18. [2] P. Reynolds, J. L.Wiener, J. C. Mogul, M. K. Aguilera and A. Vahdat, WAP5: Black-box Performance Debugging for Wide-area S ystems, in P roc. 15 th WWW, 2006, pp.347- 356. [3] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen, Performance Debugging for Distributed Systems of Black Boxes, in Proc. 19 th SOSP, 2003, pp. 74-89. [4] E. Koskinenand and J. Jannotti, BorderPatrol: Isolating Events for Black-box Tracing. SIGOPS Oper. Syst. Rev. 42, 4, 2008, pp. 191-203. [5] M. Ricahrd Stevens, UNIX Network Programming Networking APIs: S ockets and XTI, Volume 1, Prentice Hall, 1998. [6] S. Agarwala, F. Alegre, K. Schwan, and J. Mehalingham, E2EProf: Automated End-to-End Performance Management for Enterprise Sy stems, in Proc. 37 th DSN, 2007, pp.749-758. [7] L. Lamport, Time, Clocks and the Ordering of Ev ents in a Distributed System, Com ms. ACM, 21(7), 1978, pp.558-565. [8] System tap, http://sourceware.org/systemstap/. [9] KProbe, http://sourceware.org/systemtap/kprobes/. [10] RUBiS, http://rubis.orjectweb.org. [11] B. P. Miller, DPM: A Measurement Sy stem for Distributed Programs, IEEE Trans. on Computers, 37(2), 1988, pp.243–248. [12] B. Tierney, W. Johnston, B. Crowley, G. Hoo, C. Brooks and D. Gunter, the NetLogger Methodology for High Perform ance Distributed Systems Perform ance Analysis, in Proc. 17 th HPDC, 1998, pp. 260-267 [13] J. L. Hellerstein, M.M. Maccabee, W.N. III. Mills, J.J. Turek, ETE: a Customizable Approach to Measuring End- to-end Response Times and Their Components in Distributed Systems, in Proc. 19 th ICDCS, 1999, pp. 152-162. [14] E. Thereska, B. Salmon, J. Strunk, M. Wachs, M. Abd- El-Malek, J. Lopez, and G. R. Ganger, Stardust: Tracking Activity in a Distributed Storage System, in Pr oc. SIGMETRICS, 2006, pp. 3-14. [15] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, , M. A. Shahand A. Vahdat, Pip: Detecting the Unexpected in Distributed System s, in Proc. 3 rd NSDI, 2006, pp.115-128. [16] A. Chanda, A. L. Cox and W. Zwaenepoel, Whodunit: Transactional Profiling f or Multi-tier Applications, SIGOPS Oper. Syst. Rev. 41, 3, 2007, pp. 17-30. [17] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, Pinpoint: Problem Determ ination in Large, Dynamic Internet Services, In Proc. 32 th DSN, 2002, pp.595-604. [18] A. Chanda, K. Elmeleegy, A. L. Cox, and W. Zwaenepoel, Causeway: Operating System Support for Controlling and Analyzing the Execution of Distributed Programs, in Proc. 10 th HotOS, 2005, pp. 18-18. [19] R. Fonseca, G. Po rter, R. H. Katz, S. Shenker, I. Stoica, X-Trace: A Pervasive Network Tracing Fram ework, In Proc. 4 th NSDI, 2007, pp.271-284.

Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment