A General Theory of Computational Scalability Based on Rational Functions

A General Theory of Computational Scalabilit y Based on Rational F unctions Neil J. Gun ther ∗ No vem ber 26, 2024 Abstract The universal scalability law of computational capacit y is a rational function C p = P ( p ) /Q ( p ) with P ( p ) a linear polynomial and Q ( p ) a second-degree p olynomial in the n umber of physical pro cessors p , that has b een long used for statistical mo deling and prediction of computer system p erformance. W e prov e that C p is equiv alen t to the synchronous throughput b ound for a mac hine- repairman with state-dependent service rate. Simpler rational functions, suc h as Amdahl’s la w and Gustafson sp eedup, are corollaries of this queue-theoretic b ound. C p is further sho wn to b e b oth necessary and suﬃcien t for mo deling all practical c haracteristics of computational scalability . 1 In tro duction F or sev eral decades, a class of real functions called r ational functions [1], has b een used to represen t throughput scalabilit y as a function of physical pro cessor conﬁguration. In particular, Amdahl’s la w [2], its mo diﬁcation due to Gustafson [3] and the Universal Scalability La w (USL) [4] hav e found ubiquitous application. In this context, the relative computing capacit y , C p , is a rational function of the num b er of ph ysical pro cessors p . It is deﬁned as the quotient of a p olynomial P ( p ) in the n umerator and Q ( p ) in the denominator, i.e., C p = P ( p ) /Q ( p ). Eac h of the ab o ve-men tioned scalabilit y models is distinguished b y the num b er of coeﬃcients or ﬁtting parameters asso ciated with the p olynomials in P ( p ) and Q ( p ). F or example, Amdahl’s law and Gustafson’s mo diﬁcation are single parameter mo dels, whereas the USL model contains tw o parameters. Despite their historical utilit y , these mo dels hav e stoo d in isolation without an y deeper ph ysical in terpretation. It has ev en b een suggested that Amdahl’s law is not fundamen tal [5]. More imp ortan tly , the lac k of a uniﬁed physical interpretation has led to the use of certain ﬂaw ed scalabilit y mo dels [6]. In this note, we demonstrate that the aforemen tioned class of rational functions corresponds to certain p erformance bounds b elonging to a queue-theoretic mo del. The idea that Amdahl’s la w, which has most frequen tly been associated with the scalabilit y of massively parallel systems, can b e considered from a queue-theoretic standp oin t, is not entirely new [See e.g., 7, 8]. Ho wev er, quite apart from motiv ations entirely diﬀerent from our own, those previous works emplo yed op en queueing mo dels with an unbounded num b er of requests (See App endix C), whereas we shall use a close d queueing model with a ﬁnite num b er of requests p corresp onding to the n umber of physical pro cessors. The USL function is asso ciated with a state-dep enden t generalization of the machine repairman [9]. The organization of this pap er is as follows. W e brieﬂy review the scalability mo dels of interest in Sect. 2. The appropriate queueing metrics asso ciated with the standard mac hine repairman and its state-dep enden t extension are discussed in Sect. 3. The p erformance characteristics asso ciated with synchr onous queueing are also presen ted there. The main theorem (Theorem 2) is established in Sect. 4. Amdahl’s la w and Gustafson’s linear speedup are sho wn to b e corollaries of this theorem. Finally , in Sect. 5 w e pro ve an earlier conjecture that a rational function with Q ( p ) a second-degree p olynomial is b oth necessary and suﬃcien t to mo del all practical cases of computational scalabilit y . ∗ Performance Dynamics Company , 4061 East Castro V alley Blvd., Suite 110, Castro V alley , CA 94552, USA. Email: nj gunther @ perfdynamics . com 1 2 P arametric Mo dels Although technically , we are discussing rational functions, we shall hereafter refer to them as parametric mo dels, and the co eﬃcients as parameters, since the primary application of these mo dels is nonlinear statistical regression of performance data [See e.g., 4, 10, 11, 12, and references therein]. 0 20 40 60 80 100 p 5 10 15 20 C p Figure 1: P arametric mo dels: USL (red), Amdahl (green), Gustafson (blue), with parameter v alues exaggerated to distinguish their typical characteristic relativ e to ideal linear scaling (dashed). The horizon tal line is the Amdahl asymptote at σ − 1 . Deﬁnition 1 (Sp eedup) . If an amount of w ork N is completed in time T 1 on a unipro cessor, the same amount of work can be completed in time T p < T 1 on a p -wa y m ultipro cessor. The sp eedup S p = T 1 /T p is one measure of scalability . 2.1 Amdahl’s law F or a single task that takes time T 1 to execute on a unipro cessor ( p = 1), Amdahl’s la w [2] states that if the task can b e e quip artitione d onto p processors, but contains an irreducible fraction of sequential w ork σ ∈ [0 , 1], then only the remaining portion of the execution time (1 − σ ) T 1 can b e executed as p parallel subtasks on p physical processors. The b ound on the achiev able equipartitioned speedup [13] is given by the ratio S p ( σ ) = T 1 σ T 1 + „ 1 − σ p « T 1 (1) whic h simpliﬁes to S p ( σ ) = p 1 + σ ( p − 1) ; (2) a rational function with P ( p ) = p and Q ( p ) a ﬁrst-degree p olynomial. As the pro cessor conﬁgu- ration is increased, i.e., p → ∞ , the num ber of concomitan tly smaller subtasks also increases and the speedup (1) approaches an asymptote, S p ( σ ) ∼ σ − 1 , (3) sho wn as the horizontal in Fig. 1. 2 2.2 Gustafson’s sp eedup Amdahl’s law assumes the size of the work is ﬁxed. Gustafson’s mo diﬁcation is based on the idea of scaling up the size of the work to match p . This assumption results in the theoretical recov ery of linear speedup S G p ( σ ) = σ + (1 − σ ) p (4) Equation (4) is a rational function with Q ( p ) = 1 and P ( p ) a ﬁrst-degree p olynomial in p . Although (4) has inspired v arious eﬀorts for impro ving parallel pro cessing eﬃciencies, achieving truly linear sp eedup has turned out to be diﬃcult in practice. Most recently , (4) has b een prop osed as a w ay to optimize the throughput of m ulticore pro cessors [14]. Deﬁnition 2 (Relativ e Capacity) . As an alternative to the sp eedup, scalabilit y can also b e deﬁned as the relative capacity , C p = X ( p ) /X (1), where X ( p ) is the throughput with p pro cessors, and X (1) the throughput of a single pro cessor. 2.3 Univ ersal Scalabilit y La w (USL) The USL mo del [4, 10, 11] is a rational function with P ( p ) = p and Q ( p ) a second-degree p olyno- mial: C p ( σ, κ ) = p 1 + σ ( p − 1) + κ p ( p − 1) (5) where the coeﬃcients b elonging to the terms in the denominator hav e b een regroup ed into three terms with t wo parameters ( σ, κ ). These terms can b e interpreted as representing: 1. Ideal concurrency associated with linear scalability ( σ, κ = 0) 2. Conten tion-limited scalabilit y due to serialization or queueing ( σ > 0 , κ = 0) 3. Coherency-limited scalability due to inconsisten t copies of data ( σ, κ > 0) T able 1 summarized ho w these parameter v alues can b e used to classify the scalability of diﬀerent t yp es of applications. Equation (5) subsumes (2) and (4). In particular, (2) is identical to (5) with κ = 0. The key distinction is that, unlike (2), (5) p ossesses a maxim um at p ∗ = r 1 − σ κ (6) whic h is controlled by the parameter v alues according to: (a) p ∗ → 0 as κ → ∞ (b) p ∗ → ∞ as κ → 0 (c) p ∗ → κ − 1 / 2 as σ → 0 (d) p ∗ → 0 as σ → 1 The imp ortant implication is that b ey ond p ∗ the throughput b ecomes r etr o gr ade . See Fig. 1. This eﬀect is commonly observed in applications that inv olve shared-writable data (Case D in T able 1). In the subsequent sections, we attempt to provide deep er insight into the physical signiﬁcance of (5) b y recognizing its asso ciation with queueing theory . 3 Queueing Mo dels The machine r ep airman (Fig. 2) is a w ell-known queueing mo del [15] whic h represents an assembly line comprising a ﬁnite num b er of mac hines p which break down after a mean lifetime Z . A repair- man takes a mean time S to repair a broken machine and if multiple machines fail, the additional mac hines must queue for service in FIFO order. The queue-theoretic notation, M / M / 1 //p , implies exp onen tially distributed lifetimes and service p erio ds with a ﬁnite p opulation p of requests and buﬀering. 3 T able 1: Application domains for the USL mo del A: Ideal concurrency ( σ, κ = 0) B: Con ten tion-limited ( σ > 0 , κ = 0) Single-threaded tasks T asks requiring lo cking or sequencing P arallel text searc h Message-passing proto cols Read-only queries P olling proto cols (e.g., hypervisors) C: Coherency-limited ( σ = 0 , κ > 0) D: W orst case ( σ, κ > 0) SMP cache pinging T asks acting on shared-writable data Incoheren t application state b etw een Online reserv ation systems cluster no des Up dating database records p, Z S R(p) X(p) Figure 2: Conv entional machine repairman queueing mo del comprising p mac hines with mean uptime Z ( top ) and a repair queue ( b ottom ) with mean service time S . 3.1 Queueing Metrics The p erformance characteristics of interest for the subsequent discussion are the throughput X ( p ) and residence time R ( p ). Deﬁnition 3 (Throughput) . The throughput, X = N /T , is the num b er of tasks N completed in time T . Deﬁnition 4 (Residence Time) . The mean residence time R = W + S is the sum of the time spent w aiting for service W plus the actual repair time S when the repairman services the mac hine. On a verage, the n umber of machines that are “up” is Z X , while Q are “down” (for repairs), suc h that the total num b er of machines in either state is giv en by p = Q + Z X . Rearranging this expression produces: Q = p − Z X and applying Little’s law [15] ( Q = X R ) [15] X R = p − Z X giv es R ( p ) = p X ( p ) − Z (7) for the mean residence time at the repair station. Rearranging (7) provides an expression for the mean throughput of the repairman as a function of p : X ( p ) = p R ( p ) + Z (8) Deﬁnition 5 (Mean R TT) . The denominator in (8) , R ( p ) + Z , is the mean r ound-trip time (R TT) for M / M / 1 //p . 4 T able 2: Interpretation of the queueing metrics in Fig. 2 In terpretation Metric Repairman Multipro cessor Time share p mac hines pro cessors users Z up time execution p erio d think time S service time transmission time CPU time R(p) residence time interconnect latency run-queue time X(p) failure rate bandwidth throughput Since M / M / 1 //p is an abstraction, it can b e applied to diﬀerent computational con texts. In the subsequent sections, the p machines will b e taken to represent ph ysical pro cessors and the time spent at the repair station is taken to represen t the interconnect latency b etw een the pro cessors [16, 17]. See T able 2. This c hoice is merely to conform to the conv entions most commonly used in discussions of parallel scalability [11], but the generic nature of queueing mo del means that an y conclusions also hold for soft ware scalability [See e.g., 10, Chap. 6]. 3.2 Sync hronous Queueing W e consider the sp ecial case of synchr onous queueing in M / M / 1 //p . The queue-theoretic p erfor- mance metrics deﬁned in Sect. 3.1 are for the steady-state case and therefore each corresp onds to the statistical mean of the respective random v ariable. Moreov er, as already mentioned , we cannot giv e an explicit expression for R ( p ) since its v alue is dep enden t on the v alue of X ( p ), which is also unkno wn in steady state. Deﬁnition 6 (Synchronized Requests) . If all the machines in M / M / 1 //p break down simultane- ously , the queue length at the repairman is maximized such that the residence time in deﬁnition 4 b ecomes R ( p ) = pS . This situation corresponds to one machine in service and ( p − 1) waiting for service and pro vides a low er bound on (8) [18, 19, 20]: p pS + Z ≤ X ( p ) (9) In the context of multiprocessor scalabilit y (T able 2), it is tan tamount to all p pro cessors simulta- neously issuing requests across the interconnect. R emark 1 (A Parado x) . Consider the case where all p processors hav e the same deterministic Z p eriod. A t the end of the ﬁrst Z perio d, all p requests will enqueue at the interconnect (low er p ortion of Fig. 2) sim ultaneously . By deﬁnition, how ever, the requests are serviced serially , so they will return to the parallel execution phase (top portion of Fig. 2) separately and thereafter will alw ays return to the interconnect at diﬀerent times. In other words, ev en if the queueing system starts with sync hronized visits to the interconnect, that sync hronization is immediately lost after the ﬁrst tour b ecause it is destroy ed by the serial queueing pro cess. The resolution of this paradox is discussed in App endix B. Deﬁnition 7 (Synchronous R TT) . In the presence of synchronous queueing, the mean R TT of deﬁnition 5 becomes pS + Z . Lemma 1. The relative c ap acity C p and the sp e e dup S p give identic al values for the same pr o c essor c onﬁgur ation p . Pr o of. Let the unipro cessor throughput is deﬁned as X 1 = N/T 1 and the multiprocessor through- put is X p = N/T p . Hence C p = X ( p ) X (1) = N T p T 1 N ≡ S p follo ws from deﬁnition 1. 5 Lemma 2 (Serial F raction) . The serial fr action (2.1) c an b e expr esse d in terms of M / M / 1 //p metrics by the identity [6, 10, 11]: σ = S S + Z → ( 0 as S → 0 , Z = c onst. , 1 as Z → 0 , S = c onst . (10) If σ = 0 , then ther e is no communic ation b etwe en pr o c essors and the inter c onne ct latency ther e- for e vanishes (maximal exe cution time). Conversely, if σ = 1 , then the exe cution time vanishes (maximal c ommunic ation latency). Pr o of. The R TT for a single (unpartitioned) task in Fig. 2 is T 1 = S + Z . The R TT for a p equipartitioned subtasks is T p = p ( S/p ) + Z /p . F rom deﬁnition 1, the corresp onding sp eedup is S p = S + Z S + Z/p (11) Equating (11) with (1), we ﬁnd S = σ T 1 and Z = (1 − σ ) T 1 . (12) Eliminating T 1 pro duces S = „ σ 1 − σ « Z (13) whic h, up on solving for σ , pro duces (10). See Appendix B for another p erspective. Deﬁnition 8. The quantit y Z S = 1 − σ σ (14) is the servic e r atio for the M / M / 1 //p model. Theorem 1 (Speedup Dualit y) . Let ( σ, π ) b e a continuous dual-p ar ameter p air with σ is the serial fr action (10) and π = 1 − σ . The A mdahl sp e e dup (2) is invariant under sc alings of ( σ, π ) by p . Pr o of. Using deﬁnition 8, theorem 1 can b e represen ted diagrammatically as: π /σ = Z/S . & Z 7→ Z /p Z 7→ Z S 7→ S S 7→ pS & . S p ( σ ) (15) The path on the left hand side of (15) corresp onds to reducing the single task execution time by p (subtasks) while the interconnect service time remains unchanged. This follows from deﬁnition 6, R = pS , but the service time for each subtask is also reduced to S/p . Hence, R = S . Conv ersely , the right hand path of (15) corresp onds to p tasks, eac h with unc hanged execution time Z , but scaled service time R = pS . Both paths result in Amdahl’s la w (2), whic h can b e seen b y ﬁrst rewriting (11) in terms of the service ratio Z/S (deﬁnition 8): S p = 1 + Z /S 1 + 1 p „ Z S « . (16) (a) Interpreting the denominator in (16) as b elonging to the left hand path of (15) leads to the expansion S p = S + Z S + Z p + S − S p = ( S + Z ) /S 1 p „ S + Z S « + p − 1 p 6 Collecting terms and simplifying pro duces: S p = p 1 + „ S S + Z « ( p − 1) (17) whic h is identical to (2) up on substituting (10). (b) F ollowing the right hand path in (15) leads to the expansion S p = pS (1 + Z/S ) pS + Z = p p „ S S + Z « + Z S + Z + S S + Z − S S + Z Collecting terms in the denominator pro duces: S p = p p „ S S + Z « + 1 − S S + Z (18) whic h also yields (2) via (10). R emark 2 . Theorem 1 anticipates the interpretation of Gustafson’s la w as a consequence of scaling the w ork size Z 7→ pZ in M / M / 1 //p . See corollary 2. 3.3 State-Dep enden t Service W e no w consider a generalization of this machine repairman model in which the residence time R ( p ) includes an additional time that is prop ortional to the load on the server, expressed as the n umber of enqueued requests. Since the queue-length is a canonical measure of the state of the system, the repairman b ecomes a state-dep endent serv er [9, 15], denoted M /G/ 1 //p . Let the additional service time b e S 0 in the state-dependent progression: p = 1 : R (1) = 1 S p = 2 : R (2) = 2 ( S + S 0 ) p = 3 : R (3) = 3 ( S + 2 S 0 ) p = 4 : R (4) = 4 ( S + 3 S 0 ) (19) · · · · · · p = p : R ( p ) = p ( S + ( p − 1) S 0 ) The extra time sp en t by each machine at the repair station increases linearly with the additional n umber “down” mac hines. There is no stretching of the mean service time, S , when repairing a single mac hine. R emark 3 . In general, it is exp ected that S 0 < S . It could, how ever, b e a m ultiple of S , but that is clearly undesirable. Some example applications of the state-dep enden t M /G/ 1 //p model in a computational con text include: (a) Pairwise Exchange: Mo deling the performance degradation due to combinatoric pairwise exc hange of data b et ween p multiprocessor caches or cluster no des. See Sect. 5. (b) Broadcast Proto col: If any pro cessor broadcasts a request for data, the other ( p − 1) processors m ust stop and resp ond b efore computation can con tinue [11]. (c) Virtual Memory: Each task is a program with its own working set of memory pages. Page replacemen t relies on a higher latency device, such as a disk. As the num b er of programs p increases, page replacemen t latency causes the system to “thrash” such that the throughput to become retrograde [9]. Cf. Fig. 2. 7 4 P arametric Mo dels as Queueing Bounds In this section, we show that the parametric scalability mo dels in Section 2 correspond to certain throughput bounds on the queueing mo dels in Section 3. Theorem 2 (Main Result) . The universal sc alability law (5) is e quivalent to synchronous r elative thr oughput in M /G/ 1 //p . Pr o of. Let S 0 = cS in (19), with c a p ositiv e constan t of prop ortionalit y . The residence time for state-dep enden t, synchronous-requests b ecomes R ( p ) = pS + c p ( p − 1) S (20) Substituting (20) in to deﬁnition 2: C p = p ( S + Z ) pS + c p ( p − 1) S + Z (21) = p ( S + Z ) ( p − 1) S + ( S + Z ) + c p ( p − 1) S = p ( S + Z ) ( S + Z ) [1 + ( p − 1) S ( S + Z ) − 1 + c p ( p − 1) S ( S + Z ) − 1 ] Collecting terms and simplifying pro duces: C p = p 1 + σ ( p − 1) + c σp ( p − 1) , (22) where we ha ve applied the identit y for the serial fraction in lemma 2. Combining the co eﬃcien ts of the third term in the denominator of (22) as κ = c σ , yields (5). R emark 4 . Since c is an arbitrary constant, c > 0 implies that the parameter κ = c σ in (5) can b e unbounded, whereas σ < 1 alw ays. R emark 5 . The state-dep endence of R ( p ) in (20) does not change lemma 2 since σ is determined b y S and Z only and b oth of those queueing metrics are constants. Corollary 1 (Amdahl’s la w) . Amdahl’s law (2) is the synchr onous b ound on r elative thr oughput in M / M / 1 //p . Pr o of. F ollows immediately from the pro of of theorem 2 with c = 0 in (21). R emark 6 . Elsewhere [6, 10, 11], corollary 1 was prov en as a separate theorem. Corollary 2 (Gustafson’s law) . Gustafson ’s law (4) c orr esp onds to the r esc aling Z 7→ pZ in M / M / 1 //p . Pr o of. Since Gustafson’s result is a modiﬁcation of Amdahl’s la w, we start with (21) and let c = 0. Under Z 7→ pZ , the scalability function b ecomes: C p = p ( S + pZ ) pS + pZ = S + pZ S + Z = S + p ( Z + S − S ) S + Z Once again, after application of lemma 2, this simpliﬁes to C p = σ + p − σp (23) whic h is identical to the linear sp eedup S G p in (4). R emark 7 . Rewriting (23) as C 0 p = p + σ/ (1 − σ ), w e note that the additiv e constant σ / (1 − σ ) = S/ Z is the inv erse of the service ratio in deﬁnition 8. In the context of M / M / 1 //p , rescaling the execution time, Z 7→ pZ , prior to partitioning, adds a ﬁxe d o verhead ( pS/pZ ) to an otherwise linear function of p , whereas the ov erhead in Amdahl’s la w is an incr e asing function of p . The results of this section hav e also b een conﬁrmed using even t-based sim ulations [12]. 8 5 Extrema and Univ ersalit y Ideal linear scalabilit y ( C p ∼ p, for p > 0) has a p ositiv e and constant deriv ative. More realistically , large pro cessor conﬁgurations ( p → ∞ ) are exp ected to approac h saturation, i.e., the asymptote C sat ∼ σ − 1 . An y physical computational system that develops a scalability maximum at p ∗ in the positive quadran t ( p > 0), means that C p m ust hav e a negative deriv ative for p > p ∗ and therefore relative computational capacit y falls below the saturation v alue or C p < C sat , i.e., throughput p erformance b ecomes r etr o gr ade . Since this behavior is undesirable, there is little virtue in characterizing the maxim um beyond the abilit y to quantify its location ( p ∗ ) using a giv en scalabilit y mo del. This observ ation leads to the following conjecture [See also 10, p. 65], which we no w prov e. Conjecture 1 (Universalit y) . F or a r ational function C p = P ( p ) /Q ( p ) with P ( p ) = p , a nec essary and suﬃcient c ondition for C p to b e a mo del of c omputational sc alability is, Q ( p ) = 1 + a 1 p + a 2 p 2 with c o eﬃcients a 1 , a 2 > 0 . The simplest line of proof comes from considering latency rather than throughput. See Sec- tion D for further discussion. Pr o of. Ideal latency reduction, T p = T 1 /p , is a hyperb olic function and therefore has no extrema. The additional latency due to pairwise interprocessor comm unication introduces the combinatoric term, ` p 2 ´ = p ( p − 1) / 2, such that the total latency b ecomes T p ( κ ) = T 1 p + κ T 1 2 ( p − 1) (24) with constant κ > 0. Equation(24) has a unique minimum for p > 0 (Fig. 3). Substituting (24) in to the sp eedup deﬁnition 1, pro duces S p ( κ ) = p 1 + κp ( p − 1) (25) where we hav e absorbed the factor of 2 in κ . S p ( κ ) now p ossesses a unique maxim um in the p ositiv e quadrant ( p > 0). Thus, the quadratic term in the denominator of (25) is necessary for the existence of a maximum but it is not suﬃcien t b ecause S p ( κ ) does not exhibit the Amdahl asymptote σ − 1 when κ = 0. How ever, the tw o-parameter latency T p ( σ, κ ) = T 1 p + σT 1 „ p − 1 p « + κ T 1 2 ( p − 1) (26) do es introduce the required term into (25) and, b y lemma 1, is identical to (5). The second term in (26) can b e in terpreted as the ﬁxed time it tak es any one processor to broadcast a request for data and wait for the remaining fraction of processors, ( p − 1) /p , to resp ond simultaneously . This is also another w ay to view synchronous queueing in M / M / 1 //p . It simply in tro duces a lo wer b ound, σ T 1 , on the latency reduction. The third term in (26) is analogous to Bro ok’s la w [21]: “Adding more manpow er to a late soft ware pro ject mak es it later,” with p in terpreted as p eople rather than pro cessors. Here, for example, it can b e interpreted as the latency due to the pairwise exchange of data to maintain cac he coherency in a multiprocessor. 6 Conclusion Sev eral ubiquitous scalability models, viz., Amdahl’s la w, Gustafson’s law and the universal scal- abilit y law (USL), b elong to a class of rational functions. T reated as parametric mo dels, they are neither ad ho c nor unphysical. Rather, they correspond to certain b ounds on the relative through- put of the mac hine repairman queueing mo del. In the most general case, the main theorem 2 states that the USL mo del corresp onds to the synchronous throughput b ound of a load-dep enden t mac hine repairman. USL subsumes b oth Amdahl’s law and Gustafson’s law as corollaries of theo- rem 2. As well as pro viding a more physical basis for these scalabilit y mo dels, the queue-theoretic in terpretation has practical signiﬁcance in that it facilitates prediction of resp onse time scalability using (7) and provides deeper insight into p otential p erformance tuning opp ortunities. 9 0.5 1.0 1.5 2.0 2.5 3.0 p 1 2 3 4 5 6 T ! p " Figure 3: A minimum occurs in the total latency T ( p ) due to an increasing pairwise-exc hange time b eing added to the initial latency reduction. App endices A Wh y Univ ersal? The term “universal” is intended to conv ey the notion that the USL (deﬁned in Sect. 2.3) can b e applied to any computer arc hitecture; from m ulti-core to m ulti-tier. This follows from the fact that there is nothing in (5) that explicitly represents an y particular system architecture or interconnect top ology . That information is presen t but it is encoded in the n umeric v alue of the parameters σ and κ . The same could b e said for Amdahl’s la w but the diﬀerence is that, being a rational function with linear Q ( p ), Amdahl’s la w cannot predict the retrograde scalabilit y commonly observed in p erformance ev aluation measuremen ts [4, 10]. As prov en in Sect. 5, the USL is b oth necessary and suﬃcien t to mo del all these eﬀects. The USL do es not exclude deﬁning a more general or more complex scalabilit y mo del to accoun t for such details as, heterogeneous pro cessors or the functional form of degradation beyond p ∗ , but an y suc h model m ust include the USL as a limiting case. The best analogy migh t be to regard the USL as b eing akin to Newton’s universal law of gr avitation . Here, “universal” means generally applicable to an y gra vitating b odies. Newton’s theory has b een sup erseded by a more sophisticated theory of gravitation; Einstein’s general theory of relativity . Einstein’s theory , how ever, do es not negate Newton’s theory but rather, contains it as a limiting case, when space-time is ﬂat. Since space-time is ﬂat in all practical applications, NASA uses Newton’s equations to calculate the ﬂigh t paths of all its missions. B Sync hronous Queueing The pro ofs of theorem 2 (USL) and corollary 1 (Amdahl) employ mean v alue equations for metrics whic h characterize steady state conditions. As noted in remark 1, synchronized queueing cannot b e maintained in steady state. Sync hronization and steady state are not compatible concepts b ecause the former is an instantane ous eﬀect, whereas the analytic solutions we seek are only v alid in long-run equilibrium. Elsewhere [10, 12], we hav e shown that a necessary requirement for main taining synchronous queueing is to introduce another buﬀer in addition to the waiting line ( W 1 ) at the repairman. If the extra buﬀer represents a p ost -repair collection p oin t, suc h that eac h repaired machine (com- pleted request) is held “oﬀ-line” until al l p machines are repaired then, sync hronous queueing is main tained pro vided the Z p erio ds are i.i.d. deterministic. The extra buﬀer acts as a b arrier syn- 10 chr onizer . Unfortunately , this is a D / M / 1 //p queue, whereas the pro of of corollary 1 is based on M / M / 1 //p . Moreov er, the repairman performance metrics are robust [22], so our results should also hold for G/ M / 1 //p . In more recen t simulation experiments [12], we hav e shown that this restriction on Z p eriods can b e lifted b y p ositioning the buﬀer as a pr e -repair waiting room. Instead of requiring all p mac hines to break do wn and enqueue simultaneously , we allow an y n umber, less than p , to fail but with the added constraint that when any mac hine inv okes service at the repairman, all other machines (or executing pro cessors) must susp end operations as well, i.e., visit the susp ension buﬀer. Under these conditions, Z can b e G -distributed. Because this intermittent synchr onization o ccurs with m uch higher frequency and for muc h shorter a verage time perio ds than barrier sync hronization, the potential impact of the G -distributed tails on the Z perio ds is truncated. Sync hronization can b e treated as a tw o-state Marko v pro cess, e.g., A : p ar al lel and B : serial , where the B state includes those pro cesses that are susp ended as well as waiting for service. If λ A is the transition rate for A → B and λ B for B → A , then the probability of b eing in state B is giv en by Pr( B ) = λ B λ A + λ B (27) In the previous scenario it only tak es a single machine to fail to suspended all other machines. The failure rate is therefore λ A = 1 / Z and the service rate is λ B = 1 /S . Substituting these into (27) pro duces and expression identical to the serial fraction σ in (10). In state B , some fraction p 1 are enqueued and the remainder p 2 = p − p 1 are susp ended. On av erage, any machine can exp ect to sp end time R = ( p 1 + p 2 ) S to complete repairs. Hence, the total serial time is R = pS , which is the quan tity that app ears in the pro ofs. C Queueing Mo dels of Amdahl’s La w Others hav e also considered Amdahl’s la w from a queue-theoretic standp oin t [See, e.g., 7, 8]. Of these, [8] is closest to our discussion, so we brieﬂy summarize the diﬀerences. First, the motiv ations are quite diﬀerent. The author of [8], lik e man y other authors, seeks clev er wa ys to defe at Amdahl’s law; in the sense of Gustafson (Sect. 2.2), whereas we are trying to understand Amdahl’s la w by providing it with a more fundamental physical interpretation. Ironically , both in vestigations inv oke queue-theoretic mo dels to gain more insight in to the pertinent issues; an op en ( M/ M /m ) queue in [8], a close d queue (Fig. 2) in this paper. Second, t wo steps are undertaken to deﬁne an alternative sp eedup function: 1. An attempt to unify b oth the Amdahl and Gustafson equations in to a single speedup function. 2. Extend that uniﬁed speedup function to include waiting times. The ov erarching goal is to ﬁnd waiting-time optima for this uniﬁed sp eedup function. The uniﬁca- tion step is achiev ed by purely algebraic manipulations and do es not rest on any queue-theoretic argumen ts. The op en queueing mo del provides an ad ho c means for incorp orating waiting times as a function of queue length. The subsequent analysis is based entirely on simulation results and thereafter departs signiﬁcan tly from the analytic approac h of this pap er. By virtue of our approach, we hav e shown that b oth Amdahl and Gustafson scaling laws are uniﬁed b y the same queueing mo del, viz., the mac hine-repairman model. Moreov er, corollary 1 is a lo wer b ound on throughput; sync hronous throughput, and therefore represents worst-case scalabilit y . With this ph ysical interpretation, it follo ws immediately that Amdahl’s la w can b e “defeated” more conv eniently than proposed in [8] b y simply requiring that all requests be issued asynchr onously [12]. D Remarks on the Pro of of Conjecture 1 The proof in Section 5, is simpliﬁed by using the additive prop erties of the latency function T p rather than in verting the rational function f ( p ) = R ( p ) /Q ( p ) directly . Since Q ( p ) is quadratic in p , the temptation is to consider the inv erse of f − 1 and use the fact that a general quadratic function has a minimum. 11 T o see why this approach runs into diﬃculties, consider the simpliﬁed represen tation of (5): f ( p ) = p 1 + p + p 2 (28) with all unit co eﬃcients. Equation (28) is not in v ertible in the formal sense b ecause f − 1 is not a one-to-one mapping, even in the p ositiv e quadrant. W e can, ho w ever, consider the full inv erse with its branches, as shown in Fig. 4. The principal branch is shown in light blue with the other branc hes o ccurring at the extrema of f − 1 . This corresp onds to the extrema of (28) o ccurring at p = ± 1 ⇒ f (1) = ± 1 / 3. Hence, f − 1 ∈ C , ∀ p > 1 / 3. ! 2 ! 1 1 2 ! 2 ! 1 1 2 Figure 4: The complete rational function (28) ( r e d ) and its inv erse ( blue ). Alternativ ely , choosing the denominator b e a p erfect square: p 1 + 2 p + p 2 (29) (29) can be expressed either as a pr o duct of line ar factors : p 1 + 2 p + p 2 = „ p 1 + p «„ 1 1 + p « or a p artial fr action exp ansion : p 1 + 2 p + p 2 = 1 1 + p − 1 (1 + p ) 2 Although such decompositions are suggestive of the need for tw o parameters (Fig. 5), they w ould seem to obscure the proof of Theorem 1 rather than illuminate it. Using the latency function T p and then “inv erting” it to pro duce the corresp onding throughput scaling using lemma 1, av oids these problems. E Ac kno wledgmen ts I thank W en Chen for ﬁnding several t yp os in the original manuscript. 12 1 2 3 4 5 x 0.2 0.4 0.6 0.8 1.0 f ! x " Figure 5: Amdahl scaling ( r e d ), en velope function (1 + p ) − 1 ( gr e en ), their conv olution ( solid blue ) and equation (28) ( dashe d blue ). The small diﬀerence in the latter tw o curv es arises from the factor of 2 in the denominator of equation (29). References [1] Wikip edia. Rational function. en.wikipedia.org/wiki/Rational_function , April 2008. [2] G. Amdahl. Validity of the single processor approach to ac hieving large scale computing capabilities. Pr o c. AFIPS Conf. , 30:483–485, Apr. 18-20 1967. [3] J. L. Gustafson. “Reev aluating Amdahl’s law”. Comm. ACM , 31(5):532–533, 1988. [4] N. J. Gun ther. “A simple capacit y model for massiv ely parallel transaction systems”. In Pr o c. CMG Conf. , pages 1035–1044, San Diego, California, Decem b er 1993. [5] F. P . Preparata. Should Amdahl’s law b e rep ealed? In ISAAC ’95: Pro c e e dings of the 6th International Symp osium on A lgorithms and Computation , page 311, London, UK, Dec 4-6 1995. Springer-V erlag. [6] N. J. Gun ther. A new in terpretation of Amdahl’s law and Geometric scalability . abs/cs/0210017 , Oct 2002. [7] L. Kleinro ck and J-H. Huang. On parallel pro cessing systems: Amdahl’s law generalized and some results on optimal design. IEEE T r ans. Softwar e Eng. , 18(5):434–447, 1992. [8] R. D. Nelson. Including queueing eﬀects in Amdahl’s law. Comm. ACM , 39(12es):231–238, 1996. [9] N. J. Gunther. “Path integral methods for computer p erformance analysis”. Information Pr o c essing L etters , 32(1):7–13, 1989. [10] N. J. Gun ther. Guerril la Cap acity Planning . Springer-V erlag, Heidelb erg, Germany , 2007. [11] N. J. Gun ther. Uniﬁcation of Amdahl’s law, LogP and other performance mo dels for message- passing architectures. In PDCS 2005, International Conferenc e on Parallel and Distribute d Computing Systems , pages 569–576, Nov ember 14–16, 2005, Pho enix, AZ, 2005. [12] J. Holtman and N. J. Gunther. Getting in the zone for successful scalability . In Pr o c. CMG Conf. , page T o app e ar , Las V egas, NV, 7–12 December 2008. [13] W. W are. The ultimate computer. IEEE Spe ctrum , 9:89–91, March 1972. [14] H. Sutter. Break Amdahl’s law! www.ddj.com/hpc- high- performance- computing/ 205900309 , Jan 2008. [15] D. Gross and C. Harris. F undamentals of Queueing The ory . Wiley , 1998. [16] D. A. Reed and R. M. F ujimoto. Multic omputer Networks: Message Base d Parallel Pr o c essing . MIT Press, Boston, Mass., 1987. [17] M. Ajmone-Marsan, G. Balbo, and G. Conte. Performanc e Mo dels of Multipr o c essor Systems . MIT Press, Boston, Mass., 1990. 13 [18] R. R. Muntz and J. W. W ong. Asymptotic prop erties of closed queueing netw ork mo dels. In Pr o c. Ann. Princ eton Conf. on Inf. Sci. and Sys. , pages 348–352, 1974. [19] J. C. S. Lui and R. R. Muntz. Computing b ounds on steady state av ailability of repairable computer systems. Journal of the ACM , 41(4):676–707, 1994. [20] J. Zahorjan, K. C. Sev cik, D. L. Eager, and B. Galler. Balanced job b ound analysis of queueing net works. Comm. ACM , 25(2):134–141, F eb 1982. [21] F. P . Bro oks. The Mythic al Man-Month . Addison-W esley , Reading, MA, anniversary edition, 1995. [22] B. D. Bunday and R. E. Scraton. The G/ M /r mac hine interference model. Eur op e an Journal of Op er ations R ese ar ch , 4:399–402, 1980. 14

A General Theory of Computational Scalability Based on Rational Functions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment