Multi-criteria scheduling of pipeline workflows
Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonist criteria should be optimized, such as throughput and latency (or a combination). In this …
Authors: Anne Benoit, Veronika Rehn-Sonigo, Yves Robert
apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6232--FR+ENG Thème NUM INSTITUT N A TION AL DE RECHERCHE EN INFORMA TIQUE ET EN A UTOMA TIQUE Multi-criteria sc heduling of pipeline workflo ws Anne Benoit — V er onika Rehn-Sonigo — Yves Robert N° 6232 June 2007 Unité de recherche INRIA Rhône-Alpe s 655, av enue de l’Europ e, 38334 Montbon not Saint Ismier (France) Téléphone : +33 4 76 61 52 00 — Téléco pie +33 4 76 61 52 52 Multi-criteria sc heduling of pip eline w orkfl o ws Anne Benoit, V eronik a Rehn-So nigo, Yv es Rob ert Th ` eme NUM — Syst` emes num ´ eriques Pro jet GRAA L Rapp ort d e recherc he n ° 6232 — June 2007 — 19 pages Abstract: Mapping w orkflow applications on to parallel platforms is a challenging problem, ev en for simple application patterns such as pip eline graphs. Seve ral antago nist criteria should b e optimize d, such as throughput a nd lat ency (or a com bination). In this pap er, we study the complexit y of the bi-crite ria mapping problem for pip eline graphs on communicati on homogeneous platforms. In particular, we assess the complexit y of the well -known c h ains-to- c hains problem for different-speed p ro cessors, which turns out to b e NP-hard. W e p r o vide sev eral efficien t p olynomial bi-criteria heur istics, and their rela tive performan ce is ev aluated through extensive simulations. Key-w ords: algorithmic skeletons, p ip eline, multi- criteria optimization, complexit y results, heuristics, h eterogeneous platforms. This text is also av ailable as a research rep ort of the Lab oratoire de l’Informatique du Pa rall´ elisme http://www .ens-lyon.fr/L IP . Ordonnancemen t m ulti-crit ` ere des w ork fl o ws pip elin´ es R ´ esum ´ e : L’ordonnancement et l’ allocation de workflows sur plates-formes parall ` eles est un probl ` eme crucial, m ˆ eme p our d es applications simples comme des grap h es en p ip eline. Plusieurs cr it` eres contradictoires doiven t ˆ etre optimis ´ es, tels que le d ´ ebit et la latence (ou une combinaison des deux). Dans ce rapp ort, n ous ´ etudions la complexit ´ e du p robl ` eme de l’ordonnancement b i-crit ` ere p our les graph es en p ip eline sur des plates-formes av ec comm u- nications h omog ` enes. En particulie r nous ´ ev aluons la complexit ´ e du probl` eme b ien connu “c hains-on-chains” p our les pr ocesseur s h´ et ´ er og` enes, un probl` eme qui s’av ` ere NP-difficile. Nous prop osons plusieurs heuristiques bi-crit` eres efficaces en te mps polynomial. Leur p er- formance relative est ev alu ´ ee par des sim ulations intensiv es. Mots-cl ´ es : squelettes algorithmiques, pipeline, optimisat ion multi-crit ` ere, p lates-formes h ´ et ´ erog ` en es. Multi-criteria sche duling of pip eline workflows 3 1 Int ro duction Mapping applications ont o parallel platforms is a difficult challe nge. Sev eral scheduling and load-balancing techniques hav e b een d eve lop ed for homogeneous archite ctures (see [ 17 ] for a sur vey) but the adven t of h eterogeneous clusters has ren d ered the m apping pr oblem even more difficult. T ypically , su ch clusters are comp osed of different-speed pr o cessors intercon- nected either by p lain Ethern et (the lo w-end v ersion) or by a high-sp eed switc h (the high-end counte rpart), and they constitute the exp erimental platform of choice in most academic or industry research departments. In this con text of heterogeneous platforms, a s tructured programming approac h rules out many of the pr oblems w hich the lo w-lev el parallel application d eve lop er is usually confr onted to, su ch as deadlo cks or p rocess starv ation. Moreo v er, many real app lications draw from a range of well-kno wn s olution paradigms, such as pip elined or farmed computations. High-lev el approac hes based on algorithmic skeleto ns [ 7 , 15 ] iden tify suc h patte rn s and seek to m ak e it easy for an application d eveloper to tailor such a paradigm to a sp ecific problem. A library of skeleto ns is p rovided to the programmer, who can r ely on these already coded patterns to express the comm unication sc heme within its o wn application. Mo reov er, the use of a partic- ular ske leton carries with it consider ab le information ab out implied sc hedulin g dep endencies, which we b eliev e can help address the complex p r oblem of m ap p ing a d istributed app lication ont o a heterogeneous p latform. In this pap er, we therefore consider applicati ons that can b e expr essed as algo rithmic ske letons, and we f ocus on the p ip eline skeleto n, whic h is one of the most widely used . In such workflow applications, a series of data sets (tasks) enter the inpu t stage and progress from stage to stage until the final result is computed. Ea ch stage h as its own comm unication and computation requ ir ements: it r eads an input file f rom the p revious stage, pr o cesses the data and outputs a resu lt to the next stage. F or eac h data set, initial data is input to the first stage, and final results are outpu t from the last stage. The p ip eline w orkflow op erates in synchronous mo de: after some latency due to the initializat ion delay , a new task is completed ev ery perio d. The perio d is defined a s the longest cycle -time to operate a stage. Key m etrics for a giv en workflo w are the throughp u t and the latency . The thr ou gh p ut measures the aggrega te r ate of pro cessing of data, and it is th e rate at which data sets can ent er the system. Equiv alen tly , the inv erse of the throughpu t, defined as the p eriod , is the time in terv al required b etw een the b eginning of the execution of tw o consecutive data sets. The latency is the time elapsed b etw een the b eginning and the end of the execution of a given data set, hence it measures the response time of the system to pro cess the data set en tirely . Note that it m a y well b e th e case that different data sets h a ve different latencies (b ecause they are mapp ed onto different pr ocessor sets), hence the latency is d efined as the maximum resp onse time o ver all data sets. Minimizing th e latency is antag onistic to min im izing the p eriod , and tradeoffs should b e found b etw een these criteria. In this pap er, we f ocus on bi-criteria app roac hes, i.e . minimizing the latency u nder p eriod constrain ts, or the conv erse. The problem of mapping pipeline sk eletons on to p arallel platfo rms has receiv ed some at- ten tion, and w e surv ey r elated work in Section 6 . In this paper, w e target heterogeneous clusters, and aim at deriving optimal mappings for a bi-criteria ob jective fun ction, i.e. map- pings w hich minimize the p erio d for a fixed maximum latency , or which minimize th e latency for a fixed maximum p er io d. Eac h p ip eline stage can b e s een as a sequential p ro cedure which ma y perform disc accesses or w rite data in the memory for eac h task. This data ma y be reused from one task to another, and thus the rule of the game is alw ays to pro cess the tasks RR n ° 6232 4 A. Benoit, V. R ehn-Sonigo, Y . R ob ert in a s equentia l order within a stage. Moreo v er, due to the p ossible lo cal memory ac cesses, a giv en stage must b e mapped onto a single p r ocessor: we cannot p ro cess half of the tasks on a p ro cessor and the remaining tasks on another without exchanging intra-stage inform ation, which might b e costly and difficult to implement. In other w ords, a p ro cessor that is assigned a stage will execute the op erations required by this stage (input, computation and output) for all the tasks fed into the p ip eline. The optimization problem can b e stated informally as follo ws: which stage to assign to which pro cessor? W e r equire the m ap p ing to b e int erv al-based, i.e. a processor is assigned an interv al of consecutive stages. W e target Communic ation Homo gene ous platforms, with identic al links b ut different sp eed pr ocessors, which int ro duce a first degree of hete rogene- it y . Suc h platforms corresp ond to netw orks of w orkstations in terconnected by a LAN, w hich constitute the t ypical exp erimen tal platfo rms in most aca demic or r esearc h departmen ts. The main ob jectiv e of this p ap er is to assess the complexity of the bi-criteria mapping problem onto Communic ation Homo gene ous platforms. An interesting consequence of one of the new complexit y results is the following. Give n an array of n elemen ts a 1 , a 2 , . . . , a n , the wel l-known chains-to-c h ains p roblem is to partition the arra y in to p interv als whose eleme nt sums are well balanced (technically , the aim is to minimize the largest sum of the elements of any interv al). Th is pr oblem has been extensively stu died in th e literature (see the p ioneering pap ers [ 6 , 10 , 13 ] and the survey [ 14 ]). It amounts to load-balance n compu tations whose ordering must b e preserved (hence the restriction to i nterv als) onto p iden tical pr o cessors. The adv ent of heterogeneous clusters naturally lea ds to the follo wing generalizati on: can we partition the n eleme nts into p interv als whose elemen t sums m atch p prescrib ed v alues (the pro cessor sp eeds) as closely as possible? The NP-hardness of this imp ortan t extension of th e c hains-to-c hains pr ob lem is established in Section 3 . Th us the bi-criteria mappin g pr oblem is NP-hard, and we derive efficient p olynomial bi-criteria heuristics, which are compared through simulat ion. The rest of the paper is organize d as follo ws. Secti on 2 is devoted the presentati on of th e target optimization problems. Next in Section 3 we pro ceed to the complexity r esults. In Section 4 we introduce several p olynomial h euristics to s olve the mapp ing prob lem. T hese heuristics are compared thr ough simulatio ns, w hose results are analyzed in Section 5 . Sec- tion 6 is dev oted to an ov erview of related w ork. Finally , we state some concluding remarks in S ection 7 . 2 F ramew ork Applicativ e fra mew ork. W e consider a pip eline of n stages S k , 1 ≤ k ≤ n , as illustrated on Figure 1 . T asks are fed int o th e pip eline and p ro cessed fr om stage to stage, u ntil they exit the pip eline after the last stage . The k -th stage S k receiv es an input from the previous stage, of size δ k − 1 , performs a num b er of w k computations, and outputs data of size δ k to the next stage. The fi rst stage S 1 receiv es an input of size δ 0 from the outsid e w orld, while the last stage S n returns th e result, of size δ n , to the outside w orld. T arget platform. W e target a p latform with p p r ocessors P u , 1 ≤ u ≤ p , fully interconnected as a (virtu al) clique. There is a bidirectional link lin k u,v : P u → P v b et ween any pro cessor pair P u and P v , of bandwidth b u,v . No te that we do not n eed to hav e a physical link b etw een any processor pair. Instead, we ma y ha ve a switc h, or ev en a path comp osed of several physical links, to interconnect P u and P v ; in the latte r case we would retain the band w idth INRIA Multi-criteria sche duling of pip eline workflows 5 ... ... S 2 S k S n S 1 w 1 w 2 w k w n δ 0 δ 1 δ k − 1 δ k δ n Figure 1: Th e application pipeline. of the s lo west link in the p ath for the v alue of b u,v . In th e most general case, we hav e fully heterogeneous platforms, with different pro cessor sp eeds and link capacities, but we restrict in this pap er to Communic ation Homo gene ous p latforms with d ifferent- sp eed p ro cessors ( s u 6 = s v ) interconnected by links of same capacities ( b u,v = b ). They corresp ond to n et works of different-speed processors or w orkstations i nterconnected by either plain Ethernet or by a high-sp eed switch, and they constitute the t ypical experimental platforms in most academic or ind ustry r esearch departments. The sp eed of pro cessor P u is den oted as s u , and it tak es X/ s u time-units for P u to execute X floa ting p oint operations. W e also enforce a linear cost model for comm unications, hence it takes X/ b time-units to s en d (resp. receiv e) a m essage of size X to (r esp. f rom) P v . Communicatio ns cont ention is tak en care of by enf orcing the one-p ort model [ 4 , 5 ]. In this mod el, a giv en pro cessor can be inv olv ed in a s in gle comm unication at an y time-step, either a send or a receiv e. Ho we ver, independ ent communicatio ns b etw een distinct pro cessor pairs can tak e pla ce simulta neously . The one-p ort mo del seems to fit the p erformance of so me current MPI implementations, wh ic h serialize asynchronous MPI sends as so on as message sizes exceed a few megab ytes [ 16 ]. Bi-criteria mapping problem. The general map p ing p roblem consists in assigning appli- cation stages to p latform pr ocessors. F or the s ak e o f simplicit y , we can assume that eac h stage S k of the a pplication p ip eline is mapp ed onto a distinct pro cessor (which is p ossib le only if n ≤ p ). Ho wev er, su c h one-to-one mappings ma y b e unduly restrictiv e, and a natural extension is to searc h for in terv al mappings, i.e. allocation fu nctions where each partici- pating pr ocessor is assigned an interv al of consecutive stages. Intuitiv ely , assignin g several consecutiv e tasks to the same pro cessors will increase th eir computational load, bu t may well dramatically decrease comm unication r equirements. In fact, the b est interv al mapping may turn out to b e a one-to-o ne mapping, or instead may enroll only a ve ry small num b er of fast computing p rocessors in terconnected b y high-sp eed links. Interv al mappings constitute a natural and useful generalizati on of one-to-one mappings (not to speak of situations wh ere p < n , where interv al mappings are mandatory), and suc h mappings hav e b een s tudied by Subhlo ck et al. [ 19 , 20 ]. Th e cost model associated to interv al mappings is the follo wing. W e search for a p artition of [1 ..n ] into m ≤ p interv als I j = [ d j , e j ] such that d j ≤ e j for 1 ≤ j ≤ m , d 1 = 1, d j +1 = e j + 1 for 1 ≤ j ≤ m − 1 and e m = n . Interv al I j is mapp ed on to processor P allo c ( j ) , and the p eriod is expr essed as T perio d = max 1 ≤ j ≤ m ( δ d j − 1 b + P e j i = d j w i s allo c ( j ) + δ e j b ) (1) RR n ° 6232 6 A. Benoit, V. R ehn-Sonigo, Y . R ob ert The latency is obtained by the following expression (data sets trav erse all stages, and only inte rpr ocessor comm unications need be paid for): T latency = X 1 ≤ j ≤ m ( δ d j − 1 b + P e j i = d j w i s allo c ( j ) ) + δ n b (2) The optimization problem is to determine the b est mapping, o ver all p ossible p artitions into interv als, and ov er all pro cessor assignmen ts. T he ob jectiv e can b e to minimize either the p erio d, or the latency , or a com bination: g iven a threshold p erio d , what is the minimum latency that can b e ac hieved? and the counterpart: give n a thr eshold latency , what is the minimum p eriod that can be ac hieved? 3 Complexit y results T o the b est of our kno wledge, this w ork is the first to study the complexit y of the bi-criteria optimization p roblem for an interv al-based mapping of pip eline a pp lications onto Communi- c ation H omo gene ous platforms. Minimizing the latency is trivial, w h ile minimizing the p eriod is NP-hard . Quite in- terestingly , th is last result is a consequence of the fact that the natural extension of the c hains-to-c hains problem [ 14 ] to different -sp eed processors is NP-hard. Lemma 1. The optimal pip eline mapping which minimizes the latency c an b e determine d in p olynomial time. Pro of . The minimum latency can b e achiev ed by map p ing the whole interv al onto the fastest pro cessor j , resulting in the latency ( P n i =1 w i ) / s j . If a slow er pro cessor is inv olved in the mapping, the latency increases, foll owing equation ( 2 ), since part of the compu tations w ill tak e longer, and communications m ay occur. Thus, m in imizing the latency can be done in p olynomial time. Ho wev er, it is not so easy to minimize the perio d, and we study the h eterogeneous 1D partitioning pr ob lem in order to assess the complexit y of the p er io d minimization problem. Giv en an array of n elemen ts a 1 , a 2 , . . . , a n , the 1D p artitioning problem, also kno wn as the chains-to-c hains problem, is to partition the array into p interv als whose elemen t sums are almost identica l. More pr ecisely , w e search for a partition of [1 ..n ] into p consecutiv e interv als I 1 , I 2 , . . . , I p , where I k = [ d k , e k ] and d k ≤ e k for 1 ≤ k ≤ p , d 1 = 1, d k +1 = e k + 1 for 1 ≤ k ≤ p − 1 and e p = n . The ob j ectiv e is to min imize max 1 ≤ k ≤ p X i ∈I k a i = max 1 ≤ k ≤ p e k X i = d k a i . This problem has b een ext ensively studied in the literature b ecause it h as v arious ap- plications. I n particular, it amounts to loa d-balance n computations whose orderin g must b e p reserved (hence th e restrict ion to interv als) onto p iden tical pro cessors. Then eac h a i corresp onds to the execution time of the i -th task, and the sum of the elemen ts in in terv al I k is the load of th e pro cessor wh ic h I k is assigned to. Seve ral algorithms and heur istics hav e b een p rop osed to solve this load-balancing pr ob lem, including [ 6 , 11 , 10 , 12 , 13 ]. W e refer the INRIA Multi-criteria sche duling of pip eline workflows 7 reader to the survey pap er by Pinar and Ayk anat [ 14 ] for a detailed ov erv iew and comparison of the literature. The ad vent of h eterogeneous cl usters leads to the follo w ing generalizat ion of the 1D par- titioning pr oblem: the goal is to partition the n element s int o p interv als whose element sums matc h p pr escrib ed v alues (the pro cessor sp eeds) as closely as p ossible. Let s 1 , s 2 , . . . , s p de- note these v alues. W e search for a partition of [1 ..n ] into p interv als I k = [ d k , e k ] and for a p ermutatio n σ of { 1 , 2 , . . . , p } , with the ob jective to m inimize: max 1 ≤ k ≤ p P i ∈I k a i s σ ( k ) . Another wa y to express the problem is that in terv als are no w weigh ted by the s i v alues, while we had s i = 1 for the h omogeneous v ersion. Can we extend the efficient algorithms describ ed in [ 14 ] to solve the heterogeneous 1D p artitioning problem, Het ero-1D-P ar t ition for s h ort? In fact, the problem seems combinatorial, b eca use of the search ov er all p ossible p ermutatio ns to wei ght the interv als. In deed, we prov e th e NP-completeness of (the decision problem asso ciated to) Het ero-1D-P ar t ition . Definition 1 ( Hete ro-1D- P ar t ition-Dec ) . Given n elements a 1 , a 2 , . . . , a n , p v alues s 1 , s 2 , . . . , s p and a b ound K , c an we find a p artition o f [1 ..n ] into p i ntervals I 1 , I 2 , . . . , I p , with I k = [ d k , e k ] and d k ≤ e k for 1 ≤ k ≤ p , d 1 = 1 , d k +1 = e k + 1 for 1 ≤ k ≤ p − 1 and e p = n , and a p ermutation σ of { 1 , 2 , . . . , p } , such that max 1 ≤ k ≤ p P i ∈I k a i s σ ( k ) ≤ K ? Theorem 1. The Het ero-1D- P ar t ition-Dec pr oblem is NP- c omplete. Pro of . The Hete ro-1D- P ar t ition-Dec problem clearly b elongs to th e class NP: given a solution, it is easy to v erify in p olynomial time that the partition into p interv als is v alid and that th e maximum sum of the elemen ts in a given inte rv al divided by th e corresp ond in g s v alue do es not exceed the b oun d K . T o establish th e completeness, we use a redu ction from NU- MERICAL MA TCHING WITH T ARGET SUMS ( NMWTS ), which is NP-complete in the strong sense [ 9 ]. W e consider an instance I 1 of NMWTS : giv en 3 m n umb ers x 1 , x 2 , . . . , x m , y 1 , y 2 , . . . , y m and z 1 , z 2 , . . . , z m , do es there exist tw o p ermutations σ 1 and σ 2 of { 1 , 2 , . . . , m } , such that x i + y σ 1 ( i ) = z σ 2 ( i ) for 1 ≤ i ≤ m ? Beca use NMWTS is NP-co mplete in the strong sense, we can enco de th e 3 m numbers in unary and assume that the s ize of I 1 is O ( m + M ), where M = max i { x i , y i , z i } . W e also a ssum e that P m i =1 x i + P m i =1 y i = P m i =1 z i , otherwise I 1 cannot h av e a solution. W e build the follo w ing instance I 2 of Hetero-1D-P ar tition-Dec (we u se the formula- tion in terms of task weigh ts and processor speeds whic h is more intuitive ): W e define n = ( M + 3) m tasks, whose weigh ts are outlined b elow: A 1 111 ... 1 | {z } C D | A 2 111 ... 1 | {z } C D | . . . | A m 111 ... 1 | {z } C D M M M RR n ° 6232 8 A. Benoit, V. R ehn-Sonigo, Y . R ob ert Here, B = 2 M , C = 5 M , D = 7 M , and A i = B + x i for 1 ≤ i ≤ m . T o define the a i formally for 1 ≤ i ≤ n , let N = M + 3. W e hav e for 1 ≤ i ≤ m : a ( i − 1) N +1 = A i = B + x i a ( i − 1) N + j = 1 for 2 ≤ j ≤ M + 1 a iN − 1 = C a iN = D F or the number of processors (and interv als), we choose p = 3 m . As for the sp eeds, we let s i b e the sp eed of processor P i where, for 1 ≤ i ≤ m : s i = B + z i s m + i = C + M − y i s 2 m + i = D Finally , we ask whether there exists a solution matc hing the b ound K = 1. Clea rly , the size of I 2 is polynomial in the size of I 1 . W e n ow show that instance I 1 has a solution if and only if ins tance I 2 do es. Supp ose fi rst that I 1 has a solution, with p erm utations σ 1 and σ 2 such that x i + y σ 1 ( i ) = z σ 2 ( i ) . F or 1 ≤ i ≤ m : W e map each task A i and the follo wing y σ 1 ( i ) tasks of we ight 1 onto pro cessor P σ 2 ( i ) . W e map the follo wing M − y σ 1 ( i ) tasks of w eight 1 and the next task, of we ight C , on to pro cessor P m + σ 1 ( i ) . W e map the next task, of weigh t D , on to the processor P 2 m + i . W e do hav e a v alid partition of all the tasks into p = 3 m int erv als. F or 1 ≤ i ≤ m , the load and speed of the processors a re indeed equal: The load of P σ 2 ( i ) is A i + y σ 1 ( i ) = B + x i + y σ 1 ( i ) and it s speed is B + z σ 2 ( i ) . The load of P m + σ 1 ( i ) is M − y σ 1 ( i ) + C , w hich is equal to its speed. The load and sp eed of P 2 m + i are both D . The m apping do es ac hieve the b oun d K = 1, hence a sol ution to I 1 . Supp ose n o w th at I 2 has a solution, i.e. a mapp ing matc hing the b ound K = 1. W e first observe that s i < s m + j < s 2 m + k = D for 1 ≤ i, j, k ≤ m . Indeed s i = B + z i ≤ B + M = 3 M , 5 M ≤ s m + j = C + M − y j ≤ 6 M and D = 7 M . Hence eac h of the m tasks of w eight D must b e assigned to a pro cessor of sp eed D , and it is th e only task assigned to this pro cessor. These m singleton assignment s divide the set of tasks into m interv als, namely th e set of tasks b efore the first task of weigh t D , and the m − 1 sets of tasks lying b etw een t wo consecutive tasks of wei ght D . The total weigh t of eac h of these m interv als is A i + M + C > B + M + C = 10 M , while the largest sp eed of the 2 m remaining pro cessors is 6 M . Therefore each of them must b e assigned to at least 2 p ro cessors ea ch. Ho wev er, there remains only 2 m av ailable processors, hence each interv al is assigned exactly 2 pro cessors. INRIA Multi-criteria sche duling of pip eline workflows 9 Consider suc h an inte rv al A i 111 ... 1 C with M tasks of weig ht 1, and let P i 1 and P i 2 b e the tw o pro cessors assigned to this in terv al. T asks A i and C are not assigned to the same pro cessor (otherwise the w hole interv al would). So P i 1 receiv es task A i and h i tasks of weig ht 1 while P i 2 receiv es M − h i tasks of weigh t 1 a nd task C . Th e weigh t of P i 2 is M − h + C ≥ C = 5 M while s i ≤ 3 M for 1 ≤ i ≤ m . Hence P i 1 must be some P i , 1 ≤ i ≤ m while P i 2 must b e so me P m + j , 1 ≤ j ≤ m . Because this h olds true on each interv al, this defines tw o p ermutations σ 2 ( i ) and σ 1 ( i ) su c h that P i 1 = P σ 2 ( i ) and P i 2 = P σ 1 ( i ) . Because the b ound K = 1 is achiev ed , we hav e: • A i + h i = B + x i + h i ≤ B + z σ 2 ( i ) • M − h i + C ≤ C + M − y σ 1 ( i ) Therefore y σ 1 ( i ) ≤ h i and x i + h i ≤ z σ 2 ( i ) , and m X i =1 x i + m X i =1 y i ≤ m X i =1 x i + m X i =1 h i ≤ m X i =1 z i By hypothesis, P m i =1 x i + P m i =1 y i = P m i =1 z i , hence all the previous inequalities are tight, and in particular P m i =1 x i + P m i =1 h i = P m i =1 z i . W e can dedu ce that P m i =1 y i = P m i =1 h i = P m i =1 z i − P m i =1 x i , and sin ce y σ 1 ( i ) ≤ h i for all i , we ha ve y σ 1 ( i ) = h i for all i . Similarly , we d educe that x i + h i = z σ 2 ( i ) for all i , and therefore x i + y σ 1 ( i ) = z σ 2 ( i ) . Altoget her, we hav e found a solution for I 1 , whic h concludes the pro of. This imp ortant r esult leads to the NP-completeness of th e p erio d minimization p roblem. Theorem 2. The p erio d minimization pr oblem for pip eline gr aphs is NP-c omplete. Pro of . Obviously , the op timization problem b elongs to the class NP . An y instance of the Hetero-1D-P ar tition problem with n tasks a i , p pro cessor speeds s i and b ound K can b e con verted int o an instance of the mapp ing problem with n stages of weigh t w i = a i , letting all communication costs δ i = 0, targeting a Communic ation Homo gene ous p latform with the same p pro cessors and homogeneous links of bandwidth b = 1, and trying to achiev e a p eriod not greater than K . This concludes the pro of. Since the p eriod minimization p roblem is NP-hard, all bi-criteria pr oblems are NP-hard. 4 Heuristics The bi-criteria optimization problem is NP-h ard, this is why we prop ose in this section several p olynomial heuristics to tac kle the problem. In the follo wing, w e denote b y n the n umb er of stages, a nd b y p the number of processors. 4.1 Minimizing latency for a fixed p erio d In the first set of heuristics, the perio d is fixed a priori, and we aim at minimizing the latency while resp ecting the prescrib ed p eriod. All the following heuristics sort pr ocessors by n on - increasing sp eed, and start by assigning all the stages to the first (fastest) pro cessor in the list. This pro cessor becomes use d . RR n ° 6232 10 A. Benoit, V. R ehn-Sonigo, Y . R ob ert H1-Sp mono P: Splitting mono-criterion – At eac h step, we select th e u s ed pro cessor j with the large st p erio d and we try to split its stage interv al, giving some stages to the next fastest pro cessor j ′ in the list (not yet us ed). Th is can b e done by sp litting the inter- v al at an y place, and either placing the fir st p art of the interv al on j and the remainder on j ′ , or the other wa y round . Th e solution which minimizes max ( per iod ( j ) , per iod ( j ′ )) is c hosen if it is b etter than the original solution. Splitting is p erformed as long as w e hav e not reac hed the fixed perio d or u ntil we cannot impro ve the p erio d an ymore. H2a-3-Explo mono: 3-Exploration mono-criterion – At eac h step we select the used pro cessor j with the largest p erio d and we split its interv al into three p arts. F or this purp ose we try to map t wo parts of the interv al on the next pair of fastest pr ocessors in the list, j ′ and j ′′ , and to keep the third part on pro cessor j . T esting all possib le p ermu- tations and all p ossible p ositio ns wh ere to cut, w e c ho ose the solution that min imizes max ( per iod ( j ) , per iod ( j ′ ) , per iod ( j ′′ )). H2b-3-Explo bi: 3-Exploration bi-criteria – In this heuristic the choice of wh ere to split is more elaborated: it d ep ends not only of the p erio d improv ement, but also of the la- tency increase. Using the same splitting mechanism as in 3-Explo mono , we select the solution that min im izes max i ∈{ j,j ′ ,j ′′ } ( ∆ latency ∆ per iod ( i ) ). H ere ∆ l atency denotes the differ- ence b etw een the global latency of the solution b efore the sp lit and after the split. In the same manner ∆ per iod ( i ) defines the difference betw een the perio d b efore the split (ac hieve d b y processor j ) and the new p erio d of pr ocessor i . H3-Sp bi P: Split ting bi-criteria – Th is heur istic uses a bin ary search o ver the latency . F or this pu rp ose at each iterat ion we fix an au th orized increase of th e optimal latency (which is obtained by mappin g all stages on the fastest pr ocessor), and we test if w e get a feasible solution via splitting. The splitting mec hanism itself is quite similar to H1 Sp mono P except th at we choose the solution that minimizes max i ∈{ j,j ′ } ( ∆ latency ∆ per iod ( j ) ) within the authorized la tency in crease to decide w h ere to split. While we get a feasible solution, w e r educe the authorized latency increase for the next iteration of the binary searc h, thereb y aiming at minimizing the mappin g global lat ency . 4.2 Minimizing p erio d for a fixed latency In this second set of heuristics, latency is fixed, and we try to achiev e a minim um p erio d while r esp ecting th e latency constraint. As in the heuristics describ ed abov e, first of all we sort p r ocessors acco rdin g to th eir sp eed and m ap all stages on the fastest p ro cessor. Th e approac h used here is the conv erse of the heur istics w here we fix the p eriod, as we start with an optimal solution concerning latency . Indeed, at eac h step we do wngrade the solution with resp ect to its latency but impr o ve it regarding its p eriod . H4-Sp mono L: Splitting mono-criterion – This heuristic us es the same method as H1- Sp mono P w ith a different break co ndition. Here splitting is p erf ormed as long as we d o n ot exceed the fixed latency , s till choosing the s olution that minimizes max ( per iod ( j ) , per iod ( j ′ )). H5-Sp bi L: Splitt ing bi-criteria – Th is v ariant of th e sp litting h euristic works simi- larly to H4 Sp mono L , but at ea ch step it chooses the solution which minimizes max i ∈{ j,j ′ } ( ∆ latency ∆ per iod ( i ) ) while the fixed latency is not exceeded. INRIA Multi-criteria sche duling of pip eline workflows 11 The co de for all these h eu ristics can be found on the W eb at: http://g raal.ens - lyon.fr/~ vrehn/code/multicriteria/ 5 Exp erimen ts Seve ral exp eriments hav e b een condu cted in order to assess th e perf ormance of the h euristics describ ed in Section 4 . First we describ e the exp erimental setting, then we rep ort the results, and finally we p r o vide a summary . 5.1 Exp erimen tal setting W e hav e generated a set of rand om applications with n ∈ { 5 , 10 , 20 , 40 } s tages and a s et of random Communic ation Homo gene ous plat forms with p = 10 or p = 100 pro cessors. In all the experiments, we fi x b = 10 for the link bandwid ths. Moreov er, th e speed of eac h pro cessor is ran d omly chosen as an intege r b etw een 1 and 20. W e keep the latter r ange of v ariation thr ou gh ou t the exp eriments, while we v ary the range of the application p arameters from one set of experiments to the other. Indeed, although there are four cate gories of parameters to pla y with, i.e. the v alues of δ , w , s and b , we can see from equations ( 1 ) and ( 2 ) that on ly the relative ratios δ b and w s hav e an impact on the p erformance. Eac h exp erimental v alue rep orted in the following has been calc ulated as an a v erage o ver 50 rand omly c h osen application/platforms pairs. F or each of these pairs, we report the p erformance of the six heuristics describ ed in Sectio n 4 . W e rep ort four main sets of exp eriments condu cted b oth for p = 10 and p = 100 pr ocessors. F or ea ch exp eriment, we v ary some k ey application/platform parameter to assess the impact of this parameter on the p erformance of the heuristics. The first tw o exp eriments deal with app lications wh ere comm unications and computations hav e the same order of m agnitude, and w e study the imp act of th e degree of heterogeneit y of the communications, i.e. of the v ariation range of the δ parameter: (E1): balanced comm unication/computation, and homogeneous comm unica- tions. In the first set of exp er im ents, the application comm unications are homogeneous, we fix δ i = 10 for i = 0 ..n . The computation time required by ea ch stage is randomly c hosen b etw een 1 and 20. Thus, communicat ions and computations are balanced within the applicat ion. (E2): balanced comm unications/computations, and heterogeneous comm u- nications. In the second set of exper im ents, the application comm unications are h et- erogeneous, c hosen randomly betw een 1 and 10 0. Similarly to Exp eriment 1, the com- putation time requ ired by eac h stage is randomly c hosen b etw een 1 and 20. Thus, comm unications and computations are still r elativ ely b alanced within the app lication. The last tw o exp erimen ts deal w ith imbalanced applications: the third exp eriment assumes large computations (large v alue of the w to δ ratio ), an d the f ou r th one rep orts results for small co mputations (small v alue of the w to δ ratio): (E3): la rge computations. In this experiment, the applications are much more d e- manding on computations than on communicat ions, making communications negligible with resp ect to computation requirements. W e c ho ose the comm un ication time b etw een 1 and 20, wh ile the computation time of eac h application is chosen b etw een 10 and 1000. RR n ° 6232 12 A. Benoit, V. R ehn-Sonigo, Y . R ob ert 4 6 8 10 12 14 16 18 20 3 4 5 6 7 8 9 Latency Period 10 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (a) 10 stages. 20 25 30 35 40 45 50 8 10 12 14 16 18 20 22 24 Latency Period 40 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (b) 40 stages. Figure 2: (E1) Balanced communicat ions/computations, and homogeneous communications. (E4): small computations. The last exp eriment is the opp osite to Exp eriment 3 since computations are now negligible compared to communicatio ns. The communicatio n time is still chosen b et ween 1 and 20, but the computation time is now chosen b etw een 0 . 01 and 10. 5.2 Results Results for the entire set of exp eriments can b e found on the W eb at http://g raal.ens - lyon.fr/~v rehn/code/multicriteria/ . In the follo wing we only present the most significan t plots. 5.2.1 Wit h p = 10 pro cessors F or (E1) we see that all h euristics follo w the same curve shap e, with the exception of heuristic Sp bi P (cf. Figure 2 ), which has a different b ehavior. W e observe this general behavior of the different heuristics in all the exp eriments. T he h euristic Sp bi P initially fin ds a solution with rel ative ly small perio d and latency , and th en tends to increase b oth. T he other five heuristics ac hieve small p er io d times at the p rice of lo ng latencies and th en seem to con verge to a somewhat shorter latency . W e notic e that the simplest splitting heuristics perform ve ry wel l: Sp mono P and Sp mono L achiev e the b est perio d, and Sp mono P has the low er latency . Sp bi P minimizes the latency w ith comp etitive p erio d s izes. Its counte rpart Sp bi L p erf orms p o orly in comparison. 3-Explo mono and Sp bi L ca nn ot k eep u p with the other heuristics (but the latter ac h iev es better results than the former). In th e middle range of perio d v alues, 3-Explo bi ac hieves comparable latency v alues with th ose of Sp mono P and Sp bi P . F or (E2), if we leav e aside Sp bi P , w e see that Sp mono P outperforms the other heuristics almost everywhere with the f ollo wing exception: with 40 stages and a large fixed p eriod , 3-Explo bi obtains the b etter results. Sp bi P ac hieves by f ar the b est latency times, but the perio d times are not as goo d as those of Sp bi P and 3-E xplo bi . W e observ e that the comp etitive ness of 3-Explo bi increases with the increase of the number of stages. Sp mono L ac hieve s p erio d v alues jus t as small as Sp mono P but th e correspond ing latency is higher and once again it p er f orms b etter than its bi-criteria counterpart Sp bi L . The p oorest results are obtained by 3-Explo mono . INRIA Multi-criteria sche duling of pip eline workflows 13 10 12 14 16 18 20 22 24 26 28 9 10 11 12 13 14 15 16 17 Latency Period 10 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (a) 10 stages. 30 32 34 36 38 40 42 44 12 14 16 18 20 22 24 26 28 30 Latency Period 40 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (b) 40 stages. Figure 3: ( E2) Balanced comm unications/computations, and heterogeneous communicat ions. 140 150 160 170 180 190 60 80 100 120 140 Latency Period 5 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (a) 5 stages. 400 600 800 1000 1200 1400 1600 1800 100 200 300 400 500 600 700 800 900 1000 Latency Period 20 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (b) 20 stages. Figure 4: (E3) Large computations. The results of (E3) are muc h more scattered than in the other exp eriments (E1, E2 and E4) and this d ifference ev en increases w ith risin g n . When n = 5, the results of the differen t heuristics are almost parallel so that w e ca n s tate the following hierarc hy: Sp mono P , 3-Explo bi , Sp mono L , Sp bi L and fi nally 3-E xplo mono . F or this exp erimen t Sp bi P ac hieves rather p o or results. Wit h the increase of the number of stages n , the p erformance of Sp bi P gets b etter and this h euristic ac hiev es the b est latency , but its p eriod v alues cannot compete with Sp mono P and 3-E xplo bi . T hese latter heur istics achiev e very go od results concerning p erio d durations. On the co ntrary , 3-Explo mono bursts its perio d and latency times. 3 -Explo bi loses its second p ositio n f or small p er io d times compared to Sp mono L , but wh en p er io d times are higher it reco v ers its p osition in the hierarc hy . In (E4), 3-Explo mono p erforms the p o orest. Nevertheless th e gap is sm aller than in (E3) and for high p erio d times and n ≥ 20, its latency is comparable to th ose of the other heuristics. F or n ≥ 20, 3-Explo bi ac hieves f or the fi rst time the b est r esults and the latency of Sp bi P is only one time low er. When n = 5, Sp bi L achiev es the b est l atency , but the perio d v alues are not co mp etitive with Sp mono P and Sp mono L , which obtain the smallest p er io ds (for slig htly higher lat ency times). In T able 1 the failure thresholds of the different heuristics are sh o wn. W e denote by failure thr eshold the largest v alue of the fixed p eriod or latency for which the heuristic wa s RR n ° 6232 14 A. Benoit, V. R ehn-Sonigo, Y . R ob ert 2 2.5 3 3.5 4 4.5 5 5.5 6 2 2.2 2.4 2.6 2.8 3 3.2 3.4 Latency Period 5 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (a) 5 stages. 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 Latency Period 20 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (b) 20 stages. Figure 5: (E4) Small computations. Exp. Heur. Number of stages Exp. Heur. Number of stages 5 10 20 40 5 10 20 40 E1 H1 3.0 3.3 5.0 5.0 E2 H1 9.7 10.0 11.0 11.0 H2 3.0 4.7 9.0 18.0 H2 10.3 10.0 12.0 19.0 H3 3.0 4.0 5.0 5.0 H3 10.0 10.0 11.0 11.0 H4 3.3 3.3 6.0 10.0 H4 11.3 11.0 13.0 15.0 H5 4.5 6.0 13.0 25.0 H5 11.7 15.0 22.0 32.0 H6 4.5 6.0 13.0 25.0 H6 11.7 15.0 22.0 32.0 E3 H1 50.0 70.0 100.0 250.0 E4 H1 2.2 2.3 2.3 2.3 H2 50.0 140.0 450.0 950.0 H2 2.4 2.7 3.7 7.0 H3 50.0 90.0 250.0 400.0 H3 2.4 2.7 3.0 4.0 H4 100.0 140.0 300.0 650.0 H4 2.8 2.7 3.0 4.0 H5 140.0 270.0 500.0 1000.0 H5 3.0 4.0 7.0 11.0 H6 140.0 270.0 500.0 1000.0 H6 3.0 4.0 7.0 11.0 T able 1: F ailure thresh olds of the different heuristics in the d ifferent exp eriments. not able to find a solution. W e sta te th at Sp mono P has the smallest failure th resholds whereas 3-Explo mono has the highest v alues. Surprisingly th e failure thresholds (for fi xed latencies) of the heur istics Sp mono L and Sp bi L are the s ame, b ut their p erformance differs enormously as stated in the differen t exp eriments. 5.2.2 Wit h p = 100 pro cessors Man y r esults are similar with p = 10 and p = 100 pro cessors, th us we only rep ort the main differences. First we observe that b oth p erio ds and latencies are low er with the in creasing num b er of pro cessors. This is easy to explain, as all heuristics alwa ys choose fastest pro cessors first, an there is muc h more c hoice with p = 100. All h euristics k eep their general b ehavio r, i.e. their curve c haracteristics. But the relativ e perform ance of some heuristics c hanges dramatically . The results of 3-Explo mono are much b etter, and we do get adequate latency times (compare Figures 2(b) and 6 (a) ). F urther m ore the multi-criteria heuristics turn out to b e much more p erformant . An interesting example can be see n in Figure 6(b) : all multi- criteria heuristics outp erform their mono-criterion counterparts, ev en Sp bi L , whic h never had a better p erformance than Sp mono L when p = 10. INRIA Multi-criteria sche duling of pip eline workflows 15 20 25 30 35 40 45 50 8 10 12 14 16 18 20 22 24 Latency Period 40 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (a) (E1) 40 stages, hom. comms. 30 32 34 36 38 40 12 14 16 18 20 22 24 26 28 Latency Period 40 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (b) (E2) 40 stages, het. comms. Figure 6: Extension to 100 processors, balanced communications/co mpu tations. 200 250 300 350 400 100 150 200 250 Latency Period 10 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (a) (E3) 10 stages, large compu tations. 10 12 14 16 18 20 22 4 6 8 10 12 Latency Period 40 Stages Sp mono, P fix 3-Explo mono 3-Explo bi Sp bi, P fix Sp mono, L fix Sp bi, L fix (b) (E4) 40 stages, small computations. Figure 7: Extension to 100 processors, im balanced communications/computatio ns. In the case of imbala nced comm unications/computations, we observ e that all heu ristics ac hieve almost the same r esults. The o nly exception is the binary-searc h h euristic Sp bi P , which shows a slightly sup erior performance as can b e seen in Figure 7(a) . The p erformance of 3-Explo bi dep ends on the num b er of stages. In general it is sup erseded by Sp mono P , when n ≤ 10, b ut for n ≥ 20 3-Explo bi it owns the second p osition after Sp bi L and eve n p erforms b est in the configuration small computations/ n = 40 (see Figure 7(b) ). 5.2.3 Summary Overal l we conclude that the p erformance of bi-criterion heuristics v ersus mono-criterio n heuristics h ighly d ep ends on the number of a v ailable pr ocessors. F or a small number of p rocessors, the simple splitting tec hnique whic h is u sed in Sp mono P and Sp mono L is v ery competitive as it almost alw a ys minimizes the perio d with acce ptable latency v alues. T he bi-criteria splitting Sp bi P mainly minimizes latency v alues at the price of longer per io ds. N evertheless d ep ending up on the application, this heuristics seems to be ve ry interesting, whenever small latencies are demanded. On the con trary , its coun terpart Sp bi L do es not provide convincing r esults. Final ly , b oth 3- Exploration heur istics do not achiev e the expected perf ormance. RR n ° 6232 16 A. Benoit, V. R ehn-Sonigo, Y . R ob ert Ho wev er when increasing he n umber of av ailable pro cessors, we obser ve a significan t im- prov ement of the behavior of bi-criteria heuristics. Sp bi L turns out to outperform the mono-criterion version and Sp bi P up grades its p er io d times such that it outpla ys its com- p etitors. Finally b oth 3-Exploration heur istics p erform much b etter and 3-Explo bi finds its slot. 6 Related w ork As already ment ioned, this work is an extension of the work of Subh lok and V ondran [ 19 , 20 ] for pipeline applications on homogeneous platforms. W e extend the complexit y results t o heterogeneous platforms. W e hav e also d iscussed the relationship with the chains-to-c h ains problem [ 6 , 11 , 10 , 12 , 13 , 14 ] in Section 1 . Seve ral paper s consider the p roblem of mapping communicati ng tasks onto h eterogeneous platforms, but for a d ifferent applicat ive framewo rk. In [ 21 ], T aura an d Ch ien consider ap- plications composed of sev eral copies of the same task graph , expressed as a DA G (directed acyclic graph). These copies are to b e executed in pip eline fashion. T aura and Chien also restrict to mapping all in stances of a given task type (wh ic h corresp onds to a stage in our framework) onto the same processor. Their problem is shown NP-complete, and they provi de an iterativ e heuristic to determine a goo d mapping. At eac h step, the heu ristic refines the current clustering of the DA G. Bea umont et al. [ 1 ] consider the s ame pr ob lem as T aura and Chien, i.e . with a general D AG, but they allo w a given task t yp e to b e mapped ont o sev eral pro cessors, eac h executing a fracti on of the total number of tasks. The problem remains NP- complete, but becomes polynomial for sp ecial classes of DA Gs, such as series-parallel graphs. F or such graph s, it is possib le to determine the optimal mapp ing owing to an appr oac h based up on a linear pr ogramming formula tion. The drawbac k with the approac h of [ 1 ] is that th e optimal throughp ut can only b e achiev ed through very long p erio ds, so that the simplicity and regula rity of th e schedule are lost, while the la tency is severely increased. Another imp ortan t series of p ap ers comes from the DataCutter p ro ject [ 8 ]. One goal of this pro ject is to schedule multiple data analysis op erations onto clusters and grids, decide where to place and/or replicate v arious components [ 3 , 2 , 18 ]. A typical applicatio n is a c hain of consecutiv e filtering operations, to b e executed on a very large data set. The task graphs targeted by DataCutter are more general than linear pip elines or forks, bu t still more regular than arb itrary D AGs, which m ak es it p ossible to design efficient heuristics to solve the previous placement and replication optimizatio n p roblems. How ever, we p oint out that a recen t pap er [ 22 ] targets w orkflows structured as arbitrary D AGs and considers b i-criteria optimization problems on homogeneous platforms. The pap er provides many interesting ideas and sev eral heuristics to solve the general mapping problem. It would b e v ery interesting to exp eriment th ese heur istics on the simp le pip eline mapping problem, and to compare it to our own heur istics designed sp ecifically for pip eline w orkflows. 7 Conclus ion In this pap er, we hav e stud ied a difficult bi-criteria mapping problem on to Communic ation Homo gene ous platforms. W e restricted ourselves to the cl ass of applications w hich hav e a pip eline stru cture, and stud ied the complexity of the problem. T o the b est of our knowledge, it INRIA Multi-criteria sche duling of pip eline workflows 17 is the fi rst time that a multi- criteria pip eline mappin g is studied from a theoretical p ersp ective, while it is quite a standard and widely used pattern in many real-life applications. While minimizing the la tency is trivial, the problem of m inimizing the p ip eline p eriod is NP-hard, and thus th e bi-criteria problem is NP-hard. W e pr ovided several efficient p olyno- mial heuristics, either to minimize the perio d for a fixed latency , or to minimize the la tency for a fixed p erio d. These heuristics hav e been extensively compared through sim ulation. R esu lts highly de- p end on p latform parameters such as number of stages and num b er of a v ailable pro cessors. Simple mono-criterion splitting h euristics p erform very well w h en there is a limited number of pro cessors, w hereas b i-criterion heur istics p erform muc h b etter when increasing the num b er of pro cessors. Overall, the intro d uction of b i-criteria heuristics was not fully successful for small clusters b ut tur ned out to b e mandatory to achiev e good p erf ormance on larger platforms. There remains muc h w ork to extend the resu lts of this pap er. W e designed heuristics for Comm unic ation Homo gene ous platforms, and finding efficient bi-criteria heuristics w as already a c hallenge. It wo uld be inte resting to deal w ith fu lly heterogeneous p latforms, b u t it seems to b e a difficult problem, even for a mon o-criterion optimization problem. In the longer term, we p lan to perform real experiments on heterogeneous platforms, using a n already- implemente d skel eton library , in order to compare the effectiv e p erf orm ance of the app lication for a give n m apping (obtained with our heur istics) against the theoretical p erformance of this mapping. A natural extension of this wo rk would b e to consider other widely used skel etons. F or example, when there is a b ottle neck in the pip eline op eration due to a stage which is b oth computationally-demanding and n ot constrained by internal dep end encies, we can nest an- other skele ton in place of the stage. F or ins tance a farm or deal sk eleton would allo w to split the workload of the initial s tage among seve ral pro cessors. Using such deal skele tons ma y b e either the p r ogrammer’s decision (explicit n esting in the app lication code) or the r esult of th e mapping pro cedure. Extending our mapping strategies to automatically identify opp ortunities for d eal skeletons, and implemen t these, is a difficult but v ery interesting p ersp ectiv e. References [1] Olivier Beaumont, Arnaud Legrand, Loris Marc hal, and Yv es Rob ert. Assessing the impact an d limits of steady-state sc hedulin g for mixed task and data paralle lism on heterogeneous p latforms. In Heter oPar’200 4: Internation al Confer enc e on H e ter o gene ous Computing, jointly publishe d with ISPDC’2004: International Symp osium on Par al lel and Distribute d Computing , page s 296– 302. IEEE Computer So ciet y Press, 2004. [2] M. D. Beynon, T. Kurc, A. Sussm an, and J. Saltz. O ptimizing execution of comp onent- based app lications usin g group instances. F utur e Gener ation Computer Systems , 18(4): 435–448, 2002. [3] Michael Beynon, Alan Sussman, Umit Catalyurek, T ahsin K urc, and J oel Saltz. P erfor- mance optimization for data intensiv e grid app lications. In PPr o c e e dings o f the Thir d Annual International Workshop on A ctive Midd lewar e Servic es (AMS’01) . IEEE Com- puter So ciety Press, 20 01. RR n ° 6232 18 A. Benoit, V. R ehn-Sonigo, Y . R ob ert [4] P .B. Bhat, C.S. Raghav endra, and V.K. Prasanna. Efficient collectiv e c ommunication in distrib u ted heterogeneous systems. In ICDCS’99 19th International Confer enc e on Distribute d Computing Systems , page s 15–2 4. IEEE Computer So ciet y Press, 1999. [5] P .B. Bhat, C.S. Raghav endra, and V.K. Prasanna. Efficient collectiv e c ommunication in distributed heterogeneous s y s tems. Journal of Par al lel and Distribute d Computing , 63:251 –263, 2003. [6] S. H. Bokhari. Partitioning problems in parallel, pip eline, and distribu ted computing. IEEE T r ans. Computers , 37(1):48–57 , 1988. [7] M. Cole. Bringing Sk eletons out of the Closet: A Pragmatic Manifesto for Skeleta l Pa rallel Programming. Par al lel Computing , 30(3):389 –406, 2004. [8] DataCutter Pro ject: Middleware for Filtering Large Arc hiv al Scientific Datasets in a Grid Environment. http://w ww.cs.um d.edu/projects/hpsl/R esearchAreas/DataCutter.htm . [9] M. R. Garey and D. S. John son. Computers a nd Intr actability, a Guide to the The ory of NP-Completeness . W. H. F reeman and Company , 1979. [10] Pierre Hansen and Keh-W ei Lih . Imp rov ed algorithms for partitioning p roblems in paral- lel, p ip eline, and d istributed compu ting. IEEE T r ans. Computers , 41(6):769– 771, 1992. [11] M. A Iqb al. Appr o ximate algorithms for partitioning p roblems. Int. J. Par al lel Pr o gr am- ming , 20(5):341 –361, 1991. [12] M. A Iqbal and S . H. Bokhari. E fficient algorithms for a class of partitioning problems. IEEE T r ans. Par al lel and Di strbute d Systems , 6(2):170– 175, 1995. [13] Bjorn Olstad and F redrik Manne. Efficient partitioning of sequences. IEEE T r ansactions on Computers , 44(11):13 22–1326, 1995 . [14] Ali Pin ar and Cevdet Ayk anat. F ast optimal load balancing algorithms f or 1D partition- ing. J. Par al lel Distribute d Computing , 64(8):974–9 96, 2004. [15] F.A. Rabhi and S. Gorlatch. P atterns and Skeletons for P ar al lel and Di stribu te d Com- puting . Springer V erlag, 2002 . [16] T. Saif and M. Parashar. Understanding the b eha vior and p erformance of n on-blo cking comm unications in MPI. In Pr o c e e dings of Eur o-Par 2004: Par al lel Pr o c essing , LNC S 3149, pages 173– 182. Springer, 2004. [17] B. A. Shirazi, A. R . Hur son, and K. M. K a vi. Sche duling and lo ad b alancing in p ar al lel and distribute d systems . IEEE Comp u ter Science Press, 1995. [18] M. Sp encer, R. F erreira, M. Be ynon, T. Kurc, U. Catalyurek, A. Sussman, and J. Saltz. Executing multiple pip elined data analysis oper ations in the grid. I n 2002 ACM/IEEE Sup er c omputing Confer enc e . ACM Press, 2002. [19] Jaspal Subh lok and Gary V ond ran. Optimal mapping of sequences of data parallel tasks. In Pr o c. 5th ACM SIGPLAN Symp osium on Principles and Pr actic e of Par al lel P r o- gr amming, PPoPP’95 , page s 134–1 43. ACM Press, 1995. INRIA Multi-criteria sche duling of pip eline workflows 19 [20] Jaspal Subhlok and Gary V ond ran. Optimal la tency-throughpu t tradeoffs for data par- allel pipelines. In ACM Symp osium on P ar al lel Algorithms and Ar chite ctur es SP AA’96 , pages 62–71. A CM Pr ess, 1996. [21] K. T aura and A. A. Chien. A heur istic algorithm for mapping comm un icating ta sks on heterogeneous resources. In Heter o gene ous Computing Workshop , pages 102– 115. IEEE Computer S o ciety Press, 200 0. [22] N. Vydyanathan, U. Catalyurek, T. Ku rc, P . Sadday appan, and J. Saltz. An approach for optimizing latency under throughpu t constraints for application w orkflows on clus- ters. Research Rep ort OSU-CISRC-1/07 -TR03, Ohio State Univ ersity , Columbus, OH, January 2007. Av ailable at ftp://ft p.cse.oh io- state.edu/pub/t ech- report/2007 . RR n ° 6232 Unité de recherche INRIA Rhône-Alpe s 655, av enue de l’Eu rope - 38334 Montbonn ot Saint-Ismier (France) Unité de reche rche INRIA Futurs : Parc Club Orsay Uni versité - ZAC des V ignes 4, rue Jacques Monod - 91893 ORSA Y Cedex (Franc e) Unité de reche rche INRIA Lorraine : LORIA, T echnopôle de Nancy -Brabois - Campus scientifique 615, rue du Jardin Botani que - BP 101 - 54602 V illers-lè s-Nancy Cedex (France) Unité de reche rche INRIA Rennes : IRISA, Campus univ ersitai re de Beaulie u - 35042 Rennes Cedex (Franc e) Unité de recherch e INRIA Rocquen court : Domaine de V oluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France) Unité de reche rche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France) Éditeur INRIA - Domaine de V olucea u - Rocquenc ourt, BP 105 - 78153 Le Chesnay Cedex (France) http://www .inria.fr ISSN 0249 -6399
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment