A Bit-Compatible Shared Memory Parallelization for ILU(k) Preconditioning and a Bit-Compatible Generalization to Distributed Memory

A Bit-Compatible Shared Memory P arallelizat ion for ILU(k) Precond itioning and a Bit-Compatible Generalization to Distributed Memory Xin Dong ⋆ and Gene Co op erman ⋆ College of Computer Science, Northeastern Universit y Boston, MA 02115, US A { xindong, gene } @ccs.neu.ed u Abstract. ILU(k) is a commonly used preconditioner for iterativ e lin- ear solvers for sparse, non-symmetric systems. It is often preferred for the sake of its stabilit y . W e present TPILU(k), the ﬁrst eﬃcien tly par- allelized I LU(k) preconditioner that maintains this imp ortant stability prop erty . Even b etter, TPILU(k) preconditioning produ ces an answ er that is bit-compatible with th e sequential ILU(k) preconditioning. In terms of p erformance, the TPILU(k ) precond itioning is sho wn to ru n faster whenever more cores are made av aila ble to it — while continu- ing to b e as stable as sequential ILU(k). This is in contrast to some compet in g meth od s that may b ecome unstable if the degree of th read parallelism is raised to o far. Wh ere Block Jacobi ILU(k) fails in an ap- plication, it can b e replaced by TPILU(k) in order to maintain goo d p erformance, while also ac h iev in g full stabilit y . As a further optimiza- tion, TPILU(k) oﬀers an opt ional level-b ase d inc omplete inverse metho d as a fast approximation for the original ILU(k ) preconditioned matrix. Although this enh ancement is n ot bit- compatible with classical ILU(k) , it is bit-compatible with t he outp ut from th e single-threaded versi on of the same algorithm. In exp eriments on a 1 6-core compu ter, the enhanced TPILU(k)-based iterativ e linear s olver p erformed up t o 9 times faster. As w e app roac h an era of many-core computing, the abilit y to eﬃciently take adv antage of many cores will b ecome ever more imp ortant. TPILU(k) also demonstrates goo d performance on cluster or Grid. F or example, the new algorithm ac hieves 50 t imes sp eedu p with 80 n od es for general sparse matrices of dimension 160, 000 that are diagonally dominan t. Keyw ords: ILU(k), bit-compatible para llelization, pr e conditioning, Ga ussian elimination, task- o riented parallelism ⋆ This w ork was p artially sup p orted by the National Science F oundation un der Gran ts CCF 09-16133 and CNS-06-19616. 1 In tro duction This work introduce s a para llel preconditione r , TPILU(k), with go o d stability and p er fo rmance acr oss a range o f sparse, no n-symmetric linear systems . F or a large spar se linear system Ax = b , para lle l iter ative so lvers based on ILU(k) [1 , 2] often suﬀer from instability o r p erfor mance deg radation. In par ticular, most of to day’s commonly used a lgorithms are domain dec omp osition preconditioner s , which b ecome slow or unstable with greater parallelis m. This happe ns as they attempt to approximate a linear sy s tem by more and smaller sub domains to provide the para llel w ork for an increasing n umber of threads. The restriction to sub do mains of ever smaller dimension must either igno re mo r e of the oﬀ-dia gonal matrix elemen ts, o r must r aise the complexit y b y including oﬀ-diago na ls into the computation for a n optimal decomp o s ition. The former tends to create insta bilit y for large n umbers of threads (i.e., for small sub domains), and the latter is slow. Consider the pa rallel pr econditioner PILU [3, 4] as an example. PILU w o uld exp erience p erfo rmance deg radation unless the matrix A is wel l-p artitionable int o sub domains . This condition is violated by linear systems generating many ﬁll-ins (as o ccur s with hig her initial densit y o r higher lev el k ) or by linear solvers employing many threads. Another parallel preconditioner BJ ILU [5] (Blo ck Ja- cobi ILU(k)), w o uld fail to conv erg e as the num b er of threads w g rows. This is esp ecially true for linear systems that are not diagonally dominan t, in whic h the solver migh t b ecome inv alid by ignoring signiﬁcant oﬀ-diago nal entries. This k ind of performa nce deg radation or instability is inconsistent with the widespread ac- ceptance of paralle l ILU(k) for v arying k to provide eﬃcient preconditioners. In con tra st, TPILU(k) is as stable as seq uential ILU(k) and its per formance increases with the num b er o f cor es. TPILU(k) ca n c a pture b o th prop erties s i- m ultaneo usly — precisely b ecause it is not bas ed on domain decomp osition. In the rest of this paper , we will simply wr ite that TPILU(k) is stable as a short- ened v ersion of the statemen t tha t TPILU(k) is stable for an y n umber of threa ds whenever seq uent ia l ILU(k) is stable. TPILU(k) uses a task-oriented par allel ILU(k) preconditioner for the base algorithm. How ever, it optionally ﬁrst tries a diﬀeren t, lev el-ba sed incomplete inv erse submetho d ( TPIILU(k) ). The term level-b ase d inc omplete inverse is used to distinguish it from previo us metho ds such as “ threshold-ba sed” incomplete inv erses [6]. The level-based submethod either succeeds or els e it fails to co nv erge. If it doesn’t conv erg e fast, TPILU(k) quickly r everts to the stable, base task- oriented par allel ILU(k) algorithm. A central p oint of nov elty of this work concerns bit-co mpatibility . The base task-or ient ed pa rallel comp onent of TP ILU(k) is bit-compatible with classica l sequential ILU(k), and the level-based optimization pro duces a new a lgorithm that is also bit-compa tible with the single -threaded version o f that same al- gorithm. F ew numerical parallel implemen tations can guar antee this stringent standard. The order of ope rations is pre c isely maintained s o that the low order bits due to round-oﬀ do not change under par allelization. F urther, the o utput remains bit-compatible a s the num b er of threa ds increases — thus eliminating worries whether sca ling a computation will bring increase d round-oﬀ err o r. 2 In pra ctice, bit-compatible algorithms a re well-received in the workplace. A new bit-compatible version o f co de may be substituted with little disc ussion. In contrast, new versions of code that result in output with modiﬁed low-order bits m ust b e v alidated by a numerical analys t. New versions of co de that claim to pro duce more accura te output must be v a lidated by a domain exp ert. A prerequisite for an eﬃcien t implementation in this work w as the use of thread-priv ate memor y allo cation a renas. The implemen tation der ives from [7], where w e ﬁrst noted the issue. The essence of the issue is that a ny implemen ta- tion o f POSIX-standar d “mallo c” libraries must b e prepared for the case that a second thread frees memor y originally allo cated by a ﬁrst thread. This requires a ce nt r a lized da ta s tructure, whic h is slo w in many-core a rchitectures. Where it is known that memory allo cated by a threa d will be freed b y that same thread, one can use a thread-priv ate (p er-thre a d) memory allo cation are na. The is sue arises in the memory allo ca tions for “ﬁll-ins” fo r symbolic factorizatio n. In LU- factorization based a lgorithms, the is sue is still more serious than incomplete LU, since symbolic factor ization is a r elatively la r ger pa rt of the ov era ll algo r ithm. TPILU(k) is generalized to co mputer clusters , whic h supp orts a hybrid mem- ory mode l using multiple computers, eac h with m ultiple cores. This helps to fur- ther impr ove the p erforma nc e for the TP ILU(k) alg orithm by aggre g ating cores from many computers when the computation is hig hly intensive. A pipeline com- m unica tio n model is employed for eﬃcient lo cal netw ork usage, whic h ov erla ps communication and computation succes s fully given the num b er of computers is not huge. The rest of this pap er is o rganized as follows. Section 2 reviews LU factor - ization and sequential ILU(k) alg orithm. Section 3 pr esents task-oriented paral- lel TPILU(k), including the base a lgorithm (Sections 3.1 through 3.3) a nd the level-based incomplete inv erse submetho d (Section 3.5). Section 4 a nalyzes the exp erimental re s ults. W e review related work in Sectio n 5. 2 Review of the Sequen tial ILU(k) Algorithm A brief sketc h is pr ovided. See [8] for a deta iled review of ILU(k). LU factorization decomp oses a matr ix A into the pro duct of a lower tria ngular matrix L and a n upper tria ngular ma trix U . F rom L and U , one eﬃciently computes A − 1 as U − 1 L − 1 . While computation of L and U r equires O ( n 3 ) steps, o nce done, the computation of the inv erse of the triang ular matric es pro ceeds in O ( n 2 ) steps. F or sparse ma tr ices, one conten ts oneself with so lving x in Ax = b for v ectors x a nd b , since A − 1 , L and U would all be hopeless ly dens e . Itera tive solvers are often used for this purp ose. An ILU(k) alg orithm ﬁnds spars e appr oximations, e L ≈ L and e U ≈ U . The preconditioned iterative so lver then implicitly s o lves A e U − 1 e L − 1 , whic h is close to the identit y . F or this purp ose, triangula r solve o p- erations are integrated in to each itera tion to obtain a solution y such that e L e U y = p (1) 3 where p v aries for each iteration. This ha s faster conv erg ence and b etter nu- merical stability . Here , the level limit k co nt r o ls how many elements should be computed in the pro cess of incomplete LU factorization. A level limit of k = ∞ yields full LU-factor ization. Similarly to LU factorizatio n, ILU(k) factorization can b e implemen ted by the same pro cedure as Gaussian eliminatio n. Moreover, it also reco rds the ele- men ts o f a low er tr ia ngular matrix e L . Because the dia gonal ele ments of e L are deﬁned to be 1, we do not need to store them. Therefore, a single ﬁl le d matrix F is suﬃcient to sto re both e L and e U . 2.1 T erminology for ILU(k) F or a huge spar se matrix, a s tandard dense format w ould be wasteful. Instead, we just sto re the p ositio n and the v alue of non-zer o ele ments. Similarly , incom- plete LU fa c to rization do es not insert all elements that ar e generated in the pro cess of factorization. Instead, it employs some mec hanisms to con tro l ho w many elements ar e stored. ILU(k) [1, 2] uses the level limit k as the para meter to implement a mor e ﬂexible mec hanism. W e next r eview some deﬁnitions. Deﬁnition 2 . 1: A ﬁll en try , or entry for short, is an element stor e d in memory. (Elements that ar e not stor e d ar e c al le d zer o elements.) Deﬁnition 2.2: Fill-in : Consider Figur e 1a. If ther e exists h su ch t hat i, j > h and b oth f ih and f hj ar e ﬁl l entries, then the ILU(k) factorizatio n algorithm may ﬁl l in a n on- zer o value when c onsidering r ows i and j . Henc e, this element f ij is c al le d a ﬁll-in ; i.e., an entry c andidate. We say the ﬁl l-in f ij is caused by the existenc e of the two en t ries f ih and f hj . T he entries f ih and f hj ar e t he c a usative ent r ies of f ij . The c ausality wil l b e made cle ar er in the n ext subse ction. Deﬁnition 2 . 3: Level : Each entry f ij is asso ciate d with a level, denote d as l ev el ( i, j ) and deﬁne d r e cursively by l ev el ( i, j ) = ( 0 , if a ij 6 = 0 min 1 ≤ h< min ( i,j ) l ev el ( i, h ) + l e v el ( h, j ) + 1 , otherwise The level limit k is used to control how man y ﬁll-ins should b e inserted into the ﬁlled matrix during ILU(k) factorization. Those ﬁll-ins with a lev el smaller than o r equal to k are in se r ted into the ﬁlled matrix F . Other ﬁll-ins are ignor ed. By limiting ﬁll-ins to level k o r less, ILU(k) maintains a spar se ﬁlle d matrix. 2.2 ILU(k) Algorithm and its P arallelization F or LU factor ization, the deﬁning equation A = L U is expa nded into a ij = P min ( i,j ) h =1 l ih u hj , s ince l ih = 0 for i > j a nd u hj = 0 for i < j . When i > j , f ij = l ij and we can write a ij = P j − 1 h =1 l ih u hj + f ij u j j . When i ≤ j , f ij = u ij and w e can write a ij =  P i − 1 h =1 l ih u hj  + l ii f ij =  P i − 1 h =1 l ih u hj  + f ij . Rewriting 4 them yields the equatio ns for LU fa c torization. f ij = (  a ij − P j − 1 h =1 l ih u hj  /u j j , i > j a ij − P i − 1 h =1 l ih u hj , i ≤ j (2) ij f f hj f ih f ih f hj ij f and Fill−in with its causative entries ij f Row i In the incomplete case, is a candidate entry. j Column g h Row h Row g (a) Causative Relationship 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 Band 3 Band 2 Band 1 Band 0 Column 0 3 6 9 Four bands of 3 rows. After band 0 is reduced, first block of 3 further bands is reduced in parallel. (b) View of Matrix as Bands Fig. 1: Parallel Inco mplete LU F actoriz a tion The computation for incomplete LU fa c torization follows a deﬁnition sim- ilar to Equations 2 except it skips zero elements. In our implementation, the matrix F is initialized to A and stored in row-ma jor or der form prior to any computation. The computation can b e re-o rganized to use the ab ove equations in the for ward direc tio n. As each ter m l ih u hj for h < j is determined, it can immediately b e subtracted fr om f ij . Just as all of the rema ining rows can b e reduced simultaneously by the ﬁrst row in Gaussian elimina tion, a row-ma jor order for ILU(k) factor ization leads to a na tural parallel alg o rithm. F ollowing the deﬁning equations, the ILU(k) algor ithm maintains in memory t wo rows: row h and row i , wher e h < i . Ro w h is used to p artial ly r e duc e row i . F or each p ossible j , the pro duct l ih u hj is used to reduce the en try f ij . Once w e hav e a ccumulated all pr o ducts l ih u hj for h < mi n ( i, j ), we ar e done. ILU(k) is s e parated into tw o passes: symb olic factorization or Phas e I to compute the le vels and insert a ll ﬁll- ins with the level less than o r equal to the level limit k int o the ﬁlled matrix; and numeric factorization or P hase II to compute the v alues o f a ll en tries in the ﬁlled matrix. Both passes follow a pro cedure similar to tha t descr ibe d above. Algorithm 1 illus trates the sy m b olic factorization phase. It deter mines for ea ch row j , the set of p ermitted entries, per mitted ( j ). These are the entries for which the computed entry lev el or wei ght is less than or equal to the k . Numeric factorization is simpler, but similar in 5 spirit to the r ow-merge up date pas s of Algor ithm 1. T he lines 14 through 17 control the entries to b e up dated, and the up date in line 19 is replace d by an upda te numeric v alue. The details a re omitted. Algorithm 1 Symbolic factor ization: Phase I of ILU(k) preconditioning 1: //Calculate lev els and permitted entry p ositions 2: //Loop ov er rows 3: for j = 1 to n do 4: //Initialization: admit entrie s in A, and assign them the level zero. 5: per mitte d ( j ) ← empt y set //permitted entry in row j 6: for t = 1 to n // non zero entries in row j do 7: if A j,t 6 = 0 then 8: lev e l ( j, t ) ← 0 9: insert t into pe r mitted ( j ) 10: end if 11: end for 12: end for 13: //Row -merge up date p ass 14: for each unpro cessed i ∈ per mitted ( j ) with i < j , in ascending order do 15: for t ∈ per mitted ( i ) with t > i do 16: weig ht = l ev e l ( j, i ) + lev el ( i, t ) + 1 17: if if t ∈ perm itted ( j ) then 18: //alrea d y n onzero in F j,t 19: lev el ( j, t ) ← mi n { l ev el ( j, t ) , w eig ht } 20: else 21: //zero in F j,t 22: if weig ht ≤ k //lev el control then 23: insert t into pe r mitted ( j ) 24: lev e l ( j, t ) ← weig ht 25: end if 26: end if 27: end for 28: end for 29: return per m itted The algorithm has so me of the same spirit as Gaussian elimination if o ne thinks of ILU(k) as using the earlier row h to r e duc e the la ter row i . This is the crucial insigh t in the parallel ILU( k) algorithm of this pap er. One splits the r ows of F int o bands, and reduces the rows of a later ba nd by the rows of an earlier band. Distinct threa ds can reduce distinct ba nds simultaneously , as illustrated in Figure 1b. 6 3 TPILU(k): T ask-orien ted Parallel ILU(k) Algor ithm 3.1 P arallel T asks and Static Load Bal ancing W e in tro duce the following deﬁnition to describe a general parallel mo del, which is v alid for Gaussian elimination as well a s ILU(k) and ILUT [5]. Deﬁnition 3 . 1: The fr ontier is the maxim um n umber of r ows that are current ly factored completely . According to this deﬁnition, the fron tier i is the limit up to whic h the re- maining rows can b e par tially factored except for the ( i + 1) th row. The ( i + 1) th row can b e facto red completely . That changes the fr o ntier to i + 1. Threads synchronize on the frontier. T o balance and overlap computation a nd synchronization, the matrix is organized as b a nds to mak e the granularit y of the computation adjustable, as demonstr ated in Figure 1b. A task is asso ciated to a band and is deﬁned as the computation to partially factor the band to the current fro ntier. F or each band, the pr o gram m ust remember up to what column this band has b een partia lly factored. W e call this column the curr ent p osition , whic h is the start po int of factoriz a tion for the next task attached to this band. In addition, it is impo rtant to use a v aria ble to remember the ﬁrst band that has not bee n factor ed completely . After the ﬁr st unﬁnished ba nd is completely factored, the frontier global v alue is increased by the num b er o f rows in the band. This completely f a c tored band should be br oadcast t o all machines in the distr ibuted- memory case or shared by all threa ds in the m ulti-co re case. The smaller the band size, the larger the n umber of synchronization points. How ever, TP ILU(k) prefers a smaller band size, that leads to mor e para llel tasks. Mor eov er, the low er bound of the fac torization time is the time to facto r the last band, which should not b e very lar ge. L uckily , shar ed memor y a llows for a s maller band size because the synchronization here is to read/ write the frontier, which ha s a small cost. While the strategy of bands is well known to be eﬃcient for dense matrice s (e.g., see [9]), researchers hesitate to use this strategy for spa rse matrice s be - cause they ma y ﬁnd only a small num b er of relatively dense bands, while all other bands are clo se to trivia l. The TPILU(k) algo rithm works well on spar s e matrices b eca us e succe ssive factor ing of bands pro duces many somewhat dense bands (with more ﬁll-ins) near the end of the matrix. TPILU(k) uses static loa d balancing whereby each work er is assigned a ﬁxe d gr oup o f bands chosen round robin so that each thre ad will also be resp o nsible for some of the denser bands. 3.2 Communication in the Distributed Mem ory Case Only the co mpletely factored ba nds are use ful fo r the fac to rization o f other bands. The intermediate result is not truly needed b y other “workers”. This observ ation, with the static load bala ncing, helps to decrea se the comm unicatio n ov erhead to a minim um by sending an up date message only for each completely factored band. 7 The static load balancing also pro duces a more regular communication that ﬁts well with the pip elining comm unication of the next section. A fur ther virtue of this strategy is that it uses a ﬁxed num b er of messag e buﬀers and a ﬁxed buﬀer size. This av oids dynamically allo cating the memo ry for message handling . Under the strategy of static lo ad balancing, the computatio ns o n a ll pro cesso r s are co ordinated so as to guarantee that no pro cess or can send tw o up date messages simult a ne o usly . In other words, a pro cesso r m ust ﬁnish br o adcasting an up date message befo re it factor s a nother band completely . Pip eline Communic ation f or Eﬃci ent L o c al Network Usage. Although the bands can be factored sim ultaneously , their completion fo llows a strict top-down o rder. When one ba nd is completely factored, it is best for the no de that holds the ﬁrst unﬁnished band to receive the result ﬁrst. This stra teg y is implemented by a “pipeline” mo del. F ollowing this mo del, all no des a re organized to for m a directed r ing. The message is tra nsferred along the dir ected edge. E very no de sends the message to its unique success o r until each no de has received a cop y of the mess age. After this messa ge is for warded, each no de uses this message to upda te its memory data. Figure 2 shows that this mo del achiev es the ag grega te bandwidth. In this ﬁgure, the horizontal ax is is the time step while the vertical axis is the sender ID (the rank num b er of e a ch no de). Note tha t at most time steps, there are several no des participa ting. Each of them is either se nding a message to its successor or receiv ing a messag e fro m its predecessor . 1 2 3 4 5 Sender ID Time step Band: 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Fig. 2: “Pipe line” mo del. The ho rizontal a xis is the time step. The v ertica l axis is the sender id. The lines represent w hen the alg orithm s ends a messag e . The time step a nd the sender id of the so urce are indicated. The r eceiver is always the successor o f the so ur ce. The messa ge is marked by the corres po nding band nu mber . Only the ﬁrst several messag es a re shown. Algorithm 2 des crib es the implementation to overlap the computation with the communication bas ed on the “pip eline” mo del. Both sym b olic factor iz ation and numeric fa c to rization can use this model, except for the c a se of k = 1, whose symbolic factorizatio n re duces to trivial para llelism a s discussed in Section 3.3. 8 Algorithm 2 Parallel ILU(k) a lgorithm with the “pip eline” mo del 1: receive from predecessor //non-blocking receive 2: //Loop until all ban d s are factored completely 3: while ﬁ rstUnfactoredBand < num b erOfBands do 4: get new task (band I D ) from the “master” to w ork on 5: if there was a band to work on then 6: doT ask(band) // factor band using all previous bands 7: if band is not factored completely then 8: //not factored completely , th en non-b locking test 9: try to receiv e a message for some band 10: if a newly factored band is receiv ed then 11: send band to successor //non-b locking send 12: up date our copy of newly factored band 13: conti nue to receiv e and up date until our band is completely factored 14: end if 15: else 16: send our factored band to successor //non-blo cking send 17: end if 18: else 19: w ait u ntil a new band is av ailable , while in background contin uing to receive other factored bands from p redecessor, up d ating our copy , and sending t he factored band t o our successor 20: end if 21: end whil e 3.3 Optimized Symboli c F actorization Static Load Balancing and TPMallo c . Simultaneous memory allo cation for ﬁll-ins is a perfo rmance bottleneck for shared-memory paralle l computing. TPILU(k) takes adv antage of a threa d-priv ate mallo c library to solve this iss ue as discuss ed in [7]. TPMallo c is a non-standar d e xtension to a standard allo cator implemen tatio n, which asso c iates a thread- pr iv ate memory allo cation ar e na to each threa d. A thread-lo cal global v ar iable is also pr ovided, so that the modiﬁed behavior can be turned on or oﬀ on a p er -thread ba s is. By default, threads use thread-priv ate memory alloca tion arenas. The static load balancing s trategy guarantees that if a thread allo ca tes memory , then the sa me thread will free it, which is consis tent with the use of a thread-pr iv ate allo ca tion a rena. Optimization for the Case k = 1. When k = 1, it is po ssible to s ymbolically factor the bands and the rows within each band in any desir e d o rder. This is bec ause if either f ih or f hj is a n entry of lev el 1, the resulting ﬁll-in f ij m ust b e an element of le vel 2 or lev el 3. So f ij is not inserted in to the ﬁlled matr ix F . As a ﬁrst obser v ation, the sym b olic factoriza tion now b ecomes pleasing ly pa rallel since the pro cessing of each ba nd is indepe ndent of that o f an y other. Second, since the order can b e ar bitrary , even the purely s equential pro ces sing within one band by a single threa d can b e made more eﬃcient. Pr o cessing rows in reverse order from last to ﬁrst is the most eﬃcient, while the more natural ﬁrst-to-las t order is the least eﬃcient. First-to-las t is ineﬃcient , bec ause we add 9 level 1 ﬁll-ins to the s parse repr esentation of e a rlier rows, and we must then skip ov er those earlier level 1 ﬁll-ins in determining lev el 1 ﬁll-ins of later r ows. Pro cess ing fr om last to ﬁrst av oids this ineﬃciency . 3.4 Other Optimi zations Optimization on Cl usters with Multi-core N o des. A hybrid memor y mo del using multiple co mputer s, ea ch with multiple co res, helps to further im- prov e the p er formance for the TPILU(k) a lgorithm. O n each node, start se veral threads as “workers” and one par ticular thread as a “communicator” to handles messages b etw een no des. T his desig n leads to b etter p erforma nce than having each thread work er communicate directly with remote threads. The r eason is that the “MPI THREAD MUL TIP L E” option of MPI can degra de p erfo r mance. Optimization for Eﬃcient Matrix Storage. The compressed spars e row format (CSR) is used in the iteration phase due to its eﬃciency for arithmetic op erations. How ever, the CSR for mat do es not supp ort enlar ging the memor y space for several rows simultaneously . Therefor e , TPILU(k) initialize s the matrix F in row-ma jo r forma t during the symbolic facto r ization phase. A fter the matrix non-zero pattern is determined by symbolic factorizatio n, TPILU(k) changes the output matrix F from row-ma jor for mat back to CSR for mat. The forma t transformatio ns happ ened during the factorization phase with a neglig ible co st. 3.5 Optional Lev el-Base d Incompl e te In verse Metho d The goal of this section is to descr ibe the level-based inco mplete in verse method for solving e Lx = p by matrix -vector multiplication: x = g e L − 1 p . This av oids the sequential b ottleneck of using forward substitution on e Lx = p . W e pro duce incomplete inv erses g e L − 1 and g e U − 1 so that the triangular solve stage of the linea r solver (i.e., so lving for y in e L e U y = p a s describ ed in E quation (1) of Section 2) can be trivially parallelized ( y = g e U − 1 g e L − 1 p ) while also enforcing bit co mpatibilit y . Although details ar e omitted here, the same ideas ar e then used in a s econd stage: using the solution x to so lve for y in e U y = x . Below, denote the ma trix ( − β it ) t ≤ i to b e the lower tria ng ular matrix e L − 1 . Recall tha t β ii = 1, just as for e L . Fir st, we hav e Equa tion (3a), i.e., x = e L − 1 p . Second, we hav e E quation (3b), i.e., the equation for solving e Lx = p b y forward substitution. O bviously , Equation (3a) and Equation (3b) deﬁne the s ame x . x i = X t h, f ij ← f ij − f ih f hj (6b) ∀ t < h, f it ← f it − f ih f ht (6c) ∀ t < i, f it ← − f it (6d) The matr ix e L − 1 is in danger of b ecoming dense. T o main tain the sparsity , we compute the level-based incomplete inv er s e matrix g e L − 1 following the sa me non-zero pattern a s e L − 1 . The computation for g e L − 1 can b e combined with the original n umer ic factorizatio n phase. A further factoriz a tion phase is added to compute g e U − 1 by computing matrix ent r ies in reverse o rder from last row to ﬁrst and from right to left within a given row. Given the ab ov e a lgorithm for g e L − 1 and a s imilar algor ithm for g e U − 1 , the triangular solve s tage is reduced to matr ix-vector multiplication, which ca n b e trivially parallelized. Inner pro duct ope r ations are not pa rallelized for tw o rea- sons: ﬁrs t, even when s equential, they are fast; second, par allelization of inner pro ducts would viola te bit-compatibilit y b y changing the or der of op erations. 4 Exp erimen tal Results W e ev aluate the performa nce of the bit-compatible parallel ILU(k) algorithm, TPILU(k), b y compa ring with t wo co mmonly used parallel preconditioners , PILU [3] a nd BJILU [5 ] (Blo ck Jacobi ILU(k)). B oth PILU a nd BJILU ar e based on domain de c omp osition . Under the fra mework of Euclid [10, Section 6.12], b oth preconditioners app ear in Hypre [10 ], a p opular linear so lver pac k age under de- velopmen t at La wr ence Livermore National Lab ora tory sinc e 2001. The prima ry test platform is a computer with four Intel Xeon E5520 quad- core CPUs (16 cores total). Figure 4 demons tr ates the sca lability o f TPILU(k) bo th on this primary pla tform and a cluster including tw o no des connected b y Inﬁniband. Each no de has a single Quad-Cor e AMD Optero n 2378 CPU. The op erating system is Cen tOS 5.3 (Linux 2.6.18 ) and the compiler is gcc- 4.1.2 with the “-O 2” option. The MPI library is Op enMPI 1.4. Within Hypre, the same choice o f iterative solver is used to test b oth E uclid (PILU and BJILU) and TPILU(k). The chosen iterative solver is pr e conditioned stabilized bi-conjug a te gradients with the default toler ance r tol = 10 − 8 . Note that the Euclid fra me- work employs m ultiple MPI processes commun ica ting via MPI’s shar ed-memory architecture, instead o f directly implementin g a single m ulti-threa de d pro cess. 11 4.1 Driv en Ca vit y Problem This set of test cases [11] consists of so me diﬃcult problems fro m the mo del- ing of the incompressible Navier-Stokes equations. These test cases are consid- ered here for the sake of compara bilit y . They had prev io usly b een chosen to demonstrate the features of PILU b y [4]. Her e, we test o n three represe ntatives: e 20 r 300 0, e 30 r 3000 and e 4 0 r 3000. Figure 3 shows that b oth E uclid PILU and Euclid BJILU a re inﬂuenced by the num b er of pr o cesses and the level k when solving driven ca vity problems. With mor e pro cesses or la rger k , b oth the PILU and BJILU preconditioner s tend to s low down, break down or div er ge. Matrix dimension: 4,241 Number of non−zeros: 131,556 Number of Cores (Processes) 2 3 4 0 1 2 3 4 5 6 10 20 BJILU PILU(1) PILU(2) PILU(3) PILU(4) 1 Solution Time (s) (a) Euclid for e 20 r 3000 1 2 3 4 Number of Cores (Processes) 0 1 2 3 4 5 6 10 20 Solution Time (s) PILU(4) PILU(3) PILU(2) PILU(1) BJILU Number of non−zeros: 306,356 Matrix dimension: 9,661 (b) Euclid for e 30 r 3000 1 2 3 4 0 1 2 3 4 5 6 10 20 Solution Time (s) Number of Cores (Processes) PILU(4) PILU(3) PILU(2) BJILU PILU(1) Number of non−zeros: 553,956 Matrix dimension: 17,281 (c) Euclid for e 40 r 3000 Fig. 3: Euclid PILU and BJILU for Driven Cavit y P r oblem using a Single AMD Opteron (4 Cor es). “ X” means fail, and the time is arbitra rily shown to be an int er po lated v alue or the same a s for the preceding num b er of threads . Note that in Figure 3 (a ), P I L U(k) actually br eaks down for 3 threads, w hile then succeeding for 4 threads. Euclid register s its b est solution time for e 20 r 3 000 b y us ing PILU(2) with 1 pro cess, for e 30 r 3 0 00 by using BJILU with 2 pro ce s ses, and for e 40 r 300 0 by using P ILU(1) with 2 pro cesses . T he reason that Euclid PILU obtains only a small sp eedup for these problems is that PILU r equires the ma trix to b e wel l- p artitionable , which is violated when using a larg er level k or when employing more pro cesses. Similar ly , Euclid BJILU must approximate the origina l matr ix by a num be r of sub doma ins equal to the num b er of pro ces s es. Therefore, higher parallelism forces B JILU to ig nore even more oﬀ-diagonal matrix entries with more blo cks of smaller blo ck dimension, and e ventually th e BJILU computation just breaks down. In contrast, TPILU(k) is bit-compa tible. Grea ter paralleliz a tion only acceler- ates th e computation, while also never introducing instabilities or other ne g ative side eﬀects. Figure 4a illustrates that for the e 20 r 30 0 0 case, TPILU with level k = 4 and 4 threads lea ds to a better p er formance (0.55 s) than Euclid’s 0.7 8 s (Figure 3a). F or the e 30 r 300 0 ca se, TPILU(k) ﬁnishes in 1 .16 s (Figure 4b), as 12 Total. Cores on 2 nodes. Symbolic. Cores on 2 nodes. Numeric. Cores on 2 nodes. Total. Cores on 1 node. Symbolic. Cores on 1 node. Numeric. Cores on 1 node. 1 2 3 4 5 6 Number of Cores 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 TPILU(4) Solution Time (s) (a) Matrix e 20 r 3000 1 2 3 4 5 6 Number of Cores 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 TPILU(3) Total. Cores on 2 nodes. Numeric. Cores on 2 nodes. Symbolic. Cores on 2 nodes. Total. Cores on 1 node. Symbolic. Cores on 1 node. Numeric. Cores on 1 node. Solution Time (s) (b) Matrix e 30 r 3000 TPILU(3) 1 2 3 4 5 6 Number of Cores 0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6 4.0 Total. Cores on 2 nodes. Symbolic. Cores on 2 nodes. Numeric. Cores on 2 nodes. Total. Cores on 1 node. Symbolic. Cores on 1 node. Numeric. Cores on 1 node. Solution Time (s) (c) Matrix e 40 r 3000 Fig. 4: TPILU(k) for the Dr iven Cavity P roblem Using 2 AMD Optero n (2 × 4 Cores). The exp er iment a l r uns for 1,2,3 ,4 threa ds ar e all for a 4-cor e shared memory CPU. The exp er imental runs for 2,4,6 threa ds are all for tw o no des with 4-cor e s p er node, while an additional thread p er no de is reserved for com- m unica tio n betw een no des in o rder to replicate bands. compared to 1.47 s for BJILU and 1.6 4 s for PILU (Figure 3b). F or the e 40 r 300 0 case, TPILU(k) with k = 3 ﬁnishes in 2.14 s (Figure 4c), as compar e d to 3 .15 s for PILU and 3 .52 s for BJ ILU (Figur e 3 c). Figur e 4c demonstrates the potential of TPILU(k) for further perfor mance impro vement s when a h ybrid ar chit ectur e is used to provide additional cor e s: the h ybrid ar chitecture with 6 CPU cor es ov er tw o no des connected by Inﬁniband is even better (2.14 s ) than the shar ed- memory mo del with a single quad-c ore CPU (2.20 s). 4.2 3D 27-p o in t Ce ntral Diﬀerencing As pointed out in [4], ILU(k) preconditioning is amenable to perfor mance anal- ysis since the non-z ero patterns of the resulting ILU(k) preconditioned ma trices are identical for an y partial diﬀerential equation (PDE) that has b een disc r etized on a g r id with a given stencil. How ever, a paralleliza tion based on doma in de- comp osition may er adicate this fea ture since it generally relies on r e-order ing to maximize the indep endence among sub domains. The re- ordering is required for domain decomp os ition since it would otherwise face a higher cost dominated by the resulting denser matr ix. As Figure 5 a shows, Euclid PILU degrades with more pro cesses when solving a linear sys tem gener ated b y 3D 27 -p oint central diﬀerencing for Poisson’s equation. The p erforma nc e degr adation also incr e a ses rapidly as the level k grows. This p er formance deg radation is not an acciden t. The domain-decomp osition computation domina tes when the num ber of no n-zeros p er r ow is larger (ab out 2 7 in this case). Therefore, the sequential alg orithm with the level k = 0 wins o ver the par allelized PILU in the co ntest for the be s t so lution time. This obs erv ation holds true for all grid sizes tested: from 50 × 50 × 50 to 90 × 90 × 90 . In contrast, 13 2 3 4 1 0 1 2 3 4 Solution Time (s) 5 6 7 8 9 10 Preconditioning. Level 0. Total. Level 1. Preconditioning. Level 1. Total. Level 0. central diferencing on a 40X40X40 grid Number of Cores (Processes) PILU for a system from 3D 27−print (a) Euclid PILU 1 2 3 4 Number of Cores TPIILU(0) for systems from 3D various grids 27−point central differencing on 0 4 8 12 16 20 24 28 32 36 40 Solution Time (s) 90X90X90 80X80X80 70X70X70 60X60X60 50X50X50 40X40X40 PILU optimals respectively Lines with arrow indicate (b) Comparison of Euclid PILU and TPI ILU Fig. 5: Solving Linear Sys tem from 3D 27 -p oint Central Diﬀerencing on Grid using a Single AMD Quad-Co r e Opteron. F o c us ing o n the alg orithm only , the compariso n ignores reusing the doma in decomp osition ov er multiple linear sys- tem solutions. for all of these tes t cases, TPI ILU ( the lev el-bas ed inco mplete inv erse submetho d of TPILU(k)) leads to improv ed p er fo rmance using 4 cores, as s een in Figur e 5b. 4.3 Mo del for DNA E lectrophoresis: cage15 The cag e mo del of DNA electrophoresis [12] describes the dr ift, induced by a constant e lectric ﬁeld, of homoge neously charged p o lymers through a gel. W e test on the largest cas e in t his pro blem set: cag e 15. F or cag e 15, TPIILU(0) obtains a sp eedup o f 2.93 using 8 thre a ds (Figure 6a). The ratio of the num b er of FLo ating po int arithmetic OPerations (FLOPs) to the n umber of non-zero entries is less than 5. This implies that ILU(k) preco nditioning just passes through ma tr ices with few FLO Ps. In other words, the computation is to o “easy” to b e further sp ed up. 4.4 Computational Flui d Dynamics Problem: ns3Da The problem ns 3 D a [12] is used as a test cas e in FEMLAB, develop ed by Co msol, Inc. Because ther e a re zero diag onal element s in the matrix, we us e TPI ILU with lev el k = 1 a s the pr econditioner. Fig ure 6b shows a sp eedup of 8.93 with 16 threads since the preconditio ning is ﬂoa ting-p oint intensiv e. 4.5 TPMallo c Performanc e F or a la r ge level k , the symbolic factorization time will dominate. T o squeeze greater p erforma nce from this ﬁrst phase, glib c’s s ta ndard mallo c is repla c ed 14 TPIILU(0) for cage15 1 2 4 6 8 Number of Cores 0 4 8 12 16 20 24 28 32 Solution Time (s) Matrix dimension: 5,154,859 Preconditioning floating point Number of non−zeros: 99,199,551 arithmetic operations: 476,940,712 (a) TPI ILU(0) for cage15 1 2 4 8 16 Number of Cores 0 10 20 30 40 50 60 70 80 90 Solution Time (s) TPIILU(1) for ns3Da Speedup: 8.93 with 16 cores (b) TPI ILU(1) for ns3Da 1 2 3 4 5 6 7 8 Number of Cores 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Solution Time (s) using a standard allocator Symbolic factorization Symbolic factorization using the thread−private malloc library Numeric factorization. factorization TPILU(3) for e40r3000 Independent of symbolic (c) TPMalloc Performance Fig. 6: TPI ILU(k)/TP ILU(k) using 4 In tel Xeon E 5520 (4 × 4 Cores) with a thread- pr iv ate mallo c (TPMallo c). Figure 6 c demonstrates that the im- prov ement provided b y TPMallo c is signiﬁcant whenev er the num b er of cores is greater than 2. 4.6 Results to Di s tributed M emory ns3Da for di stributed me m ory . W e also test the ns3Da case on a cluster consisting of 33 no des. Each no de of this cluster has 4 CPU core s (dual pro cess or, dual-core), 2.0 GHz Intel Xeon EM6 4 T pro cessor s with either 8 GB or 16 GB per no de. The no des are connected by a Gig abit Ethernet net work. E a ch no de is conﬁgured with Linux 2.6 .9, gcc 3.4 .5 and MPICH2 1 .0 as the MPI library . On this cluster , TPILU(k) obtains a sp eedup of 7.22 times (T able 1) for distr ibuted memory and 7.82 times (T able 2) for hybrid memory . Procs PC time (s) # Iters Ti me (s) Sp eed up 1 82.26 34 6.39 1 2 42.57 34 5.27 1.85 4 22.90 34 4.00 3.30 8 13.69 34 3.30 5.22 16 9.19 34 3.08 7.22 T able 1: TPILU(k) for ns 3 D a with 16 nodes on a cluster F urther distributed m emory exp eri men ts based on matgen. This sec- tion highlights TP ILU(k) on distributed memor y for the factor ization of sparse matrices generated by matgen [13 ]. A copy of the input matrix A is as sumed to 15 Procs Threads PC time ( s) # Iters Ti me (s) Sp eed u p 1 1 82.26 34 6.39 1 2 6 15.28 34 2.83 4.90 4 12 11.39 34 2.63 6.32 8 24 8.59 34 2.75 7.82 T able 2: TP ILU(k) for ns 3 D a with 8 no des on a cluster with 1 dedicated com- m unica tio n thread and 3 work er threads per pro cess and one pro cess p e r no de reside on each no de o f the cluster b efore the computation. After the computa- tion, the r esult matrix F also r esides on each node of the clus ter . See [14 ] for a more detailed discussion concerning TP ILU(k) on clusters. Sym b o l ic factorization v ersus n umeric factorization. This set of e x pe ri- men ts illustrate how the r atio of s ymbolic factor ization to numeric factor iz ation changes when k increases gra dually . Figur e 7 demonstr ates the co mputation time of four matrices meas ured for both symbolic factor ization a nd numeric facto r iza- tion. The initial matr ix densities are 0 . 073, 0 . 0 3 6, 0 . 0 09 and 0 . 002 res p ec tively . In all cases, the ratio of symbolic factor ization to n umeric factoriza tion does not decrease when the level k grows fro m 1 to 5. When k is large enoug h, the r atio go es b eyond 1. Therefore, t he time for symbolic factorization is almost the same as or even a little greater than that of numeric f a ctorization if no entry is skipped in the ﬁrst phase (turning oﬀ the optimization as desc r ib ed in Section 3.4). Although the non-zero elements are inserted dy namically and this r e sults in fewer compariso ns , the ins e r tion of a n entry is cos tly . T o inse r t an entry , we mov e the remaining ent r ies in order to op en enough space for the new ﬁll entry . R emark: The symb olic factoriza t ion for t he c ase k = 1 is lightweigh t due to the optimization mentione d in Se ction 3.4 . Consider a matrix of dimension 20,00 0 and a level limit of 1 in one of our exp eriments. After symb olic factoriza- tion, the numb er of entries is 1,239 ,058 and al l of them ar e involve d in numeric factorization. On the other hand, only 265, 563 level 0 elements c ause a new entry in the symb olic factorization phase. P arallel I LU ( k ) fo r larger k . F or the s itua tion k = 2 and k = 3, the sp eedup is go o d, as demo nstrated by the exp er imental results in Figure 8 . F or the 24 K matrix with initial de ns it y 0.000 61, k = 3 creates 79,811,0 2 3 ﬁnal entries. F or the 30 K matrix with initial density 0.00089 , k = 2 r esults in 90 ,1 70,72 2 ﬁnal entries. F o r b oth situations, w e achiev e near ly linear speedup throug h the largest example: 60 CPU cores. This result is rea sonable, since raising k to 2 or 3 ca uses mor e ﬁll-ins, and a denser result ma trix. The added ﬂo ating-p oint arithmetic p er task leads to a relatively larger computatio n suﬃcient to ov erlap communication. 16 Fig. 7: Compariso n of symbolic factorizatio n and n umeric factorization (sequen- tial algor ithm); LEVEL k = 1 , 2 , 3 , 4 and 5; 1 is the result for a matr ix o f 1 K × 1 K ; 2 is the r esult for a matr ix of 2 K × 2 K ; 3 is the result for a matrix of 4 K × 4 K ; 4 is the r esult for a matrix of 8 K × 8 K ; Fig. 8: Computation for higher level; The ma trix densities are 0.000 61 and 0.0008 9. 17 Case k = 1 : P I L U (1). This is the most commonly used case. The res ults of the sequential ILU(k) ar e co lle cted in T able 3 while TPILU(1) is pr esented in T able 4. F or k = 1 , all causative en tries are initial entries with level 0. So in the ﬁrst pa ss, all c omputation that enlarges the s ize of ﬁlled matrix or updates the level is inv o lved in mer ely initial en tries . There is no c o mmun ica tion in the ﬁrst pass. The ﬁnal n umber of entries determines the communication ov er head of a single machine for the second pass. The ﬁnal entry n umber, densit y and non-zero pattern determine the amount of computation in the second pass. n #Initial entry #Fi nal entry Time (s) 40K 5120950 1962 23519 445.2 + 8938. 2 80K 6960983 1952 02037 234.0 + 4120. 8 160K 9832794 1989 69083 140.2 + 2112. 7 320K 1 4090553 2064895 90 93.4 + 1162.0 T able 3: Computation of s e quential algorithm; Level k = 1. (The matrix den- sities are 0.003, 0. 0 01, 0.000 3 7 and 0.000 13.) n #CPU Core #Band Time (s) Sp eedup 40K 50 20480 9.5 + 217.4 41.4 40K 60 20480 7.8 + 191.6 47.1 80K 40 20480 6.7 + 142.9 29.1 80K 60 20480 5.0 + 98.7 42.0 160K 30 40960 4.7 + 101. 8 21 .2 160K 60 40960 2.5 + 64.4 33.7 320K 30 81920 3.3 + 89.8 13.5 320K 40 81920 2.4 + 71.3 17.0 320K 60 81920 1.6 + 59.0 20.7 T able 4: Computation for Level k = 1. (The matrix dens ities are 0.00 3, 0.001, 0. 00037 and 0. 00013. ) F rom T ables 3 and 4, one can see the following: for a ma trix of dimension 40,000 with density 0.00 3 and a matrix of dimension 80,0 00 with dens ity 0.001, TPILU(k) obtains a sub-linear sp eedup for 60 CP U cores. F o r a matr ix o f di- mension 160 ,000 with density 0.00 037, and a matr ix of dimension 320,00 0 with density 0.00 013, TP ILU(k) achiev es a maximum sp eedup a t 60 CPU cores. The sym b olic factorization phase alwa ys obta ins a linear sp eedup b ecause there is no communication ov erhea d. How ever, the numeric factor ization phase do es no t achieve a linear s p eedup for a ll cases. The decrea sing sp eedup in the 18 nu mer ic facto rization pha s e is the ma jor par t that inﬂuences the tota l speedup. It is explained as follows. Suppo se the matrix has n f ent r ies ﬁnally . F ollowing the “pipeline” mo del (Section 3.2), b oth the column num ber a nd v alue of each en try in each band are sent to a unique child no de once except for the bands that are handled by the child no de. Therefor e, the communication overhead is ab out 8 n f B p er node. Considering the four matric e s in the ab ov e exper iments, the n umber of ﬁnal ent r ies is 200 M in all cases. So the comm unication o verhead is a bo ut 200 M × 8 B per no de in the second phase. As the matrix dimension increases, the density decreases, as do es the amount of ﬂoating-p oint arithmetic op erations and the computation-communication ratio. Also the optimal num ber of CPU cor e s then decreases. Increasing the num ber of CPU cor es decr eases the computation-communication ratio by increa sing the total comm unicatio n o verhead and decreasing the com- putation burden of ea ch machine. In order to impro ve the sp eedup fur ther , it is necess ary to decrease the total communication time to overlap communica- tion and computation well. O ne so lution is to increa se ba ndwidth us ing a high per formance cluster lonesta r.tacc.utexas .e du, whic h has mor e bandwidth. The Lonestar cluster is conﬁg ured w ith 520 0 compute-no de pro cessors con- nected by a 10 -Gigabyte netw ork. Each no de ha s t wo Xeo n Intel Duo-Co r e 64-bit pro cesso r s (totally 4 core s ) and 8 GB memory . The core frequency is 2 .66GHz. The opera ting system is Linux 2.6 and the compiler is Intel 9.1. The MP I libr ary is MV APICH. Figur e 9 pr esents exp er iment al results on L o nestar using th e same matrices of 40K , 80K and 16 0K as in the previo us expe r iment. It demo ns trates the scalability of TPILU(k) tow ard 80– 1 00 CP U c o res given suﬃcient bandwidth. Scalabilit y to Grid. TPILU(k) also has re a sonable la tency tolerance . This is imp or tant for such architectures as the Computational Gr id. TPIL U(k) man- ages to o vercome the la rge latency of inter-cluster, which is gener ally a few ms compared to a few µs inter-cluster latency . T o tes t TPILU(k) o n the Grid, we simulate the comm unication latency of the Grid o n a cluster b y adding some delay ( ≥ 17 . 5 ms , this n umber being chosen b ecause o ur tes ts of the round-trip time on the In ternet are alwa ys less than 35 ms ) befor e message sending for eac h gatewa y no de. A 32 K × 3 2 K matrix with the initia l density 0.00 458 is gener ated fo r this exp eriment. The num b er of ﬁnal entries is 2 10,212 ,433 after symbolic facto r iza- tion. The res ults are collec ted in T able 5. In this table, the num b er of CPU c o res is express ed as the num b er o f cluster s times the nu mber of CPU cor es of ea ch cluster. T able 5 demonstrates a sing le cluster of 10 0 no des exhibiting 6 4 times sp eedup. Tw o clusters of 50 no des each caused the sp eedup to degrade to only 45 times (for a typical 17 ms delay o ver the Internet), and to a 42 times sp eedup for a 24 ms lo ng delay . Adding a third cluster o f 50 no des contributed a lmost no additional sp eedup. T able 5 illustrates the following features of TPILU(k) on Grid: 19 Fig. 9: Computation on Lonesta r ; Le vel k = 1 1. The a verage bandwidth of comm unication on the Grid is no t as g o o d as that on cluster due to the communication la tency . This a lwa ys makes the speedup decrease more or less. The maximal spee d up is a t 100 CPU cores without latency . 2. The sp eedup decrea ses if the latency increas es given the same amount of clusters and CPU cores. The algorithm per forms better with 1 7 ms latency than with 24 ms latency ass uming o ther parameters are same. 3. The perfor mance of TP ILU(k) does not deter io rate. In b oth 17 ms and 24 ms cases, the maximal sp eedup is obtained b y the most CPU cor es. The reaso n that TP ILU(k) maintains g o o d perfo rmance on the Grid is that this algorithm supp or ts a strategy to mitigate the inﬂuence of latency via using as small a num b er o f bands as p ossible given there are still eno ug h num be r of bands for load-balancing. The fewer-bands p olicy r educes the n umber of mes- sages required for band replication. 4.7 Exp erime ntal Analysis Given a denser matrix, or a higher level k or more CP U cores , the time for domain-decomp os itio n bas ed par allel preconditioning using Euclid’s P ILU(k) can domina te ov er the time for the iterative so lving phase. This degrades the ov erall perfor mance, as seen b oth in Figure 5a and in Figures 3(a,b,c). A s econd domain-decomp os itio n based parallel prec o nditioner, Euclid’s BJ ILU, generally 20 Dela y(m s) #CPU Core #Band Time (s) Speedu p 0 1 7821.3 0 60 4096 7 .9 + 203.2 37.1 0 100 8192 7.1 + 115.1 64 17 2 × 50 2048 5.3 + 169. 2 44.8 24 2 × 50 2048 5.0 + 182. 8 41.6 17 2 × 60 2048 4.8 + 158. 0 48.0 24 2 × 60 2048 4.8 + 169. 4 44.9 17 3 × 50 1024 3.8 + 159. 7 47.8 24 3 × 50 1024 3.7 + 163. 7 46.7 17 3 × 60 4096 3.1 + 159. 6 48.1 T able 5: Simulation of the p erforma nce on Grid; 32 K × 32 K ; Lev el k = 1. (The matrix densi t y i s 0.0 0 458) pro duces a preconditioned matrix of low er quality tha n ILU(k) in Fig ur e 3(a,b,c). This ha pp e ns b eca use it ignore s oﬀ-diag onal non-zer o elements. Therefor e , wher e Euclid PILU(k) degrades the p er formance, it is not reasona ble to r esort to Euclid BJILU. Fig ures 3 a and 3 c s how that the low er qua lity of BJILU-based s olvers o f- ten perfor med w o r se than P ILU(k). Figure 4 s hows TPIL U(k) to per form be tter than either while ma intaining the go o d sca lability exp ected o f a bit-compatible algorithm. TP ILU(k) is also robust eno ugh to p e rform re asonably even in a conﬁguratio n with t wo quad-c o re no des . Additionally , Figure s 5b and 6 demon- strate very go o d scalability on a v ariety of applicatio ns when using the optiona l level-based incomplete inverse optimization. The crucial issue for TPILU(k) o n distributed memor y is how well this algo- rithm is able to ov erlap co mputation and communication. F or k = 2 and k = 3, TPILU(k) a chiev es near ly linear speedup since co mputation do minates c o mmu- nication due to mo re ﬂoa ting -p oint a rithmetic p er task, a s Figure 8 shows. F or k = 1, T a bles 3 and 4 illustrate that the optimized v ers io n of TPILU(1) pro- duces a 21-fold sp ee dup o n a depa rtmental cluster (Gigabit E thernet) ov er 3 0 no des, o p erating on a matr ix of dimens ion 160,0 00 and density 0.0003 7 . With a high p erfo r mance cluster (InﬁniBand in ter c onnect), TPILU(k) registers a 58- fold sp eedup with 60 no des op erating on a matrix of dimension 80,0 00 and density 0.00 1, a s Figure 9 demo nstrates. T able 5 highlig hts the p otential of TPILU(k) on the Grid even with a lar g e communication delay for inter-cluster message passing : although the TPILU(k) sp eedup degrades, we still observe a n ov erall p erfor mance impro vemen t with tw o and three clusters participating in a computation. 5 Related W ork There are many sequential pr econditioners based on incomplete LU factorization. Two typical sequential preco nditioners are ILUT[5] and ILU(k)[1]. ILU(k) was 21 formalized to solve the system of linear eq uations arising fro m ﬁnite diﬀerence discretizations in 1978. In 1981, ILU(k) was extended to apply to more general problems [2]. A recent seq uent ia l preco nditioner is ILUC [15, 16 ]. In [17 ], a para llel version o f ILUT is given for distributed memory para llel computers. How ever, the para llelism in this pape r comes from the analysis of a sp ecial non-zero pattern for a sparse matrix and do es not hav e high scalability for a general sparse matrix. In the pr o cess of parallelizing ILU(k) preco nditioners, we a re faced with a natural pro ble m: why is it so diﬃcult to sp eed up ILUT o r ILU(k) when k is small? W e observe that ILU(k) pr econditioning is the kind of computation that accesses lots of memor y while using relatively little ﬂoa ting-p oint a rithmetic in the case of a h uge sparse matrix of lo wer density with k = 1 o r k = 2. Therefore, it is limited b y either the memory bandwidth for the shared-memo ry c ase or the netw ork bandwidth for the distributed-memory case when para llelizing and sp eeding up an ILU pre c o nditioner with mor e CPU co res. Ma ny discussions in [18 –20] contribute v aluable ideas that help us to ha ndle this problem and design a scalable algo rithm. In [21 –23], an LU factoriza tion algorithm fo r distributed memo ry mac hines is implemen ted. Howev er, this implemen tation needs a sp ecial AP I to up date and synchronize the distributed memory . It is an evidence that communication in the distributed memor y mo del is a bo ttleneck even for LU facto rization when h uge sparse matrices are cons ide r ed. It implies that the parallelization of ILU(k) pre- conditioner is challenging on c lus ters. How ever, it is imp ortant b ecaus e cluster systems are the mainstr eam o f super computers: More th a n 70% o f all sup erco m- puters in the 2007 TOP5 00 list [2 4] are cluster systems. In [25], the s up er no de data structure is used to reorganize a sparse matrix . Those sup erno des can be pro c e ssed in parallel. Observing that many rows hav e a similar non-zero pattern in the r esult matrix of LU factoriz a tion, rows with a similar non-zero pattern can b e org anized as one superno de. The par allel ILU preconditioner [26] aims tow ar d distributed s parse ma tri- ces. PMILU [2 7] presents a new technique to re order a linear system in order to exp ose g reater parallelis m. They repr e s ent a categ ory of parallel alg orithms based on reorder ing. In [28], pivoting is emplo yed to reduce the n umber of ﬁll-ins for LU factoriza tion. Similarly , pivoting is used to impro ve th e robustness of the ILU preconditioner in [29]. The work in [30] provides an alg orithm to make the diagonal elements of a spar se matrix large. The methodo logy is to compute the optimal pivoting and prepro cess a spars e matrix. If the prepro ces sing mak es the sparse matrix break-down free for ILU(k) preconditio ning , then it is p ossible to relax the diago nal do minance condition required by TP IL U(k). Some other previous parallel ILU(k) preconditio ners include [3, 31, 32]. The latter t wo metho ds, whose parallelism co mes from level/backw ard scheduling, are stable and were studied in the 1980 ’s and achieved a spee dup of ab out 4 or 5 on a n Alliant FX-8 [5, 1s t edition, page 35 1] and a sp eedup of 2 or 3 on a Cray Y-MP . The mo re recent w ork [3] is directly compared w ith in the current work, and is not sta ble. 22 6 Conclusion 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 ADE −PILU HIJ FGHI speedup ABC −BJILU −TPIILU (incomplete inverse) (base algorithm) −TPILU density A I O H B G D C E J F off−diagonal dominance Fig. 10 : Three-dimensio na l compar is on of BJILU, PILU and TP ILU(k). Higher po ints represent grea ter speedup. Oﬀ-diagonal dominance is intended to express a qualitative rather than quantitativ e concept: how man y oﬀ-diag onal elemen ts are there that are signiﬁcantly f ar from ze ro. TPILU(k)/TPI ILU(k) is ge ne r ally faster, but BJILU is faster for matrices with few or small oﬀ-dia gonal e lement s. PILU(k) is also faster in the ab ove case–but only for very spar se matrices. This work ca n b e graphica lly summariz e d via Figure 1 0. While there are re- gions in whic h each of BJILU, PILU and TPILU(k) is faster, TPILU(k) remains comp etitive a nd stable in mos t cases. TPILU(k) is more stable than BJILU, a s demonstrated in Figure 10: BJILU has excellent p erformance on some types o f linear systems, while failing to conv erg e o n other linear systems. Figure 10 also shows that TPILU(k) is more stable than P ILU, whose p erfor mance de g rades along t wo dimensions: density and oﬀ-diag o nal dominance. Even tho ug h it is o mitted by Figure 10, the p er formance degr adation in- creases for b oth B J ILU a nd PILU w hen they employ more threads. In contrast, the bit-compatibility of TPILU(k) guar antees p er formance improvemen t for so lv- ing most classes of larg e s parse linear systems while maintaining stabilit y . 7 Ac knowled gemen t W e ac knowledge helpful dis cussions with Y e W ang at an early stage of this work. 23 References 1. Gustafsson, I.: A Cla ss of First Order F actorization Methods. BI T Numerical Mathematics, Sprin ger Netherlands, vol. 18(2), pp. 142-156 (1978) 2. W atts I I I, J. W.: A Conjugate Gradient-T runcated Direct Metho d for t h e It erative Solution of the Reservoi r Simula tion Pressure Equation. SPE Journal, vol. 21(3), pp. 345-353 (1981) 3. Hy som, D., P othen, A.: A Scalable P arallel Algorithm for Incomplete F actor Pre- conditioning. SI AM J. Sci. Comput, vol. 22, pp . 2194-2215 (2000) 4. Hy som, D ., P othen, A.: Eﬃcient Para llel Computation of ILU( k) Preconditioners. In: Sup ercompu ting ’99 ( 1999) 5. Saad, Y.: Iterative Method s for S parse Linear Systems, 2nd ed. SIAM (2003) 6. Bollh¨ ofer, M., Saad, Y.: On t he Relations b etw een ILUs and F actored A pproximate Inv erses. SIA M J. Matrix A nal. Ap p l., vol. 24(1), pp. 219-237 (2002) 7. Dong, X., Coop erman, G., Ap ostolakis, J.: Multithreaded Geant4: Semi-Automatic T ransformation into Scalable Thread-Pa rallel Soft ware. In: Euro-Par’2010 (2010) 8. Saad, Y., v an der V orst, H .A .: Iterative Solution of Linear Systems in th e 20th Cen tu ry. J. Comput . Ap p l. Math., vol. 123(1-2), pp. 1-33 (2000) 9. Co op erman, G.: Practical T ask-O riented P arallelism for Gaussian Elimination in Distributed Memory. Linear Algebra and Its Applications, vol. 275-276, pp. 107- 120 (1998) 10. hypre: High Performance Preconditioners. User’s Manual, version 2.6.0b. https:// computatio n.llnl.gov/casc/ hypre/download/hypre- 2.6.0b_usr_manual.pdf 11. Matrix Market, Driven Cavit y from the S P AR SKIT Collection. http://math. nist.gov/M atrixMarket/data /SPARSKIT/drivcav/drivcav.html 12. UF S parse Matrix Collection. http://www.c ise.ufl.edu/resea r ch/sparse/ matrices/ 13. Matgen: Command line rand om matrix generator on Linux. http://mat gen. sourceforg e.net/ 14. Dong, X., Co op erman, G.: Scalable T ask-Orien ted P arallel ism for Structure Based Incomplete LU F actorization. http: //arxiv.org/abs/0 803.0048v1 (2008) 15. Li, N., Saad, Y.: Crout versions of the ILU factorization with pivoting for sparse symmetric matrices. Electronic T ransactions on Numerical An alysis, vol. 20, pp. 75–85, (2005) 16. May er, J.: Some New Developmen ts in ILU Preconditioners. GAMM Annual Meeting, vol. 6(1), pp . 719-720 ( 2006) 17. Karypis, G., Kumar, V.: P arallel Threshold-b ased ILU F actorization. In: Sup er- computing ’97 (1997) 18. Benzi, M.: Preconditioning tec hn iq ues for large linear systems: A survey . Journal of Computational Ph ysics 182 (2002) 418–477 19. Duﬀ, I.S., v an der V orst, H.A .: Dev elopments and trends in t h e parallel solution of linear systems. P arallel Computing 25 (1999) 1931–1971 20. Heath, M.T., Ng, E., Peyton, B.W.: Para llel algorithms for sparse linear systems. SIAM Rev iew 33 (1991) 420–240 21. F u, C., Jiao, X ., Y ang, T.: Eﬃcien t sparse LU factorization with partial p ivoting on distributed memory arc hitectures. IEEE T ransactions on Pa rallel and Distributed Systems 9 (1998) 22. F u, C., Y ang, T.: Sparse LU factorization with p artial piv oting on distributed memory machines. In: Proceedings of the 1996 A CM/IEEE conference on Su p er- computing (CD-ROM). V olume 31. (1996) 24 23. Jiang, B., R ichman, S., Shen, K., Y ang, T.: Eﬃcient sparse LU factorization with lazy space allocation. In : Proceedings of the Ninth SIAM Conference on P arallel Processing for Scientiﬁc Computing. (1999) 24. TOP500 List. http://www .top500.org (2007) 25. Li, X.: Sparse Gaussian El imination on High Perfo rmance Computers. PhD thesis, Computer Science D ivision, EECS, U. of California, Berkel ey (1996) 26. Shen, C., Zhang, J., W ang, K.: P arallel Multilevel Blo ck ILU Preconditioning T echniques for Large Sparse Linear S y stems. In: IPDPS ’03 (2003) 27. H´ enon, P ., Saad, Y.: A P arallel Multistage ILU F actorization Based on A Hi- erarc hical Graph Decomp osition. S IAM J. Scientiﬁc Computing, vol . 28(6), p p. 2266–22 93 (2006) 28. Grigori, L., Demmel, J., Li, X.S.: Parallel Symbolic F actorization for Sparse LU with Static Piv oting. SIAM J. Scientiﬁc Computing, vo l. 29(3), pp. 128 9–1314 (2007) 29. Zhang, J.: A m ultilevel dual reordering stategy for robust incomplete LU factor- ization of indeﬁnite matrices. SIAM J. Matrix Anal. Appl. 22 (2001) 925 30. Duﬀ, I.S., Koster, J.: On Algorithms for P ermuting Larg e Entries to the Diag onal of A Sparse Matrix. SIAM Journal on Matrix Analysis and Applications, vol. 22(4), pp . 973–996 (2001) 31. And erson, E.: P arallel Implementation of Preconditioned Conjugate Gradient Method s for Solving Sparse S ystems of Linear Equations. Master’s Thesi s, Cen ter for Sup ercomput in g Researc h and Developmen t, Universit y of I llinois (1988) 32. Heroux, M.A., V u , P ., Y ang, C.: A P arallel Preconditioned Conjugate Gradient P ack age for Solving Sparse Linear Systems on a Cra y Y-MP. Appl. Nu m. Math, vol . 8, pp. 93-115 (1991) 25

A Bit-Compatible Shared Memory Parallelization for ILU(k) Preconditioning and a Bit-Compatible Generalization to Distributed Memory

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment