Pushing the limits for medical image reconstruction on recent standard multicore processors

Pushing the limits f or medical image reconstruction on recent standar d m ulticore pr o cessor s Ja n T reibig Erlangen Regional Computing Center Mar tensstr . 1 91058 Erlangen, Ger many jan.treibig@r rze .uni- erlangen.de Georg Hager Erlangen Regional Computing Center Mar tensstr . 1 91058 Erlangen, Ger many georg.ha ger@rrze .uni- erlangen.de Hannes G. Hofmann P attern Recognition Lab Mar tensstr . 3 91058 Erlangen, G ermany hannes .hofmann@in f ormatik.uni- erlangen.de Joachim Hornegger P attern Recognition Lab Mar tensstr . 3 91058 Erlangen, Ger many jh@inf ormatik.uni- erlangen.de Gerhard W ellein Erlangen Regional Computing Center Mar tensstr . 1 91058 Erlangen, G ermany gerhard .wellein@ rrze .uni- erlangen.de ABSTRA CT V olume reconstruction by backpro jection is the comput a- tional b ottlenec k in many in terve ntional clinical computed tomograph y (CT) app licatio ns. T oday vendors i n this ﬁeld replace sp ecial purp ose hardware accelerators by standard hardwa re lik e m ulticore c hips and GPG PUs. Medical imag- ing algorithms are on the verge of emplo ying High Per- formance Computing (H PC) technology , and are t h erefore an interesting n ew candidate for optimization. This pa- p er p resen ts l ow -level opt imizations for the bac kpro jection algorithm, guided by a thorough p erformance analysis on four generations of Intel multico re processors (Harp ertown, W estmere, W estmere EX, and Sandy Bridge). W e c ho ose th e R abbit CT b enc hmark, a standardized test- case w ell sup ported in industry , to ensure transparen t and comparable results. Our aim is to pro vide not only the fastest p ossible implemen tation but als o compare to p erfor- mance mo dels and hardwa re counter data in order t o fully understand the res ults. W e separate th e in ﬂ uence of algo- rithmic optimizations, parallelization, SIMD vectorization, and microarchitectural issues and p in p oint p roblems with current SIMD instruction set ex tensions on standard CPUs (SSE, A VX). The use of assem bly language is mand atory for best performance. Finally w e compare our results to the b est GPGPU implementations a v aila ble for this op en com- p etition b enc hmark. 1. INTR ODUCTION AND RELA TED WORK 1.1 Computed tomography Computed tomograph y (CT) [1] is now adays an established technolo gy to non- inv asiv ely determine a th ree-dimensional (3D) structure from a series of 2D p ro jections of an ob- ject. Beyond its classic application area of static analysis in clinical en vironmen ts the use of CT has accelerated substan- tially in recent y ears, e.g., tow ards material science or t ime- resolv ed scans sup porting interven tional cardiolo gy . The n u- merical volume reconstruction scheme is a key comp onen t of modern CT systems and is known to b e very compute- intensiv e. Acceleration through sp ecial-purp ose h ard ware such as FPGAs is a typical approac h to meet th e constraints of real-time p rocessing. In tegrating nonstandard hardware into commercial CT systems adds considerable costs both in terms of hardw are and soft w are develo pment, as well as sys- tem complexit y . F rom an economic view the use of stand ard x86 pro cessors w ould thus be preferable. Driven b y Moore’s la w the compute capabilities of standard CPUs hav e now the p otenti al to meet th e req u ested CT time constraints. Figure 1: C-arm system illustration (Axiom Artis Zeego, Siemens Healthcare, F orchheim, Germany). (a) Skin (b) Bones Figure 2: V ol ume renderings based on the reconstruction of 2D X-ray p ro jections of a rabb it. The volume reconstruct ion step for recent C-arm systems with ﬂat panel detector can b e considered as a prototype for mo dern clinical CT systems. Interven tional C-arm CTs, such a s the one sketc hed in Fig. 1, p erform th e rotational acquisition of 496 high resolution X-ray pro jection images (1248 × 960 pixels) in 20 seconds [2]. This acquisition phase sets a constrain t for the max imum reconstruct ion time to attain real -time reconstruction. In practice ﬁltered bac kpro- jection (FBP) metho ds su ch as the F eldkamp algorithm [3] are widely used for p erformance reasons. The algorithm consists of 2D pre- processing steps, b ac kpro jection, and 3D p ost-processing. V olume data is reconstructed in the back- pro jection s tep, m ak ing it by far the mo st time consuming part [4]. It is characteri zed by high computational inten- sit y , nontrivial data dep endencies, and complex numerical ev aluatio ns but also oﬀers an inheren t em barras singly par- allel structure. In recen t years hardw are-speciﬁc optimiza- tion of the F eldk amp algorithm h as focu sed on GPUs [5, 6] and IBM Cell p ro cessors [7]. F or GPUs in particular, large p erformance gains compared to CPUs w ere rep orted [6] or docu men ted by the standardized R abbit CT b enchmark [8, 9]. Av ailable studies with standard CPUs indicate that l arge serve rs are requ ired t o meet GPU p erformance [10]. In this rep ort w e also use the R abbit CT environmen t, which de- ﬁnes a cli nically relev ant test case and is s upp orted b y in- dustry . R abbit CT is an open competition b enchmark based on C-arm C T images of a rabbit ( see Fig . 2). It allo ws to compare the manifold of existing hardware tec hnologies and implementa tion alternative s for reconstruction scenarios by applying them to a ﬁxed, well-deﬁned problem. At researc h level, recent reconstruction metho ds use more adv anced iterative tec hniques, whic h can pro vide supe- rior image quality in sp ecial cases like sparse or irregular data [11]. Other algorithms are used t o reconstruct t ime- resolv ed volumes ( “3 D+ t ” ) [12], e.g., for cardiac imaging. How ever, both ap p roac hes incorp orate several backpro- jection steps, making p erformance impro vemen ts on t he R abbit CT b enc hmark v aluable for them as w ell. The same holds for industrial CTs in material science used in nondestructive testing (ND T), which add itionally run at higher resolution and thus further increase the computa- tional requirements [ 13]. The high computational demand of backpro jection algorithms tog ether with the gro wing ac- ceptance of parallel computing in medical applicatio ns mak e them interesting n ew candidates on the interf ace t o High P erformance Computing. 1.2 Modern pr ocessors The steadil y gro wing transistor budget is still pushing for- w ard the compute capabilities of commodity processors at rather constan t p rice and clock speed. Improv emen t and scaling of established arc hitectural features in the core ( lik e, e.g., SIMD and SMT; see b elo w for detail s) in addition to increasing th e num ber of cores p er chip lead to p eak p erfor- mance levels which enable standard CPUs to meet the time constrain ts of interv entio nal CT sy stems. Optimized single core imp lementations are thus mandatory fo r reconstruction algorithms. Complemented by standard optimiza tion tech- niques, a highly eﬃcient SIMD (Single Instru ction Multiple Data) co de is t h e key p erformance comp onent. Owing to the n on trivial d ata parallelism in the reconstruction scheme, SIMD optimizations need to be done at lo w level, i.e., do wn to assembly language. How ev er, these eﬀorts will p a y oﬀ for future computer arc hitectures since an eﬃcien t SIMD implementa tion will become of ma jor imp ortance to further b eneﬁt from increasing p eak p erformance num bers. Scaling SIMD width is a safe b et to increase ra w core per- formance and optimize energy eﬃciency (i.e., maximize th e ﬂop/W att ratio), whic h is kno wn to be th e critical co mp o- nent in future HPC sy stems. Recently Intel has play ed the game by doubling th e SIMD width from t he SSE instruc- tion set (128-bit registers) to A VX [14] (256-bit registers, with larger widths planned for future d esigns), whic h is im- plemented in the “Sandy Bri dge” arc h itecture. More ongo- ing pro jects are p ointing in the sa me direction, lik e In tel’s Man y Integrated Core (MIC) [15] architecture or the Chi- nese Godson-3B chip [16]. Wider SIMD units d o not c hange core co mplexity substantially since the o ptimal instruction throughput does not dep end on the SIMD width. Ho w ever, the b eneﬁt in th e application strongly dep ends on the level and the structure of data parallelis m as well as the capabilit y of the compiler to detect and exploit it. Compilers are very eﬀective for simple code patterns like streaming kernels [17], while more complex loop structures require manual inter- ven tion, at least to the level of compiler intrinsics. This i s the only s afe w a y to utilize S IMD capabil ities to their full p oten tial. The imp act of wider SIMD units on progra mming style is still u n clear since this trend currently starts to ac- celerate. Of course wide S IMD execution is most eﬃcient for in-cache codes b ecause of the large “DRAM gap.” Simultaneous Multi-Threading (S MT) is another technol- ogy to i mprov e core utili zation at l o w arc hitectural impact and energy costs. It is obvious that SMT should be most b eneﬁcial for those “in cac h e codes” th at ha v e limited sin- gle th read eﬃciency d ue to bubbles in ari thmetic/logic pip elines, and where cac he bandwidth is not the dominat- ing factor. Naivel y one w ould n ot exp ect improv emen ts from SMT if a co de is fully SIMD-vectorized since SIMD is typicall y applied to simple data-parallel structures, which are a paradigm for eﬃcien t pip eline use. Since many pro- grammers do not care ab out pip eline o ccupancy in their application th e b eneﬁt of SMT is often tested in an ex peri- mental w a y , without arriving at a clear exp lanation for why it do es or do es not improv e p erformance. This pap er is organized a s follow s. In Sect. 3 w e p erform a ﬁrst analysis of the b ac kpro jection algorithm implemen ted in the R abbit CT framew ork using simple metrics like arith- metic throughput and memory band width as guidelines for estimating p erformance. W e add ress pro cessors from four generations of Intel’s x86 family (Harp ertow n, W estmere, W estmere EX, and S andy Bridge). Basic optimization rules such as minimizi ng o verall work are app lied. In Sect. 4 w e sho w how to eﬃciently vectorize the inn er lo op kernel using SSE and A VX instructions, and discuss the p ossible b eneﬁt of multithreading with SMT. Sect. 5 provides an in-dep th p erformance analysis, which will show t h at simple band- width or arithmetic throughput models are inadequate to estimate th e p erformance of the algorithm. Op enMP paral- lelization a nd related optimizatio ns lik e ccNUMA placemen t and band width redu ct ions are discussed in Sect. 6. Perfor- mance results for cores, so c kets, and no des on all four plat- forms are given in Sect. 7, where we also in terpret the eﬀect of the diﬀerent optimizations d iscussed earlier and v alidate our p erformance model. Finally w e compare our results with current GPGPU imp lementations in Sect. 8. 2. EXPERIMENT AL TESTB ED A selection o f mo d ern Intel x86-based m ulticore pro cessors (see T able 1) has been c hosen to test the performance p o- tential of our optimizations. All of these chips feature a large outer level cache, whic h is shared b y tw o (Core 2 Quad “Harp erto wn” ), four (S andy Bridge), six (W estmere EP), or ten co res (W estmere E X). W e refer to the maximum num- b er of cores sharing an outer leve l L2/L3 cac he as an “L2/L3 group.” With the initiation of th e Core i7 architecture t he mem- ory subsystem of Intel pro cessors w as redesigned to allo w for a substantial increase in memory bandwidth, at the price of introducing ccNUMA on multis ocket servers. A t the same time Intel also relaunched simultaneous multi- threading (SMT, a.k.a. “Hyper-Threading” ) with tw o SMT threads p er ph ysical core. The most recent pro cessor, Sandy Bridge, is equipped with a new instruction sc heduler, sup- p orts the n ew A V X SIMD instruction set exten sion, and has a new last lev el cache subsystem (which was already present in N ehalem EX). The 10-core Intel W estmere EX is not mainly targeted at HPC clusters bu t a t large mission- critical servers. It reﬂects the performance maxim um for x86 shared-memory no des. A comprehensive summary of the most important processor fea tures is presen ted in T able 1. Note that the Sandy Bridge model u sed here is a desktop v arian t, while the other processors a re of the serv er ( “Xeon” ) type. T able 1 a lso contains b andwidth measurements for a simple up date b enchmark: f o r ( i n t i = 0 ; i < N ; + + i ) a [ i ] = s * a [ i ] ; This b enchmark reﬂects the data streaming prop erties of the reconstruction algorithm and is thus b etter suited than STREAM [17] as a baseline for a quantitative p erformance mod el. W e u se t he Intel C/C++ compiler in versi on 12.0; since most of our p erforma nce-critical co de is written in assembly language, th is choice is marginal, how ev er. Thread aﬃn- z y x v u X-ray source detector volume Figure 3: Setup geometry for generating the CT pro jection images. The size of the vo lume is alwa ys 256 3 mm 3 , bu t the num ber of voxels may v ary . it y , hardw are p erformance monitoring, and lo w-lev el b enc h- marking wa s implemented via the LIKWID tool suite [18, 19], using th e to ols likwid-p in, likwid-p erfctr, and likwid- b enc h, resp ectively . 3. THE ALGORITHM 3.1 Code analysis The bac kpro jection algorithm (as provided by the RabbitCT framew ork [9], see Listing 1 ) is usually imp lemen ted in single precision (SP) and exhibits a streaming access pattern for most of its data traﬃc. One volume reconstruction uses 496 CT images (denoted by I ) of 1248 × 960 pixels each ( ISX × ISY ). The volume size is 256 3 mm 3 . MM is t h e vo xel size and c h anges dep ending on the n um b er of vo x els. The most common resolution in present clinical applications is 512 v o xels in eac h direction (denoted b y th e problem size L ). Eac h CT image is accompanied by a 3 × 4 pro jection matrix A , which pro jects an a rbitrary p oint in 3D space onto the CT image. The algorithm computes the con tributions to eac h vo xel across all pro jection images. The reconstructed vol- ume i s store d in array VOL . V oxel co ordinates (ind ices) are denoted by x , y , and z , while pixel coordinates are called u and v . See Fig. 3 for the geometric setup. The aggregated size of al l pro jection images is ≈ 2 . 4 GB. The total d ata transfer volume of one vo xel sweep comprises the loads from the pro jection image and an u p date op eration ( VOL[i]+=s , see line 41 in Listing 1) to the vo xel arra y . The latter incurs 8 b ytes of traﬃc per voxel and results (for prob- lem si ze 51 2 3 ) in a data vol ume o f 1 GB, or 496 GB for all pro jections. The traﬃc caused by the pro jection images is not easy to quantify since it is not a simple stream; it is deﬁned by a “beam” of locations slo wly moving ov er the pro jection pixels as the vo xel up date lo op nest progresses. It exhibits some temp oral locality since neighboring voxels are pro jected on proximate pixels of the i mage, but there may also b e m ultiple streams with large strides. On the comp u - tational side, the b asic version of this algorithm p erforms 13 additions, 5 subt ractions, 17 multipli cations, and 3 div ides. T able 1: T est mac hine speciﬁcations. The cacheli ne size is 64 bytes for all pro cessors and cac he lev els. The up date benchmark results were obtained with the likwid-b ench to ol. Microarc hitecture Intel Harp ertow n Intel W estmere Intel W estmere EX Intel Sandy Bridge Model Xeon X5482 Xeon X5670 Xeon E7- 4870 Core i7-2600 Lab el HPT WEM WEX SNB Clock [GHz] 3.2 2.66 (2.93 tu rbo) 2.40 3.4 (3.5 turb o) No d e sockets/ cores/threads 2/8/- 2/12/24 4/40/80 1/4/8 Socket L1/L2/L3 cache 4 × 32k/2 × 6M/- 6 × 32k/6 × 256k/12M 8 × 32k/8 × 256k/30M 4 × 32k/4 × 256k/8M Bandwidths [GB/s]: Theoretical socket BW 12.8 32.0 34.2 21.3 Up date (1 thread) 5.9 15.2 8.3 16.5 Up date (socke t) 6.2 20.3 24.6 17.3 Up date (no de) 8.4 39.1 98.7 - Listing 1: V oxel u p date lo op n est for t he plain backpro jec- tion algorithm. This gets executed for each pro jection I . All v ariable s are of typ e float unless indicated otherwise. The division in to parts (see text) is only approximate since there is no 1:1 corresp ondence to the SIMD- v ectorized co de. 1 w z = o f f s e t _ z ; 2 f o r ( i n t z = 0 ; z < L ; z + + , w z + = M M ) { 3 w y = o f f s e t _ y ; 4 5 f o r ( i n t y = 0 ; y < L ; y + + , w y + = M M ) { 6 w x = o f f s e t _ x ; 7 v a l t l = 0 . 0 f ; v a l t r = 0 . 0 f ; 8 v a l b l = 0 . 0 f ; v a l b r = 0 . 0 f ; 9 10 / / P a r t 1 - - - - - - - - - - - - - - - - - - - - - - - - - - 11 f o r ( i n t x = 0 ; x < L ; x + + , w x + = M M ) { 12 u w = ( A [ 0 ] * w x + A [ 3 ] * w y + A [ 6 ] * w z + A [ 9 ] ) ; 13 v w = ( A [ 1 ] * w x + A [ 4 ] * w y + A [ 7 ] * w z + A [ 1 0 ] ) ; 14 w = ( A [ 2 ] * w x + A [ 5 ] * w y + A [ 8 ] * w z + A [ 1 1 ] ) ; 15 16 u = u w * 1 . 0 f / w ; v = v w * 1 . 0 f / w ; 17 18 i n t i u = ( i n t ) u , i v = ( i n t ) v ; 19 20 s c a l x = u - ( f l o a t ) i u ; 21 s c a l y = v - ( f l o a t ) i v ; 22 / / P a r t 2 - - - - - - - - - - - - - - - - - - - - - - - - - - - 23 i f ( i v > = 0 & & i v = 0 & & i u = - 1 & & i u = - 1 & & i v = 0 & & i u = - 1 & & i u < I S X - 1 ) 34 v a l b r = I [ ( i v + 1 ) * I S X + i u + 1 ] ; 35 } 36 / / P a r t 3 - - - - - - - - - - - - - - - - - - - - - - - - - - - 37 v a l l = s c a l y * v a l b l + ( 1 . 0 f - s c a l y ) * v a l t l ; 38 v a l r = s c a l y * v a l b r + ( 1 . 0 f - s c a l y ) * v a l t r ; 39 f x = s c a l x * v a l r + ( 1 . 0 f - s c a l x ) * v a l l ; 40 41 VOL[z*L* L + y*L + x] += 1.0f/(w*w)*f x; 42 } / / x 43 } / / y 44 } / / z 3.2 Simple perf ormance models Based on this knowledge ab out data transfers and arithmetic operations we can derive a ﬁrst rough estimate of exp ected upp er p erformance b ounds. The arithmetic limitation re- sults in 21 cycles p er vectoriz ed up date (4 and 8 inn er lo op iterations for S SE and A VX, resp ectiv ely), assuming full vec- torization and a throughput of one divide p er cycle. This takes into account th at all architectures und er consideration can execute one addition and one m ultiplicatio n per cycle, neglects t he slight imbalance of additions versus multiplica - tions, an d assumes that the p ipelined rcpps instruction can b e emp lo y ed for the d ivisions (see Sect. 4.1 for d etails). On the oth er hand, runtime can also b e estimated b ased on the data transfer volume and the maxim um data transfer capabilities of the no des measured with t h e sy n thetic u p- date b enchmark described in S ect. 2. The follo wing table sho ws upp er p erformance b ounds for a full 512 3 reconstruc- tion based on arithmetic and band width limitations on th e four systems in th e testb ed (full n odes): HPT WEM WEX SNB Arithm. lim. [GUP/s] 4.86 6.75 13.9 5.31 BW lim. [GUP/s] 1.06 4.90 11.1 2.15 P erformance is given in billions of vo x el up dates p er second (GUP/s), 1 where one “up date” represents the reconstru ct ion step of one voxel using a single image. The low arithmetic limitation for the single so ck et Sandy Bridge is caused by its wide A VX vector size and its f aster cl ock. W h ile ab o ve predictions seem to indicate a strongly memory-b ound situ- ation, they are far f rom accurate: The ru n time is go verned by th e number of instructions and th e cy cles it takes t o ex- ecute these instructions. A reduction to the purely “useful” w ork, i. e., to arithmetic o p erations, ca n not be made since this algorithm is nontrivial to vectori ze du e to the scattered load of the pro jection image; it t herefore invo lves many more non-arithmetic instructions ( see Sect. 4.1 for d etails). W e will show later th at a more careful analysis leads to a com- 1 W e use SI preﬁxes, i.e., 1 GUP/s means 10 9 up dates p er second. This is i nconsistent with a large part of the litera- ture on medical image reconstruction, where “G” is used as a binary preﬁ x for 2 30 ≈ 1 . 07 4 · 10 9 [20] pletely diﬀerent picture, and t h at further optimizations can change th e b ottleneck analysis considerably . In order to have a b etter view on lo w-lev el optimizations we divide the algorithm into three p arts: 1. Geometry computation: Calculate the index of the pro jection of a vo xel in pixel co ordinates 2. Load four c orner pixel v alues from the pro jection im- age 3. Interp olate linearly for the up date of the voxel data 3.3 Algorithmic optimizations The ﬁrst optimiza tions for a give n al gorithm must b e on a hardwa re-indep endent leve l. Beyond elementary measures lik e moving in v arian t comput ations out of the inner lo op b ody and reducing the d ivides to one recipro cal (thereby reducing th e ﬂ op count to 31), a main optimization is to minimize the w orkload. V oxels l ocated at the corners and edges of the vol ume are not visible on every pro jection, and can thus b e “clipped oﬀ ” and skipp ed in the inner loop. This is not a new idea, but the approach presen ted here impro v es the wo rk redu ction from 24% [21] to nearly 39%. The basic b uilding block for all further steps is th e up date of a consecutive line of vo xels in x d irection, cov ered by the inn er lo op level in Listing 1. W e refer to this as the “line up date kernel.” The geometry , i.e., the p osition of the ﬁrst a nd the l ast relev an t vo xel for each pro jection im- age and line of vo xels is p recomputed. This information is sp eciﬁc for a given geometric setup, so it can b e stored and used later d uring the backpro jection lo op. Reading the data fro m memory incurs an additional transfe r v olume o f 512 2 × 496 × 4 bytes=496 MB/s (assuming 16-bit indexing), whic h is negligible compared to the other traﬃc. The ad- v an tage of line-wise clipping is that the shap e of the clipp ed vo xel volume is muc h more accurately trac k ed th an with the blocking approach d escribed in [21]. The conditionals (lines 23 and 30 in Listing 1), which en- sure correct access to the pro jectio n image, inv olv e no mea- surable ov erhead for the scalar case due to the hardware branch prediction. How ever, for vectorized cod e they are p oten tially costly since an appropriate mask m ust be con- structed whenever th ere is the p ossibilit y that a SI MD vec- tor instruction accesses data outside the pro jection [21]. T o remo ve this complication, separate buﬀers are used to hold suitably zero-padded copies of the pro jection images, so that there is no need for vector masks. The additional o verhead is far outw eighed by th e p erformance adva ntage for vector- ized code execution. The conditionals are also eﬀectively remo ved by the clipping optimiza tion d escribed ab ov e, but w e need a co de v ersion without clipping for val idating our p erformance mod el later. Note that a simila r eﬀect could b e ac h iev ed by p eeling oﬀ scalar loop iterations to make the length of the inn er lo op b ody a m ultiple of the SI MD vector size and ensure aligned memory access. How ev er, this ma y in troduce a signiﬁcant scalar comp onent esp ecially for small problem sizes and large vector lengths. 4. SINGLE CORE OPTIMIZA TIONS F or all further optimizations we choose an implementation of the line up date kernel in C as the baseline. All algorithmic optimizations from Sect. 3.3 hav e already b een applied. The p erformance of present p rocessors on the core level re- lies on instruction-level p arallelism (ILP) by pipelined and sup erscalar execu t ion, and data-p arall el operations (SIMD). W e also regard sim ultaneous m ultithreading (SMT) as a sin- gle core optimization since it is a hardware feature to in- crease the eﬃciency of the execut ion units by ﬁlling pipeline bubbles with u seful work: The idea is to dup licate parts of the hardw are resources (control logic, registers) in order to allo w a qu asi-sim ultaneous execut ion of d iﬀeren t threads while sharing other parts like , e.g., ﬂoating-p oin t pip elines. In a se nse this is an alternative to outer lo op unrolling, which can also provide m ultiple indep endent instruction streams. W e elab orate on the SIMD vectorization here and comment on the b eneﬁt of SMT in Sect. 5.2. 4.1 SIMD vectorization No cu rrent compiler is able to eﬃciently vectorize the b ac k- pro jection a lgorithm, so w e h a v e implemented the code di- rectly in x86 assembly language. Using SIMD intrinsics could ease the vectorization but adds some uncertainties with regard to register scheduling and hence do es n ot allow full control ov er t h e instru ction co de. All d ata is aligned to enable pac ke d and al igned loads/stores of ve ctor regis ters (16 (SSE) or 32 (A VX) bytes with one instruction). The line u pdate kernel operates on consecutive vo x els. P art 1 of the algorithm (see Sect. 3.2) is s traigh tforw ard to vec- torize, since it is arithmetically limited and fully b eneﬁts from the i ncreased register width. The division is replaced by a recipro cal. SSE pro vides the f ully pip elined rcpps in- struction for an approximate recipro cal with reduced accu- racy compared to a full divide. This approximation is suﬃ- cien t f or this algo rithm, and results i n an accuracy similar to GPGPU implemen tations. An analysis of the impact of the approximate recipro cal on p erforma nce and accuracy is pre- sented in Sect. 7.2. The integer cast (li ne 18) is implemented via the v ectorized h ardw are rounding instru ction roundps , whic h was in trod uced with SSE4. P art 2 of the algorithm cannot b e directly vectorized. S ince the pixel coordinates from step 1 are already in a vec- tor register, the index calculation f or, e.g., iv*ISX+iu and (iv+1)*ISX +iu (lines 25, 27, 32, and 34 in Listing 1) is done usi ng the SIMD ﬂ oating p oint units. There are pairs of v alues which can b e loaded in one step b ecause they are consecutive in memory: valtl / valtr , and valbl / valbr , respectively . Fig. 4 shows the steps inv olv ed to vectorize part 2 and t h e ﬁrst linear i nterpolation. The con vers ion of the index in to a general purpose regi ster, whic h is needed for addressing the load of the data and the sca ttered pairwise loads, is costly in terms of necessary instructions. Moreo v er the run time increase s linearly with the wi dth of the register, and the whole o p eration is limited by instruction through- put. There are d iﬀerent imp lementation options with th e instructions av ailable (see b elo w). Finally the b ilinear in- terp olation in part 3 is again straightforw ard to vectorize and fully b eneﬁts from wider SIMD registers. 0 1 2 3 0 1 2 3 0 0 2 2 0 0 2 2 l r l r l1 l2 0 0 2 2 l r l r 1 1 3 3 l r l r 0 0 2 2 l r l r 1 1 3 3 l l l l 0 0 2 2 l r l r 1 1 3 3 0 1 2 3 l l l l 1. Duplicate scaly 2. Load valt/valb pairwise left/right 4. Recover initial ordering: duplicate and blend 3. Perform linear interpolation 2 iterations per register 0 0 2 2 l r l r valb 0 0 2 2 l r l r valt 0 0 2 2 scaly 0 0 2 2 1.0 0 0 2 2 scaly x x - + 1 1 3 3 ( ) Figure 4: V ectorization of part 2 of the algorithm: The data is l oaded pairwi se in to the vector registers. The interpola- tion of iterations 0, 2 and 1,3 are computed simultaneously . Afterw ards the results must b e reordered for the second in- terp olation step. W e consider tw o S SE imp lementations, which only d iﬀer in part 2 o f the alg orithm. V ersion 1 (V 1) converts the ﬂoating p oin t v alues in the v ector registers to f our quadwo rds and stores the result back to memory (cache, actually). Single index v alues are th en loaded to general purp ose registers one b y one. V ersion 2 (V2) does not s tore to memory but instead shifts all v alues in turn to the lo w est p osition in the SSE register, from where they are mov ed directly to a general purp ose register u sing th e cvtss2si instru ction. Note that any f urther inner lo op unrolling beyond what is required by SIMD vectorization w ould not sho w any beneﬁ t due to register shortage; how ev er, as will b e shown later, SMT can b e used to ac hieve a similar eﬀect. 4.2 A VX implementation In theory , the new A V X instruction set ex tension doubles the p erformance p er core. The backpro jection cannot fully b eneﬁt from this advan tage b ecause the num ber of required instructions increases linearly with the register w idth in part 2 of the algorithm. F or arbitrary SIMD vector lengths a hardwa re gather operation w ould b e required to p rev ent th is part from b ecoming a severe b ottleneck. Also the limited n umber of A VX instruct ions that natively operate on 256-bit registers imp edes more sophisticated v ari- ants of part 2; only the simple version V1 could b e p orted. Implementation of V2 w ould b e p ossible only at the price of a muc h larger instruction count, so this alternative w as not considered. Still an improv emen t of 25% could b e achiev ed with the A V X k ernel on Sandy Bridge (see Sect. 7 for de- tailed p erformance results). Intel already announced th e successor A VX2, whic h will b e implemented in the Intel Hasw ell pro cessor in 201 2. A V X 2 will ﬁx issues p rev enting bett er p erformance for th e backpro- jection algorithm which ar e the hardware gather instruction and a more complete instruction set operating on the full SIMD register width . 5. IN-DEPTH PERFORMANCE ANAL YSIS 5.1 Analytic perf ormance model P opular p erformance mod els for bandwidth-limited algo- rithms reduce the inﬂuence on the runtime to the algo- rithmic balance, taking into account th e sustained main memory p erformance [22]. T his assumption works well in many cases, esp eciall y when th ere is a large gap b e- tw een arithmetic capabilities and main memory bandwidth. Recently it w as shown [23] that this simpliﬁcation is prob- lematic on newer arc hitectures like Intel Core i7 b ecause of their exceptionally large memo ry bandwidth p er s ocket; e.g., a single Neh alem core cann ot saturate th e memory interf ace [24]. Moreov er the additional L3 cache deco uples core in struction execution from main memory transfers and generally p ro vides a b etter opp ortunity to o verla p data traﬃc b etw een diﬀerent levels of the memory hierarch y . Note that in a multithreaded scenario the simple bandwidth mod el can indeed work as long as multiple cores are able to saturate the so c ket bandwidth. This p roperty dep ends crucially on th e algorithm, of course. In [23] we hav e introduced a more detailed analytic met h od for cases where th e balance mo del is not suﬃcient t o predict p erformance with t he required accuracy , and the in- cac he data transfers account for a signiﬁcant fraction of over all runtime. This mo del is based on an instruction analysis of the innermost lo op b ody and runtime contributions of cac heline transfer volumes through the whole memory hi- erarc hy . W e ﬁrst co ver single-threaded execution only and then generalize to m ultithreading on the socket once the ba- sic p erformance limitations are understo od. A useful starting p oin t for all further analysis is the single- thread runtime sp en t executing instructions with d ata loaded from L1 cache. The Intel A rc hitecture Code An- alyzer (IACA) [ 25] was used to analytically determine the runtime of th e lo op b ody . This tool calculates the raw throughput according to the architectural prop erties of t h e processor u n der the assumption t hat all data resides in the L1 cache. It supp orts W estmere and S an d y Bridge (includ - ing A V X) as target arc hitectures. The results for W estmere are shown in t he follo wing table for the tw o SS E kernel v arian ts describ ed ab o ve (all entries except µ OPs are in core cycles): Issue p ort V ariant 0 1 2 3 4 5 TP µ OPs CP V1 15 21 24 3 3 19 24 85 54 V2 20 27 16 1 1 20 27 85 71 Execution times are calculated separately for all six issue p orts (0. . . 5). (A µ OP is a R ISC-like “micro-instruction;” x86 pro cessors p erform an on- the-ﬂy translation of mac hine instructions to µ OPs, which are the “real” instructions that get executed by the core.) Apart from the raw throughpu t (TP) and the total num b er of µ OPs the tool also rep orts a ru n time prediction taking into account latencies on th e critical path (CP). Based on this prediction V 1 should be faster than V2 on W estmere. How ever, the measurements in T able 2 sh ow the opp osite result. The high p ressure on th e load issue p ort (2) together with an ov erall high pressure on all ALU is sue p orts (0, 1, and 5) seems to be decisive. In V2 Core L1 L2 32 b/cycle 230 cycles per cacheline 4 cycles 2x64 b MEM 2x64 b 8 b/FSB cycle 16 FSB cycles = ca. 32 cycles (a) Harp erto wn Core L1 L2 32 b/cycle 20 6 cycles per cacheline 4 cycles 2x64 b L3 32 b/cycle 2x64 b 4 cycles MEM 2x64 b 24 b/mem cycle 5.3 mem cycles = ca. 12 cycles (b) W estmere Core L1 L2 32 b/cycle 17 8 cycles per cacheline 4 cycles 2x64 b L3 32 b/cycle 2x64 b 4 cycles MEM 2x64 b 16 b/mem cycle 8 mem cycles = ca. 21 cycles (c) Sandy Bridge (SS E) Core L1 L2 32 b/cycle 21 9 cycl es per cacheline 4 cycles 2x64 b L3 32 b/cycle 2x64 b 4 cycles MEM 2x64 b 32 b/ mem cycle 4 mem cyc les = ca. 9 cycles (d) W estmere EX Figure 5: P erformance An alysis: R untime contributions from instruction ex ecution and necessary cac heline transfers. Eac h arro w is a 64-byte cacheline transfer. The total data volume in bytes is indicated on the left of eac h group of arro ws. On the right w e sho w th e data transfer capabilities b etw een hierarch y levels and the resulting transfer time in core cycles. This assumes that the tran sfer time is solely determined by bandwidth capabilities, and any latency inﬂuence is ignored. F or data transfers from main memory the contribution in memory/FSB cycles are translated to core cycles. In- core execution times are measured va lues from T able 2, scaled to a complete cac heline. HPT WEM WEX SNB V1 SSE 62.6 61.6 59.6 44.4 V2 SSE 57.4 51.5 54.7 50.0 V1 A VX 76.2 T able 2: Measured execution times ( one core) in cycles for one iteration of th e SIMD-vectorized kernel (i.e., 4 or 8 v o xel up dates) with all op erands residing in L1 cac he. the pressure on p ort 2 is muc h low er, although the ov erall pressure on all issue p orts is slightly larger. Belo w we rep ort th e results f or th e Sandy Bridge architec- ture with SS E and A VX. The pressure on the ALU p orts is similar, b ut due to the d oubled S SE load p erformance Sandy Bridge n eeds only half the cycles for the loads in kernel V1. V1 is therefore faster than V 2 on S andy Bridge (see T able 2). Issue p ort V ariant 0 1 2 3 4 5 TP µ OPs CP V1 SSE 16 20 14 13 3 19 20 85 56 V2 SSE 20 26 9 8 1 21 26 85 72 V1 A VX 18 20 22 21 6 30 30 114 90 So far we hav e assumed that all data resides in the L1 cache. The data tran sfers required to bring cac helines into L1 and back to memory are mo deled separately . W e assume that there is no ove rlap betw een data transfers and instruction execution. This is true at least for the L1 cache: It can either comm unicate with L2 to load or evict a cacheline, or it can deliver data to the registers, b ut not b oth at the same time. As a ﬁ rst approximation we also (p essimistica lly) assume that t his “no-ov erlapping” condition holds for all cac hes, and that a data transfer b et w een any tw o adjacen t levels in the memory hierarc hy does not o verl ap with anything else. Since the smalle st transfer unit is a 64-byte cacheline, the analysis will from no w on be based on a full “cacheline up date” (16 four-byte vo xels), which corresp onds to four ( tw o) inner loop iterations when using SS E (A VX ). W e only consider the d ata traﬃc for vo xel up dates; the im- age data traﬃc is neglig ible in compari son, hence w e assume that all image data comes from L1 cac he. It take s tw o cy cles to transfer one cacheline b et w een ad jacent cac he leve ls ov er the 256-bit unidirectional d ata path. Every mo diﬁed line must even tually b e evicted, which takes another t w o cycles. Figure 5 shows a full a nalysis, in which th e core execution time fo r a complete cac h eline u pdate is based on the mea- sured cycles from T able 2. On the three architectures with L3 cache the s impliﬁcation is made that th e “Uncore” part (L3 cache, memory in terface, and QuickP ath interconnect) runs at th e same frequency as the core, whic h is not strictly true but do es not change the results signiﬁcantly . It was sho wn for the Nehalem-based architectures (W estmere and W estmere EX) that t hey can ov erlap instruct ion execution with reloading d ata from memory to t he last level cac he [24]. Hence, th e model predicts that the in-core execution time is muc h larger than all other contributions, whic h makes this alg orithm limited by instru ction throughp u t for single core execution. O n Sandy Bridge, th e A VX kernel req uires 76.2 cycles for one ve ctorized loop iteration (eight updates). This results in 152 cycles instead of 17 8 cycles (SSE) fo r one cac heline up date. Based on th e runtime of t he loop k ernel we can no w estimate the total required memory band width for multithreaded exe- cution if all cores on a sock et are utilized, and also deriv e the exp ected p erformance ( we consider the full volume without clipping): HPT WEM WEX SNB SNB (A VX) BW/core [GB/s] 1.7 1.9 1.5 2.5 3 .0 BW/sock et [GB/s] 6. 8 11.2 11.6 10.0 12.0 Pe rf. [GUP/s] 0.85 1.42 1.45 1.25 1.51 W e conclude that th e multithreaded cod e is bandwidth- limited only on H arpertown, since the requ ired so ck et b an d - width is ab o ve the practical limit g iven by t h e up date b ench- mark (see T able 1). All other architectures are b elow their data transfer capabili ties for this op eration and sh ou ld show no b eneﬁt from further band width-reducing opt imizatio ns (see Sect. 6.2). 5.2 ILP optimization and SMT At this p oint the analysis still neglects the p ossible b ene- ﬁt from SMT. SMT can signiﬁcantly improv e the eﬃciency of the ﬂoa ting p oint un its for codes that are limi ted by in- struction throughput and suﬀer f rom underutilization of the arithmetic units due to depen dencies or instruction schedul- ing issues. This is deﬁnitely the case h ere, as indicated by th e discrepancy b etw een the “throughput” and “critical path” predictions in the p revious section. Due to t h e com- plex lo op b o dy register d epend encies are unav oidable, re- sulting in many pip eline bub bles. Outer lo op unroll and jam (interle aving tw o outer lo op iterations in th e inn er b o dy) is out of the question du e to re gister shortage, b u t SMT can do a similar job and pro vide indep endent instru ction streams using indep endent register sets. Since memory bandwidth is no limitation on all SMT-enabled pro cessors considered here, running tw o threads on the tw o virtual cores of ea ch physical core is exp ected to reduce the cycles taken for the cacheline up date. How ev er, the eﬀect o f using SMT is diﬃcult to esti- mate quantitativ ely . S ee Sect. 7 b elow for complete parallel results 6. OPENMP P ARALLELIZA TION Op enMP parallelization of the algorithm is straigh tforw ard and w orks with all optimizations d iscussed so far. F or the thread counts and problem sizes under consideration h ere it is suﬃcien t to parallelize the loop that iterates ov er all vo xel vol ume slices (loop vari able z in Listing 1 ). How ev er, due to the clipp ed-oﬀ vo xels at the ed ges and corners of the volume, simple static l oop scheduling wi th default c h unksize leads to a strong load imbala nce. This can b e easily corrected by using block-cyclic sc heduling with a small ch unksize (e.g., static,1 ). Images are produ ced one by one during the C-arm rotation, and could at b est b e deliv ered to the application in batches. Since the reconstruction should start as soon as images b e- come av ailable, a parallelizati on across images w as not con- sidered. As sho wn in Sect. 5, the sock et-level p erformance analysis does not p redict strong b eneﬁts from bandwidth - reducing optimizations except on the Harp erto wn platform. How ev er, since one can exp ect to see more bandwidth- starved proces- sor designs with a more u n balanced ratio of p eak p erfor- mance to memory bandwidth in the future, w e still consider bandwidth optimizations important for this algo rithm. F ur- thermore, ccNUMA architectures hav e b ecome omnipresent even in the commo dity market, mak ing lo calit y and band- width aw areness mandatory . I n the follo wing sections w e will describ e a prop er ccNUMA page placement strategy for vo xel and image data, and a blo cking optimiz ation for bandwidth redu ction. The reason why we present those op- timizations in the context of shared-memory parallelization is th at t h ey b ecome relev ant only in the parallel case, since bandwidth is not a p roblem on all architectures for serial execution (see Sect. 5.1). 6.1 ccNUMA placement The reconstruction algorithm uses essentia lly tw o relev an t data structures: the vo xel array and the image data arra ys. Up on vo x el initialization one can easily emplo y ﬁrst-touch initialization, using the same O penMP lo op sc hedule (i.e., access pattern) as in the main program loop. This wa y eac h thread has lo cal access ( i.e., within its own ccN UMA do- main) to its assigned vo x el lay ers, and the full aggregate bandwidth of a ccNUMA nod e can b e utilized. Although the access to the pro jection image data is much less b an d width-intensiv e th an the memory traﬃc incu rred by the vo xel u pd ates, ccN U MA page placement w as im- plemented here as w ell. As mentioned in Sect. 3.3, the padded pro jection buﬀers are exp licitly allo cated and initial- ized in ea ch lo calit y domain, and a local copy is shared b y all threads within a d omain. Since t he additional ov erhead for the duplicatio n is negligible, th is ensures conﬂict-free lo cal access to all image data. The time tak en to copy the images to the local buﬀers is in cluded in the runti me measurements. 6.2 Blocking/unr olling In order to redu ce the pressure on th e memory interface w e u se a simple blo c king scheme for the outer loop ov er all images: Pro jections are loaded and copied to t he padded pro jection buﬀers in sma ll c h unks, i.e., b images at a time. The line up date kernel (see Sect. 4) for a certain pair of ( y , z ) coordinates is then executed b t imes, once for eac h pro jection. This corresp onds to a b -wa y u nrolling of the image loop and a su b sequent jam in to the next-to-innermost vo xel lo op (across the y vo x el co ordinate). At the problem sizes studied here, all the vo xel data for this line can b e kept in the L1 cache and reused b − 1 t imes. Hen ce, the complete volume is only u pd ated in memory 496 /b instead of 496 times. Relatively small un rolli ng factors b etw een 2 and 8 are thus suﬃcient to re duce th e bandwidth req u iremen ts to un critical levels even on “starved” processors like the Intel Harp erto wn. This optimization is so eﬀective t hat i t renders prop er cc- NUMA placement all bu t obsolete; w e will th us not r ep ort the b en eﬁt of ccNUMA placement in our p erformance re- sults, although it is certainly p erformed in the co de. 7. RESUL TS In order to ev aluate the b eneﬁt of our optimizations we hav e b enc hmarked d iﬀeren t co de versions with the 512 3 case on all test machines. R abbit CT includes a b enchmarking ap- plication, which takes care of timing and error chec king. It rep orts total runtime in seconds for the complete back- pro jection. W e p erformed additional hardware performance counter measurements using the li kwid-p erfctr to ol. Likwid- p erfctr can pro duce high-resolution timelines of counter d ata and useful derived metrics on th e core and no de level with- out changes to the source co de. U nless stated otherwise we alw a ys rep ort results using tw o SMT threads p er core. F or all architectures apart from Sand y Bridge th e line up date kernel version V2 was used. O n S andy Bridge results for the SSE kernel V1 as well as for the A VX p ort of th e V1 kernel are presented. 7.1 V a lidation of analytical pr edictions T o v al idate the predicted performance of the an alytic model (see Sect. 5), single-so c ket runs were p erformed without the clipping optimization and S MT. Blo c king w as used on the Harp erto wn platform o nly , to ensure that execution is not dominated by memory access. The follo wing t ab le sho ws t h e measured p erformance and the deviation against the mo del prediction: HPT WEM WEX SNB SNB (A VX) Pe rf. [GUP/s] 0.75 1.20 1.30 1.11 1.28 deviation -13.3% -18.3% -11.5% -12.6% -18.0% This demonstrates that the mo del has a reasonable predic- tive p o w er. It has b een conﬁrmed that the con tribution of data transfers ind eed v anishes against th e core ru n time, de- spite the fact that the total transfer v olume i s high and a ﬁrst rough estimate based on data transfers and arithmetic throughput al one (Sect. 3) predicted a bandwidth limitation of this algorithm on all machines. As a ge neral rule, the IACA tool can p ro vide a rough esti- mate of the innermost lo op kernel run time via static cod e analysis. St ill it is necessary to further enhance th e ma- chine model to improv e the accuracy of t he predictions. Es- p ecially the abilit y of the out of order scheduler t o exp loit sup erscalar execution w as o veres timated and has led to qual- itative ly wrong p redictions. Note t h at this ex ample is an extreme case with all data transfers vanis hing against core runtime. How ev er, the approac h also wo rks for bandwidth- limited co des, as was sho wn in [23]. 7.2 Impact of optimizations on accuracy One of the costly op erations in this algorithm is t h e di- vide inv olv ed. A p ossible optimization is to use the fully pip elined recipro cal ( rcpps ), which provides an app ro xima- tion wi th 12-bit accuracy compared to the 24 -bit accuracy of th e regular divide instruction. The accuracy of the recip- rocal can b e impro v ed to at le ast 21 bits using a Newton- Raphson iteration [26]. This req uires four additional arith- metic instructions in the implementation. The follow ing ta- ble shows the peak signal-to-noise ratio (PSNR) and p erfor- mance of the three alternative s: Regular divide instruction, reciprocal, and recipro cal with Newton-Rap h son iteratio n, on the W estmere test machine for the fully optimized case (full no de). P erformance [GUP/s] PSNR [dB] divps 3.79 103 rcpps 4.21 67.7 Newton-Raphson 3.79 103 As usual in image p rocessing, the PSNR wa s determined as the logarithm of a mean quadratic deviation from a reference image: PSNR = 10 · log 10 M 2 L − 3 X x,y ,z [ V ( x, y , z ) − R ( x, y , z )] 2 (1) Here V ( x, y , z ) is the reconstructed vo xel gra yscale v alue at co ordinates ( x , y , z ) scaled to the interv al [0 , M ], and R ( x, y , z ) is a reference vo xel with the “correct” result. The higher the PSN R, the more accurate the reconstruction. There is n o signiﬁcan t diﬀerence neither in p erformance nor accuracy b et w een th e regular divide and rcpps with Newton-Raphson. The p erformance improv emen t when using rcpps alone is 10%, at an accuracy which is still b etter than for the pub lished GPGPU implementations (63–65 dB). In the follo wing section w e discuss performance results using th is latter version. 7.3 Parallel r esults Figures 6 (a)–(d) displa y a summary of all performance re- sults on node and socket levels, and parallel scali ng inside one sock et fo r the best vers ion on each arc hitecture. All ma- chines show nearly ideal scaling i nside one so c ket when using physical cores only . With SMT, the b eneﬁt is considerable on Sandy Bridge (33%) and W estmere (31%), and a little smaller on W estmere EX (25%). The large eﬀect on Sand y Bridge may b e attributed to a higher num ber of bubb les in the pip eline, as indicated by the larger discrepancy betw een the “throughput” and “critica l path” cy cles in the A V X loop kernel ( see Sect. 5.2). S calabili ty from one to all sock ets of t he no de is also close to p erfect for the multisock et ma- chines, with the ex ception of W estmere EX, on which th ere is a sligh t load imbalance d ue to 80 threads w orking on only 512 slices. Dep ending on t he architecture, SS E vectorization b o osts p erformance by a factor of 2–3 on the socket level. As e x- plained earlier (see Sect. 4), part 2 of the algorithm prohibits the optimal sp eedup of 4 b ecause its runtime is linear in th e SIMD v ector length. W ork reduction th rough clipping alone sho ws o nly limited eﬀect due to load imbalance, but this can b e remedied b y an appropriate cyclic O p enMP sc heduling, as described in Sect. 6. This kind of loa d balancing not only impro ves the work distribution but also leads to a more simi- lar access p attern to the pro jecti on images across all threads. This can be seen in Fig. 7 (a), which sh ows the cacheline traf- ﬁc betw een the L2 and L1 cac he during a 6-thread run on one W estmere so ck et with all optimizations except clipping (only 3 cores are sho wn for clari ty). Although t he amount of w ork, i.e., the num ber of vo xels, is p erfectly balanced across all threads, there is a strong discrepancy in ca cheli ne traf- ﬁc b et w een threads when standard static scheduling is used. The reason fo r this is that the pro jections of v o xel lines onto 1 2 4 8 # threads 0 0.5 1 1.5 2 2.5 Performance [GUP/s] blocking load balancing clipping SSE Plain C 64.0 GFlop/s (a) Harp ertow n 1 2 4 8 (SMT) 0 0.5 1 1.5 2 2.5 AVX blocking load balancing clipping SSE Plain C 69.4 GFlop/s (b) Sandy Bridge 1 2 4 6 12 (SMT) 24 (SMT) 0 2 4 6 8 10 blocking load balancing clipping SSE Plain C 20 sec reconstr. 130 GFlop/s (c) W estmere 1 2 4 6 8 10 20 (SMT) 80 0 2 4 6 8 10 blocking load balancing clipping SSE Plain C 20 sec reconstr. 286 GFlop/s (d) W estmere EX Figure 6 : S calabili ty and performance results for the 512 3 test case on al l platforms. I n-sock et scalabilit y w as tested using the b est version of the SIMD-vectorized line upd ate ker nel on each system (A VX- V1 on Sandy Bridge, SSE-V2 on all oth ers). The practical p erformance goal f or complete reconstruction (20 seconds runtime, corresp onding to 3.33 GUP/s) is indicated as a dashed line. Gﬂop/s num b ers hav e b een compu ted assuming 3 1 ﬂops p er optimized (sca lar) inner lo op iteration. N ote the scale change b et w een the left and right pairs of graphs. 0 10 20 30 40 50 60 70 80 time [s] 0 1000 2000 3000 4000 L1-L2 traffic per core [MB/s] Core 2 Core 3 Core 6 (a) Op enMP static sc heduling 0 10 20 30 40 50 60 70 80 time [s] 0 1000 2000 3000 4000 L1-L2 traffic per core [MB/s] Core 2 Core 3 Core 6 (b) Op enMP cyclic (static,1) sc heduling Figure 7: Timeline view of the L2 t o L1 cac heline traﬃc for three cores of a 6-thread run without clipping on one W est- mere so c ket, using (a) stand ard static OpenMP scheduling and (b) cyclic “static,1” sc heduling. the detector are quite diﬀeren t f or lines that are far apart from eac h other in th e volume, whic h leads to vas tly d iﬀerent access patterns to the i mage data, and hence very dissimilar locality prop erties. Cyclic sc heduling remov es this v ariation (see Fig. 7 (b)). Cac he blo c king has little to no eﬀect on all architectures ex- cept Harp erto wn, as predicted b y our analysis. Fig. 8 shows timelines for so ck et ﬂ oating p oin t p erformance and band - width on one W estmere socke t, comparing b lo ck ed/non- blocked and S MT/non-S MT v arian ts. Fig. 8 (a) clearly demonstrates the p erformance b oost of SMT in contrast to the very limited eﬀect of blocking. On the other hand, the blocked implementation signiﬁcantly redu ces t he band - width requirements (Fig. 8 (b)). The blo cked v arian ts have a noticeable amplitude of vari ations while the nonblock ed versi ons app ear smooth. I n the inset of Fig. 8 ( b ) w e sho w a zoomed-in view with ﬁner t ime resolution, which in d icates that the frequency of b an d width v ariations is muc h larger without blo cking; th is is plausible since the b andwidth is dominated by the vo xel volume in this case. With b locking, individual v o xel lines are transferred in “bu rsts” with phases of inactivity in b et w een, where image data is read at low bandwidth. The b eneﬁt of A VX on Sandy Bridge falls short of exp ec- tations for t he same reason as in th e SSE case. Still it is remark ab le that the desktop Sandy Bridge system outper- forms th e dual-so c ket Harpertown server no de, whic h fea- tures tw ice th e num b er of cores at a similar clo c k sp eed. Both W estmere and W estmere EX meet the p erformance requirements of at most 20 sec for a complete volume recon- struction. The W estmere EX n ode is, how ev er, not comp et- itive due to its unfav orable p rice to p erformance ratio. It is an option if absolute p erformance is the only criterion. 8. CPU VS. GPGPU 154.5 155 155.5 runtime [s] 0 10 20 30 40 50 60 70 80 SP GFlops/s unblocked 6T unblocked 12T SMT 4-way blocked 6T 4-way blocked 12T SMT (a) FP p erformance 175 176 runtime [s] 0 4 8 12 16 20 Memory bandwidth [GB/s] unblocked 6T unblocked 12T SMT 4-way blocked 6T 4-way blocked 12T SMT 200 300 400 500 600 700 runtime [ms] 0 5 10 15 (b) Memory ban d width Figure 8: Performa nce counter t imeline monitoring of ﬂoating-p oin t p erformance (a) and memory bandwidth (b), comparing blocke d/nonblock ed and SMT/non-SMT vari - ants of the b est implementati on on one W estmere sock et at 100 ms resolution. The inset in (b) show s a zo omed-in view with 2 ms resolution. Since th e backpro jection algorithm is well su ited for GP- GPUs, the p erformance leaders of the op en comp etition b enc hmark h a ve traditionally b een GPGPU implementa- tions in OpenCL and CUD A. The gap to the fastest CPU versi on rep orted on the R abbit CT W eb site [ 9] b efore th is w ork was started was very large. F or the reasons given in the deriv ation of the performance mod el, simple band width or p eak p erformance comparisons are inadequate to estimate the exp ected reconstruction sp eed adv anta ge of GPGPUs, although it should certainly lie inside the usual corridor of 4–10 when comparing with a full CPU sock et. A n example of a w ell-optimized GPU code w as published re cently [27]. Our implementation sho ws that current x 86 multicor e chips are t ruly comp etitiv e (see Fig. 9), ev en when considering the price/performance ratio. Beyond the cl inically rel ev an t 512 3 case, indu strial app lications need higher resolutions. This is a problem for GPGPU imp lementations b ecause the local memory on th e card is often to o small to hold the complete vo xel volume, causing extra ov erhead for moving partial v ol- umes in and out of the local memory and leading to a more complex implementatio n. If th e price for the hardware is unimp ortan t, a W estmere EX no de is an option that can easily outp erform GPGPUs. Note that th e unusually large gap b etw een the p erformance levels at 512 3 and 1024 3 on this architecture ma y be attributed to b etter load balancing when 80 threads wo rk on 1024 instead of 512 slices. It also lifts th e p erformance eﬃciency (fraction of n ode p eak) on this system to ab out 50%, matching the lev el ac hieved b y the other multicore nodes already o n the smal ler problem. This compares to 26% for the CUDA co de on the GTX480 card, assuming the same num b er of ﬂops p er vo x el up date. The results for the Sandy Bridge desktop system and the goo d scalabilit y even of the optimized algorithm promise even better performance levels on commo dity m ulticore sys- tems in the future. Note that it w ould be possible to p ro- vide a simple and eﬃcien t distributed memory paralleliza- tion of the algorithm for even fas ter reconstruction. “Micro- GTX260 OpenCL FX5800 CUDA2.1 Westmere Westmere EX Sandy Bridge GTX480 CUDA 3.0 0 2 4 6 8 10 12 Performance [GUP/s] 512 3 1024 3 20 sec @ 512 3 Figure 9: Performance comparison b etw een t he b est re- p orted GPGPU implementations in Op enCL and CUDA and our CPU versions on th e systems in the test b ed for problem sizes 512 3 and 1024 3 , resp ectiv ely . There is cur- rently no working CUDA implementation for the latter case. The practical p erformance goal for complete reconstruction of a 512 3 vol ume (3.33 GUP/s) i s indicated by a dashed line. clusters” based on cheap desktop tec hnology could then eas- ily meet any time constraint at extremely lo w cost. How ever, a comparison based on th e hardwa re cost alone is certainly too simplistic, esp ecially when taking into accoun t the o ver- all cost for a complete CT scanner including maintenance. 9. CONCLUSIONS AND OUTLOOK W e ha ve demonstrated several algorithmic and lo w-lev el op- timizations for a CT backpro jection algorithm on current Intel x86 multicore p rocessors. Highly optimizing compil- ers were not able to deliver useful SIMD-vectorized co de. Our implemen tation is thus based on assem bly language and vectoriz ed using the standard instruction set extensions SSE and A VX. The results show that commod it y hardware can b e competitive with highly tun ed GPU implemen tations for t h e clinically relev an t 512 3 vo xel case at the same lev el of accuracy . Nonpip elined d ivide instructions ( divps ) or a fast pip elined versi on ( rcpps ) with subsequent Newton- Raphson iteration provide b etter accuracy at a 10% p er- formance p enalty . Our results sho w ed that it is necessary to consider all asp ects of pro cessor and system architecture in order to reach best performance, and that the eﬀects of diﬀerent optimizations are closely conn ected to each other. The ben eﬁ t of the A VX instruction set o n Sandy Bri dge w as limited due to the lack of a gathered lo ad and the small num- b er of instru ctions that n ativ ely operate on the full SIMD register width. This relev an t algorithm can ac hieve very goo d eﬃciencies on commodity pro cessors and it w ould b e a natural step to further impro ve performance with a d is- tributed memory implementation. At higher resolutions, whic h are used in ind ustrial applications, multicore systems are frequently th e only c hoice (apart from ex p ensive custom solutions). F uture w ork in cludes a more thorough analysis and op t i- mization of the A VX line up date kernel, and an inclusion of the new A VX2 gather op erations once they b ecome av ail- able. 10. A CKNO WLEDGMENTS W e are indebted to I ntel Germany for providing test sys- tems and early access h ardw are for b enchmarking. This w ork was sup ported by the Comp etence Netw ork for Scien- tiﬁc High Performance Computing in Bav aria (KONWIHR) under pro ject O MI4papps. 11. REFERENCES [1] A. Kak and M. Slaney . Princi ples of Computerize d T omo gr aphic Imaging ( SIAM), 2001. [2] N. Strob el and et al. 3D Imaging with Flat-Dete ct or C-A rm Systems . In : Multislic e CT (Springer, Berlin / Heidelb erg), 3rd ed. ISBN 978-3-540-33125-4, 33–51. [3] L. F eldka mp, L . Davis and J. Kress. Pr actic al Cone-Be am Algorithm . Journal of the Optical So ciet y of America A1(6) , ( 1984 ) 612–619. [4] B. Heigl and M. Kow ars chik. High-sp e e d r e c onstruction for C-arm c om pute d tomo gr aphy . In : 9th International Me eting on F ul ly Thr e e-Dimensional Image R e c onstruction in R adiolo gy and Nucle ar Me di cine (www .fully3d.org, Lind au), 25–28. [5] K. Mueller and R. Y agel. R apid 3D c one-b e am r e c onstruction with the Algebr aic Re c onstruction T e chnique ( AR T) by util i zing textur e mapp ing gr aphics har dwar e . Nuclear S cience Sy mposium, 1998. Conference Record. 3 , (1998) 1552–1559. [6] K. Mueller, F. Xu and N . Neophytou. Why do Commo dity Gr aphics H ar dwar e Bo ar ds (GPUs) wor k so wel l for ac c eler ation of Compute d T om o gr aphy? In: SPIE Ele ctr oni c Imaging Confer enc e , vol . 6498 (San Diego), 64980N.1–6 4980N.12. [7] H. Scherl, M. Ko erner, H. Hofmann, W. Ec kert, M. Kow arschik and J. H ornegger. Implementation of the FDK algorithm for c one-b e am CT on the c el l br o adb and engine ar chite ctur e . In: J. Hsieh and M. J. Flynn (eds.), SPIE Me dic al Imaging Confer enc e Pr o c. , vol . 6510 (S PIE), 651058. [8] C. Rohkohl, B. Keck, H. G. Hofmann and J. Hornegger. R abbitCT—An Op en Platform for Benchmarking 3-D Cone-b e am Re c onstruction Algorithms . Medical Physics 36(9) , (2009) 3940–3944. http://lin k.aip.org/ link/?MPH/3 6/3940/1 [9] RabbitCT Benchmark . http://www.rabbitct .com/ . [10] H. G. Hofmann, B. Keck, C. Rohkohl and J. Hornegger. Comp aring Performanc e of Many-c or e CPUs and G PUs for Static and M otion Comp ensate d Re c onstruction of C-arm CT Data . Medical Physics 38(1) , (2011) 3940–3944 . [11] H. Ku nze, W. H ¨ arer and K. Stierstorfer. I ter ative extende d ﬁeld of view r e c onst ruction . In: J. H sieh and M. J. Flynn (ed s.), Me di c al Imaging 2007: Physics of Me di c al Imaging. , vol. 6510 of So ciety of Photo-Optic al Instrumentation Engine ers (SPIE) Confer enc e Series (San Diego), 65105X. [12] C. R ohk ohl, G. Lauritsch, M. Pr ¨ ummer and J. Hornegger. I nterventional 4-D Motion Estim ation and R e c onst ruction of Car diac V asculatur e without Motion Perio dicity Assumption . I n: G.-Z. Y ang, D. Haw kes, D. Rueck ert, A. Noble and C. T aylor (eds.), Me dic al Im age Computing and Computer-Assist e d Intervention – MICCAI 2009 , vol. 5761 of L e ctur e Notes in Computer Scienc e (Springer, Berlin / Heidelb erg). ISBN 978-3-642-04267-6, 132–139 . [13] B. K ec k, H. Hofmann, H. Scherl, M. Kow arsc h ik and J. Hornegger. High R esolution Iter ative C T R e c onst ruction using Gr aphics Har dwar e . In : B. Y u (ed.), 2009 IEEE Nucle ar Scienc e Symp osium Confer enc e R e c or d (N /A), 4035–4040. http://www 5.informat ik.uni- erlangen .de/Forschung/Publik atio n e n / 2 0 0 9 / K e c k 0 9 - H R I . p d f [14] Intel. Intel advanc e d ve ctor extensions pr o gr amming r efer enc e , Apr 2011. [15] Intel many i nte gr ate d c or e ar chite ctur e . http://www .intel.com /technology /architectu re- silicon/mic/in d e x . h t m . [16] Go dson 3b a r chite ctur e . http://www .theregist er.co.uk/20 11/02/25/ic t_godson_3b_chip/ p a g e 2 . h t m l , F eb 2011. [17] The Str e am Ben chmark . http://www .streamben ch.org/ . [18] J. T reibig, G. Hager and G. W ellein. Likwid: A lightweight p erformanc e-oriente d to ol suite for x86 multic or e envir onments . In: Pr o c e e dings of PSTI2010, the First International Workshop on Par al lel Softwar e T o ols and T o ol Infr astruct ur es (San Diego CA). [19] LIKW ID p erformanc e to ols . http://cod e.google.c om/p/likwid [20] I. Go ddard, A . Berma n, O . Bock enbac h, F. Lauginige r, S. Sch ub erth and S . Thieret. Evolution of c omputer te chnolo gy for fast c one b e am b ackpr oje ction . In: So ci ety of Photo-Optic al Instrumentation Engine ers (SPIE) Conf er enc e Series , vol. 6498. [21] H. G. Hofmann, B. Keck and J. Hornegger. Ac c eler ate d C-arm R e c onstruction by Out-of-Pr oje ction Pr e diction . In: T. M. Deserno, H. Handels, H.-P . Meinzer and T. T olxdorﬀ (eds.), Bildver arb eitung f ¨ ur die Me dizin 2010 (Berlin). ISBN 978-3-642-11967-5, 380–384. http://www 5.informat ik.uni- erlangen .de/Forschung/Publik atio n e n / 2 0 1 0 / H o f m a n n 1 0 - A C R . p d f [22] G. H ager and G. W ellein. Intr o duction to High Performanc e C om puting f or Scientists and Engine ers (CRC Press), 2010. I SBN 978-14398 11924. [23] J. T reibig and G. Hager. Intr o ducing a p erformanc e mo del for b andwidth- limite d lo op kernels . In: Pr o c e e di ngs of the Workshop Memory i ssues on Multi- and Manyc or e Platforms at PP AM 2009, t he 8th International Confer enc e on Par al lel Pr o c ess ing and Applie d Mathematics (W ro cla w Pol and). http://arx iv.org/abs /0905.0792 [24] J. T reibig, G. Hager and G. W ellein. Complexities of p erformanc e pr e diction f or b andwidth-limite d l o op kernels on multi-c or e ar chite ctur es . In: Hi gh Performanc e C om puting i n Scienc e and Eng ine ering (Springer, Berlin / Heidelb erg, Garching/Munic h). ISBN 978-36421 38713. [25] Intel. Intel ar chite ctur e c o de analyzer , Ap r 2011. http://sof tware.inte l.com/en- us/art icles/intel- architecture - c o d e - a n a l y z e r / [26] Intel. Ap-803 i ncr e asing the ac cur acy of the r esults fr om the r e cipr o c al and r e cipr o c al squar e r o ot instructions using the newton-r aphson metho d , Ap r 1999. [27] E. Papen hausen, Z. Zheng and K. Mueller. GPU-A c c eler ate d Back-Pr oj e ction R evisite d: Sque ezing Performanc e by Car ef ul T uni ng . W ork sh op on H igh P erformance Image R econstruction ( HPIR), 2011.

Pushing the limits for medical image reconstruction on recent standard multicore processors

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment