PVM-Distributed Implementation of the Radiance Code

The Parallel Virtual Machine (PVM) tool has been used for a distributed implementation of Greg Ward's Radiance code. In order to generate exactly the same primary rays with both the sequential and the parallel codes, the quincunx sampling technique u…

Authors: Francisco R. Villatoro, Antonio J. Nebro, Jose E. Fern

PVM-Distributed Impleme ntation of the Radiance Code F . R. V illatoro 1 Departamento de Lenguajes y Ciencias de la Computaci ´ on E.T .S. Ingen i eros Industriales Univ ersidad de M ´ alaga villa@lcc.uma.es A. J. Nebro and J. E. Fern ´ andez-Mart´ ın Departamento de Lenguajes y Ciencias de la Computaci ´ on E.T .S. Ingen i eros en Inform ´ atica Univ ersidad de M ´ alaga antonio@lcc.uma.es Abstract. The Parallel V irtual Machine (PVM) tool has been used for a distribu ted i mple- mentation of Greg W ard’ s Radiance code. In order to generate exac tly the same primary rays with both the sequential and the parallel codes, the quincun x sam- pling techniqu e used in Radiance for the reduction of the number of primary rays by interpolation, must b e left untouched in the p arallel implementation. The octree of local ambient v alues used in Radiance for the indirect illumination has been shared among all the processors. Both static and dynam ic image p arti tioning techniques which replicate the octree of the complete scene in all the processors and hav e load-balancing, h av e been de veloped for one frame rendering . S peedups larger tha n 7.5 have been achiev ed i n a n etwork o f 8 workstations. For animation sequences, a new dynamic partiti oning distribution technique with superlinear speedups has also been de velope d. Keywords: Parallel V irtual Mach ine; Rad iance; Ph otorealistic Rend ering; pa rallel ray tracing. 1 Intr oduction Radiance [ 1, 2] is a global illum ination pa ckage based on h ybrid appr oach that uses deterministic ray tracing, path tracing an d Monte Carlo tech niques [3]. It is a widely used in architectu ral lighting system de signs a nd visu alization. Howe ver, Radiance requires a considerable amount of computatio n b efore an image can be generated, so its parallel implemen tation has to be addressed. The standar d distribution of Radiance p rovides the possibility of p arallel execution of the re ndering phase of a single image using a memor y sharing algorithm and the par- allel/distributed ex ecution during the rendering of animations with many frames, where the f rames to be render ed are distributed betwe en th e pr ocessors each one r unning th e sequential code. Howe ver, better techniques ha ve to be developed. 1 Correspondi ng author The kernel f or th e path tra cing of Radiance is a ray tracer and the techniqu es de- veloped for parallel ray tracin g can be applied straightfor wardly . A ray trac er can be distributed by partitionin g of the image (pixel) space o r the scene (o bject) space [4, 5]. The techn iques of im age partitioning yield better results but have the problem that the complete octr ee of the scene has to be r eplicated in all the proc essors and so the size of th e ima ge wh ich ca n b e ren dered is limited by the amount m emory available in the computer s. Object partitionin g techniq ues have the pr oblem that the ray/inter section p rocess must test any ray with all th e o bjects of the scen e, an d this may requ ire that the ray b e traced by all the processors [6]. A hybrid image/space technique combining demand dri ven and data p arallel ex ec u- tion for the paralelization of Radiance using PVM has been de velop ed by Reinhard and Chalmers [ 7]. These au thors do n ot p reserve the qu incunx algo rithm, d o not distribute the octree of local ambient values and do not obtain superlinear speedups. The pu rpose of this work is to distribute Radiance code in a network o f heteroge - neous UNIX workstations using the Parallel V irtual Machine (PVM) too l. In this pap er , se veral image partitio ning techniqu es fo r the distributed impleme ntation of Radiance are stud ied. These tec hniques ar e b ased o n clien t/server architectur e, where the server receives the imag e/animation to be rendered , distributes the work to be don e b etween the clients, ma nage th e load- balancing techn iques, co llects statistical data on the ren- dering phase and generates the outpu t image. In this paper several tech niques f or the image p artitioning of th e im age space a re developed and tested using the parallel virtual mac hine (PVM) to ol [8]. I n Section 2, the basic structu re o f Radiance is recalled and the q uincunx samp ling algorithm is presented in more detail. In Section 3, th e basic techn iques for the distribution of a ray tracer ar e revie wed with emphasis o n the image partition ing technique s and, in Sectio n 4, it is applied to Radian ce with empha sis on the preser vation o f its acceleration techniqu es. Section 5 is d ev oted a static image partitioning tech nique with an estimato r for the best partition, an d a new techniqu e for load- balancing is developed. Dyn amic image partitioning technique s based on b oth scanbars and wind ows are de veloped in Section 6 including an application to the rendering of animation sequences. The main results and conclusion s are presented in Sections 7 and 8, respectively . 2 Radiance basics Radiance [1, 2] is a global il lumination package which solves the radiance or rendering equation, including pa rticipating m edia, by p ath tracing with Russian ro ulette, and se- lects ne w ray s b ased on hy brid app roach u sing both d eterministic ray tracing and Monte Carlo technique s with importance sam pling dependin g on the Bidire ctional Reflectance Distribution Fun ction ( BRDF) [3]. Radiance uses a hybrid local/global illum ination model and allows for the specification of isotropic and anisotropic BRDF . Every object in a Radian ce scene is a collec tion of mod ifiers, e. g., surfaces and volumes which define the geometry , materials which determine the illuminatio n model parameters, texture s, patterns a nd mixtur es of several m odifiers. The materials in Ra- diance include light sou rces, specu lar , Lambertian , isotropic and a nisotropic general BRDF material models, and participating media. The Radiance ray tracing kernel includes three basic acceleration techniques: the scene is stored in an octree in order to accelerate the ray/object intersection process, the indirect illu mination is calculated durin g the path tracing an d stored as local ambient illumination in an o ctree, and the number of pr imary rays is red uced using a n ad aptive partitioning of the imag e space includ ing in terpolation of pixel color s by a quincun x sampling technique [2]. The octree u sed by Radianc e to subdivide the object space is created by the p ro- gram oc onv of th e Radian ce packag e and is in depend ent of the r enderer rp ict . In all the resu lts and comparisons mad e during this research, the time for th e c reation and distribution of the octree is considered as a part of the total compu tational time. The octree th at stores the loc al ambient values u sed for the c alculation of the in- direct illum ination in Radiance has exactly th e same structure as the on e used for the scene. Each subcube of th is octree con tains an un boun ded list with the lo cal ambient illumination values calculated dur ing the path tracin g of the o bjects inside it. T o de ter- mine the indirect illuminatio n at one point in the scene, a hemisp here around this point is samp led using ad aptive supersampling, an d, for each sub division, the local ambient values are collected . A measur e of the expected er ror is calculated an d used for the adaptive supersampling process. Radiance d ivides the i mage space ( hres × yres p ixels) in samples ( xstep × ystep p ixels) and uses a samp ling tech nique, r eferred to as quincu nx sampling, to calculate the colors of the pixels in each sample [1]. A scanb ar i s a collection of ystep consecutive scanlines each on e divided in samp les of x step pixels. Note that bo th the last scanbar and the l ast sample may hav e less number of pixels t han the rest. T wo con- secutiv e scanbar share a common scanline (except f or the first one), which is calculated only on ce. There are pixels in a scanline for which no prim ary ray is traced and an interpolatio n is made. The numb er of t hese pixels is determined by a ray density v ector which is updated during the quincun x s a mpling. Each scanline is divided in samples of xstep + 1 pixels s h aring one pixel between consecutive samples. Odd an d even scanlines ar e samp led in a sligh tly d ifferent m anner . Primary ray s are traced through the pixels of the extremes of the samp les, and other pixels inside the sample depending on the v alues of the ray density v ector; these v alu es can ch ange as a function o f the n umber of rays traced d uring th e color filling of the sample. Th e rest of the pixels in the sample are calculated by interpo lation. The color in terpolation b etween p ixels used in the horizontal o r vertica l dir ection is b ased o n a local d ensity o f r ays (r ay den sity vector) an d p roceeds as follows. Gi ven two extrem e pixels in th e sample, if the difference b etween its colors exceeds a given tolerance or the local density of rays for this sample is greater than zero, a ray is traced by the center of the sample, and the lo cal de nsity o f this sample is ch anged to 1. Oth - erwise, th e colo r of th e cen ter pixel is c alculated by inte rpolation and th e loca l den sity of this sample is change to 0. If there are pixels remaining in the sample withou t color , the same pr ocedur e is applied recursively to the two sub-sam ples defined by the cente r pixel with a ne w local density of rays equal to a half of the orig inal one. 3 Radiance distribu tion basics Radiance pr ovides the po ssibility of par allel execution of th e ren dering phase o f a sin- gle image using a mem ory sharing algorithm ( rpiec e ). The shared m emory stores the octree of loc al ambient values u sed to accelerate the ind irect illu mination calcula- tion wher e these values are wr itten as th ey are calc ulated and ar e shared am ong all the processors. Radiance a lso pr ovides a simp le mec hanism fo r its parallel/distributed ex- ecution during the r enderin g o f an imations with many fr ames, where the frames to be rendere d are distributed among the processors each one runn ing the s e quential code. The Parallel V irtu al Machine (PVM) is a standard tool to distribute a sequential code, e.g., Radiance, in a network of heterogen eous UNIX workstations. This tool uses message p assing as a basic com municatio n mechanism among a system of distributed tasks with collaborate inside a virtual mach ine appearing as only one co mputer with a comm on distributed m emory . The m ain character istics of message passing in PVM are the following: the sender of a message is a synchro nous and can continue e xecu tion, the receiv er can act bo th synchron ously , stop execution and wait for a message, or asynchro nously , ch eck if th ere is any receiv ed m essage an d continu e execution if not; it provides a multicasting service for the sending of a message to all the processes; and, the message size is not limited. The he art of Radian ce is a ray tracer a nd, th erefore , can be imp lemented in a dis- tributed sh ared memory sy stem by both partitioning the image (p ixel) sp ace or the scene (object) space [4]. In o bject space p artitioning, the scene is d ivided amon gst all the processor s. Since in a ray tracer, the more d emandin g task is the r ay/object intersectio n, obje ct partitionin g accelerates th is p rocess because it reduces the numb er of objects in each processor . T wo main scheme s can be u sed [6]: r ay data flow whe re rays ar e transferr ed through all the processors in order to ob tain the nearest intersectio n with the scen e, and data ob ject flow wh ere ob jects are m oved th rough the processors and a v irtual mem ory stores all the objects and every proc essor uses a local cache with the ob jects more frequently used. The su ccess of both schemes is b ased on the coh erence o f the scene and on the rays. In image partitionin g, th e pixels of the image are distrib u ted am ongst the processors and the scene is completely r eplicated. The par titioning can b e static or dy namic. The perfor mance of static partitioning depends on the technique of distribution of the pixels among the processors. I n o rder to avoid idle processors, the parts of th e imag e which are computatio nally more e xp ensive to render must be di v ided into smaller pieces or sent to more powerful p rocessors. Load balance can be improved du ring th e first d istribution of the pixels am ongst the processors based on estimators of the co mputation al cost or by redistribution of the l oad when some pro cessors become idle. In dy namic par titioning of th e pixels of th e image, a p ool o f pieces or samples of the image is built and the processors ask for samples of th e image as they become idle, render seq uentially this sample, and th en req uest fo r an other one. The size o f these samples o f the image depen ds on the d istribution techniqu e used and the n ature of the algorithm used b y Radiance to sample the imag e space. Load ba lancing can also be applied when using dynamic partitioning . Both static and dynamic partitioning techniques can obtain good speedups, but with the d rawback that th e co mplete scene h as to b e re plicated in a ll the p rocessors and th e processor with smallest memory cap acity con trols the size of the scene which ca n be processed by these techniques. 4 Distribu ted implementation of Radiance Sev er al image partition ing techn iques for the distributed implemen tation of Radiance are studied in t he next sections. These techniques ar e based o n clien t/server arch itecture where th e server r eceiv es the im age/animatio n to be rendered, distributes the work to be do ne betwe en the clients, manage the load-bala ncing techniqu es, collect statistical data on the renderin g phase, and generate the output image. Any parallel/distributed implementation must deal with the three basic acceleration technique s used by Radianc e: the q uincun x sampling technique, the o ctree rep resenta- tion of the scene, and the o ctree of local ambien t values used fo r the ind irect illumina - tion. The most chara cteristic featur e of Radiance is the quincun x sampling algo rithm and this mu st be preser ved by any parallel imp lementation in o rder to gen erate exactly the same number o f primary rays in b oth th e sequential and parallel codes. This can be reached by partitionin g the ima ge into scan bars and sharing the comm on scan line between consecutive scanbars among the corresponding processors. In the im age partitioning tech niques de velop ed in this p aper, th e octree f or the scene is generated by the server and completely replicated among all the clients. Each client h as its own octree of a mbient values and broadc asts to the oth er clients all the new ambient values as they are calculated. Ev ery client has a parallel ta sk to update th e loca l ambient values as they are rece iv ed and without inter ference with the render ta sk. This gu arantees that all the clients h av e a consistent copy o f the same o ctree of a mbient values. Moreover, this parallel upd ating of the ambient values ma y y ield a reduction of the total number of rays required for the indirect illumination, since v alue s in some parts of the scene are updated before the sequential code does. 5 Static partitioning of the image The image space is partitioned into scanbar s which are distributed am ong the pr oces- sors. T hese scanb ars share its last scanlin e a nd atten tion h ad to be paid to av o id its recalculation by more tha n o ne p rocessor . Howe ver, this partition of the image can not be u niform , i.e., every processor can not receive the same nu mber o f scanbars, because the compu tational co st of each scanb ar changes considerab ly dep ending o n the com - plexity of the scene visible inside ea ch one. T o obtain better p artitions, an estimator o f the scene complexity , i.e., the cost of each scanbar, must be de veloped . In or der to estimate the comp utational cost a ssociated with a scanbar, several pri- mary r ays selected random ly are traced through the scan bar . In ord er to mea sure the cost for these rays, it is possible to count both the total number of rays generated in the renderin g o r the total number o f ray/objec t in tersections used. Th e perform ance of both measures has b een com pared with the to tal com putational time f or several imag es and both measures yield significant information on the complexity of the s canbar . After th e estimation of the computation al cost of every scan bar, the scan bars are group ed in partitions of approxim ately the same co mputation al cost an d distributed amongst the processors. If the i -th partition has n i scanbars and its cost is C i ( n i ) , then the static distribution technique tries to obtain C 1 ( n 1 ) ≈ C 2 ( n 2 ) ≈ · · · ≈ C p ( n p ) , where p is the number of processors. If c j is the estimated cost for the j -th scanb ar , the above expression yields n 1 X j =1 c j ≈ n 1 + n 2 X j = n 1 +1 ≈ · · · ≈ n 1 + ··· + n p X j = n 1 + ··· + n p − 1 , which can be solved iterati vely by a procedur e of equalization. The clien ts, in th e client/server model used fo r the static distribution o f the imag e among the processor s, execute the Radiance code sequentially on the partition of the image, i.e., the set of scanbars, that th ey h av e recei ved from the server . In order to share the last scanline of its p artition with a nother pro cessor , e very clien t starts its execution by calculating this scanline and transferrin g it to the corresp onding processor , and then proceed s with the rest of the partition as usual. In or der to o btain larger speedup s, a load-balan cing technique contr olled by the server has bee n imp lemented. The server r eceives requests fro m th e clien ts wh ich b e- come idle, collec ts inf ormation on the work remaining to be done in every client, selects the client with larger remaining work and, if this work is larger th an one scanb ar , de- mands that this p rocess transfer ha lf of its scan bars remaining to b e r endered to the client who has become idle, while this proc ess co ntinues its execution. The last scan - line shar ed amo ng both p rocessors is ren dered by the clien t rec ently receiving the new partition which automatically r esends it to the first client. It may be p ossible that the first clien t rec eiv es an other requ est to transfer part of its work to a third client be fore it receives th e comm on scanline; in tha t case, the second client is notified to send the previously common scaline to this new client. 6 Dynamic partitioning of the image In dynamic partitioning , th e pixels ar e redistributed amon g the p rocessors so there is no idle one. In a client/server architectur e, the server ho lds a pool of pieces of the image and d istributes this work to the clients as th ey make requests when they become idle. T wo solu tions have been de velop ed depending of the piec es used for the par tition of the image, i.e., scanbar s that p reserve exactly the quincunx algor ithm, o r wind ows that man tain the quincun x algorithm on ly on windows. Also a d ynamic p artitioning technique for animation sequences has been dev eloped. 6.1 Scanbar partition of the image In this implementatio n, every client render s a scanbar locally . In ord er to share the last scanline among the processors, each client calculates it s first scanline and transfers it to another processor, and receives th e last scanline from an other client. Au tomatic load-balan cing is obtained. The server re ceiv e s the requ ests from an d assigns scanbars to th e clients. With p clients, the first p scanbar s that the server assigns are not consecutive. Later , when sending a new scanb ar , the server check s if it can assign a scan bar con secutiv e to the scanbar the client has finished render ing in or der that the last scanline of the previous scanbar is shared loca lly , without co mmunica tion through the n etwork. If no consecu- ti ve scan bar can be assigned , another o ne is selected, and the client cur rently processing the preceding scanbar is advised on the processor holding its consecutive scanbar . Every client renders a scanbar by t he following steps: 1) It calculates the first scan- line; at any mom ent, when it knows which pro cessor is p rocessing th e scanbar sharin g this scanline , it sends it automatically ; this first scanline is sent incremen tally as every xstep p ixels are calculated. 2) It renders the last scanline if the scanbar is the first one of the imag e. 3 ) It applies qu incunx algorithm to fill the rest of th e scanlines of the scanbar by interpolation with the quincunx samplin g algorithm; this process is done incrementa lly as the last scan line is received, if requ ired. Th is procedu re av oid s that shared scanlines o f high co st in a p rocessor delayed excessively the pro cessor sha ring this scanline. 6.2 Window partition of the image In all the distribution solutions for th e Radiance co de presented above the q uincunx sampling algorith m for the selectio n of the p rimary rays was lef t as it was. Another possibility is to divide th e image in to unifor mly sized windows. Th ese windows m ay be distributed among the processors and a clien t/server architecture may b e adopted . The server ho lds a pool of windows with are sent wh en a processor become s idle. No sharing of scanlin es is requir ed. Th e client rende rs th e window as th e sequential code do es, ad justing app ropriate ly the camera p arameters. The on ly info rmation share among processors is the values of ind irect illum ination that can b e calculated in the other machines. W ith this partition, the number of gener ated prima ry rays exceeds the sequential code and gr ows with the number o f windows. The r easons for the in crease in the nu mber of rays are that both the n umber of total scanlines, although they d o ha ve shorter length, and th e nu mber of primary rays for the image samplin g ar e increa sed; mor eover , an increase in the numb er of prim ary ray s inc reases the ray density vector values and further rays are traced. Note that for a fe w exceptional windows, the n umber of primary rays traced is ev en smaller . The alg orithm tha t th e server uses to select the next window to be transferr ed to an idle proc ess can affect th e speed up. The best strategy for the selection of the new window is to send first the windows with higher co st and then tho se of lo we r , but a prior i is im possible to d etermine th is o rder exactly . Three strategies: backward, fo rward and random selection have be en implemen ted. An experim ental study indicates that the differences in the speedups achie ved with these three strategies ar e small e ven when the scenes used in th e tests have clear differences in comp utational cost between the first or the last windows and the rest. Of course, the dif f erence comes from the rendering of the last wind ows, where some pro cessors can become idle with out furth er work to b e done. As the nu mber of windows is increased these dif f erences decrease. 6.3 Dynamic partitioning of an animation sequence Usually Radiance is employed to obtain an animation sequence, e.g ., a virtual walk throug h a building. This animation is usually rende red in a frame by f rame man ner . This requires that the global illumination informatio n be recalculated for e very frame. The dynamic par titioning tech nique b ased on windows presen ted in the previous section, where all the frame s of the animation are divided into windows, has been used in this paper . When an idle client asks the server for a new window , it selects a window of the p resent fram e (if any re mains) or one of the next fr ame (oth erwise). Both clien t and server must know the actual frame being rendered. 7 Pr esentation of results The techn iques developed in this paper for the distributed implem entation of the Rad i- ance code have b een tested in an A TM network of 8 Ultra Sparc-2 workstation s. It is importan t to note that th e network used f or the tests pr esented in this section has a low load and the processors run only the present application; therefore, th e commun ications were very fast. Share ware versions o f the Radiance 3.0 code and the PVM 3.3.1 too l have be en used. All th e scenes u sed in the tests p resented in this section can be ob tained from the standard distribution of R a diance. T able 1 present its main characteristics. In all the algorithm s presented in this paper, a consistent copy of the octree of local ambient v alu es was distrib u ted among all the processors. T able 2 s hows the importance of th is d istribution in order to reduce bo th the number of rays a nd th e rea l co mputatio nal time. Altho ugh T able 2 sh ows the results for the d ynamic par titioning technique based on windows, similar results ha ve been ob tained for the rest of the techn iques presented in this pape r . Th is table shows that the nu mber of rays is red uced about 55%–6 5% and T able 1. Scenes used i n the presentation of results from the standard distribution of Radiance. The CPU time is in hours on a Ultra Sparc-2 workstation. Scene Filename Resolution Rays traced T ime 1 Mmack Leftbal c 2000 × 135 2 97,12 8,720 25.5 98 2 Confer ence room 1000 × 676 27,146, 574 0.914 T able 2. Number of rays (in million) and real computational time (in hours) for scene 1 with the dynamic partitioning based on windows with 4 clients and se veral window numbers both with and without the replication (broadcast) of the indirect illumination local ambient v alues. W indows 64 1024 4096 W ithou t Broadcast Number Rays 146.9 6 174.4 0 177 .72 Real T ime 13.1 4 11.88 12 .30 W ith Broadcast Number Rays 97. 961 96. 956 9 7.969 Real T ime 6.53 5 6.407 6. 706 the compu tational co st is red uced about 4 5%–55% thank s to the sharing of th e local ambient values. In order to assess the behaviour of the estimator of the scanbar cost used in the static partition ing tech nique, co mparison s between th e partition based on the estima- tors an d th e op timum pa rtition, calculated af ter the comp letion of the renderin g of the image, have be en d ev e loped. The corre sponding r esults indicate that the perf ormance of the estimator fo r a scanbar incr eases as the co st fo r this scanbar incr eases; when a large number of estimation rays are generated, there is no guarantee t hat the estimation improves b ecause there are e stimation rays which are n ot traced in the standard ren - dering due to the qu incunx sampling algo rithm, and this c an degrade the estimation. These resu lts also indicate that there is a clear co rrespon dence b etween the e stimated complexity of the scan bar of the image and the visually ap parent complexity of these scanbars. For an image as scene 2, wh ose co st is ne arly a smo oth fu nction o f the n umber of scanbars, T able 3 shows that for these scan bars of low complexity the estimator with 5 rays b ehaves better than the estimato r with 20 rays a nd that th e estimato r ba sed o n the ray/object intersectio n is better than th e on e based on the total num ber of ray s gener ated; moreover , as the nu mber o f pr ocessors increa ses the se differences also increases. For an image as scene 1, who se cost function has 7 pe aks of high cost an d a series of p lateaus of lo w cost, th e performance o f the estimator with 5 rays is worse than that with 20 rays because of th e large complexity o f this scene, but, in a ny case, the resultin g p artitions are n ot very good, specially b ecause of the high cost of th e first scanba r . In sum mary , the estimator with 20 rays is worse than the one with 5 r ays (co unting the number o f rays or r ay/object intersections) ; fo r a low degree of partitionin g, the estimator works better and th e ray /object intersection measu re of cost is be tter than that o f ge nerated rays. T able 4 sho ws the speedu p obtained with th e static image p artitioning with load - T able 3. Speedup for scene 2 for the static image partitioning without load-balancing using the follo wing partit ions of the i mage: uniform, optimum, esti mated wit h 5 and with 20 rays based on the number of total rays (Estim.5a, Estim.20a) and esti mated with 5 and 20 rays based on the number of ray/object intersections (Estim.5b, Estim.20b). Processors 2 4 8 Uniform 1.30 2.45 4.66 Estim.5a 1.99 3.96 6.01 Estim.5b 2.0 8 3.9 6 7 .25 Estim.20a 2.09 3.60 6.22 Estim.20b 2.02 3.89 6.98 Optimum 2.07 4.01 7.37 T able 4. Number of rays (in million) and speedup for scenes 1 and 2 for the stati c i mage par- titioning wi th load-balan cing using as initial partition of the image the partition with estimation of ray/object intersection wi th 5 rays. The cost of this estimation is included in the speedup calculation. Processors 1 2 4 8 Scene 1 Num ber Rays 9 7.1 96.1 95.1 96 .1 Speed-up 1 2.05 4.0 2 7 .26 Scene 2 Num ber Rays 2 7.2 27.1 27.2 27 .2 Speed-up 1 2.12 4.0 0 7 .43 balancing . For 2 and 4 processors, slightly superlinea r spee dups are obtained bec ause of the d istribution o f the octree o f lo cal amb ient values; this result was to be expected. This table also shows that as the initial partition is im proved, higher speed ups are o btained, since the communicatio n o verlo ad during the last phase of the rendering is reduced. T able 5 shows the n umber of rays an d sp eedup f or the d ynamic image partitioning based on scanb ars. This table indicates that th e preservation of the quincunx algorith m by mean s of this so lution, y ields a numbe r of ra ys nea rly con stant and equal to that of the sequential code; it also shows v er y good s peedup s of about 7.5 for 8 processors. The nu mber of rays and speedu p for the dynam ic image partitioning based on win- dows a re sho wn in T ab le 6. Th is tab le shows th at superlinear speedu p h as b een o btained for some partitions of scene 2 because of the reduction of the number of indirect ill umi- nation rays due to th e sharing of the octree o f ambient values. For e x ample, for scene 2, the sequential code need 27 .23 million of rays, but with 64 windows and 2 clients only 27.06 million , an d with 4 clien ts only 27.02 millio n. Howe ver , the number of primary rays increases a s the number o f windows increases. For example, scene 2 with 1024 windows need s 374,42 4 rays more than with 64 win dows, while scene 1 with 40 96 windows exceeds in 505,000 rays that for 256 windows. T able 6 also sh ows that the speed up with 8 clients for scene 2 is low due to th e low cost of e very window . For the m ore complex scene 1, this is no t the case and a better perform ance is obtained with a larger n umber o f windows. This behaviour is T able 5. Number of rays (in milli on) and speedup for scenes 1 and 2 for the dynamic image partitioning based on scanbars. Processors 1 2 4 8 Scene 1 Num ber Rays 9 7.1 97.1 97.1 97 .1 Speed-up 1 1.70 3.8 4 7 .46 Scene 2 Num ber Rays 2 7.2 27.2 27.2 27 .2 Speed-up 1 1.76 3.9 6 7 .62 T able 6. Number of rays (in milli on) and speedup for scenes 1 and 2 for the dynamic image partitioning based on windo ws as a function of the num ber of windows and clients. scene 2 scene 1 W indows 4 16 64 25 6 1 024 64 25 6 1 024 4096 2 clients 2.02 2.02 2.0 5 2 .02 1 .99 1.89 1 .91 1. 99 1 .82 4 clients 2.44 3.81 4.0 3 3 .96 3 .54 3.92 3 .98 3. 99 3 .82 8 clients 2.64 6.44 6.8 7 7 .14 6 .48 7.36 7 .79 7. 87 7 .76 due to the trade off betwe en compu tational and com munication costs becau se for h igh computatio nal cost, the commun ication cost appears to be less significant. The tech nique of d ynamic partition ing b ased on windows o f anim ation sequences has been tested with two “disconn ected” an imations, where there is a large change o n the camera po sition fr om frame to frame, which is the worst case found in an animatio n. A sequence based o n the M mack scene with 12 frame s and using 8 clients requires 15.42 hours of CPU time if all the frames are calculated independe ntly using a dynamic partitioning with 25 wind ows a nd only 5.98 ho urs using the dyn amic partitioning o f the complete animation. Another animation, based on the Townh ouse scene of 17 frames, needs 1 09.45 CPU hours with the sequential code , 15 .4 hours p artitioning fram e by fr ame and only 5.8 0 hour s with the partition ing of the comp lete animation; th is correspo nds to an extremely superlin ear spe edup of 1 8.3. The reason fo r this sup erlinear behaviour is tw ofold: firstly , the use of the sam e octree for the scene through out the execution saves the time re quired to loa d and initialize the octree fo r each frame an d, secondly , the octree of ambient v alu es is calculated and shared among the processors. 8 Conclusions This paper gi ves a general description and performan ce results for a distributed version of Greg W ard’ s Radiance co de by mean s of the Parallel V irtual Machine (PVM) to ol. Their im plementatio n uses the same primary ray s as the sequential code, pr eserving the quincun x sampling algorithm, and explo res several strategies for static and dynamic load-balan cing. Client/server d istributed implemen tations fo r both single im ages and animation sequen ces have been d ev elo ped. In all these tech niques, a copy of the co m- plete octree of the scene is replicated am ong all the proc essors theref ore the smallest memory size of any p rocessor limits the size of the largest imag e which can be rendered. The results pr esented in this paper show that, try ing to p reserve the qu incunx sam- pling algor ithm by u sing the scan bars as distribution quantum , has the pro blem that a scanbar is a q uantum which f or some images is very co mplex an d o f high cost, hence a window partitioning with smaller windows than th e scanbar s yields better re sults in practice. In or der to rend er large images, an objec t p artitioning techn ique which distributes the octree between all th e p rocessors has to b e dev elo ped. But this techniqu e may have load imbalance pro blems with intense co mmunica tions. Furth er research on o b- ject partitio ning of Radiance is required . Also, the d ev elo pment of better estimators which detec t parts of the imag e with high /low cost is also importa nt altho ugh difficult in practice. Refer ences 1. G. J. W ard. T he RADIANCE 3.0 Synthetic Imaging System. Lighting Systems Research Group, Lawrence Berk eley Laboratory, 1996. 2. G. J. W ard. The Radiance Lighting Simulation and Rendering System. Computer Graphics , 28:459–4 72, 1996. 3. A. S. Glassner. Principles of Digital Imag e Synthes is . Morgan-Kauf fman Publishers, 19 95. 4. D. W . Jensen and D. A. Reed. A Performance Analysis Exemplar: Parallel Ray Tracing. Department of Computer Science, Uni versity of Illinois, 1992. 5. K. Sung, J. L. J. Shiuan and A. L. An anda. Ray T racing in a Distributed En vironment. Com- puter and Graph i cs , 20:41–49, 1996. 6. H.-J. Kim and Ch.-M. Kyung . A New Parallel Ray-T racing System Based on Object Deco m- position. The V isual Computer , 5:244–253, 1996. 7. E. Reinha rd and A. Chalmers. Message Handling in Parallel Radiance. In D. Kranzlmller , P . Kacsuk, and J. Dongarra, eds., Recent Advances in P arallel V irtual Mac hine and Messa ge P assing Interface. L ectur e Notes in Computer Science, V ol. 3241 , pp. 486–493, Springer- V erlag, 1997. 8. A. Geist, A. Begue lin, J. Dongarra, W . Jiang, R. Man chek and V . Sunderam. PVM 3 User’ s Guide and Reference Ma nual. Oa k Ridge National Laborato ry , Oak Ridge, T ennessee, 1993 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment