High-Performance Pseudo-Random Number Generation on Graphics Processing Units
This work considers the deployment of pseudo-random number generators (PRNGs) on graphics processing units (GPUs), developing an approach based on the xorgens generator to rapidly produce pseudo-random numbers of high statistical quality. The chosen …
Authors: Nimalan N, apalan, Richard P. Brent
High-P erformance Pseudo-Random Num b er Generation on Graphics Pr o cessing Units Nimalan Nandapalan 1 , Ric har d P . Brent 1 , 2 , Lawrence M. Murray 3 , and Alistair Rendell 1 1 Researc h Sc h o ol of Computer Science, The A ustralian National Universit y 2 Mathematical Sciences I n stitute, The Australian National Universit y 3 CSIRO Mathematics, In formatic s and Statistics Abstract. This w ork considers the deplo yment of p seudo-random num- b er generators (PR NGs) on graphics processing un its (GPUs), devel- oping an approac h based on the xorgens ge n erator to rapidly produce pseudo-random num b ers of high statistical quality . The c h osen algorithm has configurable state size and perio d, making it ideal for tuning to the GPU arc hitecture. W e present a comparison of b oth sp eed an d statistical quality with other common parallel, GPU-based PRN Gs, demonstrating fa vourable p erformance of the xorgens-based app roac h. Keywords: Pseudo-random n umb er generation, graphics processing units, Mon te Carlo 1 In tro duction Motiv a ted by compute-intense Monte Car lo metho ds, this work conside r s the tailoring of pseudo-r andom num b er g eneration (PRNG) a lgorithms to gr aph- ics pro cessing units (GPUs). Mon te Carlo metho ds of in terest include Markov chain Mon te Ca rlo (MCMC) [5], sequen tial Mo n te Ca rlo [4] and mos t recently , particle MCMC [1], with n umer o us applications across the ph ysical, biolog ical and environmen tal sciences. These metho ds demand larg e num b ers of rando m v a riates of high statistical quality . W e hav e obser v ed in our own w or k that, after acceleratio n of other comp onents of a Mon te Carlo prog ram on GPU [13, 14], the PRNG comp onent , s till executing on the CPU, can b ottleneck the whole pro- cedure, failing to pro duce num b ers as fast as the GPU can consume them. The aim, then, is to also acce lerate the PRNG comp onen t on the GPU, without com- promising the statistica l quality of the random num b er sequence, as demanded by the tar get Monte Car lo applica tions. Performance o f a PRNG inv o lv es b oth sp eed and quality . A metric for the former is the num b er of ra ndo m num ber s pro duced p er second (RN/s). Measure- men t of the latter is more difficult. Intuit ively , for a given sequence o f num b ers, an inability to discriminate their so urce fr om a truly random sour ce is indica tiv e of hig h quality . Assessment may b e made by a batter y of tests which attempt to ident ify flaws in the sequence that ar e not exp ected in a truly random sequenc e . These might include, for example, tests of auto cor relation and linear dep endence. 2 Nandapalan et al. Commonly used pack ag es for p erforming suc h tests are the DIEHARD [10] and T estU01 [8 ] suites. The trade-o ff b et ween sp eed a nd quality ca n take many forms. Critical pa - rameters are the p erio d of the g e nerator (the leng th of the seque nce b efore r e- pea ting) and its state s ize (the amount of working memor y required). Typically , a generato r with a larger sta te s iz e will have a lar ger p erio d. In a GP U computing context, where the av a ilable memory p er pro cesso r is small, the state size may be cr itical. Also, a conven tional PRNG pro duces a single sequence of num b ers; an added challenge in the GPU context is to pr oduce many uncorrela ted streams of n umbers concurr en tly . Existing w or k includes the recent r elease o f NVIDIA’s CURAND [15] library , simple rando m generator s in the Thrust C++ librar y [6 ], and early work for graphics applications [7]. Much o f this work uses simple generator s with small state sizes and commensur a tely short p erio ds, in order no t to exc e ed the limited resource s that a GPU provides to individua l threads. The statistical quality o f nu mber s pro duced by these a lgorithms is not necess arily adequate for Monte Carlo applicatio ns, a nd in some cases ca n undermine the pro cedure eno ugh to cause conv ergence to the wrong result. The Mersenne Twister [12] is the de facto standar d for statistical a pplications and is used by default in pa c k ages such as MA TLAB. It features a la r ge sta te size and long p erio d, and has recently b een p orted to GPUs [17]. How ever, it has a fixed and p erhaps over-large state size, and is difficult to tune fo r optimal per formance on GPUs. In this work we adapt the x o rgens algorithm of [2, 3 ]. The attra ction of this approach is the flexible choice o f per io d and state size, facilitating the optimisation of s peed and statistical qualit y within the resour c e constraints of a pa rticular GPU ar c hitecture. W e b egin with a brief ov erview of CUD A, then discuss qualitative testing of PRNGs, including the Mers e nne T wis ter for Graphic Pro cessor s (MTGP), CU- RAND and xo rgens ge nerators. W e then descr ibe our a da ptation of the xo rgens algorithm for GPUs. Finally , the r esults of testing these gener ators are presented and s ome conclus ions drawn. 1.1 The NVIDIA Com pute Unified Device Arc hitecture (CUD A) and the Graphics Pro cessi ng Unit (GPU) The Compute Unified Device Architecture (CUD A) was introduced by the NVIDIA Corp oration in Nov ember 20 06 [16]. This architecture provides a co mplete s olu- tion fo r g e neral purp ose GPU programming (GPGP U), including ne w hardware, instruction sets, and progr amming mo dels. The CUDA API allows communica- tion b et ween the CP U and GP U, allowing the user to co n tro l the ex e cution of co de on the GPU to the sa me degre e as on the CP U. A GPU resides on a devic e , which usually consists of many mu ltipr o c essors (MPs), each co n taining some pr o c essors . Each CUDA co mpatible GP U device has a globally accessible memor y address space that is ph ysica lly separa te fro m the MPs. The MPs hav e a lo cal shar ed memory s pace for eac h of the pr oces sors High-Perf ormance PRNG on GPUs 3 asso ciated with the MP . Finally , each pr oces sor has its own set of register s and pro cessing units for p erforming computations. There are thr ee abstractions central to the CUD A softw are progr amming mo del, pr ovided by the API as simple language extens io ns: – A hierarchy of thr e ad groupings – a thread b eing the smallest unit of pro- cessing that can b e scheduled by the device. – Shar ed memory – fast sections o f memory co mmon to the threads o f a group. – Ba rrier synchronisation – a mea ns of synchronising thread o p era tio ns by halting threads within a gr o up un til all threads hav e met the barr ier. Threads a re o rganised into sma ll gr oups of 32 called warps for execution on the pro cess ors, which are Single- Instruction Multiple-Data (SIMD) and implic- itly synchronous. These are organis ed for scheduling a cross the MP s in blo cks . Thu s , ea c h blo ck of threa ds has acces s to the same sha red memo ry space . Fina lly , each blo ck is part of a grid o f blo c ks that repr esen ts all the threads launched to solve a problem. Thes e are sp ecified a t the inv o catio n of kernels – functions executed on the GPU de v ice – which a re mana ged by or dinary CPU prog rams, known as host co de. As a co ns equence of the num b er o f in-flight threads supp orted by a device, and the memor y requirements of ea c h thread, not all o f a given GPU device’s computational ca pacit y can b e used at once. The fractio n o f a dev ice’s capacity that can b e used by a g iv en kernel is known as its o c cup ancy . 1.2 Statistical T esting: T estU01 Theoretically , the per formance of so me PRNGs o n certain sta tistical tes ts can be predicted, but usually this only applies if the test is per fo rmed over a co mplete per io d of the P RNG. In pra ctice, statistical testing of PRNGs ov er realistic subsets of their p erio ds requires empirical metho ds [8, 10]. F or a g iv en s tatistical test and PRNG to b e tes ted, a test sta tistic is co mputed using a finite n umber of outputs from the PRNG. It is required that the distri- bution of the test statistic for a sequence of uniform, indep enden tly distributed random n umber s is known, or at leas t that a sufficiently g o o d a pproximation is computable [9]. Typically , a p -value is computed, which gives the probability that the tes t sta tistic exceeds the observed v a lue. The p -v alue can b e thought of a s the proba bility that the test sta tistic or a la rger v alue would b e obse rved for p erfectly unifor m a nd indep endent input. Thu s the p -v alue itself sho uld b e distributed uniformly o n (0 , 1). If the p -v alue is extremely small, fo r example of the o rder 10 − 10 , then the PRNG definitely fails the test. Similarly if 1 − p is extremely small. If the p -v alue is not clo se to 0 or 1, then the P RNG is said to p ass the test, although this o nly says tha t the test failed to detect any pro blem with the PRNG. Typically , a whole battery o f tes ts is applied, so that there a re many p - v a lues, not just one. W e need to be cautious in in terpre ting the results of many such tests; if per forming N tests, it is not ex ceptional to observe that a p -v alue 4 Nandapalan et al. is s ma ller than 1 / N or la rger than 1 − 1 / N . The T estU01 library presented by L’Ecuyer and Simard [8] provides a thoro ugh suite of tests to ev aluate the statistical quality of the sequence pro duced by a P RNG. It includes and improves on a ll of the tests in the ea rlier DIEHARD pack age of Ma rsaglia [10]. 1.3 The Mers e nne Twister for Graphic Pro cesso rs The MTGP g enerator is a rec en tly-r e leased v ariant of the well known Merse nne Twister [1 2, 17]. As its name suggests, it was designed for GP GPU applications. In particula r , it was designed with par allel Monte Car lo simulations in mind. It is released with a parameter genera tor for the Mersenne Twister algo r ithm to sup- ply users w ith distinct gener a tors on request (MTGPs with different s equences). The MTGP is implemented in NVIDIA CUD A [16] in b oth 32-bit and 6 4-bit versions. F ollowing the p opularity of the original Me r senne Twis ter P RNG, this generator is a suitable standar d against which to co mpare GPU-bas ed PRNGs. The appr oach taken by the MTGP to ma k e the Mer senne Twister par allel can b e expla ined as follows. The next element of the sequence, x i , is expressed as some function, h , of a n umber of pr evious elemen ts in the se q uence, say x i = h ( x i − N , x i − N +1 , x i − N + M ) . The para llelism that ca n be exploited in this algorithm b ecomes a pparent when we cons ide r the pattern of dependency b et ween further elements of the sequence: x i = h ( x i − N , x i − N +1 , x i − N + M ) x i +1 = h ( x i − N +1 , x i − N +2 , x i − N + M +1 ) . . . x i + N − M − 1 = h ( x i − M − 1 , x i − M , x i − 1 ) x i + N − M = h ( x i − M , x i − M +1 , x i ) . The last element in the se q uence, which pro duces x i + N − M , requires the v a lue of x i , which ha s no t yet b een calcula ted. Thus, only N − M elemen ts of the sequence pro duced by a Mersenne Twister can b e computed in parallel. As N is fixed by the Mer senne P rime chosen for the algor ithm, all that can b e done to maximise the parallel efficie ncy of the MTGP is careful selection of the constant M . This constant, s p ecific to ea c h genera tor, determines the selec tio n of one o f the previous elements in the sequence in the recurrence that defines the MTGP . Thus, it ha s a direc t impact on the quality of the rando m num be r s generated, and the distribution of the sequence. 1.4 CURAND The CUD A CURAND Libra ry is NVIDIA’s par allel PRNG framework and li- brary . It is do cumented in [1 6]. The default g enerator for this library is based High-Perf ormance PRNG on GPUs 5 on the XOR W OW algorithm intro duced b y Mar saglia [11]. The XOR W OW al- gorithm is an exa mple of the x orshift class of g enerators. Generators of this cla ss have a num be r of adv a n tag e s. The algo rithm b e- hind them is pa rticularly simple when compared to other g enerators such as the Mersenne Twis ter. This results in s imple gener ators which are very fas t but still per form well in statistica l tests of randomness. The idea of the xo rshift class gener ators is t o com bine t wo terms in the pseudo-rando m s e quence (int eg ers represented in binary) using left/r igh t s hifts and “exclusive or” (xo r) op erations to pro duce the next term in the sequence. Shifts and xor op erations c a n b e pe r formed q uic kly on computing architectures, t ypica lly faster than op erations such as multiplication a nd division. Also, g ener- ators designed on this principle g enerally do not require a large n umber of v alues in the s equence to b e r etained (i.e. a lar ge state space) in order to pro duce a sequence of satisfactory s tatistical qua lit y . 1.5 xorgens Marsag lia’s o riginal paper [11] only gav e xorshift generator s with per iods up to 2 192 − 1. B ren t [2] r ecen tly pro pos e d the x or gens family of PRNGs that ge ne r alise the idea and hav e p erio d 2 n − 1, where n can be chosen to be any conv enient power of tw o up to 4096 . The xorgens genera tor has b een r eleased as a free softw are pack a ge, in a C langua ge implementation (most r ecen tly xorg e ns version 3.05 [3]). Compared to previous xorshift ge nerators, the xorgens family has several adv ant a ges: – A family of generato rs with different p erio ds and co rresp onding memory requirements, instead of just one. – Parameters are chosen o ptima lly , sub ject to certain criteria designed to give the best quality output. – The defect of linearity ov er GF(2) is ov erco me e fficie ntly by combining the output with that of a W eyl generator. – Atten tion has b een paid to the initialisation co de (see comments in [2, 3] on pr oper initia lisation), so the generator s ar e suitable for use in a paralle l environmen t. F or deta ils of the design and implemen tation of the xorge ns family , w e refer to [2, 3]. Here we just comment o n the combination with a W eyl ge nerator. This s tep is p erformed to av oid the problem of linearity ov er GF(2) that is common to all gener ators of the Linear -F eedbac k Shift Register clas s (such as the Mersenne Twister a nd CURAND). A W eyl genera tor has the following simple form: w k = w k − 1 + ω mo d 2 w , where ω is some o dd constant (a reco mmended choice is an o dd integer clo se to 2 w − 1 ( √ 5 − 1)). The final output o f an xorg ens ge ner ator is given by: w k ( I + R γ ) + x k mo d 2 w , (1) 6 Nandapalan et al. where x k is the output b efore addition of the W eyl ge nerator, γ is s ome integer constant close to w / 2, and R is the r igh t-s hift op erator . The inclus ion o f the term R γ ensures that the lea st-significant bits have high linear complexity (if we omitted this term, the W eyl ge ne r ator would do little to improve the q ualit y of the least-significant bit, since ( w k mo d 2) is p erio dic with p erio d 2). As a ddition mo d 2 w is a no n-linear o p era tio n ov er GF(2), the result is a mixture o f op erations from t wo different a lgebraic structures, a llo wing the se- quence pro duced by this generato r to pas s all of the empiric a l tests in BigCrush, including those failed by the Mersenne Twister. A b onus is that the p erio d is increased by a factor 2 w (though this is no t free, since the state size is incr e a sed by w bits). 2 xorgensGP Extending the xorgens PRNG to the GPGP U domain is a nontrivial endeavour, with a n umber of design c o nsiderations requir ed. W e ar e esse ntially seeking to exploit some level of parallelis m inher en t in the flow o f data. T o r e alise this, w e examine the r ecursion rela tion descr ibing the xorg ens a lgorithm: x i = x i − r ( I + L a )( I + R b ) + x i − s ( I + L c )( I + R d ) . In this equa tion, the parameter r repr esen ts the degr ee o f recurrence, and conse- quently the size of the state spa ce (in w or ds, a nd not coun ting a small constant for the W e y l g enerator a nd a circula r ar ray index). L and R repres en t left-shift and right-shift op era tors, res p ectively . If w e conceptualis e this s ta te space array as a circular buffer of r elements we can reveal some str uc tur e in the flow of data. In a circular buffer, x , o f r elements, where x [ i ] denotes the i th element, x i , the indices i and i + r would access the same p osition within the circular buffer. This means that as each new element x i in the sequence is calcula ted from x [ i − r ] a nd x [ i − s ], the result r eplaces the r th oldest element in the state space, whic h is no lo nger ne c essary for calculating future elemen ts. Now we can be g in to consider the par allel computation of a sub-sequence of xorgens . Let us exa mine the dep endencies of the data flow within the buffer x as a seq uenc e is b eing pro duced: x i = x i − r A + x i − s B x i +1 = x i − r +1 A + x i − s +1 B . . . x i +( r − s ) = x i − r +( r − s ) A + x i − s +( r − s ) B = x i − s A + x i + r − 2 s B . . . x i + s = x i − r + s A + x i − s + s B = x i − r + s A + x i B . High-Perf ormance PRNG on GPUs 7 If w e c o nsider the concurre n t co mputatio n of the sequence, we observe that the maximum num b er of terms that can b e computed in para llel is min( s, r − s ) . Here r is fix ed b y the p erio d required, but w e have some freedom in the choice of s . It is bes t to cho ose s ≈ r / 2 to maximise the inheren t parallism. How ever, the constraint GCD( r, s ) = 1 implies that the b est w e can do is s = r/ 2 ± 1, e x cept in the case r = 2, s = 1 . This provides one additional constraint, in the cont ex t of xorgens GP versus (serial) xorgens, on the para meter set { r , s, a, b, c, d } defining a gener a tor. Thus, we find the thread-level pa r allelism inher en t to the xorg e ns class of ge ne r ators. In the CUDA implementation of this ge nerator w e considered the approa c h of pr oducing indep endent subsequences. With this approa ch the problem of c r e- ating one sequence of ra ndo m n umbers o f arbitra r y length, L , is made paralle l by p pro cesses by indepe ndently pro ducing p subseq uenc e s o f leng th L/p , a nd gathering the results. With the blo c k of threa ds architecture of the CUDA in- terface and this technique, it is a lo g ical and natura l decision to allo cate each subsequence to a blo c k within the g rid of blo cks. This ca n b e achieved by pro- viding each blo ck with its own lo cal copy of a state space via the s hared memory of a n MP , and then using the thread- lev el par allelism for the threa ds within this blo ck. Thus, the lo cal state space will represent the s ame g enerator, but at differe nt p oints within its p erio d (which is sufficiently long that ov erla pping sequences are extr emely improbable). Each genera tor is identical in th a t only one par ameter s e t { r, s, a, b, c, d } is used. An adv antage of this is that the parameters a re known at compile time, allowing the compiler to make o ptimisa tions that would not b e av ailable if the par ameters w ere dynamically allo cated, a nd thus k no wn only at run time. This results in fewer reg isters b eing requir e d by ea c h thread. F or the gener a- tor whose test results ar e given in § 3, we used the parameter s ( r , s, a, b, c, d ) = (128 , 65 , 15 , 1 4 , 12 , 17). 3 Results W e now present an ev a luation o n the results obtained in our co mparison of the existing GPU PRNG against our implementation of xorgens GP . All exp eriments were p erformed on a NVIDIA GeF orce GTX 48 0 and a single GPU o n the NVIDIA GeF orce GTX 295 (whic h is a dual GPU de v ice), using the CUDA 3 .2 to olkit a nd drivers. Performance r esults a re presented in T a ble 1, and q ualitative results in T able 2. W e first compared the memor y fo otprint of each gener ator. This dep ends on the algor ithm defining the genera tor. The CURAND gener a tor was determined to hav e the sma lle s t memory requirements o f the three genera tors co mpared, and the MTGP was found to hav e the gr eatest. The MTGP has the lo ng est p e r io d (2 11213 − 1), and the CURAND gener ator has the s hortest p erio d (2 192 − 2 32 ). 8 Nandapalan et al. T able 1. Approximate memory fo otp rin ts, p erio ds and sp eed on tw o devices for 32-bit generators. Generator State-Space Period GTX 480 RN/s GTX 295 RN/s xorgensGP 129 wo rd s 2 4128 17 . 7 × 10 9 9 . 1 × 10 9 MTGP 1024 w ords 2 11213 17 . 5 × 10 9 10 . 7 × 10 9 CURAND 6 w ords 2 192 18 . 5 × 10 9 7 . 1 × 10 9 Next, we compar ed the ra ndom num b er throughput (RN/s) of ea c h gener ator on the tw o different devic es. This was obtained by r epeatedly genera ting 1 0 8 random num ber s a nd timing the duration to pr oduce the seq uence of that leng th. W e found that the p erfor mance of each g enerator was roug hly the same, with no significa nt sp eed adv antage for any genera tor. On the newer GTX 4 80, the CURAND gener ator was the fas test, and the MTGP was the slowest. On the older ar c hitecture of the GTX 295 the or dering was reversed: the CURAND generator was the slo west and the MTGP was fastes t. T hes e r esults can b e explained in part by the fact that the CURAND g enerator was designed with the current gener ation of “F ermi” cards lik e the GTX 480, and the MTGP w as designed and tested initially o n a ca rd very simila r to the GTX 2 95. In a ny even t, the sp eed differences are small and implement a tion/platform-dep endent . Finally , to c o mpare the quality of the sequences pro duced, each of the gener - ators was sub jected to the SmallCrush, Crush, and BigCrush batteries o f tests from the T estU01 L ibr ary . The xorgens GP gene r ator did no t fail any of the tests in a n y of the benchmarks. Only the MTGP failed in the Crush b enc hmar k, where it failed t wo s eparate tests. This was exp ected as the genera to r is based on the Merse nne Twis ter , a nd the tests are designed to exp ose the pr oblem of linearity ov er GF(2). The MTGP failed the cor resp onding, more rigor o us tests in BigCrush. Interestingly , the CURAND generator failed one of these tw o tests in BigCrush. T able 2. T ests failed in each standard T estU01 b enchmark. Generator SmallCrush Crush BigCrush xorgensGP None None None MTGP None #71,#72 #80,#81 CURAND None None #81 High-Perf ormance PRNG on GPUs 9 4 Discussion W e briefly discuss the results of the statistica l tests, along with some desig n consideratio ns for the xo rgensGP gener a tor. CURAND fails o ne of the T estU01 tests. This test checks for linearity and ex- po ses this flaw in the Mer senne Twister. How ever, like the xorgensGP , CURAND combines the output of an xor shift g enerator with a W eyl genera tor to a void lin- earity ov er GF(2), so it was exp ected to pass the test. The per iod 2 192 − 2 32 of the CURAND genera tor is m uch smaller than that of the other tw o generator s . The BigCrush test consumes approximately 2 38 random num b ers, which is still only a small fra c tion of the perio d. A more probable explanation relates to the initialisation of the gener ators at the blo ck level. In xo rgensGP ea c h blo ck is provided with consecutive seed v a lues (the id num ber of the blo c k within the grid). Corr elation betw ee n the resulting subseque nce s is av oided by the metho d xor gens uses to initialise the state space. It is unclear what steps CURAND takes in its initialisation. The MTGP avoids this pr oblem by providing each gener ator with different parameter sets for v alues such as the shift amounts. In developing xo rgensGP this a pproach w a s also explo red. How ever, it was found that the ov erhea d of managing the pa rameters increa sed the memory fo otprint of each genera tor and consequently reduced the o ccupancy and p erfor mance o f the generator, without any noticeable improv ement on the quality and so was not develop ed any further. In co nclusion, we prese nted a new PRNG x orgensGP for GPUs using CUDA. W e show ed that it p erforms with compara ble sp eed to e xisting solutions a nd with b etter statistical qualities. The prop osed gener a tor has a p erio d that is sufficiently large for statistical pur pose s while not r e q uiring to o muc h state spa c e, allowing it to giv e go od p erformance on differe n t devices. Bibliograph y [1] C. Andrie u, A. Douc e t, and R. Holenstein. Particle Ma rko v chain Monte Carlo metho ds. Journal of the Roy al Statistic al So ciety Series B , 7 2:269– 302, 2010. [2] R. P . Br en t. Some long-p erio d rando m num b er generator s using shifts and xors. ANZI AM Journal , 48, 2 007. [3] R. P . Br en t. xor gens version 3.05, 200 8. http://maths.anu.edu.au/ ˜brent/random.html. [4] A. Do uce t, N. de F reitas, and N. Gordon, editor s. Se quential Monte Carlo Metho ds in Pr actic e . Springe r, 20 01. [5] W. Gilks, S. Ric har dson, and D. Spiegelhalter , editor s. Markov chain Monte Carlo in pr actic e . Chapman and Hall, 199 5 . [6] J . Hob ero ck and N. Bell. Thrust: A parallel templa te library , 20 1 0. URL http:/ /thrust .googlecode.com/ . V ersion 1.3 .0 . [7] L. How es and D. Thomas. GPU Gems 3 , c hapter Efficient Random Num b er Generation a nd Applica tio n Using CUDA. Addison-W esley , 2007. [8] P . L’Ecuyer and R. Simar d. T estU01: A C libr ary for empirical testing of random num b er g enerators. ACM T r ansactions on Mathematic al Softwar e , 33, 2 0 07. [9] P . Leo pardi. T esting the tests: using random num b er g enerators to improv e empirical tests. Monte Carlo and Quasi-Monte Carlo Metho ds 2008 , pa ges 501–5 12, 20 09. [10] G. Marsag lia. DIEHARD: a battery o f tests o f ra ndomness. ht tp:// stat.fsu.edu/˜geo / dieha r d.h tml, 1996 . [11] G. Marsa glia. Xor shift RNGs. J. of Statist ic al S oftwar e , 8(14):1– 6 , 2003 . [12] M. Matsumoto and T. Nishimu r a. Merse nne twister: A 623 -dimensionally equidistributed uniform pse udo random num b er genera tor. ACM T r ansac- tions on Mo deling and Computer S imulation , 8:3 –30, 19 98. [13] L. M. Mur ray . GP U acceler ation of Rung e-Kutta integrator s. IEEE T r ans- actions on Par al lel and D istribute d Systems , to a ppear , 201 1. [14] L. M. Murr ay . GPU acce leration of the particle filter: The Metro p olis re- sampler. In DMMD: Distribute d machine le arning and sp arse r epr esent ation with massive data sets , 20 11. [15] N VIDIA Co r p. CUDA CURAND Libr ary . NVIDIA Cor por ation, 2010 . [16] N VIDIA Co rp. CUDA Compute Unified Device Architecture Progr amming Guide V er sion 3 .2. NVIDIA Corp., Santa Cla ra, C A 9505 0, 2010 . [17] M. Saito . A v aria n t of Mersenne Twister suitable for graphic pro cesso rs. ht tp:// arxiv.org /abs/100 5.4973, 201 1.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment