Highly accelerated simulations of glassy dynamics using GPUs: caveats on limited floating-point precision

Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations f…

Authors: Peter H. Colberg, Felix H"ofling

Highly accelerated simulations of glassy dynamics using GPUs: ca veats on limited floating-point precision Peter H. Colberg a,b,1 , Felix H ¨ ofling a,c,2 a Arnold Sommerfeld Center for Theor etical Physics and Center for NanoScience (CeNS), F akult ¨ at f ¨ ur Physik, Ludwig-Maximilians-Universit ¨ at M ¨ unchen, Ther esienstraße 37, 80333 M ¨ unchen, Germany b Institut f ¨ ur Materialphysik im W eltraum, Deutsc hes Zentrum f ¨ ur Luft- und Raumfahrt (DLR), Linder H ¨ ohe, 51147 K ¨ oln, Germany c Rudolf P eierls Centr e for Theor etical Physics, 1 K eble Road, Oxfor d O X1 3NP , England, United Kingdom Abstract Modern graphics processing units (GPUs) provide impressi ve computing resources, which can be accessed conv eniently through the CUD A programming interface. W e describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conserv ation, and ca veats on limited floating-point precision are issued. Strict ener gy conserv ation ov er 10 8 MD steps is obtained by double-single emulation of the floating-point arithmetic in accurac y-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard–Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and e ven physically wrong results. For simulations of a Lennard–Jones fluid, the described implementation shows speedup f actors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores. K e ywords: GPU computing, molecular dynamics simulations, dynamics of supercooled liquids P A CS: 05.10.-a, 61.20.Ja, 64.70.qj 1. Introduction In recent years, the computational power of graphics pro- cessing units (GPUs) has increased rapidly: the theoretical peak performance for single precision floating-point operations on an amateur’ s GPU 3 reaches nearly 1 Tflop / s. Compared to a single core of a con ventional processor , this giv es rise to an e x- pected performance jump of one or two orders of magnitude for many demanding computational problems, fueling the desire to exploit graphics processors for scientific applications. While con ventional high-performance computing (HPC) depends on expensi ve central computing clusters, shared amongst many re- searchers, GPU computing makes local clusters acting as small HPC facilities concei vable for lar ge-scale simulations at a frac- tion of the cost. A modern GPU w orks as a streaming processor implement- ing the single-instruction-multiple-threads (SIMT) model. One device contains sev eral hundred scalar processor cores ex ecut- ing a single instruction or a small set of di ver gent instructions in parallel in thousands of threads on a lar ge data set. Parallel, Email addr esses: peter.colberg@utoronto.ca (Peter H. Colberg), hoefling@mf.mpg.de (Felix H ¨ ofling) 1 Pr esent addr ess: Chemical Physics Theory Group, Department of Chem- istry , University of T oronto, T oronto, Ontario M5S 3H6, Canada 2 Pr esent addr ess: Max-Planck-Institut f ¨ ur Metallforschung, Heisenberg- straße 3, 70569 Stuttgart and Institut f ¨ ur Theoretische und Ange wandte Physik, Univ ersit ¨ at Stuttgart, Pfa ff enwaldring 57, 70569 Stuttgart, German y 3 NVIDIA GeForce GTX 280 coalesced access to the onboard de vice memory via a memory interface of up to 512 bits provides a remarkably high memory bandwidth. Molecular dynamics (MD) simulations—a widespread tool for particle-based simulations in biological physics, chemistry , and material science—are ideally applicable to the GPU due to their inherent parallelism. The release of NVIDIA ’ s compute unified device ar chitectur e (CUDA) as a con venient means of GPU programming has triggered a lot of activity in e xploiting GPUs for scientific applications. Several MD implementations using GPUs hav e demonstrated significant speedups, with per- formance measurements based on relati vely short test runs [1– 5]. A critical re view of current attempts to exploit GPUs in MD simulations may be found in Ref. 6; in essence, published results are not yet av ailable, showing the “nascent nature” of the field. As a notable exception, complex MD simulations tar - geting at protein folding were accelerated by GPUs allowing for the trajectories to reach into the millisecond regime [7, 8]. In the realm of physics, we are only aware of a Monte-Carlo study of the critical behaviour of the Ising model that was per- formed on the GPU, reporting speedups between 35 and 60 for the mostly integer arithmetic-based algorithm [9]; a multi-GPU approach to this problem shows a promising scalability [10]. In this article, we describe MD simulations that are fully im- plemented on the GPU and that f aithfully reproduce the glassy dynamics of a supercooled binary mixture. If a glass-forming liquid is cooled or compressed, the structural relaxation criti- cally slows down by se veral orders of magnitude and a rapidly Pr eprint submitted to Computer Physics Communications November 26, 2024 growing time scale emerges close to the glass transition line [11]. Small changes in the temperature may already have drastic ef- fects on the dynamics, and a numerical study thus requires ex- cellent long-time stability with respect to energy conservation ov er long simulation runs of 10 8 MD integration steps and more. The single precision floating-point arithmetic provided by re- cent GPUs turns out to be a major obstacle to this goal. But once this limitation is overcome, general-purpose computing on GPUs provides a useful tool to address current questions of the glass transition with minimal allocation of hardware and reasonable computing time. The substantial enhancement of computing resources due to GPU computing facilitates the in vestigation of very large sys- tems, which is desirable for studies of phenomena associated with a diver gent length scale. Only recently , e vidence for a div ergent static correlation length in supercooled liquids was found in large-scale simulations of up to 64,000 particles [12] and 80,000 particles [13], respectiv ely . The article is organised as follows. W e describe a high- precision implementation for the GPU in Section 2, followed by performance benchmarks in Section 3. A detailed analy- sis of the long-time stability regarding momentum and energy conservation is giv en in Section 4. Section 5 demonstrates the impact of numerical accuracy on the simulation results for the glassy dynamics of a binary mixture of soft spheres. A conclu- sion is giv en in Section 6. 2. Implementation for the GPU 2.1. Ar chitectur e of the GPU har dware An MD implementation for the GPU needs to be adapted to its v astly di ff erent architecture. A modern GPU consists of hundreds of scalar cores, which map to execution threads in the CUD A architecture and require fine-grained parallelisation of the algorithm. The scalar cores are divided into multiprocessors— units of 8 processors for the NVIDIA G200 series GPUs— capable of ex ecuting an atomic warp of 32 threads in four clock cycles. Each multiprocessor is equipped with a total of 16,384 32-bit registers, and a fast, tiny shared memory of 16 kB; all multiprocessors access a large global device memory of 1 to 4 GB, which is two orders of magnitude slo wer than local reg- isters. T o hide latencies when accessing the global de vice mem- ory , a scheduler concurrently executes multiple warps on a mul- tiprocessor . The equiv alent of a multiprocessor in the CUD A architecture is a block of up to 512 threads, and the only means of communication and synchronisation is within a block. For the e xecution of complex algorithms, the number of threads per block has to be lowered due to the limited number of regis- ters av ailable per multiprocessor . Global de vice memory en- forces a strict memory access pattern, where threads hav e to ad- dress memory in a linear coalesced order . In contrast to shared memory access, random read and write access to global de vice memory entails a performance penalty of an order of magni- tude. This impairs GPU acceleration of many common com- putational primiti ves such as sorting algorithms. As a partial remedy a te xture cache of up to 8 kB per multiprocessor , which stems from the graphics heritage of the GPU, enables read-only random access to a small window of global device memory . A global cache of 64 kB of constant memory provides access to constant data at speeds comparable to register access. 2.2. The MD inte gration step A soft-sphere molecular dynamics (MD) simulation solv es Newton’ s equations in discretised time for N classical particles interacting via a C 2 -continuous, short-ranged potential. Ev- ery MD step, the force on each particle exerted by all inter- acting particles is calculated, and the particles are propagated by a symplectic integrator , the velocity-V erlet algorithm [14]. A na ¨ ıve implementation based on a doubly nested force loop would yield an algorithmic complexity of O ( N 2 ). For short- ranged interparticle forces howe ver , a linear scaling of the per - formance with the number of particles can be achieved with V erlet neighbour lists. F or each particle, a list of particles lo- cated within a radius r c , the cuto ff radius of the potential, is kept in memory; the force algorithm then considers only particles in the neighbour list. A small “skin” of a fraction of r c is added to the neighbour sphere to reduce the necessity of rebuilding the neighbour lists to every 10 to 100 steps. Particle binning—the decomposition of the system into spatial cells—avoids a dou- bly nested loop in the neighbour list algorithm by limiting the search for neighbours of a particle to its own and the adjacent cells. Our implementation for the GPU combines and extends concepts described in detail in Refs. 1 and 2. Parallelisation of the v elocity-V erlet algorithm for the GPU is straightforward and naturally respects coalesced memory ac- cess: each CUDA thread is assigned a single particle with the task of updating the velocities and coordinates. The imple- mented force and neighbour list algorithms resemble the ones proposed in Ref. 1. Specifically in the force algorithm, each thread reads the indices of a particle’ s neighbours in a coa- lesced manner , and coordinates are then fetched using the tex- ture cache, mitigating performance penalties due to random mem- ory access. For particle binning, we hav e implemented a cell list algo- rithm based on a parallel radix sort [15, 16]. Each particle is as- signed a one-dimensional integer cell index. An array contain- ing the particle numbers is sorted according to the cell indices, as proposed in the Particles example of the CUD A SDK [17]. A further pass determines the start index of each cell, follo wed by the assembly of fixed-size cell lists for the subsequent neigh- bour list algorithm. W ith this method we avoid the bottleneck of a host to de vice memory copy which is needed with the CPU- based cell list algorithm of Ref. 1. The radix sort algorithm performs worse on the GPU than on the CPU for small systems of considerably less than 10 4 particles, and further , particle bin- ning is only necessary ev ery 10 to 100 steps. Thus, the o verall performance of the MD step is best for systems of 10 5 and more particles. The neighbour list algorithm finally reads the particle in- dices from the fix ed-size cell lists to gather coordinates and ve- locities of particles in neighbouring cells via texture fetches, temporarily stores them in shared memory , and builds a fix ed- size neighbour list [1]. 2 Figure 1: Hilbert’ s space-filling curve in three dimensions after 1, 2 and 3 recursions. It provides the mapping between three-dimensional space and one-dimensional memory , which ensures texture locality of neighbouring particles. Particles at v ertices with similar colour will be close in memory after the sort. A parallel reduction scheme 4 on the GPU is used to cal- culate properties which are needed every MD step such as the maximum absolute particle displacement and the potential en- ergy sum as well as less frequently ev aluated properties such as temperature, centre-of-mass velocity , and the virial tensor . In the force algorithm, we make e ffi cient use of the tex- ture cache by periodically sorting the particles in global de vice memory according to a three-dimensional space-filling Hilbert curve [1, 18]. Such a curve provides a mapping between 3D space and 1D memory to optimally conserve spatial locality . In our implementation, the mapping is recursively generated on the GPU using 8 verte x rules [19] as depicted in Fig. 1, based on a lattice spacing of σ with a maximum recursion depth of 10. The subsequent particle sort emplo ys the radix sort algo- rithm [15, 16] to generate a permutation array; then read access to the texture cache and coalesced write access to global mem- ory are used to e ffi ciently permute particle coordinates and ve- locities. 2.3. Random number generator A modified Andersen thermostat for pre-equilibration cool- ing is realised by assigning Boltzmann-distributed velocities to all particles ev ery ( µ δ t ) − 1 steps according to a fixed heat bath collision rate µ . Furthermore, the assigned velocities are rescaled by means of parallel reduction to exactly match the temperature of the heat bath. W e ha ve implemented the parallel rand48 random number generator , which may be tri vially par- allelised by leap-frogging within the sequence [2]. The linear recurrence x t + T = A T x t + C T mod 2 48 uses a leapfrog multi- plier A T = a T mod 2 48 and leapfrog addend C T = c P T − 1 n = 0 a n mod 2 48 to jump within the sequence depending on the total number of threads T . As an impro vement compared to Ref. 2, we seed the parallel generator using the parallel prefix sum al- gorithm 5 , which yields the initial state x t for each thread t by binary-tree summation and multiplication. Thus, initialisation times of many seconds are a v oided for large systems. 2.4. Double-single pr ecision floating-point arithmetic It will be demonstrated belo w that numerical long-time sta- bility requires double precision arithmetic in critical parts of the 4 See example “Reduction” in Ref. 17 5 See example “Scan” in Ref. 17 Algorithm 1 Algorithm for the addition of two floating-point numbers in double-single precision, based on the DSFUN90 package [25]. The subscripts 0 and 1 refer to the high and lo w word of a double-single float, respecti vely . { Compute a + b using Knuth’ s trick. } t 0 ← a 0 + b 0 e ← t 0 − a 0 t 1 ← (( b 0 − e ) + ( a 0 − ( t 0 − e ))) + a 1 + b 1 { The result is t 0 + t 1 , after normalisation. } c 0 ← e ← t 0 + t 1 c 1 ← t 1 − ( e − t 0 ) MD step. Currently pre valent GPUs of the NVIDIA GT200 se- ries feature only 1 double precision floating-point unit per e very 8 single precision units, which prohibits the use of nativ e dou- ble precision in performance-critical algorithms. The limited nativ e precision is overcome by algorithms which implement multi-precision arithmetic using two nati ve floating- point re gisters [20, 21]. On hardware supporting the IEEE 754- 1985 floating-point standard [22], proofs of numerical exact- ness hav e been giv en for multi-precision addition and multipli- cation [23]. For the GPU, which does not fully comply with IEEE 754-1985 in terms of rounding and division, these proofs hav e been adapted for double-single precision arithmetic using two nati ve single precision floats [24]. The double-single pre- cision algorithms for addition and multiplication with NVIDIA GPUs yield an e ff ectiv e precision of 44 bits. For IEEE 754-1985 compatible hardware, the DSFUN90 package for Fortran [25] contains implementations of addition, subtraction, multiplication, di vision, and square root in double- single precision. W e have ported these implementations from Fortran to CUD A, extending the functionality provided by the Mandelbrot example of the CUD A SDK [17]. As an example the addition algorithm in double-single precision is displayed in T able 1. T o implement double-single multiplication on the GPU, care has to be taken to av oid fused multiply-add opera- tions, as the GPU does not round the result of the comprised multiplication, in violation of the floating-point standard. As a remedy , CUDA provides the intrinsic operations fmul r n and f add r n . 3 0 1 2 3 4 5 6 7 8 0 5 10 15 20 25 30 0 25 50 75 100 0 125 250 375 500 Figure 2: Illustration of the blocking scheme for k = 4 blocks of size ` = 5. Each square represents the system state at a giv en time; the numbers refer to the finest time resolution ∆ t . Dashed arrows indicate the filling of the higher block lev els with lower resolution. Solid lines indicate the correlations calculated between the most left entry and all other entries of the same block. The thin squares to the left ha ve already been processed and were discarded, while those to the right still hav e to be inserted into the blocks. In double-single precision, we have implemented the up- date of velocities and coordinates in the velocity-V erlet step as well as the summing over force contributions from neigh- bouring particles. These may be subject to accumulated sum- ming errors depending on the particular time step and potential, respectiv ely . The ev aluation of the force contributions itself, which accounts for a large fraction of the computational cost within the MD step, remains in single precision. 2.5. Evaluation of time-corr elation functions The dynamic properties of a molecular system are often quantified in terms of time-correlation functions [26]. For ob- servables A and B , one defines their correlation at di ff erent times t and s as C AB ( t , s ) =  A ( t ) B ∗ ( s )  = lim T →∞ 1 T Z T 0 d τ A ( t + τ ) B ∗ ( s + τ ) . (1) In equilibrium, time-correlation functions are stationary and de- pend only on the di ff erence t − s , thus C AB ( t , s ) = C AB ( t − s ). The most important class of time-correlation functions is com- prised by the auto -correlation functions C AA ( t ). If the system is ergodic, the av erage may be ev aluated alternativ ely as an en- semble average ov er initial conditions, i.e., over independent simulation runs. For tagged-particle observables, the statistical error is decreased additionally by av eraging over all particles of the same species. W e ha ve developed a blocking scheme which allows the on- line e valuation of time-correlation functions during the produc- tion run. The scheme was inspired by the “order- n algorithm” of Ref. 27. Complex relaxation processes often comprise sev- eral time scales and are usually discussed on a logarithmic time axis. The blocking scheme yields the correlation functions for fast and slo w processes simultaneously and provides a time grid already suitable for a logarithmic representation. Introducing some short-time resolution ∆ t , we assume that the state of the system is stored at times m ∆ t for m = 0 , 1 , 2 , . . . , from which the observables A m = A ( m ∆ t ) are calculated. For a long sim- ulation run over a time span of T = M ∆ t , we approximate the integral in Eq. (1) by a sum, C ( m ) AA : = C AA ( m ∆ t ) ≈ 1 M 0 M 0 − 1 X j = 0 A m + j A ∗ j (2) where M 0 = M − m . The e valuation of C ( m ) AA for all m would require handling a huge number of M copies of the system state at di ff erent times (or of the deriv ed observables at least) and would in volve O ( M 2 ) floating-point operations. Instead, we ar- range the time grid in k blocks of size ` , where the time reso- lution between subsequent blocks is increased by a factor of ` , see Fig. 2. W ithin block n , correlations are calculated for time lags ∆ t n , . . . , ( ` − 1) ∆ t n only , where ∆ t n = ` n ∆ t . The blocks are continuously filled during the simulation. Whenev er a block is full, the first entry is correlated with all other entries and is then discarded. Thus, each block contains a section of the trajectory , moving forward as the simulation progresses. The memory re- quirement to handle a trajectory of length ` k ∆ t is merely k × ` snapshots of the system state. 3. Perf ormance measurements A central argument for the use of GPUs in high-performance computing is their high theoretical peak performance compared to that of a single core of a con ventional Opteron or Xeon CPU. W e did extensi ve performance measurements of our MD im- plementation, which we compare first to our own serial refer- ence implementation for the host processor . The comparison between a massively parallel implementation and a serial one is somewhat unfair , in particular in view of the multi-core nature of current CPUs. Thus, we secondly provide a comparison with the LAMMPS package [28], being one of the established, par- allelised MD packages widely used in the physics community . The test includes the parallel use of multi-core CPUs within a single compute node as well as distributed computing over sev- eral nodes, which is more realistic for production use and allows for a more reasonable comparison between a con ventional clus- ter and a single GPU; a similar approach was taken recently by Harve y et al. [8]. LAMMPS o ff ers some GPU acceleration as well which we have not used by purpose and which we explic- itly do not refer to. Our measurements were done on Lennard–Jones liquids of varying size and density with the conv entional 12–6 interac- tion potential, V LJ ( r ; σ , ε ) = 4 ε [( r /σ ) − 12 − ( r /σ ) − 6 ], cut o ff at r c = 2 . 5 σ ; in LAMMPS we used the potential implementation termed lj/cut . Units of length, mass, time, and energy are σ , m , p m σ 2 /ε , and ε as usual, and dimensionless quantities are indi- cated by an asterisk. W e used a step size of δ t ∗ = 0 . 001 and a neighbour list skin of width 0 . 5 σ for the GPU and 0 . 3 σ for the serial CPU and parallel LAMMPS benchmarks. T o generate a homogeneous system, an initial fcc lattice was equilibrated in the NVT ensemble at T ∗ = k B T /ε = 1 . 5 for 10 4 steps. Then, 4 a) b) 10 3 10 4 10 5 10 6 nu m b e r o f p a r ti c l e s N 10 − 1 10 0 10 1 10 2 m ea n M D s t e p ti m e τ [ m s ] ρ ∗ 1 . 2 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 10 3 10 4 10 5 0 20 40 60 80 τ / N [ n s ] 10 3 10 4 10 5 10 6 nu m b e r o f p a r ti c l e s N 10 0 10 1 10 2 10 3 10 4 m ea n M D s t e p ti m e τ [ m s ] ρ ∗ 1 . 2 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 10 3 10 4 10 5 0 1 2 3 4 τ / N [ µ s ] c) d) 10 3 10 4 10 5 10 6 nu m b e r o f p a r ti c l e s N 0 20 40 60 80 G P U ov e r C P U s p ee dup ρ ∗ 1 . 2 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 10 3 10 4 10 5 10 6 nu m b e r o f p a r ti c l e s N 10 − 1 10 0 10 1 10 2 m ea n M D s t e p ti m e τ [ m s ] N P 64 48 16 8 10 3 10 4 10 5 10 6 0 40 80 120 τ / N [ n s ] Figure 3: Results of the performance benchmarks for a Lennard–Jones liquid of various densities as function of the number of particles N ; temperature was kept fixed at T ∗ = 1 . 5. Large panels show the total time τ per MD step, while insets test the expected linear scaling and display τ/ N . Panel a: average time required for a single MD step on the GPU (single precision); panel b: time per MD step on a serial host implementation; note the di ff erent scale of the y -axis compared to panels a and d; panel c: relati ve GPU performance versus a single CPU core; panel d: time per MD step using the parallelised LAMMPS package for jobs of v arious process numbers N p . For each density and particle number , the fastest result is shown, where symbols indicate the selected N p . 5 the wall time spent on the MD step was measured for the next 10 4 steps. For the GPU and parallel CPU benchmarks, the re- sults were av eraged over 10 independent realisations. For the LAMMPS benchmarks, only a single system w as generated and the results were averaged over two consecutiv e measurement runs of 10 4 steps each. The in vestigated system sizes ranged from N = 1 , 372 to 864,000 particles and the densities from % ∗ = 0 . 2 to 1 . 2. The GPU benchmarks were run on GPUs of type NVIDIA GeForce GTX 280, which contains 240 scalar processor cores clocked at 1 . 3 GHz, with GDDR3 memory at 1 . 1 GHz provid- ing a de vice memory bandwidth of 140 GB / s ( = 2 × 1 . 1 GHz × 512 bits). Its theoretical peak single precision floating-point performance based on one fused multiply-add operation and one concurrent multiply operation per cycle is 933 Gflop / s ( = 240 × 3 flop × 1296 MHz). For comparison with our host ref- erence implementation, we used an AMD Opteron 2216 HE processor at 2 . 4 GHz with PC2-4200 CL4 dual-channel mem- ory and a theoretical memory bandwidth of 17 GB / s ( = 2 × 2 × 533 MHz × 64 bits). The theoretical peak performance on the assumption of a single SSE instruction execution with four concurrent single precision floating-point operations per cycle is 9 . 6 Gflop / s per CPU core. This roughly yields a f actor of 100 for the theoretical limit on the speedup of a compute-bound al- gorithm, and a factor of 8 for an algorithm limited by memory bandwidth. The estimate does not take into account the mem- ory latency on the GPU, which is 400 – 600 clock cycles for a thread accessing global memory [29], and may more substan- tially constrain the overall performance of an algorithm than floating-point performance or memory bandwidth. Further , re- cent CPUs contain sev eral cores, and typical nodes in a comput- ing centre are found to be 4- to 16-way SMP machines, and one has to put these numbers into perspectiv e for a parallelised im- plementation making use of all av ailable cores simultaneously . The GPU benchmarks were performed with HAL ’ s MD pack- age [30, commit e611734] using single floating-point preci- sion on the GPU for comparison with other work, i.e., without the implementation of double-single precision described abov e. The GPU-specific CUD A code was compiled with NVIDIA ’ s CUD A compiler (version 2.2) targeting compute capability 1.0. The host-specific code was compiled for the x86 64 architec- ture with the GNU C / C ++ compiler (version 4.3.4) with opti- misation flag O3 . The serial CPU benchmarks were done with HAL ’ s MD package [30, commit a628797]; it was compiled with single floating-point precision for x86 64 using the GNU compiler again with optimisation flag O3 , which implies auto- matic vectorisation of loops. On the x86 64 architecture, GCC defaults to SSE floating-point arithmetic, which ensures that throughout a calculation, a floating-point value is stored with the precision mandated by its data type. The compute times per MD step obtained on the GPU and the host are compared in Figs. 3a and b, respectively . They are proportional to the number of particles N , and the double- logarithmic representation nicely corroborates the linear com- plexity of the simulation algorithm. In the GPU case, the lin- ear scaling is only obeyed for su ffi ciently large N & 20 , 000; for smaller N , the compute times show a constant o ff set of 0 . 2 − 0 . 5 ms, reflecting the overhead of parallel sorting and re- duction. The prefactor increases for denser systems, which we attribute to larger neighbour lists. At ρ ∗ = 0 . 4, we measure a GPU time per MD step per particle of 24 ns; this value dou- bles to 53 ns for the highest density , ρ ∗ = 1 . 2. For compari- son, the timings reported in Refs. 1 and 2 at ρ ∗ = 0 . 4 are 62 ns and 200 ns per MD step and particle using the older NVIDIA GeForce 8800 GTX hardware with only 128 CUDA cores at 1 . 5 GHz. For the described double-single precision implementation, the execution times on the GPU increase modestly by about 20%. The dependence on system size and density is similar to the results for the single precision implementation in Fig. 3a. Thus, the performance penalty for using single-double preci- sion in critical parts of the algorithm results roughly in a global factor . A significant part of the additional ex ecution time seems to be related to memory latency . Using double-single precision, twice the amount of data is read and written for v elocity and po- sition vectors. For the three-dimensional velocity vectors, this totals to 3 × 2 = 6 32-bit words per particle, requiring at least two coalesced memory operations by using, e.g., two arrays of float4 . In contrast, the velocity vector can be accessed with a single coalesced operation for single precision. W e expect that these considerations hold also if nativ e double precision on the GPU is used, dropping the additional floating-point operations for the double-single arithmetic. CUDA devices of compute capability 2.0 only support te xture read operations of up to 128 bytes, which would entail splitting position and velocity vec- tors into properly aligned double2 and double components, and thus require twice the amount of memory accesses per vector . On the host processor, the MD step time of the serially ex- ecuted programme is proportional to N for small system sizes too (Fig. 3b). With increasing system size, the execution times increase slightly , probably due to a growing number of cache misses. In particular, performance degraded substantially (by a factor of 2) for systems of more than 10 5 particles if the particles were randomly distributed in memory . Thereby , the obtained speedups were spoiled considerably , which we have solved by regularly sorting the particles in memory as in the GPU implementation. The prefactor spreads by about 60% around its av erage, which is somewhat larger than on the GPU (40%). At ρ ∗ = 0 . 4, we hav e measured an MD step time per particle of 1.8 µ s, which is comparable to the timings obtained below with the LAMMPS package on a single processor core. The relativ e performance gain of the GPU over the CPU is displayed in Fig. 3c. It depends to some extend on the parti- cle density such that dense systems are favoured by the GPU. While the speedup factor is as small as 4 to 12 for small systems of just 10 3 particles, it goes up to values around 40 for 10 4 par- ticles, and it reaches an approximate plateau between 70 and 80 for more than 10 5 particles, being close to the theoretical limit for a compute-bound algorithm. The LAMMPS bench- marks were performed at the Leibnizrechenzentrum in Munich on 8-way nodes containing 4 dual-core processors of type AMD Opteron 8218 HE 2 . 6 GHz. The tested version of the package was released on 9 January 2009, it was compiled with Intel’ s C ++ compiler (version 10.1) using the optimisation flags O2 , 6 ipo , unroll , and fstrict-aliasing . The programme was run in par- allel mode using the MPI interface, the nodes were connected via 10 Gigabit Ethernet, and the MPI library used was from Parastation. W e hav e tested configurations with 8, 16, 24, 32, 48, and 64 processes. For each particle number and density , the fastest configuration was selected, and these smallest MD step times are sho wn in Fig. 3d; only configurations with 1, 2, 6, and 8 nodes are relev ant for particle numbers between 10 3 and 10 6 . For system sizes 10 3 . N . 10 4 , a single node (8 processes) is fa vourable. Although the parallelisation overhead is signif- icantly smaller than for the GPU, the measured times per MD step exceed those on the GPU for more than 4,000 particles. Restricting the configuration to a single 8-way node e ven for larger systems, the ex ecution times for 10 5 particles are found to be 5 to 10 times slo wer than for the GPU, depending on den- sity; for higher particle numbers, the single-node performance goes drastically down, probably due to frequent cache misses. For a large system of 10 4 . N . 10 5 particles, the fastest ex ecution is obtained with 48 processes, but the time per MD step and particle is about 2 times slower compared to the GPU, in particular at high densities. For huge systems, N & 10 5 , the absolute numbers and the spread of the execution times are comparable for the GPU and 64 processes. In conclusion, a sin- gle GPU outperforms a con ventional cluster by a factor of 2 for large systems and performs similarly as 8 recent 8-way nodes for huge system sizes. 4. Numerical long-time stability W e hav e thoroughly tested the numerical long-time stabil- ity of our implementation with respect to momentum and en- ergy conservation. In Section 5, we will show how a drift in these quantities may a ff ect the dynamics of the simulated sys- tem. Numerical errors will be larger if only single precision arithmetic is used, but the pace at which numerical errors ac- cumulate also depends on the form of the potential. W e will compare the Lennard–Jones potential and its purely repulsiv e part, also kno wn as W eeks-Chandler-Andersen (WCA) poten- tial [31], V WCA ( r ; ε, σ ) = V LJ ( r ; ε, σ ) + ε which is cuto ff at r c = 2 1 / 6 σ . This potential allows for smaller neighbour lists with less particles, speeding up the simulation by up to 40%. Further , we will discuss smoothed versions of the potentials, which possess a continuous second deriv ati ve by multiplying with the function g ( x = ( r − r c ) / h σ ) = x 4 / (1 + x 4 ). The fol- lowing scrutiny is based on a system of N = 10 , 000 particles at ρ ∗ = 0 . 75 and T ∗ = 1 . 12. For either the LJ or the WCA potential, all results share the same initial configuration, which was obtained by equilibrating an initial fcc lattice in the NVT ensemble with µ ∗ = 1, δ t ∗ = 0 . 001 and h = 0 . 005 o ver 10 5 steps using the GPU (double-single precision) implementation. 4.1. Momentum conservation Momentum conservation implies that the interaction between particles does not generate additional momentum or, equiv a- lently , a drift of the centre-of-mass (c.m.) velocity or mean particle velocity , v cm =  v i  N . For the summation of forces, a standard index loop over the neighbouring cells yields first a large force contribution to one direction, which is later can- celled by the forces from particles on the opposite side. Such an implementation in single precision results in a clear drift of the c.m. v elocity for the WCA potential as illustrated in Fig. 4a; the drift is an order of magnitude smaller in the case of the LJ potential, Fig. 4d. Using double-single precision for the sum- mation of forces and summing opposite cells together reduces the drift by factors of 20 and 2.7 for the WCA and LJ poten- tials, respectiv ely (panels b and e). The drift of the c.m. ve- locity is suppressed by another factor of 100(!) in both cases if the velocity-V erlet integration is done in double-single pre- cision too (panels c and f). Modifying the time step of the in- tegration or smoothing the potential does not hav e a significant e ff ect on the velocity drift. 4.2. Ener gy conservation Faithful simulations in the microcanonical ensemble cru- cially depend on the conserv ation of the total system energy , E = E kin + E pot = const . This condition is particularly sensi- tiv e to limited floating-point precision and requires particularly careful examination. For example, an energy drift of 2% was found after 10 6 MD steps using the single precision version of GR OMACS (which is the def ault) [32]. For our GPU and host implementations using both single and double precision, we have examined the e volution of the to- tal system energy , displayed in Fig. 5. In addition, we compare the degree of energy conserv ation for WCA and LJ potentials with and without smoothing and for two di ff erent time steps. The single precision GPU implementation shows a clear drift of about 0.1% per 10 7 MD steps for both potentials (panels a and b). There is a contribution from the c.m. v elocity drift, which is, howe ver , orders of magnitude smaller . Smoothing the potentials at the cuto ff does not improve energy conserva- tion here. For the WCA potential, the time step δ t ∗ = 0 . 003 is obviously too large; energy conservation is considerably more degraded if the potential is smoothed. Using double-single pre- cision for the summation of forces and in the velocity-V erlet algorithm (panels d and e) reduces the ener gy drift by factors of about 3 (LJ potential) to 7 (WCA). Additional smoothing of the potentials, howe ver , improves energy conservation signif- icantly by an extra order of magnitude. Thereby , the drift is almost eliminated in the case of δ t ∗ = 0 . 001 down to 5 × 10 − 6 and 6 × 10 − 5 per 10 7 steps for the WCA and the LJ potential, respectiv ely . Such a tiny drift, howe ver , is dominated and ob- scured by numerical fluctuations. The host simulation results shown in panels c and f under- mine that the degradation of energy conservation due to single precision is not specific to the GPU implementation. While the quantitativ e e volution of the energy in panels e and f di ff ers due to the implementation of floating-point arithmetic in either 44- bit double-single precision (GPU) or native 53-bit double pre- cision (CPU), the orders of magnitude of energy conservation are comparable. 7 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 0 100 200 300 400 500 | v ∗ c m | × 10 − 7 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 0 10 20 30 40 50 | v ∗ c m | × 10 − 7 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 | v ∗ c m | × 10 − 7 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 0 10 20 30 40 50 | v ∗ c m | × 10 − 7 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 0 10 20 30 40 50 | v ∗ c m | × 10 − 7 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 | v ∗ c m | × 10 − 7 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 a b c d e f WCA potential single precision WCA potential impro ved force sum. WCA potential double-single precision LJ potential single precision LJ potential impro ved force summation LJ potential double-single precision Figure 4: Drift of the centre-of-mass velocity for simulation runs on the GPU if forces are accumulated in single precision and na ¨ ıve order (panels a,d), or in double-single precision and opposite cell order (b,e), and if double-single precision is used for both summation of forces and the velocity-V erlet integration (c,f). The two rows compare the purely repulsiv e WCA potential (a–c) and the Lennard–Jones potential (d–f). Each panel contains data for two di ff erent time steps ( δ t ∗ = 0 . 003 and 0 . 001) with and without smoothing the potentials ( h = 0 . 005 and 0). 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 − 5 0 5 10 15 20 25 30 ( E ( t ∗ ) − E ( 0 )) / ǫ × 10 − 4 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 − 2 0 2 4 6 8 10 ( E ( t ∗ ) − E ( 0 )) / ǫ × 10 − 4 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 − 2 0 2 4 6 8 10 ( E ( t ∗ ) − E ( 0 )) / ǫ × 10 − 4 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 − 2 0 2 4 6 8 10 ( E ( t ∗ ) − E ( 0 )) / ǫ × 10 − 4 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 − 2 0 2 4 6 8 10 ( E ( t ∗ ) − E ( 0 )) / ǫ × 10 − 4 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ × 10 4 − 2 0 2 4 6 8 10 ( E ( t ∗ ) − E ( 0 )) / ǫ × 10 − 4 δ t ∗ h 0 . 003 , 0 0 . 003 , 0 . 005 0 . 001 , 0 0 . 001 , 0 . 005 a b c d e f GPU WCA potential single precision GPU LJ potential single prec. host LJ potential single precision GPU WCA potential double-single prec. GPU LJ potential double-single prec. host LJ potential double precision Figure 5: Conservation of total ener gy per particle during a long simulation run. The first two columns refer to GPU results (panels a,b,d,e) and the last column to the host implementation (c,f). Single floating-point precision was used for the top ro w and double-single or double precision for the bottom row . The sensitivity on the precision is compared for the WCA and LJ potentials with and without smoothing and for tw o di ff erent time steps as indicated. Note the di ff erent y-axis of panel a. 8 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ − 6 − 4 − 2 0 2 4 6 [ E ( t ∗ ) − E ( 0 )] / ( δ t ∗ ) 2 ǫ δ t ∗ 0 . 001 0 . 002 0 . 003 0 . 004 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ − 6 − 4 − 2 0 2 4 6 [ E ( t ∗ ) − E ( 0 )] / ( δ t ∗ ) 2 ǫ δ t ∗ 0 . 001 0 . 002 0 . 003 0 . 004 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ − 6 − 4 − 2 0 2 4 6 [ E ( t ∗ ) − E ( 0 )] / ( δ t ∗ ) 2 ǫ δ t ∗ 0 . 001 0 . 002 0 . 003 0 . 004 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ − 8 − 6 − 4 − 2 0 2 4 [ E ( t ∗ ) − E ( 0 )] / ( δ t ∗ ) 2 ǫ δ t ∗ 0 . 001 0 . 002 0 . 003 0 . 004 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ − 8 − 6 − 4 − 2 0 2 4 [ E ( t ∗ ) − E ( 0 )] / ( δ t ∗ ) 2 ǫ δ t ∗ 0 . 001 0 . 002 0 . 003 0 . 004 0 . 005 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 t ∗ − 8 − 6 − 4 − 2 0 2 4 [ E ( t ∗ ) − E ( 0 )] / ( δ t ∗ ) 2 ǫ δ t ∗ 0 . 001 0 . 002 0 . 003 0 . 004 0 . 005 a b c d e f GPU single precision h = 0 GPU single precision h = 0 . 01 host single precision h = 0 . 01 GPU double-single precision h = 0 GPU double-single precision h = 0 . 01 host double precision h = 0 . 01 Figure 6: Scaling test for the numerical fluctuations of the total ener gy around the initial value on the GPU (left and centre column) and the host (right column). Results using single precision are sho wn in the top row , double-single (or native double) precision was used in the bottom row . A smoothed LJ potential was used in the centre and right columns ( h = 0 . 01). 4.3. Numerical ener gy fluctuations The analysis of the numerical energy fluctuations yields a sensitiv e test of the numerical accuracy which requires only short simulation runs. Newton’ s di ff erential equations are dis- cretised by the velocity-V erlet algorithm to first order in the time step δ t , introducing a discretisation error of order δ t 2 — provided the potential possesses a continuous second deriv a- tiv e. Thus, one expects that the total energy shows numerical fluctuations around its initial value which scale as δ t 2 . In par- ticular , [ E ( t ) − E (0)] /δ t 2 should roughly be independent of the time step. For time steps between δ t ∗ = 0 . 001 and 0.005, these rescaled energy fluctuations are displayed in Fig. 6 for the LJ poten- tial. Using double-single precision on the GPU, all five curves nicely collapse (panel d) ov er the range of 200 to 1000 integra- tion steps. Merely the largest time step, δ t ∗ = 0 . 005, is slightly o ff , indicating that higher order terms become relev ant in this case. Smoothing the potential at the cuto ff improves the col- lapse, from which small time steps benefit especially (panel e). If only single precision arithmetic is used on the GPU, the col- lapse is poor and smoothing the potential at the cuto ff has essen- tially no e ff ect (panels a and b). The two smallest time steps are completely o ff , which we attribute to the quick accumulation of rounding errors from the tiny increments. For comparison, we have added the results from the host implementation in single and double precision (panels c and f). The almost identical behaviour as on the GPU corroborates that the discussed e ff ects are due to the limited precision only , and no other numerical artifacts are introduced by the GPU. An in- significant, but nev ertheless interesting di ff erence between the GPU (double-single precision) and the CPU (double precision) results are the high-frequency fluctuations visible at the small- est time step δ t ∗ = 0 . 001 for the GPU case (panels d and e). These tiny fluctuations are caused by the ev aluation of the in- dividual pair force contributions in single precision. W e veri- fied that the high-frequency fluctuations disappear if the entire force algorithm on the GPU is implemented in double-single precision at the cost of significantly reduced performance. 5. Application to glassy dynamics of a supercooled liquid W e employed the above GPU implementation of a molec- ular dynamics simulation to reproduce the slow dynamics of a supercooled liquid of soft spheres. W e resorted to the Kob– Andersen (KA) binary mixture [33 – 35], which has proven to be a useful model system that reliably delays crystallisation. 6 Specifically , we have inv estigated a relativ ely large system of 40,000 A and 10,000 B particles of equal masses m interacting with Lennard–Jones potentials, V αβ ( r ) = V LJ ( r ; ε αβ , σ αβ ) with the parameters chosen as in Ref. 33: σ AA = 1, σ BB = 0 . 88, σ AB = 0 . 8, ε AA = 1, ε BB = 0 . 5, and ε AB = 1 . 5. The time step of the V erlet integrator was taken to be δ t ∗ = 0 . 001 in units of q m σ 2 AA /ε AA . Equilibration runs were done for fix ed ener gy and cov ered about 10 times the structural relaxation time; all results 6 W e could not find any signs of crystallisation for a very long thermostat run of a small system of 1,500 particles at T ∗ = 0 . 4 and ρ ∗ = 1 . 2 over a time span of 10 8 LJ units or 5 × 10 9 MD steps. 9 a) b) 10 − 1 10 0 10 1 10 2 10 3 10 4 t ∗ 10 − 2 10 − 1 10 0 10 1 δ r ( t ∗ ) 2 σ − 2 T ∗ 0 . 43 0 . 44 0 . 46 0 . 48 0 . 50 0 . 60 1 . 00 10 − 2 10 − 1 10 0 10 1 10 2 10 3 10 4 10 5 t ∗ 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 F s ( q ∗ = 7 . 25 , t ∗ ) T ∗ 0 . 43 0 . 44 0 . 46 0 . 48 0 . 50 0 . 60 1 . 00 Figure 7: Simulation results for a Kob–Andersen mixture obtained on a GPU using double-single arithmetic. Mean-square displacements (a) and incoherent intermediate scattering functions (b) of the A particles are shown for di ff erent temperatures approaching the glass transition at fixed density ρ ∗ = 1 . 2. Broken lines indicate results with single floating-point precision. were obtained for two independent systems with di ff erent ini- tial conditions. Production runs of 1 × 10 6 to 2 × 10 8 steps (for T ∗ = 0 . 60 and 0 . 43, respectiv ely) including the online ev alua- tion of the correlation functions merely took between 1.4 hours and 9.6 days of real time. For the following discussion, we need some basic quanti- ties central to the description of glassy dynamics. The simplest quantity to characterise the transport dynamics is the mean- square displacement (MSD) δ r 2 ( t ) =  r i ( t ) − r i (0)  N , where  ·  denotes a microcanonical time average and  ·  N an av erage ov er all particles, i = 1 , . . . , N . Here, we restrict the discussion to A particles. The di ff usion constant D of a tagged particle is then obtained via δ r 2 ( t ) ' 6 Dt for t → ∞ . A more complex quantity is the time-correlation function of local density fluc- tuations, the intermediate scattering function (ISF). The self- part of the ISF is defined as F s ( q , t ) = DD ρ ( s ) q ( t ) ρ ( s ) − q (0) EE N using the Fourier components of the tagged particle density , ρ ( s ) q ( t ) = exp  − i q · r ( s ) ( t )  , with discrete wavenumbers q ∈ (2 π/ L ) Z 3 and r ( s ) denoting any of the r i . Rotational symmetry implies that F s ( q , t ) = F s ( q , t ) merely depends on the magnitude of the wa ve vector q = | q | , and additional av eraging can be done over the orientation of q . T o quantify the structural relaxation time τ α , we adopt the usual definition F s ( q max , τ α ) = 1 / e, where q max = 7 . 25 σ − 1 denotes the maximum of the structure factor . The MSD develops a pronounced plateau for decreasing temperature, reflecting the caging by the arrested surroundings; see Fig. 7a. The di ff usion coe ffi cient is drastically suppressed as the temperature approaches an anticipated glass transition temperature T g and spans more than 3 decades ov er the simu- lated temperature range. Correspondingly , the density correla- tors shown in Fig. 7b dev elop a plateau, which is a signature of the structural arrest. The slow dynamics is quantified by the div erging structural relaxation time τ α , for which we observe an increase by 4 orders of magnitude. T ∗ D ∗ d.-s. D ∗ single error τ ∗ α ; d.-s. τ ∗ α ; single error 0.48 1 . 02 × 10 − 4 1 . 03 × 10 − 4 1% 2 . 31 × 10 3 2 . 26 × 10 3 2% 0.46 4 . 48 × 10 − 5 5 . 12 × 10 − 5 14% 6 . 43 × 10 3 5 . 69 × 10 3 12% 0.44 1 . 73 × 10 − 5 2 . 43 × 10 − 5 40% 2 . 19 × 10 4 1 . 71 × 10 4 22% 0.43 1 . 02 × 10 − 5 ∞ 4 . 24 × 10 4 2 . 50 × 10 4 41% T able 1: Dimensionless di ff usion constants D ∗ and relaxation times τ ∗ of A particles for temperatures close to the glass transition. The table compares sim- ulation results obtained with double-single and single floating-point precision and giv es the relativ e error due to the limited precision. A clear observation of the glassy dynamics requires su ffi - cient separation between short-time features and the slo w struc- tural relaxation. Hence, the temperature has to be fine-tuned close to the transition, presupposing a sharp value for T g . More- ov er , simulation runs become very long and the energy of the system must be extremely stable during a complete run; we could limit the energy drift to merely 3 × 10 − 5 ov er 2 × 10 8 MD steps at T ∗ = 0 . 43. It turns out that such a long-time stabil- ity cannot be maintained with single floating-point precision. For low temperatures T ∗ = 0 . 43 , 0 . 44 , 0 . 46 , 0 . 48, we have per- formed additional production runs with single precision, and both the MSD and the density correlators deviate significantly at long times, see Fig. 7 (broken lines). The system heats up during the simulation, which introduces physical artif acts in the form of a faster relaxation at the pretended temperature. The determined di ff usion constants and structural relaxation times for both lev els of precision are compared in T able 1, rev ealing quantitativ e di ff erences of up to 41%. At the lowest inv estigated temperature, the system appears to become even super-di ff usi ve at long times; note the crossing of the MSD curve at T ∗ = 0 . 43 obtained with single precision and the one at T ∗ = 0 . 44 using double-single precision in Fig. 7a. Further , the use of current graphics processors facilitates 10 the inv estigation of much larger system sizes than were usu- ally accessible before. Among other benefits, data quality is enhanced as statistical fluctuations of tagged particle quantities are e xpected to scale as 1 / √ N . First results for the velocity au- tocorrelation function display an excellent signal-to-noise ratio and shed light on no vel po wer-law correlations at low tempera- tures [36]. 6. Conclusion W e have shown how recent graphics processing units can be harnessed to carry out large-scale molecular dynamics simula- tions in the microcanonical ensemble with strict energy con- servation ev en after 10 8 MD steps. Using GPU computing, we were able to reproduce the slow glassy dynamics of a bi- nary Lennard–Jones mixture over 4 nontrivial decades in time. Single floating-point precision, howe ver , is not su ffi cient for this purpose and may result in qualitatively and quantitati vely wrong results; e.g., the di ff usion coe ffi cient was found to di- ver ge at T ∗ = 0 . 43 and to deviate by up to 40% at higher temperatures due to the limited energy conservation. W e have shown that the mediocre nati ve double precision performance of recent GPUs can be overcome by implementing numerically critical parts of the MD algorithm with double-single precision floating-point arithmetic, which is based on single precision in- structions. The described MD simulation package is fully implemented on the GPU using CUD A, and it av oids costly memory transfers of trajectory data between host and GPU. In addition to earlier work [1, 2], the sorting of particles in memory and the ev alua- tion of dynamic correlation functions are completely performed on the GPU. The number of simulated particles is solely lim- ited by the amount of global device memory . On the NVIDIA GeForce GTX 280 providing 1 GB of memory , simulations of 864,000 particles are possible, but the barrier of one million particles is broken on the NVIDIA T esla C1060 with 4 GB of (somewhat slower) memory . Our performance measurements show speedups of 70 to 80 compared to a serial simulation on the host processor , and we hav e shown that the GPUs we have deployed perform similarly to LAMMPS on a modern, con ven- tional HPC system running in parallel on 64 processor cores. W e have found that the use of double-single precision in spe- cific parts of the algorithm increases the execution times by merely 20%, which we attribute mainly to a doubling of mem- ory accesses. While the use of nativ e double precision arith- metic in the next generation of GPUs will reduce the number of floating-point operations compared to double-single preci- sion, we expect that the performance penalty due to the latency of global memory access will remain. In particular, the trade- o ff between performance optimisation and numerical accuracy in terms of floating-point precision will persist for GPUs as it does for con ventional processors. In summary , current graphics processors provide a powerful and robust means for state-of-the-art simulations of simple and complex liquids in general and for numerical studies on glass- forming liquids in particular . Substantial computing resources can be delivered already by local GPU clusters containing a few dozen high-end GPUs, which are a ff ordable in terms of ac- quisition cost and maintenance for a single institute and which are likely to play a considerable role in future simulation-based research. In addition, some national computing centres have started to support GPU-accelerated computing on large dedi- cated GPU clusters, which are useful at the single-GPU lev el already . They will, howe ver , be fully exploited only by the fur- ther dev elopment of simulation packages running on distrib uted GPUs (see Refs. 6–8, 10 for examples), enabling the routine in- vestigation of large and complex systems that can be studied today on exceptionally fe w supercomputers only . 7. Acknowledgment W e thank Thomas Franosch and J ¨ urgen Horbach for many helpful discussions and Erwin Frey for generous support. F .H. acknowledges financial support from the Nanosystems Initia- tiv e Munich (NIM). The described software, HAL ’ s MD pack- age , is licensed under the GNU General Public License and is freely av ailable at the project’ s website [30]. References [1] J. A. Anderson, C. D. Lorenz, A. Trav esset, General purpose molecular dynamics simulations fully implemented on graphics processing units, J. Comp. Phys. 227 (10) (2008) 5342–5359. [2] J. A. van Meel, A. Arnold, D. Frenkel, S. Portegies Zwart, R. G. Belle- man, Harvesting graphics power for MD simulations, Mol. Simul. 34 (3) (2008) 259–266. [3] W . Liu, B. Schmidt, G. V oss, W . M ¨ uller-W ittig, Accelerating molecular dynamics simulations using graphics processing units with CUD A, Com- put. Phys. Commun. 179 (9) (2008) 634–641. [4] J. E. Stone, J. C. Phillips, P . L. Freddolino, D. J. Hardy , L. G. Trabuco, K. Schulten, Accelerating molecular modeling applications with graphics processors, J. Comput. Chem. 28 (16) (2007) 2618–2640. [5] J. Y ang, Y . W ang, Y . Chen, GPU accelerated molecular dynamics sim- ulation of thermal conductivities, J. Comp. Phys. 221 (2) (2007) 799 – 804. [6] D. Xu, M. J. W illiamson, R. C. W alker , Adv ancements in molecular dy- namics simulations of biomolecules on graphical processing units, V ol. 6 of Annual Reports in Computational Chemistry , Elsevier , 2010, pp. 2 – 19. [7] V . A. V oelz, G. R. Bowman, K. Beauchamp, V . S. Pande, Molecular sim- ulation of ab initio protein folding for a millisecond folder NTL9(139), J. Am. Chem. Soc. 132 (5) (2010) 1526–1528, pMID: 20070076. [8] M. J. Harvey , G. Giupponi, G. D. Fabritiis, ACEMD: Accelerating biomolecular dynamics in the microsecond time scale, J. Chem. Theory Comput. 5 (6) (2009) 1632–1639. [9] T . Preis, P . Virnau, W . Paul, J. J. Schneider , GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model, J. Comp. Phys. 228 (12) (2009) 4468 – 4477. [10] B. Block, P . V irnau, T . Preis, Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model, Comput. Phys. Commun. 181 (9) (2010) 1549 – 1556. [11] W . G ¨ otze, Complex Dynamics of Glass-Forming Liquids: A Mode- Coupling Theory , International Series of Monographs on Physics, Oxford Univ ersity Press, Oxford, 2009. [12] M. Mosayebi, E. Del Gado, P . Ilg, H. C. ¨ Ottinger , Probing a critical length scale at the glass transition, Phys. Rev . Lett. 104 (20) (2010) 205704. [13] E. Flenner, G. Szamel, Dynamic heterogeneity in a glass forming fluid: Susceptibility , structure factor, and correlation length, Phys. Rev . Lett. 105 (21) (2010) 217801. [14] D. C. Rapaport, The Art of Molecular Dynamics Simulation, 2nd Edition, Cambridge Univ ersity Press, New Y ork, 2004. 11 [15] D. E. Knuth, The Art of Computer Programming, V ol. 3: Sorting and Searching, Addison-W esley , 1998. [16] M. Zagha, G. E. Blelloch, Radix sort for v ector multiprocessors, in: Pro- ceedings Supercomputing ’91, 1991, pp. 712–721. [17] NVIDIA CUD A SDK code samples (2009). http://developer.download.nvidia.com/compute/cuda/sdk/ website/samples.html [18] H. Sagan, A three-dimensional Hilbert curve, Int. J. Math. Educ. Sci. T echnol. 24 (4) (1993) 541–545. [19] J. W ang, J. Shan, Space-filling curve based point clouds index, in: Pro- ceedings of the 8th International Conference on GeoComputation, Uni- versity of Michigan, 2005. [20] T . J. Dekk er , A floating-point technique for extending the available preci- sion, Numerische Mathematik 18 (3) (1971) 224–242. [21] D. E. Knuth, The Art of Computer Programming, V ol. 2: Seminumerical Algorithms, 3rd Edition, Addison-W esley , 1997. [22] IEEE standard for binary floating-point arithmetic, ANSI / IEEE Std 754- 1985. [23] C. Q. Lauter , Basic building blocks for a triple-double intermediate for - mat, T ech. Rep. RR-5702, INRIA (09 2005). http://hal.inria.fr/inria- 00070314/en [24] G. D. Grac ¸ ca, D. Defour, Implementation of float-float operators on graphics hardware, in: 7th conference on Real Numbers and Computers, Nancy , France, 2006, arXiv:cs / 0603115v1 [cs.AR]. [25] D. H. Bailey , Fortran-90 double-single package (2005). http://crd.lbl.gov/ ~ dhbailey/mpdist [26] J.-P . Hansen, I. McDonald, Theory of Simple Liquids, 3rd Edition, Aca- demic Press, Amsterdam, 2006. [27] D. Frenkel, B. J. Smit, Understanding Molecular Simulation, 2nd Edition, Academic Press, London, 2001. [28] S. J. Plimpton, Fast parallel algorithms for short-range molecular dynam- ics, J. Comp. Phys. 117 (1995) 1–19, http://lammps.sandia.gov . [29] NVIDIA CUD A programming guide version 2.2. http://developer.download.nvidia.com/compute/cuda/2_2/ toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf [30] P . H. Colberg, F . H ¨ ofling, HALMD—Highly Accelerated Large-scale Molecular Dynamics package. http://research.colberg.org/projects/halmd [31] J. D. W eeks, D. Chandler, H. C. Andersen, Role of repulsiv e forces in determining the equilibrium structure of simple liquids, J. Chem. Phys. 54 (12) (1971) 5237–5247. [32] R. A. Lippert, K. J. Bowers, R. O. Dror , M. P . Eastwood, B. A. Gregersen, J. L. Klepeis, I. K olossv ary , D. E. Shaw , A common, av oidable source of error in molecular dynamics integrators, J. Chem. Phys. 126 (4) (2007) 046101–2. [33] W . Kob, H. C. Andersen, Scaling beha vior in the β -relaxation re gime of a supercooled lennard-jones mixture, Phys. Re v . Lett. 73 (10) (1994) 1376– 1379. [34] W . Kob, H. C. Andersen, T esting mode-coupling theory for a super- cooled binary Lennard-Jones mixture. I: The v an Hove correlation func- tion, Phys. Rev . E 51 (5) (1995) 4626–4641. [35] W . K ob, H. C. Andersen, T esting mode-coupling theory for a supercooled binary Lennard-Jones mixture. II. Intermediate scattering function and dynamic susceptibility, Phys. Rev . E 52 (4) (1995) 4134–4153. [36] F . H ¨ ofling, P . H. Colberg, in preparation. 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment