A GPU-Accelerated Fast Multipole Method for GROMACS: Performance and Accuracy

An important and computationally demanding part of molecular dynamics simulations is the calculation of long-range electrostatic interactions. Today, the prevalent method to compute these interactions is particle mesh Ewald (PME). The PME implementation in the GROMACS molecular dynamics package is extremely fast on individual GPU nodes. However, for large scale multinode parallel simulations, PME becomes the main scaling bottleneck as it requires all-to-all communication between the nodes; as a consequence, the number of exchanged messages scales quadratically with the number of involved nodes in that communication step. To enable efficient and scalable biomolecular simulations on future exascale supercomputers, clearly a method with a better scaling property is required. The fast multipole method (FMM) is such a method. As a first step on the path to exascale, we have implemented a performance-optimized, highly efficient GPU FMM and integrated it into GROMACS as an alternative to PME. For a fair performance comparison between FMM and PME, we first assessed the accuracies of the methods for various sets of input parameters. With parameters yielding similar accuracies for both methods, we determined the performance of GROMACS with FMM and compared it to PME for exemplary benchmark systems. We found that FMM with a multipole order of 8 yields electrostatic forces that are as accurate as PME with standard parameters. Further, for typical mixed-precision simulation settings, FMM does not lead to an increased energy drift with multipole orders of 8 or larger. Whereas an ≈50 000 atom simulation system with our FMM reaches only about a third of the performance with PME, for systems with large dimensions and inhomogeneous particle distribution, e.g., aerosol systems with water droplets floating in a vacuum, FMM substantially outperforms PME already on a single node.


INTRODUCTION
The evaluation of mutual interactions in many-body systems is a crucial and limiting task in many scientific fields such as biomolecular simulations, 1 astronomy, 2 and plasma physics. 3 Here, we consider molecular dynamics (MD) simulations, where electrostatic forces acting on N atoms at positions x i with partial charges q i are calculated to determine new positions of the atoms in subsequent discrete time steps. ∥·∥ denotes the Euclidean norm. A direct calculation of the forces has N ( ) 2 complexity; thus only systems of limited size can be computed directly in equitable time. Additionally, a typical MD simulation employs periodic boundary conditions (PBC) to avoid surface artifacts, making the direct calculation unfeasible even for small systems. In contrast to cosmological calculations, which are usually limited by the available memory due to enormous particle numbers, 1 many interesting biomolecular systems consist of (10 10 ) 5 6 − particles. Recently, however, the demand to study increasingly large systems has grown markedly, and systems of 10 8 −10 9 particles could become routine soon. 4−7 Nevertheless, biomolecular systems, independent of their size, require long trajectories where the length of a time step can be no longer than a few femtoseconds for numerical stability reasons. Thus, the time required to finish one simulation step needs to be shortened to a millisecond or less so that long enough trajectories can be produced in reasonable time. To overcome these bottlenecks, the solution of eq 1 requires efficient approximation. The prevalent method for such approximation in the field is particle mesh Ewald (PME). 8 PME uses Ewald summation to split up the calculation into a short-range part, for which all interactions up to a cutoff radius r c are directly evaluated, and a long-range part, which is solved in reciprocal space. To take advantage of fast Fourier transforms (FFTs) for the conversions to and from reciprocal space, the charges are interpolated onto a uniform grid using cardinal B-splines. Higher interpolation orders and finer grids yield higher accuracy for the reciprocal part. PME scales with N N ( log ) and by construction provides a PBC solution, but does not allow for nonperiodic calculations.
MD packages like GROMACS 9−11 or NAMD 12 have PME implementations that are highly performance optimized. With GROMACS, typical MD systems reach iteration rates of (1000) steps per second currently; 11 hence all forces are computed in less than a millisecond. However, with increasing parallelization, as required for high performance applications, PME runs into a communication bottleneck. Because the FFTs require all-to-all communication, which implies quadratic scaling with the number of processes, PME scaling breaks down at an intermediate number of processes. 13−15 A further limitation is that the FFT grid becomes memory intensive, particularly if high accuracy is required or for highly inhomogeneous charge distributions.
An alternative way for rapid evaluation of Coulomb forces is the fast multipole method 16 (FMM), which is not impaired by the aforementioned limitations and even scales with N ( ). Therefore, while PME is fast for small to medium sized MD systems at moderate parallelization, FMM will be competitive for large number of particles, large simulation boxes, inhomogeneous charge distributions, and high parallelization. 11,13 Further, FMM can be used for both periodic and open boundaries.
FMM splits the calculation into a near f ield, which is directly evaluated, and into a far f ield. For the far field, groups of sufficiently separated point charges are combined and described as truncated multipole expansions. The grouping is accomplished by recursively subdividing the simulation box into sub-boxes in an octree fashion; i.e., each parent box is subdivided into eight equal child boxes when the tree depth d is increased. This yields 8 d boxes on the lowermost level. For d = 0, there is no subdivision. Interactions between particles residing in the same or in directly neighboring boxes at the lowest tree level are calculated directly as in eq 1, whereas interactions between particles in distant boxes are approximated via far field calculations. FMM can also allow for direct interactions between particles in boxes with a larger distance from each other. The distance is controlled by the wellseparateness criterion "ws". Larger ws improves the accuracy of the method but it impairs its performance markedly since eq 1 scales quadratically with respect to the number of particles. 17 In this work we exclusively consider ws = 1; hence, only particles of nearest neighbor boxes interact directly.
For the far field interactions, the inverse distance between charged particles with index i and j is approximated as 18 where Y and Y* are spherical harmonics and their complex conjugate, respectively. The multipole order p controls the accuracy of the approximation. FMM achieves linear scaling with respect to N by performing hierarchical far field operations on multipoles expanded in octree boxes. Computationally, the most demanding part of the far field evaluation is the multipole-to-local (M2L) transformation. It requires p ( ) 2 dot products with p ( ) 2 complexity, yielding an overall complexity of p ( ) 4 . The spherical harmonics based FMM (eq 2) was developed by Greengard and Rokhlin. 18 Following this, other approximations of the inverse distance have been developed, such as the plane wave expansion approach 19 to reduce operational costs of the M2L operator from p ( ) 4 or p ( ) 3 to p ( ) 2 or the black-box FMM, 20 which utilizes Chebyshev interpolation to minimize the far field representation of the multipoles.
One of the first parallel GPU implementations of the spherical harmonics based FMM 21 used p ( ) 3 operators and achieved accuracy dependent speedups of 30−70 relative to a serial run on a single CPU core. Recently, the p ( ) 3 M2L operator for a single GPU was optimized further. 22 GPUs were used to speed up the kernel independent FMM 20,23 and the black-box FMM. 24 A single-GPU implementation of the spherical harmonics FMM 25 was also parallelized over a cluster 26 with 32 GPUs where it reached parallel efficiencies of 44% for 10 6 particles and 66% for 10 7 particles. Larger, multinode, multi-GPU parallelization 27 for a 256 million particle system over 256 GPUs followed. Blanchard et al. 28 and Agullo et al. 24 presented task based parallelization strategies.
The FMM has been successfully used to compute Coulomb or gravitational interactions in a wide range of applications, 1−3 whereas its use for biomolecular simulations is still limited with a few exceptions. 29,30 We have therefore developed, implemented, and optimized an FMM for MD simulations with GROMACS.
As GROMACS usually runs in mixed precision, using double precision only for accumulation order sensitive tasks, consumer GPUs are extremely attractive for the force computation, as they offer a high single-precision FLOP rate at a low price, especially compared to CPUs. 31 Therefore, we implemented the complete FMM workflow on the GPU. Whereas rotational M2L operators with complexity p ( ) 3 have been proposed, 18 here we consider an p ( ) 4 approach for the M2L operator as it is better suited for GPU parallelization. 32 Our GPU version is based on the ScaFaCoS FMM, 2 which we fully parallelized with CUDA 33 and optimized for GROMACS. Here, we assess the performance of our GPU FMM implementation 34 and evaluate its accuracy in comparison to GROMACS' PME implementation.

BENCHMARK METHODS
In a first step, we verified that our CUDA FMM implementation yields accurate energies and forces by comparing against known reference solutions for several input systems. Subsequently, we used typical MD systems to compare FMM vs PME performance in GROMACS 2019.
2.1. Accuracy of FMM Results. The forces and energies computed with the FMM deviate from their exact values mainly due to truncation of the multipole expansion at finite order p, which for small p causes the main contribution to the numerical error. Additionally, the errors in the energies vary due to different accumulation orders in the parallelized reductions. To quantify the magnitude of these errors, we compared FMM derived forces, potentials, and energies with reference solutions.
Given a reference solution v i , i = 1, ..., N, with N values of the potential at the atomic positions, or the 3N individual Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article scalar values x, y, and z force components, we estimated the approximation error with the cumulative relative L 2 error norm: where vĩ are the approximated values.

Benchmark Systems.
To assess the correctness and performance of our FMM implementation, we created five benchmark systems, which were then used to check different aspects of our implementation. We first verified that the FMM forces and energies for open and periodic boundaries are correct; then we found the FMM parameters yielding the same accuracy as the existing GROMACS PME implementation. Finally, we compared the performance of both methods at the same accuracy.
GROMACS benchmark systems were set up with GROMACS 10 2019 using the AMBER03 force field, 35 the TIP3P 36 water model, and an integration time step of 4 fs. Note that this force field and water model are just an examplein fact, all force fields and water models supported by GROMACS can be combined with FMM electrostatics.
2.2.1. Infinite Ideal Crystal. The "ideal crystal" benchmark represents an infinite lattice of alternating positive and negative elementary charges. The charges were arranged as in an NaCl crystal in a 32 × 32 × 32 nm 3 large box containing alternating +1e and −1e charges at 0.5, 1.5, 2.5, ..., 30.5, 31.5 nm in each dimension, in total 32 3 = 32 768 point charges. The shortest distance between nearest charges is exactly 1.0 nm, allowing for direct comparison with an analytical solution.
Considering the PBC, such a system of size 2 × 2 × 2 nm 3 would be sufficient to compare against an analytical solution. However, the number of charges was chosen in a way that allows for flexibility during the tests regarding the choice of parameters. For instance, with PME a larger range of real-space cutoffs can be used, and with FMM various tree depths d = 1, 2, 3, 4 can be tested having a significant number of charges even on the lowest levels.
The potential energies at each charge center were calculated analytically with Madelung's constant M. 37 Its value was obtained by summing a specific, three-dimensional Epstein zeta function for the case of s = 1/2, where ∑′ excludes the origin sum to avoid singularity. The sum is absolutely convergent when summing over expanding cubes. 38 For comparison we used the value M = −1.74756459..., which is given with 60 digits by Crandall. 37 2.2.2. Salt Water. Our "salt water" benchmark consists of 16 861 water molecules with 46 Na + and 46 Cl − ions in an ≈8 × 8 × 8 nm 3 periodic simulation box, yielding 50 675 atoms in total. We used it to compare PME versus FMM errors and to determine which FMM parameters are needed to obtain a desired accuracy. Considering the Coulomb forces, we expect this system to reasonably well approximate the error behavior of typical MD systems of macromolecules embedded in water. However, setups with highly nonuniform charge distributions, e.g., membrane systems, could differ in their error distribution and magnitude.
An initial trajectory was generated with cutoffs set to 1 nm. PME was used for electrostatic interactions with a grid spacing of 0.135 nm and fourth order B-spline interpolation. 8 Temperature coupling to a heat bath of 300 K was done with the V-rescale algorithm, 39 while the pressure was kept at 1 bar with the use of Berendsen coupling. 40 2.2.3. Salt Water Droplet. The "salt water droplet," as shown in Figure 1, contains the same number of molecules as the periodic salt water system, but in open boundaries. It was built by centering a snapshot of the above system in a larger box of size 14 × 14 × 14 nm 3 . Apart from the fixed volume and therefore variable pressure, the simulation parameters are identical to those of the periodic case. With open boundaries, the system adopted an approximately spherical shape within ≈50 ps.
In principle, the box size is only relevant with PBC; technically, however, we used the box to treat the individual single water molecules that did occasionally evaporate from the droplet, as if they were in PBC, simply to keep them from flying too far away. A 130 ns long trajectory of the droplet was simulated, of which snapshots for later analysis were extracted.
The droplet system with open boundaries allows for computation of the reference Coulomb energy and forces by direct summation. It was, therefore, used assess the correctness of the complete FMM implementation apart from the periodic part, which is computed with an additional lattice operator.
2.2.4. Aerosol/Multidroplet Evaporation System. The "multidroplet system" (Figure 2) was built to demonstrate the advantages of the FMM for systems with highly nonuniform particle distributions, as occur in the atomistic simulation of, e.g., electrospray ionization as a prerequisite for mass spectrometric analysis, 41−43 ion mobility spectrometry, 44 laser-induced liquid beam ion desorption, 45,46 and various naturally occurring 47 or artificially produced 48 aerosols.
MD simulations can significantly complement these experiments by providing a detailed picture of the involved processes, e.g., the various aspects of droplet formation and evolution, charge migration, ion/lipid/protein desolvation, Figure 1. Salt water droplet test system. Water molecules are shown in surface representation (oxygens, red; hydrogens, white), with Na + ions in magenta, Cl − ions in green, simulation box in black.
Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article collisions with the background gas, and gas-phase unfolding.
Simulating proteins, lipids, ion, and waters in the gas phase 49 implies spatially extended simulation systems consisting mostly of vacuum.
In the gas phase, due to the lack of shielding, the correct treatment of long-range electrostatic forces is even more crucial than for fully solvated species to avoid artifacts 50 and to correctly describe experimental conditions. 51 With such extended systems, PME often reaches its limits, as memory requirements become prohibitive for the underlying large FFT grids. Sometimes the use of PME is precluded because, for optimal agreement with experiment, open boundaries may be more appropriate than PBC. 43 Being a prototype for such sparse systems, our multidroplet benchmark contains 75 small water droplets in a box of side length 135.6 nm with 108 663 atoms. Sixty-three Na + and 63 Cl − ions were distributed within the droplets. The system was run in the NVE ensemble with PBC. The van der Waals cutoff was set to 2 nm. For PME, to prevent a prohibitively large FFT grid, a Coulomb cutoff of 2.943 nm was used in combination with a grid spacing of 0.353 nm. This results in a Fourier grid of 384 3 points.
2.2.5. Water Boxes of Different Sizes. To assess how our FMM implementation scales with respect to the number of particles N, we have build cubic boxes of edge lengths 3.13− 67.4 nm containing 1000−10 000 000 TIP3P water molecules, 36 i.e., N = 3000−30 000 000 particles. Benchmarks were run in the NVT ensemble with the use of Berendsen temperature coupling 40 at a reference temperature of 300 K. Coulomb and van der Waals cutoffs were set to 1 nm. With PME, a mesh spacing of 0.135 nm was used with fourth order interpolation.
2.2.6. Random Charges. To assess the FMM performance and scaling in a standalone setting, i.e., without being coupled to GROMACS, we used 1000 < N < 286 000 000 randomly distributed charges in a box of a constant size of 100 nm. FMM standalone tests estimate the overhead introduced by integration of the FMM into GROMACS.
2.3. Benchmarking Procedure. All performance benchmarks were run on a node with Intel E5-2630v4 @ 2.2 GHz CPU, and NVIDIA RTX 2080Ti GPU running Scientific Linux 7.6 GROMACS 2019 was compiled with GCC 7.4.0, CUDA 10.0, thread-MPI, and AVX2_256 SIMD instructions, and with OpenMP and hwloc 52 1.11 support. For the runs with PME on the CPU, the FFTW 3.3.7 53 library was used.
For optimum performance in a single-CPU, single-GPU setting, 11,31 we used a single thread-MPI rank with either as many OpenMP threads as physical CPU cores (10) or as many OpenMP threads as CPU hardware threads (20). On modern Intel CPUs, using all available hardware threads can provide a performance benefit of up to ≈15% for cases with at least a few thousand atoms per core. We tested both settings in our benchmarks and report the performance of the fastest setting. It turned out that our benchmark systems with 50 000 atoms or more were faster with 20 threads instead of 10.
Additionally, the FMM vs PME scaling benchmarks were run on a node with 20-core Intel Xeon Gold 6148F CPU and NVIDIA V100-PCIE-32GB GPU running SLES 12.4. Here, GROMACS was compiled with GCC 8.4.0, CUDA 10.1, Intel MPI 2019, and AVX_512 SIMD instructions, and with OpenMP and hwloc 2.1.
Each benchmark ran for several minutes, i.e., several thousand time steps. Because the initial time steps often require long execution times due to memory allocations and load balancing effects, all times were recorded for the second half of each run.

RESULTS AND DISCUSSION
3.1. FMM Convergence and Correctness. In this section we quantify the errors resulting from the FMM evaluation of the Coulomb interactions. We will first show that, with increasing order p of the multipole expansion, FMM converges to the correct solution. This was done in two steps. First, we used a system with open boundaries, where the correct solution (within numerical limits) can be obtained by a direct summation. Second, a simple periodic crystal with analytically derived solution was used as a reference to verify the correctness of the FMM PBC solution.
The Coulomb potential V C for a system of N charges is with k = 1/(4πϵ 0 ) and ϵ 0 is the vacuum permittivity. Our FMM implementation uses dimensionless values with k set to unity, whereas in GROMACS, k ≈ 138.935 kJ nm/(mol e 2 ), with e the elementary charge. If an axis of a plot shows kilojoules per mole units for the potential energy and kilojoules per mole per nanometer for a force, the GROMACS unit system is used; otherwise energy and forces will be dimensionless. As can be seen, the force errors decrease exponentially with growing multipole order p and begin to saturate at p = 40 and  system. In single precision, increasing the multipole order to p > 12 does not reduce the error any further as the error reaches the limited machine representation.
In summary, for open boundaries we conclude that FMM forces are as accurate as forces from a direct summation for high multipole orders p. In double precision, p ≳ 40 yields as accurate forces as a direct summation, whereas for single precision, p ≳ 12 suffices to reach the numerical limits. The relative accuracy of the Coulomb potential energy is about 10 −7 for p ≥ 8 in single precision, whereas with double precision, accuracies of 10 −14 are reached for p > 40. For p < 50 in double precision and p < 12 in single precision, the errors in forces and energies are larger for higher tree depth d.
3.1.2. Comparison to Analytic Solution for Periodic Boundaries. Next, we compared the FMM electrostatic energy for the ideal crystal with the analytical results. Figure 7 shows the relative error in the energy for a double-precision computation. The energy error decays exponentially with increasing multipole order. Note that the decay of the energy (compare also Figures 6 and 5) is not strictly monotonic, which follows from the evaluation of the Coulomb integral on cuboids and has been described elsewhere. 17,55 Reaching the relative accuracies at the numerical limit for p ≳ 40 verifies that the treatment of the periodic boundaries in our FMM implementation is correct and that the FMM approximated    Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article energy with full PBC converges to the true value for growing multipole orders.

3.2.
Comparison of FMM to PME. After establishing the correctness of our FMM implementation, we compared it to PME by asking which FMM parameters p and d yield accuracies similar to those of several representative PME parameter settings, e.g., the spacing s of the Fourier grid and the B-splines interpolation order (also called PME order). In GROMACS' PME implementation, the ewald-rtol parameter controls the relative strength of the direct potential at the cutoff r c and thereby how accurate the real space part is in relation to the reciprocal space part. 56 Smaller values yield a more accurate real space contribution but a less accurate reciprocal space contribution. The default PME parameters use 10 −5 for ewald-rtol, which minimizes the error for typical MD settings with cutoffs of r c ≈ 1 nm and PME grid spacings of s ≈ 0.12 nm. To reach optimal PME accuracy, however, a much smaller value of the ewald-rtol parameter is required in combination with a very fine PME grid and a sufficiently large interpolation order.
3.2.1. PME and FMM for the Ideal Crystal. Figure 8 shows how much the energy of the ideal crystal computed using several different PME parameters deviates from the reference values. For the used very fine grids with spacing s = 0.05−0.1 nm (corresponding to 320 3 −640 3 grid points) combined with a high interpolation order of 12, PME accuracy mainly depends on the value of the ewald-rtol parameter. For ewald-rtol ≲ 10 −13 , the energy error achieves roughly 10 −14 , whereas FMM reaches this error bound for p ≥ 40. Hence, we have shown that both PME and FMM reach a relative accuracy of ≈10 −14 in double precision in a periodic system.
3.2.2. PME vs FMM for the Salt Water System. Having shown that FMM and PME yield the same numerical accuracy for the potential energy for a simple periodic system, we switch to a more typical MD setting, namely the periodic salt water box with 50 675 atoms. For this system, we used a reference solution computed with the FMM in double precision using p = 50 and d = 0.
The colored histograms in Figure 9 show the errors in the Coulomb forces for various FMM and PME parameters. For PME, we selected four different parameter sets, two of which are representative for typical MD settings, another which pushes the parameters toward maximum accuracy, and an intermediate one. The "default" set uses the GROMACS default values of PME parameters, which are typical settings for many biomolecular simulations, i.e., a Coulomb cutoff of r c = 1.0 nm with a PME grid spacing of s = 0.12 nm and a B-spline interpolation order of 4. The "high precision" set uses r c = 1.2 nm with s = 0.1 nm and an interpolation order of 6. We also test a "maximal" parameter set using the largest possible cutoff that still respects the minimum image convention (r c = 4.0 nm) with s = 0.04 nm and an interpolation order of 12, which is the highest order supported by GROMACS. The "extreme" parameter set yields a precision between the "high" and "maximal" settings; see the legend of Figure 10. For each of the four PME parameter sets we have selected the ewald-rtol  Circles show the relative deviation of the energy computed with PME from its correct value as a function of the ewald-rtol parameter for interpolation order 12 for four parameter sets (see legend, r c = real-space cutoff, s = PME grid spacing) For comparison, the corresponding FMM errors for p ≥ 40 are indicated by the shaded region (compare Figure 7). Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article parameter such that it yields the minimal error in the Coulomb energy in double precision, as summarized in Figure 10.
For the typical use case with single-precision forces, the accuracy of the Coulomb forces for "default" PME parameters is similar to that of FMM for p ≈ 7 at d = 3. The "high precision" PME parameters require an FMM with p = 14.
3.2.3. Periodic Boxes with Noncubic Geometry. With PBC, our implementation is currently limited to cubic box shapes. Noncubic simulation boxes require modified octree subdivision, 57 as eq 2 converges only if ∥x i ∥ < ∥x j ∥. As shown in Figure 11A, this condition is always fulfilled for cubic boxes. A slight deviation from cubicity, Figure 11B, does not violate this condition, but a larger ratio s ≔ ∥x i ∥/∥x j ∥ affects the convergence rate of the approximation. With decreasing s, the approximation error decreases with ∥x j − x i ∥ −1 s p+1 for given p. 19 As a rough guideline, a rectangular box with a 1.2:1 aspect ratio should achieve an accuracy similar to that of a cubic box with p = 8 (or p = 12), if the multipole order is raised to p = 10 (or p = 25). Slight deviations from cubicity, i.e., a few percent, should however not markedly affect the accuracy at constant p.

Energy Conservation with FMM.
For NVE simulations without temperature and pressure control, all employed algorithms must be energy conserving to prevent a gradual, unphysical heating (or cooling) of the simulation system. But even when a thermostat is in place to absorb excess heat, algorithms should in general not introduce or remove significant amounts of heat from the system as that could cause artifacts. In practice, however, slight deviations from perfect energy conservation may be tolerated and, in fact, many of the employed algorithms contribute (with positive or negative sign) to an overall energy drift. The drift is caused by accumulated numerical and integration errors due to, e.g., the finite integration step size, the finite numerical precision, the constraint algorithm(s), or the various approximations during force calculations.
One such approximation is the pair lists for the Coulomb and van der Waals interactions within the cutoff. For enhanced performance, these lists are constructed from the cutoff plus an added buffer region (called Verlet buffer) so that they do not need to be updated every step. However, with list lifetimes > 1 step, even with such a radial buffer, occasionally a distant nonbonded interaction may be missed, thus contributing to the overall energy drift. 58 In contrast, for FMM-computed Coulomb interactions, energy drift results from octree space discretization. Whereas PME uses a smooth switching function between interactions computed in direct versus reciprocal space, FMM particles contribute either completely or not at all to an octree box. Hence, particles crossing the octree box boundaries produce small discontinuities in the forces over time.
When substituting PME with FMM, we need to make sure that FMM does not increase the total energy drift. Therefore, we have determined the energy drift over time in a typical, mixed-precision simulation with PME and compared it to the same simulation with FMM for various FMM parameters. Figure 12 shows the drift of the total energy for the salt water benchmark with FMM in comparison to PME with "default" parameters (r c = 1.0 nm, s = 0.12 nm, fourth order interpolation, ewald-rtol = 10 −4 ). With FMM, at the depth that yields the highest performance (d = 3), the PME default drift level is met for multipole orders p ≥ 8. The values of the total drift smaller than the black dashed line are due to cancellation of the positive FMM contribution with negative contributions as, e.g., result from the water SETTLE constraints. 58 Whereas with double precision and a large enough Verlet buffer, the total drift can be reduced to <10 −7 kJ/mol/ps per atom for both PME and FMM (see Figure 11 in Kohnke et al. 32 ), typical mixed-precision MD settings yield drifts of (5−8) × 10 −5 kJ/mol/ps per atom. Regularizing the Figure 10. Coulomb energy error for various PME parameters, as in Figure 8but for a snapshot of the salt water system for double precision (solid lines with large circles) and single precision (dotted lines with darker small circles). For each combination of r c , s, and PME order, there is one value of the ewald-rtol parameter that minimizes the PME error. The reference energy was determined using a double-precision FMM calculation with p = 50 at d = 0. As almost all energy errors are ≥10 −6 for single precision, they were omitted from the graph for the "maximal" parameter set (brown). Figure 11. Chosen strict octree subdivision requires the simulation box to be approximately cubic; otherwise the convergence criterion is not fulfilled. (A) Exactly cubic box; (B) slightly noncubic box; (C) extremely noncubic box. A source particle x i and a target particle x j are positioned in a way that maximizes the ∥x i ∥/∥x j ∥ ratio reflecting the worst-case scenario.
Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article FMM could help to meet the energy conservation requirements of MD simulation at even lower p, as shown by Shamshirgar et al. 59 3.4. Performance of GPU FMM in GROMACS. 3.4.1. FMM vs PME Performance. With previous tests we have established that the FMM with p = 8 and d = 3 achieves the same approximation quality as the PME with "default" parameters (see Figure 9). Therefore, we compared the performances of the two methods at these parameters.
We first determined the FMM performance as a function of p and d for simulations in mixed precision (Figure 13). At p = 8, the salt water and multidroplet benchmarks achieve 153 and 72 ns/day, respectively. For both benchmarks d = 3 maximizes the performance. However, the scaling behaviors with respect to p notably differ when comparing both systems. The inhomogeneity of the particle distribution in the multidroplet system changes the near field to far field calculation intensity ratio. Clustered particles occupy only a few FMM boxes; hence, for the far field, the empty boxes are skipped to enhance performance. We can observe that performance dependency on p is significant only for higher multipoles as the calculation is dominated by a very large number of directly interacting particles clustered into only a few boxes. Figure 14 shows a performance comparison between FMM and PME for both systems. For the periodic salt water system, GROMACS with GPU FMM achieves about a third of the GPU PME performance. The situation reverses for the strongly inhomogeneous multidroplet system: here, FMM outperforms PME by more than a factor of 2.
We finally compared the FMM and PME scaling behaviors with respect to the number of particles for N = 3000− 30 000 000. To ensure optimal scaling, we determined the proper FMM depth for each system size at p = 8. As can be seen in Figure 15, for both methods we can identify two different slopes with polynomial scaling N ( ) α , where α describes the slope of the curve. For small systems (N < 30 000), α is approximately 0.5. This indicates that with growing N in this region the GPU utilization increases leading to a better scaling behavior than linear. For N > 30 000 both methods achieve α ≈ 1 on the Tesla V100 GPU with 32 GB memory in the entire tested particle range. However, when the RTX 2080Ti GPU with 11 GB memory is used, the scaling begins to worsen already by ≈300 000 particles. Here the FMM scales slightly better (α = 1.02) whereas the PME achieves α = 1.08, indicating a performance decrease due to higher memory requirement. Figure 12. Drift of total energy at typical mixed-precision settings for the periodic salt water system. Dashed black lines show the total (in this case negative) energy drift with PME (Δt = 4 fs, "default" PME parameters as given in Figure 9, default Verlet buffer tolerance of 0.005 kJ/mol/ps). (top) Evolution of total energy with FMM at depth 3 (red) compared to PME (black). (bottom) Absolute drift of total energy derived from a linear fit. At depth d = 3 (encircled numbers), which results in optimal FMM performance for this system, for p ≥ 8, the positive drift component from the FMM does not lead to an increased total drift. Figure 13. GROMACS performance with FMM electrostatics for the 50 675 atom periodic salt water system (left) and for the 108 663 atom aerosol/multidroplet benchmark (right). Encircled numbers indicate FMM tree depth. With p = 8 as indicated by the dashed vertical line, FMM offers an accuracy of the electrostatic interactions that is comparable to the "default" PME parameter set (i.e., r c = 1.0 nm, PME grid spacing s = 0.12 nm, fourth order interpolation, see Figure 9). Benchmarks were run with 20 OpenMP threads on the CPU.
Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article From the FMM standalone tests, also shown in Figure 15, we can clearly see that our FMM is tightly integrated into the GROMACS time stepping over the whole tested N range, as the runtimes of the FMM with GROMACS are not significantly longer than the FMM standalone runtimes.
Furthermore, on the Tesla V100 GPU, with the standalone FMM implementation we were able to run performance tests for an even larger number of particles as, in contrast to PME, where the simulation box size is limited by the available memory, FMM is limited only by the number of particles. Figure 15 shows that standalone FMM scales linearly up to ≈270 000 000 and ≈160 000 000 particles in single and double precision, respectively. The ability to perform efficient doubleprecision calculations on a GPU introduces a new asset as GROMACS is limited to run double-precision simulations only on a CPU.

Comparison to Other FMM Implementations.
Finally, we compared the performance of our implementation with that of GemsFMM, 25 which is another GPU FMM implementation written in CUDA. It uses spherical harmonics for the far field evaluation and p ( ) 4 operators to shift and transform the moments. Unfortunately, we were unable to find any other complete GPU FMM implementations (i.e., that compute both the far field and the near field) that can be tested and provide verifiable results. Figure 16 compares FMM runtimes for particle numbers N = 10 4 −10 7 . The optimal depth for each N was chosen separately for each implementation to ensure optimal performance. Both implementations show a linear scaling with respect to N. The FMM implementation described in this work outperforms GemsFMM by a factor of 5.5 to 13.

CONCLUSIONS AND OUTLOOK
Here we have assessed the accuracy and performance of our GPU FMM described in detail by Kohnke et al. 34 We demonstrated that our implementation provides correct electrostatic energies and forces for single and double numerical precisions by comparison to high-precision reference solutions for open and periodic systems. Using benchmark systems of various sizes and compositions, ranging from 3000 to 286 000 000 particles, we measured and compared Figure 14. FMM versus PME performance in GROMACS for salt water (top) and multidroplet (bottom) benchmarks. Settings were chosen such that PME and FMM yield similar accuracies of electrostatic forces as well as comparable energy drifts. FMM used p = 8 and d = 3, whereas PME used the "default" parameter set (r c = 1.0 nm, PME grid spacing s = 0.12 nm, fourth order interpolation, see Figure 9). For the multidroplet system, for optimal PME performance, both r c and s were scaled by a factor of 2.943, which leaves the PME accuracy essentially unchanged. Figure 15. FMM and PME scaling with respect to system size N for up to 268 million charges. Benchmarks were run on an NVIDIA Tesla V100 GPU with 32 GB RAM (solid lines) and on an RTX 2080Ti GPU (dashed lines). Blue (single precision) and orange (double precision) colors denote FMM standalone timings for the random charge benchmark (left scale) with depths d = 1−6 (encircled numbers) and multipole order p = 8, whereas the lower and upper boundaries of the shaded regions indicate timings for p = 7 and p = 9. Gray and dark blue lines show wall clock time per MD step (left scale) and resulting GROMACS performance (right scale) for PME (gray stars) and FMM (blue circles) for water boxes of different sizes. GROMACS benchmarks were run on a 10-core E5-2630v4 node with RTX 2080Ti GPU (dashed lines) and on a 20-core Xeon Gold 6148F with V100 GPU (solid lines) with all nonbonded interactions offloaded to the GPU. Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article FMM and PME performances in GROMACS on up-to-date GPU models. As a prerequisite to calculating Coulomb interactions in MD simulations with the FMM, as well as for a proper performance comparison, we have determined the FMM parameters that yield results as accurate as those with typical PME settings. For a representative biomolecular simulation system of about 50 000 particles in size, a multipole order of 7 yields a similar accuracy for the Coulomb energies and forces as standard PME parameters in a mixed-precision simulation. The error distribution for the Coulomb forces is comparable for both solvers. Limiting the energy drift to the level present in a standard PME simulation requires raising the multipole order to about 8.
For typical biomolecular systems (proteins in solution) of up to 30 million particles in size, the GROMACS 2019 performance with our CUDA FMM is about a third of that with PME on a single GPU node. However, for systems with larger dimensions and nonuniform particle distributions, such as our ≈100 000 atom aerosol/multidroplet example, FMM easily outperforms PME already at small particle numbers. Here, the huge memory requirements for the FFT grid become the limiting factor for PME.
GemsFMM is a completely independent FMM implementation, which also runs exclusively on the GPU and which uses the same operators for the far field evaluations as our implementation. Our GPU FMM outperforms GemsFMM by a factor of about 8. Unfortunately, further comparisons were not possible because we did not find additional ready-to-use FMM codes that provide verifiable results.
One of the drawbacks of the FMM is that it does not intrinsically allow for noncubic simulation boxes with periodic boundaries. For noncubic boxes the governing octree structure of the FMM would have to be redesigned, 60 requiring further optimizations if the level of achieved performance is to be maintained. Moreover, for typical biomolecular simulation systems of proteins in solution, the single-node GPU FMM is still slower than the highly optimized GROMACS GPU PME implementation; however, single-node GPU FMM can handle larger particle systems and larger simulation boxes.
One of the advantages of FMM electrostatics over PME is that also open boundaries can be handled; however, the FMM's main strength will become apparent on larger exascale clusters of GPU nodes, where PME scaling breaks down due to its inherent communication bottleneck. In combination with the demonstrated high single-GPU performance of this implementation, the performance of a parallelized FMM should eventually beat that of PME. For large sparse systems, FMM already outperforms PME on a single GPU. Additionally, due to FMM's flexible octree structure that allows one to easily evaluate local energy differences, λ-dynamics calculations, as needed for MD simulations at constant pH, can be implemented without much computational overhead. 32 The next step toward higher FMM performance will be a parallel implementation for multiple GPUs. As the FMM communication requirement is small compared to that of PME, we expect a parallelized FMM to scale significantly better than PME with the number of GPUs. Additionally, harnessing new CUDA programming features such as persistent threads and CUDA graphs should be beneficial also for single-node GPU performance. Considering large sparse systems, additional optimizations should yield even more speedup because the current implementation was only slightly adopted to handle nonuniform particle distributions.