ICSTCF2020 Poster format contribution

by Diego I. Garrido-Arguirre and José A. Moreno-Razo

This is an interactive and animated version of the poster. If you prefer a downloadable static version, click here.

Molecular dynamics

A representative method with this approach is called "molecular dynamics simulation". In this method, the motion of the particles is calculated from Newton’s equation of motion in the classical theory.

Particles:
2
4
8
16
32

Force matrix

Let’s call matrix force to a matrix with N rows and N columns, where each element in the row i and column j contains the value of the force fij due to the interaction between particle i and particle j of a system with N particles.

The calculation of the forces, consequence of particle’s interactions, is a crucial and time-expensive step in any molecular simulation. As the number of particles increases, the number of interactions does quadratically.

About CUDA

The CUDA technology for NVIDIA graphics cards provides a development interface to make programs that take advantage of the GPU’s processor calculation capacity. The CUDA architecture allows to implement functions (called kernels) to be executed by several threads simultaneously.

CUDA Memory spaces

The CUDA programming model assumes a system composed of a host (cpu) and a device (gpu), each with their own separate memory.

CUDA threads may access data from multiple memory spaces during their execution, for example:

Local Memory

Accessible only by the thread itself

Shared Memory

Accessible to all threads within the same block

Global Memory

Accesible para todos los bloques.

CUDA Grid

Threads

In CUDA, the kernel is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A kernel is a small program or a function. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.

Blocks

Threads are grouped into thread blocks. There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core.

Grid

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks.

Our work

We created a set of algorithms to calculate, on the GPU, the matrix force for systems with different numbers of particles. These algorithms were compared with each other. and against other that does the same procedure but executed on the CPU.
Only two computational optimizations were implemented for the CPU algorithm: cutoff radius (the distance which the particles are far enough apart to neglect the force between them) and the use of Newton’s third law (which reduces to a half the number of forces calculated)
All the algorithms that we created to run on GPU apply the same strategy respect the use of available memory spaces in this technology: inside the kernel, each thread copies the necessary data from global memory to the shared memory. Once the threads of a block have finished their work, the results are sent back to the global memory. If it is possible, input parameters of functions are sent at the constant memory space.
The main difference between the algorithms for GPU in this study, consists in the format of the grid used in each kernel for calculate the matrix force.

Algorithm A

Algorithm A uses a 1D grid with 1D blocks, using a single thread to reduce a complete row in the matrix of forces, applying only the cutting radius optimization.

Algorithm A*

Algorithm A* uses a technique that consists by using the same grid as many times are necessary (block-stride technique), allowing to send a smaller grid to avoid the problem of thread limits.

Algorithm B

Algorithm B uses a 2D grid with 2D blocks, using one row of threads of each block to work on reducing two complete rows of the matrix force, applying the cutting radius optimization and the Newton’s third law optimizations.

Algorithm B*

Algorithm B* uses a technique that consists by using the same grid as many times are necessary (block-stride technique), allowing to send a smaller grid to avoid the problem of thread limits.
In this case, was possible to apply the cutting radius and the Newton’s third law optimizations.

Algorithm C

Algorithm C reuses the idea of algorithm A*, but for a 1D grid with 2D blocks, using one row of threads of each block to work on reducing a complete row of the matrix force.

Algorithm D

Algorithm D, in experimental stage, uses an identical grid from the algorithm C, but the blocks were programmed to go through the matrix force in a staggered way, with the aim of taking advantage of Newton’s third law as optimization.

Choose an algorithm:

Results and conclusions

The comparison showed that algorithms that use the block stride technique are more efficient than those that do not. Our implementation of Newton’s third law as optimization to calculate the matrix force on GPU, achieves an evident advantage in systems with a large number of particles