This is an interactive and animated version of the poster. If you prefer a downloadable static version, click here.
A representative method with this approach is called "molecular dynamics simulation". In this method, the motion of the particles is calculated from Newton’s equation of motion in the classical theory.
Let’s call matrix force to a matrix with N rows and N columns,
where each element in the row i and column j contains the value of the force
fij due to the interaction between particle i
and particle j of a system with N particles.
The calculation of the forces, consequence of particle’s interactions, is a crucial and time-expensive step in any molecular simulation. As the number of particles increases, the number of interactions does quadratically.
The CUDA technology for NVIDIA graphics cards provides a development interface to make programs that take advantage of the GPU’s processor calculation capacity. The CUDA architecture allows to implement functions (called kernels) to be executed by several threads simultaneously.
The CUDA programming model assumes a system composed of a host (cpu) and a device (gpu),
each with their own separate memory.
CUDA threads may access data from multiple memory spaces during their execution, for example:
Accessible only by the thread itself
Accessible to all threads within the same block
Accesible para todos los bloques.
In CUDA, the kernel is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A kernel is a small program or a function. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.
Threads are grouped into thread blocks. There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks.
We created a set of algorithms to calculate,
on the GPU, the matrix force for systems with different numbers
of particles. These algorithms were compared with each other.
and against other that does the same procedure but executed on
the CPU.
Only two computational optimizations were implemented for
the CPU algorithm: cutoff radius (the distance which the particles
are far enough apart to neglect the force between them) and the
use of Newton’s third law (which reduces to a half the number of
forces calculated)
All the algorithms that we created to run on GPU apply the
same strategy respect the use of available memory spaces in this
technology: inside the kernel, each thread copies the necessary
data from global memory to the shared memory. Once the threads
of a block have finished their work, the results are sent back to the
global memory. If it is possible, input parameters of functions are
sent at the constant memory space.
The main difference between the algorithms for GPU in this
study, consists in the format of the grid used in each kernel for
calculate the matrix force.
Algorithm A uses a 1D grid with 1D blocks, using a single thread to reduce a complete row in the matrix of forces, applying only the cutting radius optimization.
Algorithm A* uses a technique that consists by using the same grid as many times are necessary (block-stride technique), allowing to send a smaller grid to avoid the problem of thread limits.
Algorithm B uses a 2D grid with 2D blocks, using one row of threads of each block to work on reducing two complete rows of the matrix force, applying the cutting radius optimization and the Newton’s third law optimizations.
Algorithm B* uses a technique that
consists by using the same grid as many times are necessary (block-stride technique),
allowing to send a smaller grid to avoid the problem of thread limits.
In this case, was possible to apply the cutting radius and the Newton’s third law optimizations.
Algorithm C reuses the idea of algorithm A*, but for a 1D grid with 2D blocks, using one row of threads of each block to work on reducing a complete row of the matrix force.
Algorithm D, in experimental stage, uses an identical grid from the algorithm C, but the blocks were programmed to go through the matrix force in a staggered way, with the aim of taking advantage of Newton’s third law as optimization.
The comparison showed that algorithms that use the block stride technique are more efficient than those that do not. Our implementation of Newton’s third law as optimization to calculate the matrix force on GPU, achieves an evident advantage in systems with a large number of particles