Submitted in 11.7.07.
Elastic Constants Calculation Of Solid Argon:
Serial & Parallel Realization.

 Contents: • This site is best viewed in a resulation of 1280x1024 pixels.

1. Introduction:
Our main question is: what is the time improvement when making a simulation on many processors (parallel) compared with one processor (serial).
In this project a serial program was modified into a parallel one for NANCO - The new (June 2007) Russell Berrie Nanotechnology Institute's computer cluster. The original program was taken from Slava Sorkin's Msc. thesis and modified somewhat for describing 3D solid Argon with Lennard Jones (LJ) potential in a NVE/NVT ensemble, Argon has a stable state of a face centered cubic (FCC) with a lattice constant a=5.4 angstrom. During the simulation the program saves xyz coordinates of all the atoms in Aviz form - The Atomic Visualization software. The simulation technique in use is Molecular Dynamics (MD) with a predictor-corrector integrator coupled with Nose-Hoover thermostat. Periodic boundary conditions are assumed and the parallization is done with the Message Passing Interface (MPI) library. The programming language is C.
The program calculates the three independent elastic constants of the FCC lattice (C11, C12, C44) via their relationship with the stress fluctuations.
The main supplement to Slava's code is the parallel computing, the way the calculation gets divided between the given processors. The are 3 main decompositions techniques: atom, force and spatial. The latter was chosen becuase it is the most suitable one for simulations with a large number of atoms.
Due to a dependance of the serial program on Slava's Msc. Thesis, links will be giving to some topics summerized by him.

2. Molecular Dynamics:
 Molecular dynamics is a simulation technique where the time evolution of a set of interacting atoms is followed by integrating their equations of motion. In MD we follow the classical laws of mechanics, the most important one being Newton's second law: where V is the potential energy and i stands for a specific atom. This defines 3N equations of motion. This method is a very powerful tool to attack many-body problems in physics, also allowing the study of specific aspects of complex systems in more detail. It enables us to observe the behaviour of a system when an analytic approach is not possible. A successful MD simulation depends on engineering an interaction potential and giving the initial conditions (i.e initial coordinates and velocities). The potential is usually built from theoretical models or from experimental data which are consistent with parameters that characterize the system in question. After obtaining the set of equations the next step is to integrate them. For this purpose we used in the project the Predictor-Corrector method. A statistical ensemble has to be chosen for controlling thermodynamic quantities, in this project we worked with NVT ensemble coupled to Nose-Hoover thermostat1. Ms is the effective mass of the heat reservoir, it is worth mentioning if Q≡1/Ms=0 we are back to the NVE ensemble. To find the value of Q two steps were taken: first the average Einstein frequency, ωE, was calculated in the NVE ensemble. 2/3 of the Einstein frequency is the inital value of Q. Then the program was run again in the NVT ensemble and a spectrum analysis on the energy was made, if two frequencies were received (the zero frequency and some other frequency) than it meant we got the correct Q. These steps are the basic recipe for defining a MD simulation, the flow chart summarizes the basic MD algorithm. Taken from Amit Kumar's page
Using this technique, it is possible to measure static & dynamics properties such as temperature, pressure, heat transport, relaxation time, elastic constants, phonon spectra and more. Our program calculates the elastic constants of the system, the method is explained here2.

1At the end of the explanation, equation 9.14: the momentum is not divided by 2, it is only divided by mass.
2The difference in our program is that in our model Fi)=0 ⇒ B2=B3=0.

3. Lennard Jones:
 The LJ 12-6 potential for interaction between a pair of atoms is: rij is the distance between atom i and j and in the Argon model ε=0.0104[eV] and σ=3.4[Angstrom]. This potential has an attractive tail at large r, reaches a minimum around 1.122σ, and is strongly repulsive at shorter distance. After passing through 0 at r=σ it increases steeply as r is decreased further. For comparison, on the right, the graph presents the empirical potential of Argon with the LJ potential for Argon, the similarities being obvious. The LJ potential has an infinite range, but for practical applications it is customary to define a cutoff radius rc and disregard the interaction between atoms separated by more than rc. This simplifies the programs and reduces their running time. However this action creates a new problem: whenever a particle pair "crosses" the cutoff distance, the energy makes a little jump. When a many of these events occur it is likely to cause some problems when it comes to energy conservation in simulation. Taken from Wikipedia
The solution is to avoid the jump by shifting the LJ potential, such that it and its first derivative vanish at the cutoff distance:
The cutoff distance in this project was chosen to be rc=2.1σ.

4. NANCO and Message Passing Interface (MPI):
 NANCO is a super computer from SUN/EMET and consists of 64 dual processors, dual core compute nodes (4 cores per node) of 2.2GHz Opterons. It is a Linux cluster with a fast switch based on DDR infiniband Interconnect. Each 2 processors (4 cores) share the same memory - 8GB of RAM. The picture on the right shows an outline of NANCO's architecture. MPI is a library that is widely used for parallel program writing. Users don't have to study a new language, instead they only need to learn new subprograms of the MPI library that can be called from C and Fortran 77 programs. Its popularity and success comes from its ability to hide all the network definitions and communications protocols from normal users and leave them with a few simple subprograms. The foundation of this library is a small group of functions that can be used to achieve parallelism by message passing. A message passing function is simply a function that explicitly transmits data from one process to another, among those subprograms the two most widely popular are sending and recieving data from one processor to another. MPI is a rich and diversified library containing many many subprogram of different kind and function.

5. Spatial decomposition and some more:
Spatial decomposition means that our simulation box is divided by the number of available processors, each "mini-box" does its own calculations and each time step the neighbouring "mini-boxes" communicate to exchange relevant atoms' positions, forces and other summed up quantities like the sum of squares velocities of each slice and etc.
This project breaks down the simulation box into p equal slices. p is some even number which represent the number of processors given for the simulation run. The number of atoms along each direction must divide by p for the proper run of our spatial decomposition algorithm. But this is not enough for the proper execution of the simulation, we must also check that the width of a slice is at least 2rc. If the user's intention was to run the program with smaller number of atoms per slice per processor then the program won't run. It will suggest running the serial version, reducing the number of processors, increasing the number of atoms or lowering rc.
Each time step each processor computes the next coordinates or forces and then they updates their two neighbours (one above and one below) with them. The update stage has two parts, one updates the information clockwise and the second is counter clockwise. For example the clockwise: It checks if the number of the proccesor is even then if it is than it sends an update to the right neighbour while the odd processor receives it, then the processes are reversed, the odd sends and the even receives.
But this is not enough for a full parallel MPI program, at each time step the simulation requires a parameter which consists of a sum of all the velocities, forces etc. Therefore the program also includes an update of these parameters. These parameters divide into two groups: parameters needed only for the primary processor (marked as 0) and parameters needed for each of the processors. No matter the kind of the parameter, each processors sums only its atoms and sends it to the relevant processor (all or to processor 0) to be added to their own parameter. Through the entire program I've tried to avoid double updating (or more) for speeding up the simulation.
The primary processor which its definition is arbitrary, could be any one of the given processors. In this project, as mentioned before, it is the one marked as 0. Between it and the other processors exist one difference: it alone and only it does all the writing of all the data to the hard drive of the computer.
For simplification one can look at it as a circle made of discreet points, each point represents one slice, also the first slice is connected to the last one for mentaining periodical boundaries, as seen in the picture on the right.
A scheme below is an exmaple of a spatial decomposition of 2048 atoms between 4 processors.

6. Programs: Serial & Parallel:
Overall the program are similer to the MD schematic above except that the "Analyze the Data" bubble is inside the loop of "Till Termination Condition".
The program provided here in a zip format: serial version & MPI version.
Both version contains two important files one "initdata.h" and second "aaa_info", each change in the first file automatically imposes the user to compile the program again while any change in the second file does not.
Main choices in "initdata.h" (for this project, the rest is irrelevant):
1. step_EPT_meas: Number of steps for writing the measurements of kinetic energy, potential energy, Einstein frequency, current temperature and current pressure of the system.
2. step_cij_meast: Number of steps for writing the measurements of the three independent elastic constants.
3. size: Number of atoms in each side of our cubic system, for example the figure above has size=16.
4. Atoms: Total number of atoms which is for a FCC lattice 0.5(size)3.
5. lj_eps and lj_sig: The value of the two parameters of the Lennard Jones Potential.
6. cutoff: The cutoff of the Lennard Jones Potential.
7. dt: Time step in picoseconds.
Meaning of each line in "aaa_info" (line 1 is empty):
1. MODE [#]:
1. Start the entire simulation from zero meaning define all the coordinates and velocities.
2. Start the simulation using saved coordinates and velocities in "r.log" and "v.log".
3. Start the simulation from the measurement of elastic constants from the given step0 and "r.log" and "v.log" files.
2. Lattice constant of the FCC lattice.
3. Temperature of the heat bath (NVT ensemble).
4. Number of steps in MODE 1.
5. Number of steps in MODE 2.
6. Number of step, for MODE 0 its value is zero.
7. Percentage done of the simulation.
8. Number of aviz frames in MODE 0 and 1.
9. Number of aviz frames in MODE 2.
10. Value of the Nose Hoover parameter Q.
11. Refresh calculation of elastic constant, 1 [Yes] or 0 [No].
Both programs have basically the same output files:
• info_EPT_T=##K_Atoms=###.txt: The coloums are step number, Einstien frequency, potential energy, kinetic energy, current pressure and finally current temperature.
• info_order_T=##K_Atoms=###.txt: The coloums are step number and the lattice's order parameter.
• Cij_T=##K_Atoms=###.txt: The coloums are step number, C11, C12 and C44.
• lattice_T=##K_Atoms=###.txt: The coloums are step number and the current average lattice constants.
• conf_init.xyz: Initial configuration of the FCC lattice.
• ## and ### stands for the simulation's temperature and number of atoms.
Each version contains a batch file suited for running in the queuing system of NANCO:
• Serial: "qsub -q scalar ./batchserial.sh".
• Parallel: "mpisub N batchmpi.sh" where N is the number of processors.
Important stuff to pay attention to:
• Make sure that the directory written in both batch files is the working directory, default direcotories are "Argon_Serial" and "Argon_MPI" in the user's HOME directory.
• If the program is run on a computer other than NANCO then it is mandatory to compile the program from the beginning by simply executing "make" and in the MPI version in the library "lib", "make" must be also executed.
• The current MPI version has a memory limit on the number of atoms it supported, the maximum this project tried was 1048576 (size=128) when we have 2Gb memory per core.

7. Results:
 This project calculated the speedup of the simulation with the increase of number of processors. At the right and below the speedup and some other relevant parameters are shown. The other parameters are present to validate the similarities between all the different runs (due to the random initial velocities and number of processors). Each job was consisted of 256000 atoms, 20K temperature, Q=0.22, a=5.4, steps in MODE[1]=50000 and steps in MODE[2]=450000. The time step through in all the simulations was 0.005 picoseconds. Clicking on the following links will show the relevant graph at the right: 3 Here is the definition of the order parameter (only the upper part is relevent). It is important to note that becuase the MPI version is written only for parallel computing (at least 2 processors) and the serial version is not compatible with a large number of atoms, the speeedup graph start from 4 processors. For that reason two more simulations were done, similar to previous ones with just two changes: 16384 atoms at 30K temperature. One experiment is with a serial version and the second is with a MPI version with 4 processors. The links following have the same meaning as above: A short movie of Argon at 20K with 256000 atoms processed with 20 cores A short movie of Argon at 30K with 16384 atoms processed with 4 cores
Movies were made with the C-Point's Animator Pro which used the png files created by Aviz from the xyz files created during the simulation.

Here is the MATLAB code I've wrote to create the graphs.

8. Conclusions:
Our main goal was to test NANCO, was it really worth using NANCO for reducing the time needed to run a simulation compared with a regular
computer (serial one)?
This project succeeded in answering this question and more. The simulation of 256000 atoms clearly shows that increasing the number of cores brings down the simulation run time, the optimum number of cores is 16 cores because the ~97.5 minutes difference between it and 20 cores has a small significance for short simulation runs.
All the computed values at the end of the simulation converge to the same values despite the different initial velocities and number of processors.
The calculation of the elastic constant gave the same order of magnitude compare with the experimental one in unit of GPa:

The deviation is probably from the method this program uses to calculated the elastic constants which is not the best one.
The simulation of 16384 atoms shows the improvement in the running time between the two version of the program, one is the serial version and the other the MPI version running on 4 cores. Cleary it shows the effects of the boundaries of each core, the smaller the number of atoms the less speedup we get. This fact is becuase of all the network traffic each core produces to update itself from its nearest neighbours, while in the first simulation of 256000 atoms, each core mainly concentrates on its atoms alone and less on the updates and thus getting higher speedup.
A very important conclusion from this project is that a larger number of atoms gives faster convergence and the accuracy of physical values. This result is quite clear because the more atoms we have the less number of atoms which feel the effects of the periodic boundary conditions.
Although 256000 atoms is a significant achievement, it should be noted that this translates to a sample of only 21.6nm so it is still a nanoscale project. Finally it is worth mentioning that this project has succeded once to run a simulation with 1048576 atoms which is also only a sample of 34.56nm.