[gmx-users] Re : Re : GROMACS 4.6.7 not running on more than 16 MPI threads

Sat Feb 28 00:29:02 CET 2015

The following problem is still there :
Number of CPUs detected (16) does not match the number reported by OpenMP
(1).
Consider setting the launch configuration manually!

The above message is always there and I am not sure about how to set it.
Even when I set OpenMP threads manually it shows.

I ran my system using only MPI with automatic PME node selection. I found
that for nodes = 16 and ppn = 16 there was issue with domain decomposition.

Message from standard output and error files of job script :-
There is no domain decomposition for 224 nodes that is compatible with the
given box and a minimum cell size of 1.08875 nm
Change the number of nodes or mdrun option -rcon or -dds or your LINCS
settings
Look in the log file for details on the domain decomposition

turning all bonds into constraints...
turning all bonds into constraints...
turning all bonds into constraints...
turning all bonds into constraints...
turning all bonds into constraints...
turning all bonds into constraints...
Largest charge group radii for Van der Waals: 0.039, 0.039 nm
Largest charge group radii for Coulomb:       0.078, 0.078 nm
Calculating fourier grid dimensions for X Y Z
Using a fourier grid of 64x64x128, spacing 0.117 0.117 0.117

The log file was incomplete as the simulation crashed.

Then I fiddled with the node numbers and found that for nodes = 10 and ppn
= 16 mdrun_mpi could successfully work.

This is part of the log file :
Average load imbalance: 36.5 %
 Part of the total run time spent waiting due to load imbalance: 8.7 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0
% Y 3 % Z 17 %
 Average PME mesh/force load: 0.847
 Part of the total run time spent waiting due to PP/PME imbalance: 1.4 %

NOTE: 8.7 % of the available CPU time was lost due to load imbalance
      in the domain decomposition.

               Core t (s)   Wall t (s)        (%)
       Time:   212092.050     1347.167    15743.6
                 (ns/day)    (hour/ns)
Performance:       32.067        0.748

..................................................................................................................................

Then I set OpenMP threads = 8 and this is what happened. ( I had to use
Verlet cutoff scheme )

Number of CPUs detected (16) does not match the number reported by OpenMP
(1).
Consider setting the launch configuration manually!
Reading file pull1.tpr, VERSION 4.6.7 (double precision)
The number of OpenMP threads was set by environment variable
OMP_NUM_THREADS to 8 (and the command-line setting agreed with that)

Will use 144 particle-particle and 16 PME only nodes
This is a guess, check the performance at the end of the log file
Using 160 MPI processes
Using 8 OpenMP threads per MPI process
..........................................................................................................................................
..........................................................................................................................................

NOTE: 9.9 % performance was lost because the PME nodes
      had more work to do than the PP nodes.
      You might want to increase the number of PME nodes
      or increase the cut-off and the grid spacing.

NOTE: 11 % of the run time was spent in domain decomposition,
      9 % of the run time was spent in pair search,
      you might want to increase nstlist (this has no effect on accuracy)

               Core t (s)   Wall t (s)        (%)
       Time:   319188.130     2019.349    15806.5
                         33:39
                 (ns/day)    (hour/ns)
Performance:       21.393        1.122

So , I have bad performance.
Now I am using only MPI for running the jobs.

Any suggestions on performance improvement ?

Thanks & Regards
Agnivo Gosai
Grad Student, Iowa State University.