[gmx-users] Poor GPU Performance with GROMACS 5.1.4

Mark Abraham mark.j.abraham at gmail.com
Thu May 25 09:51:03 CEST 2017


Hi,

Good. Remember that the job scheduler is a degree of freedom that matters,
so how you used it and why would have been good to mention the first time
;-) And don't just set your time step to arbitrary numbers unless you know
why it is a stable integration scheme.

Mark

On Thu, May 25, 2017 at 4:48 AM Daniel Kozuch <dkozuch at princeton.edu> wrote:

> I apologize for the confusion, but I found my error. I was failing to
> request a certain number of cpus-per-task and the scheduler was having
> issues assigning the threads because of this. Speed is now at ~400 ns/day
> with a 3 fs timestep which seems reasonable.
>
> Thanks for all the help,
> Dan
>
> On Wed, May 24, 2017 at 9:48 PM, Daniel Kozuch <dkozuch at princeton.edu>
> wrote:
>
> > Szilárd,
> >
> > I think I must be misunderstanding your advice. If I remove the domain
> > decomposition and set pin on as suggested by Mark, using:
> >
> > gmx_gpu mdrun -deffnm my_tpr -dd 1 -pin on
> >
> > Then I get very poor performance and the following error:
> >
> > NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> > degradation.
> >       If you think your settings are correct, ask on the gmx-users list.
> >
> > I am running only one rank and using 6 threads (I do not want to use all
> > the available 28 cores on the node because I hope to run multiple of
> these
> > jobs per node in the near future).
> >
> > Thanks for the help,
> > Dan
> >
> >
> > ------------------------------------------------------------
> > -----------------------------------------------------------------
> > Log File:
> >
> > GROMACS version:    VERSION 5.1.4
> > Precision:          single
> > Memory model:       64 bit
> > MPI library:        MPI
> > OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 32)
> > GPU support:        enabled
> > OpenCL support:     disabled
> > invsqrt routine:    gmx_software_invsqrt(x)
> > SIMD instructions:  AVX2_256
> > FFT library:        fftw-3.3.4-sse2-avx
> > RDTSCP usage:       enabled
> > C++11 compilation:  disabled
> > TNG support:        enabled
> > Tracing support:    disabled
> > Built on:           Mon May 22 18:29:21 EDT 2017
> > Built by:           dkozuch at tigergpu.princeton.edu [CMAKE]
> > Build OS/arch:      Linux 3.10.0-514.16.1.el7.x86_64 x86_64
> > Build CPU vendor:   GenuineIntel
> > Build CPU brand:    Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> > Build CPU family:   6   Model: 79   Stepping: 1
> > Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> > lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> > rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> > C compiler:         /usr/bin/cc GNU 4.8.5
> > C compiler flags:    -march=core-avx2    -Wextra
> > -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall
> > -Wno-unused -Wunused-value -Wunused-parameter  -O3 -DNDEBUG
> > -funroll-all-loops -fexcess-precision=fast  -Wno-array-bounds
> > C++ compiler:       /usr/bin/c++ GNU 4.8.5
> > C++ compiler flags:  -march=core-avx2    -Wextra
> > -Wno-missing-field-initializers -Wpointer-arith -Wall
> > -Wno-unused-function  -O3 -DNDEBUG -funroll-all-loops
> > -fexcess-precision=fast  -Wno-array-bounds
> > Boost version:      1.53.0 (external)
> > CUDA compiler:      /usr/local/cuda-8.0/bin/nvcc nvcc: NVIDIA (R) Cuda
> > compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on
> > Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44
> > CUDA compiler
> flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch=comp
> > ute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-
> > gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_
> > 50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-
> > gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_
> > 61,code=sm_61;-gencode;arch=compute_60,code=compute_60;-
> > gencode;arch=compute_61,code=compute_61;-use_fast_math;;
> > ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-
> > Wpointer-arith;-Wall;-Wno-unused-function;-O3;-DNDEBUG;-
> > funroll-all-loops;-fexcess-precision=fast;-Wno-array-bounds;
> > CUDA driver:        8.0
> > CUDA runtime:       8.0
> >
> >
> > Number of logical cores detected (28) does not match the number reported
> > by OpenMP (1).
> > Consider setting the launch configuration manually!
> >
> > Running on 1 node with total 28 logical cores, 1 compatible GPU
> > Hardware detected on host tiger-i23g14 (the node of MPI rank 0):
> >   CPU info:
> >     Vendor: GenuineIntel
> >     Brand:  Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
> >     Family:  6  model: 79  stepping:  1
> >     CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt
> > lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd
> > rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
> >     SIMD instructions most likely to fit this hardware: AVX2_256
> >     SIMD instructions selected at GROMACS compile time: AVX2_256
> >   GPU info:
> >     Number of GPUs detected: 1
> >     #0: NVIDIA Tesla P100-PCIE-16GB, compute cap.: 6.0, ECC: yes, stat:
> > compatible
> >
> > Using 1 MPI process
> > Using 6 OpenMP threads
> >
> > 1 compatible GPU is present, with ID 0
> > 1 GPU auto-selected for this run.
> > Mapping of GPU ID to the 1 PP rank in this node: 0
> >
> > NOTE: GROMACS was configured without NVML support hence it can not
> exploit
> >       application clocks of the detected Tesla P100-PCIE-16GB GPU to
> > improve performance.
> >       Recompile with the NVML library (compatible with the driver used)
> or
> > set application clocks manually.
> >
> >
> > Using GPU 8x8 non-bonded kernels
> >
> > Removing pbc first time
> >
> > Overriding thread affinity set outside gmx_514_gpu
> >
> > Pinning threads with an auto-selected logical core stride of 4
> >
> > NOTE: Affinity setting for 6/6 threads failed. This can cause performance
> > degradation.
> >       If you think your settings are correct, ask on the gmx-users list.
> >
> > Initializing LINear Constraint Solver
> >
> >  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> >
> > On 1 MPI rank, each using 6 OpenMP threads
> >
> >  Computing:          Num   Num      Call    Wall time         Giga-Cycles
> >                      Ranks Threads  Count      (s)         total sum    %
> > ------------------------------------------------------------
> > -----------------
> >  Neighbor search        1    6        201       1.402         20.185
>  9.4
> >  Launch GPU ops.        1    6       5001       0.216          3.116
>  1.5
> >  Force                  1    6       5001       1.070         15.402
>  7.2
> >  PME mesh               1    6       5001       5.538         79.745
> 37.1
> >  Wait GPU local         1    6       5001       0.072          1.043
>  0.5
> >  NB X/F buffer ops.     1    6       9801       0.396          5.706
>  2.7
> >  Write traj.            1    6          2       0.022          0.310
>  0.1
> >  Update                 1    6       5001       1.683         24.232
> 11.3
> >  Constraints            1    6       5001       2.488         35.833
> 16.7
> >  Rest                                           2.031         29.247
> 13.6
> > ------------------------------------------------------------
> > -----------------
> >  Total                                         14.918        214.819
> 100.0
> > ------------------------------------------------------------
> > -----------------
> >  Breakdown of PME mesh computation
> > ------------------------------------------------------------
> > -----------------
> >  PME spread/gather      1    6      10002       4.782         68.865
> 32.1
> >  PME 3D-FFT             1    6      10002       0.654          9.411
>  4.4
> >  PME solve Elec         1    6       5001       0.024          0.352
>  0.2
> > ------------------------------------------------------------
> > -----------------
> >
> >  GPU timings
> > ------------------------------------------------------------
> > -----------------
> >  Computing:                         Count  Wall t (s)      ms/step
>  %
> > ------------------------------------------------------------
> > -----------------
> >  Pair list H2D                        201       0.020        0.099
>  0.3
> >  X / q H2D                           5001       0.090        0.018
>  1.5
> >  Nonbonded F kernel                  4800       5.617        1.170
> 92.8
> >  Nonbonded F+prune k.                 150       0.186        1.240
>  3.1
> >  Nonbonded F+ene+prune k.              51       0.064        1.257
>  1.1
> >  F D2H                               5001       0.075        0.015
>  1.2
> > ------------------------------------------------------------
> > -----------------
> >  Total                                          6.052        1.210
>  100.0
> > ------------------------------------------------------------
> > -----------------
> >
> > Force evaluation time GPU/CPU: 1.210 ms/1.321 ms = 0.916
> > For optimal performance this ratio should be close to 1!
> >
> >                Core t (s)   Wall t (s)        (%)
> >        Time:       23.471       14.918      157.3
> >                  (ns/day)    (hour/ns)
> > Performance:       86.893        0.276
> > Finished mdrun on rank 0 Wed May 24 21:36:47 2017
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.


More information about the gromacs.org_gmx-users mailing list