[gmx-users] Excessive and gradually increasing memory usage with OpenCL

Thu Mar 29 01:01:57 CEST 2018

Hi,

Our own installation guide does advise against OpenCL on NVIDIA hardware,
and also hints that compiler compatibility is dependent on the CUDA
version, but we could improve the latter I think.

Last time we considered performance of OpenCL on NVIDIA, the GPU kernels
seemed to always run synchronously, providing no overlap with CPU tasks, so
the advice Szilard gave applies mainly to the CUDA case. By far best tuning
opportunity is to organize to use CUDA.

Mark

On Thu, Mar 29, 2018, 00:17 Albert Mao <albert.mao at gmail.com> wrote:

> Thank you for this workaround!
>
> Just setting the GMX_DISABLE_GPU_TIMING environment variable has
> allowed mdrun to progress for several million steps. The memory usage
> is still high at about 1 GB memory and 26 GB swap, but it does not
> appear to increase as the simulation progresses.
>
> I tried 6 ranks x 2 threads as well, but performance was unchanged. I
> think it's because the CPUs are spending time waiting for the GPUs;
> Mark's suggestion to switch to native CUDA would probably make a
> significant difference here. If this is an important recommendation,
> the Gromacs installation guide should probably link to
> http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html,
> which clarifies that even the latest release of CUDA does not come
> close to being compatible with the latest version of GCC.
>
> -Albert Mao
>
> On Tue, Mar 27, 2018 at 4:43 PM, Szilárd Páll <pall.szilard at gmail.com>
> wrote:
> > Hi,
> >
> > This is an issue I noticed recently, but I thought it was only
> > affecting some use-cases (or some runtimes). However, it seems it's a
> > broader problem. It is under investigation, but for now it seems that
> > eliminate it (or strongly diminish its effects) by turning off
> > GPU-side task timing. You can do that by setting the
> > GMX_DISABLE_GPU_TIMING environment variable.
> >
> > Note that this is workaround that may turn out to not be a complete
> > solution, please report back if you've done longer runs.
> >
> > Regarding the thread count, the MPI and CUDA runtimes can spawn
> > threads, GROMACS certainly used 3x 4 threads in your case. Note that
> > you will likely get better performance by using 6 ranks x 2 threads
> > (both because this avoids ranks spanning across sockets and it allows
> > GPU task/transfer overlap).
> >
> > Cheers,
> > --
> > Szilárd
> >
> >
> > On Tue, Mar 27, 2018 at 4:09 PM, Albert Mao <albert.mao at gmail.com>
> wrote:
> >> Hello!
> >>
> >> I'm trying to run molecular dynamics on a fairly large system
> >> containing approximately 250000 atoms. The simulation runs well for
> >> about 100000 steps and then gets killed by the queueing engine due to
> >> exceeding the swap space usage limit. The compute node I'm using has
> >> 12 cores in two sockets, three GPUs, and 8 GB of memory. I'm using
> >> GROMACS 2018 and allowing mdrun to delegate the workload
> >> automatically, resulting in three thread-MPI ranks each with one GPU
> >> and four OpenMP threads. The queueing engine reports the following
> >> usage:
> >>
> >> TERM_SWAP: job killed after reaching LSF swap usage limit.
> >> Exited with exit code 131.
> >> Resource usage summary:
> >>     CPU time   :  50123.00 sec.
> >>     Max Memory :      4671 MB
> >>     Max Swap   :     30020 MB
> >>     Max Processes  :         5
> >>     Max Threads    :        35
> >>
> >> Even though it's a large system, by my rough estimate, the simulation
> >> should not need much more than 0.5 gigabytes of memory; 4.6 GB seems
> >> like too much and 30 GB is completely ridiculous.
> >> Indeed, running the system on a similar node without GPUs is working
> >> well (but slowly), consuming about 0.65 GB and 2 GB of swap.
> >>
> >> I also don't understand why 35 threads got created.
> >>
> >> Could there be a memory leak somewhere in the OpenCL code? Any
> >> suggestions on preventing this memory usage expansion would be greatly
> >> appreciated.
> >>
> >> I've included relevant output from mdrun with system and configuration
> >> information at the end of this message. I'm using OpenCL despite
> >> having Nvidia GPUs because of a sad problem where building with CUDA
> >> support fails due to the C compiler being "too new".
> >>
> >> Thanks!
> >> -Albert Mao
> >>
> >> GROMACS:      gmx mdrun, version 2018
> >> Executable:   /data/albertmaolab/software/gromacs/bin/gmx
> >> Data prefix:  /data/albertmaolab/software/gromacs
> >> Command line:
> >>
> >>   gmx mdrun -v -pforce 10000 -s blah.tpr -deffnm blah -cpi blah.cpt
> >>
> >> GROMACS version:    2018
> >> Precision:          single
> >> Memory model:       64 bit
> >> MPI library:        thread_mpi
> >> OpenMP support:     enabled (GMX_OPENMP_MAX_THREADS = 64)
> >> GPU support:        OpenCL
> >> SIMD instructions:  SSE4.1
> >> FFT library:        fftw-3.2.1
> >> RDTSCP usage:       disabled
> >> TNG support:        enabled
> >> Hwloc support:      hwloc-1.5.0
> >> Tracing support:    disabled
> >> Built on:           2018-02-22 07:25:43
> >> Built by:           ahm17 at eris1pm01.research.partners.org [CMAKE]
> >> Build OS/arch:      Linux 2.6.32-431.29.2.el6.x86_64 x86_64
> >> Build CPU vendor:   Intel
> >> Build CPU brand:    Common KVM processor
> >> Build CPU family:   15   Model: 6   Stepping: 1
> >> Build CPU features: aes apic clfsh cmov cx8 cx16 intel lahf mmx msr
> >> nonstop_tsc pcid pclmuldq pdpe1gb popcnt pse sse2 sse3 sse4.1 sse4.2
> >> ssse3
> >> C compiler:         /data/albertmaolab/software/gcc/bin/gcc GNU 7.3.0
> >> C compiler flags:    -msse4.1     -O3 -DNDEBUG -funroll-all-loops
> >> -fexcess-precision=fast
> >> C++ compiler:       /data/albertmaolab/software/gcc/bin/g++ GNU 7.3.0
> >> C++ compiler flags:  -msse4.1    -std=c++11   -O3 -DNDEBUG
> >> -funroll-all-loops -fexcess-precision=fast
> >> OpenCL include dir: /apps/lib-osver/cuda/8.0.61/include
> >> OpenCL library:     /apps/lib-osver/cuda/8.0.61/lib64/libOpenCL.so
> >> OpenCL version:     1.2
> >>
> >> Running on 1 node with total 12 cores, 12 logical cores, 3 compatible
> GPUs
> >> Hardware detected:
> >>   CPU info:
> >>     Vendor: Intel
> >>     Brand:  Intel(R) Xeon(R) CPU           X5675  @ 3.07GHz
> >>     Family: 6   Model: 44   Stepping: 2
> >>     Features: aes apic clfsh cmov cx8 cx16 htt intel lahf mmx msr
> >> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3
> >> sse4.1 sse4.2 ssse3
> >>   Hardware topology: Full, with devices
> >>     Sockets, cores, and logical processors:
> >>       Socket  0: [   0] [   2] [   4] [   6] [   8] [  10]
> >>       Socket  1: [   1] [   3] [   5] [   7] [   9] [  11]
> >>     Numa nodes:
> >>       Node  0 (25759080448 bytes mem):   0   2   4   6   8  10
> >>       Node  1 (25769799680 bytes mem):   1   3   5   7   9  11
> >>       Latency:
> >>                0     1
> >>          0  1.00  2.00
> >>          1  2.00  1.00
> >>     Caches:
> >>       L1: 32768 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
> >>       L2: 262144 bytes, linesize 64 bytes, assoc. 8, shared 1 ways
> >>       L3: 12582912 bytes, linesize 64 bytes, assoc. 16, shared 6 ways
> >>     PCI devices:
> >>       0000:04:00.0  Id: 8086:10c9  Class: 0x0200  Numa: -1
> >>       0000:04:00.1  Id: 8086:10c9  Class: 0x0200  Numa: -1
> >>       0000:05:00.0  Id: 15b3:6746  Class: 0x0280  Numa: -1
> >>       0000:06:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
> >>       0000:01:03.0  Id: 1002:515e  Class: 0x0300  Numa: -1
> >>       0000:00:1f.2  Id: 8086:3a20  Class: 0x0101  Numa: -1
> >>       0000:00:1f.5  Id: 8086:3a26  Class: 0x0101  Numa: -1
> >>       0000:14:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
> >>       0000:11:00.0  Id: 10de:06d2  Class: 0x0302  Numa: -1
> >>   GPU info:
> >>     Number of GPUs detected: 3
> >>     #0: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> >> OpenCL 1.1 CUDA, stat: compatible
> >>     #1: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> >> OpenCL 1.1 CUDA, stat: compatible
> >>     #2: name: Tesla M2070, vendor: NVIDIA Corporation, device version:
> >> OpenCL 1.1 CUDA, stat: compatible
> >>
> >> (later)
> >>
> >> Using 3 MPI threads
> >> Using 4 OpenMP threads per tMPI thread
> >> On host gpu004.research.partners.org 3 GPUs auto-selected for this run.
> >> Mapping of GPU IDs to the 3 GPU tasks in the 3 ranks on this node:
> >>   PP:0,PP:1,PP:2
> >> Pinning threads with an auto-selected logical core stride of 1
> >> System total charge: 0.000
> >> Will do PME sum in reciprocal space for electrostatic interactions.
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.