[gmx-users] [gmx-developers] About dynamics loading balance

Mon Aug 25 06:55:51 CEST 2014

Please upload them to a file-sharing service on the web (there are lots
that are free-to-use), and paste the link here.

Mark

On Mon, Aug 25, 2014 at 6:07 AM, Yunlong Liu <yliu120 at jhmi.edu> wrote:

> Hi Szilard,
>
> I would like to send you the log file and i really need your help. Please
> trust me that i have tested many times when i turned on the dlb, the gpu
> nodes reported cannot allocate memory error and shut all MPI processes
> down. I have to tolerate the large loading imbalance (50%) to run my
> simulations. I wish i can figure out some way that makes my simulation run
> on GPU and have better performance.
>
> Where can i post the log file? If i paste it here, it will be really long.
>
> Yunlong
>
>
> > On Aug 24, 2014, at 2:20 PM, "Szilárd Páll" <pall.szilard at gmail.com>
> wrote:
> >
> >> On Thu, Aug 21, 2014 at 8:25 PM, Yunlong Liu <yliu120 at jh.edu> wrote:
> >> Hi Roland,
> >>
> >> I just compiled the latest gromacs-5.0 version released on Jun 29th. I
> will
> >> recompile it as you suggested by using those Flags. It seems like the
> high
> >> loading imbalance doesn't affect the performance as well, which is
> weird.
> >
> > How did you draw that conclusion? Please show us log files of the
> > respective runs, that will help to assess what is gong on.
> >
> > --
> > Szilárd
> >
> >> Thank you.
> >> Yunlong
> >>
> >>> On 8/21/14, 2:13 PM, Roland Schulz wrote:
> >>>
> >>> Hi,
> >>>
> >>>
> >>>
> >>> On Thu, Aug 21, 2014 at 1:56 PM, Yunlong Liu <yliu120 at jh.edu
> >>> <mailto:yliu120 at jh.edu>> wrote:
> >>>
> >>>    Hi Roland,
> >>>
> >>>    The problem I am posting is caused by trivial errors (like not
> >>>    enough memory) and I think it should be a real bug inside the
> >>>    gromacs-GPU support code.
> >>>
> >>> It is unlikely a trivial error because otherwise someone else would
> have
> >>> noticed. You could try the release-5-0 branch from git, but I'm not
> aware of
> >>> any bugfixes related to memory allocation.
> >>> The memory allocation which causes the error isn't the problem. The
> >>> printed size is reasonable. You could recompile with PRINT_ALLOC_KB
> (add
> >>> -DPRINT_ALLOC_KB to CMAKE_C_FLAGS) and rerun the simulation. It might
> tell
> >>> you where the usual large memory allocation happens.
> >>>
> >>> PS: Please don't reply to an individual Gromacs developer. Keep all
> >>> conversation on the gmx-users list.
> >>>
> >>> Roland
> >>>
> >>>    That is the reason why I post this problem to the developer
> >>>    mailing-list.
> >>>
> >>>    My system contains ~240,000 atoms. It is a rather big protein. The
> >>>    memory information of the node is :
> >>>
> >>>    top - 12:46:59 up 15 days, 22:18, 1 user,  load average: 1.13,
> >>>    6.27, 11.28
> >>>    Tasks: 510 total,   2 running, 508 sleeping,   0 stopped,   0 zombie
> >>>    Cpu(s):  6.3%us,  0.0%sy,  0.0%ni, 93.7%id,  0.0%wa, 0.0%hi,
> >>> 0.0%si,  0.0%st
> >>>    Mem:  32815324k total,  4983916k used, 27831408k free,     7984k
> >>>    buffers
> >>>    Swap:  4194296k total,        0k used,  4194296k free,   700588k
> >>>    cached
> >>>
> >>>    I am running the simulation on 2 nodes, 4 MPI ranks and each rank
> >>>    with 8 OPENMP-threads. I list the information of their CPU and GPU
> >>>    here:
> >>>
> >>>    c442-702.stampede(1)$ nvidia-smi
> >>>    Thu Aug 21 12:46:17 2014
> >>>    +------------------------------------------------------+
> >>>    | NVIDIA-SMI 331.67     Driver Version: 331.67 |
> >>>
> >>>
> |-------------------------------+----------------------+----------------------+
> >>>    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
> >>>    Uncorr. ECC |
> >>>    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
> >>> Compute M. |
> >>>
> >>>
> |===============================+======================+======================|
> >>>    |   0  Tesla K20m          Off  | 0000:03:00.0     Off
> >>>    |                    0 |
> >>>    | N/A   22C    P0    46W / 225W |    172MiB /  4799MiB |         0%
> >>> Default |
> >>>
> >>>
> +-------------------------------+----------------------+----------------------+
> >>>
> >>>
> >>>
> +-----------------------------------------------------------------------------+
> >>>    | Compute processes: GPU Memory |
> >>>    |  GPU       PID  Process name
> >>> Usage      |
> >>>
> >>>
> |=============================================================================|
> >>>    |    0    113588 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
> >>>    |    0    113589 /work/03002/yliu120/gromacs-5/bin/mdrun_mpi 77MiB |
> >>>
> >>>
> +-----------------------------------------------------------------------------+
> >>>
> >>>    c442-702.stampede(4)$ lscpu
> >>>    Architecture:          x86_64
> >>>    CPU op-mode(s):        32-bit, 64-bit
> >>>    Byte Order:            Little Endian
> >>>    CPU(s):                16
> >>>    On-line CPU(s) list:   0-15
> >>>    Thread(s) per core:    1
> >>>    Core(s) per socket:    8
> >>>    Socket(s):             2
> >>>    NUMA node(s):          2
> >>>    Vendor ID:             GenuineIntel
> >>>    CPU family:            6
> >>>    Model:                 45
> >>>    Stepping:              7
> >>>    CPU MHz:               2701.000
> >>>    BogoMIPS:              5399.22
> >>>    Virtualization:        VT-x
> >>>    L1d cache:             32K
> >>>    L1i cache:             32K
> >>>    L2 cache:              256K
> >>>    L3 cache:              20480K
> >>>    NUMA node0 CPU(s):     0-7
> >>>    NUMA node1 CPU(s):     8-15
> >>>
> >>>    I hope this information will help. Thank you.
> >>>
> >>>    Yunlong
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>    On 8/21/14, 1:38 PM, Roland Schulz wrote:
> >>>>
> >>>>    Hi,
> >>>>
> >>>>    please don't use gmx-developers for user questions. Feel free to
> >>>>    use it if you want to fix the problem, and have questions about
> >>>>    implementation details.
> >>>>
> >>>>    Please provide more details: How large is your system? How much
> >>>>    memory does a node have? On how many nodes do you try to run? How
> >>>>    many mpi-ranks do you have per node?
> >>>>
> >>>>    Roland
> >>>>
> >>>>    On Thu, Aug 21, 2014 at 12:21 PM, Yunlong Liu <yliu120 at jh.edu
> >>>>    <mailto:yliu120 at jh.edu>> wrote:
> >>>>
> >>>>        Hi Gromacs Developers,
> >>>>
> >>>>        I found something about the dynamic loading balance really
> >>>>        interesting. I am running my simulation on Stampede
> >>>>        supercomputer, which has nodes with 16-physical core ( really
> >>>>        16 Intel Xeon cores on one node ) and an NVIDIA Tesla K20m
> >>>>        GPU associated.
> >>>>
> >>>>        When I am using only the CPUs, I turned on dynamic loading
> >>>>        balance by -dlb yes. And it seems to work really good, and
> >>>>        the loading imbalance is only 1~2%. This really helps improve
> >>>>        the performance by 5~7%。But when I am running my code on
> >>>>        GPU-CPU hybrid ( GPU node, 16-cpu and 1 GPU), the dynamic
> >>>>        loading balance kicked in since the imbalance goes up to ~50%
> >>>>        instantly after loading. Then the the system reports a
> >>>>        fail-to-allocate-memory error:
> >>>>
> >>>>        NOTE: Turning on dynamic load balancing
> >>>>
> >>>>
> >>>>        -------------------------------------------------------
> >>>>        Program mdrun_mpi, VERSION 5.0
> >>>>        Source code file:
> >>>>
> >>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/utility/smalloc.c,
> >>>>        line: 226
> >>>>
> >>>>        Fatal error:
> >>>>        Not enough memory. Failed to realloc 1020720 bytes for
> >>>>        dest->a, dest->a=d5800030
> >>>>        (called from file
> >>>>
> >>>> /home1/03002/yliu120/build/gromacs-5.0/src/gromacs/mdlib/domdec_top.c,
> >>>>        line 1061)
> >>>>        For more information and tips for troubleshooting, please
> >>>>        check the GROMACS
> >>>>        website at http://www.gromacs.org/Documentation/Errors
> >>>>        -------------------------------------------------------
> >>>>        : Cannot allocate memory
> >>>>        Error on rank 0, will try to stop all ranks
> >>>>        Halting parallel program mdrun_mpi on CPU 0 out of 4
> >>>>
> >>>>        gcq#274: "I Feel a Great Disturbance in the Force" (The
> >>>>        Emperor Strikes Back)
> >>>>
> >>>>        [cli_0]: aborting job:
> >>>>        application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
> >>>>        [c442-702.stampede.tacc.utexas.edu:mpispawn_0][readline]
> >>>>        Unexpected End-Of-File on file descriptor 6. MPI process died?
> >>>>        [c442-702.stampede.tacc.utexas.edu:
> mpispawn_0][mtpmi_processops]
> >>>>        Error while reading PMI socket. MPI process died?
> >>>>        [c442-702.stampede.tacc.utexas.edu:mpispawn_0][child_handler]
> >>>>        MPI process (rank: 0, pid: 112839) exited with status 255
> >>>>        TACC: MPI job exited with code: 1
> >>>>
> >>>>        TACC: Shutdown complete. Exiting.
> >>>>
> >>>>        So I manually turned off the dynamic loading balance by -dlb
> >>>>        no. The simulation goes through with the very high loading
> >>>>        imbalance, like:
> >>>>
> >>>>        DD  step 139999 load imb.: force 51.3%
> >>>>
> >>>>                   Step Time         Lambda
> >>>>                 140000 280.00000        0.00000
> >>>>
> >>>>           Energies (kJ/mol)
> >>>>                    U-B    Proper Dih. Improper Dih.      CMAP
> >>>>        Dih.          LJ-14
> >>>>            4.88709e+04    1.21990e+04 2.99128e+03   -1.46719e+03
> >>>>        1.98569e+04
> >>>>             Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)
> >>>> Coul. recip.
> >>>>            2.54663e+05    4.05141e+05 -3.16020e+04   -3.75610e+06
> >>>>        2.24819e+04
> >>>>              Potential    Kinetic En. Total Energy    Temperature
> >>>>        Pres. DC (bar)
> >>>>           -3.02297e+06    6.15217e+05 -2.40775e+06    3.09312e+02
> >>>>        -2.17704e+02
> >>>>         Pressure (bar)   Constr. rmsd
> >>>>           -3.39003e+01    3.10750e-05
> >>>>
> >>>>        DD  step 149999 load imb.: force 60.8%
> >>>>
> >>>>                   Step Time         Lambda
> >>>>                 150000 300.00000        0.00000
> >>>>
> >>>>           Energies (kJ/mol)
> >>>>                    U-B    Proper Dih. Improper Dih.      CMAP
> >>>>        Dih.          LJ-14
> >>>>            4.96380e+04    1.21010e+04 2.99986e+03   -1.51918e+03
> >>>>        1.97542e+04
> >>>>             Coulomb-14        LJ (SR) Disper. corr.   Coulomb (SR)
> >>>> Coul. recip.
> >>>>            2.54305e+05    4.06024e+05 -3.15801e+04   -3.75534e+06
> >>>>        2.24001e+04
> >>>>              Potential    Kinetic En. Total Energy    Temperature
> >>>>        Pres. DC (bar)
> >>>>           -3.02121e+06    6.17009e+05 -2.40420e+06    3.10213e+02
> >>>>        -2.17403e+02
> >>>>         Pressure (bar)   Constr. rmsd
> >>>>           -1.40623e+00    3.16495e-05
> >>>>
> >>>>        I think this high loading imbalance will affect more than 20%
> >>>>        of the performance but at least it will let the simulation
> >>>>        on. Therefore, the problem I would like to report is that
> >>>>        when running simulation with GPU-CPU hybrid with very few
> >>>>        GPU, the dynamic loading balance will cause domain
> >>>>        decomposition problems ( fail-to-allocate-memory ). I don't
> >>>>        know whether there is any solution to this problem currently
> >>>>        or anything could be improved?
> >>>>
> >>>>        Yunlong
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>        --
> >>>>        ========================================
> >>>>        Yunlong Liu, PhD Candidate
> >>>>        Computational Biology and Biophysics
> >>>>        Department of Biophysics and Biophysical Chemistry
> >>>>        School of Medicine, The Johns Hopkins University
> >>>>        Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
> >>>>
> >>>>        Address: 725 N Wolfe St, WBSB RM 601, 21205
> >>>>        ========================================
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>    --     ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
> >>>>    <http://cmb.ornl.gov>
> >>>>    865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
> >>>
> >>>
> >>>    --
> >>>    ========================================
> >>>    Yunlong Liu, PhD Candidate
> >>>    Computational Biology and Biophysics
> >>>    Department of Biophysics and Biophysical Chemistry
> >>>    School of Medicine, The Johns Hopkins University
> >>>    Email: yliu120 at jhmi.edu <mailto:yliu120 at jhmi.edu>
> >>>
> >>>    Address: 725 N Wolfe St, WBSB RM 601, 21205
> >>>    ========================================
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov <
> http://cmb.ornl.gov>
> >>> 865-241-1537 <tel:865-241-1537>, ORNL PO BOX 2008 MS6309
> >>
> >>
> >> --
> >>
> >> ========================================
> >> Yunlong Liu, PhD Candidate
> >> Computational Biology and Biophysics
> >> Department of Biophysics and Biophysical Chemistry
> >> School of Medicine, The Johns Hopkins University
> >> Email: yliu120 at jhmi.edu
> >> Address: 725 N Wolfe St, WBSB RM 601, 21205
> >> ========================================
> >>
> >> --
> >> Gromacs Users mailing list
> >>
> >> * Please search the archive at
> >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >>
> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >>
> >> * For (un)subscribe requests visit
> >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a
> >> mail to gmx-users-request at gromacs.org.
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>