[gmx-users] Production run error

Tsjerk Wassenaar tsjerkw at gmail.com
Wed May 18 09:18:58 CEST 2016


Hi Sanket,

The problem is that a charge group moved too far between two domain
decomposition steps.

Seriously, we can't say more than that, unless you tell us more about the
system and how you got to the point where you are.

Cheers,

Tsjerk
On May 18, 2016 8:44 AM, "Sanket Ghawali" <sanket.ghawali at gmail.com> wrote:

> *Dear all,
> I'm performing a production run for a 100ns simulation. It runs well
> upto 88 ns, but stops after that. Giving an error message.*
>
> Program mdrun_mpi, VERSION 4.6.5
> Source code file: /root/data/gromacs-4.6.5/src/mdlib/domdec.c, line: 4412
>
> Fatal error:
> A charge group moved too far between two domain decomposition steps
> This usually means that your system is not well equilibrated
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> "Come on boys, Let's push it hard" (P.J. Harvey)
>
> Error on node 25, will try to stop all the nodes
> Halting parallel program mdrun_mpi on CPU 25 out of 48
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.5
> Source code file: /root/data/gromacs-4.6.5/src/mdlib/domdec.c, line: 4412
>
> Fatal error:
> A charge group moved too far between two domain decomposition steps
> This usually means that your system is not well equilibrated
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> "Come on boys, Let's push it hard" (P.J. Harvey)
>
> Error on node 37, will try to stop all the nodes
> Halting parallel program mdrun_mpi on CPU 37 out of 48
>
> gcq#339: "Come on boys, Let's push it hard" (P.J. Harvey)
>
>
> gcq#339: "Come on boys, Let's push it hard" (P.J. Harvey)
>
>
> -------------------------------------------------------
> Program mdrun_mpi, VERSION 4.6.5
> Source code file: /root/data/gromacs-4.6.5/src/mdlib/domdec.c, line: 4412
>
> Fatal error:
> A charge group moved too far between two domain decomposition steps
> This usually means that your system is not well equilibrated
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> -------------------------------------------------------
>
> "Come on boys, Let's push it hard" (P.J. Harvey)
>
> Error on node 38, will try to stop all the nodes
> Halting parallel program mdrun_mpi on CPU 38 out of 48
>
> gcq#339: "Come on boys, Let's push it hard" (P.J. Harvey)
>
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 38 in communicator MPI_COMM_WORLD
> with errorcode -1.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [compute-0-2.local][[20037,1],46][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-2.local][[20037,1],26][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-1.local][[20037,1],13][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-1.local][[20037,1],33][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-1.local][[20037,1],45][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [bicamp.bicnirrh.res.in:13287] 2 more processes have sent help message
> help-mpi-api.txt / mpi-abort
> [bicamp.bicnirrh.res.in:13287] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [compute-0-0.local][[20037,1],24][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [compute-0-0.local][[20037,1],36][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 38 with PID 5651 on
> node compute-0-2 exiting improperly. There are two reasons this could
> occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here)
>
> *Command used is: mpirun -np 48 -hostfile host
> /share/apps/gromacs/bin/mdrun_mpi -v -deffnm filename*
>
> *I checked up all the files everything seem to be ok, as i have use
> the same parameters for running other simulation and they worked out
> well.*
>
> *Does anyone know what might be the problem?*
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
>


More information about the gromacs.org_gmx-users mailing list