<html>
<head>
<style>
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Verdana
}
</style>
</head>
<body class='hmmessage'>
Hi,<br><br>This is strange.<br>You run on 4 nodes and all processes hang at the same MPI call.<br>I see no reason why they should hang if they are all at the correct call.<br><br>After how many steps does this happen?<br>If it is not much I can try to see if it also hangs on our system.<br>Otherwise, could you try to generate a checkpoint file with<br>which it hangs quickly?<br><br>What version of MPI are you using?<br><br>Berk<br><br><br>> Date: Tue, 13 Jan 2009 10:53:25 +0100<br>> From: patrick.fuchs@univ-paris-diderot.fr<br>> To: gmx-users@gromacs.org<br>> Subject: Re: Subject: Re: Re: [gmx-users] Gromacs 4 bug?<br>> <br>> Hi Berk,<br>> I did a test on gromacs-4.0.2 under Fedora 10 (with fftw-3.0.1 and <br>> lam-7.1.4), using a slightly upgraded version of gcc compared to my <br>> previous post (gcc version 4.3.2 20081105 (Red hat 4.3.2-7)) on the same <br>> hardware but it still hangs (so both FC9 and FC10 give the same problem, <br>> while FC8 does not). Finally I could test mdrun_mpi in the debugger and <br>> here are the results of my tests. You were right, it seems that mdrun <br>> hangs at an MPI call, here are the outputs of each xterm:<br>> <br>> XTERM1<br>> ===================================================================<br>> GNU gdb Fedora (6.8-29.fc10)<br>> Copyright (C) 2008 Free Software Foundation, Inc.<br>> License GPLv3+: GNU GPL version 3 or later <br>> <http://gnu.org/licenses/gpl.html><br>> This is free software: you are free to change and redistribute it.<br>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"<br>> and "show warranty" for details.<br>> This GDB was configured as "x86_64-redhat-linux-gnu"...<br>> (gdb) run<br>> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi<br>> [Thread debugging using libthread_db enabled]<br>> [New Thread 0x12df30 (LWP 8285)]<br>> NNODES=4, MYRANK=0, HOSTNAME=cumin.dsimb.inserm.fr<br>> NODEID=0 argc=1<br>> :-) G R O M A C S (-:<br>> <br>> Giant Rising Ordinary Mutants for A Clerical Setup<br>> <br>> :-) VERSION 4.0.2 (-:<br>> <br>> [snip]<br>> <br>> starting mdrun 'Pure DLPC bilayer with 128 lipids and 3655 SPC water'<br>> 5000000 steps, 10000.0 ps.<br>> ^C<br>> Program received signal SIGINT, Interrupt.<br>> 0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6<br>> Missing separate debuginfos, use: debuginfo-install <br>> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 <br>> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 <br>> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 <br>> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64<br>> (gdb) where<br>> #0 0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6<br>> #1 0x0000000000770c83 in lam_ssi_rpi_usysv_proc_read_env ()<br>> #2 0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()<br>> #3 0x000000000074a1e0 in _mpi_req_advance ()<br>> #4 0x000000000073ced0 in lam_send ()<br>> #5 0x000000000075328e in MPI_Send ()<br>> #6 0x000000000074d7ec in MPI_Sendrecv ()<br>> #7 0x00000000004aebfd in gmx_sum_qgrid_dd ()<br>> #8 0x00000000004b40bb in gmx_pme_do ()<br>> #9 0x0000000000479a58 in do_force_lowlevel ()<br>> #10 0x00000000004d1d32 in do_force ()<br>> #11 0x00000000004214d2 in do_md ()<br>> #12 0x000000000041bea0 in mdrunner ()<br>> #13 0x0000000000422b94 in main ()<br>> (gdb)<br>> ===================================================================<br>> <br>> <br>> XTERM2<br>> ===================================================================<br>> GNU gdb Fedora (6.8-29.fc10)<br>> Copyright (C) 2008 Free Software Foundation, Inc.<br>> License GPLv3+: GNU GPL version 3 or later <br>> <http://gnu.org/licenses/gpl.html><br>> This is free software: you are free to change and redistribute it.<br>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"<br>> and "show warranty" for details.<br>> This GDB was configured as "x86_64-redhat-linux-gnu"...<br>> (gdb) run<br>> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi<br>> [Thread debugging using libthread_db enabled]<br>> [New Thread 0x12df30 (LWP 8294)]<br>> NNODES=4, MYRANK=1, HOSTNAME=cumin.dsimb.inserm.fr<br>> NODEID=1 argc=1<br>> ^C<br>> Program received signal SIGINT, Interrupt.<br>> 0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6<br>> Missing separate debuginfos, use: debuginfo-install <br>> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 <br>> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 <br>> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 <br>> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64<br>> (gdb) where<br>> #0 0x0000003b978cc087 in sched_yield () from /lib64/libc.so.6<br>> #1 0x0000000000770c83 in lam_ssi_rpi_usysv_proc_read_env ()<br>> #2 0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()<br>> #3 0x000000000074a1e0 in _mpi_req_advance ()<br>> #4 0x000000000073ea90 in MPI_Wait ()<br>> #5 0x000000000074d800 in MPI_Sendrecv ()<br>> #6 0x00000000004aed44 in gmx_sum_qgrid_dd ()<br>> #7 0x00000000004b40bb in gmx_pme_do ()<br>> #8 0x0000000000479a58 in do_force_lowlevel ()<br>> #9 0x00000000004d1d32 in do_force ()<br>> #10 0x00000000004214d2 in do_md ()<br>> #11 0x000000000041bea0 in mdrunner ()<br>> #12 0x0000000000422b94 in main ()<br>> (gdb)<br>> ===================================================================<br>> <br>> <br>> XTERM3<br>> ===================================================================<br>> GNU gdb Fedora (6.8-29.fc10)<br>> Copyright (C) 2008 Free Software Foundation, Inc.<br>> License GPLv3+: GNU GPL version 3 or later <br>> <http://gnu.org/licenses/gpl.html><br>> This is free software: you are free to change and redistribute it.<br>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"<br>> and "show warranty" for details.<br>> This GDB was configured as "x86_64-redhat-linux-gnu"...<br>> (gdb) run<br>> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi<br>> [Thread debugging using libthread_db enabled]<br>> [New Thread 0x12df30 (LWP 8276)]<br>> NNODES=4, MYRANK=2, HOSTNAME=cumin.dsimb.inserm.fr<br>> NODEID=2 argc=1<br>> ^C<br>> Program received signal SIGINT, Interrupt.<br>> 0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()<br>> Missing separate debuginfos, use: debuginfo-install <br>> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 <br>> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 <br>> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 <br>> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64<br>> (gdb) where<br>> #0 0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()<br>> #1 0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()<br>> #2 0x000000000074a1e0 in _mpi_req_advance ()<br>> #3 0x000000000073ced0 in lam_send ()<br>> #4 0x000000000075328e in MPI_Send ()<br>> #5 0x000000000074d7ec in MPI_Sendrecv ()<br>> #6 0x00000000004aed44 in gmx_sum_qgrid_dd ()<br>> #7 0x00000000004b40bb in gmx_pme_do ()<br>> #8 0x0000000000479a58 in do_force_lowlevel ()<br>> #9 0x00000000004d1d32 in do_force ()<br>> #10 0x00000000004214d2 in do_md ()<br>> #11 0x000000000041bea0 in mdrunner ()<br>> #12 0x0000000000422b94 in main ()<br>> (gdb)<br>> ===================================================================<br>> <br>> <br>> XTERM4<br>> ===================================================================<br>> GNU gdb Fedora (6.8-29.fc10)<br>> Copyright (C) 2008 Free Software Foundation, Inc.<br>> License GPLv3+: GNU GPL version 3 or later <br>> <http://gnu.org/licenses/gpl.html><br>> This is free software: you are free to change and redistribute it.<br>> There is NO WARRANTY, to the extent permitted by law. Type "show copying"<br>> and "show warranty" for details.<br>> This GDB was configured as "x86_64-redhat-linux-gnu"...<br>> (gdb) run<br>> Starting program: /usr/local/gromacs-4.0.2/bin/mdrun_mpi<br>> [Thread debugging using libthread_db enabled]<br>> [New Thread 0x12df30 (LWP 8267)]<br>> NNODES=4, MYRANK=3, HOSTNAME=cumin.dsimb.inserm.fr<br>> NODEID=3 argc=1<br>> ^C<br>> Program received signal SIGINT, Interrupt.<br>> 0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()<br>> Missing separate debuginfos, use: debuginfo-install <br>> e2fsprogs-libs-1.41.3-2.fc10.x86_64 glibc-2.9-3.x86_64 <br>> libICE-1.0.4-4.fc10.x86_64 libSM-1.1.0-2.fc10.x86_64 <br>> libX11-1.1.4-6.fc10.x86_64 libXau-1.0.4-1.fc10.x86_64 <br>> libXdmcp-1.0.2-6.fc10.x86_64 libxcb-1.1.91-5.fc10.x86_64<br>> (gdb) where<br>> #0 0x0000000000770c70 in lam_ssi_rpi_usysv_proc_read_env ()<br>> #1 0x0000000000784a39 in lam_ssi_rpi_usysv_advance_common ()<br>> #2 0x000000000074a1e0 in _mpi_req_advance ()<br>> #3 0x000000000073ea90 in MPI_Wait ()<br>> #4 0x000000000074d800 in MPI_Sendrecv ()<br>> #5 0x00000000004aebfd in gmx_sum_qgrid_dd ()<br>> #6 0x00000000004b40bb in gmx_pme_do ()<br>> #7 0x0000000000479a58 in do_force_lowlevel ()<br>> #8 0x00000000004d1d32 in do_force ()<br>> #9 0x00000000004214d2 in do_md ()<br>> #10 0x000000000041bea0 in mdrunner ()<br>> #11 0x0000000000422b94 in main ()<br>> (gdb)<br>> ===================================================================<br>> <br>> <br>> Cheers,<br>> <br>> Patrick<br>> <br><br><br /><hr />Express yourself instantly with MSN Messenger! <a href='http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/' target='_new'>MSN Messenger</a></body>
</html>