Dear GMX Users,<br><div class="gmail_quote">I have been running gromacs-4.0.2 and gromacs-4.0_rc3 in parallel on various 8-cores-per-node and 16-cores-per-node 64-bit linux clusters. While I am able to run mpi without any problems on a single node (8 or 16 processes, respectively), when running larger jobs on more than one node I invariably get a crash either immediately or after several hours of correct simulation output. Below are errors and compile options/details. The nodes seem to stop communicating and stop producing output after some time even if I use mpirun -q 0. These were mostly small simulations (40 angstrom cubic box containing peptide(s) and water). Thanks, Ron Hills<br>
<br><font size="-1">setenv CC icc<br>setenv CXX icc<br>setenv F77 ifort #intel ifort 10.1.021 or 10.1.017<br>setenv MPICC "mpicc -cc=icc" #using Pathscale/Qlogic "InfiniPath" InfiniBand MPI or mvapich2-1.2-intel-ofed-1.2.5.5 or mvapich/1.0<br>
setenv MPIF77 "mpif77 -fc=ifort"<br><br></font>
> ***immediate error after job submission:<br>
<div>> tr029:36.Hardware problem: {[RXE EAGERTID Memory Parity]}<br>
</div><div>> tr024:14.ips_proto_connect: Couldn't connect to<br>
> 172.17.19.29(LID=0x0025:2.0). Time elapased 00:00:30. Still trying...<br>
> tr025:16.MPID_Key_Init: rank 16 (tr025): Detected Connection timeout:<br>
> 172.17.19.29 (rank 32,33,34,35,36,37,38,39)<br>
<br>> ***termination after 30hrs running correctly on 8x8=64 cores:<br>
> tr019:33.PIO Send Stall after at least 2.10M failed send attempts<br>
> (elapsed=54232.018s, last=2119242.641s, pio_stall_count=1)<br>
> (TxPktCnt=21654082263,RxPktCnt=21662363713) PIO Send Bufs port 1 with 8 bufs<br>
> from<br>
> 8 to 15. PIO avail regs: <0>=(4145041114514105) <1>=(1010545410441100)<br>
> <2>=(15555554) <3>=(0) <4>=(0) <5>=(0) <6>=(0) <7>=(0) . PIO shadow<br>
> regs: <0>=(41505001ebae4050) (err=23)<br>
> mdrun_mpi:14064 terminated with signal 11 at PC=61329f SP=7fbfffcd00.<br>
> Backtrace:<br>
> /uufs/<br>
> <a href="http://hec.utah.edu/common/vothfs/u0636784/gromacs-4.0_rc3/tlrd/bin/mdrun_mpi%5B0x61329f%5D" target="_blank">hec.utah.edu/common/vothfs/u0636784/gromacs-4.0_rc3/tlrd/bin/mdrun_mpi[0x61329f]</a><br>
> MPIRUN.tr012: 26 ranks have not yet exited 60 seconds after rank 37 (node<br>
> tr019) exited without reaching MPI_Finalize().<br>
> MPIRUN.tr012: Waiting at most another 60 seconds for the remaining ranks to<br>
> do a clean shutdown before terminating 26 node processes<br>
<br>> ***error from a coworker:<br>
</div>> mdrun_mpi:27203 terminated with signal 11 at PC=469daa SP=7fbfffe080.<br>
> Backtrace:<br>
> /scratch/tr/zzhang/workgmx/<div>T4L/job/mdrun_mpi(do_pme+0x2f8e)[0x469daa]<br>
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(force+0x6be)[0x443d4e]<br>
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(do_force+0xb7b)[0x47d3f1]<br>
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(do_md+0x19c4)[0x42b360]<br>
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(mdrunner+0xc15)[0x4297b5]<br>
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi(main+0x2ad)[0x42ccd1]<br>
> /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x2a96a5e40b]<br>
> /scratch/tr/zzhang/workgmx/T4L/job/mdrun_mpi[0x41781a]<br>
> MPIRUN.tr082: 15 ranks have not yet exited 60 seconds after rank 12 (node<br>
> tr086) exited without reaching MPI_Finalize().<br>
> MPIRUN.tr082: Waiting at most another 60 seconds for the remaining ranks to<br>
> do a clean shutdown before terminating 15 node processes<br><font size="-1"><br>***Using mpirun -q 0 I get the following errors after completing 460,000 dynamics steps with no errors:<br>tr006:6.PIO
Send Stall after at least 2.10M failed send attempts (elapsed=272.699s,
last=2462705.468s, pio_stall_count=1) (TxPktCnt=5960586432,RxPktCnt=5963056955)
PIO Send Bufs port 3 with 8 bufs from 32 to 39. PIO avail regs:
<0>=(1455444101454155) <1>=(4100140514101400)
<2>=(45100000) <3>=(0) <4>=(0) <5>=(0)
<6>=(0) <7>=(0) . PIO shadow regs:
<1>=(405541050145ebff) (err=23)<br>
tr037:39.PIO Send Stall after at least 2.10M failed send attempts
(elapsed=278.602s, last=4999304.123s, pio_stall_count=1)
(TxPktCnt=61756904051,RxPktCnt=61772810688)
PIO Send Bufs port 1 with 8 bufs from 0 to 7. PIO avail regs:
<0>=(504400541150401) <1>=(5044450510455544)
<2>=(14155155) <3>=(0) <4>=(0) <5>=(0)
<6>=(0) <7>=(0) . PIO shadow regs:
<0>=(500415154014fbfe) (err=23)</font><br></div>
</div>