I am using 3.3.2 and 3.3.1 and I get the following problem with both of them.<br><br>If I run replica exchange on >4 processors (2 and 4 are fine), the simulations finish, but mpi gives the following errors, thus the job never terminates
<br><br><br>this is the end of my log file<br><br>-----------------------------------------------------------------------<br><br> NODE (s) Real (s) (%)<br> Time: 158483.430 159636.000 99.3<br>
1d20h01:23<br> (Mnbf/s) (MFlops) (ns/day) (hour/ns)<br>Performance: 18.919 818.029 2.726 8.805<br>p13_15442: p4_error: Timeout in establishing connection to remote process: 0
<br>p12_15407: p4_error: Timeout in establishing connection to remote process: 0<br>Broken pipe<br>p11_2364: p4_error: Timeout in establishing connection to remote process: 0<br>p9_20588: p4_error: Timeout in establishing connection to remote process: 0
<br>p10_2329: p4_error: Timeout in establishing connection to remote process: 0<br>Broken pipe<br>Broken pipe<br>Broken pipe<br>Broken pipe<br>p6_24137: p4_error: Timeout in establishing connection to remote process: 0<br>
p7_24172: p4_error: Timeout in establishing connection to remote process: 0<br>Broken pipe<br>Broken pipe<br><br><br>I have tried installing on three different clusters, using different versions of mpich and they all do this. BUT, I do not get the error if I am running a single simulation on 8 processors, I only get this problem when I run replica exchange. Any ideas what is going on? I'm also including my submission script, perhaps I am missing something, but I'm just not seeing it
<br><br>#!/bin/bash<br>#<br>#$ -N switch_less<br>#$ -pe mpich 8<br>#$ -cwd<br>#$ -j y<br>#$ -S /bin/bash<br>#<br>#$ -l h_rt=00:05:00<br><br>MPIDIR=/opt/mpich/intel/bin/<br>MDDIR=/soft/linux/pkg/gromacs-3.3.1/bin<br>SYSTEM=free
<br><br><br>INDEX=0<br>for T in 80 82 84 86 87 88 89 90<br>do<br>sed "s/TTTT/$T/g" MDRUN > mdrun.$INDEX.mdp<br><br>$MDDIR/grompp \<br> -f mdrun.$INDEX \<br> -c $SYSTEM.gro \<br> -p $SYSTEM.top \
<br> -po mdout.$INDEX \<br> -o $SYSTEM$INDEX.tpr<br>let "INDEX += 1"<br><br>done<br><br>if test $NSLOTS -eq $INDEX<br>then<br>$MPIDIR/mpirun -v -np $NSLOTS -machinefile $TMPDIR/machines \<br> -nolocal $MDDIR/mdrun-mpi -v \
<br> -np $NSLOTS \<br> -multi $NSLOTS \<br> -replex 50 \<br> -s $SYSTEM.tpr \<br> -o $SYSTEM \<br> -c $SYSTEM.out \<br> -g $SYSTEM \<br> -e $SYSTEM \<br> -x $SYSTEM
<br>else<br><br>echo 'wrong number of nodes for the number of replicas'<br>fi<br><br><br>I have tried using the -debug option when running gromacs, but I can't tell what is going on with it. Is there something I should look for in the debug logfile?
<br><br>thanks<br><br>-Paul<br>