<HTML>
<HEAD>
<META content="text/html; charset=big5" http-equiv=Content-Type>
<META content="OPENWEBMAIL" name=GENERATOR>
</HEAD>
<BODY bgColor=#ffffff>
<pre>Hi,
<br />
<br />I tried to use 4 and 8 CPUs.
<br />There are about 6000 atoms in my system.
<br />The interconnect of our computer is the network with speed 1Gb but not optical fiber.
<br />
<br />I'm sorry for my poor English and I couldn't express well in my question.
<br />Everytime I submitted the parallel job, the nodes assigned to mehave been 100% loading,
<br />and the CPU source availble to me is less then 10%.
<br />I think there is something wrong with my submit script or executable script,
<br />and I post them in my previous message.
<br />How should I correct my script?
<br />
<br />Hsin-Lin
<br /></pre>
<br /><font size="2">>
Hi,
<br />>
<br />>
how many CPUs do you try to use? How big is your system. What kind of
<br />>
interconnect? Since you use condor probably some pretty slow interconnect.
<br />>
Than you can't aspect it to work on many CPUs. If you want to use many CPUs
<br />>
for MD you need a faster interconnect.
<br />>
<br />>
Roland
<br />>
<br />>
2010/4/2 Hsin-Lin Chiang <jiangsl@phys.sinica.edu.tw>
<br />>
<br />>
>  Hi,
<br />>
>
<br />>
> Do someone use gromacs, lam, and condor together here?
<br />>
> I use gromacs with lam/mpi on condor system.
<br />>
> Everytime I submit the parallel job.
<br />>
> I got the node which is occupied before and the performance of each cpu is
<br />>
> below 10%.
<br />>
> How should I change the script?
<br />>
> Below is one submit script and two executable script.
<br />>
>
<br />>
> condor_mpi:
<br />>
> ----
<br />>
> #!/bin/bash
<br />>
> Universe = parallel
<br />>
> Executable = ./lamscript
<br />>
> machine_count = 8
<br />>
> output = md_$(NODE).out
<br />>
> error = md_$(NODE).err
<br />>
> log = md.log
<br />>
> arguments = /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md.sh
<br />>
> +WantIOProxy = True
<br />>
> should_transfer_files = yes
<br />>
> when_to_transfer_output = on_exit
<br />>
> Queue
<br />>
> -------
<br />>
>
<br />>
> lamscript:
<br />>
> -------
<br />>
> #!/bin/sh
<br />>
>
<br />>
> _CONDOR_PROCNO=$_CONDOR_PROCNO
<br />>
> _CONDOR_NPROCS=$_CONDOR_NPROCS
<br />>
> _CONDOR_REMOTE_SPOOL_DIR=$_CONDOR_REMOTE_SPOOL_DIR
<br />>
>
<br />>
> SSHD_SH=`condor_config_val libexec`
<br />>
> SSHD_SH=$SSHD_SH/sshd.sh
<br />>
>
<br />>
> CONDOR_SSH=`condor_config_val libexec`
<br />>
> CONDOR_SSH=$CONDOR_SSH/condor_ssh
<br />>
>
<br />>
> # Set this to the bin directory of your lam installation
<br />>
> # This also must be in your .cshrc file, so the remote side
<br />>
> # can find it!
<br />>
> export LAMDIR=/stathome/jiangsl/soft/lam-7.1.4
<br />>
> export PATH=${LAMDIR}/bin:${PATH}
<br />>
> export LD_LIBRARY_PATH=/lib:/usr/lib:$LAMDIR/lib:.:/opt/intel/compilers/lib
<br />>
>
<br />>
>
<br />>
> . $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS
<br />>
>
<br />>
> # If not the head node, just sleep forever, to let the
<br />>
> # sshds run
<br />>
> if [ $_CONDOR_PROCNO -ne 0 ]
<br />>
> then
<br />>
>                 wait
<br />>
>                 sshd_cleanup
<br />>
>                 exit 0
<br />>
> fi
<br />>
>
<br />>
> EXECUTABLE=$1
<br />>
> shift
<br />>
>
<br />>
> # the binary is copied but the executable flag is cleared.
<br />>
> # so the script have to take care of this
<br />>
> chmod +x $EXECUTABLE
<br />>
>
<br />>
> # to allow multiple lam jobs running on a single machine,
<br />>
> # we have to give somewhat unique value
<br />>
> export LAM_MPI_SESSION_SUFFIX=$$
<br />>
> export LAMRSH=$CONDOR_SSH
<br />>
> # when a job is killed by the user, this script will get sigterm
<br />>
> # This script have to catch it and do the cleaning for the
<br />>
> # lam environment
<br />>
> finalize()
<br />>
> {
<br />>
> sshd_cleanup
<br />>
> lamhalt
<br />>
> exit
<br />>
> }
<br />>
> trap finalize TERM
<br />>
>
<br />>
> CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
<br />>
> export $CONDOR_CONTACT_FILE
<br />>
> # The second field in the contact file is the machine name
<br />>
> # that condor_ssh knows how to use. Note that this used to
<br />>
> # say "sort -n +0 ...", but -n option is now deprecated.
<br />>
> sort < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines
<br />>
>
<br />>
> # start the lam environment
<br />>
> # For older versions of lam you may need to remove the -ssi boot rsh line
<br />>
> lamboot -ssi boot rsh -ssi rsh_agent "$LAMRSH -x" machines
<br />>
>
<br />>
> if [ $? -ne 0 ]
<br />>
> then
<br />>
>         echo "lamscript error booting lam"
<br />>
>         exit 1
<br />>
> fi
<br />>
>
<br />>
> mpirun C -ssi rpi usysv -ssi coll_smp 1 $EXECUTABLE $@ &
<br />>
>
<br />>
> CHILD=$!
<br />>
> TMP=130
<br />>
> while [ $TMP -gt 128 ] ; do
<br />>
>         wait $CHILD
<br />>
>         TMP=$?;
<br />>
> done
<br />>
>
<br />>
> # clean up files
<br />>
> sshd_cleanup
<br />>
> /bin/rm -f machines
<br />>
>
<br />>
> # clean up lam
<br />>
> lamhalt
<br />>
>
<br />>
> exit $TMP
<br />>
> ----
<br />>
>
<br />>
> md.sh
<br />>
> ----
<br />>
> #!/bin/sh
<br />>
> #running GROMACS
<br />>
> /stathome/jiangsl/soft/gromacs-4.0.5/bin/mdrun_mpi_d \
<br />>
> -s /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.tpr \
<br />>
> -e /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.edr \
<br />>
> -o /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.trr \
<br />>
> -g /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.log \
<br />>
> -c /stathome/jiangsl/simulation/gromacs/2OMP/2OMP_1_1/md/200ns.gro
<br />>
> -----
<br />>
>
<br />>
>
<br />>
> Hsin-Lin
<br />
</font>
</BODY>
</HTML>