<html>
<head>
<style>
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
FONT-SIZE: 10pt;
FONT-FAMILY:Tahoma
}
</style>
</head>
<body class='hmmessage'><div style="text-align: left;">Hi,<br><br>The Cray XT4 has a torus network, but you don't get access to it as a torus.<br>You will get assigned processors which can be anywhere in the machine<br>and they are usually never in a nice cube, but there are always some missing.<br>Therefore software, such as Gromacs, can not make use of proper Cartesian<br></div>(torus) communication as one can for instance on a Blue Gene.<br><br>I have no clue about the wallclock issue.<br>Can you find out if the run took 1.35 or 4 hours?<br>The start time is somewhere at the beginning of the log file.<br><br>Berk<br><br><hr id="stopSpelling">> Date: Wed, 1 Oct 2008 12:27:06 +0200<br>> From: Bjorn.Sathre@student.uib.no<br>> To: gmx3@hotmail.com<br>> Subject: RE: [gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??<br>> <br>> <br>> <br>> On Wed, 1 Oct 2008, Berk Hess wrote:<br>> <br>> > Hi,<br>> ><br>> > Your PME nodes seem to be running one order of magnitude slower than they should.<br>> > This could be explained by a memory usage problem, which is indicated by the out<br>> > of memory error.<br>> ><br>> > I am running systems on 8 cores for 24 hours and the memory usage stays constant<br>> > after the first few steps.<br>> > I have no clue what the problem could be.<br>> > I am also looking into a dynamic load balancing problem which only seems to happen<br>> > on the Cray XT4 and for which I, up till now, also have no clue what could cause this.<br>> ><br>> > What compiler (and version) are your using?<br>> <br>> I am using gcc-4.2.0.quadcore (the first gcc optimized for newest AMD <br>> Opterons (Barcelona) ). This is the same compiler as employed on the <br>> "Louhi" at CSC Finland. See:<br>> <br>> http://developer.amd.com/cpu/gnu/Pages/default.aspx<br>> http://www.csc.fi/english/pages/louhi_guide/program_development/compilers/gcc/index_html<br>> <br>> We have recently got an update to our MPI library, and I am now using <br>> Cray's MPT-3.0.3 MPI Library (MPICH2-adaptation)<br>> <br>> Do you have any comments on the wallclock issue I brought up in the <br>> previous post??<br>> <br>> One more thing:<br>> My current run having 7584 atoms, and a GCD of the # fourier-bins <br>> in direct space of 16. Should then run stably on 48 CPU's. (12 nodes on <br>> our Cray XT4)<br>> I understand we have a 3D-torus network on the machine.<br>> Why is the default DD of the 2D form 4x8x1 PP nodes, instead of the 3D <br>> 4x4x2???<br>> Does it have to do with having 4CPU's per node perhaps??<br>> <br>> Thanks for your efforts in clearing up the Cray XT4 issues.<br>> Bjørn<br>> <br>> > Berk<br>> ><br>> ><br>> >> Date: Tue, 30 Sep 2008 18:22:44 +0200<br>> >> From: st01397@student.uib.no<br>> >> To: gmx-users@gromacs.org<br>> >> Subject: RE: [gmx-users] Possible bug in parallelization, PME or        load-balancing on Gromacs 4.0_rc1 ??<br>> >> CC: gmx3@hotmail.com<br>> >><br>> >> I have some (hopefully) clarifying commments to my previous post now:<br>> >><br>> >> First to answer your question regarding pme.c. My compilation was done<br>> >> from v. 1.125<br>> >> ------------<br>> >> Line 1037-<br>> >> if ((kx>0) || (ky>0)) {<br>> >> kzstart = 0;<br>> >> } else {<br>> >> kzstart = 1;<br>> >> p0++;<br>> >> }<br>> >> ------<br>> >> As you can see the p0++; line is there.<br>> >><br>> >> Now here are some additional points:<br>> >><br>> >> On Mon, 29 Sep 2008, Bjørn Steen Sæthre wrote:<br>> >><br>> >>> The only Error message I can find is the rather cryptic::<br>> >>><br>> >>> NOTE: Turning on dynamic load balancing<br>> >>><br>> >>> _pmii_daemon(SIGCHLD): PE 4 exit signal Killed<br>> >>> [NID 1412]Apid 159787: initiated application termination<br>> >>><br>> >>> There are no error's apart from that.<br>> >><br>> >>> Furthermore I can now report that this error is endemic in all my sims<br>> >>> using harmonic position restraints in GROMACS 4.0_beta1 and GMX<br>> >>> 4.0_rc1.<br>> >>><br>> >>> About core dumps. I will talk to our HPC staff, and get back to you with<br>> >>> something more substantial I hope.<br>> >>>><br>> >><br>> >> OK, I have gotten some info from our HPC staff, they checked another job of<br>> >> mine which crashed in the exact same fashion, with the exact same starting<br>> >> run-topology and node configuration.<br>> >> They found some more info in the admin's log:<br>> >><br>> >>> Hi,<br>> >>> this job got an OOM (out of memory), which is only recorded in the<br>> >>> system logs, not available directly to users:<br>> >><br>> >>> [2008-09-29 17:18:18][c11-0c0s1n0]Out of memory: Killed process 8888<br>> >>> (parmdrun).<br>> >><br>> >> I can also add that I have been able to stabilize the engine, by altering the<br>> >> cut-offs and lowering the total PME-load of the run, at the expense of far<br>> >> greater computational inefficiency.<br>> >><br>> >> That is I went from unstable < to stable > as in the following diff on<br>> >> the mdp-files:<br>> >> -----------------------------<br>> >> 21c21<br>> >> < rlist = 0.9<br>> >> ---<br>> >>> rlist = 1.0<br>> >> 24c24<br>> >> < rcoulomb = 0.9<br>> >> ---<br>> >>> rcoulomb = 1.0<br>> >> 26c26<br>> >> < rvdw = 0.9<br>> >> ---<br>> >>> rvdw = 1.0<br>> >> 28,30c28,31<br>> >> < fourier_nx = 60<br>> >> < fourier_ny = 40<br>> >> < fourier_nz = 40<br>> >> ---<br>> >>> fourier_nx = 48<br>> >>> fourier_ny = 32<br>> >>> fourier_nz = 32<br>> >> 35c36<br>> >> ------------------------------<br>> >> That is, the PME-workload went from 1/2 of nodes to 1/3 of them since I was<br>> >> using exactly the same startup configuration ---------------------<br>> >><br>> >> This however, while enhancing stability, the output rate slowed down<br>> >> appreciably. And as shown in the log output, the reason is clear:<br>> >> ------------------------------------------------------------<br>> >> Making 2D domain decomposition 8 x 4 x 1<br>> >> starting mdrun 'Propane-hydrate prism (2x2x3 UC)'<br>> >> 2000000 steps, 4000.0 ps.<br>> >> Step 726095: Run time exceeded 3.960 hours, will terminate the run<br>> >><br>> >> Step 726100: Run time exceeded 3.960 hours, will terminate the run<br>> >><br>> >> Average load imbalance: 26.7 %<br>> >> Part of the total run time spent waiting due to load imbalance: 1.5 %<br>> >> Average PME mesh/force load: 9.369<br>> >> Part of the total run time spent waiting due to PP/PME imbalance: 57.5 %<br>> >><br>> >> NOTE: 57.5 % performance was lost because the PME nodes<br>> >> had more work to do than the PP nodes.<br>> >> You might want to increase the number of PME nodes<br>> >> or increase the cut-off and the grid spacing.<br>> >><br>> >><br>> >> Parallel run - timing based on wallclock.<br>> >><br>> >> NODE (s) Real (s) (%)<br>> >> Time: 5703.000 5703.000 100.0<br>> >> 1h35:03<br>> >> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br>> >> Performance: 29.593 8.566 60.600 0.396<br>> >><br>> >> gcq#0: Thanx for Using GROMACS - Have a Nice Day<br>> >> -----------------------------------------------<br>> >><br>> >><br>> >> One thing more is odd here though.<br>> >> In the startup script I allocated 4 hours, and set -maxh 4:<br>> >><br>> >> -----------------------------------------------<br>> >> #PBS -l walltime=4:00:00,mppwidth=48,mppnppn=4<br>> >> cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K_2nd<br>> >> source $HOME/gmx_latest_290908/bin/GMXRC<br>> >> aprun -n 48 parmdrun -s topol.tpr -maxh 4 -npme 16<br>> >> exit $?<br>> >> -----------------------<br>> >><br>> >> why the wallclock inconsistency (ie. wallclock is 1:35:03 which does not<br>> >> correspond to the note of 3.96 hours exceeded.)<br>> >><br>> >><br>> >><br>> >> I hope this is helpful in resolving the issue brought up originally. (Might<br>> >> there be a possible memory leak somewhere?)<br>> >><br>> >> Regards<br>> >> Bjørn<br>> >><br>> >><br>> >> PhD-student<br>> >> Insitute of Physics & Tech.- University of Bergen<br>> >> Allegt. 55,<br>> >> 5007 Bergen<br>> >> Norway<br>> >><br>> >> Tel(office): +47 55582869<br>> >> Cell: +47 99253386<br>> >> _______________________________________________<br>> >> gmx-users mailing list gmx-users@gromacs.org<br>> >> http://www.gromacs.org/mailman/listinfo/gmx-users<br>> >> Please search the archive at http://www.gromacs.org/search before posting!<br>> >> Please don't post (un)subscribe requests to the list. Use the<br>> >> www interface or send it to gmx-users-request@gromacs.org.<br>> >> Can't post? Read http://www.gromacs.org/mailing_lists/users.php<br>> ><br>> > _________________________________________________________________<br>> > Express yourself instantly with MSN Messenger! Download today it's FREE!<br>> > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/<br><br /><hr />Express yourself instantly with MSN Messenger! <a href='http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/' target='_new'>MSN Messenger</a></body>
</html>