<html>
<head>
<style>
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Verdana
}
</style>
</head>
<body class='hmmessage'>
Hi,<br><br>This problem is not very strange at all.<br>Different CPU's have different efficiencies for different types of code.<br>In this case the Opteron and new Xeon have measurably different<br>(although probably not more than 20%) relative performance<br>for particle-particle and PME interactions.<br><br>Also, your results tell you exactly what the problem is.<br>If without seperate PME nodes PME takes 44.1% of the time,<br>you will not get good load balancing by using 5 PP and 3 PME nodes<br>(which gives 37.5% PME processing power).<br><br>You can simply try different -npme, including 0, and see when<br>you get the best performance.<br><br>BTW, a cut-off of 1.0 nm with a fourierspacing of 0.16 gives<br>pretty bad accuracy, you should increase the cut-off or decrease the spacing.<br>A reasonable ratio off cut-off/fourier_spacing is 8.<br>Increasing the cut-off will reduce the relative PME load, which will<br>make load balancing easier and PME communication less costly.<br><br>Berk<br><br>> Date: Wed, 2 Sep 2009 21:52:47 -0500<br>> From: dadriano@gmail.com<br>> To: gmx-users@gromacs.org<br>> Subject: [gmx-users] Scaling problems in 8-cores nodes with GROMACS 4.0x<br>> <br>> Dear Gromacs users, (all related to GROMACS ver 4.0.x)<br>> <br>> I am facing a very strange problem on a recently acquired supermicro 8<br>> XEON-cores nodes (2.5GHz quad-core/node, 4G/RAM with the four memory<br>> channels activated, XEON E5420, 20Gbs Infiniband Infinihost III Lx<br>> DDR): I had been testing these nodes with one of our most familiar<br>> protein model (49887 atoms: 2873 for protein and the rest for water<br>> into a dodecahedron cell) which I known scales almost linearly until<br>> 32 cores in a quad-core/node Opteron 2.4 GHz cluster. Now, with our<br>> recently acquired nodes I have severe imbalance PME/PP ratios (from<br>> 20% and up). At the beginning I think that this problem was related to<br>> Infiniband latency problems, but recently I made a test that gave me a<br>> big surprise: since my model scales very well to 8 cores I spreaded it<br>> to 8 cores into four machines and the performance was the same than in<br>> a single node, which in turns suggests that the problem could be<br>> caused by a different reason that latency. After several tests I<br>> realized that the problem arises when the process is divided into PME<br>> and PP nodes, even into a single node!!!, it is to say:<br>> -if for a short job I do (it is exactly the same for a long run):<br>> srun -n8 /home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun<br>> -v -dlb yes -deffnm FULL01/full01<br>> <br>> Average load imbalance: 0.7 %<br>> Part of the total run time spent waiting due to load imbalance: 0.2 %<br>> Steps where the load balancing was limited by -rdd, -rcon and/or<br>> -dds: X 0 % Y 0 %<br>> <br>> <br>> R E A L C Y C L E A N D T I M E A C C O U N T I N G<br>> <br>> Computing: Nodes Number G-Cycles Seconds %<br>> -----------------------------------------------------------------------<br>> Domain decomp. 8 101 19.123 7.6 2.5<br>> Vsite constr. 8 1001 2.189 0.9 0.3<br>> Comm. coord. 8 1001 5.810 2.3 0.8<br>> Neighbor search 8 101 51.432 20.4 6.7<br>> Force 8 1001 250.938 99.5 32.7<br>> Wait + Comm. F 8 1001 15.064 6.0 2.0<br>> PME mesh 8 1001 337.946 133.9 44.1<br>> Vsite spread 8 2002 2.991 1.2 0.4<br>> Write traj. 8 2 0.604 0.2 0.1<br>> Update 8 1001 17.854 7.1 2.3<br>> Constraints 8 1001 35.782 14.2 4.7<br>> Comm. energies 8 1001 1.407 0.6 0.2<br>> Rest 8 25.889 10.3 3.4<br>> -----------------------------------------------------------------------<br>> Total 8 767.030 304.0 100.0<br>> -----------------------------------------------------------------------<br>> <br>>         Parallel run - timing based on wallclock.<br>> <br>> NODE (s) Real (s) (%)<br>> Time: 38.000 38.000 100.0<br>> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br>> Performance: 254.161 14.534 11.380 2.109<br>> <br>> <br>> <br>> which in turns reflects that there are not separation between PME and<br>> PP and scaling is almost lineal compared with 1 processor. But if I<br>> force PME, and use exactly the same number of processors :<br>> srun -n8 --cpu_bind=rank<br>> /home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun -v<br>> -dlb yes -npme 3 -deffnm FULL01/full01<br>> <br>> <br>> Average load imbalance: 0.5 %<br>> Part of the total run time spent waiting due to load imbalance: 0.2 %<br>> Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %<br>> Average PME mesh/force load: 1.901<br>> Part of the total run time spent waiting due to PP/PME imbalance: 23.9 %<br>> <br>> NOTE: 23.9 % performance was lost because the PME nodes<br>> had more work to do than the PP nodes.<br>> You might want to increase the number of PME nodes<br>> or increase the cut-off and the grid spacing.<br>> <br>> <br>> R E A L C Y C L E A N D T I M E A C C O U N T I N G<br>> <br>> Computing: Nodes Number G-Cycles Seconds %<br>> -----------------------------------------------------------------------<br>> Domain decomp. 5 101 14.660 5.9 1.4<br>> Vsite constr. 5 1001 1.440 0.6 0.1<br>> Send X to PME 5 1001 4.601 1.9 0.5<br>> Comm. coord. 5 1001 3.229 1.3 0.3<br>> Neighbor search 5 101 48.143 19.4 4.8<br>> Force 5 1001 252.340 101.8 25.0<br>> Wait + Comm. F 5 1001 8.845 3.6 0.9<br>> PME mesh 3 1001 304.447 122.9 30.1<br>> Wait + Comm. X/F 3 1001 73.389 29.6 7.3<br>> Wait + Recv. PME F 5 1001 219.552 88.6 21.7<br>> Vsite spread 5 2002 3.828 1.5 0.4<br>> Write traj. 5 2 0.555 0.2 0.1<br>> Update 5 1001 17.765 7.2 1.8<br>> Constraints 5 1001 31.203 12.6 3.1<br>> Comm. energies 5 1001 1.977 0.8 0.2<br>> Rest 5 25.105 10.1 2.5<br>> -----------------------------------------------------------------------<br>> Total 8 1011.079 408.0 100.0<br>> -----------------------------------------------------------------------<br>> <br>>         Parallel run - timing based on wallclock.<br>> <br>> NODE (s) Real (s) (%)<br>> Time: 51.000 51.000 100.0<br>> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br>> Performance: 189.377 10.354 8.479 2.831<br>> <br>> <br>> As you can see I got a very bad performance, and this is also true if<br>> I do not specify the number of PME nodes and spread the job into 11<br>> processors (and goes worst with more processors), which gives me:<br>> srun -n11 --cpu_bind=rank<br>> /home/dsilva/PROGRAMAS/gromacs-4.0.5-mavapich2_gcc-1.2p1/bin/mdrun -v<br>> -dlb yes -deffnm FULL01/full01<br>> <br>> NOTE: 11.9 % performance was lost because the PME nodes<br>> had more work to do than the PP nodes.<br>> You might want to increase the number of PME nodes<br>> or increase the cut-off and the grid spacing.<br>> <br>> <br>> R E A L C Y C L E A N D T I M E A C C O U N T I N G<br>> <br>> Computing: Nodes Number G-Cycles Seconds %<br>> -----------------------------------------------------------------------<br>> Domain decomp. 6 101 15.450 6.2 1.6<br>> Vsite constr. 6 1001 1.486 0.6 0.2<br>> Send X to PME 6 1001 1.154 0.5 0.1<br>> Comm. coord. 6 1001 3.832 1.5 0.4<br>> Neighbor search 6 101 47.950 19.1 5.1<br>> Force 6 1001 250.202 99.7 26.7<br>> Wait + Comm. F 6 1001 10.022 4.0 1.1<br>> PME mesh 5 1001 314.841 125.5 33.6<br>> Wait + Comm. X/F 5 1001 111.565 44.5 11.9<br>> Wait + Recv. PME F 6 1001 102.240 40.8 10.9<br>> Vsite spread 6 2002 2.317 0.9 0.2<br>> Write traj. 6 2 0.567 0.2 0.1<br>> Update 6 1001 17.849 7.1 1.9<br>> Constraints 6 1001 31.215 12.4 3.3<br>> Comm. energies 6 1001 2.274 0.9 0.2<br>> Rest 6 25.283 10.1 2.7<br>> -----------------------------------------------------------------------<br>> Total 11 938.249 374.0 100.0<br>> -----------------------------------------------------------------------<br>> <br>>         Parallel run - timing based on wallclock.<br>> <br>> NODE (s) Real (s) (%)<br>> Time: 34.000 34.000 100.0<br>> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br>> Performance: 284.388 15.963 12.719 1.887<br>> <br>> <br>> <br>> I had tried everything that came in my mind, from modify npme to<br>> cpu_affinity and mdp cut-offs and fourierspacing, also recompiling<br>> things and tried different versions of fftw. Please advice me with any<br>> ideas, trps to test or any tips. My mdp options for this runs were:<br>> <br>> integrator = md                        <br>> dt = 0.005                        <br>> nsteps = 1000                <br>> <br>> pbc = xyz                <br>> nstlist = 10                        <br>> rlist = 1.0                        <br>> ns_type = grid                <br>> <br>> coulombtype = pme                        <br>> rcoulomb = 1.0        <br>> <br>> vdwtype = cut-off                <br>> rvdw = 1.0        <br>> <br>> tcoupl = Berendsen                <br>> tc-grps = protein non-protein        <br>> tau-t = 0.1 0.1        <br>> ref-t = 318 318         <br>> <br>> Pcoupl = Berendsen        <br>> pcoupltype = isotropic<br>> tau-p = 1.0                        <br>> ref-p = 1.0                 <br>> compressibility = 4.5e-5<br>> <br>> fourierspacing = 0.16<br>> pme_order = 4<br>> optimize_fft = yes                <br>> ewald_rtol = 1e-5        <br>> <br>> gen_vel = yes        <br>> gen_temp = 318        <br>> gen_seed = 173529        <br>> <br>> constraints = all-bonds<br>> constraint_algorithm = lincs        <br>> lincs_order = 4        <br>> <br>> nstxout = 5000        <br>> nstvout = 5000        <br>> nstfout = 0        <br>> nstlog = 5000<br>> <br>> nstenergy = 5000<br>> energygrps = Protein non-protein<br>> <br>> <br>> Thanks.<br>> Daniel Silva<br>> _______________________________________________<br>> gmx-users mailing list gmx-users@gromacs.org<br>> http://lists.gromacs.org/mailman/listinfo/gmx-users<br>> Please search the archive at http://www.gromacs.org/search before posting!<br>> Please don't post (un)subscribe requests to the list. Use the <br>> www interface or send it to gmx-users-request@gromacs.org.<br>> Can't post? Read http://www.gromacs.org/mailing_lists/users.php<br><br /><hr />Express yourself instantly with MSN Messenger! <a href='http://clk.atdmt.com/AVE/go/onm00200471ave/direct/01/' target='_new'>MSN Messenger</a></body>
</html>