Dear gmx users,<br> <br> I have accommodated a Linux Cluster consisting of 8 nodes with the following specification:<br> <br> Node HW: Two Dual-Core Opteron 2212 (2GHz + 1 MB cache every core), which means totally 4 cores on every node + 2GByte RAM + Gigabit Eth NICs.<br> Network Infrastructure: Gigabit Ethernet (Catalyst 2960) + Linux TCP/IP stack.<br> OS: Fedora-Core 5.<br> <br> I configured and compiled lam (with default parameters), FFTW (with double precision support) and Gromacs (with mpi-enabled and double precision supports) seperately without any problems on our Cluster.<br> <br> After all, I tried to benchmark the Cluster using your gmxbench pkg. According to Gmx benchmarks, I also ran parallel benchmarks of DPPC system that provided with your gmxbench pkg.<br> <br> Starting with single node, I tried to fork four processes on single node (with four cores) with the following commands:<br> <br> grompp -np 4 -sort -shuffle -f
grompp.mdp -p topol.top -c conf.gro -o grompp.tpr<br> & <br> mpirun -np 4 mdrun_d -v -deffnm grompp<br> <br> everything seems well, four cores utilized very well to about greater than 90%, and the following benchmarks attained:<br> <br> M E G A - F L O P S A C C O U N T I N G<br> <br> Parallel run - timing based on wallclock.<br> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy<br> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)<br> NF=No Forces<br> <br> Computing: M-Number M-Flops % of Flops<br>
-----------------------------------------------------------------------<br> LJ 13783.350611 454850.570163 10.7<br> Coulomb 11511.123348 310800.330396 7.3<br> Coulomb [W3] 1477.194071 118175.525680 2.8<br> Coulomb [W3-W3] 2305.011660 539372.728440 12.7<br> Coulomb +
LJ 6733.896263 255888.057994 6.0<br> Coulomb + LJ [W3] 2980.257052 271203.391732 6.4<br> Coulomb + LJ [W3-W3] 5589.019105 1369309.680725 32.4<br> Outer nonbonded loop 2892.716574 28927.165740 0.7<br> 1,4 nonbonded interactions 148.509696 13365.872640 0.3<br> NS-Pairs 29597.265161
621542.568381 14.7<br> Reset In Box 61.049856 549.448704 0.0<br> Shift-X 1218.803712 7312.822272 0.2<br> CG-CoM 30.268416 877.784064 0.0<br> Sum Forces 1828.205568 1828.205568 0.0<br>
Angles 291.898368 47579.433984 1.1<br> Propers 87.057408 19936.146432 0.5<br> Impropers 15.363072 3195.518976 0.1<br> RB-Dihedrals 122.904576 30357.430272 0.7<br>
Virial 609.941964 10978.955352 0.3<br> Update 609.401856 18891.457536 0.4<br> Stop-CM 609.280000 6092.800000 0.1<br> Calc-Ekin 609.523712 16457.140224 0.4<br>
Lincs 251.030528 15061.831680 0.4<br> Lincs-Mat 3504.181248 14016.724992 0.3<br> Constraint-V 609.401856 3656.411136 0.1<br> Constraint-Vir 604.522496 14508.539904 0.3<br>
Settle 117.830656 38059.301888 0.9<br> -----------------------------------------------------------------------<br> Total 4232795.844875 100.0<br> -----------------------------------------------------------------------<br> <br> NODE (s) Real (s) (%)<br> Time: 2799.000 2799.000 100.0<br>
46:39<br> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br> Performance: 15.856 1.512 0.309 77.750<br> <br> <br> Then in the next step I repeated the above simulations, but for two nodes. In this phase, I created a lamboot file that lam daemon started by it as follow:<br> <br> Node-1 (repeated 4 times)<br> Node-2 (repeated 4 times)<br> <br> and then execute the following commands:<br> <br> grompp -np 8 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o grompp.tpr<br> & <br> mpirun -np 8 mdrun_d -v -deffnm grompp<br> <br> In this phase I have four running processes (mdrun)
on every node, with utilization factor about (60-70%) which got by top commands. The benchmarks:<br> <br> M E G A - F L O P S A C C O U N T I N G<br> <br> Parallel run - timing based on wallclock.<br> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy<br> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)<br> NF=No Forces<br> <br> Computing: M-Number M-Flops % of Flops<br> -----------------------------------------------------------------------<br>
LJ 13784.875630 454900.895790 10.7<br> Coulomb 16468.628499 444652.969473 10.5<br> Coulomb [W3] 866.583623 69326.689840 1.6<br> Coulomb [W3-W3] 2304.621876 539281.518984 12.7<br> Coulomb + LJ
8301.510488 315457.398544 7.4<br> Coulomb + LJ [W3] 1413.085299 128590.762209 3.0<br> Coulomb + LJ [W3-W3] 5588.477053 1369176.877985 32.3<br> Outer nonbonded loop 2958.469329 29584.693290 0.7<br> 1,4 nonbonded interactions 148.509696 13365.872640 0.3<br> NS-Pairs 29582.701180 621236.724780 14.7<br> Reset In
Box 61.049856 549.448704 0.0<br> Shift-X 1218.803712 7312.822272 0.2<br> CG-CoM 30.268416 877.784064 0.0<br> Sum Forces 2437.607424 2437.607424 0.1<br>
Angles 291.898368 47579.433984 1.1<br> Propers 87.057408 19936.146432 0.5<br> Impropers 15.363072 3195.518976 0.1<br> RB-Dihedrals 122.904576 30357.430272 0.7<br>
Virial 610.482072 10988.677296 0.3<br> Update 609.401856 18891.457536 0.4<br> Stop-CM 609.280000 6092.800000 0.1<br> Calc-Ekin 609.523712 16457.140224 0.4<br>
Lincs 251.030528 15061.831680 0.4<br> Lincs-Mat 3504.181248 14016.724992 0.3<br> Constraint-V 609.401856 3656.411136 0.1<br> Constraint-Vir 604.522496 14508.539904 0.3<br>
Settle 117.830656 38059.301888 0.9<br> -----------------------------------------------------------------------<br> Total 4235553.480319 100.0<br> -----------------------------------------------------------------------<br> <br> NODE (s) Real (s) (%)<br> Time: 1337.000 1337.000 100.0<br>
22:17<br> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br> Performance: 36.446 3.168 0.646 37.139<br> <br> Not so bad, I got a scalability about 100%, but the bad news is that the processing utilization factor of every core decreased to 60%. Finally, I did all the above steps but for three physical nodes. The lamboot script for lamd:<br> <br> Node-1 (repeated 4 times because of four cores on every node)<br> Node-2 (repeated 4 times)<br> Node-3 (repeated 4 times)<br> <br> and then execute the following commands:<br> <br> grompp -np 12 -sort -shuffle -f grompp.mdp -p topol.top -c conf.gro -o
grompp.tpr<br> & <br> mpirun -np 12 mdrun_d -v -deffnm grompp<br> <br> In this phase I have four running processes (mdrun) on every node, with utilization factor about (45-50%). The benchmarks:<br> <br> <br> M E G A - F L O P S A C C O U N T I N G<br> <br> Parallel run - timing based on wallclock.<br> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy<br> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)<br> NF=No Forces<br> <br> Computing: M-Number M-Flops % of Flops<br> -----------------------------------------------------------------------<br>
LJ 13784.842940 454899.817020 10.7<br> Coulomb 14582.358701 393723.684927 9.3<br> Coulomb [W3] 1138.280373 91062.429840 2.1<br> Coulomb [W3-W3] 2306.683307 539763.893838 12.7<br> Coulomb + LJ
7768.301364 295195.451832 7.0<br> Coulomb + LJ [W3] 1946.513725 177132.748975 4.2<br> Coulomb + LJ [W3-W3] 5594.156904 1370568.441480 32.3<br> Outer nonbonded loop 3059.386119 30593.861190 0.7<br> 1,4 nonbonded interactions 148.509696 13365.872640 0.3<br> NS-Pairs 29577.291883 621123.129543 14.7<br> Reset In
Box 61.049856 549.448704 0.0<br> Shift-X 1218.803712 7312.822272 0.2<br> CG-CoM 30.268416 877.784064 0.0<br> Sum Forces 4265.812992 4265.812992 0.1<br>
Angles 291.898368 47579.433984 1.1<br> Propers 87.057408 19936.146432 0.5<br> Impropers 15.363072 3195.518976 0.1<br> RB-Dihedrals 122.904576 30357.430272 0.7<br>
Virial 611.022180 10998.399240 0.3<br> Update 609.401856 18891.457536 0.4<br> Stop-CM 609.280000 6092.800000 0.1<br> Calc-Ekin 609.523712 16457.140224 0.4<br>
Lincs 251.030528 15061.831680 0.4<br> Lincs-Mat 3504.181248 14016.724992 0.3<br> Constraint-V 609.401856 3656.411136 0.1<br> Constraint-Vir 604.522496 14508.539904 0.3<br>
Settle 117.830656 38059.301888 0.9<br> -----------------------------------------------------------------------<br> Total 4239246.335581 100.0<br> -----------------------------------------------------------------------<br> <br> NODE (s) Real (s) (%)<br> Time: 1272.000 1272.000 100.0<br>
21:12<br> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<br> Performance: 37.045 3.333 0.679 35.333<br> <br> <br> Very bad scalability!<br> I expected in about 4.5 GFlops, but the results are like 2 nodes execution. In other words, the third node did nothing for us at all. I googled Gmx mailing lists, and saw many topics in this regard. I think that gigabit ethernet's latency is the performance killer here. I want to know is there any solution for this problem like recompiling kernel, tcp/ip stack parameters tunning, LAM recompilation, setup simulations in different way or anything else?<br> <br> Any help
in this regards will be appreciated.<br> <br> Thanks.<br> K. Jahanbakhsh<p> 
<hr size=1>Boardwalk for $500? In 2007? Ha! <br><a href="http://us.rd.yahoo.com/evt=48223/*http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow">Play Monopoly Here and Now</a> (it's updated for today's economy) at Yahoo! Games.