<html>
<body>
Erik,<br>
I also have older systems which use Opteron 165 CPUs. I have run tests of
the AMD Opteron 165 CPUs (2.18GHz) against the Intel Core2 Duos (3GHz).
Twelve concurrent AutoDock jobs on each machine show the Core2 duos
outperforming the Opterons by a factor of two. <br><br>
The data I posted showed inconsistencies which have nothing to do with
memory bandwidth, and I was rather hoping for an analysis based upon the
manner in which GROMACS mdrun distributes its computing tasks.<br><br>
I don't believe my data shows memory bandwidth-limiting effects. For
example, three 'local' CPUs on the quad core are faster (6.65Gflops) than
one of the Quads (5.02 Gflops) and two from the cluster. How does that
support the memory bandwidth hypothesis? <br><br>
I figured that it might be possible that the GAMMA MP software is causing
overhead, but when I examined the distribution of tasks by GROMACS (in
the log I provided) it would seem that the tasks which mdrun distributed
to GAMMA actually were distributed well, but that that the manner in
which CPU0 hogged most of the mdrun calculations might be a bottleneck.
It was insight into GROMACS' mdrun distribution methodology which I was
seeking. Is there any quantitative data available for me to
review?<br><br>
Sincerely<br>
Trevor<br><br>
<br>
At 12:45 PM 5/27/2007, Erik Lindahl wrote:<br>
<blockquote type=cite class=cite cite>Hi Trevor,<br><br>
It's probably due to memory bandwidth limitations, as well as Intel's
design.<br><br>
Intel managed to get quad cores to market by gluing together two
dual-core chips. All communication between them has to go over the front
side bus though, and all eight cores in a system share the bandwidth to
memory.<br><br>
This can become a problem when you're running in parallel, since all
eight processes are communicating (=using the bus bandwidth) at once, and
have to share it. You will probably get much better performance by
running multiple (8) independent simulations.<br>
<br>
Essentially, there's no such thing as a free lunch. Intel's quad-core
chips are cheap, but have the same drawback as their first generation
dual-core chips. AMD's solution with real quad-cores and on-chip memory
controllers in Barcelona is looking a whole lot better, but I also expect
it to be quite a bit more expensive.<br><br>
You might want to test the CVS version for better scaling. The lower
amount of data communicated there might improve performance a bit for
you.<br><br>
Cheers,<br><br>
Erik<br><br>
<br>
On May 27, 2007, at 6:28 PM, Trevor Marshall wrote:<br><br>
<blockquote type=cite class=cite cite>Can anybody give me any ideas which
might help me optimize my new cluster for a more linear speed increase as
I add computing cores? The new intel Core2 CPUs are inherently very fast,
and my mdrun simulation performance is becoming asymptotic to a value
only about twice the speed I can get from a single core.<br><br>
I have included the log output from mdrun_mpi when using 5 cores at the
foot of this email. But here is the system overview<br>
<br>
My cluster system which comprises two computers running Fedora Core 6 and
MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core speed
(overclocked). The main machine now has a sparkling new Core2 Quad
4-processor CPU and the remote still has a Core2-duo dual core
CPU.<br><br>
Networking hardware is crossover CAT6 cables. The GAMMA software is
connected thru one Intel PRO/1000 board in each computer, with MTU 9000.
A Gigabit adapter with Realtek chipset is the primary Linux network in
each machine, with MTU 1500. For the common filesystem I am running NFS
on a mounted filesystem with "async" declared in the exports
file. The mount is /dev/hde1 to /media and then /media is exported via
NFS to the cluster machine. File I/O does not seem to be a
bottleneck.<br><br>
With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000
time intervals. Here are the results for various combinations of one,
two, three, four and five cores.<br><br>
One local core only running mdrun:
<x-tab> </x-tab>18.3
hr/nsec<x-tab> </x-tab>2.61 Gflops<br>
Two local
cores:<x-tab> </x-tab><x-tab> </x-tab><x-tab> </x-tab><x-tab> </x-tab>9.98
hr/nsec<x-tab> </x-tab>4.83 Gflops<br>
Three local
cores:<x-tab> </x-tab><x-tab> </x-tab><x-tab> </x-tab><x-tab> </x-tab>7.35
hr/nsec<x-tab> </x-tab>6.65 Gflops<br>
Four local cores (one also controlling)<x-tab> </x-tab>7.72
hr/nsec<x-tab> </x-tab>6.42 Gflops<br>
Three local cores and two remote cores:<x-tab> </x-tab>7.59
hr/nsec<x-tab> </x-tab>6.72 GFlops<br>
One local and 2 remote
cores:<x-tab> </x-tab><x-tab> </x-tab>9.76
hr/nsec<x-tab> </x-tab>5.02 GFlops<br><br>
I get good performance with one local core doing control, and three doing
calculations, giving 6.66 Gflops. However, adding two extra remote cores
only increases the speed a very small amount to 6.72 Gflops, even though
the log (below) shows good task distribution (I think).<br><br>
Is there some problem with scaling when using these new fast CPUs? Can I
tweak anything in mdrun_mpi to give better scaling?<br><br>
Sincerely<br>
Trevor<br>
------------------------------------------<br>
Trevor G Marshall, PhD<br>
School of Biological Sciences and Biotechnology, Murdoch University,
Western Australia<br>
Director, Autoimmunity Research Foundation, Thousand Oaks,
California<br>
Patron, Australian Autoimmunity Foundation.<br>
------------------------------------------<br>
<br>
<x-tab> </x-tab>M E G A -
F L O P S A C C O U N T I N G<br><br>
<x-tab> </x-tab>Parallel
run - timing based on wallclock.<br>
RF=Reaction-Field FE=Free Energy
SCFE=Soft-Core/Free Energy<br>
T=Tabulated
W3=SPC/TIP3p W4=TIP4p (single or pairs)<br>
NF=No Forces<br><br>
Computing:
M-Number M-Flops %
of Flops<br>
-----------------------------------------------------------------------<br>
LJ
928.067418 30626.224794
1.1<br>
Coul(T)
886.762558 37244.027436
1.4<br>
Coul(T)
[W3]
92.882138 11610.267250
0.4<br>
Coul(T) +
LJ
599.004388 32945.241340
1.2<br>
Coul(T) + LJ
[W3]
243.730360 33634.789680
1.2<br>
Coul(T) + LJ
[W3-W3]
3292.173000 1257610.086000 45.6<br>
Outer nonbonded
loop
945.783063 9457.830630
0.3<br>
1,4 nonbonded interactions
41.184118 3706.570620
0.1<br>
Spread Q
Bspline
51931.592640 103863.185280 3.8<br>
Gather F
Bspline
51931.592640 623179.111680 22.6<br>
3D-FFT
40498.449440 323987.595520 11.7<br>
Solve
PME
3000.300000 192019.200000 7.0<br>
NS-Pairs
1044.424912 21932.923152
0.8<br>
Reset In
Box
24.064040
216.576360 0.0<br>
Shift-X
961.696160 5770.176960
0.2<br>
CG-CoM
8.242234 239.024786
0.0<br>
Sum
Forces
721.272120
721.272120 0.0<br>
Bonds
25.022502 1075.967586
0.0<br>
Angles
36.343634 5924.012342
0.2<br>
Propers
13.411341 3071.197089
0.1<br>
Impropers
12.171217 2531.613136
0.1<br>
Virial
241.774175 4351.935150
0.2<br>
Ext.ens.
Update
240.424040 12982.898160
0.5<br>
Stop-CM
240.400000 2404.000000
0.1<br>
Calc-Ekin
240.448080 6492.098160
0.2<br>
Constraint-V
240.424040 1442.544240
0.1<br>
Constraint-Vir
215.884746 5181.233904
0.2<br>
Settle
71.961582 23243.590986
0.8<br>
-----------------------------------------------------------------------<br>
Total
2757465.194361 100.0<br>
-----------------------------------------------------------------------<br><br>
NODE (s) Real (s) (%)<br>
Time:
408.000 408.000 100.0<br>
6:48<br>
(Mnbf/s) (GFlops) (ns/day) (hour/ns)<br>
Performance: 14.810
6.758 3.176
7.556<br>
<br>
Detailed load balancing info in percentage of average<br>
Type NODE: 0
1 2 3 4 Scaling<br>
-------------------------------------------<br>
LJ:423 0 3 41
32 23%<br>
Coul(T):500
0 0 0 0
20%<br>
Coul(T) [W3]: 0 0 32 291
176 34%<br>
Coul(T) + LJ:500 0 0
0 0 20%<br>
Coul(T) + LJ [W3]: 0 0 24 296
178 33%<br>
Coul(T) + LJ [W3-W3]: 60 116 108 106 107
86%<br>
Outer nonbonded loop:246 42 45 79
85 40%<br>
1,4 nonbonded interactions:500 0 0
0 0 20%<br>
Spread Q Bspline: 98 100 102 100 97
97%<br>
Gather F Bspline: 98 100 102 100 97
97%<br>
3D-FFT:100 100 100 100
100 100%<br>
Solve PME:100 100 100 100
100 100%<br>
NS-Pairs:107 96 91 103
100 93%<br>
Reset In Box: 99 100 100 100
99 99%<br>
Shift-X: 99 100 100 100
99 99%<br>
CG-CoM:110
97 97 97 97 90%<br>
Sum Forces:100 100 100 99
99 99%<br>
Bonds:499 0 0 0
0 20%<br>
Angles:500
0 0 0 0
20%<br>
Propers:499
0 0 0 0
20%<br>
Impropers:500 0
0 0 0 20%<br>
Virial: 99 100 100
100 99 99%<br>
Ext.ens. Update: 99 100 100 100 99
99%<br>
Stop-CM: 99 100 100 100
99 99%<br>
Calc-Ekin: 99 100 100 100
99 99%<br>
Constraint-V: 99 100 100 100
99 99%<br>
Constraint-Vir: 54 111 111 111 111
89%<br>
Settle: 54 111 111 111
111 89%<br><br>
Total Force: 93 102 97 104
102 95%<br><br>
<br>
Total Shake: 56 110 110 110
110 90%<br><br>
<br>
Total Scaling: 95% of max performance<br><br>
Finished mdrun on node 0 Sun May 27 07:29:57 2007<br>
<b> <br>
</b>_______________________________________________<br>
gmx-users mailing list
<a href="mailto:gmx-users@gromacs.org">gmx-users@gromacs.org</a><br>
<a href="http://www.gromacs.org/mailman/listinfo/gmx-users" eudora="autourl">http://www.gromacs.org/mailman/listinfo/gmx-users</a><br>
Please search the archive at
<a href="http://www.gromacs.org/search">http://www.gromacs.org/search</a>
before posting!<br>
Please don't post (un)subscribe requests to the list. Use the <br>
www interface or send it to <a href="mailto:gmx-users-request@gromacs.org">gmx-users-request@gromacs.org</a>.<br>
Can't post? Read <a href="http://www.gromacs.org/mailing_lists/users.php">http://www.gromacs.org/mailing_lists/users.php</a></blockquote><br>
_______________________________________________<br>
gmx-users mailing list gmx-users@gromacs.org<br>
<a href="http://www.gromacs.org/mailman/listinfo/gmx-users" eudora="autourl">http://www.gromacs.org/mailman/listinfo/gmx-users</a><br>
Please search the archive at <a href="http://www.gromacs.org/search" eudora="autourl">http://www.gromacs.org/search</a> before posting!<br>
Please don't post (un)subscribe requests to the list. Use the <br>
www interface or send it to gmx-users-request@gromacs.org.<br>
Can't post? Read <a href="http://www.gromacs.org/mailing_lists/users.php" eudora="autourl">http://www.gromacs.org/mailing_lists/users.php</a> </blockquote></body>
</html>