<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; ">Hi Trevor,<DIV><BR class="khtml-block-placeholder"></DIV><DIV>It's probably due to memory bandwidth limitations, as well as Intel's design.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Intel managed to get quad cores to market by gluing together two dual-core chips. All communication between them has to go over the front side bus though, and all eight cores in a system share the bandwidth to memory.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>This can become a problem when you're running in parallel, since all eight processes are communicating (=using the bus bandwidth) at once, and have to share it. You will probably get much better performance by running multiple (8) independent simulations.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Essentially, there's no such thing as a free lunch. Intel's quad-core chips are cheap, but have the same drawback as their first generation dual-core chips. AMD's solution with real quad-cores and on-chip memory controllers in Barcelona is looking a whole lot better, but I also expect it to be quite a bit more expensive.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>You might want to test the CVS version for better scaling. The lower amount of data communicated there might improve performance a bit for you.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Cheers,</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Erik</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR><DIV><DIV>On May 27, 2007, at 6:28 PM, Trevor Marshall wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite"> Can anybody give me any ideas which might help me optimize my new cluster for a more linear speed increase as I add computing cores? The new intel Core2 CPUs are inherently very fast, and my mdrun simulation performance is becoming asymptotic to a value only about twice the speed I can get from a single core.<BR><BR> I have included the log output from mdrun_mpi when using 5 cores at the foot of this email. But here is the system overview<BR> <BR> My cluster system which comprises two computers running Fedora Core 6 and MPI-GAMMA. Both have Intel Core2 CPUs running at 3GHz core speed (overclocked). The main machine now has a sparkling new Core2 Quad 4-processor CPU and the remote still has a Core2-duo dual core CPU.<BR><BR> Networking hardware is crossover CAT6 cables. The GAMMA software is connected thru one Intel PRO/1000 board in each computer, with MTU 9000. A Gigabit adapter with Realtek chipset is the primary Linux network in each machine, with MTU 1500. For the common filesystem I am running NFS on a mounted filesystem with "async" declared in the exports file. The mount is /dev/hde1 to /media and then /media is exported via NFS to the cluster machine. File I/O does not seem to be a bottleneck.<BR><BR> With mdrun_mpi I am calculating a 240aa protein and ligand for 10,000 time intervals. Here are the results for various combinations of one, two, three, four and five cores.<BR><BR> One local core only running mdrun: <X-TAB> </X-TAB>18.3 hr/nsec<X-TAB> </X-TAB>2.61 Gflops<BR> Two local cores:<X-TAB> </X-TAB><X-TAB> </X-TAB><X-TAB> </X-TAB><X-TAB> </X-TAB>9.98 hr/nsec<X-TAB> </X-TAB>4.83 Gflops<BR> Three local cores:<X-TAB> </X-TAB><X-TAB> </X-TAB><X-TAB> </X-TAB><X-TAB> </X-TAB>7.35 hr/nsec<X-TAB> </X-TAB>6.65 Gflops<BR> Four local cores (one also controlling)<X-TAB> </X-TAB>7.72 hr/nsec<X-TAB> </X-TAB>6.42 Gflops<BR> Three local cores and two remote cores:<X-TAB> </X-TAB>7.59 hr/nsec<X-TAB> </X-TAB>6.72 GFlops<BR> One local and 2 remote cores:<X-TAB> </X-TAB><X-TAB> </X-TAB>9.76 hr/nsec<X-TAB> </X-TAB>5.02 GFlops<BR><BR> I get good performance with one local core doing control, and three doing calculations, giving 6.66 Gflops. However, adding two extra remote cores only increases the speed a very small amount to 6.72 Gflops, even though the log (below) shows good task distribution (I think).<BR><BR> Is there some problem with scaling when using these new fast CPUs? Can I tweak anything in mdrun_mpi to give better scaling?<BR><BR> Sincerely<BR> Trevor<BR> ------------------------------------------<BR> Trevor G Marshall, PhD<BR> School of Biological Sciences and Biotechnology, Murdoch University, Western Australia<BR> Director, Autoimmunity Research Foundation, Thousand Oaks, California<BR> Patron, Australian Autoimmunity Foundation.<BR> ------------------------------------------<BR> <BR> <X-TAB> </X-TAB>M E G A - F L O P S A C C O U N T I N G<BR><BR> <X-TAB> </X-TAB>Parallel run - timing based on wallclock.<BR> RF=Reaction-Field FE=Free Energy SCFE=Soft-Core/Free Energy<BR> T=Tabulated W3=SPC/TIP3p W4=TIP4p (single or pairs)<BR> NF=No Forces<BR><BR> Computing: M-Number M-Flops % of Flops<BR> -----------------------------------------------------------------------<BR> LJ 928.067418 30626.224794 1.1<BR> Coul(T) 886.762558 37244.027436 1.4<BR> Coul(T) [W3] 92.882138 11610.267250 0.4<BR> Coul(T) + LJ 599.004388 32945.241340 1.2<BR> Coul(T) + LJ [W3] 243.730360 33634.789680 1.2<BR> Coul(T) + LJ [W3-W3] 3292.173000 1257610.086000 45.6<BR> Outer nonbonded loop 945.783063 9457.830630 0.3<BR> 1,4 nonbonded interactions 41.184118 3706.570620 0.1<BR> Spread Q Bspline 51931.592640 103863.185280 3.8<BR> Gather F Bspline 51931.592640 623179.111680 22.6<BR> 3D-FFT 40498.449440 323987.595520 11.7<BR> Solve PME 3000.300000 192019.200000 7.0<BR> NS-Pairs 1044.424912 21932.923152 0.8<BR> Reset In Box 24.064040 216.576360 0.0<BR> Shift-X 961.696160 5770.176960 0.2<BR> CG-CoM 8.242234 239.024786 0.0<BR> Sum Forces 721.272120 721.272120 0.0<BR> Bonds 25.022502 1075.967586 0.0<BR> Angles 36.343634 5924.012342 0.2<BR> Propers 13.411341 3071.197089 0.1<BR> Impropers 12.171217 2531.613136 0.1<BR> Virial 241.774175 4351.935150 0.2<BR> Ext.ens. Update 240.424040 12982.898160 0.5<BR> Stop-CM 240.400000 2404.000000 0.1<BR> Calc-Ekin 240.448080 6492.098160 0.2<BR> Constraint-V 240.424040 1442.544240 0.1<BR> Constraint-Vir 215.884746 5181.233904 0.2<BR> Settle 71.961582 23243.590986 0.8<BR> -----------------------------------------------------------------------<BR> Total 2757465.194361 100.0<BR> -----------------------------------------------------------------------<BR><BR> NODE (s) Real (s) (%)<BR> Time: 408.000 408.000 100.0<BR> 6:48<BR> (Mnbf/s) (GFlops) (ns/day) (hour/ns)<BR> Performance: 14.810 6.758 3.176 7.556<BR><BR> Detailed load balancing info in percentage of average<BR> Type NODE: 0 1 2 3 4 Scaling<BR> -------------------------------------------<BR> LJ:423 0 3 41 32 23%<BR> Coul(T):500 0 0 0 0 20%<BR> Coul(T) [W3]: 0 0 32 291 176 34%<BR> Coul(T) + LJ:500 0 0 0 0 20%<BR> Coul(T) + LJ [W3]: 0 0 24 296 178 33%<BR> Coul(T) + LJ [W3-W3]: 60 116 108 106 107 86%<BR> Outer nonbonded loop:246 42 45 79 85 40%<BR> 1,4 nonbonded interactions:500 0 0 0 0 20%<BR> Spread Q Bspline: 98 100 102 100 97 97%<BR> Gather F Bspline: 98 100 102 100 97 97%<BR> 3D-FFT:100 100 100 100 100 100%<BR> Solve PME:100 100 100 100 100 100%<BR> NS-Pairs:107 96 91 103 100 93%<BR> Reset In Box: 99 100 100 100 99 99%<BR> Shift-X: 99 100 100 100 99 99%<BR> CG-CoM:110 97 97 97 97 90%<BR> Sum Forces:100 100 100 99 99 99%<BR> Bonds:499 0 0 0 0 20%<BR> Angles:500 0 0 0 0 20%<BR> Propers:499 0 0 0 0 20%<BR> Impropers:500 0 0 0 0 20%<BR> Virial: 99 100 100 100 99 99%<BR> Ext.ens. Update: 99 100 100 100 99 99%<BR> Stop-CM: 99 100 100 100 99 99%<BR> Calc-Ekin: 99 100 100 100 99 99%<BR> Constraint-V: 99 100 100 100 99 99%<BR> Constraint-Vir: 54 111 111 111 111 89%<BR> Settle: 54 111 111 111 111 89%<BR><BR> Total Force: 93 102 97 104 102 95%<BR><BR> <BR> Total Shake: 56 110 110 110 110 90%<BR><BR> <BR> Total Scaling: 95% of max performance<BR><BR> Finished mdrun on node 0 Sun May 27 07:29:57 2007<BR> <B> <BR> </B><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">_______________________________________________</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">gmx-users mailing list<SPAN class="Apple-converted-space"> </SPAN><A href="mailto:gmx-users@gromacs.org">gmx-users@gromacs.org</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><A href="http://www.gromacs.org/mailman/listinfo/gmx-users">http://www.gromacs.org/mailman/listinfo/gmx-users</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Please search the archive at <A href="http://www.gromacs.org/search">http://www.gromacs.org/search</A> before posting!</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Please don't post (un)subscribe requests to the list. Use the<SPAN class="Apple-converted-space"> </SPAN></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">www interface or send it to <A href="mailto:gmx-users-request@gromacs.org">gmx-users-request@gromacs.org</A>.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Can't post? Read <A href="http://www.gromacs.org/mailing_lists/users.php">http://www.gromacs.org/mailing_lists/users.php</A></DIV> </BLOCKQUOTE></DIV><BR></DIV></BODY></HTML>