<div dir="ltr">Hi,<div><br></div><div>you can use -nopin but you will get slightly lower performance. You can also use pinoffset and number the different gromacs instances you are running any way you wish. Most programs don&#39;t pin so it is unlikely that it going to conflict with other programs if you are running both Gromacs and other programs.</div>


<div><br></div><div>Roland </div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Dec 19, 2012 at 1:23 PM, Shirts, Michael (mrs5pt) <span dir="ltr">&lt;<a href="mailto:mrs5pt@eservices.virginia.edu" target="_blank">mrs5pt@eservices.virginia.edu</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, all-<br>

<br>

I&#39;m trying to figure out the reason for a performance hit when running on a<br>

single core with the new code, which is specifically reflected in a core<br>

time that is significantly less than the wall time (about 1/3).  Apologies<br>

if this has already been discussed and I missed it!  I have some theories,<br>

but need some more help figuring this out.<br>

<br>

Executive summary -- is there something which causes all jobs sent to the<br>

same node be pinned to the first core, so that if there are 8 jobs<br>

requesting 1 thread each on an 8 node CPU, they will just steal from each<br>

other on the first core rather than operating on different cores?  If so,<br>

how can this be avoided?  Looking at online docs, it seems that -pinoffset<br>

options might help, but there is no way to tell beforehand where the jobs<br>

will be sent, or what other users will be doing with THEIR programs.  Is<br>

there a way to make this simpler?  To &#39;just work&#39; and use the available<br>

cores like it did in 4.5.5?<br>

<br>

Details:<br>

<br>

In everything, I am using only group cutoffs as well as only thread_mpi<br>

(though only one thread, so thread_mpi shouldn&#39;t matter).<br>

<br>

When I run with a single core using a PBS script (including only the cpu<br>

selection line in the PBS script), for example:<br>

<br>

#PBS -l select=1:mpiprocs=1:ncpus=1<br>

<br>

Running command:<br>

<br>

mdrun_d -ntomp 1 -ntmpi 1<br>

<br>

I find that with 4.6 beta I got.<br>

<br>

&gt;                Core t (s)   Wall t (s)        (%)<br>

&gt;        Time:      526.200     1586.919       33.2<br>

&gt;                  (ns/day)    (hour/ns)<br>

&gt; Performance:        1.089       22.038<br>

<br>

Note that the core time is only about 1/3 of the wall time.<br>

<br>

This also occurs when running with simply:<br>

<br>

#PBS -l nodes=1:ppn=1<br>

mdrun_d -nt 1<br>

<br>

However, other runs with identical call parameters got up to 96%<br>

utilization.  Logging directly onto the compute notes and running &#39;top&#39;, I<br>

found that the CPU use percent was somewhere between 10 and 40% for the 8<br>

jobs running (all of which used 1 thread). It should have been 100% for<br>

each, as far as I can tell.   When I was able to isolate a run that was<br>

going faster, I logged into it&#39;s compute node and found that it was indeed<br>

running alone, with a CPU utilization determined by &#39;top&#39; of near 100%.<br>

<br>

So, is there something pinning 1 core jobs to the first thread?<br>

<br>

When running a (different chemical system which is inherently faster, same<br>

4.6 code) with all 8 processors:<br>

<br>

#PBS -l select=1:mpiprocs=8:ncpus=8<br>

mdrun_d -ntmpi 8<br>

<br>

&gt;               Core t (s)   Wall t (s)        (%)<br>

&gt;       Time:   324684.500    40826.497      795.3<br>

&gt;                 (ns/day)    (hour/ns)<br>

&gt;Performance:      321.181        0.075<br>

<br>

Here, we get near full resources: utilization is 795.3/8 = 99.4%<br>

<br>

With older code (modifications of 4.5.5), and the same system as the first<br>

example, running:<br>

<br>

#PBS -l nodes=1:ppn=1<br>

Mdrun_d -nt 1<br>

<br>

then even though the core/note time drops by 10-15% (yay speed increases in<br>

4.6!) the wall time is much closer to 100%, so the throughput is much better<br>

than old one process timing.  These results are very consistent.  They don&#39;t<br>

depend on what else is being run on the node.<br>

<br>

Old code:<br>

<br>

&gt;                NODE (s)   Real (s)      (%)<br>

&gt;        Time:    611.460    623.325     98.1<br>

&gt;                        10:11<br>

&gt;                (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)<br>

&gt; Performance:     36.451      1.823      2.826      8.492<br>

<br>

<br>

Some information from the new code setup (from the log)<br>

<br>

Host: lc5-compute-1-2.local  pid: 32227  nodeid: 0  nnodes:  1<br>

Gromacs version:    VERSION 4.6-beta2-dev-20121217-e233b32<br>

GIT SHA1 hash:      e233b3231ae94805ae489840133ffcc225263d3a<br>

Branched from:      c5706f32cc2363c50b61ec0a207bf93dc20220a1 (4 newer local<br>

commits)<br>

Precision:          double<br>

MPI library:        thread_mpi<br>

OpenMP support:     enabled<br>

GPU support:        disabled<br>

invsqrt routine:    gmx_software_invsqrt(x)<br>

CPU acceleration:   SSE2<br>

FFT library:        fftw-3.2.2<br>

Large file support: enabled<br>

RDTSCP usage:       disabled<br>

Built on:           Mon Dec  3 10:14:02 EST 2012<br>

Built by:           <a href="mailto:mrs5pt@fir-s.itc.virginia.edu">mrs5pt@fir-s.itc.virginia.edu</a> [CMAKE]<br>

Build OS/arch:      Linux 2.6.18-308.11.1.el5 x86_64<br>

Build CPU vendor:   GenuineIntel<br>

Build CPU brand:    Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz<br>

Build CPU family:   6   Model: 26   Stepping: 5<br>

Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc<br>

pdcm popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3<br>

C compiler:         /usr/bin/gcc GNU gcc (GCC) 4.1.2 20080704 (Red Hat<br>

4.1.2-50)<br>

C compiler flags:   -msse2  -Wextra -Wno-missing-field-initializers<br>

-Wno-sign-compare -Wall -Wno-unused -Wunused-value   -fomit-frame-pointer<br>

-funroll-all-loops  -O3 -DNDEBUG<br>

<br>

. . .<br>

<br>

Using 1 MPI thread<br>

<br>

Detecting CPU-specific acceleration.<br>

Present hardware specification:<br>

Vendor: GenuineIntel<br>

Brand:  Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz<br>

Family:  6  Model: 23  Stepping: 10<br>

Features: apic clfsh cmov cx8 cx16 lahf_lm mmx msr pdcm pse sse2 sse3 sse4.1<br>

ssse3<br>

Acceleration most likely to fit this hardware: SSE4.1<br>

Acceleration selected at GROMACS compile time: SSE2<br>

<br>

<br>

Binary not matching hardware - you might be losing performance.<br>

Acceleration most likely to fit this hardware: SSE4.1<br>

Acceleration selected at GROMACS compile time: SSE2<br>

<br>

<br>

Best,<br>

~~~~~~~~~~~~<br>

Michael Shirts<br>

Assistant Professor<br>

Department of Chemical Engineering<br>

University of Virginia<br>

<a href="mailto:michael.shirts@virginia.edu">michael.shirts@virginia.edu</a><br>

<a href="tel:%28434%29-243-1821" value="+14342431821">(434)-243-1821</a><br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

gmx-developers mailing list<br>

<a href="mailto:gmx-developers@gromacs.org">gmx-developers@gromacs.org</a><br>

<a href="http://lists.gromacs.org/mailman/listinfo/gmx-developers" target="_blank">http://lists.gromacs.org/mailman/listinfo/gmx-developers</a><br>

Please don&#39;t post (un)subscribe requests to the list. Use the<br>

www interface or send it to <a href="mailto:gmx-developers-request@gromacs.org">gmx-developers-request@gromacs.org</a>.<br>

<br>

<br>

<br>

<br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>ORNL/UT Center for Molecular Biophysics <a href="http://cmb.ornl.gov">cmb.ornl.gov</a><br>865-241-1537, ORNL PO BOX 2008 MS6309<br>

</div>