Hi,<div><br></div><div>Keep an eye on the release-4-6 git branch, the AVX 128 (+FMA4) and AVX 256 version of non-bonded kernels will very soon get merged upstream.</div><div><br></div><div>Cheers,<br clear="all">--<br>Szilárd<br>


<br><br><div class="gmail_quote">On Wed, May 30, 2012 at 4:31 PM, Shun Sakuraba <span dir="ltr">&lt;<a href="mailto:shun.sakuraba@gmail.com" target="_blank">shun.sakuraba@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Dear list,<br>

<br>

I would like to share my trial to port GROMACS 4.5 SSE/SSE2 nonbonded kernel<br>

to AMD&#39;s new family 15h chip, &quot;Bulldozer&quot; architecture, performed to measure<br>

the benefit of new instructions. In AMD family 15h new FMA4 instructions<br>

are added; FMA4 is the fused multiplication and addition (subtraction) operations.<br>

FMA4 reduces number of instructions and latency, giving a marginal performance boost.<br>

Also there are XOP instructions, which is useful implementing table interpolation<br>

used in GROMACS.<br>

<br>

 As far as I could try, the speedup is only 5% for SP kernels and around 10% for<br>

DP kernels, on gmxbench ( <a href="http://www.gromacs.org/About_Gromacs/Benchmarks" target="_blank">http://www.gromacs.org/About_Gromacs/Benchmarks</a> ).<br>

The source codes and generated kernels are uploaded on<br>

( <a href="https://bitbucket.org/shun.sakuraba/gromacs-fma4-patch" target="_blank">https://bitbucket.org/shun.sakuraba/gromacs-fma4-patch</a> ),<br>

being available for download. I hope this interests some developers.<br>

<br>

gmxbench 3.0, ns/day<br>

           vanilla fma4+xop<br>

SP    dppc  2.13    2.21<br>

SP lzm.pme 16.15   16.73<br>

SP lzm.cut 23.93   24.79<br>

SP polych2 38.88   38.13<br>

SP  villin 88.32   92.71<br>

DP    dppc  1.35    1.52<br>

DP lzm.pme 10.28   11.14<br>

DP lzm.cut 15.34   17.09<br>

DP polych2 26.73   29.87<br>

DP  villin 48.82   54.92<br>

<br>

Benchmarking results are taken on the following environment:<br>

AMD FX-4100 (2 modules, 4 cores, 3.6GHz / up to 3.9GHz with turbo)<br>

Linux (ArchLinux) 3.1.12-1-ARCH<br>

gcc (GCC) 4.7.0  20120505 (prerelease)<br>

FFTW 3.3.2<br>

GROMACS 4.5.5, used threading parallelization with 4 threads, compiled with<br>

CFLAGS=&quot;-O3 -fomit-frame-pointer -finline-functions -Wall -Wno-unused -march=native -funroll-all-loops -std=gnu99 -fexcess-precision=fast&quot;.<br>

All benchmarks are taken 3 times and the median is taken.<br>

<br>

Notes:<br>

* In GROMACS, if &quot;-march=native&quot; is replaced with &quot;-msse2&quot; (which is default in GROMACS 4.5.5),<br>

  the program runs 5~10% slower than &quot;-march=native&quot;.<br>

* I also compiled FFTW with and without -march=native. With default (i.e. without -march=native)<br>

  SP/DP runs are ~3% slower in PME.<br>

* This version of FFTW does not use FMA4 SIMD instructions. FMA4 SIMD instruction in FFTW<br>

  should increase performance, but I have not tried.<br>

* On NVE simulation, the energy conservation is slightly better than vanilla, possibly because<br>

  in IEEE-compliant FMA, intermediate multiplication results are calculated with infinite precision.<br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

Shun SAKURABA, Ph.D.<br>

Postdoc @ Molecular Modeling &amp; Simulation Group, Japan Atomic Energy Agency<br>

--<br>

gmx-developers mailing list<br>

<a href="mailto:gmx-developers@gromacs.org">gmx-developers@gromacs.org</a><br>

<a href="http://lists.gromacs.org/mailman/listinfo/gmx-developers" target="_blank">http://lists.gromacs.org/mailman/listinfo/gmx-developers</a><br>

Please don&#39;t post (un)subscribe requests to the list. Use the<br>

www interface or send it to <a href="mailto:gmx-developers-request@gromacs.org">gmx-developers-request@gromacs.org</a>.<br>

</font></span></blockquote></div><br></div>