<br><br><div class="gmail_quote">On Wed, Dec 10, 2008 at 11:09 AM, Knox, Kent <span dir="ltr">&lt;<a href="mailto:Kent.Knox@amd.com">Kent.Knox@amd.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


I&#39;ve done a naīve printf style instrumentation of the gromacs FFT interface, and can only see 3d real-to-complex/complex-to-real style fft&#39;s being used. &nbsp;For an MPI build of gromacs, I see that gromacs chunks 3d FFT&#39;s into 2D FFT&#39;s itself and passes those down to the underlying fft library to finish. &nbsp;I believe that in these instances ACML is threaded appropriately, but please let me know if I am drawing the wrong conclusions. &nbsp;I am basing my observations purely on the d.lzm bench.</blockquote>

<div><br>Yes it is correct that gromacs partitions the 3d FFT in 2D FFT itself. It doesn&#39;t use threading for the FFT because the surrounding code is not threaded and thus one MPI process is running per core. My work (not in Gromacs yet) is to partition into two dimensions not only one  to scale to higher number of processors. The partitioning thus&nbsp; ends up with columns/stencils instead of slabs. <br>

<br>Roland<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><font color="#888888"><br>

</font><div><div></div><div class="Wj3C7c"><br>

-----Original Message-----<br>

From: roland@rschulz.eu [mailto:<a href="mailto:roland@rschulz.eu">roland@rschulz.eu</a>] On Behalf Of Roland Schulz<br>

Sent: Tuesday, December 09, 2008 2:24 PM<br>

To: Discussion list for GROMACS development<br>

Cc: Knox, Kent<br>

Subject: Re: [gmx-developers] Gromacs FFT<br>

<br>

Hi Kent,<br>

<br>

usually FFT is not a bottleneck for MD when run on one or a few processors. You can increase the FFT load slightly by using a small cut-off (rcoulomb in the mdp file) and a fine grid (fourierspacing in mdp). Typical one uses a minimum of rcoloumb 0.8 and fourierspacing of 1.1. But you could decrease fourierspacing further to see the effect on the FFT time.<br>


<br>

FFT becomes the mayor bottleneck for parallel runs on more than a few hundred CPUs. I did some work on parallel FFT on Jaguar and Kraken. Let me know in case you are also interested in parallel FFT. Is it correct that the ACML only supports serial FFT so far? Do you plan to add an parallel FFT or an extension as for the linear algegra routines with AMD ScaLAPACK?<br>


<br>

Roland<br>

<br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>ORNL/UT Center for Molecular Biophysics <a href="http://cmb.ornl.gov">cmb.ornl.gov</a><br>