<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Sep 30, 2013 at 11:23 AM, Erik Lindahl <span dir="ltr">&lt;<a href="mailto:erik.lindahl@scilifelab.se" target="_blank">erik.lindahl@scilifelab.se</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word">Hi,<div><br><div><div class="im">

<div>On Sep 30, 2013, at 11:14 AM, Mark Abraham &lt;<a href="mailto:mark.j.abraham@gmail.com" target="_blank">mark.j.abraham@gmail.com</a>&gt; wrote:</div><blockquote type="cite"><p dir="ltr"><br>

&gt; I think this is unlikely to make it for 5.0, but long-term I would like to support multiple hardware accelerations in a single binary again, by making the actual binaries very small and loading one of several libraries as a dynamic module at runtime. This is not technically difficult to do, but there is one step that will be a little pain for us: Each symbol we want to use from the library must be resolved manually with a call to dlsym().</p>

<p dir="ltr">I can see three possible division levels: mdrun vs tools, md-loop vs rest, hardware-tuned inner loops vs rest. The third is by far the easiest to do.</p></blockquote></div>We already discussed this in Redmine (see the thread Teemu linked to), and unfortunately the problem is not limited to inner loops - CPU-specific optimization flags has significant impact on large parts of the code, and will improve performance by ~20% above the inner kernels - I don&#39;t think we&#39;re willing to sacrifice 20% performance.</div>

</div></div></blockquote><div><br></div><div>OK, great. I&#39;m glad someone has measured some numbers (even if those reported in <a href="http://redmine.gromacs.org/issues/1165">http://redmine.gromacs.org/issues/1165</a> are 9% and 17%). If premature optimization is the root of all evil, then optimization based on assumption is the tuber of all evil!</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div>

<div class="im"><blockquote type="cite"><p dir="ltr">Cray still requires static linking, and BlueGene/Q encourages it, so I think it is important that the implementation does not require dynamic linking in the cases where portability of the binary is immaterial.</p>

</blockquote></div><div>I don&#39;t think we both can have our cake and eat it. For special-purpose highly parallel architectures that require static linking I think it is reasonable that the Gromacs binary will be specific to that particular architecture. </div>

</div></div></div></blockquote><div><br></div><div>Agreed.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div style="word-wrap:break-word"><div><div><div class="im"><blockquote type="cite"><p dir="ltr">&gt; This means we should start thinking of two things to make life simpler in the future:<br>

&gt;<br>

&gt; 1) Decide on what level we want the interface between library and executables, and keep this interface _really_ small (in the sense that we want to resolve as few symbols as possible).<br>

&gt; 2) Since we will have to compile the high-level binaries with generic compiler flags, any code that is performance-sensitive should go in the architecture-specific optimized library.</p><p dir="ltr">I think the third option I give above is the most achievable. I do not know whether the dynamic function calls incur overhead per call, or whether that can be mitigated by the helper object Teemu suggested, but he sounds right (as usual). I hope the libraries would share the same address space. Since we anyway plan for tasks to wrap function calls, the implementations converge.</p>

</blockquote></div><div>See above. It would lose ~20% performance, which I think is unacceptable. The main md loop and all functions under it need to be compiled with CPU-specific optimization, so that&#39;s the lowest level we can split on. Otherwise we can just as well disable AVX optimization and ship SSE4.1 binaries to be portable :-)</div>

</div></div></div></blockquote><div><br></div><div>OK, so if the division is based on &quot;code called from the integrator loop,&quot; then that starts to make for a sensible division of the code base. The general API (i.e. which might need to be implemented with dlsym() on x86) would be the integrator functions, unless/until someone identifies specific needs. Organizational sanity suggests that we start moving code from src/gromacs to src/core when someone is fairly sure it belongs there. The criterion for that should be along the lines that &quot;it has been measured to benefit from machine-specific compiler optimization flags, and is not called directly by code in src/gromacs.&quot; So for starters that is kernels, neighbour search code, update code, PME.</div>

<div><br></div><div>What about data structures (like parts of forcerec) that get filled during setup? If the compiler might want to control alignment and padding to suit the hardware, then they will have to be declared in src/core, and their setting machinery must be in the API.</div>

<div><br></div><div>Mark</div></div></div></div>