<div dir="auto"><div>Hi,<div dir="auto"><br></div><div dir="auto">I&#39;m running simulations with the CHARMM forcefield, which also uses UB and experienced similar things. Apparently the flops count in the first table is not the actual time for the calculations, if I understood the explanations correctly. So it&#39;s the Force row in the second table that&#39;s bonded forces (with long range and PME on GPU). So I tried making a SIMD version of UB (only standard angles are SIMD optimised) and got almost a 50% performance gain. Making also bonds using SIMD only have an additional 1 or 2%. My patch is just a draft as it&#39;s not clear what future SIMD functions should look like, but ill share it with you so that you can try it. However, it won&#39;t be in the next release, I guess.</div><div dir="auto"><br></div><div dir="auto">Cheers,</div><div dir="auto"><br></div><div dir="auto">Magnus</div><br><div class="gmail_extra"><br><div class="gmail_quote">Den 1 dec. 2017 22:34 skrev &quot;Jochen Hub&quot; &lt;<a href="mailto:jhub@gwdg.de">jhub@gwdg.de</a>&gt;:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear developers,<br>


<br>


I started a thread in the user list yesterday (and Szilard already gave a quick answer) but I felt this point is relevant for the developers list.<br>


<br>


We did some benchmarks with the 2018-beta1 with PME on the GPU - overall fantastic (!!) - we just don&#39;t understand the performance of lipid membrane simulations (Slipids or Charmm36, with UB potentials). They contain roughly 50% lipid, 50% water atoms. Please see here:<br>


<br>


<a href="http://cmb.bio.uni-goettingen.de/bench.pdf" rel="noreferrer" target="_blank">http://cmb.bio.uni-goettingen.<wbr>de/bench.pdf</a><br>


<br>


As you see in the linked PDF, the Slipid simulations are limited by the CPU up to 10 (!) quite strong Xeon cores, when using a GTX 1080. Szilard pointed out that is is probably due to bonded UB interactions - however, they make only 0.2% of the Flops, see the log output pasted below, for ntomp=4 or 10 (for 128 Slipids system with 1nm cutoff). The Flops-Summary is nearly the same for ntomp=4 or 10, so only the ntomp=4 is shown below.<br>


<br>


In contrast, protein simulations (whether membrane protein or purely in water) behave as one hopes, showing that we can buy a cheap CPU when doing PME on the GPU.<br>


<br>


So my question is: Is this expected? Is this really due to Urey-Bradley? Or maybe due to Constraints? In case that UB is limiting, are there any plans to port this also onto the GPU in the future?<br>


<br>


This has also impact on hardware: Depending on whether you run protein or membrane simulation, you need to buy different hardware.<br>


<br>


Many thanks for any input, and many thanks again for the fabulous work on 2018!<br>


<br>


Jochen<br>


<br>


<br>


 Computing:                               M-Number         M-Flops  % Flops<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


 Pair Search distance check             151.929968        1367.370     0.0<br>


 NxN Ewald Elec. + LJ [F]            157598.160192    10401478.573    97.2<br>


 NxN Ewald Elec. + LJ [V&amp;F]            1623.781504      173744.621     1.6<br>


 1,4 nonbonded interactions             200.360064       18032.406     0.2<br>


 Shift-X                                  1.553664           9.322     0.0<br>


 Propers                                246.449280       56436.885     0.5<br>


 Impropers                                1.280256         266.293     0.0<br>


 Virial                                   7.657759         137.840     0.0<br>


 Stop-CM                                  1.553664          15.537     0.0<br>


 P-Coupling                               7.646464          45.879     0.0<br>


 Calc-Ekin                               15.262464         412.087     0.0<br>


 Lincs                                   74.894976        4493.699     0.0<br>


 Lincs-Mat                             1736.027136        6944.109     0.1<br>


 Constraint-V                           226.605312        1812.842     0.0<br>


 Constraint-Vir                           7.614336         182.744     0.0<br>


 Settle                                  25.605120        8270.454     0.1<br>


 Urey-Bradley                           144.668928       26474.414     0.2<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


 Total                                                10700125.072   100.0<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


<br>


<br>


     R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G<br>


<br>


On 1 MPI rank, each using 4 OpenMP threads<br>


<br>


 Computing:          Num   Num      Call    Wall time         Giga-Cycles<br>


                     Ranks Threads  Count      (s)         total sum    %<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


 Neighbor search        1    4         51       0.260          2.284   2.4<br>


 Launch GPU ops.        1    4      10002       0.591          5.191   5.4<br>


 Force                  1    4       5001       7.314         64.211  67.1<br>


 Wait PME GPU gather    1    4       5001       0.071          0.626   0.7<br>


 Reduce GPU PME F       1    4       5001       0.078          0.684   0.7<br>


 Wait GPU NB local      1    4       5001       0.017          0.151   0.2<br>


 NB X/F buffer ops.     1    4       9951       0.321          2.822   2.9<br>


 Write traj.            1    4          2       0.117          1.026   1.1<br>


 Update                 1    4       5001       0.199          1.749   1.8<br>


 Constraints            1    4       5001       1.853         16.270  17.0<br>


 Rest                                           0.085          0.743   0.8<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


 Total                                         10.907         95.757 100.0<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


<br>


******************************<wbr>**<br>


****** 10 Open MP threads ******<br>


******************************<wbr>**<br>


<br>


On 1 MPI rank, each using 10 OpenMP threads<br>


<br>


 Computing:          Num   Num      Call    Wall time         Giga-Cycles<br>


                     Ranks Threads  Count      (s)         total sum    %<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


 Neighbor search        1   10         51       0.120          2.625   2.3<br>


 Launch GPU ops.        1   10      10002       0.580         12.731  11.3<br>


 Force                  1   10       5001       2.999         65.828  58.4<br>


 Wait PME GPU gather    1   10       5001       0.066          1.459   1.3<br>


 Reduce GPU PME F       1   10       5001       0.045          0.980   0.9<br>


 Wait GPU NB local      1   10       5001       0.014          0.308   0.3<br>


 NB X/F buffer ops.     1   10       9951       0.157          3.453   3.1<br>


 Write traj.            1   10          2       0.147          3.224   2.9<br>


 Update                 1   10       5001       0.140          3.067   2.7<br>


 Constraints            1   10       5001       0.814         17.867  15.9<br>


 Rest                                           0.053          1.161   1.0<br>


------------------------------<wbr>------------------------------<wbr>-----------------<br>


 Total                                          5.135        112.703 100.0<br>


------------------------------<wbr>------------------------------<wbr>-----------------<font color="#888888"><br>


<br>


<br>


<br>


-- <br>


------------------------------<wbr>---------------------<br>


Dr. Jochen Hub<br>


Computational Molecular Biophysics Group<br>


Institute for Microbiology and Genetics<br>


Georg-August-University of Göttingen<br>


<a href="https://maps.google.com/?q=Justus-von-Liebig-Weg+11,+37077+G%C3%B6ttingen,+Germany&amp;entry=gmail&amp;source=g">Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany</a>.<br>


Phone: <a href="tel:%2B49-551-39-14189" value="+495513914189" target="_blank">+49-551-39-14189</a><br>


<a href="http://cmb.bio.uni-goettingen.de/" rel="noreferrer" target="_blank">http://cmb.bio.uni-goettingen.<wbr>de/</a><br>


------------------------------<wbr>---------------------<br>


-- <br>


Gromacs Developers mailing list<br>


<br>


* Please search the archive at <a href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a> before posting!<br>


<br>


* Can&#39;t post? Read <a href="http://www.gromacs.org/Support/Mailing_Lists" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>


<br>


* For (un)subscribe requests visit<br>


<a href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers" rel="noreferrer" target="_blank">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a> or send a mail to <a href="mailto:gmx-developers-request@gromacs.org" target="_blank">gmx-developers-request@gromacs<wbr>.org</a>.</font></blockquote></div><br></div></div></div>