<div dir="ltr"><div><div>This issue appears to not be a GROMACS problem so much as a problem with "huge pages" that is</div><div>triggered by PME tuning. PME tuning creates a large data structure for every cutoff that it tries, which</div><div>is replicated on each PME node. These data structures are not freed during tuning, so memory usage</div><div>expands. Normally it is still too small to cause problems. With huge pages, however, I get errors from<br></div><div>"libhugetlbfs" and very slow runs if more than about five cutoffs are attempted.</div><div><br></div><div>Sample output on NERSC Cori KNL with 32 nodes. Input system size is 248,101 atoms.<br></div><div><br></div><div>step 0<br>step 100, remaining wall clock time: 24 s<br>step 140: timed with pme grid 128 128 128, coulomb cutoff 1.200: 66.2 M-cycles<br>step 210: timed with pme grid 112 112 112, coulomb cutoff 1.336: 69.6 M-cycles<br>step 280: timed with pme grid 100 100 100, coulomb cutoff 1.496: 63.6 M-cycles<br>step 350: timed with pme grid 84 84 84, coulomb cutoff 1.781: 85.9 M-cycles<br>step 420: timed with pme grid 96 96 96, coulomb cutoff 1.559: 68.8 M-cycles<br>step 490: timed with pme grid 100 100 100, coulomb cutoff 1.496: 68.3 M-cycles<br>libhugetlbfs [nid08887:140420]: WARNING: New heap segment map at 0x10001200000 failed: Cannot allocate memory<br>libhugetlbfs [nid08881:97968]: WARNING: New heap segment map at 0x10001200000 failed: Cannot allocate memory<br>libhugetlbfs [nid08881:97978]: WARNING: New heap segment map at 0x10001200000 failed: Cannot allocate memory<br></div><div><br></div><div>Szilárd, to answer to your questions: This is the verlet scheme. The problem happens during tuning, and</div><div>no problems occur if -notunepme is used. In fact, the best performance thus far has been with 50% PME</div><div>nodes, using huge pages, and '-notunepme'.<br></div></div><div><div><div><div><div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">John</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_quote">On Wed, Sep 13, 2017 at 6:20 AM, Szilárd Páll <span dir="ltr"><<a href="mailto:pall.szilard@gmail.com" target="_blank">pall.szilard@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Forking the discussion as now we've learned more about the issue Åke<br>
is reporting and it is quiterather dissimilar.<br>
<br>
On Mon, Sep 11, 2017 at 8:09 PM, John Eblen <<a href="mailto:jeblen@acm.org" target="_blank">jeblen@acm.org</a>> wrote:<br>
> Hi Szilárd<br>
><br>
> No, I'm not using the group scheme.<br>
<br>
$ grep -i 'cutoff-scheme' md.log<br>
cutoff-scheme = Verlet<br>
<br>
> The problem seems similar because:<br>
><br>
> 1) Deadlocks and very slow runs can be hard to distinguish.<br>
> 2) Since Mark mentioned it, I assume he believes PME tuning is a possible<br>
> cause, which is also the cause in my situation.<br>
<br>
Does that mean you tested with "-notunepme" and the excessive memory<br>
usage could not be reproduced? Did the memory usage increase only<br>
during the tuning or did it keep increasing after the tuning<br>
completed?<br>
<br>
> 3) Åke may be experiencing higher-than-normal memory usage as far as I know.<br>
> Not sure how you know otherwise.<br>
> 4) By "successful," I assume you mean the tuning had completed. That doesn't<br>
> mean, though, that the tuning could not be creating conditions that<br>
> causes the<br>
> problem, like an excessively high cutoff.<br>
<br>
Sure. However, it's unlikely that the tuning creates conditions under<br>
which the run proceeds after the after the initial tuning phase and<br>
keeps allocating memory (which is more prone to be the source of<br>
issues).<br>
<br>
I suggest to first rule our the bug I linked and if that's not the<br>
culprit, we can have a closer look.<br>
<br>
Cheers,<br>
--<br>
Szilárd<br>
<br>
><br>
><br>
> John<br>
><br>
> On Mon, Sep 11, 2017 at 1:09 PM, Szilárd Páll <<a href="mailto:pall.szilard@gmail.com" target="_blank">pall.szilard@gmail.com</a>><br>
> wrote:<br>
>><br>
>> John,<br>
>><br>
>> In what way do you think your problem is similar? Åke seems to be<br>
>> experiencing a deadlock after successful PME tuning, much later during<br>
>> the run, but no excessive memory usage.<br>
>><br>
>> Do you happen to be using the group scheme with 2016.x (release code)?<br>
>><br>
>> Your issue sounds more like it could be related to the the excessive<br>
>> tuning bug with group scheme fixed quite a few months ago, but it's<br>
>> yet to be released (<a href="https://redmine.gromacs.org/issues/2200" rel="noreferrer" target="_blank">https://redmine.gromacs.org/i<wbr>ssues/2200</a>).<br>
>><br>
>> Cheers,<br>
>> --<br>
>> Szilárd<br>
>><br>
>><br>
>> On Mon, Sep 11, 2017 at 6:50 PM, John Eblen <<a href="mailto:jeblen@acm.org" target="_blank">jeblen@acm.org</a>> wrote:<br>
>> > Hi<br>
>> ><br>
>> > I'm having a similar problem that is related to PME tuning. When it is<br>
>> > enabled, GROMACS often, but not<br>
>> > always, slows to a crawl and uses excessive amounts of memory. Using<br>
>> > "huge<br>
>> > pages" and setting a high<br>
>> > number of PME processes seems to exacerbate the problem.<br>
>> ><br>
>> > Also, occurrences of this problem seem to correlate with how high the<br>
>> > tuning<br>
>> > raises the cutoff value.<br>
>> ><br>
>> > Mark, can you give us more information on the problems with PME tuning?<br>
>> > Is<br>
>> > there a redmine?<br>
>> ><br>
>> ><br>
>> > Thanks<br>
>> > John<br>
>> ><br>
>> > On Mon, Sep 11, 2017 at 10:53 AM, Mark Abraham<br>
>> > <<a href="mailto:mark.j.abraham@gmail.com" target="_blank">mark.j.abraham@gmail.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> Hi,<br>
>> >><br>
>> >> Thanks. Was PME tuning active? Does it reproduce if that is disabled?<br>
>> >> Is<br>
>> >> the PME tuning still active? How many steps have taken place (at least<br>
>> >> as<br>
>> >> reported in the log file but ideally from processes)?<br>
>> >><br>
>> >> Mark<br>
>> >><br>
>> >> On Mon, Sep 11, 2017 at 4:42 PM Åke Sandgren<br>
>> >> <<a href="mailto:ake.sandgren@hpc2n.umu.se" target="_blank">ake.sandgren@hpc2n.umu.se</a>><br>
>> >> wrote:<br>
>> >>><br>
>> >>> My debugger run finally got to the lockup.<br>
>> >>><br>
>> >>> All processes are waiting on various MPI operations.<br>
>> >>><br>
>> >>> Attached a stack dump of all 56 tasks.<br>
>> >>><br>
>> >>> I'll keep the debug session running for a while in case anyone wants<br>
>> >>> some more detailed data.<br>
>> >>> This is a RelwithDeb build though so not everything is available.<br>
>> >>><br>
>> >>> On 09/08/2017 11:28 AM, Berk Hess wrote:<br>
>> >>> > But you should be able to get some (limited) information by<br>
>> >>> > attaching a<br>
>> >>> > debugger to an aldready running process with a release build.<br>
>> >>> ><br>
>> >>> > If you plan on compiling and running a new case, use a release +<br>
>> >>> > debug<br>
>> >>> > symbols build. That should run as fast as a release build.<br>
>> >>> ><br>
>> >>> > Cheers,<br>
>> >>> ><br>
>> >>> > Berk<br>
>> >>> ><br>
>> >>> > On 2017-09-08 11:23, Åke Sandgren wrote:<br>
>> >>> >> We have, at least, one case that when run over 2 nodes, or more,<br>
>> >>> >> quite<br>
>> >>> >> often (always) hangs, i.e. no more output in md.log or otherwise<br>
>> >>> >> while<br>
>> >>> >> mdrun still consumes cpu time. It takes a random time before it<br>
>> >>> >> happens,<br>
>> >>> >> like 1-3 days.<br>
>> >>> >><br>
>> >>> >> The case can be shared if someone else wants to investigate. I'm<br>
>> >>> >> planning to run it in the debugger to be able to break and look at<br>
>> >>> >> states when it happens, but since it takes so long with the<br>
>> >>> >> production<br>
>> >>> >> build it is not something i'm looking forward to.<br>
>> >>> >><br>
>> >>> >> On 09/08/2017 11:13 AM, Berk Hess wrote:<br>
>> >>> >>> Hi,<br>
>> >>> >>><br>
>> >>> >>> We are far behind schedule for the 2017 release. We are working<br>
>> >>> >>> hard<br>
>> >>> >>> on<br>
>> >>> >>> it, but I don't think we can promise a date yet.<br>
>> >>> >>><br>
>> >>> >>> We have a 2016.4 release planned for this week (might slip to next<br>
>> >>> >>> week). But if you can give us enough details to track down your<br>
>> >>> >>> hanging<br>
>> >>> >>> issue, we might be able to fix it in 2016.4.<br>
>> >>> ><br>
>> >>><br>
>> >>> --<br>
>> >>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden<br>
>> >>> Internet: <a href="mailto:ake@hpc2n.umu.se" target="_blank">ake@hpc2n.umu.se</a> Phone: <a href="tel:%2B46%2090%207866134" value="+46907866134" target="_blank">+46 90 7866134</a> Fax: <a href="tel:%2B46%2090-580%2014" value="+469058014" target="_blank">+46 90-580 14</a><br>
>> >>> Mobile: <a href="tel:%2B46%2070%207716134" value="+46707716134" target="_blank">+46 70 7716134</a> WWW: <a href="http://www.hpc2n.umu.se" rel="noreferrer" target="_blank">http://www.hpc2n.umu.se</a><br>
>> >>> --<br>
>> >>> Gromacs Developers mailing list<br>
>> >>><br>
>> >>> * Please search the archive at<br>
>> >>> <a href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a><br>
>> >>> before<br>
>> >>> posting!<br>
>> >>><br>
>> >>> * Can't post? Read <a href="http://www.gromacs.org/Support/Mailing_Lists" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> >>><br>
>> >>> * For (un)subscribe requests visit<br>
>> >>><br>
>> >>> <a href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers" rel="noreferrer" target="_blank">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> >>> or send a mail to <a href="mailto:gmx-developers-request@gromacs.org" target="_blank">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> >><br>
>> >><br>
>> >> --<br>
>> >> Gromacs Developers mailing list<br>
>> >><br>
>> >> * Please search the archive at<br>
>> >> <a href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a> before<br>
>> >> posting!<br>
>> >><br>
>> >> * Can't post? Read <a href="http://www.gromacs.org/Support/Mailing_Lists" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> >><br>
>> >> * For (un)subscribe requests visit<br>
>> >> <a href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers" rel="noreferrer" target="_blank">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> >> or<br>
>> >> send a mail to <a href="mailto:gmx-developers-request@gromacs.org" target="_blank">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> ><br>
>> ><br>
>> ><br>
>> > --<br>
>> > Gromacs Developers mailing list<br>
>> ><br>
>> > * Please search the archive at<br>
>> > <a href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a> before<br>
>> > posting!<br>
>> ><br>
>> > * Can't post? Read <a href="http://www.gromacs.org/Support/Mailing_Lists" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> ><br>
>> > * For (un)subscribe requests visit<br>
>> > <a href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers" rel="noreferrer" target="_blank">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> > or<br>
>> > send a mail to <a href="mailto:gmx-developers-request@gromacs.org" target="_blank">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> --<br>
>> Gromacs Developers mailing list<br>
>><br>
>> * Please search the archive at<br>
>> <a href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a> before<br>
>> posting!<br>
>><br>
>> * Can't post? Read <a href="http://www.gromacs.org/Support/Mailing_Lists" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>><br>
>> * For (un)subscribe requests visit<br>
>> <a href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers" rel="noreferrer" target="_blank">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a> or<br>
>> send a mail to <a href="mailto:gmx-developers-request@gromacs.org" target="_blank">gmx-developers-request@gromacs<wbr>.org</a>.<br>
><br>
><br>
<span class="gmail-m_7075310555345551884m_7161326610723356472HOEnZb"><font color="#888888">><br>
> --<br>
> Gromacs Developers mailing list<br>
><br>
> * Please search the archive at<br>
> <a href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a> before<br>
> posting!<br>
><br>
> * Can't post? Read <a href="http://www.gromacs.org/Support/Mailing_Lists" rel="noreferrer" target="_blank">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
><br>
> * For (un)subscribe requests visit<br>
> <a href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers" rel="noreferrer" target="_blank">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a> or<br>
> send a mail to <a href="mailto:gmx-developers-request@gromacs.org" target="_blank">gmx-developers-request@gromacs<wbr>.org</a>.<br>
</font></span></blockquote></div><br></div></div></div></div></div></div></div>