<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 2017-09-18 18:34, John Eblen wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CANxFD_kqQMN2PPxCAGdSU=MgWnMD4_vAccjr6f-s2VJ4MQECNQ@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr">
<div>Hi Szilárd<br>
</div>
<div><br>
</div>
<div>These runs used 2M huge pages. I will file a redmine
shortly.</div>
<div><br>
</div>
<div>On a related topic, how difficult would it be to modify
GROMACS to support > 50%</div>
<div>PME nodes?</div>
</div>
</blockquote>
That's not so hard, but I see little benefit, since then the MPI
communication is not reduced much compared to all ranks doing PME.<br>
<br>
Berk<br>
<blockquote type="cite"
cite="mid:CANxFD_kqQMN2PPxCAGdSU=MgWnMD4_vAccjr6f-s2VJ4MQECNQ@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
<div><br>
</div>
<div>John<br>
</div>
<br>
<div>
<div>
<div class="gmail_extra">
<div class="gmail_quote">On Fri, Sep 15, 2017 at 6:37 PM,
Szilárd Páll <span dir="ltr"><<a
href="mailto:pall.szilard@gmail.com" target="_blank"
moz-do-not-send="true">pall.szilard@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">Hi John,<br>
<br>
Thanks for diagnosing the issue!<br>
<br>
We have been aware of this behavior, but been both
intentional (as we<br>
re-scan grids after the first pass at least once
more); plus, it's<br>
also simply been considered a "not too big of a deal"
given that in<br>
general mdrun has very low memory footprint. However,
it seems that,<br>
at least on this particular machine, our assumption
was wrong. What is<br>
the page sizes on Cori KNL?<br>
<br>
Can you please file a redmine with your observations?<br>
<br>
Thanks,<br>
--<br>
Szilárd<br>
<br>
<br>
On Fri, Sep 15, 2017 at 8:25 PM, John Eblen <<a
href="mailto:jeblen@acm.org" target="_blank"
moz-do-not-send="true">jeblen@acm.org</a>> wrote:<br>
> This issue appears to not be a GROMACS problem so
much as a problem with<br>
> "huge pages" that is<br>
> triggered by PME tuning. PME tuning creates a
large data structure for every<br>
> cutoff that it tries, which<br>
> is replicated on each PME node. These data
structures are not freed during<br>
> tuning, so memory usage<br>
> expands. Normally it is still too small to cause
problems. With huge pages,<br>
> however, I get errors from<br>
> "libhugetlbfs" and very slow runs if more than
about five cutoffs are<br>
> attempted.<br>
><br>
> Sample output on NERSC Cori KNL with 32 nodes.
Input system size is 248,101<br>
> atoms.<br>
><br>
> step 0<br>
> step 100, remaining wall clock time: 24 s<br>
> step 140: timed with pme grid 128 128 128,
coulomb cutoff 1.200: 66.2<br>
> M-cycles<br>
> step 210: timed with pme grid 112 112 112,
coulomb cutoff 1.336: 69.6<br>
> M-cycles<br>
> step 280: timed with pme grid 100 100 100,
coulomb cutoff 1.496: 63.6<br>
> M-cycles<br>
> step 350: timed with pme grid 84 84 84, coulomb
cutoff 1.781: 85.9 M-cycles<br>
> step 420: timed with pme grid 96 96 96, coulomb
cutoff 1.559: 68.8 M-cycles<br>
> step 490: timed with pme grid 100 100 100,
coulomb cutoff 1.496: 68.3<br>
> M-cycles<br>
> libhugetlbfs [nid08887:140420]: WARNING: New heap
segment map at<br>
> 0x10001200000 failed: Cannot allocate memory<br>
> libhugetlbfs [nid08881:97968]: WARNING: New heap
segment map at<br>
> 0x10001200000 failed: Cannot allocate memory<br>
> libhugetlbfs [nid08881:97978]: WARNING: New heap
segment map at<br>
> 0x10001200000 failed: Cannot allocate memory<br>
><br>
> Szilárd, to answer to your questions: This is the
verlet scheme. The problem<br>
> happens during tuning, and<br>
> no problems occur if -notunepme is used. In fact,
the best performance thus<br>
> far has been with 50% PME<br>
> nodes, using huge pages, and '-notunepme'.<br>
><br>
><br>
> John<br>
><br>
> On Wed, Sep 13, 2017 at 6:20 AM, Szilárd Páll
<<a href="mailto:pall.szilard@gmail.com"
target="_blank" moz-do-not-send="true">pall.szilard@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Forking the discussion as now we've learned
more about the issue Åke<br>
>> is reporting and it is quiterather
dissimilar.<br>
>><br>
>> On Mon, Sep 11, 2017 at 8:09 PM, John Eblen
<<a href="mailto:jeblen@acm.org" target="_blank"
moz-do-not-send="true">jeblen@acm.org</a>> wrote:<br>
>> > Hi Szilárd<br>
>> ><br>
>> > No, I'm not using the group scheme.<br>
>><br>
>> $ grep -i 'cutoff-scheme' md.log<br>
>> cutoff-scheme = Verlet<br>
>><br>
>> > The problem seems similar because:<br>
>> ><br>
>> > 1) Deadlocks and very slow runs can be
hard to distinguish.<br>
>> > 2) Since Mark mentioned it, I assume he
believes PME tuning is a<br>
>> > possible<br>
>> > cause, which is also the cause in my
situation.<br>
>><br>
>> Does that mean you tested with "-notunepme"
and the excessive memory<br>
>> usage could not be reproduced? Did the memory
usage increase only<br>
>> during the tuning or did it keep increasing
after the tuning<br>
>> completed?<br>
>><br>
>> > 3) Åke may be experiencing
higher-than-normal memory usage as far as I<br>
>> > know.<br>
>> > Not sure how you know otherwise.<br>
>> > 4) By "successful," I assume you mean
the tuning had completed. That<br>
>> > doesn't<br>
>> > mean, though, that the tuning could
not be creating conditions that<br>
>> > causes the<br>
>> > problem, like an excessively high
cutoff.<br>
>><br>
>> Sure. However, it's unlikely that the tuning
creates conditions under<br>
>> which the run proceeds after the after the
initial tuning phase and<br>
>> keeps allocating memory (which is more prone
to be the source of<br>
>> issues).<br>
>><br>
>> I suggest to first rule our the bug I linked
and if that's not the<br>
>> culprit, we can have a closer look.<br>
>><br>
>> Cheers,<br>
>> --<br>
>> Szilárd<br>
>><br>
>> ><br>
>> ><br>
>> > John<br>
>> ><br>
>> > On Mon, Sep 11, 2017 at 1:09 PM, Szilárd
Páll <<a href="mailto:pall.szilard@gmail.com"
target="_blank" moz-do-not-send="true">pall.szilard@gmail.com</a>><br>
>> > wrote:<br>
>> >><br>
>> >> John,<br>
>> >><br>
>> >> In what way do you think your
problem is similar? Åke seems to be<br>
>> >> experiencing a deadlock after
successful PME tuning, much later during<br>
>> >> the run, but no excessive memory
usage.<br>
>> >><br>
>> >> Do you happen to be using the group
scheme with 2016.x (release code)?<br>
>> >><br>
>> >> Your issue sounds more like it could
be related to the the excessive<br>
>> >> tuning bug with group scheme fixed
quite a few months ago, but it's<br>
>> >> yet to be released (<a
href="https://redmine.gromacs.org/issues/2200"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://redmine.gromacs.org/i<wbr>ssues/2200</a>).<br>
>> >><br>
>> >> Cheers,<br>
>> >> --<br>
>> >> Szilárd<br>
>> >><br>
>> >><br>
>> >> On Mon, Sep 11, 2017 at 6:50 PM,
John Eblen <<a href="mailto:jeblen@acm.org"
target="_blank" moz-do-not-send="true">jeblen@acm.org</a>>
wrote:<br>
>> >> > Hi<br>
>> >> ><br>
>> >> > I'm having a similar problem
that is related to PME tuning. When it<br>
>> >> > is<br>
>> >> > enabled, GROMACS often, but not<br>
>> >> > always, slows to a crawl and
uses excessive amounts of memory. Using<br>
>> >> > "huge<br>
>> >> > pages" and setting a high<br>
>> >> > number of PME processes seems
to exacerbate the problem.<br>
>> >> ><br>
>> >> > Also, occurrences of this
problem seem to correlate with how high the<br>
>> >> > tuning<br>
>> >> > raises the cutoff value.<br>
>> >> ><br>
>> >> > Mark, can you give us more
information on the problems with PME<br>
>> >> > tuning?<br>
>> >> > Is<br>
>> >> > there a redmine?<br>
>> >> ><br>
>> >> ><br>
>> >> > Thanks<br>
>> >> > John<br>
>> >> ><br>
>> >> > On Mon, Sep 11, 2017 at 10:53
AM, Mark Abraham<br>
>> >> > <<a
href="mailto:mark.j.abraham@gmail.com"
target="_blank" moz-do-not-send="true">mark.j.abraham@gmail.com</a>><br>
>> >> > wrote:<br>
>> >> >><br>
>> >> >> Hi,<br>
>> >> >><br>
>> >> >> Thanks. Was PME tuning
active? Does it reproduce if that is<br>
>> >> >> disabled?<br>
>> >> >> Is<br>
>> >> >> the PME tuning still
active? How many steps have taken place (at<br>
>> >> >> least<br>
>> >> >> as<br>
>> >> >> reported in the log file
but ideally from processes)?<br>
>> >> >><br>
>> >> >> Mark<br>
>> >> >><br>
>> >> >> On Mon, Sep 11, 2017 at
4:42 PM Åke Sandgren<br>
>> >> >> <<a
href="mailto:ake.sandgren@hpc2n.umu.se"
target="_blank" moz-do-not-send="true">ake.sandgren@hpc2n.umu.se</a>><br>
>> >> >> wrote:<br>
>> >> >>><br>
>> >> >>> My debugger run finally
got to the lockup.<br>
>> >> >>><br>
>> >> >>> All processes are
waiting on various MPI operations.<br>
>> >> >>><br>
>> >> >>> Attached a stack dump
of all 56 tasks.<br>
>> >> >>><br>
>> >> >>> I'll keep the debug
session running for a while in case anyone<br>
>> >> >>> wants<br>
>> >> >>> some more detailed
data.<br>
>> >> >>> This is a RelwithDeb
build though so not everything is available.<br>
>> >> >>><br>
>> >> >>> On 09/08/2017 11:28 AM,
Berk Hess wrote:<br>
>> >> >>> > But you should be
able to get some (limited) information by<br>
>> >> >>> > attaching a<br>
>> >> >>> > debugger to an
aldready running process with a release build.<br>
>> >> >>> ><br>
>> >> >>> > If you plan on
compiling and running a new case, use a release +<br>
>> >> >>> > debug<br>
>> >> >>> > symbols build.
That should run as fast as a release build.<br>
>> >> >>> ><br>
>> >> >>> > Cheers,<br>
>> >> >>> ><br>
>> >> >>> > Berk<br>
>> >> >>> ><br>
>> >> >>> > On 2017-09-08
11:23, Åke Sandgren wrote:<br>
>> >> >>> >> We have, at
least, one case that when run over 2 nodes, or more,<br>
>> >> >>> >> quite<br>
>> >> >>> >> often (always)
hangs, i.e. no more output in md.log or otherwise<br>
>> >> >>> >> while<br>
>> >> >>> >> mdrun still
consumes cpu time. It takes a random time before it<br>
>> >> >>> >> happens,<br>
>> >> >>> >> like 1-3 days.<br>
>> >> >>> >><br>
>> >> >>> >> The case can
be shared if someone else wants to investigate. I'm<br>
>> >> >>> >> planning to
run it in the debugger to be able to break and look<br>
>> >> >>> >> at<br>
>> >> >>> >> states when it
happens, but since it takes so long with the<br>
>> >> >>> >> production<br>
>> >> >>> >> build it is
not something i'm looking forward to.<br>
>> >> >>> >><br>
>> >> >>> >> On 09/08/2017
11:13 AM, Berk Hess wrote:<br>
>> >> >>> >>> Hi,<br>
>> >> >>> >>><br>
>> >> >>> >>> We are far
behind schedule for the 2017 release. We are working<br>
>> >> >>> >>> hard<br>
>> >> >>> >>> on<br>
>> >> >>> >>> it, but I
don't think we can promise a date yet.<br>
>> >> >>> >>><br>
>> >> >>> >>> We have a
2016.4 release planned for this week (might slip to<br>
>> >> >>> >>> next<br>
>> >> >>> >>> week). But
if you can give us enough details to track down your<br>
>> >> >>> >>> hanging<br>
>> >> >>> >>> issue, we
might be able to fix it in 2016.4.<br>
>> >> >>> ><br>
>> >> >>><br>
>> >> >>> --<br>
>> >> >>> Ake Sandgren, HPC2N,
Umea University, S-90187 Umea, Sweden<br>
>> >> >>> Internet: <a
href="mailto:ake@hpc2n.umu.se" target="_blank"
moz-do-not-send="true">ake@hpc2n.umu.se</a> Phone:
<a href="tel:%2B46%2090%207866134"
value="+46907866134" target="_blank"
moz-do-not-send="true">+46 90 7866134</a> Fax: +46
90-580<br>
>> >> >>> 14<br>
>> >> >>> Mobile: <a
href="tel:%2B46%2070%207716134" value="+46707716134"
target="_blank" moz-do-not-send="true">+46 70
7716134</a> WWW: <a href="http://www.hpc2n.umu.se"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.hpc2n.umu.se</a><br>
>> >> >>> --<br>
>> >> >>> Gromacs Developers
mailing list<br>
>> >> >>><br>
>> >> >>> * Please search the
archive at<br>
>> >> >>> <a
href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a><br>
>> >> >>> before<br>
>> >> >>> posting!<br>
>> >> >>><br>
>> >> >>> * Can't post? Read <a
href="http://www.gromacs.org/Support/Mailing_Lists"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> >> >>><br>
>> >> >>> * For (un)subscribe
requests visit<br>
>> >> >>><br>
>> >> >>><br>
>> >> >>> <a
href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> >> >>> or send a mail to <a
href="mailto:gmx-developers-request@gromacs.org"
target="_blank" moz-do-not-send="true">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> >> >><br>
>> >> >><br>
>> >> >> --<br>
>> >> >> Gromacs Developers mailing
list<br>
>> >> >><br>
>> >> >> * Please search the archive
at<br>
>> >> >> <a
href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a><br>
>> >> >> before<br>
>> >> >> posting!<br>
>> >> >><br>
>> >> >> * Can't post? Read <a
href="http://www.gromacs.org/Support/Mailing_Lists"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> >> >><br>
>> >> >> * For (un)subscribe
requests visit<br>
>> >> >><br>
>> >> >> <a
href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> >> >> or<br>
>> >> >> send a mail to <a
href="mailto:gmx-developers-request@gromacs.org"
target="_blank" moz-do-not-send="true">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> >> ><br>
>> >> ><br>
>> >> ><br>
>> >> > --<br>
>> >> > Gromacs Developers mailing list<br>
>> >> ><br>
>> >> > * Please search the archive at<br>
>> >> > <a
href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a><br>
>> >> > before<br>
>> >> > posting!<br>
>> >> ><br>
>> >> > * Can't post? Read <a
href="http://www.gromacs.org/Support/Mailing_Lists"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> >> ><br>
>> >> > * For (un)subscribe requests
visit<br>
>> >> ><br>
>> >> > <a
href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> >> > or<br>
>> >> > send a mail to <a
href="mailto:gmx-developers-request@gromacs.org"
target="_blank" moz-do-not-send="true">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> >> --<br>
>> >> Gromacs Developers mailing list<br>
>> >><br>
>> >> * Please search the archive at<br>
>> >> <a
href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a>
before<br>
>> >> posting!<br>
>> >><br>
>> >> * Can't post? Read <a
href="http://www.gromacs.org/Support/Mailing_Lists"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> >><br>
>> >> * For (un)subscribe requests visit<br>
>> >> <a
href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> >> or<br>
>> >> send a mail to <a
href="mailto:gmx-developers-request@gromacs.org"
target="_blank" moz-do-not-send="true">gmx-developers-request@gromacs<wbr>.org</a>.<br>
>> ><br>
>> ><br>
>> ><br>
>> > --<br>
>> > Gromacs Developers mailing list<br>
>> ><br>
>> > * Please search the archive at<br>
>> > <a
href="http://www.gromacs.org/Support/Mailing_Lists/GMX-developers_List"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists/GMX-developers_<wbr>List</a>
before<br>
>> > posting!<br>
>> ><br>
>> > * Can't post? Read <a
href="http://www.gromacs.org/Support/Mailing_Lists"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.gromacs.org/Support<wbr>/Mailing_Lists</a><br>
>> ><br>
>> > * For (un)subscribe requests visit<br>
>> > <a
href="https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-developers"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://maillist.sys.kth.se/ma<wbr>ilman/listinfo/gromacs.org_gmx<wbr>-developers</a><br>
>> > or<br>
>> > send a mail to <a
href="mailto:gmx-developers-request@gromacs.org"
target="_blank" moz-do-not-send="true">gmx-developers-request@gromacs<wbr>.org</a>.<br>
><br>
><br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
</blockquote>
<br>
</body>
</html>