<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
On 5/06/2011 12:31 PM, Dimitar Pachov wrote:
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite"><br>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> This script is not
using mdrun -append. </div>
</blockquote>
<div><br>
</div>
<div>-append is the default, it doesn't need to
be explicitly listed. <br>
</div>
</div>
</blockquote>
<br>
Ah yes, very true.<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">Your original post
suggested the use of -append was a problem. Why aren't we
seeing a script with mdrun -append? Also, please provide the
full script - it looks like there might be a loop around
your tpbconv-then-mdrun fragment.<br>
</div>
</blockquote>
<div><br>
</div>
<div>There is no loop; this is a job script with PBS directives.
The header of it looks like:</div>
<div>===========================</div>
<div>
<div><font class="Apple-style-span" size="1">#!/bin/bash</font></div>
<div><font class="Apple-style-span" size="1">#$ -S /bin/bash</font></div>
<div><font class="Apple-style-span" size="1">#$ -pe mpich 8</font></div>
<div> <font class="Apple-style-span" size="1">#$ -ckpt reloc</font></div>
<div><font class="Apple-style-span" size="1">#$ -l
mem_total=6G</font></div>
</div>
<div>===========================</div>
<div><br>
</div>
<div>as usual submitted by:</div>
<div><br>
</div>
<div><font class="Apple-style-span" size="1">qsub -N aaaa
myjob.q</font></div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> <br>
Note that a useful trouble-shooting technique can be to
construct your command line in a shell variable, echo it to
stdout (redirected as suitable) and then execute the
contents of the variable. Now, nobody has to parse a shell
script to know what command line generated what output, and
it can be co-located with the command's stdout.</div>
</blockquote>
<div><br>
</div>
<div>I somewhat understand your point, but could give an example
if you think it is really necessary?</div>
</div>
</blockquote>
<br>
It's just generally helpful if your stdout has "mpirun -np 8
/path/to/mdrun_mpi -deffnm run_4 -cpi run_4" at the top of it so
that you have a definitive record of what you did under the
environment that existed at the time of execution.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div>As I said, the queue is like this: you submit the job, it
finds an empty node, it goes there, however seconds later
another user with higher privileges on that particular node
submits a job, his job kicks out my job, mine goes on the
queue again, it finds another empty node, goes there, then
another user with high privileges on that node submits a job,
which consequently kicks out my job again, and the cycle
repeats itself ... theoretically, it could continue forever,
depending on how many and where the empty nodes are, if any.</div>
</div>
</blockquote>
<br>
You've said that *now* - but previously you've said nothing about
why you were getting lots of restarts. In my experience, PBS queues
suspend jobs rather than deleting them, in order that resources are
not wasted. Apparently other places do things this way. I think that
this information is highly relevant to explaining your observations.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div>These many restarts suggest that the queue was full with
relatively short jobs ran by users with high privileges.
Technically, I cannot see why the same processes should be
running simultaneously because at any instant my job runs only
on one node, or it stays in the queuing list. <br>
</div>
</div>
</blockquote>
<br>
I/O can be buffered such that the termination of the process and the
completion of its I/O are asynchronous. Perhaps it *shouldn't* be
that way, but this is a problem for the administrators of your
cluster to address. They know how the file system works. If the next
job executes before the old one has finished output, then I think
the symptoms you observe might be possible.<br>
<br>
Note that there is nothing GROMACS can do about that, unless somehow
GROMACS can apply a lock in the first mdrun that is respected by
your file system such that a subsequent mdrun cannot open the same
file until all pending I/O has completed. I'd expect proper HPC file
systems do that automatically, but I don't really know.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">
<div>
<div class="h5">
<blockquote type="cite">
<div><br>
</div>
<div style="border-collapse: collapse; font-family:
arial,sans-serif; font-size: 13px;">
<div>From md-1-2360.out:</div>
<div>=====================================</div>
<div>
<div><font size="1">:::::::</font></div>
</div>
</div>
<div style="border-collapse: collapse; font-family:
arial,sans-serif; font-size: 13px;">
<div><font size="1">Getting Loaded...</font></div>
<div><font size="1">Reading file run1.tpr, VERSION
4.5.4 (single precision)</font></div>
<div><font size="1"><br>
</font></div>
<div><font size="1">Reading checkpoint file run1.cpt
generated: Tue May 31 10:45:22 2011</font></div>
<div><font size="1"><br>
</font></div>
<div><font size="1"><br>
</font></div>
<div><font size="1">Loaded with Money</font></div>
<div><font size="1"><br>
</font></div>
<div><font size="1">Making 2D domain decomposition 4
x 2 x 1</font></div>
<div><font size="1"><br>
</font></div>
<div><font size="1">WARNING: This run will generate
roughly 4915 Mb of data</font></div>
<div><font size="1"><br>
</font></div>
<div><font size="1">starting mdrun 'run1'</font></div>
<div><font size="1">100000000 steps, 200000.0 ps
(continuing from step 51879590, 103759.2 ps).</font></div>
</div>
<div style="border-collapse: collapse; font-family:
arial,sans-serif; font-size: 13px;">=====================================</div>
</blockquote>
<br>
</div>
</div>
These aren't showing anything other than that the restart is
coming from the same point each time.
<div class="im"><br>
<br>
<blockquote type="cite"><span style="border-collapse:
collapse; font-family: arial,sans-serif; font-size:
13px;">
<div>And from the last generated output md-1-2437.out
(I think I killed the job at that point because of
the above observed behavior):</div>
<div>=====================================</div>
<div>
<div><font size="1">:::::::</font></div>
</div>
<div><font size="1">
<div>Getting Loaded...</div>
<div>Reading file run1.tpr, VERSION 4.5.4 (single
precision)</div>
</font></div>
<div><font size="1"><span style="font-size: small;">=====================================</span></font></div>
<div><br>
</div>
<div>I have at least 5-6 additional examples like this
one. In some of them the *xtc file does have size
greater than zero yet still very small, but it
starts from some random frame (for example, in one
of the cases it contains frames from ~91000ps to
~104000ps, but all frames before 91000ps are
missing).</div>
</span></blockquote>
<br>
</div>
I think that demonstrating a problem requires that the set
of output files were fine before one particular restart, and
weird afterwards. I don't think we've seen that yet.
<div class="im"><br>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>I don't understand your point here. I am providing you with
all info I have. I am showing the output files of 3 restarts,
and they are different in a sense that the last two did not
progress further enough before another job restart occurred.
The first was fine before the restart, and the others were not
exactly fine after the restart. At this point I realize that
what I call "restart" and what you call "restart" might be two
different things. And here is where the problem might
be lying. </div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">
<div class="im"> <br>
<blockquote type="cite"><span style="border-collapse:
collapse; font-family: arial,sans-serif; font-size:
13px;">
<div>I realize there might be another problem, but the
bottom line is that there is no mechanism that can
prevent this from happening if many restarts are
required, and particularly if the timing between
these restarts is prone to be small (distributed
computing could easily satisfy this condition).</div>
<div><br>
</div>
<div>Any suggestions, particularly related to the
minimum resistance path to regenerate the missing
data? :)</div>
<div style="color: rgb(80, 0, 80);">
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0px
0px 0px 0.8ex; border-left: 1px solid rgb(204,
204, 204); padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">
<div><br>
<blockquote type="cite">
<div>Using the checkpoint capability &
appending make sense when many restarts
are expected, but unfortunately it is
exactly then when these options completely
fail! As a new user of Gromacs, I must say
I am disappointed, and would like to
obtain an explanation of why the usage of
these options is clearly stated to be safe
when it is not, and why the append option
is the default, and why at least a single
warning has not been posted anywhere in
the docs & manuals?</div>
</blockquote>
<br>
</div>
I can understand and sympathize with your
frustration if you've experienced the loss of a
simulation. Do be careful when suggesting that
others' actions are blame-worthy, however.</div>
</blockquote>
<div><br>
</div>
</div>
<div>I have never suggested this. As a user, I am
entitled to ask. </div>
</span></blockquote>
<br>
</div>
Sure. However, talking about something that can "completely
fail"</div>
</blockquote>
<div><br>
</div>
<div>This is a fact, backed up by my evidences => I don't see
anything bad directed to anybody. </div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> which makes you
"disappointed"</div>
</blockquote>
<div><br>
</div>
<div>This is me being honest => again not related to anybody
else. </div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> and wanting to "obtain
an explanation"</div>
</blockquote>
<div><br>
</div>
<div>Well, this even is funny :) - many people want this,
especially in science. Is that bad?</div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> about why something
doesn't work as stated and lacks "a single warning"</div>
</blockquote>
<div><br>
</div>
<div>Again a fact => again nothing bad here.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> suggests that someone
has done something less than appropriate</div>
</blockquote>
<div><br>
</div>
<div>This is a completely personal interpretation, and I am
personally not responsible of how people perceive information.
For unknown to me reason you moved into a very defensive mode.
What could I do? </div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">, and so blame-worthy.
It also assumes that the actions of a new user were correct,
and the actions of a developer with long experience were
not. </div>
</blockquote>
<div><br>
</div>
<div>Sorry, this is too much. Where was this suggested? It seems
to me you took it too personally. </div>
<div> </div>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">This may or may not
prove to be true. Starting such a discussion from a
conciliatory (rather than antagonistic) stance is usually
more productive. The shared objective should be to fix the
problem, not prove that someone did something wrong.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Agree, and I did it. Again, your perception does not seem
to be correlated with my intended approach. <br>
</div>
</div>
</blockquote>
<br>
Words are open to interpretation. Communicating well requires that
you consider the impact of your words on your reader. You want
people who can address the problem to want to help. You don't want
them to feel defensive about the situation - whether you think that
would be an over-reaction or not.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff"> <br>
An alternative way of wording your paragraph could have
been:<br>
"<span style="border-collapse: collapse; font-family:
arial,sans-serif; font-size: 13px;">Using the checkpoint
capability & appending make sense when many restarts
are expected, however I observe that under such
circumstances this capability can fail. I am a new user of
GROMACS, might I have been using them incorrectly? Are the
developers aware of any situations under which the
capability is unreliable? If so, should the default
behaviour be different, and should this issue be
documented somewhere?"</span></div>
</blockquote>
<div><br>
</div>
<div>This is helpful, but again a bit too much. I don't tell you
how to write, please do the same. <br>
</div>
</div>
</blockquote>
<br>
OK, but how the tone of how you write determines whether people will
respond, no matter how important your message.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div>Moreover, how could I ask questions the answers to which
were mostly known to me before sending my post? <br>
</div>
</div>
</blockquote>
<br>
These are the same ideas about which you asked in your original
post.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">
<div class="im">
<blockquote type="cite"><span style="border-collapse:
collapse; font-family: arial,sans-serif; font-size:
13px;">
<div>And since my questions were not clearly answered,
I will repeat them in a structured way:</div>
<div><br>
</div>
<div>1. Why is the usage of these options (-cpi and
-append) clearly stated to be safe when in fact it
is not?</div>
</span></blockquote>
<br>
</div>
Because they are believed to be safe. Jussi's suggestion
about file locking may have merit.
<div class="im"><br>
<br>
<blockquote type="cite"><span style="border-collapse:
collapse; font-family: arial,sans-serif; font-size:
13px;">
<div>2. Why have you made the -append option the
default in the most current GMX versions?</div>
</span></blockquote>
<br>
</div>
Because it's the most convenient mode of operation.</div>
</blockquote>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">
<div class="im"> <br>
<br>
<blockquote type="cite"><span style="border-collapse:
collapse; font-family: arial,sans-serif; font-size:
13px;">
<div>3. Why has not a single warning been posted
anywhere in the docs & manuals? (this question
is somewhat clear - because you did not know about
such a problem, but people say "<span
style="font-family: sans-serif; font-size: 13px;
line-height: 20px;">ignorance of the law excuses
no one</span>", which means ignoring to put a
warning for something that you were not 100% certain
it would be error-free could not be an excuse)</div>
</span></blockquote>
<br>
</div>
Because no-one is aware of a problem to warn about.</div>
</blockquote>
<div><br>
</div>
<div>No, people are aware, they just do not think it is a
problem, because there is an easy work-around (-noappend),
although not as convenient and clean. Ask users of the Condor
distributed grid using Gromacs.</div>
</div>
</blockquote>
<br>
You asked why there was no warning in the documentation - that's
because no-one who can fix the documentation is aware of a problem.
If Condor-using people want to keep using a work-around and not
communicate, that's their prerogative. But if the issue isn't
communicated, then it isn't going to be documented, whether it's a
real issue or not.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">
<div class="im">
<blockquote type="cite"><span style="border-collapse:
collapse; font-family: arial,sans-serif; font-size:
13px;">
<div>I am blame-worthy - for blindly believing what
was written in the manual without taking the
necessary precautions. Lesson learned. </div>
<div style="color: rgb(80, 0, 80);">
<div> </div>
</div>
<div style="color: rgb(80, 0, 80);">
<blockquote class="gmail_quote" style="margin: 0px
0px 0px 0.8ex; border-left: 1px solid rgb(204,
204, 204); padding-left: 1ex;">
<div text="#000000" bgcolor="#ffffff">However,
developers' time rarely permits addressing
"feature X doesn't work, why not?" in a
productive way. Solving bugs can be hard, but
will be easier (and solved faster!) if the user
who thinks a problem exists follows good
procedure. See <a moz-do-not-send="true"
href="http://www.chiark.greenend.org.uk/%7Esgtatham/bugs.html"
style="color: rgb(0, 0, 204);" target="_blank">http://www.chiark.greenend.org.uk/~sgtatham/bugs.html</a><br>
<br>
</div>
</blockquote>
<div><br>
</div>
</div>
<div>Implying that I did not follow a certain
procedure related to a certain problem without you
knowing what my initial intention was is just a
speculation. <br>
</div>
</span></blockquote>
<br>
</div>
I don't follow your point. If your intent is to get the
problem being fixed, the advice on that web page is useful.</div>
</blockquote>
<div><br>
</div>
<div>My intend was clearly stated before, but for the sake of
clarification, let's repeat it again:</div>
<div><br>
</div>
<div>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
<div> 1. To let you know about the existence of such a
problem.</div>
</div>
</div>
</blockquote>
<br>
Great. So far, I think it's an artefact of the combination of your
PBS and file system configuration.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div>
<div>2. To find out why I encountered the problem, although I
have read and followed all of the Gromacs documentation
related to the used by me features. <br>
</div>
</div>
</div>
</blockquote>
<br>
As above - it's not really the fault of GROMACS. I don't know if a
better solution exists.<br>
<br>
<blockquote
cite="mid:BANLkTi=Pb8EZZjfSmE=zPJo0Oi86EAKDUQ@mail.gmail.com"
type="cite">
<div class="gmail_quote">
<div> </div>
<div>3. To somewhat improve the way the documentation is
written. <br>
</div>
</div>
</blockquote>
<br>
OK, I will add a short note to mdrun -h noting that there exist
execution environments where timing of file access across separate
GROMACS processes might be a problem.<br>
<br>
Mark<br>
<br>
</body>
</html>