<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
On 4/06/2011 8:26 AM, Dimitar Pachov wrote:
<blockquote
cite="mid:BANLkTi=UT2m3eyNyivOTq9Hb8pe-k2MRfQ@mail.gmail.com"
type="cite">
<div><br>
</div>
<div>At first, I thought the -append option of the mdrun command
was great. However, I don't think it is anymore and have
actually started questioning myself why it exists at the first
place, and second, why has it become the default option in the
newest versions?<br clear="all">
</div>
</blockquote>
<br>
It exists because it used to be a pain to manage your simulation
file numbering.<br>
<br>
<blockquote
cite="mid:BANLkTi=UT2m3eyNyivOTq9Hb8pe-k2MRfQ@mail.gmail.com"
type="cite">
<div>It is useless unless you run your simulations in a 100% safe
from any unexpected problems (hardware, restarts, etc) mode,
which is never the case. It is beyond me how such an option can
become the default and how a statement like this:</div>
<div><br>
</div>
<div>"By default the output will be appending to the existing
output files. The checkpoint file contains checksums of all
output files, such that <strong>you will never loose data when
some output files are modified, corrupt or removed.</strong>"</div>
<div><br>
</div>
<div>can be claimed without testing ALL of the scenarios that can
lead to problems, that is, lost data.</div>
</blockquote>
<br>
The checkpoint file records the position of the output file pointers
at the time of the checkpoint, along with an MD5 checksum. Upon
restarting with -append, mdrun seeks to that file pointer position,
verifies the checksum and issues a fatal error if this is not
possible. So if checkpoint and other files are not altered or
removed after a crash, then the method seems pretty safe to me.<br>
<br>
The above text mentions you are safe even if you remove files -
that's an overstatement. However, I can't see that removing a
non-checkpoint file could lead to loss of useful data from other
non-checkpoint files.<br>
<br>
<blockquote
cite="mid:BANLkTi=UT2m3eyNyivOTq9Hb8pe-k2MRfQ@mail.gmail.com"
type="cite">
<div>If one uses that option and the run is restarted and is again
restarted before reaching the point of attempting to write a
file, then things are lost,</div>
</blockquote>
<br>
If this is true, then it wants fixing, and fast, and will get it :-)
However, it would be surprising for such a problem to exist and not
have been reported up to now. This feature has been in the code for
a year now, and while some minor issues have been fixed since the
4.5 release, it would surprise me greatly if your claim was true.<br>
<br>
You're saying the equivalent of the steps below can occur:<br>
1. Simulation wanders along normally and writes a checkpoint at step
1003<br>
2. Random crash happens at step 1106<br>
3. An -append restart from the old .tpr and the recent .cpt file
will restart from step 1003<br>
4. Random crash happens at step 1059<br>
5. Now a restart doesn't restart from step 1003, but some other step<br>
<br>
<blockquote
cite="mid:BANLkTi=UT2m3eyNyivOTq9Hb8pe-k2MRfQ@mail.gmail.com"
type="cite">
<div> and most importantly, the most important piece of data, that
being the trajectory file, could be completely lost! I don't
know the code behind the checkpointing & appending, but I
can see how easy one can overwrite 100ns trajectories, for
example, and "obtain" the same trajectories of size .... 0. <br>
</div>
</blockquote>
<br>
I don't see how easy that is, without a concrete example, where user
error is not possible.<br>
<blockquote
cite="mid:BANLkTi=UT2m3eyNyivOTq9Hb8pe-k2MRfQ@mail.gmail.com"
type="cite">
<div>Using the checkpoint capability & appending make sense
when many restarts are expected, but unfortunately it is exactly
then when these options completely fail! As a new user of
Gromacs, I must say I am disappointed, and would like to obtain
an explanation of why the usage of these options is clearly
stated to be safe when it is not, and why the append option is
the default, and why at least a single warning has not been
posted anywhere in the docs & manuals?</div>
</blockquote>
<br>
I can understand and sympathize with your frustration if you've
experienced the loss of a simulation. Do be careful when suggesting
that others' actions are blame-worthy, however. The developers all
act in good faith on a largely volunteer basis. Errors in coding do
happen, and they do get attention as developers' time permits.
However, developers' time rarely permits addressing "feature X
doesn't work, why not?" in a productive way. Solving bugs can be
hard, but will be easier (and solved faster!) if the user who thinks
a problem exists follows good procedure. See <a
href="http://www.chiark.greenend.org.uk/%7Esgtatham/bugs.html">http://www.chiark.greenend.org.uk/~sgtatham/bugs.html</a><br>
<br>
Mark<br>
</body>
</html>