[gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Wed Mar 27 04:13:57 CET 2013

Dear Matthew:

Thank you for noticing the file size. This is a very good lead. 
I had not noticed that this was special. Indeed, here is the complete listing for truncated/corrupt .cpt files:

-rw-r----- 1 cneale cneale 1048576 Mar 26 18:53 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt

I will contact my sysadmins and let them know about your suggestions.

Nevertheless, I respectfully reject the idea that there is really nothing that can be done about this inside
gromacs. About 6 years ago, I worked on a cluster with massive sporadic NSF delays. The only solution to 
automate runs on that machine was to, for example, use sed to create a .mdp from a template .mdp file, which had ;;;EOF as the last line and then to poll the created mdp file for ;;;EOF until it existed prior to running
grompp (at the time I was using mdrun -sort and desorting with an in-house script prior to domain 
decomposition, so I had to stop/start gromacs every coupld of hours). This is not to say that such things are 
ideal, but I think  gromacs would be all the better if it was able to avoid with problems like this regardless of 
the cluster setup.

Please note that, over the years, I have seen this on 4 different clusters (albeit with different versions of 
gromacs), but that is to say that it's not just one setup that is to blame.

Matthew, please don't take my comments the wrong way. I deeply appreciate your help. I just want to put it
out there that I believe that gromacs would be better if it didn't overwrite good .cpt files with truncated/corrupt
.cpt files ever, even if the cluster catches on fire or the earth's magnetic field reverses, etc. 
Also, I suspect that sysadmins don't have a lot of time to test their clusters for graceful exit upon chiller failure 
conditions, so a super-careful regime of .cpt update will always be useful.

Thank you again for your help, I'll take it to my sysadmins, who are very good and may be able to remedy 
this on their cluster, but who knows what cluster I will be using in 5 years.

Again, thank you for your assistance, it is very useful,
Chris.

-- original message --

Dear Chris,

While it's always possible that GROMACS can be improved (or debugged), this
smells more like a system-level problem. The corrupt checkpoint files are
precisely 1MiB or 2MiB, which suggests strongly either 1) GROMACS was in
the middle of a buffer flush when it was killed (but the filesystem did
everything right; it was just sent incomplete data), or 2) the filesystem
itself wrote a truncated file (but GROMACS wrote it successfully, the data
was buffered, and GROMACS went on its merry way).

#1 could happen, for example, if GROMACS was killed with SIGKILL while
copying .cpt to _prev.cpt -- if GROMACS even copies, rather than renames --
its checkpoint files. #2 could happen in any number of ways, depending on
precisely how your disks, filesystems, and network filesystems are all
configured (for example, if a RAID array goes down hard with per-drive
writeback caches enabled, or your NFS system is soft-mounted and either
client or server goes down). With the sizes of the truncated checkpoint
files being very convenient numbers, my money is on #2.

Have you contacted your sysadmins to report this? They may be able to take
some steps to try to prevent this, and (if this is indeed a system problem)
doing so would provide all their users an increased measure of safety for
their data.

Cheers,
MZ