<div class="gmail_quote"><br><br><div class="gmail_quote"><div class="im">On Wed, Oct 13, 2010 at 3:08 AM, Erik Lindahl <span dir="ltr">&lt;<a href="mailto:lindahl@cbr.su.se" target="_blank">lindahl@cbr.su.se</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div style="word-wrap:break-word">Hi,<div><br><div><div><div>On Oct 13, 2010, at 9:02 AM, Roland Schulz wrote:</div><br><blockquote type="cite"><div class="gmail_quote">On Wed, Oct 13, 2010 at 2:20 AM, Erik Lindahl <span dir="ltr">&lt;<a href="mailto:lindahl@cbr.su.se" target="_blank">lindahl@cbr.su.se</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1ex">


<div style="word-wrap:break-word">Hi,<div><br><div><div><div>On Oct 13, 2010, at 7:53 AM, Roland Schulz wrote:</div><blockquote type="cite"><br><div class="gmail_quote">On Wed, Oct 13, 2010 at 1:02 AM, Erik Lindahl <span dir="ltr">&lt;<a href="mailto:lindahl@cbr.su.se" target="_blank">lindahl@cbr.su.se</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1ex">


<div bgcolor="#FFFFFF"><div>Hi,</div><div><br></div><div>File flushing has been a huge issue to get working properly on AFS and other systems that have an extra layer of network disk cache. We also want to make sure the files are available e.g. on the frontend node of a cluster while the simulation is still running.</div>


</div></blockquote><div>Do we want to guarantee that it is available sooner than at each checkpoint (thus by default 15min)?</div></div></blockquote><div><br></div></div>It&#39;s not only a matter of &quot;being available&quot;, but making sure you don&#39;t lose all that data in the disk cache layer of the node crashes and you (for some reason) disabled checkpointing.</div>


</div></div></blockquote><div>Well but if you disabled checkpointing than it&#39;s your own fault ;-)</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">


<div><div><br></div><div>Basically, when a frame has been &quot;written&quot;, it is reasonable for the user to expect that it is actually on disk. The default behavior should be safe, IMHO.</div></div></div></blockquote>


<div>I&#39;m not sure whether the user necessarily assumes that. Their are well known cases where the behavior of the cache is exposed to the user (e.g. writing files to USB sticks). Currently GROMACS only does a fflush not a fsync after each frame. Thus, it is not guaranteed that it is immediate on the disk because it can still be in the kernel buffers. Already now, a fsync is only done after each checkpoint.</div>


</div></blockquote><div><br></div></div>Priority 1A is that we should never write &quot;broken&quot; trajectory frames to disk - that has caused huge amounts of grief in the past, and can be really confusing to users.</div>


</div></div></blockquote><div><br></div></div><div>This is not what we are doing at the moment. At the moment (flush after frame, sync after checkpoint) it is possible that the trajectory is broken. But the check-pointing append feature guarantees that it automatically fixes it. I like the approach of fast writing + automatic fix in the worst case better than having to guarantee that it is always correct from the beginning. Also it would be extremely difficult to guarantee it for all cases (e.g. for the case of a crash during writing of a frame). </div>


<div class="im">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><br></div><div>I think that basically leaves two long-term options:</div><div>


<br></div><div>1) Make sure that each frame is properly flushed/synced</div></div></div></blockquote></div><div>that would be slow for frequent writing.</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div style="word-wrap:break-word"><div><div>2) Buffer IO and wait until the next checkpoint time before you write the frames to disk.</div></div></div></blockquote></div><div>We have added Buffer IO already in the CollectiveIO branch. Without buffering it is impossible to get fast CollectiveIO. </div>


<div>At the moment we only buffer as many frames as the number of IO nodes. We could certainly increase that and that would reduce or even eliminate the sync problem.</div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div style="word-wrap:break-word"><div><div><br></div><div>If we go with #2, there are two additional (minor?) issues: First, we need to check if checkpointing is disabled or only done every 5-10h, and in that case anyway sync frames ever ~15 minutes.</div>


</div></div></blockquote></div><div>ok </div><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div> Second, there could be a number of systems where we run out of memory if we buffer things. Then we need to designate a buffer amount and flush files when this is full.</div>


</div></div></blockquote></div><div>Currently we have a limit of 2MB per core as upper limit for the buffer. This seems to be enough for efficient collective IO. Adding the flush back in we might want to increase that limit a bit. </div>


<div class="im">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><div><br><blockquote type="cite"><div class="gmail_quote"><div>

The problem is that MPI-IO doesn&#39;t have this distinction. Their is only a MPI_File_sync (and no MPI_File_flush). And a sync can be *very* expensive. </div></div></blockquote><div><br></div></div><div>Unfortunately we absolutely need to do a full sync at regular intervals (but #2 above would work), or you risk losing weeks of results on some clusters.</div>


</div></div></div></blockquote></div><div>I never wanted to remove the full sync at regular intervals. The original question was whether we need a flush (not sync) after each frame.</div><div><br></div><font color="#888888"><div>


Roland</div></font></div>

</div><br><br clear="all"><br>-- <br>ORNL/UT Center for Molecular Biophysics <a href="http://cmb.ornl.gov">cmb.ornl.gov</a><br>865-241-1537, ORNL PO BOX 2008 MS6309<br>