<br><br><div class="gmail_quote">On Wed, Oct 13, 2010 at 2:20 AM, Erik Lindahl <span dir="ltr">&lt;<a href="mailto:lindahl@cbr.su.se">lindahl@cbr.su.se</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div style="word-wrap:break-word">Hi,<div><br><div><div class="im"><div>On Oct 13, 2010, at 7:53 AM, Roland Schulz wrote:</div><br><blockquote type="cite"><br><br><div class="gmail_quote">On Wed, Oct 13, 2010 at 1:02 AM, Erik Lindahl <span dir="ltr">&lt;<a href="mailto:lindahl@cbr.su.se" target="_blank">lindahl@cbr.su.se</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204, 204, 204);border-left-style:solid;padding-left:1ex">


<div bgcolor="#FFFFFF"><div>Hi,</div><div><br></div><div>File flushing has been a huge issue to get working properly on AFS and other systems that have an extra layer of network disk cache. We also want to make sure the files are available e.g. on the frontend node of a cluster while the simulation is still running.</div>


</div></blockquote><div>Do we want to guarantee that it is available sooner than at each checkpoint (thus by default 15min)?</div></div></blockquote><div><br></div></div>It&#39;s not only a matter of &quot;being available&quot;, but making sure you don&#39;t lose all that data in the disk cache layer of the node crashes and you (for some reason) disabled checkpointing.</div>


</div></div></blockquote><div>Well but if you disabled checkpointing than it&#39;s your own fault ;-)</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word">


<div><div><br></div><div>Basically, when a frame has been &quot;written&quot;, it is reasonable for the user to expect that it is actually on disk. The default behavior should be safe, IMHO.</div></div></div></blockquote>


<div>I&#39;m not sure whether the user necessarily assumes that. Their are well known cases where the behavior of the cache is exposed to the user (e.g. writing files to USB sticks). Currently GROMACS only does a fflush not a fsync after each frame. Thus, it is not guaranteed that it is immediate on the disk because it can still be in the kernel buffers. Already now, a fsync is only done after each checkpoint.</div>


<div><br></div><div>The problem is that MPI-IO doesn&#39;t have this distinction. Their is only a MPI_File_sync (and no MPI_File_flush). And a sync can be *very* expensive. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div style="word-wrap:break-word"><div><div><div class="im"><blockquote type="cite"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div bgcolor="#FFFFFF"><div>I think the proper solution is rather to have a separate IO thread so the disk operation can take all the latency in the world without delaying the run.</div></div></blockquote><div>This won&#39;t solve it for all cases. Depending on the write frequency (e.g. every 10 frames) the flushing time can take longer than computing the frames while the actually writing time (measured as the writing time with only infrequent flush) is fast enough to not cause significant overhead. In those situations the simulation would still wait on the IO thread. </div>


<div><br></div><div>Also this adds additional complexity. Not all systems like oversubscribing threads as far as I know. I know that older versions of Cray had problems and I heard their are also problems with BlueGene. Thus we would need to make the IO thread functionality optional which would add yet another duplication of code (both with and without IO thread).</div>


</div></blockquote><div><br></div></div>The difference is that an IO thread would virtually never run though; it would instantly block waiting for the filesystem, and in the mean time the real threads would get control back?</div>


</div></div></blockquote><div>Yes but you can only have one IO thread per file (otherwise the synchronization becomes quite difficult). Thus if the overhead is larger than the time between writes than your are still waiting. The time for MPI_File_sync can be * extremely* long (compared to fflush). </div>


<div><br></div><div>Roland</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div style="word-wrap:break-word"><div><div>

<span style="border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-size:medium;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:auto;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">----------------------------------------------------------<div class="im">


<br>Erik Lindahl &lt;<a href="mailto:lindahl@cbr.su.se" target="_blank">lindahl@cbr.su.se</a>&gt;<br></div>Professor, Computational Structural Biology<br>Center for Biomembrane Research &amp; Swedish e-Science Research Center<br>


Department of Biochemistry &amp; Biophysics, Stockholm University<br>Tel: +468164675 Cell: +46703844534<br></span>

</div>

<br></div></div></blockquote></div><br><br clear="all"><br>-- <br>ORNL/UT Center for Molecular Biophysics <a href="http://cmb.ornl.gov">cmb.ornl.gov</a><br>865-241-1537, ORNL PO BOX 2008 MS6309<br>