<br><br><div class="gmail_quote">On Wed, Oct 13, 2010 at 2:05 AM,  <span dir="ltr">&lt;<a href="mailto:hess@sbc.su.se">hess@sbc.su.se</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


Hi,<br>

<br>

I think there is no fundamental reason why we can not flush<br>

only at checkpointing. I don&#39;t recall what the originial motivation<br>

was for flushing every frame. This was probably done before we use<br>

the whole output file list in the checkpointing code, so the only<br>

reason might have been because of checkpointing.<br>

I like that I always have every frame immediately, but having it<br>

up to at most 15 minutes ago is also fine.<br>

The question is if we should only flush at checkpointing by default<br>

or have a switch or some automated setting (we could even catch<br>

a signal to flush immediately).<br></blockquote><div>What would you prefer? </div><div>I think we should try to keep it as simple as possible (I think we already have to many features too seldom used with the potential to introduce bugs). Thus I&#39;m not sure whether automated setting or signal is the best way to go. I don&#39;t see a good reason to flush thus I would go with a hidden option to enforce flushing. But that&#39;s just my 2c.</div>


<div><br></div><div>Roland</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

Berk<br>

<div><div></div><div class="h5"><br>

&gt; On Wed, Oct 13, 2010 at 1:02 AM, Erik Lindahl &lt;<a href="mailto:lindahl@cbr.su.se">lindahl@cbr.su.se</a>&gt; wrote:<br>

&gt;<br>

&gt;&gt; Hi,<br>

&gt;&gt;<br>

&gt;&gt; File flushing has been a huge issue to get working properly on AFS and<br>

&gt;&gt; other systems that have an extra layer of network disk cache. We also<br>

&gt;&gt; want<br>

&gt;&gt; to make sure the files are available e.g. on the frontend node of a<br>

&gt;&gt; cluster<br>

&gt;&gt; while the simulation is still running.<br>

&gt;&gt;<br>

&gt; Do we want to guarantee that it is available sooner than at each<br>

&gt; checkpoint<br>

&gt; (thus by default 15min)?<br>

&gt;<br>

&gt; I think the proper solution is rather to have a separate IO thread so the<br>

&gt;&gt; disk operation can take all the latency in the world without delaying<br>

&gt;&gt; the<br>

&gt;&gt; run.<br>

&gt;&gt;<br>

&gt; This won&#39;t solve it for all cases. Depending on the write frequency (e.g.<br>

&gt; every 10 frames) the flushing time can take longer than computing the<br>

&gt; frames<br>

&gt; while the actually writing time (measured as the writing time with only<br>

&gt; infrequent flush) is fast enough to not cause significant overhead. In<br>

&gt; those<br>

&gt; situations the simulation would still wait on the IO thread.<br>

&gt;<br>

&gt; Also this adds additional complexity. Not all systems like oversubscribing<br>

&gt; threads as far as I know. I know that older versions of Cray had problems<br>

&gt; and I heard their are also problems with BlueGene. Thus we would need to<br>

&gt; make the IO thread functionality optional which would add yet<br>

&gt; another duplication of code (both with and without IO thread).<br>

&gt;<br>

&gt; You are more then welcome to play with it (but not in the release branch!)<br>

&gt; -<br>

&gt;&gt;<br>

&gt; No this is only going into the CollectiveIO branch and from their into the<br>

&gt; master branch.<br>

&gt;<br>

&gt;<br>

&gt;&gt;  you might already have anaccount on the AFS-equipped clusters here, or<br>

&gt;&gt; we<br>

&gt;&gt; can arrange it!<br>

&gt;&gt;<br>

&gt;&gt; Alternatively, sync with Sander and he might be able to test new code on<br>

&gt;&gt; AFS.<br>

&gt;&gt;<br>

&gt;<br>

&gt; Yes if I could get an account or Sander could test the CollectiveIO branch<br>

&gt; that would be great. I&#39;ll write Sander directly tomorrow.<br>

&gt;<br>

&gt; Roland<br>

&gt;<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; On Oct 12, 2010, at 23:26, Roland Schulz &lt;<a href="mailto:roland@utk.edu">roland@utk.edu</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; Erik, Berk,<br>

&gt;&gt;<br>

&gt;&gt; you added flushing of trn, xtc and ern before the checkpointing<br>

&gt;&gt; functionality had been added. The additional flush can add quite a bit<br>

&gt;&gt; of unnecessary time especially with parallel file systems and/or MPI-IO.<br>

&gt;&gt; Am<br>

&gt;&gt; I right that with checkpointing it is not necessary anymore?  The<br>

&gt;&gt; checkpointing is flushing the file before writing the checkpoint.<br>

&gt;&gt; Also gmx_fio_check_file_position is flushing the file before checking<br>

&gt;&gt; whether the file is too large for gmx_off_t<br>

&gt;&gt; Roland<br>

&gt;&gt;<br>

&gt;&gt; --<br>

&gt;&gt; ORNL/UT Center for Molecular Biophysics<br>

</div></div>&gt;&gt; &lt;<a href="http://cmb.ornl.gov" target="_blank">http://cmb.ornl.gov</a>&gt;<a href="http://cmb.ornl.gov" target="_blank">cmb.ornl.gov</a><br>

<div><div></div><div class="h5">&gt;&gt; 865-241-1537, ORNL PO BOX 2008 MS6309<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;<br>

&gt;<br>

&gt; --<br>

&gt; ORNL/UT Center for Molecular Biophysics <a href="http://cmb.ornl.gov" target="_blank">cmb.ornl.gov</a><br>

&gt; 865-241-1537, ORNL PO BOX 2008 MS6309<br>

&gt;<br>

<br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>ORNL/UT Center for Molecular Biophysics <a href="http://cmb.ornl.gov">cmb.ornl.gov</a><br>865-241-1537, ORNL PO BOX 2008 MS6309<br>