<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
On 8/02/2011 1:48 AM, Qiong Zhang wrote:
<blockquote cite="mid:442280.57183.qm@web53804.mail.re2.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top"><br>
Hi Mark,<br>
<br>
Many thanks for your fast response!<br>
<br>
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:PunctuationKerning/>
<w:DrawingGridVerticalSpacing>7.8 磅</w:DrawingGridVerticalSpacing>
<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>
<w:DisplayVerticalDrawingGridEvery>2</w:DisplayVerticalDrawingGridEvery>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:SpaceForUL/>
<w:BalanceSingleByteDoubleByteWidth/>
<w:DoNotLeaveBackslashAlone/>
<w:ULTrailSpace/>
<w:DoNotExpandShiftReturn/>
<w:AdjustLineHeightInTable/>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:UseFELayout/>
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
        {mso-style-name:普通表格;
        mso-tstyle-rowband-size:0;
        mso-tstyle-colband-size:0;
        mso-style-noshow:yes;
        mso-style-parent:"";
        mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
        mso-para-margin:0cm;
        mso-para-margin-bottom:.0001pt;
        mso-pagination:widow-orphan;
        font-size:10.0pt;
        font-family:"Times New Roman";
        mso-ansi-language:#0400;
        mso-fareast-language:#0400;
        mso-bidi-language:#0400;}
</style>
<![endif]-->
<p class="MsoNormal"><i style=""><span lang="EN-US">What's
the network hardware? Can other machine load
influence your network
performance?</span></i></p>
<p class="MsoNormal"><span lang="EN-US">The supercomputer
system is based on the
Cray Gemini interconnect technology. I suppose this is
a fast network hardware...</span></p>
<p class="MsoNormal"><br>
<span lang="EN-US"></span></p>
<p class="MsoNormal"><i style=""><span lang="EN-US">Are
the systems in the NVT ensemble? Use diff to check
the .mdp files differ only
how you think they do.</span></i></p>
<p class="MsoNormal"><span lang="EN-US">The systems are in
NPT ensemble. I saw some
discussions on the mailing list that NPT ensemble is
superior to NVT ensemble
for REMD. And the .mdp files differ only in the
temperature.</span></p>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
Maybe so, but under NPT the density varies with T, and so with
replica. This means the size of neighbour lists varies, and the cost
of the computation (PME or not) varies. The generalized ensemble is
limited by the progress of the slowest replica. If using PME, in
theory, you can juggle the contribution of the various terms to
balance the computation load across the replicas, but this is not
easy to do.<br>
<span lang="EN-US"> </span>
<blockquote cite="mid:442280.57183.qm@web53804.mail.re2.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">
<p class="MsoNormal"><i style=""><span lang="EN-US">What
are the values of nstlist and<a
moz-do-not-send="true" name="OLE_LINK12">
nstcalcenergy</a>?</span></i></p>
<p class="MsoNormal"><span lang="EN-US">Previously,
nstlist=5</span>, <a moz-do-not-send="true"
style="background-color: rgb(255, 255, 255);"
name="OLE_LINK15"><span style=""><span style=""><span
style="background-image: none;
background-repeat: repeat;
background-attachment: scroll;
background-position: 0% 0%;
-moz-background-clip: border;
-moz-background-origin: padding;
-moz-background-inline-policy: continuous;"
lang="EN-US">nstcalcenergy</span></span></span></a><span
style="background-color: rgb(255, 255, 255);"><span
style="background-image: none; background-repeat:
repeat; background-attachment: scroll;
background-position: 0% 0%; -moz-background-clip:
border; -moz-background-origin: padding;
-moz-background-inline-policy: continuous;"
lang="EN-US">=1</span></span></p>
<span style=""></span>
<p class="MsoNormal"><span style="font-size: 11pt;
font-family: NimbusRomNo9L-Regu;" lang="EN-US">Thank
you for
pointing this out. I checked the manual again that
this option affects the
performance in parallel simulations because
calculating energies requires global
communication between all processes. So I have set
this option to -1 this time.
This should be one reason for the low parallel
efficiency.</span></p>
<p class="MsoNormal"><span style="font-size: 11pt;
font-family: NimbusRomNo9L-Regu;" lang="EN-US">And
after I
changed </span><span style="background: none repeat
scroll 0% 0% rgb(255, 255, 255);
-moz-background-inline-policy: continuous;"
lang="EN-US">nstcalcenergy=</span><span lang="EN-US">-1,
I</span><span style="font-size: 11pt; font-family:
NimbusRomNo9L-Regu;" lang="EN-US"> found there was a
3% improvement on the efficiency compared with those
when
</span><span lang="EN-US">nstcalcenergy=1.</span></p>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
Yep. nstpcouple and nsttcouple also influence this.<br>
<span lang="EN-US"> <br>
</span>
<blockquote cite="mid:442280.57183.qm@web53804.mail.re2.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">
<p style="font-style: italic;" class="MsoNormal"><span
lang="EN-US">Take a look at the execution time
breakdown
at the end of the .log files, and do so for more than
one replica. With the
current implementation, every simulation has to
synchronize and communicate
every handful of steps, which means that large scale
parallelism won't work
efficiently unless you have<a moz-do-not-send="true"
name="OLE_LINK17"> fast network hardware</a> that
is dedicated to your job. This effect shows up in the
"Rest" row of
the time breakdown. With <a moz-do-not-send="true"
name="OLE_LINK14"></a><a moz-do-not-send="true"
name="OLE_LINK13"><span style="">Infiniband</span></a>,
I'd expect you should
only be losing about 10% of the run time total. The
30-fold loss you have upon
going from 24->42 replicas keeping 4 CPUs/replica
suggests some other
contribution, however.</span></p>
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">I checked the time
breakdown in the log
files for short REMD simulations. For the REMD
simulaiton with 168 cores for 42
replicas, as you see below, the “Rest” makes up as
surprisingly high as <b style=""><u>96.6%</u></b> of
the time for one of the
replicas. This parameter is almost the same level for
the other replicas. For
the REMD simulation with 96 cores for 24 replicas, the
“Rest” takes up about
24%. I was also aware of your post: </span></p>
<p class="MsoNormal"><span lang="EN-US"><a
moz-do-not-send="true"
href="http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html">http://www.mail-archive.com/gmx-users@gromacs.org/msg37507.html</a></span></p>
<p class="MsoNormal"><span lang="EN-US">As you suggested
such big loss should be
ascribed to other factors. Do you think it is the
network hardware to blame or
there are other reasons please? Any suggestion would
be greatly appreciated<br>
</span></p>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
I expect the load imbalance across replicas is partly to blame. Look
at the sum of Force + PME mesh (in seconds) across the generalized
ensemble. That's where the simulation work is all done, and I expect
your low-temperature replicas are doing much more work than your
high-temperature replicas. Unfortunately 4.5.3 doesn't allow the
user to know enough detail here. Future versions of GROMACS will -
work in progress.<br>
<br>
Strictly, though, your rate-limiting lowest temperature replica in
the 24-replica regime should take an amount of time comparable to
that of the lowest in the 42-replica regime (22K difference is not
that significant) - and similar to a run other than as part of a
replica-exchange simulation. Your reported data is not consistent
with that, so I think your jobs are also experiencing differing
degrees of network or filesystem contention at different times. Your
sysadmins can comment on that.<br>
<br>
Mark<br>
<br>
<blockquote cite="mid:442280.57183.qm@web53804.mail.re2.yahoo.com"
type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">
<p class="MsoNormal"><span lang="EN-US"> </span></p>
<p class="MsoNormal"><span lang="EN-US">Computing:<span
style=""> </span>Nodes<span style=""> </span>Number<span
style="">
</span>G-Cycles<span style=""> </span>Seconds<span
style=""> </span>%</span></p>
<p class="MsoNormal"><span lang="EN-US">-----------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Domain
decomp.<span style=""> </span>4<span style="">
</span>442<span style="">
</span><span style=""> </span>2.604<span style="">
</span>1.2<span style="">
</span>0.0</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>DD
comm.
load<span style=""> </span>4<span style="">
</span>6<span style=""> </span>0.001<span
style=""> </span>0.0<span style="">
</span>0.0</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Comm.
coord.<span style=""> </span>4<span
style=""> </span>2201<span style=""> </span>1.145<span
style=""> </span>0.5<span style="">
</span>0.0</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Neighbor
search<span style=""> </span>4<span style="">
</span>442<span style=""> </span>14.964<span
style=""> </span>7.1<span style="">
</span>0.2</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Force<span
style=""> </span><span style=""> </span>4<span
style="">
</span>2201<span style=""> </span>175.303<span
style=""> </span>83.5<span style="">
</span>2.0</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Wait
+
Comm. F<span style=""> </span>4<span style="">
</span>2201<span style=""> </span>1.245<span
style=""> </span>0.6<span style="">
</span>0.0</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>PME
mesh<span style=""> </span>4<span
style=""> </span>2201<span style=""> </span>30.314<span
style=""> </span>14.4<span style="">
</span>0.3</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Write
traj.<span style=""> </span>4<span
style=""> </span>11<span style=""> </span>17.346<span
style=""> </span>8.3<span style="">
</span>0.2</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Update<span
style=""> </span>4<span style="">
</span>2201<span style=""> </span>2.004<span
style=""> </span>1.0<span style="">
</span>0.0</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Constraints<span
style=""> </span>4<span style="">
</span>2201<span style=""> </span>26.593<span
style=""> </span>12.7<span style="">
</span>0.3</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Comm.
energies<span style=""> </span>4<span
style=""> </span>442<span style=""> </span>28.722<span
style=""> </span>13.7<span style="">
</span>0.3</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Rest<span
style=""> </span>4<span style="">
</span>8426.029<span style=""> </span>4012.4<span
style="">
</span>96.6</span></p>
<p class="MsoNormal"><span lang="EN-US">-----------------------------------------------------------------------</span></p>
<p class="MsoNormal"><span lang="EN-US"><span style=""> </span>Total<span
style=""> </span>4<span style="">
</span>8726.270<span style=""> </span>4155.4<span
style="">
</span>100.0</span></p>
<br>
<br>
Qiong<br>
<br>
On 7/02/2011 9:52 PM, Qiong Zhang wrote:
<blockquote type="cite">
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td style="font: inherit;" valign="top">
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">Dear all gmx-users,</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US"> </span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">I have </span><span
lang="EN-US">recently </span><span
lang="EN-US">been testing the REMD
simulations. I was running simulations on a
supercomputer system<span
class="yiv1366269415highlightedsearchterm">
</span>ba<span
class="yiv1366269415highlightedsearchterm">se</span>d
on the AMD Opteron 12-core (2.1 GHz)
processors. The Gromacs 4.5.3 version was
used.</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US"> </span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">I have a system of 5172 atoms,
of which 138 atoms belong to solute and the
other are water molecules. An exponential
distribution of temperatures was generated
ranging from 276 to 515 K in total of 42
replicas or from 298 to 420 K in total of 24
replicas, ensuring that the exchange ratio
between all adjacent replicas is about 0.25.
The replica exchange was carried out every
0.5ps. The integrate step size was 2fs.</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US"> </span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">For the above system, when REMD
is simulated over 24 replicas, the
simulation speed is reasonably fast.
However, when REMD is simulated over 42
replicas, the simulation speed is awfully
slow.Please see the following table for the
speed.<br>
</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">----------------------------------------------------------------------------</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">Replica number<span style="">
</span>CPU number<span style=""> </span>speed</span></p>
<p class="yiv1366269415MsoNormal"
style="margin-left: 90pt;"><span style=""
lang="EN-US"><span style="">24<span style="">
</span></span></span><span lang="EN-US">96<span
style=""> </span>58015steps/15minutes</span></p>
<p class="yiv1366269415MsoNormal"
style="margin-left: 90pt;"><span style=""
lang="EN-US"><span style="">42<span style="">
</span></span></span><span lang="EN-US">42<span
style=""> </span><span style=""> </span><a
moz-do-not-send="true" rel="nofollow"
name="OLE_LINK5">865steps/15minutes</a></span></p>
<p class="yiv1366269415MsoNormal"
style="margin-left: 90pt;"><span style=""
lang="EN-US"><span style="">42<span style="">
</span></span></span><span lang="EN-US">84<span
style=""> </span>1175<a
moz-do-not-send="true" rel="nofollow"
name="OLE_LINK7">steps/15minutes</a></span></p>
<p class="yiv1366269415MsoNormal"
style="margin-left: 84.75pt;"><span style=""
lang="EN-US"><span style="">42<span style="">
</span></span></span><span lang="EN-US">168<span
style=""> </span>1875steps/15minutes</span></p>
<div style="border-style: none none solid;
border-color: windowtext; border-width: medium
medium 1pt; padding: 0cm 0cm 1pt;">
<p class="yiv1366269415MsoNormal"
style="border: medium none; padding: 0cm;"><span
lang="EN-US">42<span style="">
</span>336<span style="">
</span>2855steps/15minutes</span></p>
</div>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US"> </span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">The command line for the mdrun
is:</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">aprun -n (CPU number here)
mdrun_d -s md.tpr -multi (replica number
here) -replex 250</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US"> </span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">My questions are :<br>
</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">1) why the REMD for the 42
replicas is so slow for the same system? <br>
</span></p>
<p class="yiv1366269415MsoNormal"><span
lang="EN-US">2) On what aspects can I
improve the operating efficiency please?<br>
</span></p>
</td>
</tr>
</tbody>
</table>
</blockquote>
<br>
What's the network hardware? Can other machine load
influence your network performance?<br>
<br>
Are the systems in the NVT ensemble? Use diff to check the
.mdp files differ only how you think they do.<br>
<br>
What are the values of nstlist and nstcalcenergy?<br>
<br>
Take a look at the execution time breakdown at the end of
the .log files, and do so for more than one replica. With
the current implementation, every simulation has to
synchronize and communicate every handful of steps, which
means that large scale parallelism won't work efficiently
unless you have fast network hardware that is dedicated to
your job. This effect shows up in the "Rest" row of the
time breakdown. With Infiniband, I'd expect you should
only be losing about 10% of the run time total. The
30-fold loss you have upon going from 24->42 replicas
keeping 4 CPUs/replica suggests some other contribution,
however.<br>
<br>
Mark</td>
</tr>
</tbody>
</table>
<br>
</blockquote>
<br>
</body>
</html>