<p>
        <br>
</p>
<p>
        Hi Berk,
</p>
<p>
        <br>
</p>
<p>
        Thank you very much for the detailed explanation!
</p>
<p>
        <br>
</p>
<p>
        Sincerely,
</p>
<p>
        Zhang<br>
<br>
</p>
<blockquote class="ReferenceQuote" style="padding-left: 5px; margin-right: 0px; margin-left: 5px; border-left-color: rgb(182, 182, 182); border-left-width: 2px; border-left-style: solid;" name="replyContent">
        -----原始邮件-----<br>
<b>发件人:</b><span id="rc_from">"Berk Hess" <hess@kth.se></span><br>
<b>发送时间:</b><span id="rc_senttime">2020-04-14 19:37:31 (星期二)</span><br>
<b>收件人:</b> gmx-developers@gromacs.org<br>
<b>抄送:</b> <br>
<b>主题:</b> Re: [gmx-developers] Which part of runtime cost does "Wait GPU NB nonloc" and "Wait GPU NB local" actually count?<br>
<br>
        <div class="moz-cite-prefix">
                Hi,<br>
<br>
GPUs are only collected locally. "non-local" refers to
interactions between atoms of which some or all might belong home
on other MPI ranks. We compute these non-local interactions with
higher priority on the GPU and wait on those forces first so we
can communicate these forces to the home ranks of those atoms.
Thus the wait time on the non-local forces can be long when the
CPU has relatively little work. The local forces are often
finished quickly after the non-local forces have been transferred,
so that wait time is often short.<br>
<br>
Cheers,<br>
<br>
Berk<br>
<br>
On 2020-04-14 11:33 , 张驭洲 wrote:<br>
        </div>
        <blockquote cite="mid:320f367a.874d.1717809b286.Coremail.zhangyuzhou15@mails.ucas.edu.cn" type="cite">
                <p>
                        <br>
Hello Berk,
                </p>
                <p>
                        <br>
                </p>
                <p>
                        Thanks for your reply! I want to ask one more question. As the
wall time of Wait GPU NB noloc is relatively long while that of
Force and Wait GPU NB local is very short, does it means that
the communation between a CPU and its nolocal GPUs slows down
the running? Or in other words, the force kernel is fast, it's
the hardware connecting CPUs and GPUs or their topological
structure that restricts the performance?
                </p>
                <p>
                        <br>
                </p>
                <p>
                        Sincerely,
                </p>
                <p>
                        Zhang<br>
                </p>
                <blockquote class="ReferenceQuote" style="padding-left: 5px; margin-right: 0px; margin-left: 5px; border-left-color: rgb(182, 182, 182); border-left-width: 2px; border-left-style: solid;" name="replyContent">
                        -----原始邮件-----<br>
<b>发件人:</b><span id="rc_from">"Berk Hess" <a class="moz-txt-link-rfc2396E" href="mailto:hess@kth.se"><hess@kth.se></a></span><br>
<b>发送时间:</b><span id="rc_senttime">2020-04-14 16:52:51 (星期二)</span><br>
<b>收件人:</b> <a class="moz-txt-link-abbreviated" href="mailto:gmx-developers@gromacs.org">gmx-developers@gromacs.org</a><br>
<b>抄送:</b> <br>
<b>主题:</b> Re: [gmx-developers] Which part of runtime cost does
"Wait GPU NB nonloc" and "Wait GPU NB local" actually count?<br>
<br>
                        <div class="moz-cite-prefix">
                                Hi,<br>
<br>
Those timers report the time the CPU is waiting for results to
arrive from the local and non-local non-bonded calculations on
the GPU. When the CPU has few or no forces to compute, this
wait time can be a large part of the total run time.<br>
<br>
Cheers,<br>
<br>
Berk<br>
<br>
On 2020-04-14 10:37 , 张驭洲 wrote:<br>
                        </div>
                        <blockquote cite="mid:2613d5c7.808f.17177d626f5.Coremail.zhangyuzhou15@mails.ucas.edu.cn" type="cite">
                                <p>
                                        Hello GROMACS developers,
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                        I'm using GROMACS 2020.1 on a node with 2 Intel(R) Xeon(R)
Gold 6142 CPUs and 4 NVIDIA Tesla V100-PCIE-32GB GPUs.
                                </p>
                                <p>
                                        With the command line as follows:
                                </p>
                                <p>
                                         gmx mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e
p16.edr -g p16.log -pin on -ntmpi 4 -ntomp 6 -nb gpu -bonded
gpu -pme gpu -npme 1
                                </p>
                                <p>
                                        I got the following performance results:
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                         R E A L C Y C L E A N D T I M E A C C O U N T
I N G
                                </p>
                                <p>
                                        On 3 MPI ranks doing PP, each using 6 OpenMP threads, and<br>
on 1 MPI rank doing PME, using 6 OpenMP threads
                                </p>
                                <p>
                                         Computing: Num Num Call Wall
time Giga-Cycles<br>
Ranks Threads Count (s)
total sum %<br>
-----------------------------------------------------------------------------<br>
Domain decomp. 3 6 2001 15.290
715.584 6.4<br>
DD comm. load 3 6 245
0.008 0.377 0.0<br>
DD comm. bounds 3 6 48
0.003 0.151 0.0<br>
Send X to PME 3 6 200001 9.756
456.559 4.1<br>
Neighbor search 3 6 2001 12.184
570.190 5.1<br>
Launch GPU ops. 3 6 400002 17.929
839.075 7.5<br>
Force 3 6 200001 3.912
183.082 1.6<br>
Wait + Comm. F 3 6 40001 4.229
197.913 1.8<br>
PME mesh * 1 6 200001 16.733
261.027 2.3<br>
PME wait for PP * 162.467
2534.449 22.7<br>
Wait + Recv. PME F 3 6 200001 18.827
881.091 7.9<br>
Wait PME GPU gather 3 6 200001 2.896
135.522 1.2<br>
Wait Bonded GPU 3 6 2001
0.003 0.122 0.0<br>
Wait GPU NB nonloc. 3 6 200001 15.328
717.330 6.4<br>
Wait GPU NB local 3 6 200001
0.175 8.169 0.1<br>
Wait GPU state copy 3 6 160000 26.204
1226.327 11.0<br>
NB X/F buffer ops. 3 6 798003 7.023
328.655 2.9<br>
Write traj. 3 6 21
0.182 8.540 0.1<br>
Update 3 6 200001 6.685
312.856 2.8<br>
Comm. energies 3 6 40001 6.684
312.796 2.8<br>
Rest 31.899
1492.851 13.3<br>
-----------------------------------------------------------------------------<br>
Total 179.216
11182.921 100.0<br>
-----------------------------------------------------------------------------<br>
(*) Note that with separate PME ranks, the walltime column
actually sums to<br>
twice the total reported, but the cycle count total and
% are correct.<br>
-----------------------------------------------------------------------------
                                </p>
                                <p>
                                         Core t (s) Wall t (s) (%)<br>
Time: 4301.031 179.216 2399.9<br>
(ns/day) (hour/ns)<br>
Performance: 96.421 0.249
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                        Using two nodes and the following command:
                                </p>
                                <p>
                                         gmx_mpi mdrun -s p16.tpr -o p16.trr -c p16_out.gro -e
p16.edr -g p16.log -ntomp 6 -nb gpu -bonded gpu -pme gpu
-npme 1
                                </p>
                                <p>
                                        I got these results:
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                         R E A L C Y C L E A N D T I M E A C C O U N T
I N G
                                </p>
                                <p>
                                        On 6 MPI ranks doing PP, each using 6 OpenMP threads, and<br>
on 1 MPI rank doing PME, using 6 OpenMP threads
                                </p>
                                <p>
                                         Computing: Num Num Call Wall
time Giga-Cycles<br>
Ranks Threads Count (s)
total sum %<br>
-----------------------------------------------------------------------------<br>
Domain decomp. 6 6 2001 8.477
793.447 3.7<br>
DD comm. load 6 6 256
0.005 0.449 0.0<br>
DD comm. bounds 6 6 60
0.002 0.216 0.0<br>
Send X to PME 6 6 200001 32.588
3050.168 14.1<br>
Neighbor search 6 6 2001 6.639
621.393 2.9<br>
Launch GPU ops. 6 6 400002 14.686
1374.563 6.4<br>
Comm. coord. 6 6 198000 36.691
3434.263 15.9<br>
Force 6 6 200001 2.913
272.694 1.3<br>
Wait + Comm. F 6 6 200001 32.024
2997.400 13.9<br>
PME mesh * 1 6 200001 77.479
1208.657 5.6<br>
PME wait for PP * 119.009
1856.517 8.6<br>
Wait + Recv. PME F 6 6 200001 14.328
1341.122 6.2<br>
Wait PME GPU gather 6 6 200001 11.115
1040.397 4.8<br>
Wait Bonded GPU 6 6 2001
0.003 0.279 0.0<br>
Wait GPU NB nonloc. 6 6 200001 27.604
2583.729 11.9<br>
Wait GPU NB local 6 6 200001
0.548 51.333 0.2<br>
NB X/F buffer ops. 6 6 796002 11.095
1038.515 4.8<br>
Write traj. 6 6 21
0.105 9.851 0.0<br>
Update 6 6 200001 3.498
327.440 1.5<br>
Comm. energies 6 6 40001 2.947
275.863 1.3<br>
-----------------------------------------------------------------------------<br>
Total 198.094
21631.660 100.0<br>
-----------------------------------------------------------------------------<br>
(*) Note that with separate PME ranks, the walltime column
actually sums to<br>
twice the total reported, but the cycle count total and
% are correct.<br>
-----------------------------------------------------------------------------
                                </p>
                                <p>
                                         Core t (s) Wall t (s) (%)<br>
Time: 8319.867 198.094 4200.0<br>
(ns/day) (hour/ns)<br>
Performance: 87.232 0.275
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                        I'm curious about the "Wait GPU NB nonloc" and "Wait GPU
NB local" part, which you can see in both cases, the wall
time of wait GPU NB local is very short but that of nonloc
is pretty long, and the wall time of Force in both cases is
much shorter than that of Wait GPU NB nonloc. Could you
please give an explanation of the these timing terms?
And I'd appreciate it very much if you can give
some suggestions of reducing the time consumption of that
waiting!
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                        Sincerely,
                                </p>
                                <p>
                                        Zhang
                                </p>
                                <p>
                                        <br>
                                </p>
                                <p>
                                        <br>
                                </p>
<br>
                                <fieldset class="mimeAttachmentHeader">
                                </fieldset>
<br>
                        </blockquote>
<br>
                </blockquote>
<br>
                <fieldset class="mimeAttachmentHeader">
                </fieldset>
<br>
        </blockquote>
<br>
</blockquote>