[gmx-users] Can we set the number of pure PME nodes when using GPU&CPU?

Tue Aug 26 07:51:15 CEST 2014

Hi  Szilárd,

But CUDA 5.5 won't work with icc 14, right?
It only works with 12.1 unless a header of CUDA 5.5 to be modified.

Theo

On 8/25/2014 9:44 PM, Szilárd Páll wrote:
> On Mon, Aug 25, 2014 at 8:08 AM, Mark Abraham <mark.j.abraham at gmail.com> wrote:
>> On Mon, Aug 25, 2014 at 5:01 AM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> https://onedrive.live.com/redir?resid=990FCE59E48164A4!
>>> 2572&authkey=!AP82sTNxS6MHgUk&ithint=file%2clog
>>> https://onedrive.live.com/redir?resid=990FCE59E48164A4!
>>> 2482&authkey=!APLkizOBzXtPHxs&ithint=file%2clog
>>>
>>> These are 2 log files. The first one  is using 64 cpu cores(64 / 16 = 4
>>> nodes) and 4nodes*2 = 8 GPUs, and the second is using 512 cpu cores, no GPU.
>>> When we look at the 64 cores log file, we find that in the  R E A L   C Y
>>> C L E   A N D   T I M E   A C C O U N T I N G table, the total wall time is
>>> the sum of every line, that is 37.730=2.201+0.082+...+1.150. So we think
>>> that when the CPUs is doing PME, GPUs are doing nothing. That's why we say
>>> they are working sequentially.
>>>
>> Please note that "sequential" means "one phase after another." Your log
>> files don't show the timing breakdown for the GPUs, which is distinct from
>> showing that the GPUs ran and then the CPUs ran (which I don't think the
>> code even permits!). References to "CUDA 8x8 kernels" do show the GPU was
>> active. There was an issue with mdrun not always being able to gather and
>> publish the GPU timing results; I don't recall the conditions (Szilard
>> might remember), but it might be fixed in a later release.
> It is a limitation (well, I'd say borderline bug) in CUDA that if you
> have multiple work-queues (=streams), reliable timing using the CUDA
> built-in mechanisms is impossible. There may be a way to work around
> this, but that won't happen in the current versions. What's important
> is to observe the wait time on the CPU sideand of course, if the Op is
> profiling this is not an issue.
>
>> In any case, you
>> should probably be doing performance optimization on a GROMACS version that
>> isn't a year old.
>>
>> I gather that you didn't actually observe the GPUs idle - e.g. with a
>> performance monitoring tool? Otherwise, and in the absence of a description
>> of your simulation system, I'd say that log file looks somewhere between
>> normal and optimal. For the record, for better performance, you should
>> probably be following the advice of the install guide and not compiling
>> FFTW with AVX support, and using one of the five gcc minor versions
>> released since 4.4 ;-)
> And besides avoiding ancient gcc versions, I suggest using CUDA 5.5
> (which you can use because you have version 5.5 driver which I see in
> your log file):
>
> Additionally, I suggest avoiding MKL and using FFTW instead. For the
> grid sizes of our interest all benchmarks I did in the past showed
> considerably higher FFTW performance. Same goes for icc, but feel free
> to benchmark and please report back if you find the opposite.
>
>> As for the 512 cores log file, the total wall time is approximately the sum
>>> of PME mesh and PME wait for PP. We think this is because PME-dedicated
>>> nodes finished early, and the total wall time is the time spent on PP
>>> nodes, therefore time spent on PME is covered.
>>
>> Yes, using an offload model makes it awkward to report CPU timings, because
>> there are two kinds of CPU ranks. The total of the "Wall t" column adds up
>> to twice the total time taken (which is noted explicitly in more recent
>> mdrun versions). By design, the PME ranks do finish early, as you know from
>> Figure 3.16 of the manual. As you can see in the table, the PP ranks spend
>> 26% of their time waiting for the results from the PME ranks, and this is
>> the origin of the note (above the table) that you might want to balance
>> things better.
>>
>> Mark
>>
>> On 8/23/2014 9:30 PM, Mark Abraham wrote:
>>>> On Sat, Aug 23, 2014 at 1:47 PM, Theodore Si <sjyzhxw at gmail.com> wrote:
>>>>
>>>>   Hi,
>>>>> When we used 2 GPU nodes (each has 2 cpus and 2 gpus) to do a mdrun(with
>>>>> no PME-dedicated node), we noticed that when CPU are doing PME, GPU are
>>>>> idle,
>>>>>
>>>> That could happen if the GPU completes its work too fast, in which case
>>>> the
>>>> end of the log file will probably scream about imbalance.
>>>>
>>>> that is they are doing their work sequentially.
>>>>
>>>>
>>>> Highly unlikely, not least because the code is written to overlap the
>>>> short-range work on the GPU with everything else on the CPU. What's your
>>>> evidence for *sequential* rather than *imbalanced*?
>>>>
>>>>
>>>>   Is it supposed to be so?
>>>> No, but without seeing your .log files, mdrun command lines and knowing
>>>> about your hardware, there's nothing we can say.
>>>>
>>>>
>>>>   Is it the same reason as GPUs on PME-dedicated nodes won't be used during
>>>>> a run like you said before?
>>>>>
>>>> Why would you suppose that? I said GPUs do work from the PP ranks on their
>>>> node. That's true here.
>>>>
>>>> So if we want to exploit our hardware, we should map PP-PME ranks
>>>> manually,
>>>>
>>>>> right? Say, use one node as PME-dedicated node and leave the GPUs on that
>>>>> node idle, and use two nodes to do the other stuff. How do you think
>>>>> about
>>>>> this arrangement?
>>>>>
>>>>>   Probably a terrible idea. You should identify the cause of the
>>>> imbalance,
>>>> and fix that.
>>>>
>>>> Mark
>>>>
>>>>
>>>>   Theo
>>>>>
>>>>> On 8/22/2014 7:20 PM, Mark Abraham wrote:
>>>>>
>>>>>   Hi,
>>>>>> Because no work will be sent to them. The GPU implementation can
>>>>>> accelerate
>>>>>> domains from PP ranks on their node, but with an MPMD setup that uses
>>>>>> dedicated PME nodes, there will be no PP ranks on nodes that have been
>>>>>> set
>>>>>> up with only PME ranks. The two offload models (PP work -> GPU; PME work
>>>>>> ->
>>>>>> CPU subset) do not work well together, as I said.
>>>>>>
>>>>>> One can devise various schemes in 4.6/5.0 that could use those GPUs, but
>>>>>> they either require
>>>>>> * each node does both PME and PP work (thus limiting scaling because of
>>>>>> the
>>>>>> all-to-all for PME, and perhaps making poor use of locality on
>>>>>> multi-socket
>>>>>> nodes), or
>>>>>> * that all nodes have PP ranks, but only some have PME ranks, and the
>>>>>> nodes
>>>>>> map their GPUs to PP ranks in a way that is different depending on
>>>>>> whether
>>>>>> PME ranks are present (which could work well, but relies on the DD
>>>>>> load-balancer recognizing and taking advantage of the faster progress of
>>>>>> the PP ranks that have better GPU support, and requires that you get
>>>>>> very
>>>>>> dirty hands laying out PP and PME ranks onto hardware that will later
>>>>>> match
>>>>>> the requirements of the DD load balancer, and probably that you balance
>>>>>> PP-PME load manually)
>>>>>>
>>>>>> I do not recommend the last approach, because of its complexity.
>>>>>>
>>>>>> Clearly there are design decisions to improve. Work is underway.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 22, 2014 at 10:11 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>    Hi Mark,
>>>>>>
>>>>>>> Could you tell me why that when we are GPU-CPU nodes as PME-dedicated
>>>>>>> nodes, the GPU on such nodes will be idle?
>>>>>>>
>>>>>>>
>>>>>>> Theo
>>>>>>>
>>>>>>> On 8/11/2014 9:36 PM, Mark Abraham wrote:
>>>>>>>
>>>>>>>    Hi,
>>>>>>>
>>>>>>>> What Carsten said, if running on nodes that have GPUs.
>>>>>>>>
>>>>>>>> If running on a mixed setup (some nodes with GPU, some not), then
>>>>>>>> arranging
>>>>>>>> your MPI environment to place PME ranks on CPU-only nodes is probably
>>>>>>>> worthwhile. For example, all your PP ranks first, mapped to GPU nodes,
>>>>>>>> then
>>>>>>>> all your PME ranks, mapped to CPU-only nodes, and then use mdrun
>>>>>>>> -ddorder
>>>>>>>> pp_pme.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 11, 2014 at 2:45 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>     Hi Mark,
>>>>>>>>
>>>>>>>>   This is information of our cluster, could you give us some advice as
>>>>>>>>> regards to our cluster so that we can make GMX run faster on our
>>>>>>>>> system?
>>>>>>>>>
>>>>>>>>> Each CPU node has 2 CPUs and each GPU node has 2 CPUs and 2 Nvidia
>>>>>>>>> K20M
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Device Name     Device Type     Specifications  Number
>>>>>>>>> CPU Node        IntelH2216JFFKRNodes    CPU: 2×Intel Xeon E5-2670(8
>>>>>>>>> Cores,
>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     332
>>>>>>>>> Fat Node        IntelH2216WPFKRNodes    CPU: 2×Intel Xeon E5-2670(8
>>>>>>>>> Cores,
>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>> Mem: 256G(16×16G) ECC Registered DDR3 1600MHz Samsung Memory    20
>>>>>>>>> GPU Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon E5-2670(8
>>>>>>>>> Cores,
>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     50
>>>>>>>>> MIC Node        IntelR2208GZ4GC         CPU: 2×Intel Xeon E5-2670(8
>>>>>>>>> Cores,
>>>>>>>>> 2.6GHz, 20MB Cache, 8.0GT)
>>>>>>>>> Mem: 64GB(8×8GB) ECC Registered DDR3 1600MHz Samsung Memory     5
>>>>>>>>> Computing Network Switch        Mellanox Infiniband FDR Core Switch
>>>>>>>>> 648× FDR Core Switch MSX6536-10R, Mellanox Unified Fabric Manager
>>>>>>>>>    1
>>>>>>>>> Mellanox SX1036 40Gb Switch     36× 40Gb Ethernet Switch SX1036, 36×
>>>>>>>>> QSFP
>>>>>>>>> Interface     1
>>>>>>>>> Management Network Switch       Extreme Summit X440-48t-10G 2-layer
>>>>>>>>> Switch
>>>>>>>>> 48× 1Giga Switch Summit X440-48t-10G, authorized by ExtremeXOS
>>>>>>>>>   9
>>>>>>>>> Extreme Summit X650-24X 3-layer Switch  24× 10Giga 3-layer Ethernet
>>>>>>>>> Switch
>>>>>>>>> Summit X650-24X, authorized by ExtremeXOS    1
>>>>>>>>> Parallel Storage        DDN Parallel Storage System     DDN SFA12K
>>>>>>>>> Storage
>>>>>>>>> System       1
>>>>>>>>> GPU     GPU Accelerator         NVIDIA Tesla Kepler K20M        70
>>>>>>>>> MIC     MIC     Intel Xeon Phi 5110P Knights Corner     10
>>>>>>>>> 40Gb Ethernet Card      MCX314A-BCBT    Mellanox ConnextX-3 Chip 40Gb
>>>>>>>>> Ethernet Card
>>>>>>>>> 2× 40Gb Ethernet ports, enough QSFP cables      16
>>>>>>>>> SSD     Intel SSD910    Intel SSD910 Disk, 400GB, PCIE  80
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/10/2014 5:50 AM, Mark Abraham wrote:
>>>>>>>>>
>>>>>>>>>     That's not what I said.... "You can set..."
>>>>>>>>>
>>>>>>>>>   -npme behaves the same whether or not GPUs are in use. Using
>>>>>>>>>> separate
>>>>>>>>>> ranks
>>>>>>>>>> for PME caters to trying to minimize the cost of the all-to-all
>>>>>>>>>> communication of the 3DFFT. That's still relevant when using GPUs,
>>>>>>>>>> but
>>>>>>>>>> if
>>>>>>>>>> separate PME ranks are used, any GPUs on nodes that only have PME
>>>>>>>>>> ranks
>>>>>>>>>> are
>>>>>>>>>> left idle. The most effective approach depends critically on the
>>>>>>>>>> hardware
>>>>>>>>>> and simulation setup, and whether you pay money for your hardware.
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 9, 2014 at 2:56 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>      Hi,
>>>>>>>>>>
>>>>>>>>>>    You mean no matter we use GPU acceleration or not, -npme is just a
>>>>>>>>>>
>>>>>>>>>>> reference?
>>>>>>>>>>> Why we can't set that to a exact value?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/9/2014 5:14 AM, Mark Abraham wrote:
>>>>>>>>>>>
>>>>>>>>>>>      You can set the number of PME-only ranks with -npme. Whether
>>>>>>>>>>> it's
>>>>>>>>>>> useful
>>>>>>>>>>>
>>>>>>>>>>>    is
>>>>>>>>>>>
>>>>>>>>>>>> another matter :-) The CPU-based PME offload and the GPU-based PP
>>>>>>>>>>>> offload
>>>>>>>>>>>> do not combine very well.
>>>>>>>>>>>>
>>>>>>>>>>>> Mark
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 8, 2014 at 7:24 AM, Theodore Si <sjyzhxw at gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>       Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>     Can we set the number manually with -npme when using GPU
>>>>>>>>>>>>
>>>>>>>>>>>>   acceleration?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Gromacs Users mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>>>
>>>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_
>>>>>>>>>>>>> gmx-users
>>>>>>>>>>>>> or
>>>>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>      --
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Gromacs Users mailing list
>>>>>>>>>>>>>
>>>>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>>>
>>>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>>>
>>>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>>>> or
>>>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     --
>>>>>>>>>>>
>>>>>>>>>>>   Gromacs Users mailing list
>>>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>>>
>>>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>>>
>>>>>>>>> * For (un)subscribe requests visit
>>>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users
>>>>>>>>> or
>>>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    --
>>>>>>>>>
>>>>>>>> Gromacs Users mailing list
>>>>>>> * Please search the archive at http://www.gromacs.org/
>>>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>>>
>>>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>>>
>>>>>>> * For (un)subscribe requests visit
>>>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>> Gromacs Users mailing list
>>>>>
>>>>> * Please search the archive at http://www.gromacs.org/
>>>>> Support/Mailing_Lists/GMX-Users_List before posting!
>>>>>
>>>>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>>>>
>>>>> * For (un)subscribe requests visit
>>>>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>>>>> send a mail to gmx-users-request at gromacs.org.
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-request at gromacs.org.