[gmx-users] Fw: cudaFuncGetAttributes failed: out of memory

Wed Feb 26 16:29:10 CET 2020

Hi,

Indeed, there is an issue with the GPU detection code's consistency checks
that trip and abort the run if any of the detected GPUs behaves in
unexpected ways (e.g. runs out of memory during checks).

This should be fixed in an upcoming release, but until then as you have
observed, you can always restrict the set of GPUs exposed to GROMACS using
the CUDA_VISIBLE_DEVICES environment variable.

Cheers,

--
Szilárd

On Sun, Feb 23, 2020 at 7:51 AM bonjour899 <bonjour899 at 126.com> wrote:

> I think I've temporarily solved this problem. Only when I use
> CUDA_VISIABLE_DEVICE to block the memory-almost-fully-occupied GPUs, I can
> run GROMACS smoothly (using -gpu_id only is useless). I think there may be
> some bug in GROMACS's GPU usage model in a multi-GPU environment (It seems
> like as long as one of the GPUs is fully occupied, GROMACS cannot submit to
> any GPUs and return an error with "cudaFuncGetAttributes failed: out of
> memory").
>
>
>
> Best regards,
> W
>
>
>
>
> -------- Forwarding messages --------
> From: "bonjour899" <bonjour899 at 126.com>
> Date: 2020-02-23 11:32:53
> To:  gromacs.org_gmx-users at maillist.sys.kth.se
> Subject: [gmx-users] cudaFuncGetAttributes failed: out of memory
> I also tried to restricting to different GPU using -gpu_id, but still with
> the same error. I've also posting my question on
> https://devtalk.nvidia.com/default/topic/1072038/cuda-programming-and-performance/cudafuncgetattributes-failed-out-of-memory/
> Following is the output of nvidia-smi:
>
>
> +-----------------------------------------------------------------------------+
>
> | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
>
>
> |-------------------------------+----------------------+----------------------+
>
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
>
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
>
>
> |===============================+======================+======================|
>
> | 0 Tesla P100-PCIE... On | 00000000:04:00.0 Off | 0 |
>
> | N/A 35C P0 34W / 250W | 16008MiB / 16280MiB | 0% Default |
>
>
> +-------------------------------+----------------------+----------------------+
>
> | 1 Tesla P100-PCIE... On | 00000000:06:00.0 Off | 0 |
>
> | N/A 35C P0 28W / 250W | 10MiB / 16280MiB | 0% Default |
>
>
> +-------------------------------+----------------------+----------------------+
>
> | 2 Tesla P100-PCIE... On | 00000000:07:00.0 Off | 0 |
>
> | N/A 35C P0 33W / 250W | 16063MiB / 16280MiB | 0% Default |
>
>
> +-------------------------------+----------------------+----------------------+
>
> | 3 Tesla P100-PCIE... On | 00000000:08:00.0 Off | 0 |
>
> | N/A 36C P0 29W / 250W | 10MiB / 16280MiB | 0% Default |
>
>
> +-------------------------------+----------------------+----------------------+
>
> | 4 Quadro P4000 On | 00000000:0B:00.0 Off | N/A |
>
> | 46% 27C P8 8W / 105W | 12MiB / 8119MiB | 0% Default |
>
>
> +-------------------------------+----------------------+----------------------+
>
>
>
>
> +-----------------------------------------------------------------------------+
>
> | Processes: GPU Memory |
>
> | GPU PID Type Process name Usage |
>
>
> |=============================================================================|
>
> | 0 20497 C /usr/bin/python3 5861MiB |
>
> | 0 24503 C /usr/bin/python3 10137MiB |
>
> | 2 23162 C /home/appuser/Miniconda3/bin/python 16049MiB |
>
>
> +-----------------------------------------------------------------------------+
>
>
>
>
>
>
>
> -------- Forwarding messages --------
> From: "bonjour899" <bonjour899 at 126.com>
> Date: 2020-02-20 10:30:36
> To: "gromacs.org_gmx-users at maillist.sys.kth.se" <
> gromacs.org_gmx-users at maillist.sys.kth.se>
> Subject: cudaFuncGetAttributes failed: out of memory
>
> Hello,
>
>
> I have encountered a weird problem. I've been using GROMACS with GPU on a
> server and always performance good. However when I just reran a job today
> and suddenly got this error:
>
>
>
> Command line:
>
> gmx mdrun -deffnm pull -ntmpi 1 -nb gpu -pme gpu -gpu_id 3
>
> Back Off! I just backed up pull.log to ./#pull.log.1#
>
> -------------------------------------------------------
>
> Program: gmx mdrun, version 2019.4
>
> Source file: src/gromacs/gpu_utils/gpu_utils.cu (line 100)
>
>
>
> Fatal error:
>
> cudaFuncGetAttributes failed: out of memory
>
>
>
> For more information and tips for troubleshooting, please check the GROMACS
>
> website at http://www.gromacs.org/Documentation/Errors
>
> -------------------------------------------------------
>
>
>
>
> It seems the GPU is 0 occupied and I can run other apps with GPU, but I
> cannot run GROMACS mdrun anymore, even if doing energy minimization.
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-request at gromacs.org.