Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Kokkos >= 4.2.x, <= 4.5.x issues with Cray MPICH and CUDA async memory allocations on Polaris #489

Open
felker opened this issue Sep 25, 2024 · 2 comments

Comments

@felker
Copy link
Member

felker commented Sep 25, 2024

LAMMPS, AthenaK, XGC, and other Kokkos-based applications using versions starting with 4.2.00 in Nov 2023 are affected by an incompatibility with Cray MPICH (based on an older UCX) of the new default option:

-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON

A runtime error is thrown by CUDA-aware Cray MPICH if you try to use Kokkos with that option enabled:

(GTL DEBUG: 2) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 148

I assume the 3x prebuilt Kokkos modules were all compiled with that option disabled @zippylab ?

   kokkos/4.2.01/shared/PrgEnv-gnu/8.5.0/gnu/12.3/cuda_cudatoolkit_12.2.91
   kokkos/4.2.01/shared/PrgEnv-gnu/8.5.0/gnu/12.3/cuda_cudatoolkit_12.3.2
   kokkos/4.3.01_shared_PEg8.5.0_cv12.3_ct12.2.91                          (D)

There is a discussion to potentially revert the change to the default in 4.5.x
kokkos/kokkos#7353

@zippylab
Copy link
Contributor

The kokkos/4.3.01_shared_PEg8.5.0_cv12.3_ct12.2.91 module build was built with cmake flag -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The two kokkos/4.2.01 module builds were built with the default, -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON. I believe those will have trouble if you enable GPU-aware MPICH.

@felker
Copy link
Member Author

felker commented Oct 4, 2024

Perhaps we should mention that in all 3x .lua modulefiles, in addition to the user guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants