You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another aspect to consider is to run periodic tests on the GPUs, such as hgemm and Igemm and any peer-to-peer tests, to see if there are any low-performing GPUs over time. So, in addition to running at the beginning of a job, we can run this after a specific duration, such as after every checkpoint.
Separated the issues to better track progress.
We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.
The text was updated successfully, but these errors were encountered:
Per #37:
Separated the issues to better track progress.
We have GEMM benchmarks as part of the node performance overview. I can adapt these tests to be run as part of a pre-execution hook that we run during our job submissions and to target other systems.
The text was updated successfully, but these errors were encountered: