-
Notifications
You must be signed in to change notification settings - Fork 23
GPU support
RP should support applications to utilize CPUs on the target resources. That support consists of two main components:
- extend the RP API express GPU requirements for CUs
- extend UMGR scheduler, agent scheduler and agent executor to correctly place and execute respective CUs
The first part can initially be as simple as adding a gpus
requirement in the unit description, which would immediately also support mixed CPU/GPU units:
cud = rp.ComputeUnitDescription()
cud.executable = 'sander.GPU'
cud.cores = 1
cud.gpus = 1
On the UMGR scheduler side, the late binding scheduler needs to perform similar bookkeeping on GPUs as it already have on CPU cores. Any more fine-grained scheduling (such as CPU/GPU co-scheduling) should remain in the agent scheduler.
The agent scheduler is probably the crucial part: the internal bookkeeping and scheduling data structures need to be changed to accommodate GPUs. Note that the agent scheduler is up for a revamp anyway: it is currently a performance bottleneck due to inefficiencies in searching and changing nested Python data structures. This is supposed to improve by using the bittarray implementation exposed by the radical.utils
scheduler. GPU support should be integrated along the same lines, by using a second bitarray to map to the cluster's GPU layout.
For the agent executing component, we will initially make the trivial assumption that the CU description can differentiate any respective system peculiarities, by selecting the correct executable. Any more advanced supports should fall in line with the integration of application kernels. The latter will also be needed once we intend to support dynamic selection between CPU and GPU CUs -- which we consider out of scope initially.