-
Notifications
You must be signed in to change notification settings - Fork 23
RFC.4
RP currently starts exactly one pilot agent instance per pilot job. There is no conceptual reason for this (a pilot could in principle host any number of agents concurrently or sequentially), this is just an implementation choice which was made for generality and simplicity. This implementation choice though is now limiting for some use cases which (a) need to run different types of workload on the same resource with defined concurrency (co-scheduling pilot jobs would also solve this, but is not widely available), or (b) have workload requirements which vary over the lifetime of a pilot. Additionally (c), this design also impacts scalability due to the impacts on the pilot agent size.
This RFC aims to propose a generalization of this design, by introducing dynamic pilot partitions which are dynamic subdivisions of a pilot agent, and which host a pilot agent.
A pilot partition is defined by (a) a subset of resources acquired by a pilot job, and (b) exactly one pilot agent which manages those resources. Multiple partitions can co-exist on a pilot job's resources, but no part of the resources is shared between partitions (at least not explicitly). Multiple partitions can be created on the same pilot job, and partitions can further be created and destroyed on demand - their lifetime is thus bound, but otherwise independent from the pilot job lifetime.
RP's resource configuration file currently serve to distinct purposes: (a) configuration of resource access and system stack, and (b) configuration of the agent. RP does, however, also have a notion of an agent configuration, which is where the structure of the agent is defined (number of sub-agents, localization of agent components, etc).
Introducing partitions as intermediate layer between pilot job and pilot agent will benefit from a clearer separation of resource and agent configuration. The partition itself should not need static configuration files on its own, but only the dynamic, application specified partition size (and potentially layout).
When partitions are entities with an independent lifetime, which can get created and destroyed dynamically and on demand, and on whose availability other components, such as the unit manager, will make decisions, it seems prudent to equip them with a state model. An agent on the other hand would currently not benefit from its own state model, as it's always bound to exactly one partition, and its lifetime is also defined and constraint by that partition.
The partition states are as follows:
-
NEW
: initial state: the application requested the partition to be created -
PENDING
: the partition request is published, RP is waiting for the pilot to become active and to pick up that request -
STARTING
: the pilot is creating the partition and starts the agent on it -
ACTIVE
: the partition is alive, the agent can execute units -
DONE | FAILED | CANCELED
: final states
RP users struggle to remember to add agent resource requirements to pilot job requests. Calculating partition sizes will only aggravate that problem. The addition of partitions will thus presents and opportunity, but also a need, to reconsider that scheme. The obvious alternative is to automatically add the agent requirements to partition requests. A corollary of this approach is that the initial pilot job size is determined by RP, based on initial partition specification and their respective pilot configurations.
Consider though the following case: a pilot with two 30-node partitions are submitted to a Cray and becomes ACTIVE
. On a Cray, each agent requires at least one individual node, so the pilot job will run with 62 nodes. After some time, the application requests to terminate both partitions, and instead requests 3 partitions with 20 nodes each -- the pilot would now need to have 63 nodes in order to serve this request.
There does not seem to be any solution to this problem which can completely avoid abstraction leakage of the agent requirements. One way to alleviate that problem is to additionally allow for more abstract specifications, like 50%
, all GPUs
, remaining
(meaning all resources which are not yet used by any other partition), etc, where the RP internal calculation of the actual partition size can again hide the agent requirements. Note that this problem will likely only pop up in a minority of cases, i.e. those, where (i) the number of partitions changes over the pilot lifetime, and (ii) where the agent indeed has additional resource requirements. Upper layer libraries like EnTK will also be able reduce the scope of this problem.
Below is an proposal on how partitions are created and managed on the API level:
pd1 = rp.ComputePartitionDescription()
pd1.cores = 16
pd1.gpus = 4
pd1.config = ['orte'] # this selects an agent config
pd2 = rp.ComputePartitionDescription()
pd2.cores = 8
pd2.gpus = 2
pd2.config = ['aprun']
pd = rp.ComputePilotDescription()
pd.resource = "local.localhost"
pd.parts = [pd1, pd2]
pd.runtime = 60
pm = rp.PilotManager()
pilot = pm.submit_pilots(pd)
parts = pilot.partitions
for p in parts:
print '%10s [%3d : %3d]: %s' % (p.uid, p.cores, p.gpus, p.state)
pilot.reconfig(stop=parts[0].uid)
pilot.reconfig(start=[pd2, pd2])
pilot.reconfig(stop=parts[0].uid,
start=[pd2, pd2])
pilot.reconfig(stop=parts[0].uid,
start=[rp.ComputePartitionDescription("50%")])
pilot.reconfig(start=[rp.ComputePartitionDescription(rp.FILL)])
pilot.reconfig(stop=rp.ALL, start=[pd2, pd2]) # WARNING (underutilized)
pilot.reconfig(stop=rp.ALL, start=[pd2, pd2, pd2]) # OK
pilot.reconfig(stop=rp.ALL, start=[pd2, pd2, pd2, pd2]) # ERROR (overutilized)
We do have a communication channel in place which can communicate commands from the PMGR to the pilot agent. That channel can also be used to communication partition management commands.
An additional layer is needed between bootstrap_1
(first level in Python) and bootstrap_2
(the process which owns agent_0
). That layer will receive and enact the respective partition management commands.
The agent's RM
layer will need an additional method which can subdivide a pilot job's resource set into partitions, according to some partition description as part of the above requests. Since the RM proper is owned by the agent which does not exist at that point, that functionality will have to be exposed as static method. It can borrow large code parts from the RM
's initialize()
method - but that initialize()
method in turn will have to be adapted to pick up partition configurations instead of pilot job configurations.