-
Notifications
You must be signed in to change notification settings - Fork 23
RFC.4
RP currently starts exactly one pilot agent instance per pilot job. There is no conceptual reason for this (a pilot could in principle host any number of agents concurrently or sequentially), this is just an implementation choice which was made for generality and simplicity. This implementation choice though is now limiting for some use cases which (a) need to run different types of workload on the same resource with defined concurrency (co-scheduling pilot jobs would also solve this, but is not widely available), or (b) have workload requirements which vary over the lifetime of a pilot. Additionally (c), this design also impacts scalability due to the impacts on the pilot agent size.
This RFC aims to propose a generalization of this design, by introducing dynamic pilot partitions which are dynamic subdivisions of a pilot agent, and which host a pilot agent.
A pilot partition is defined by (a) a subset of resources acquired by a pilot job, and (b) exactly one pilot agent which manages those resources. Multiple partitions can co-exist on a pilot job's resources, but no part of the resources is shared between partitions (at least not explicitly). Multiple partitions can be created on the same pilot job, and partitions can further be created and destroyed on demand - their lifetime is thus bound, but otherwise independent from the pilot job lifetime.
RP's resource configuration file currently serve to distinct purposes: (a) configuration of resource access and system stack, and (b) configuration of the agent. RP does, however, also have a notion of an agent configuration, which is where the structure of the agent is defined (number of sub-agents, localization of agent components, etc).
Introducing partitions as intermediate layer between pilot job and pilot agent will benefit from a clearer separation of resource and agent configuration. The partition itself should not need static configuration files on its own, but only the dynamic, application specified partition size (and potentially layout).
When partitions are entities with an independent lifetime, which can get created and destroyed dynamically and on demand, and on whose availability other components, such as the unit manager, will make decisions, it seems prudent to equip them with a state model. An agent on the other hand would currently not benefit from its own state model, as it's always bound to exactly one partition, and its lifetime is also defined and constraint by that partition.
The partition states are as follows:
-
NEW
: initial state: the application requested the partition to be created -
PENDING
: the partition request is published, RP is waiting for the pilot to become active and to pick up that request -
STARTING
: the pilot is creating the partition and starts the agent on it -
ACTIVE
: the partition is alive, the agent can execute units -
DONE | FAILED | CANCELED
: final states