Overview

Mission Statement

The goal of saga-pilot is to develop a new Pilot-Framework implementation that can be used:

by the RADICAL group to design and execute distributed computing experiments
by production users to run large-scale workload across national and international cyber-infrastructures

provide the known pilot job semantics (P*, BigJob)
suitable as underpinning pilot layer for AIMES (and thus TROY)
suitable as pilot framework for production infrastructures, while operating fully in user space

code simple (as simple as possible, but no simpler)
easy to maintain
support different deployment modes
- single user / single application: modules bound to application
- user community, multiple applications: services with user / application sessions
cleanly extensible toward
- different backends (pluggable agents)
- different workloads (pluggable schedulers)
compatibility with Pilot-API

no multi-user pilots (they are mostly useless if you cannot map to system accounting, and cannot do sandboxing, which would also require system layer functionality)
no credential delegation (this seems just too hard for us to attempt at the moment. However, we will need other means to solve the data transfer problems. Also we will use session handles.)
no pilot placement intelligence (saga-pilot schedulers are limited to CU placement decisions)
no direct support for workflows (some schedulers might interpret task dependencies though)
no dynamically resizable pilots (considered to be a corner use case which introduces significant agent complexity)

Scale Up: per pilot, use maximum number of cores available on largest XSEDE machine (stampede - 10k - by request) concurrently (i.e., 10k tasks), without significant overhead.
Scale Out: 10k concurrent tasks on ~ 1000 small pilots (target use case OSG & Cloud: 4-12 cores / agent), without significant overhead.
Scale Across: 20k concurrent tasks in a hybrid up/out configuration.
Framework Scalability: multiuser mode: 10 different backends, 10 concurrent users, with 10 concurrent applications each.

modular component architecture:
- separate pilot manager and queue manager into separate modules/services
- separate notification/callback system for state updates etc. (supports flexible information flow to/from pilots, service, applications)
Queue as central concept for CU management ('Queue' in the sense as queuing systems use the term, not in the programming abstraction sense)
separate communication / protocol layer:
- persistent/resilient connectivity backbone (needed due to NAT and FW infested environments)
- lightweight protocol layer, open for bulk, async ops, notifications -- at scale (REST most likely)
- locality transparency (service locality, agent locality, etc)