diff --git a/docs/devguide/roadmap.rst b/docs/devguide/roadmap.rst index a1fe8e44e0..ccb5abce30 100644 --- a/docs/devguide/roadmap.rst +++ b/docs/devguide/roadmap.rst @@ -15,7 +15,7 @@ Code Maintenance * **Type Annotations and Static Type Checking**: Add static type annotations throughout the codebase and add typeguard checks. * **Release Process**: `Improve the overall release process `_ to synchronize docs and code releases, automatically produce changelog documentation. * **Components Maturity Model**: Defines the `component maturity model `_ and tags components with their appropriate maturity level. -* **Define and Document Interfaces**: Identify and document interfaces via which `external components `_ can augment the Parsl ecosystem. +* **Define and Document Interfaces**: Identify and document interfaces via which `external components `_ can augment the Parsl ecosystem. * **Distributed Testing Process**: All tests should be run against all possible schedulers, using different executors, on a variety of remote systems. Explore the use of containerized schedulers and remote testing on real systems. New Features and Integrations diff --git a/docs/index.rst b/docs/index.rst index 88b0c7bb4c..a9c5c99881 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -4,7 +4,7 @@ Parsl - Parallel Scripting Library Parsl extends parallelism in Python beyond a single computer. You can use Parsl -`just like Python's parallel executors `_ +`just like Python's parallel executors `_ but across *multiple cores and nodes*. However, the real power of Parsl is in expressing multi-step workflows of functions. Parsl lets you chain functions together and will launch each function as inputs and computing resources are available. @@ -37,8 +37,8 @@ Parsl lets you chain functions together and will launch each function as inputs Start with the `configuration quickstart `_ to learn how to tell Parsl how to use your computing resource, -see if `a template configuration for your supercomputer `_ is already available, -then explore the `parallel computing patterns `_ to determine how to use parallelism best in your application. +see if `a template configuration for your supercomputer `_ is already available, +then explore the `parallel computing patterns `_ to determine how to use parallelism best in your application. Parsl is an open-source code, and available on GitHub: https://github.com/parsl/parsl/ @@ -57,7 +57,7 @@ Parsl works everywhere *Parsl can run parallel functions on a laptop and the world's fastest supercomputers.* Scaling from laptop to supercomputer is often as simple as changing the resource configuration. -Parsl is tested `on many of the top supercomputers `_. +Parsl is tested `on many of the top supercomputers `_. Parsl is flexible ----------------- diff --git a/docs/quickstart.rst b/docs/quickstart.rst index d54763ee58..9c0d119fad 100644 --- a/docs/quickstart.rst +++ b/docs/quickstart.rst @@ -70,7 +70,7 @@ We describe these components briefly here, and link to more details in the `User .. note:: - Parsl's documentation includes `templates for many supercomputers `_. + Parsl's documentation includes `templates for many supercomputers `_. Even though you may not need to write a configuration from a blank slate, understanding the basic terminology below will be very useful. @@ -112,7 +112,7 @@ with hello world Python and Bash apps. with open('hello-stdout', 'r') as f: print(f.read()) -Learn more about the types of Apps and their options `here `__. +Learn more about the types of Apps and their options `here `__. Executors ^^^^^^^^^ @@ -127,7 +127,7 @@ You can dynamically set the number of workers based on available memory and pin each worker to specific GPUs or CPU cores among other powerful features. -Learn more about Executors `here `__. +Learn more about Executors `here `__. Execution Providers ^^^^^^^^^^^^^^^^^^^ @@ -141,7 +141,7 @@ Another key role of Providers is defining how to start an Executor on a remote c Often, this simply involves specifying the correct Python environment and (described below) how to launch the Executor on each acquired computers. -Learn more about Providers `here `__. +Learn more about Providers `here `__. Launchers ^^^^^^^^^ @@ -151,7 +151,7 @@ A common example is an :class:`~parsl.launchers.launchers.MPILauncher`, which us for starting a single program on multiple computing nodes. Like Providers, Parsl comes packaged with Launchers for most supercomputers and clouds. -Learn more about Launchers `here `__. +Learn more about Launchers `here `__. Benefits of a Data-Flow Kernel @@ -164,7 +164,7 @@ and performs the many other functions needed to execute complex workflows. The flexibility and performance of the DFK enables applications with intricate dependencies between tasks to execute on thousands of parallel workers. -Start with the Tutorial or the `parallel patterns `_ +Start with the Tutorial or the `parallel patterns `_ to see the complex types of workflows you can make with Parsl. Starting Parsl @@ -210,7 +210,7 @@ An example which launches 4 workers on 1 node of the Polaris supercomputer looks ) -The documentation has examples for other supercomputers `here `__. +The documentation has examples for other supercomputers `here `_. The next step is to load the configuration diff --git a/docs/userguide/examples/library/__init__.py b/docs/userguide/advanced/examples/library/__init__.py similarity index 100% rename from docs/userguide/examples/library/__init__.py rename to docs/userguide/advanced/examples/library/__init__.py diff --git a/docs/userguide/examples/library/app.py b/docs/userguide/advanced/examples/library/app.py similarity index 100% rename from docs/userguide/examples/library/app.py rename to docs/userguide/advanced/examples/library/app.py diff --git a/docs/userguide/examples/library/config.py b/docs/userguide/advanced/examples/library/config.py similarity index 100% rename from docs/userguide/examples/library/config.py rename to docs/userguide/advanced/examples/library/config.py diff --git a/docs/userguide/examples/library/logic.py b/docs/userguide/advanced/examples/library/logic.py similarity index 100% rename from docs/userguide/examples/library/logic.py rename to docs/userguide/advanced/examples/library/logic.py diff --git a/docs/userguide/examples/pyproject.toml b/docs/userguide/advanced/examples/pyproject.toml similarity index 100% rename from docs/userguide/examples/pyproject.toml rename to docs/userguide/advanced/examples/pyproject.toml diff --git a/docs/userguide/examples/run.py b/docs/userguide/advanced/examples/run.py similarity index 100% rename from docs/userguide/examples/run.py rename to docs/userguide/advanced/examples/run.py diff --git a/docs/userguide/advanced/index.rst b/docs/userguide/advanced/index.rst new file mode 100644 index 0000000000..39a89f7ecd --- /dev/null +++ b/docs/userguide/advanced/index.rst @@ -0,0 +1,13 @@ +Advanced Topics +=============== + +More to learn about Parsl after starting a project. + +.. toctree:: + :maxdepth: 2 + + modularizing + usage_tracking + monitoring + parsl_perf + plugins diff --git a/docs/userguide/advanced/modularizing.rst b/docs/userguide/advanced/modularizing.rst new file mode 100644 index 0000000000..143a4ebcd8 --- /dev/null +++ b/docs/userguide/advanced/modularizing.rst @@ -0,0 +1,109 @@ +.. _codebases: + +Structuring Parsl programs +-------------------------- + +While convenient to build simple Parsl programs as a single Python file, +splitting a Parsl programs into multiple files and a Python module +has significant benefits, including: + + 1. Better readability + 2. Logical separation of components (e.g., apps, config, and control logic) + 3. Ease of reuse of components + +Large applications that use Parsl often divide into several core components: + +.. contents:: + :local: + :depth: 2 + +The following sections use an example where each component is in a separate file: + +.. code-block:: + + examples/logic.py + examples/app.py + examples/config.py + examples/__init__.py + run.py + pyproject.toml + +Run the application by first installing the Python library and then executing the "run.py" script. + +.. code-block:: bash + + pip install . # Install module so it can be imported by workers + python run.py + + +Core application logic +====================== + +The core application logic should be developed without any deference to Parsl. +Implement capabilities, write unit tests, and prepare documentation +in which ever way works best for the problem at hand. + +Parallelization with Parsl will be easy if the software already follows best practices. + +The example defines a function to convert a single integer into binary. + +.. literalinclude:: examples/library/logic.py + :caption: library/logic.py + +Workflow functions +================== + +Tasks within a workflow may require unique combinations of core functions. +Functions to be run in parallel must also meet :ref:`specific requirements ` +that may complicate writing the core logic effectively. +As such, separating functions to be used as Apps is often beneficial. + +The example includes a function to convert many integers into binary. + +Key points to note: + +- It is not necessary to have import statements inside the function. + Parsl will serialize this function by reference, as described in :ref:`functions-from-modules`. + +- The function is not yet marked as a Parsl PythonApp. + Keeping Parsl out of the function definitions simplifies testing + because you will not need to run Parsl when testing the code. + +- *Advanced*: Consider including Parsl decorators in the library if using complex workflow patterns, + such as :ref:`join apps ` or functions which take :ref:`special arguments `. + +.. literalinclude:: examples/library/app.py + :caption: library/app.py + + +Parsl configuration functions +============================= + +Create Parsl configurations specific to your application needs as functions. +While not necessary, including the Parsl configuration functions inside the module +ensures they can be imported into other scripts easily. + +Generating Parsl :class:`~parsl.config.Config` objects from a function +makes it possible to change the configuration without editing the module. + +The example function provides a configuration suited for a single node. + +.. literalinclude:: examples/library/config.py + :caption: library/config.py + +Orchestration Scripts +===================== + +The last file defines the workflow itself. + +Such orchestration scripts, at minimum, perform at least four tasks: + +1. *Load execution options* using a tool like :mod:`argparse`. +2. *Prepare workflow functions for execution* by creating :class:`~parsl.app.python.PythonApp` wrappers over each function. +3. *Create configuration then start Parsl* with the :meth:`parsl.load` function. +4. *Launch tasks and retrieve results* depending on the needs of the application. + +An example run script is as follows + +.. literalinclude:: examples/run.py + :caption: run.py diff --git a/docs/userguide/advanced/monitoring.rst b/docs/userguide/advanced/monitoring.rst new file mode 100644 index 0000000000..c1285cb9b3 --- /dev/null +++ b/docs/userguide/advanced/monitoring.rst @@ -0,0 +1,121 @@ +Monitoring +========== + +Parsl includes a monitoring system to capture task state as well as resource +usage over time. The Parsl monitoring system aims to provide detailed +information and diagnostic capabilities to help track the state of your +programs, down to the individual apps that are executed on remote machines. + +The monitoring system records information to an SQLite database while a +workflow runs. This information can then be visualised in a web dashboard +using the ``parsl-visualize`` tool, or queried using SQL using regular +SQLite tools. + + +Monitoring configuration +------------------------ + +Parsl monitoring is only supported with the `parsl.executors.HighThroughputExecutor`. + +The following example shows how to enable monitoring in the Parsl +configuration. Here the `parsl.monitoring.MonitoringHub` is specified to use port +55055 to receive monitoring messages from workers every 10 seconds. + +.. code-block:: python + + import parsl + from parsl.monitoring.monitoring import MonitoringHub + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + from parsl.addresses import address_by_hostname + + import logging + + config = Config( + executors=[ + HighThroughputExecutor( + label="local_htex", + cores_per_worker=1, + max_workers_per_node=4, + address=address_by_hostname(), + ) + ], + monitoring=MonitoringHub( + hub_address=address_by_hostname(), + monitoring_debug=False, + resource_monitoring_interval=10, + ), + strategy='none' + ) + + +Visualization +------------- + +To run the web dashboard utility ``parsl-visualize`` you first need to install +its dependencies: + + $ pip install 'parsl[monitoring,visualization]' + +To view the web dashboard while or after a Parsl program has executed, run +the ``parsl-visualize`` utility:: + + $ parsl-visualize + +By default, this command expects that the default ``monitoring.db`` database is used +in the runinfo directory. Other databases can be loaded by passing +the database URI on the command line. For example, if the full path +to the database is ``/tmp/my_monitoring.db``, run:: + + $ parsl-visualize sqlite:////tmp/my_monitoring.db + +By default, the visualization web server listens on ``127.0.0.1:8080``. If the web server is deployed on a machine with a web browser, the dashboard can be accessed in the browser at ``127.0.0.1:8080``. If the web server is deployed on a remote machine, such as the login node of a cluster, you will need to use an ssh tunnel from your local machine to the cluster:: + + $ ssh -L 50000:127.0.0.1:8080 username@cluster_address + +This command will bind your local machine's port 50000 to the remote cluster's port 8080. +The dashboard can then be accessed via the local machine's browser at ``127.0.0.1:50000``. + +.. warning:: Alternatively you can deploy the visualization server on a public interface. However, first check that this is allowed by the cluster's security policy. The following example shows how to deploy the web server on a public port (i.e., open to Internet via ``public_IP:55555``):: + + $ parsl-visualize --listen 0.0.0.0 --port 55555 + + +Workflows Page +^^^^^^^^^^^^^^ + +The workflows page lists all Parsl workflows that have been executed with monitoring enabled +with the selected database. +It provides a high level summary of workflow state as shown below: + +.. image:: ../../images/mon_workflows_page.png + +Throughout the dashboard, all blue elements are clickable. For example, clicking a specific worklow +name from the table takes you to the Workflow Summary page described in the next section. + +Workflow Summary +^^^^^^^^^^^^^^^^ + +The workflow summary page captures the run level details of a workflow, including start and end times +as well as task summary statistics. The workflow summary section is followed by the *App Summary* that lists +the various apps and invocation count for each. + +.. image:: ../../images/mon_workflow_summary.png + + +The workflow summary also presents three different views of the workflow: + +* Workflow DAG - with apps differentiated by colors: This visualization is useful to visually inspect the dependency + structure of the workflow. Hovering over the nodes in the DAG shows a tooltip for the app represented by the node and it's task ID. + +.. image:: ../../images/mon_task_app_grouping.png + +* Workflow DAG - with task states differentiated by colors: This visualization is useful to identify what tasks have been completed, failed, or are currently pending. + +.. image:: ../../images/mon_task_state_grouping.png + +* Workflow resource usage: This visualization provides resource usage information at the workflow level. + For example, cumulative CPU/Memory utilization across workers over time. + +.. image:: ../../images/mon_resource_summary.png + diff --git a/docs/userguide/advanced/parsl_perf.rst b/docs/userguide/advanced/parsl_perf.rst new file mode 100644 index 0000000000..2ea1adb00f --- /dev/null +++ b/docs/userguide/advanced/parsl_perf.rst @@ -0,0 +1,53 @@ +.. _label-parsl-perf: + +Measuring performance with parsl-perf +===================================== + +``parsl-perf`` is tool for making basic performance measurements of Parsl +configurations. + +It runs increasingly large numbers of no-op apps until a batch takes +(by default) 120 seconds, giving a measurement of tasks per second. + +This can give a basic measurement of some of the overheads in task +execution. + +``parsl-perf`` must be invoked with a configuration file, which is a Python +file containing a variable ``config`` which contains a `Config` object, or +a function ``fresh_config`` which returns a `Config` object. The +``fresh_config`` format is the same as used with the pytest test suite. + +To specify a ``parsl_resource_specification`` for tasks, add a ``--resources`` +argument. + +To change the target runtime from the default of 120 seconds, add a +``--time`` parameter. + +For example: + +.. code-block:: bash + + + $ python -m parsl.benchmark.perf --config parsl/tests/configs/workqueue_ex.py --resources '{"cores":1, "memory":0, "disk":0}' + ==== Iteration 1 ==== + Will run 10 tasks to target 120 seconds runtime + Submitting tasks / invoking apps + warning: using plain-text when communicating with workers. + warning: use encryption with a key and cert when creating the manager. + All 10 tasks submitted ... waiting for completion + Submission took 0.008 seconds = 1248.676 tasks/second + Runtime: actual 3.668s vs target 120s + Tasks per second: 2.726 + + [...] + + ==== Iteration 4 ==== + Will run 57640 tasks to target 120 seconds runtime + Submitting tasks / invoking apps + All 57640 tasks submitted ... waiting for completion + Submission took 34.839 seconds = 1654.487 tasks/second + Runtime: actual 364.387s vs target 120s + Tasks per second: 158.184 + Cleaning up DFK + The end + diff --git a/docs/userguide/advanced/plugins.rst b/docs/userguide/advanced/plugins.rst new file mode 100644 index 0000000000..cd9244960c --- /dev/null +++ b/docs/userguide/advanced/plugins.rst @@ -0,0 +1,106 @@ +Plugins +======= + +Parsl has several places where code can be plugged in. Parsl usually provides +several implementations that use each plugin point. + +This page gives a brief summary of those places and why you might want +to use them, with links to the API guide. + +Executors +--------- +When the parsl dataflow kernel is ready for a task to run, it passes that +task to an `ParslExecutor`. The executor is then responsible for running the task's +Python code and returning the result. This is the abstraction that allows one +executor to run code on the local submitting host, while another executor can +run the same code on a large supercomputer. + + +Providers and Launchers +----------------------- +Some executors are based on blocks of workers (for example the +`parsl.executors.HighThroughputExecutor`: the submit side requires a +batch system (eg slurm, kubernetes) to start worker processes, which then +execute tasks. + +The particular way in which a system makes those workers start is implemented +by providers and launchers. + +An `ExecutionProvider` allows a command line to be submitted as a request to the +underlying batch system to be run inside an allocation of nodes. + +A `Launcher` modifies that command line when run inside the allocation to +add on any wrappers that are needed to launch the command (eg srun inside +slurm). Providers and launchers are usually paired together for a particular +system type. + +File staging +------------ +Parsl can copy input files from an arbitrary URL into a task's working +environment, and copy output files from a task's working environment to +an arbitrary URL. A small set of data staging providers is installed by default, +for ``file://`` ``http://`` and ``ftp://`` URLs. More data staging providers can +be added in the workflow configuration, in the ``storage`` parameter of the +relevant `ParslExecutor`. Each provider should subclass the `Staging` class. + + +Default stdout/stderr name generation +------------------------------------- +Parsl can choose names for your bash apps stdout and stderr streams +automatically, with the parsl.AUTO_LOGNAME parameter. The choice of path is +made by a function which can be configured with the ``std_autopath`` +parameter of Parsl `Config`. By default, ``DataFlowKernel.default_std_autopath`` +will be used. + + +Memoization/checkpointing +------------------------- + +When parsl memoizes/checkpoints an app parameter, it does so by computing a +hash of that parameter that should be the same if that parameter is the same +on subsequent invocations. This isn't straightforward to do for arbitrary +objects, so parsl implements a checkpointing hash function for a few common +types, and raises an exception on unknown types: + +.. code-block:: + + ValueError("unknown type for memoization ...") + +You can plug in your own type-specific hash code for additional types that +you need and understand using `id_for_memo`. + + +Invoking other asynchronous components +-------------------------------------- + +Parsl code can invoke other asynchronous components which return Futures, and +integrate those Futures into the task graph: Parsl apps can be given any +`concurrent.futures.Future` as a dependency, even if those futures do not come +from invoking a Parsl app. This includes as the return value of a +``join_app``. + +An specific example of this is integrating Globus Compute tasks into a Parsl +task graph. See :ref:`label-join-globus-compute` + +Dependency resolution +--------------------- + +When Parsl examines the arguments to an app, it uses a `DependencyResolver`. +The default `DependencyResolver` will cause Parsl to wait for +``concurrent.futures.Future`` instances (including `AppFuture` and +`DataFuture`), and pass through other arguments without waiting. + +This behaviour is pluggable: Parsl comes with another dependency resolver, +`DEEP_DEPENDENCY_RESOLVER` which knows about futures contained with structures +such as tuples, lists, sets and dicts. + +This plugin interface might be used to interface other task-like or future-like +objects to the Parsl dependency mechanism, by describing how they can be +interpreted as a Future. + +Removed interfaces +------------------ + +Parsl had a deprecated ``Channel`` abstraction. See +`issue 3515 `_ +for further discussion on its removal. diff --git a/docs/userguide/advanced/usage_tracking.rst b/docs/userguide/advanced/usage_tracking.rst new file mode 100644 index 0000000000..da8ac9b79d --- /dev/null +++ b/docs/userguide/advanced/usage_tracking.rst @@ -0,0 +1,171 @@ +.. _label-usage-tracking: + +Usage Statistics Collection +=========================== + +Parsl uses an **Opt-in** model for usage tracking, allowing users to decide if they wish to participate. Usage statistics are crucial for improving software reliability and help focus development and maintenance efforts on the most used components of Parsl. The collected data is used solely for enhancements and reporting and is not shared in its raw form outside of the Parsl team. + +Why are we doing this? +---------------------- + +The Parsl development team relies on funding from government agencies. To sustain this funding and advocate for continued support, it is essential to show that the research community benefits from these investments. + +By opting in to share usage data, you actively support the ongoing development and maintenance of Parsl. (See:ref:`What is sent? ` below). + +Opt-In Model +------------ + +We use an **opt-in model** for usage tracking to respect user privacy and provide full control over shared information. We hope that developers and researchers will choose to send us this information. The reason is that we need this data - it is a requirement for funding. + +Choose the data you share with Usage Tracking Levels. + +**Usage Tracking Levels:** + +* **Level 1:** Only basic information such as Python version, Parsl version, and platform name (Linux, MacOS, etc.) +* **Level 2:** Level 1 information and configuration information including provider, executor, and launcher names. +* **Level 3:** Level 2 information and workflow execution details, including the number of applications run, failures, and execution time. + +By enabling usage tracking, you support Parsl's development. + +**To opt-in, set** ``usage_tracking`` **to the desired level (1, 2, or 3) in the configuration object** (``parsl.config.Config``) **.** + +Example: + +.. code-block:: python3 + + config = Config( + executors=[ + HighThroughputExecutor( + ... + ) + ], + usage_tracking=3 + ) + +.. _what-is-sent: + +What is sent? +------------- + +The data collected depends on the tracking level selected: + +* **Level 1:** Only basic information such as Python version, Parsl version, and platform name (Linux, MacOS, etc.) +* **Level 2:** Level 1 information and configuration information including provider, executor, and launcher names. +* **Level 3:** Level 2 information and workflow execution details, including the number of applications run, failures, and execution time. + +**Example Messages:** + +- At launch: + + .. code-block:: json + + { + "correlator":"6bc7484e-5693-48b2-b6c0-5889a73f7f4e", + "parsl_v":"1.3.0-dev", + "python_v":"3.12.2", + "platform.system":"Darwin", + "tracking_level":3, + "components":[ + { + "c":"parsl.config.Config", + "executors_len":1, + "dependency_resolver":false + }, + "parsl.executors.threads.ThreadPoolExecutor" + ], + "start":1727156153 + } + +- On closure (Tracking Level 3 only): + + .. code-block:: json + + { + "correlator":"6bc7484e-5693-48b2-b6c0-5889a73f7f4e", + "execution_time":31, + "components":[ + { + "c":"parsl.dataflow.dflow.DataFlowKernel", + "app_count":3, + "app_fails":0 + }, + { + "c":"parsl.config.Config", + "executors_len":1, + "dependency_resolver":false + }, + "parsl.executors.threads.ThreadPoolExecutor" + ], + "end":1727156156 + } + +**All messages sent are logged in the** ``parsl.log`` **file, ensuring complete transparency.** + +How is the data sent? +--------------------- + +Data is sent using **UDP** to minimize the impact on workflow performance. While this may result in some data loss, it significantly reduces the chances of usage tracking affecting the software's operation. + +The data is processed through AWS CloudWatch to generate a monitoring dashboard, providing valuable insights into usage patterns. + +When is the data sent? +---------------------- + +Data is sent twice per run: + +1. At the start of the script. +2. Upon script completion (for Tracking Level 3). + +What will the data be used for? +------------------------------- + +The data will help the Parsl team understand Parsl usage and make development and maintenance decisions, including: + +* Focus development and maintenance on the most-used components of Parsl. +* Determine which Python versions to continue supporting. +* Track the age of Parsl installations. +* Assess how long it takes for most users to adopt new changes. +* Track usage statistics to report to funders. + +Usage Statistics Dashboard +-------------------------- + +The collected data is aggregated and displayed on a publicly accessible dashboard. This dashboard provides an overview of how Parsl is being used across different environments and includes metrics such as: + +* Total workflows executed over time +* Most-used Python and Parsl versions +* Most common platforms and executors and more + +`Find the dashboard here `_ + +Leaderboard +----------- + +**Opting in to usage tracking also allows you to participate in the Parsl Leaderboard. +To participate in the leaderboard, you can deanonymize yourself using the** ``project_name`` **parameter in the parsl configuration object** (``parsl.config.Config``) **.** + +`Find the Parsl Leaderboard here `_ + +Example: + +.. code-block:: python3 + + config = Config( + executors=[ + HighThroughputExecutor( + ... + ) + ], + usage_tracking=3, + project_name="my-test-project" + ) + +Every run of parsl with usage tracking **Level 1** or **Level 2** earns you **1 point**. And every run with usage tracking **Level 3**, earns you **2 points**. + +Feedback +-------- + +Please send us your feedback at parsl@googlegroups.com. Feedback from our user communities will be +useful in determining our path forward with usage tracking in the future. + +**Please consider turning on usage tracking to support the continued development of Parsl.** diff --git a/docs/userguide/app.rst b/docs/userguide/app.rst new file mode 100644 index 0000000000..5e58276c3d --- /dev/null +++ b/docs/userguide/app.rst @@ -0,0 +1,9 @@ +:orphan: + +.. meta:: + :content http-equiv="refresh": 0;url=apps/index.html + +Redirect +-------- + +This page has been `moved `_ diff --git a/docs/userguide/apps/bash.rst b/docs/userguide/apps/bash.rst new file mode 100644 index 0000000000..9f99eb4d95 --- /dev/null +++ b/docs/userguide/apps/bash.rst @@ -0,0 +1,66 @@ + +Bash Apps +--------- + +.. code-block:: python + + @bash_app + def echo( + name: str, + stdout=parsl.AUTO_LOGNAME # Requests Parsl to return the stdout + ): + return f'echo "Hello, {name}!"' + + future = echo('user') + future.result() # block until task has completed + + with open(future.stdout, 'r') as f: + print(f.read()) + + +A Parsl Bash app executes an external application by making a command-line execution. +Parsl will execute the string returned by the function as a command-line script on a remote worker. + +Rules for Function Contents +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Bash Apps follow the same rules :ref:`as Python Apps `. +For example, imports may need to be inside functions and global variables will be inaccessible. + +Inputs and Outputs +^^^^^^^^^^^^^^^^^^ + +Bash Apps can use the same kinds of inputs as Python Apps, but only communicate results with Files. + +Bash Apps, unlike Python Apps, can also return the content printed to the Standard Output and Error. + +Special Keywords Arguments +++++++++++++++++++++++++++ + +In addition to the ``inputs``, ``outputs``, and ``walltime`` keyword arguments +described above, a Bash app can accept the following keywords: + +1. stdout: (string, tuple or ``parsl.AUTO_LOGNAME``) The path to a file to which standard output should be redirected. If set to ``parsl.AUTO_LOGNAME``, the log will be automatically named according to task id and saved under ``task_logs`` in the run directory. If set to a tuple ``(filename, mode)``, standard output will be redirected to the named file, opened with the specified mode as used by the Python `open `_ function. +2. stderr: (string or ``parsl.AUTO_LOGNAME``) Like stdout, but for the standard error stream. +3. label: (string) If the app is invoked with ``stdout=parsl.AUTO_LOGNAME`` or ``stderr=parsl.AUTO_LOGNAME``, this argument will be appended to the log name. + +Outputs ++++++++ + +If the Bash app exits with Unix exit code 0, then the AppFuture will complete. If the Bash app +exits with any other code, Parsl will treat this as a failure, and the AppFuture will instead +contain an `BashExitFailure` exception. The Unix exit code can be accessed through the +``exitcode`` attribute of that `BashExitFailure`. + + +Execution Options +^^^^^^^^^^^^^^^^^ + +Bash Apps have the same execution options (e.g., pinning to specific sites) as the Python Apps. + +MPI Apps +^^^^^^^^ + +Applications which employ MPI to span multiple nodes are a special case of Bash apps, +and require special modification of Parsl's `execution environment <../configuration/execution.html>`_ to function. +Support for MPI applications is described `in a later section `_. diff --git a/docs/userguide/apps/index.rst b/docs/userguide/apps/index.rst new file mode 100644 index 0000000000..be46d4be29 --- /dev/null +++ b/docs/userguide/apps/index.rst @@ -0,0 +1,26 @@ +.. _apps: + +Writing Parsl Apps +================== + +An **App** defines a computation that will be executed asynchronously by Parsl. +Apps are Python functions marked with a decorator which +designates that the function will run asynchronously and cause it to return +a :class:`~concurrent.futures.Future` instead of the result. + +Apps can be one of three types of functions, each with their own type of decorator + +- ``@python_app``: Most Python functions +- ``@bash_app``: A Python function which returns a command line program to execute +- ``@join_app``: A function which launches one or more new Apps + +Start by learning how to write Python Apps, which define most of the rules needed to write +other types of Apps. + +.. toctree:: + :maxdepth: 1 + + python + bash + mpi_apps + joins diff --git a/docs/userguide/apps/joins.rst b/docs/userguide/apps/joins.rst new file mode 100644 index 0000000000..defb0ad012 --- /dev/null +++ b/docs/userguide/apps/joins.rst @@ -0,0 +1,257 @@ +.. _label-joinapp: + +Join Apps +========= + +Join apps, defined with the ``@join_app`` decorator, are a form of app that can +launch other pieces of a workflow: for example a Parsl sub-workflow, or a task +that runs in some other system. + +Parsl sub-workflows +------------------- + +One reason for launching Parsl apps from inside a join app, rather than +directly in the main workflow code, is because the definitions of those tasks +are not known well enough at the start of the workflow. + +For example, a workflow might run an expensive step to detect some objects +in an image, and then on each object, run a further expensive step. Because +the number of objects is not known at the start of the workflow, but instead +only after an expensive step has completed, the subsequent tasks cannot be +defined until after that step has completed. + +In simple cases, the main workflow script can be stopped using +``Future.result()`` and join apps are not necessary, but in more complicated +cases, that approach can severely limit concurrency. + +Join apps allow more naunced dependencies to be expressed that can help with: + +* increased concurrency - helping with strong scaling +* more focused error propagation - allowing more of an ultimately failing workflow to complete +* more useful monitoring information + +Using Futures from other components +----------------------------------- + +Sometimes, a workflow might need to incorporate tasks from other systems that +run asynchronously but do not need a Parsl worker allocated for their entire +run. An example of this is delegating some work into Globus Compute: work can +be given to Globus Compute, but Parsl does not need to keep a worker allocated +to that task while it runs. Instead, Parsl can be told to wait for the ``Future`` +returned by Globus Compute to complete. + +Usage +----- + +A `join_app` looks quite like a `python_app`, but should return one or more +``Future`` objects, rather than a value. Once the Python code has run, the +app will wait for those Futures to complete without occuping a Parsl worker, +and when those Futures complete, their contents will be the return value +of the `join_app`. + +For example: + +.. code-block:: python + + @python_app + def some_app(): + return 3 + + @join_app + def example(): + x: Future = some_app() + return x # note that x is a Future, not a value + + assert example.result() == 3 + +Example of a Parsl sub-workflow +------------------------------- + +This example workflow shows a preprocessing step, followed by +a middle stage that is chosen by the result of the pre-processing step +(either option 1 or option 2) followed by a know post-processing step. + +.. code-block:: python + + @python_app + def pre_process(): + return 3 + + @python_app + def option_one(x): + return x*2 + + @python_app + def option_two(x): + return (-x) * 2 + + @join_app + def process(x): + if x > 0: + return option_one(x) + else: + return option_two(x) + + @python_app + def post_process(x): + return str(x) + + assert post_process(process(pre_process()))).result() == "6" + +* Why can't process be a regular python function? + +``process`` needs to inspect the value of ``x`` to make a decision about +what app to launch. So it needs to defer execution until after the +pre-processing stage has completed. In Parsl, the way to defer that is +using apps: even though ``process`` is invoked at the start of the workflow, +it will execute later on, when the Future returned by ``pre_process`` has a +value. + +* Why can't process be a @python_app? + +A Python app, if run in a `parsl.executors.ThreadPoolExecutor`, can launch +more parsl apps; so a ``python_app`` implementation of process() would be able +to inspect x and choose and invoke the appropriate ``option_{one, two}``. + +From launching the ``option_{one, two}`` app, the app body python code would +get a ``Future[int]`` - a ``Future`` that will eventually contain ``int``. + +But, we want to invoke ``post_process`` at submission time near the start of +workflow so that Parsl knows about as many tasks as possible. But we don't +want it to execute until the value of the chosen ``option_{one, two}`` app +is known. + +If we don't have join apps, how can we do this? + +We could make process wait for ``option_{one, two}`` to complete, before +returning, like this: + +.. code-block:: python + + @python_app + def process(x): + if x > 0: + f = option_one(x) + else: + f = option_two(x) + return f.result() + +but this will block the worker running ``process`` until ``option_{one, two}`` +has completed. If there aren't enough workers to run ``option_{one, two}`` this +can even deadlock. (principle: apps should not wait on completion of other +apps and should always allow parsl to handle this through dependencies) + +We could make process return the ``Future`` to the main workflow thread: + +.. code-block:: python + + @python_app + def process(x): + if x > 0: + f = option_one(x) + else: + f = option_two(x) + return f # f is a Future[int] + + # process(3) is a Future[Future[int]] + + +What comes out of invoking ``process(x)`` now is a nested ``Future[Future[int]]`` +- it's a promise that eventually process will give you a promise (from +``option_one, two}``) that will eventually give you an int. + +We can't pass that future into post_process... because post_process wants the +final int, and that future will complete before the int is ready, and that +(outer) future will have as its value the inner future (which won't be complete yet). + +So we could wait for the result in the main workflow thread: + +.. code-block:: python + + f_outer = process(pre_process()) # Future[Future[int]] + f_inner = f_outer.result # Future[int] + result = post_process(f_inner) + # result == "6" + +But this now blocks the main workflow thread. If we really only need to run +these three lines, that's fine, but what about if we are in a for loop that +sets up 1000 parametrised iterations: + +.. code-block:: python + + for x in [1..1000]: + f_outer = process(pre_process(x)) # Future[Future[int]] + f_inner = f_outer.result() # Future[int] + result = post_process(f_inner) + +The ``for`` loop can only iterate after pre_processing is done for each +iteration - it is unnecessarily serialised by the ``.result()`` call, +so that pre_processing cannot run in parallel. + +So, the rule about not calling ``.result()`` applies in the main workflow thread +too. + +What join apps add is the ability for parsl to unwrap that Future[Future[int]] into a +Future[int] in a "sensible" way (eg it doesn't need to block a worker). + + +.. _label-join-globus-compute: + +Example of invoking a Futures-driven task from another system +------------------------------------------------------------- + + +This example shows launching some activity in another system, without +occupying a Parsl worker while that activity happens: in this example, work is +delegated to Globus Compute, which performs the work elsewhere. When the work +is completed, Globus Compute will put the result into the future that it +returns, and then (because the Parsl app is a ``@join_app``), that result will +be used as the result of the Parsl app. + +As above, the motivation for doing this inside an app, rather than in the +top level is that sufficient information to launch the Globus Compute task +might not be available at start of the workflow. + +This workflow will run a first stage, ``const_five``, on a Parsl worker, +then using the result of that stage, pass the result as a parameter to a +Globus Compute task, getting a ``Future`` from that submission. Then, the +results of the Globus Compute task will be passed onto a second Parsl +local task, ``times_two``. + +.. code-block:: python + + import parsl + from globus_compute_sdk import Executor + + tutorial_endpoint_uuid = '4b116d3c-1703-4f8f-9f6f-39921e5864df' + gce = Executor(endpoint_id=tutorial_endpoint_uuid) + + def increment_in_funcx(n): + return n+1 + + @parsl.join_app + def increment_in_parsl(n): + future = gce.submit(increment_in_funcx, n) + return future + + @parsl.python_app + def times_two(n): + return n*2 + + @parsl.python_app + def const_five(): + return 5 + + parsl.load() + + workflow = times_two(increment_in_parsl(const_five())) + + r = workflow.result() + + assert r == (5+1)*2 + + +Terminology +----------- + +The term "join" comes from use of monads in functional programming, especially Haskell. diff --git a/docs/userguide/apps/mpi_apps.rst b/docs/userguide/apps/mpi_apps.rst new file mode 100644 index 0000000000..82123123b6 --- /dev/null +++ b/docs/userguide/apps/mpi_apps.rst @@ -0,0 +1,153 @@ +MPI and Multi-node Apps +======================= + +The :class:`~parsl.executors.MPIExecutor` supports running MPI applications or other computations which can +run on multiple compute nodes. + +Background +---------- + +MPI applications run multiple copies of a program that complete a single task by +coordinating using messages passed within or across nodes. + +Starting MPI application requires invoking a "launcher" code (e.g., ``mpiexec``) +with options that define how the copies of a program should be distributed. + +The launcher includes options that control how copies of the program are distributed +across the nodes (e.g., how many copies per node) and +how each copy is configured (e.g., which CPU cores it can use). + +The options for launchers vary between MPI implementations and compute clusters. + +Configuring ``MPIExecutor`` +--------------------------- + +The :class:`~parsl.executors.MPIExecutor` is a wrapper over +:class:`~parsl.executors.high_throughput.executor.HighThroughputExecutor` +which eliminates options that are irrelevant for MPI applications. + +Define a configuration for :class:`~parsl.executors.MPIExecutor` by + +1. Setting ``max_workers_per_block`` to the maximum number of tasks to run per block of compute nodes. + This value is typically the number of nodes per block divided by the number of nodes per task. +2. Setting ``mpi_launcher`` to the launcher used for your application. +3. Specifying a provider that matches your cluster and use the :class:`~parsl.launchers.SimpleLauncher`, + which will ensure that no Parsl processes are placed on the compute nodes. + +An example for ALCF's Polaris supercomputer that will run 3 MPI tasks of 2 nodes each at the same time: + +.. code-block:: python + + config = Config( + executors=[ + MPIExecutor( + address=address_by_interface('bond0'), + max_workers_per_block=3, # Assuming 2 nodes per task + provider=PBSProProvider( + account="parsl", + worker_init=f"""module load miniconda; source activate /lus/eagle/projects/parsl/env""", + walltime="1:00:00", + queue="debug", + scheduler_options="#PBS -l filesystems=home:eagle:grand", + launcher=SimpleLauncher(), + select_options="ngpus=4", + nodes_per_block=6, + max_blocks=1, + cpus_per_node=64, + ), + ), + ] + ) + + +.. warning:: + Please note that ``Provider`` options that specify per-task or per-node resources, for example, + ``SlurmProvider(cores_per_node=N, ...)`` should not be used with :class:`~parsl.executors.high_throughput.MPIExecutor`. + Parsl primarily uses a pilot job model and assumptions from that context do not translate to the MPI context. For + more info refer to : + `github issue #3006 `_ + +Writing an MPI App +------------------ + +:class:`~parsl.executors.high_throughput.MPIExecutor` can execute both Python or Bash Apps which invoke an MPI application. + +Create the app by first defining a function which includes ``parsl_resource_specification`` keyword argument. +The resource specification is a dictionary which defines the number of nodes and ranks used by the application: + +.. code-block:: python + + resource_specification = { + 'num_nodes': , # Number of nodes required for the application instance + 'ranks_per_node': , # Number of ranks / application elements to be launched per node + 'num_ranks': , # Number of ranks in total + } + +Then, replace the call to the MPI launcher with ``$PARSL_MPI_PREFIX``. +``$PARSL_MPI_PREFIX`` references an environmental variable which will be replaced with +the correct MPI launcher configured for the resource list provided when calling the function +and with options that map the task to nodes which Parsl knows to be available. + +The function can be a Bash app + +.. code-block:: python + + @bash_app + def lammps_mpi_application(infile: File, parsl_resource_specification: Dict): + # PARSL_MPI_PREFIX will resolve to `mpiexec -n 4 -ppn 2 -hosts NODE001,NODE002` + return f"$PARSL_MPI_PREFIX lmp_mpi -in {infile.filepath}" + + +or a Python app: + +.. code-block:: python + + @python_app + def lammps_mpi_application(infile: File, parsl_resource_specification: Dict): + from subprocess import run + with open('stdout.lmp', 'w') as fp, open('stderr.lmp', 'w') as fe: + proc = run(['$PARSL_MPI_PREFIX', '-i', 'in.lmp'], stdout=fp, stderr=fe) + return proc.returncode + + +Run either App by calling with its arguments and a resource specification which defines how to execute it + +.. code-block:: python + + # Resources in terms of nodes and how ranks are to be distributed are set on a per app + # basis via the resource_spec dictionary. + resource_spec = { + "num_nodes": 2, + "ranks_per_node": 2, + "num_ranks": 4, + } + future = lammps_mpi_application(File('in.file'), parsl_resource_specification=resource_spec) + +Advanced: More Environment Variables +++++++++++++++++++++++++++++++++++++ + +Parsl Apps which run using :class:`~parsl.executors.high_throughput.MPIExecutor` +can make their own MPI invocation using other environment variables. + +These other variables include versions of the launch command for different launchers + +- ``PARSL_MPIEXEC_PREFIX``: mpiexec launch command which works for a large number of batch systems especially PBS systems +- ``PARSL_SRUN_PREFIX``: srun launch command for Slurm based clusters +- ``PARSL_APRUN_PREFIX``: aprun launch command prefix for some Cray machines + +And the information used by Parsl when assembling the launcher commands: + +- ``PARSL_NUM_RANKS``: Total number of ranks to use for the MPI application +- ``PARSL_NUM_NODES``: Number of nodes to use for the calculation +- ``PARSL_MPI_NODELIST``: List of assigned nodes separated by commas (Eg, NODE1,NODE2) +- ``PARSL_RANKS_PER_NODE``: Number of ranks per node + +Limitations ++++++++++++ + +Support for MPI tasks in HTEX is limited. It is designed for running many multi-node MPI applications within a single +batch job. + +#. MPI tasks may not span across nodes from more than one block. +#. Parsl does not correctly determine the number of execution slots per block (`Issue #1647 `_) +#. The executor uses a Python process per task, which can use a lot of memory (`Issue #2264 `_) \ No newline at end of file diff --git a/docs/userguide/apps.rst b/docs/userguide/apps/python.rst similarity index 71% rename from docs/userguide/apps.rst rename to docs/userguide/apps/python.rst index 41a988db6d..00fac2ac30 100644 --- a/docs/userguide/apps.rst +++ b/docs/userguide/apps/python.rst @@ -1,22 +1,3 @@ -.. _apps: - -Apps -==== - -An **App** defines a computation that will be executed asynchronously by Parsl. -Apps are Python functions marked with a decorator which -designates that the function will run asynchronously and cause it to return -a :class:`~concurrent.futures.Future` instead of the result. - -Apps can be one of three types of functions, each with their own type of decorator - -- ``@python_app``: Most Python functions -- ``@bash_app``: A Python function which returns a command line program to execute -- ``@join_app``: A function which launches one or more new Apps - -The intricacies of Python and Bash apps are documented below. Join apps are documented in a later -section (see :ref:`label-joinapp`). - Python Apps ----------- @@ -187,10 +168,10 @@ There are several classes of allowed types, each with different rules. capital_future = capitalize(first_line_future) print(capital_future.result()) - See the section on `Futures `_ for more details. + See the section on `Futures <../workflows/futures.html>`_ for more details. -Learn more about the types of data allowed in `the data section `_. +Learn more about the types of data allowed in `the data section <../configuration/data.html>`_. .. note:: @@ -203,7 +184,7 @@ Special Keyword Arguments Some keyword arguments to the Python function are treated differently by Parsl -1. inputs: (list) This keyword argument defines a list of input :ref:`label-futures` or files. +1. inputs: (list) This keyword argument defines a list of input :ref:`label-futures` or files. Parsl will wait for the results of any listed :ref:`label-futures` to be resolved before executing the app. The ``inputs`` argument is useful both for passing files as arguments and when one wishes to pass in an arbitrary number of futures at call time. @@ -225,7 +206,7 @@ Some keyword arguments to the Python function are treated differently by Parsl 2. outputs: (list) This keyword argument defines a list of files that will be produced by the app. For each file thus listed, Parsl will create a future, - track the file, and ensure that it is correctly created. The future + track the file, and ensure that it is correctly created. The future can then be passed to other apps as an input argument. .. code-block:: python @@ -253,7 +234,7 @@ Outputs +++++++ A Python app returns an AppFuture (see :ref:`label-futures`) as a proxy for the results that will be returned by the -app once it is executed. This future can be inspected to obtain task status; +app once it is executed. This future can be inspected to obtain task status; and it can be used to wait for the result, and when complete, present the output Python object(s) returned by the app. In case of an error or app failure, the future holds the exception raised by the app. @@ -282,70 +263,3 @@ To summarize, any Python function can be made a Python App with a few restrictio 2. Functions must explicitly import any required modules if they are defined in script which starts Parsl. 3. Parsl uses dill and pickle to serialize Python objects to/from apps. Therefore, Parsl require that all input and output objects can be serialized by dill or pickle. See :ref:`label_serialization_error`. 4. STDOUT and STDERR produced by Python apps remotely are not captured. - - -Bash Apps ---------- - -.. code-block:: python - - @bash_app - def echo( - name: str, - stdout=parsl.AUTO_LOGNAME # Requests Parsl to return the stdout - ): - return f'echo "Hello, {name}!"' - - future = echo('user') - future.result() # block until task has completed - - with open(future.stdout, 'r') as f: - print(f.read()) - - -A Parsl Bash app executes an external application by making a command-line execution. -Parsl will execute the string returned by the function as a command-line script on a remote worker. - -Rules for Function Contents -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Bash Apps follow the same rules :ref:`as Python Apps `. -For example, imports may need to be inside functions and global variables will be inaccessible. - -Inputs and Outputs -^^^^^^^^^^^^^^^^^^ - -Bash Apps can use the same kinds of inputs as Python Apps, but only communicate results with Files. - -The Bash Apps, unlike Python Apps, can also return the content printed to the Standard Output and Error. - -Special Keywords Arguments -++++++++++++++++++++++++++ - -In addition to the ``inputs``, ``outputs``, and ``walltime`` keyword arguments -described above, a Bash app can accept the following keywords: - -1. stdout: (string, tuple or ``parsl.AUTO_LOGNAME``) The path to a file to which standard output should be redirected. If set to ``parsl.AUTO_LOGNAME``, the log will be automatically named according to task id and saved under ``task_logs`` in the run directory. If set to a tuple ``(filename, mode)``, standard output will be redirected to the named file, opened with the specified mode as used by the Python `open `_ function. -2. stderr: (string or ``parsl.AUTO_LOGNAME``) Like stdout, but for the standard error stream. -3. label: (string) If the app is invoked with ``stdout=parsl.AUTO_LOGNAME`` or ``stderr=parsl.AUTO_LOGNAME``, this argument will be appended to the log name. - -Outputs -+++++++ - -If the Bash app exits with Unix exit code 0, then the AppFuture will complete. If the Bash app -exits with any other code, Parsl will treat this as a failure, and the AppFuture will instead -contain an `BashExitFailure` exception. The Unix exit code can be accessed through the -``exitcode`` attribute of that `BashExitFailure`. - - -Execution Options -^^^^^^^^^^^^^^^^^ - -Bash Apps have the same execution options (e.g., pinning to specific sites) as the Python Apps. - -MPI Apps -^^^^^^^^ - -Applications which employ MPI to span multiple nodes are a special case of Bash apps, -and require special modification of Parsl's `execution environment `_ to function. -Support for MPI applications is described `in a later section `_. diff --git a/docs/userguide/checkpoints.rst b/docs/userguide/checkpoints.rst index 8867107b7a..23af844c17 100644 --- a/docs/userguide/checkpoints.rst +++ b/docs/userguide/checkpoints.rst @@ -1,299 +1,9 @@ -.. _label-memos: +:orphan: -Memoization and checkpointing ------------------------------ +.. meta:: + :content http-equiv="refresh": 0;url=workflows/checkpoints.html -When an app is invoked several times with the same parameters, Parsl can -reuse the result from the first invocation without executing the app again. +Redirect +-------- -This can save time and computational resources. - -This is done in two ways: - -* Firstly, *app caching* will allow reuse of results within the same run. - -* Building on top of that, *checkpointing* will store results on the filesystem - and reuse those results in later runs. - -.. _label-appcaching: - -App caching -=========== - - -There are many situations in which a program may be re-executed -over time. Often, large fragments of the program will not have changed -and therefore, re-execution of apps will waste valuable time and -computation resources. Parsl's app caching solves this problem by -storing results from apps that have successfully completed -so that they can be re-used. - -App caching is enabled by setting the ``cache`` -argument in the :func:`~parsl.app.app.python_app` or :func:`~parsl.app.app.bash_app` -decorator to ``True`` (by default it is ``False``). - -.. code-block:: python - - @bash_app(cache=True) - def hello (msg, stdout=None): - return 'echo {}'.format(msg) - -App caching can be globally disabled by setting ``app_cache=False`` -in the :class:`~parsl.config.Config`. - -App caching can be particularly useful when developing interactive programs such as when -using a Jupyter notebook. In this case, cells containing apps are often re-executed -during development. Using app caching will ensure that only modified apps are re-executed. - - -App equivalence -^^^^^^^^^^^^^^^ - -Parsl determines app equivalence using the name of the app function: -if two apps have the same name, then they are equivalent under this -relation. - -Changes inside the app, or by functions called by an app will not invalidate -cached values. - -There are lots of other ways functions might be compared for equivalence, -and `parsl.dataflow.memoization.id_for_memo` provides a hook to plug in -alternate application-specific implementations. - - -Invocation equivalence -^^^^^^^^^^^^^^^^^^^^^^ - -Two app invocations are determined to be equivalent if their -input arguments are identical. - -In simple cases, this follows obvious rules: - -.. code-block:: python - - # these two app invocations are the same and the second invocation will - # reuse any cached input from the first invocation - x = 7 - f(x).result() - - y = 7 - f(y).result() - - -Internally, equivalence is determined by hashing the input arguments, and -comparing the hash to hashes from previous app executions. - -This approach can only be applied to data types for which a deterministic hash -can be computed. - -By default Parsl can compute sensible hashes for basic data types: -str, int, float, None, as well as more some complex types: -functions, and dictionaries and lists containing hashable types. - -Attempting to cache apps invoked with other, non-hashable, data types will -lead to an exception at invocation. - -In that case, mechanisms to hash new types can be registered by a program by -implementing the `parsl.dataflow.memoization.id_for_memo` function for -the new type. - -Ignoring arguments -^^^^^^^^^^^^^^^^^^ - -On occasion one may wish to ignore particular arguments when determining -app invocation equivalence - for example, when generating log file -names automatically based on time or run information. -Parsl allows developers to list the arguments to be ignored -in the ``ignore_for_cache`` app decorator parameter: - -.. code-block:: python - - @bash_app(cache=True, ignore_for_cache=['stdout']) - def hello (msg, stdout=None): - return 'echo {}'.format(msg) - - -Caveats -^^^^^^^ - -It is important to consider several important issues when using app caching: - -- Determinism: App caching is generally useful only when the apps are deterministic. - If the outputs may be different for identical inputs, app caching will obscure - this non-deterministic behavior. For instance, caching an app that returns - a random number will result in every invocation returning the same result. - -- Timing: If several identical calls to an app are made concurrently having - not yet cached a result, many instances of the app will be launched. - Once one invocation completes and the result is cached - all subsequent calls will return immediately with the cached result. - -- Performance: If app caching is enabled, there may be some performance - overhead especially if a large number of short duration tasks are launched rapidly. - This overhead has not been quantified. - -.. _label-checkpointing: - -Checkpointing -============= - -Large-scale Parsl programs are likely to encounter errors due to node failures, -application or environment errors, and myriad other issues. Parsl offers an -application-level checkpointing model to improve resilience, fault tolerance, and -efficiency. - -.. note:: - Checkpointing builds on top of app caching, and so app caching must be - enabled. If app caching is disabled in the config ``Config.app_cache``, checkpointing will - not work. - -Parsl follows an incremental checkpointing model, where each checkpoint file contains -all results that have been updated since the last checkpoint. - -When a Parsl program loads a checkpoint file and is executed, it will use -checkpointed results for any apps that have been previously executed. -Like app caching, checkpoints -use the hash of the app and the invocation input parameters to identify previously computed -results. If multiple checkpoints exist for an app (with the same hash) -the most recent entry will be used. - -Parsl provides four checkpointing modes: - -1. ``task_exit``: a checkpoint is created each time an app completes or fails - (after retries if enabled). This mode minimizes the risk of losing information - from completed tasks. - - .. code-block:: python - - from parsl.configs.local_threads import config - config.checkpoint_mode = 'task_exit' - -2. ``periodic``: a checkpoint is created periodically using a user-specified - checkpointing interval. Results will be saved to the checkpoint file for - all tasks that have completed during this period. - - .. code-block:: python - - from parsl.configs.local_threads import config - config.checkpoint_mode = 'periodic' - config.checkpoint_period = "01:00:00" - -3. ``dfk_exit``: checkpoints are created when Parsl is - about to exit. This reduces the risk of losing results due to - premature program termination from exceptions, terminate signals, etc. However - it is still possible that information might be lost if the program is - terminated abruptly (machine failure, SIGKILL, etc.) - - .. code-block:: python - - from parsl.configs.local_threads import config - config.checkpoint_mode = 'dfk_exit' - -4. ``manual``: in addition to these automated checkpointing modes, it is also possible - to manually initiate a checkpoint by calling ``DataFlowKernel.checkpoint()`` in the - Parsl program code. - - .. code-block:: python - - import parsl - from parsl.configs.local_threads import config - dfk = parsl.load(config) - .... - dfk.checkpoint() - -In all cases the checkpoint file is written out to the ``runinfo/RUN_ID/checkpoint/`` directory. - -.. Note:: Checkpoint modes ``periodic``, ``dfk_exit``, and ``manual`` can interfere with garbage collection. - In these modes task information will be retained after completion, until checkpointing events are triggered. - - -Creating a checkpoint -^^^^^^^^^^^^^^^^^^^^^ - -Automated checkpointing must be explicitly enabled in the Parsl configuration. -There is no need to modify a Parsl program as checkpointing will occur transparently. -In the following example, checkpointing is enabled at task exit. The results of -each invocation of the ``slow_double`` app will be stored in the checkpoint file. - -.. code-block:: python - - import parsl - from parsl.app.app import python_app - from parsl.configs.local_threads import config - - config.checkpoint_mode = 'task_exit' - - parsl.load(config) - - @python_app(cache=True) - def slow_double(x): - import time - time.sleep(5) - return x * 2 - - d = [] - for i in range(5): - d.append(slow_double(i)) - - print([d[i].result() for i in range(5)]) - -Alternatively, manual checkpointing can be used to explictly specify when the checkpoint -file should be saved. The following example shows how manual checkpointing can be used. -Here, the ``dfk.checkpoint()`` function will save the results of the prior invocations -of the ``slow_double`` app. - -.. code-block:: python - - import parsl - from parsl import python_app - from parsl.configs.local_threads import config - - dfk = parsl.load(config) - - @python_app(cache=True) - def slow_double(x, sleep_dur=1): - import time - time.sleep(sleep_dur) - return x * 2 - - N = 5 # Number of calls to slow_double - d = [] # List to store the futures - for i in range(0, N): - d.append(slow_double(i)) - - # Wait for the results - [i.result() for i in d] - - cpt_dir = dfk.checkpoint() - print(cpt_dir) # Prints the checkpoint dir - - -Resuming from a checkpoint -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When resuming a program from a checkpoint Parsl allows the user to select -which checkpoint file(s) to use. -Checkpoint files are stored in the ``runinfo/RUNID/checkpoint`` directory. - -The example below shows how to resume using all available checkpoints. -Here, the program re-executes the same calls to the ``slow_double`` app -as above and instead of waiting for results to be computed, the values -from the checkpoint file are are immediately returned. - -.. code-block:: python - - import parsl - from parsl.tests.configs.local_threads import config - from parsl.utils import get_all_checkpoints - - config.checkpoint_files = get_all_checkpoints() - - parsl.load(config) - - # Rerun the same workflow - d = [] - for i in range(5): - d.append(slow_double(i)) - - # wait for results - print([d[i].result() for i in range(5)]) +This page has been `moved `_ diff --git a/docs/userguide/configuration/data.rst b/docs/userguide/configuration/data.rst new file mode 100644 index 0000000000..0c4c4b334d --- /dev/null +++ b/docs/userguide/configuration/data.rst @@ -0,0 +1,415 @@ +.. _label-data: + +Staging data files +================== + +Parsl apps can take and return data files. A file may be passed as an input +argument to an app, or returned from an app after execution. Parsl +provides support to automatically transfer (stage) files between +the main Parsl program, worker nodes, and external data storage systems. + +Input files can be passed as regular arguments, or a list of them may be +specified in the special ``inputs`` keyword argument to an app invocation. + +Inside an app, the ``filepath`` attribute of a `File` can be read to determine +where on the execution-side file system the input file has been placed. + +Output `File` objects must also be passed in at app invocation, through the +outputs parameter. In this case, the `File` object specifies where Parsl +should place output after execution. + +Inside an app, the ``filepath`` attribute of an output +`File` provides the path at which the corresponding output file should be +placed so that Parsl can find it after execution. + +If the output from an app is to be used as the input to a subsequent app, +then a `DataFuture` that represents whether the output file has been created +must be extracted from the first app's AppFuture, and that must be passed +to the second app. This causes app +executions to be properly ordered, in the same way that passing AppFutures +to subsequent apps causes execution ordering based on an app returning. + +In a Parsl program, file handling is split into two pieces: files are named in an +execution-location independent manner using :py:class:`~parsl.data_provider.files.File` +objects, and executors are configured to stage those files in to and out of +execution locations using instances of the :py:class:`~parsl.data_provider.staging.Staging` +interface. + + +Parsl files +----------- + +Parsl uses a custom :py:class:`~parsl.data_provider.files.File` to provide a +location-independent way of referencing and accessing files. +Parsl files are defined by specifying the URL *scheme* and a path to the file. +Thus a file may represent an absolute path on the submit-side file system +or a URL to an external file. + +The scheme defines the protocol via which the file may be accessed. +Parsl supports the following schemes: file, ftp, http, https, and globus. +If no scheme is specified Parsl will default to the file scheme. + +The following example shows creation of two files with different +schemes: a locally-accessible data.txt file and an HTTPS-accessible +README file. + +.. code-block:: python + + File('file://home/parsl/data.txt') + File('https://github.com/Parsl/parsl/blob/master/README.rst') + + +Parsl automatically translates the file's location relative to the +environment in which it is accessed (e.g., the Parsl program or an app). +The following example shows how a file can be accessed in the app +irrespective of where that app executes. + +.. code-block:: python + + @python_app + def print_file(inputs=()): + with open(inputs[0].filepath, 'r') as inp: + content = inp.read() + return(content) + + # create an remote Parsl file + f = File('https://github.com/Parsl/parsl/blob/master/README.rst') + + # call the print_file app with the Parsl file + r = print_file(inputs=[f]) + r.result() + +As described below, the method by which this files are transferred +depends on the scheme and the staging providers specified in the Parsl +configuration. + +Staging providers +----------------- + +Parsl is able to transparently stage files between at-rest locations and +execution locations by specifying a list of +:py:class:`~parsl.data_provider.staging.Staging` instances for an executor. +These staging instances define how to transfer files in and out of an execution +location. This list should be supplied as the ``storage_access`` +parameter to an executor when it is constructed. + +Parsl includes several staging providers for moving files using the +schemes defined above. By default, Parsl executors are created with +three common staging providers: +the NoOpFileStaging provider for local and shared file systems +and the HTTP(S) and FTP staging providers for transferring +files to and from remote storage locations. The following +example shows how to explicitly set the default staging providers. + +.. code-block:: python + + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + from parsl.data_provider.data_manager import default_staging + + config = Config( + executors=[ + HighThroughputExecutor( + storage_access=default_staging, + # equivalent to the following + # storage_access=[NoOpFileStaging(), FTPSeparateTaskStaging(), HTTPSeparateTaskStaging()], + ) + ] + ) + + +Parsl further differentiates when staging occurs relative to +the app invocation that requires or produces files. +Staging either occurs with the executing task (*in-task staging*) +or as a separate task (*separate task staging*) before app execution. +In-task staging +uses a wrapper that is executed around the Parsl task and thus +occurs on the resource on which the task is executed. Separate +task staging inserts a new Parsl task in the graph and associates +a dependency between the staging task and the task that depends +on that file. Separate task staging may occur on either the submit-side +(e.g., when using Globus) or on the execution-side (e.g., HTTPS, FTP). + + +NoOpFileStaging for Local/Shared File Systems +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The NoOpFileStaging provider assumes that files specified either +with a path or with the ``file`` URL scheme are available both +on the submit and execution side. This occurs, for example, when there is a +shared file system. In this case, files will not moved, and the +File object simply presents the same file path to the Parsl program +and any executing tasks. + +Files defined as follows will be handled by the NoOpFileStaging provider. + +.. code-block:: python + + File('file://home/parsl/data.txt') + File('/home/parsl/data.txt') + + +The NoOpFileStaging provider is enabled by default on all +executors. It can be explicitly set as the only +staging provider as follows. + +.. code-block:: python + + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + from parsl.data_provider.file_noop import NoOpFileStaging + + config = Config( + executors=[ + HighThroughputExecutor( + storage_access=[NoOpFileStaging()] + ) + ] + ) + + +FTP, HTTP, HTTPS: separate task staging +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Files named with the ``ftp``, ``http`` or ``https`` URL scheme will be +staged in using HTTP GET or anonymous FTP commands. These commands +will be executed as a separate +Parsl task that will complete before the corresponding app +executes. These providers cannot be used to stage out output files. + +The following example defines a file accessible on a remote FTP server. + +.. code-block:: python + + File('ftp://www.iana.org/pub/mirror/rirstats/arin/ARIN-STATS-FORMAT-CHANGE.txt') + +When such a file object is passed as an input to an app, Parsl will download the file to whatever location is selected for the app to execute. +The following example illustrates how the remote file is implicitly downloaded from an FTP server and then converted. Note that the app does not need to know the location of the downloaded file on the remote computer, as Parsl abstracts this translation. + +.. code-block:: python + + @python_app + def convert(inputs=(), outputs=()): + with open(inputs[0].filepath, 'r') as inp: + content = inp.read() + with open(outputs[0].filepath, 'w') as out: + out.write(content.upper()) + + # create an remote Parsl file + inp = File('ftp://www.iana.org/pub/mirror/rirstats/arin/ARIN-STATS-FORMAT-CHANGE.txt') + + # create a local Parsl file + out = File('file:///tmp/ARIN-STATS-FORMAT-CHANGE.txt') + + # call the convert app with the Parsl file + f = convert(inputs=[inp], outputs=[out]) + f.result() + +HTTP and FTP separate task staging providers can be configured as follows. + +.. code-block:: python + + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + from parsl.data_provider.http import HTTPSeparateTaskStaging + from parsl.data_provider.ftp import FTPSeparateTaskStaging + + config = Config( + executors=[ + HighThroughputExecutor( + storage_access=[HTTPSeparateTaskStaging(), FTPSeparateTaskStaging()] + ) + ] + ) + +FTP, HTTP, HTTPS: in-task staging +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These staging providers are intended for use on executors that do not have +a file system shared between each executor node. + +These providers will use the same HTTP GET/anonymous FTP as the separate +task staging providers described above, but will do so in a wrapper around +individual app invocations, which guarantees that they will stage files to +a file system visible to the app. + +A downside of this staging approach is that the staging tasks are less visible +to Parsl, as they are not performed as separate Parsl tasks. + +In-task staging providers can be configured as follows. + +.. code-block:: python + + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + from parsl.data_provider.http import HTTPInTaskStaging + from parsl.data_provider.ftp import FTPInTaskStaging + + config = Config( + executors=[ + HighThroughputExecutor( + storage_access=[HTTPInTaskStaging(), FTPInTaskStaging()] + ) + ] + ) + + +Globus +^^^^^^ + +The ``Globus`` staging provider is used to transfer files that can be accessed +using Globus. A guide to using Globus is available `here +`_). + +A file using the Globus scheme must specify the UUID of the Globus +endpoint and a path to the file on the endpoint, for example: + +.. code-block:: python + + File('globus://037f054a-15cf-11e8-b611-0ac6873fc732/unsorted.txt') + +Note: a Globus endpoint's UUID can be found in the Globus `Manage Endpoints `_ page. + +There must also be a Globus endpoint available with access to a +execute-side file system, because Globus file transfers happen +between two Globus endpoints. + +Globus Configuration +"""""""""""""""""""" + +In order to manage where files are staged, users must configure the default ``working_dir`` on a remote location. This information is specified in the :class:`~parsl.executors.base.ParslExecutor` via the ``working_dir`` parameter in the :class:`~parsl.config.Config` instance. For example: + +.. code-block:: python + + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + + config = Config( + executors=[ + HighThroughputExecutor( + working_dir="/home/user/data" + ) + ] + ) + +Parsl requires knowledge of the Globus endpoint that is associated with an executor. This is done by specifying the ``endpoint_name`` (the UUID of the Globus endpoint that is associated with the system) in the configuration. + +In some cases, for example when using a Globus `shared endpoint `_ or when a Globus endpoint is mounted on a supercomputer, the path seen by Globus is not the same as the local path seen by Parsl. In this case the configuration may optionally specify a mapping between the ``endpoint_path`` (the common root path seen in Globus), and the ``local_path`` (the common root path on the local file system), as in the following. In most cases, ``endpoint_path`` and ``local_path`` are the same and do not need to be specified. + +.. code-block:: python + + from parsl.config import Config + from parsl.executors import HighThroughputExecutor + from parsl.data_provider.globus import GlobusStaging + from parsl.data_provider.data_manager import default_staging + + config = Config( + executors=[ + HighThroughputExecutor( + working_dir="/home/user/parsl_script", + storage_access=default_staging + [GlobusStaging( + endpoint_uuid="7d2dc622-2edb-11e8-b8be-0ac6873fc732", + endpoint_path="/", + local_path="/home/user" + )] + ) + ] + ) + + +Globus Authorization +"""""""""""""""""""" + +In order to transfer files with Globus, the user must first authenticate. +The first time that Globus is used with Parsl on a computer, the program +will prompt the user to follow an authentication and authorization +procedure involving a web browser. Users can authorize out of band by +running the parsl-globus-auth utility. This is useful, for example, +when running a Parsl program in a batch system where it will be unattended. + +.. code-block:: bash + + $ parsl-globus-auth + Parsl Globus command-line authorizer + If authorization to Globus is necessary, the library will prompt you now. + Otherwise it will do nothing + Authorization complete + +rsync +^^^^^ + +The ``rsync`` utility can be used to transfer files in the ``file`` scheme in configurations where +workers cannot access the submit-side file system directly, such as when executing +on an AWS EC2 instance or on a cluster without a shared file system. +However, the submit-side file system must be exposed using rsync. + +rsync Configuration +""""""""""""""""""" + +``rsync`` must be installed on both the submit and worker side. It can usually be installed +by using the operating system package manager: for example, by ``apt-get install rsync``. + +An `RSyncStaging` option must then be added to the Parsl configuration file, as in the following. +The parameter to RSyncStaging should describe the prefix to be passed to each rsync +command to connect from workers to the submit-side host. This will often be the username +and public IP address of the submitting system. + +.. code-block:: python + + from parsl.data_provider.rsync import RSyncStaging + + config = Config( + executors=[ + HighThroughputExecutor( + storage_access=[HTTPInTaskStaging(), FTPInTaskStaging(), RSyncStaging("benc@" + public_ip)], + ... + ) + ) + +rsync Authorization +""""""""""""""""""" + +The rsync staging provider delegates all authentication and authorization to the +underlying ``rsync`` command. This command must be correctly authorized to connect back to +the submit-side system. The form of this authorization will depend on the systems in +question. + +The following example installs an ssh key from the submit-side file system and turns off host key +checking, in the ``worker_init`` initialization of an EC2 instance. The ssh key must have +sufficient privileges to run ``rsync`` over ssh on the submit-side system. + +.. code-block:: python + + with open("rsync-callback-ssh", "r") as f: + private_key = f.read() + + ssh_init = """ + mkdir .ssh + chmod go-rwx .ssh + + cat > .ssh/id_rsa < .ssh/config <`_ to encrypt all communication channels +between the executor and related nodes. + +Encryption performance +^^^^^^^^^^^^^^^^^^^^^^ + +CurveZMQ depends on `libzmq `_ and `libsodium `_, +which `pyzmq `_ (a Parsl dependency) includes as part of its +installation via ``pip``. This installation path should work on most systems, but users have +reported significant performance degradation as a result. + +If you experience a significant performance hit after enabling encryption, we recommend installing +``pyzmq`` with conda: + +.. code-block:: bash + + conda install conda-forge::pyzmq + +Alternatively, you can `install libsodium `_, then +`install libzmq `_, then build ``pyzmq`` from source: + +.. code-block:: bash + + pip3 install parsl --no-binary pyzmq diff --git a/docs/userguide/configuration/examples.rst b/docs/userguide/configuration/examples.rst new file mode 100644 index 0000000000..7e9b7ae9eb --- /dev/null +++ b/docs/userguide/configuration/examples.rst @@ -0,0 +1,333 @@ +Example configurations +====================== + +.. note:: + All configuration examples below must be customized for the user's + allocation, Python environment, file system, etc. + + +The configuration specifies what, and how, resources are to be used for executing +the Parsl program and its apps. +It is important to carefully consider the needs of the Parsl program and its apps, +and the characteristics of the compute resources, to determine an ideal configuration. +Aspects to consider include: +1) where the Parsl apps will execute; +2) how many nodes will be used to execute the apps, and how long the apps will run; +3) should Parsl request multiple nodes in an individual scheduler job; and +4) where will the main Parsl program run and how will it communicate with the apps. + +Stepping through the following question should help formulate a suitable configuration object. + +1. Where should apps be executed? + ++---------------------+-----------------------------------------------+----------------------------------------+ +| Target | Executor | Provider | ++=====================+===============================================+========================================+ +| Laptop/Workstation | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.LocalProvider` | +| | * `parsl.executors.ThreadPoolExecutor` | | +| | * `parsl.executors.WorkQueueExecutor` | | +| | * `parsl.executors.taskvine.TaskVineExecutor` | | ++---------------------+-----------------------------------------------+----------------------------------------+ +| Amazon Web Services | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.AWSProvider` | ++---------------------+-----------------------------------------------+----------------------------------------+ +| Google Cloud | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.GoogleCloudProvider` | ++---------------------+-----------------------------------------------+----------------------------------------+ +| Slurm based system | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.SlurmProvider` | +| | * `parsl.executors.WorkQueueExecutor` | | +| | * `parsl.executors.taskvine.TaskVineExecutor` | | ++---------------------+-----------------------------------------------+----------------------------------------+ +| Torque/PBS based | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.TorqueProvider` | +| system | * `parsl.executors.WorkQueueExecutor` | | ++---------------------+-----------------------------------------------+----------------------------------------+ +| GridEngine based | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.GridEngineProvider` | +| system | * `parsl.executors.WorkQueueExecutor` | | ++---------------------+-----------------------------------------------+----------------------------------------+ +| Condor based | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.CondorProvider` | +| cluster or grid | * `parsl.executors.WorkQueueExecutor` | | +| | * `parsl.executors.taskvine.TaskVineExecutor` | | ++---------------------+-----------------------------------------------+----------------------------------------+ +| Kubernetes cluster | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.KubernetesProvider` | ++---------------------+-----------------------------------------------+----------------------------------------+ + + +2. How many nodes will be used to execute the apps? What task durations are necessary to achieve good performance? + + ++--------------------------------------------+----------------------+-------------------------------------+ +| Executor | Number of Nodes [*]_ | Task duration for good performance | ++============================================+======================+=====================================+ +| `parsl.executors.ThreadPoolExecutor` | 1 (Only local) | Any | ++--------------------------------------------+----------------------+-------------------------------------+ +| `parsl.executors.HighThroughputExecutor` | <=2000 | Task duration(s)/#nodes >= 0.01 | +| | | longer tasks needed at higher scale | ++--------------------------------------------+----------------------+-------------------------------------+ +| `parsl.executors.WorkQueueExecutor` | <=1000 [*]_ | 10s+ | ++--------------------------------------------+----------------------+-------------------------------------+ +| `parsl.executors.taskvine.TaskVineExecutor`| <=1000 [*]_ | 10s+ | ++--------------------------------------------+----------------------+-------------------------------------+ + + +.. [*] Assuming 32 workers per node. If there are fewer workers launched + per node, a larger number of nodes could be supported. + +.. [*] The maximum number of nodes tested for the `parsl.executors.WorkQueueExecutor` is 10,000 GPU cores and + 20,000 CPU cores. + +.. [*] The maximum number of nodes tested for the `parsl.executors.taskvine.TaskVineExecutor` is + 10,000 GPU cores and 20,000 CPU cores. + +3. Should Parsl request multiple nodes in an individual scheduler job? +(Here the term block is equivalent to a single scheduler job.) + ++--------------------------------------------------------------------------------------------+ +| ``nodes_per_block = 1`` | ++---------------------+--------------------------+-------------------------------------------+ +| Provider | Executor choice | Suitable Launchers | ++=====================+==========================+===========================================+ +| Systems that don't | Any | * `parsl.launchers.SingleNodeLauncher` | +| use Aprun | | * `parsl.launchers.SimpleLauncher` | ++---------------------+--------------------------+-------------------------------------------+ +| Aprun based systems | Any | * `parsl.launchers.AprunLauncher` | ++---------------------+--------------------------+-------------------------------------------+ + ++---------------------------------------------------------------------------------------------------------------------+ +| ``nodes_per_block > 1`` | ++-------------------------------------+--------------------------+----------------------------------------------------+ +| Provider | Executor choice | Suitable Launchers | ++=====================================+==========================+====================================================+ +| `parsl.providers.TorqueProvider` | Any | * `parsl.launchers.AprunLauncher` | +| | | * `parsl.launchers.MpiExecLauncher` | ++-------------------------------------+--------------------------+----------------------------------------------------+ +| `parsl.providers.SlurmProvider` | Any | * `parsl.launchers.SrunLauncher` if native slurm | +| | | * `parsl.launchers.AprunLauncher`, otherwise | ++-------------------------------------+--------------------------+----------------------------------------------------+ + +.. note:: If using a Cray system, you most likely need to use the `parsl.launchers.AprunLauncher` to launch workers unless you + are on a **native Slurm** system like :ref:`configuring_nersc_cori` + +Ad-Hoc Clusters +--------------- + +Parsl's support of ad-hoc clusters of compute nodes without a scheduler +is deprecated. + +See +`issue #3515 `_ +for further discussion. + +Amazon Web Services +------------------- + +.. image:: img/aws_image.png + +.. note:: + To use AWS with Parsl, install Parsl with AWS dependencies via ``python3 -m pip install 'parsl[aws]'`` + +Amazon Web Services is a commercial cloud service which allows users to rent a range of computers and other computing services. +The following snippet shows how Parsl can be configured to provision nodes from the Elastic Compute Cloud (EC2) service. +The first time this configuration is used, Parsl will configure a Virtual Private Cloud and other networking and security infrastructure that will be +re-used in subsequent executions. The configuration uses the `parsl.providers.AWSProvider` to connect to AWS. + +.. literalinclude:: ../../../parsl/configs/ec2.py + + +ASPIRE 1 (NSCC) +--------------- + +.. image:: https://www.nscc.sg/wp-content/uploads/2017/04/ASPIRE1Img.png + +The following snippet shows an example configuration for accessing NSCC's **ASPIRE 1** supercomputer. This example uses the `parsl.executors.HighThroughputExecutor` executor and connects to ASPIRE1's PBSPro scheduler. It also shows how ``scheduler_options`` parameter could be used for scheduling array jobs in PBSPro. + +.. literalinclude:: ../../../parsl/configs/ASPIRE1.py + + + + +Illinois Campus Cluster (UIUC) +------------------------------ + +.. image:: https://campuscluster.illinois.edu/wp-content/uploads/2018/02/ND2_3633-sm.jpg + +The following snippet shows an example configuration for executing on the Illinois Campus Cluster. +The configuration assumes the user is running on a login node and uses the `parsl.providers.SlurmProvider` to interface +with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. + +.. literalinclude:: ../../../parsl/configs/illinoiscluster.py + +Bridges (PSC) +------------- + +.. image:: https://insidehpc.com/wp-content/uploads/2016/08/Bridges_FB1b.jpg + +The following snippet shows an example configuration for executing on the Bridges supercomputer at the Pittsburgh Supercomputing Center. +The configuration assumes the user is running on a login node and uses the `parsl.providers.SlurmProvider` to interface +with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. + +.. literalinclude:: ../../../parsl/configs/bridges.py + + + +CC-IN2P3 +-------- + +.. image:: https://cc.in2p3.fr/wp-content/uploads/2017/03/bandeau_accueil.jpg + +The snippet below shows an example configuration for executing from a login node on IN2P3's Computing Centre. +The configuration uses the `parsl.providers.LocalProvider` to run on a login node primarily to avoid GSISSH, which Parsl does not support. +This system uses Grid Engine which Parsl interfaces with using the `parsl.providers.GridEngineProvider`. + +.. literalinclude:: ../../../parsl/configs/cc_in2p3.py + + +CCL (Notre Dame, TaskVine) +-------------------------- + +.. image:: https://ccl.cse.nd.edu/software/taskvine/taskvine-logo.png + +To utilize TaskVine with Parsl, please install the full CCTools software package within an appropriate Anaconda or Miniconda environment +(instructions for installing Miniconda can be found `in the Conda install guide `_): + +.. code-block:: bash + + $ conda create -y --name python= conda-pack + $ conda activate + $ conda install -y -c conda-forge ndcctools parsl + +This creates a Conda environment on your machine with all the necessary tools and setup needed to utilize TaskVine with the Parsl library. + +The following snippet shows an example configuration for using the Parsl/TaskVine executor to run applications on the local machine. +This examples uses the `parsl.executors.taskvine.TaskVineExecutor` to schedule tasks, and a local worker will be started automatically. +For more information on using TaskVine, including configurations for remote execution, visit the +`TaskVine/Parsl documentation online `_. + +.. literalinclude:: ../../../parsl/configs/vineex_local.py + +TaskVine's predecessor, WorkQueue, may continue to be used with Parsl. +For more information on using WorkQueue visit the `CCTools documentation online `_. + +Expanse (SDSC) +-------------- + +.. image:: https://www.hpcwire.com/wp-content/uploads/2019/07/SDSC-Expanse-graphic-cropped.jpg + +The following snippet shows an example configuration for executing remotely on San Diego Supercomputer +Center's **Expanse** supercomputer. The example is designed to be executed on the login nodes, using the +`parsl.providers.SlurmProvider` to interface with the Slurm scheduler used by Comet and the `parsl.launchers.SrunLauncher` to launch workers. + +.. literalinclude:: ../../../parsl/configs/expanse.py + + +Improv (Argonne LCRC) +--------------------- + +.. image:: https://www.lcrc.anl.gov/sites/default/files/styles/965_wide/public/2023-12/20231214_114057.jpg?itok=A-Rz5pP9 + +**Improv** is a PBS Pro based supercomputer at Argonne's Laboratory Computing Resource +Center (LCRC). The following snippet is an example configuration that uses `parsl.providers.PBSProProvider` +and `parsl.launchers.MpiRunLauncher` to run on multinode jobs. + +.. literalinclude:: ../../../parsl/configs/improv.py + + +.. _configuring_nersc_cori: + +Perlmutter (NERSC) +------------------ + +NERSC provides documentation on `how to use Parsl on Perlmutter `_. +Perlmutter is a Slurm based HPC system and parsl uses `parsl.providers.SlurmProvider` with `parsl.launchers.SrunLauncher` +to launch tasks onto this machine. + + +Frontera (TACC) +--------------- + +.. image:: https://frontera-portal.tacc.utexas.edu/media/filer_public/2c/fb/2cfbf6ab-818d-42c8-b4d5-9b39eb9d0a05/frontera-banner-home.jpg + +Deployed in June 2019, Frontera is the 5th most powerful supercomputer in the world. Frontera replaces the NSF Blue Waters system at NCSA +and is the first deployment in the National Science Foundation's petascale computing program. The configuration below assumes that the user is +running on a login node and uses the `parsl.providers.SlurmProvider` to interface with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. + +.. literalinclude:: ../../../parsl/configs/frontera.py + + +Kubernetes Clusters +------------------- + +.. image:: https://d1.awsstatic.com/PAC/kuberneteslogo.eabc6359f48c8e30b7a138c18177f3fd39338e05.png + +Kubernetes is an open-source system for container management, such as automating deployment and scaling of containers. +The snippet below shows an example configuration for deploying pods as workers on a Kubernetes cluster. +The KubernetesProvider exploits the Python Kubernetes API, which assumes that you have kube config in ``~/.kube/config``. + +.. literalinclude:: ../../../parsl/configs/kubernetes.py + + +Midway (RCC, UChicago) +---------------------- + +.. image:: https://rcc.uchicago.edu/sites/rcc.uchicago.edu/files/styles/slideshow-image/public/uploads/images/slideshows/20140430_RCC_8978.jpg?itok=BmRuJ-wq + +This Midway cluster is a campus cluster hosted by the Research Computing Center at the University of Chicago. +The snippet below shows an example configuration for executing remotely on Midway. +The configuration assumes the user is running on a login node and uses the `parsl.providers.SlurmProvider` to interface +with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. + +.. literalinclude:: ../../../parsl/configs/midway.py + + +Open Science Grid +----------------- + +.. image:: https://www.renci.org/wp-content/uploads/2008/10/osg_logo.png + +The Open Science Grid (OSG) is a national, distributed computing Grid spanning over 100 individual sites to provide tens of thousands of CPU cores. +The snippet below shows an example configuration for executing remotely on OSG. You will need to have a valid project name on the OSG. +The configuration uses the `parsl.providers.CondorProvider` to interface with the scheduler. + +.. literalinclude:: ../../../parsl/configs/osg.py + + +Polaris (ALCF) +-------------- + +.. image:: https://www.alcf.anl.gov/sites/default/files/styles/965x543/public/2022-07/33181D_086_ALCF%20Polaris%20Crop.jpg?itok=HVAHsZtt + :width: 75% + +ALCF provides documentation on `how to use Parsl on Polaris `_. +Polaris uses `parsl.providers.PBSProProvider` and `parsl.launchers.MpiExecLauncher` to launch tasks onto the HPC system. + + + +Stampede2 (TACC) +---------------- + +.. image:: https://www.tacc.utexas.edu/documents/1084364/1413880/stampede2-0717.jpg/ + +The following snippet shows an example configuration for accessing TACC's **Stampede2** supercomputer. This example uses theHighThroughput executor and connects to Stampede2's Slurm scheduler. + +.. literalinclude:: ../../../parsl/configs/stampede2.py + + +Summit (ORNL) +------------- + +.. image:: https://www.olcf.ornl.gov/wp-content/uploads/2018/06/Summit_Exaop-1500x844.jpg + +The following snippet shows an example configuration for executing from the login node on Summit, the leadership class supercomputer hosted at the Oak Ridge National Laboratory. +The example uses the :class:`parsl.providers.LSFProvider` to provision compute nodes from the LSF cluster scheduler and the `parsl.launchers.JsrunLauncher` to launch workers across the compute nodes. + +.. literalinclude:: ../../../parsl/configs/summit.py + + +TOSS3 (LLNL) +------------ + +.. image:: https://hpc.llnl.gov/sites/default/files/Magma--2020-LLNL.jpg + +The following snippet shows an example configuration for executing on one of LLNL's **TOSS3** +machines, such as Quartz, Ruby, Topaz, Jade, or Magma. This example uses the `parsl.executors.FluxExecutor` +and connects to Slurm using the `parsl.providers.SlurmProvider`. This configuration assumes that the script +is being executed on the login nodes of one of the machines. + +.. literalinclude:: ../../../parsl/configs/toss3_llnl.py diff --git a/docs/userguide/configuration/execution.rst b/docs/userguide/configuration/execution.rst new file mode 100644 index 0000000000..ac7217032a --- /dev/null +++ b/docs/userguide/configuration/execution.rst @@ -0,0 +1,227 @@ +.. _label-execution: + +Execution +========= + +Contemporary computing environments may include a wide range of computational platforms or **execution providers**, from laptops and PCs to various clusters, supercomputers, and cloud computing platforms. Different execution providers may require or allow for the use of different **execution models**, such as threads (for efficient parallel execution on a multicore processor), processes, and pilot jobs for running many small tasks on a large parallel system. + +Parsl is designed to abstract these low-level details so that an identical Parsl program can run unchanged on different platforms or across multiple platforms. +To this end, Parsl uses a configuration file to specify which execution provider(s) and execution model(s) to use. +Parsl provides a high level abstraction, called a *block*, for providing a uniform description of a compute resource irrespective of the specific execution provider. + +.. note:: + Refer to :ref:`configuration-section` for information on how to configure the various components described + below for specific scenarios. + +Execution providers +------------------- + +Clouds, supercomputers, and local PCs offer vastly different modes of access. +To overcome these differences, and present a single uniform interface, +Parsl implements a simple provider abstraction. This +abstraction is key to Parsl's ability to enable scripts to be moved +between resources. The provider interface exposes three core actions: submit a +job for execution (e.g., sbatch for the Slurm resource manager), +retrieve the status of an allocation (e.g., squeue), and cancel a running +job (e.g., scancel). Parsl implements providers for local execution +(fork), for various cloud platforms using cloud-specific APIs, and +for clusters and supercomputers that use a Local Resource Manager +(LRM) to manage access to resources, such as Slurm and HTCondor. + +Each provider implementation may allow users to specify additional parameters for further configuration. Parameters are generally mapped to LRM submission script or cloud API options. +Examples of LRM-specific options are partition, wall clock time, +scheduler options (e.g., #SBATCH arguments for Slurm), and worker +initialization commands (e.g., loading a conda environment). Cloud +parameters include access keys, instance type, and spot bid price + +Parsl currently supports the following providers: + +1. `parsl.providers.LocalProvider`: The provider allows you to run locally on your laptop or workstation. +2. `parsl.providers.SlurmProvider`: This provider allows you to schedule resources via the Slurm scheduler. +3. `parsl.providers.CondorProvider`: This provider allows you to schedule resources via the Condor scheduler. +4. `parsl.providers.GridEngineProvider`: This provider allows you to schedule resources via the GridEngine scheduler. +5. `parsl.providers.TorqueProvider`: This provider allows you to schedule resources via the Torque scheduler. +6. `parsl.providers.AWSProvider`: This provider allows you to provision and manage cloud nodes from Amazon Web Services. +7. `parsl.providers.GoogleCloudProvider`: This provider allows you to provision and manage cloud nodes from Google Cloud. +8. `parsl.providers.KubernetesProvider`: This provider allows you to provision and manage containers on a Kubernetes cluster. +9. `parsl.providers.LSFProvider`: This provider allows you to schedule resources via IBM's LSF scheduler. + + + +Executors +--------- + +Parsl programs vary widely in terms of their +execution requirements. Individual Apps may run for milliseconds +or days, and available parallelism can vary between none for +sequential programs to millions for "pleasingly parallel" programs. +Parsl executors, as the name suggests, execute Apps on one or more +target execution resources such as multi-core workstations, clouds, +or supercomputers. As it appears infeasible to implement a single +execution strategy that will meet so many diverse requirements on +such varied platforms, Parsl provides a modular executor interface +and a collection of executors that are tuned for common execution +patterns. + +Parsl executors extend the Executor class offered by Python's +concurrent.futures library, which allows Parsl to use +existing solutions in the Python Standard Library (e.g., ThreadPoolExecutor) +and from other packages such as Work Queue. Parsl +extends the concurrent.futures executor interface to support +additional capabilities such as automatic scaling of execution resources, +monitoring, deferred initialization, and methods to set working +directories. +All executors share a common execution kernel that is responsible +for deserializing the task (i.e., the App and its input arguments) +and executing the task in a sandboxed Python environment. + +Parsl currently supports the following executors: + +1. `parsl.executors.ThreadPoolExecutor`: This executor supports multi-thread execution on local resources. + +2. `parsl.executors.HighThroughputExecutor`: This executor implements hierarchical scheduling and batching using a pilot job model to deliver high throughput task execution on up to 4000 Nodes. + +3. `parsl.executors.WorkQueueExecutor`: This executor integrates `Work Queue `_ as an execution backend. Work Queue scales to tens of thousands of cores and implements reliable execution of tasks with dynamic resource sizing. + +4. `parsl.executors.taskvine.TaskVineExecutor`: This executor uses `TaskVine `_ as the execution backend. TaskVine scales up to tens of thousands of cores and actively uses local storage on compute nodes to offer a diverse array of performance-oriented features, including: smart caching and sharing common large files between tasks and compute nodes, reliable execution of tasks, dynamic resource sizing, automatic Python environment detection and sharing. +These executors cover a broad range of execution requirements. As with other Parsl components, there is a standard interface (ParslExecutor) that can be implemented to add support for other executors. + +.. note:: + Refer to :ref:`configuration-section` for information on how to configure these executors. + + +Launchers +--------- + +Many LRMs offer mechanisms for spawning applications across nodes +inside a single job and for specifying the +resources and task placement information needed to execute that +application at launch time. Common mechanisms include +`srun `_ (for Slurm), +`aprun `_ (for Crays), and `mpirun `_ (for MPI). +Thus, to run Parsl programs on such systems, we typically want first to +request a large number of nodes and then to *launch* "pilot job" or +**worker** processes using the system launchers. +Parsl's Launcher abstraction enables Parsl programs +to use these system-specific launcher systems to start workers across +cores and nodes. + +Parsl currently supports the following set of launchers: + +1. `parsl.launchers.SrunLauncher`: Srun based launcher for Slurm based systems. +2. `parsl.launchers.AprunLauncher`: Aprun based launcher for Crays. +3. `parsl.launchers.SrunMPILauncher`: Launcher for launching MPI applications with Srun. +4. `parsl.launchers.GnuParallelLauncher`: Launcher using GNU parallel to launch workers across nodes and cores. +5. `parsl.launchers.MpiExecLauncher`: Uses Mpiexec to launch. +6. `parsl.launchers.SimpleLauncher`: The launcher default to a single worker launch. +7. `parsl.launchers.SingleNodeLauncher`: This launcher launches ``workers_per_node`` count workers on a single node. + +Additionally, the launcher interface can be used to implement specialized behaviors +in custom environments (for example, to +launch node processes inside containers with customized environments). +For example, the following launcher uses Srun to launch ``worker-wrapper``, passing the +command to be run as parameters to ``worker-wrapper``. It is the responsibility of ``worker-wrapper`` +to launch the command it is given inside the appropriate environment. + +.. code:: python + + class MyShifterSRunLauncher: + def __init__(self): + self.srun_launcher = SrunLauncher() + + def __call__(self, command, tasks_per_node, nodes_per_block): + new_command="worker-wrapper {}".format(command) + return self.srun_launcher(new_command, tasks_per_node, nodes_per_block) + +Blocks +------ + +One challenge when making use of heterogeneous +execution resource types is the need to provide a uniform representation of +resources. Consider that single requests on clouds return individual +nodes, clusters and supercomputers provide batches of nodes, grids +provide cores, and workstations provide a single multicore node + +Parsl defines a resource abstraction called a *block* as the most basic unit +of resources to be acquired from a provider. A block contains one +or more nodes and maps to the different provider abstractions. In +a cluster, a block corresponds to a single allocation request to a +scheduler. In a cloud, a block corresponds to a single API request +for one or more instances. +Parsl can then execute *tasks* (instances of apps) +within and across (e.g., for MPI jobs) nodes within a block. +Blocks are also used as the basis for +elasticity on batch scheduling systems (see Elasticity below). +Three different examples of block configurations are shown below. + +1. A single block comprised of a node executing one task: + + .. image:: ../../images/N1_T1.png + :scale: 75% + +2. A single block with one node executing several tasks. This configuration is + most suitable for single threaded apps running on multicore target systems. + The number of tasks executed concurrently is proportional to the number of cores available on the system. + + .. image:: ../../images/N1_T4.png + :scale: 75% + +3. A block comprised of several nodes and executing several tasks, where a task can span multiple nodes. This configuration + is generally used by MPI applications. Starting a task requires using a specific + MPI launcher that is supported on the target system (e.g., aprun, srun, mpirun, mpiexec). + The `MPI Apps `_ documentation page describes how to configure Parsl for this case. + + .. image:: ../../images/N4_T2.png + +The configuration options for specifying the shape of each block are: + +1. ``workers_per_node``: Number of workers started per node, which corresponds to the number of tasks that can execute concurrently on a node. +2. ``nodes_per_block``: Number of nodes requested per block. + + + +Multi-executor +-------------- + +Parsl supports the use of one or more executors as specified in the configuration. +In this situation, individual apps may indicate which executors they are able to use. + +The common scenarios for this feature are: + +* A workflow has an initial simulation stage that runs on the compute heavy + nodes of an HPC system followed by an analysis and visualization stage that + is better suited for GPU nodes. +* A workflow follows a repeated fan-out, fan-in model where the long running + fan-out tasks are computed on a cluster and the quick fan-in computation is + better suited for execution using threads on a login node. +* A workflow includes apps that wait and evaluate the results of a + computation to determine whether the app should be relaunched. + Only apps running on threads may launch other apps. Often, simulations + have stochastic behavior and may terminate before completion. + In such cases, having a wrapper app that checks the exit code + and determines whether or not the app has completed successfully can + be used to automatically re-execute the app (possibly from a + checkpoint) until successful completion. + + +The following code snippet shows how apps can specify suitable executors in the app decorator. + +.. code-block:: python + + #(CPU heavy app) (CPU heavy app) (CPU heavy app) <--- Run on compute queue + # | | | + # (data) (data) (data) + # \ | / + # (Analysis and visualization phase) <--- Run on GPU node + + # A mock molecular dynamics simulation app + @bash_app(executors=["Theta.Phi"]) + def MD_Sim(arg, outputs=()): + return "MD_simulate {} -o {}".format(arg, outputs[0]) + + # Visualize results from the mock MD simulation app + @bash_app(executors=["Cooley.GPU"]) + def visualize(inputs=(), outputs=()): + bash_array = " ".join(inputs) + return "viz {} -o {}".format(bash_array, outputs[0]) + diff --git a/docs/userguide/configuration/heterogeneous.rst b/docs/userguide/configuration/heterogeneous.rst new file mode 100644 index 0000000000..f004f68fbf --- /dev/null +++ b/docs/userguide/configuration/heterogeneous.rst @@ -0,0 +1,106 @@ +Heterogeneous resources +----------------------- + +In some cases, it can be difficult to specify the resource requirements for running a workflow. +For example, if the compute nodes a site provides are not uniform, there is no "correct" resource configuration; +the amount of parallelism depends on which node (large or small) each job runs on. +In addition, the software and filesystem setup can vary from node to node. +A Condor cluster may not provide shared filesystem access at all, +and may include nodes with a variety of Python versions and available libraries. + +The :class:`parsl.executors.WorkQueueExecutor` provides several features to work with heterogeneous resources. +By default, Parsl only runs one app at a time on each worker node. +However, it is possible to specify the requirements for a particular app, +and Work Queue will automatically run as many parallel instances as possible on each node. +Work Queue automatically detects the amount of cores, memory, and other resources available on each execution node. +To activate this feature, add a resource specification to your apps. A resource specification is a dictionary with +the following three keys: ``cores`` (an integer corresponding to the number of cores required by the task), +``memory`` (an integer corresponding to the task's memory requirement in MB), and ``disk`` (an integer corresponding to +the task's disk requirement in MB), passed to an app via the special keyword argument ``parsl_resource_specification``. The specification can be set for all app invocations via a default, for example: + + .. code-block:: python + + @python_app + def compute(x, parsl_resource_specification={'cores': 1, 'memory': 1000, 'disk': 1000}): + return x*2 + + +or updated when the app is invoked: + + .. code-block:: python + + spec = {'cores': 1, 'memory': 500, 'disk': 500} + future = compute(x, parsl_resource_specification=spec) + +This ``parsl_resource_specification`` special keyword argument will inform Work Queue about the resources this app requires. +When placing instances of ``compute(x)``, Work Queue will run as many parallel instances as possible based on each worker node's available resources. + +If an app's resource requirements are not known in advance, +Work Queue has an auto-labeling feature that measures the actual resource usage of your apps and automatically chooses resource labels for you. +With auto-labeling, it is not necessary to provide ``parsl_resource_specification``; +Work Queue collects stats in the background and updates resource labels as your workflow runs. +To activate this feature, add the following flags to your executor config: + + .. code-block:: python + + config = Config( + executors=[ + WorkQueueExecutor( + # ...other options go here + autolabel=True, + autocategory=True + ) + ] + ) + +The ``autolabel`` flag tells Work Queue to automatically generate resource labels. +By default, these labels are shared across all apps in your workflow. +The ``autocategory`` flag puts each app into a different category, +so that Work Queue will choose separate resource requirements for each app. +This is important if e.g. some of your apps use a single core and some apps require multiple cores. +Unless you know that all apps have uniform resource requirements, +you should turn on ``autocategory`` when using ``autolabel``. + +The Work Queue executor can also help deal with sites that have non-uniform software environments across nodes. +Parsl assumes that the Parsl program and the compute nodes all use the same Python version. +In addition, any packages your apps import must be available on compute nodes. +If no shared filesystem is available or if node configuration varies, +this can lead to difficult-to-trace execution problems. + +If your Parsl program is running in a Conda environment, +the Work Queue executor can automatically scan the imports in your apps, +create a self-contained software package, +transfer the software package to worker nodes, +and run your code inside the packaged and uniform environment. +First, make sure that the Conda environment is active and you have the required packages installed (via either ``pip`` or ``conda``): + +- ``python`` +- ``parsl`` +- ``ndcctools`` +- ``conda-pack`` + +Then add the following to your config: + + .. code-block:: python + + config = Config( + executors=[ + WorkQueueExecutor( + # ...other options go here + pack=True + ) + ] + ) + +.. note:: + There will be a noticeable delay the first time Work Queue sees an app; + it is creating and packaging a complete Python environment. + This packaged environment is cached, so subsequent app invocations should be much faster. + +Using this approach, it is possible to run Parsl applications on nodes that don't have Python available at all. +The packaged environment includes a Python interpreter, +and Work Queue does not require Python to run. + +.. note:: + The automatic packaging feature only supports packages installed via ``pip`` or ``conda``. + Importing from other locations (e.g. via ``$PYTHONPATH``) or importing other modules in the same directory is not supported. \ No newline at end of file diff --git a/docs/userguide/aws_image.png b/docs/userguide/configuration/img/aws_image.png similarity index 100% rename from docs/userguide/aws_image.png rename to docs/userguide/configuration/img/aws_image.png diff --git a/docs/userguide/parsl_parallelism.gif b/docs/userguide/configuration/img/parsl_parallelism.gif similarity index 100% rename from docs/userguide/parsl_parallelism.gif rename to docs/userguide/configuration/img/parsl_parallelism.gif diff --git a/docs/userguide/parsl_scaling.gif b/docs/userguide/configuration/img/parsl_scaling.gif similarity index 100% rename from docs/userguide/parsl_scaling.gif rename to docs/userguide/configuration/img/parsl_scaling.gif diff --git a/docs/userguide/configuration/index.rst b/docs/userguide/configuration/index.rst new file mode 100644 index 0000000000..5db6aca918 --- /dev/null +++ b/docs/userguide/configuration/index.rst @@ -0,0 +1,88 @@ +.. _configuration-section: + +Configuring Parsl +================= + +Parsl separates program logic from execution configuration, enabling +programs to be developed entirely independently from their execution +environment. Configuration is described by a Python object (:class:`~parsl.config.Config`) +so that developers can +introspect permissible options, validate settings, and retrieve/edit +configurations dynamically during execution. A configuration object specifies +details of the provider, executors, allocation size, +queues, durations, and data management options. + +The following example shows a basic configuration object (:class:`~parsl.config.Config`) for the Frontera +supercomputer at TACC. +This config uses the `parsl.executors.HighThroughputExecutor` to submit +tasks from a login node. It requests an allocation of +128 nodes, deploying 1 worker for each of the 56 cores per node, from the normal partition. +To limit network connections to just the internal network the config specifies the address +used by the infiniband interface with ``address_by_interface('ib0')`` + +.. code-block:: python + + from parsl.config import Config + from parsl.providers import SlurmProvider + from parsl.executors import HighThroughputExecutor + from parsl.launchers import SrunLauncher + from parsl.addresses import address_by_interface + + config = Config( + executors=[ + HighThroughputExecutor( + label="frontera_htex", + address=address_by_interface('ib0'), + max_workers_per_node=56, + provider=SlurmProvider( + nodes_per_block=128, + init_blocks=1, + partition='normal', + launcher=SrunLauncher(), + ), + ) + ], + ) + + +Use the ``Config`` object to start Parsl's data flow kernel with the ``parsl.load`` method : + +.. code-block:: python + + from parsl.configs.htex_local import config + import parsl + + with parsl.load(config): + +The ``load`` statement can happen after Apps are defined but must occur before tasks are started. +Loading the Config object within context manager like ``with`` is recommended +for implicit cleaning of DFK on exiting the context manager. + +The :class:`~parsl.config.Config` object may not be used again after loaded. +Consider a configuration function if the application will shut down and re-launch the DFK. + +.. code-block:: python + + from parsl.config import Config + import parsl + + def make_config() -> Config: + return Config(...) + + with parsl.load(make_config()): + # Your workflow here + parsl.clear() # Stops Parsl + with parsl.load(make_config()): # Re-launches with a fresh configuration + # Your workflow here + + +.. toctree:: + :maxdepth: 2 + + execution + elasticity + pinning + data + heterogeneous + encryption + examples diff --git a/docs/userguide/configuration/pinning.rst b/docs/userguide/configuration/pinning.rst new file mode 100644 index 0000000000..d34f1f9030 --- /dev/null +++ b/docs/userguide/configuration/pinning.rst @@ -0,0 +1,125 @@ +Resource pinning +================ + +Resource pinning reduces contention between multiple workers using the same CPU cores or accelerators. + +Multi-Threaded Applications +--------------------------- + +Workflows which launch multiple workers on a single node which perform multi-threaded tasks (e.g., NumPy, Tensorflow operations) may run into thread contention issues. +Each worker may try to use the same hardware threads, which leads to performance penalties. +Use the ``cpu_affinity`` feature of the :class:`~parsl.executors.HighThroughputExecutor` to assign workers to specific threads. Users can pin threads to +workers either with a strategy method or an explicit list. + +The strategy methods will auto assign all detected hardware threads to workers. +Allowed strategies that can be assigned to ``cpu_affinity`` are ``block``, ``block-reverse``, and ``alternating``. +The ``block`` method pins threads to workers in sequential order (ex: 4 threads are grouped (0, 1) and (2, 3) on two workers); +``block-reverse`` pins threads in reverse sequential order (ex: (3, 2) and (1, 0)); and ``alternating`` alternates threads among workers (ex: (0, 2) and (1, 3)). + +Select the best blocking strategy for processor's cache hierarchy (choose ``alternating`` if in doubt) to ensure workers to not compete for cores. + +.. code-block:: python + + local_config = Config( + executors=[ + HighThroughputExecutor( + label="htex_Local", + worker_debug=True, + cpu_affinity='alternating', + provider=LocalProvider( + init_blocks=1, + max_blocks=1, + ), + ) + ], + strategy='none', + ) + +Users can also use ``cpu_affinity`` to assign explicitly threads to workers with a string that has the format of +``cpu_affinity="list:::"``. + +Each worker's threads can be specified as a comma separated list or a hyphenated range: +``thread1,thread2,thread3`` +or +``thread_start-thread_end``. + +An example for 12 workers on a node with 208 threads is: + +.. code-block:: python + + cpu_affinity="list:0-7,104-111:8-15,112-119:16-23,120-127:24-31,128-135:32-39,136-143:40-47,144-151:52-59,156-163:60-67,164-171:68-75,172-179:76-83,180-187:84-91,188-195:92-99,196-203" + +This example assigns 16 threads each to 12 workers. Note that in this example there are threads that are skipped. +If a thread is not explicitly assigned to a worker, it will be left idle. +The number of thread "ranks" (colon separated thread lists/ranges) must match the total number of workers on the node; otherwise an exception will be raised. + + + +Thread affinity is accomplished in two ways. +Each worker first sets the affinity for the Python process using `the affinity mask `_, +which may not be available on all operating systems. +It then sets environment variables to control +`OpenMP thread affinity `_ +so that any subprocesses launched by a worker which use OpenMP know which processors are valid. +These include ``OMP_NUM_THREADS``, ``GOMP_COMP_AFFINITY``, and ``KMP_THREAD_AFFINITY``. + +Accelerators +------------ + +Many modern clusters provide multiple accelerators per compute node, yet many applications are best suited to using a +single accelerator per task. Parsl supports pinning each worker to different accelerators using +``available_accelerators`` option of the :class:`~parsl.executors.HighThroughputExecutor`. Provide either the number of +executors (Parsl will assume they are named in integers starting from zero) or a list of the names of the accelerators +available on the node. Parsl will limit the number of workers it launches to the number of accelerators specified, +in other words, you cannot have more workers per node than there are accelerators. By default, Parsl will launch +as many workers as the accelerators specified via ``available_accelerators``. + +.. code-block:: python + + local_config = Config( + executors=[ + HighThroughputExecutor( + label="htex_Local", + worker_debug=True, + available_accelerators=2, + provider=LocalProvider( + init_blocks=1, + max_blocks=1, + ), + ) + ], + strategy='none', + ) + +It is possible to bind multiple/specific accelerators to each worker by specifying a list of comma separated strings +each specifying accelerators. In the context of binding to NVIDIA GPUs, this works by setting ``CUDA_VISIBLE_DEVICES`` +on each worker to a specific string in the list supplied to ``available_accelerators``. + +Here's an example: + +.. code-block:: python + + # The following config is trimmed for clarity + local_config = Config( + executors=[ + HighThroughputExecutor( + # Starts 2 workers per node, each bound to 2 GPUs + available_accelerators=["0,1", "2,3"], + + # Start a single worker bound to all 4 GPUs + # available_accelerators=["0,1,2,3"] + ) + ], + ) + +GPU Oversubscription +"""""""""""""""""""" + +For hardware that uses Nvidia devices, Parsl allows for the oversubscription of workers to GPUS. This is intended to +make use of Nvidia's `Multi-Process Service (MPS) `_ available on many of their +GPUs that allows users to run multiple concurrent processes on a single GPU. The user needs to set in the +``worker_init`` commands to start MPS on every node in the block (this is machine dependent). The +``available_accelerators`` option should then be set to the total number of GPU partitions run on a single node in the +block. For example, for a node with 4 Nvidia GPUs, to create 8 workers per GPU, set ``available_accelerators=32``. +GPUs will be assigned to workers in ascending order in contiguous blocks. In the example, workers 0-7 will be placed +on GPU 0, workers 8-15 on GPU 1, workers 16-23 on GPU 2, and workers 24-31 on GPU 3. diff --git a/docs/userguide/configuring.rst b/docs/userguide/configuring.rst index 88d4456a26..1b0f2be022 100644 --- a/docs/userguide/configuring.rst +++ b/docs/userguide/configuring.rst @@ -1,658 +1,9 @@ -.. _configuration-section: +:orphan: -Configuration -============= +.. meta:: + :content http-equiv="refresh": 0;url=configuration/index.html -Parsl separates program logic from execution configuration, enabling -programs to be developed entirely independently from their execution -environment. Configuration is described by a Python object (:class:`~parsl.config.Config`) -so that developers can -introspect permissible options, validate settings, and retrieve/edit -configurations dynamically during execution. A configuration object specifies -details of the provider, executors, allocation size, -queues, durations, and data management options. - -The following example shows a basic configuration object (:class:`~parsl.config.Config`) for the Frontera -supercomputer at TACC. -This config uses the `parsl.executors.HighThroughputExecutor` to submit -tasks from a login node. It requests an allocation of -128 nodes, deploying 1 worker for each of the 56 cores per node, from the normal partition. -To limit network connections to just the internal network the config specifies the address -used by the infiniband interface with ``address_by_interface('ib0')`` - -.. code-block:: python - - from parsl.config import Config - from parsl.providers import SlurmProvider - from parsl.executors import HighThroughputExecutor - from parsl.launchers import SrunLauncher - from parsl.addresses import address_by_interface - - config = Config( - executors=[ - HighThroughputExecutor( - label="frontera_htex", - address=address_by_interface('ib0'), - max_workers_per_node=56, - provider=SlurmProvider( - nodes_per_block=128, - init_blocks=1, - partition='normal', - launcher=SrunLauncher(), - ), - ) - ], - ) - -.. contents:: Configuration How-To and Examples: - - -Creating and Using Config Objects ---------------------------------- - -:class:`~parsl.config.Config` objects are loaded to define the "Data Flow Kernel" (DFK) that will manage tasks. -All Parsl applications start by creating or importing a configuration then calling the load function. - -.. code-block:: python - - from parsl.configs.htex_local import config - import parsl - - with parsl.load(config): - -The ``load`` statement can happen after Apps are defined but must occur before tasks are started. -Loading the Config object within context manager like ``with`` is recommended -for implicit cleaning of DFK on exiting the context manager - -The :class:`~parsl.config.Config` object may not be used again after loaded. -Consider a configuration function if the application will shut down and re-launch the DFK. - -.. code-block:: python - - from parsl.config import Config - import parsl - - def make_config() -> Config: - return Config(...) - - with parsl.load(make_config()): - # Your workflow here - parsl.clear() # Stops Parsl - with parsl.load(make_config()): # Re-launches with a fresh configuration - # Your workflow here - - -How to Configure ----------------- - -.. note:: - All configuration examples below must be customized for the user's - allocation, Python environment, file system, etc. - - -The configuration specifies what, and how, resources are to be used for executing -the Parsl program and its apps. -It is important to carefully consider the needs of the Parsl program and its apps, -and the characteristics of the compute resources, to determine an ideal configuration. -Aspects to consider include: -1) where the Parsl apps will execute; -2) how many nodes will be used to execute the apps, and how long the apps will run; -3) should Parsl request multiple nodes in an individual scheduler job; and -4) where will the main Parsl program run and how will it communicate with the apps. - -Stepping through the following question should help formulate a suitable configuration object. - -1. Where should apps be executed? - -+---------------------+-----------------------------------------------+----------------------------------------+ -| Target | Executor | Provider | -+=====================+===============================================+========================================+ -| Laptop/Workstation | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.LocalProvider` | -| | * `parsl.executors.ThreadPoolExecutor` | | -| | * `parsl.executors.WorkQueueExecutor` | | -| | * `parsl.executors.taskvine.TaskVineExecutor` | | -+---------------------+-----------------------------------------------+----------------------------------------+ -| Amazon Web Services | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.AWSProvider` | -+---------------------+-----------------------------------------------+----------------------------------------+ -| Google Cloud | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.GoogleCloudProvider` | -+---------------------+-----------------------------------------------+----------------------------------------+ -| Slurm based system | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.SlurmProvider` | -| | * `parsl.executors.WorkQueueExecutor` | | -| | * `parsl.executors.taskvine.TaskVineExecutor` | | -+---------------------+-----------------------------------------------+----------------------------------------+ -| Torque/PBS based | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.TorqueProvider` | -| system | * `parsl.executors.WorkQueueExecutor` | | -+---------------------+-----------------------------------------------+----------------------------------------+ -| GridEngine based | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.GridEngineProvider` | -| system | * `parsl.executors.WorkQueueExecutor` | | -+---------------------+-----------------------------------------------+----------------------------------------+ -| Condor based | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.CondorProvider` | -| cluster or grid | * `parsl.executors.WorkQueueExecutor` | | -| | * `parsl.executors.taskvine.TaskVineExecutor` | | -+---------------------+-----------------------------------------------+----------------------------------------+ -| Kubernetes cluster | * `parsl.executors.HighThroughputExecutor` | `parsl.providers.KubernetesProvider` | -+---------------------+-----------------------------------------------+----------------------------------------+ - - -2. How many nodes will be used to execute the apps? What task durations are necessary to achieve good performance? - - -+--------------------------------------------+----------------------+-------------------------------------+ -| Executor | Number of Nodes [*]_ | Task duration for good performance | -+============================================+======================+=====================================+ -| `parsl.executors.ThreadPoolExecutor` | 1 (Only local) | Any | -+--------------------------------------------+----------------------+-------------------------------------+ -| `parsl.executors.HighThroughputExecutor` | <=2000 | Task duration(s)/#nodes >= 0.01 | -| | | longer tasks needed at higher scale | -+--------------------------------------------+----------------------+-------------------------------------+ -| `parsl.executors.WorkQueueExecutor` | <=1000 [*]_ | 10s+ | -+--------------------------------------------+----------------------+-------------------------------------+ -| `parsl.executors.taskvine.TaskVineExecutor`| <=1000 [*]_ | 10s+ | -+--------------------------------------------+----------------------+-------------------------------------+ - - -.. [*] Assuming 32 workers per node. If there are fewer workers launched - per node, a larger number of nodes could be supported. - -.. [*] The maximum number of nodes tested for the `parsl.executors.WorkQueueExecutor` is 10,000 GPU cores and - 20,000 CPU cores. - -.. [*] The maximum number of nodes tested for the `parsl.executors.taskvine.TaskVineExecutor` is - 10,000 GPU cores and 20,000 CPU cores. - -3. Should Parsl request multiple nodes in an individual scheduler job? -(Here the term block is equivalent to a single scheduler job.) - -+--------------------------------------------------------------------------------------------+ -| ``nodes_per_block = 1`` | -+---------------------+--------------------------+-------------------------------------------+ -| Provider | Executor choice | Suitable Launchers | -+=====================+==========================+===========================================+ -| Systems that don't | Any | * `parsl.launchers.SingleNodeLauncher` | -| use Aprun | | * `parsl.launchers.SimpleLauncher` | -+---------------------+--------------------------+-------------------------------------------+ -| Aprun based systems | Any | * `parsl.launchers.AprunLauncher` | -+---------------------+--------------------------+-------------------------------------------+ - -+---------------------------------------------------------------------------------------------------------------------+ -| ``nodes_per_block > 1`` | -+-------------------------------------+--------------------------+----------------------------------------------------+ -| Provider | Executor choice | Suitable Launchers | -+=====================================+==========================+====================================================+ -| `parsl.providers.TorqueProvider` | Any | * `parsl.launchers.AprunLauncher` | -| | | * `parsl.launchers.MpiExecLauncher` | -+-------------------------------------+--------------------------+----------------------------------------------------+ -| `parsl.providers.SlurmProvider` | Any | * `parsl.launchers.SrunLauncher` if native slurm | -| | | * `parsl.launchers.AprunLauncher`, otherwise | -+-------------------------------------+--------------------------+----------------------------------------------------+ - -.. note:: If using a Cray system, you most likely need to use the `parsl.launchers.AprunLauncher` to launch workers unless you - are on a **native Slurm** system like :ref:`configuring_nersc_cori` - - -Heterogeneous Resources ------------------------ - -In some cases, it can be difficult to specify the resource requirements for running a workflow. -For example, if the compute nodes a site provides are not uniform, there is no "correct" resource configuration; -the amount of parallelism depends on which node (large or small) each job runs on. -In addition, the software and filesystem setup can vary from node to node. -A Condor cluster may not provide shared filesystem access at all, -and may include nodes with a variety of Python versions and available libraries. - -The `parsl.executors.WorkQueueExecutor` provides several features to work with heterogeneous resources. -By default, Parsl only runs one app at a time on each worker node. -However, it is possible to specify the requirements for a particular app, -and Work Queue will automatically run as many parallel instances as possible on each node. -Work Queue automatically detects the amount of cores, memory, and other resources available on each execution node. -To activate this feature, add a resource specification to your apps. A resource specification is a dictionary with -the following three keys: ``cores`` (an integer corresponding to the number of cores required by the task), -``memory`` (an integer corresponding to the task's memory requirement in MB), and ``disk`` (an integer corresponding to -the task's disk requirement in MB), passed to an app via the special keyword argument ``parsl_resource_specification``. The specification can be set for all app invocations via a default, for example: - - .. code-block:: python - - @python_app - def compute(x, parsl_resource_specification={'cores': 1, 'memory': 1000, 'disk': 1000}): - return x*2 - - -or updated when the app is invoked: - - .. code-block:: python - - spec = {'cores': 1, 'memory': 500, 'disk': 500} - future = compute(x, parsl_resource_specification=spec) - -This ``parsl_resource_specification`` special keyword argument will inform Work Queue about the resources this app requires. -When placing instances of ``compute(x)``, Work Queue will run as many parallel instances as possible based on each worker node's available resources. - -If an app's resource requirements are not known in advance, -Work Queue has an auto-labeling feature that measures the actual resource usage of your apps and automatically chooses resource labels for you. -With auto-labeling, it is not necessary to provide ``parsl_resource_specification``; -Work Queue collects stats in the background and updates resource labels as your workflow runs. -To activate this feature, add the following flags to your executor config: - - .. code-block:: python - - config = Config( - executors=[ - WorkQueueExecutor( - # ...other options go here - autolabel=True, - autocategory=True - ) - ] - ) - -The ``autolabel`` flag tells Work Queue to automatically generate resource labels. -By default, these labels are shared across all apps in your workflow. -The ``autocategory`` flag puts each app into a different category, -so that Work Queue will choose separate resource requirements for each app. -This is important if e.g. some of your apps use a single core and some apps require multiple cores. -Unless you know that all apps have uniform resource requirements, -you should turn on ``autocategory`` when using ``autolabel``. - -The Work Queue executor can also help deal with sites that have non-uniform software environments across nodes. -Parsl assumes that the Parsl program and the compute nodes all use the same Python version. -In addition, any packages your apps import must be available on compute nodes. -If no shared filesystem is available or if node configuration varies, -this can lead to difficult-to-trace execution problems. - -If your Parsl program is running in a Conda environment, -the Work Queue executor can automatically scan the imports in your apps, -create a self-contained software package, -transfer the software package to worker nodes, -and run your code inside the packaged and uniform environment. -First, make sure that the Conda environment is active and you have the required packages installed (via either ``pip`` or ``conda``): - -- ``python`` -- ``parsl`` -- ``ndcctools`` -- ``conda-pack`` - -Then add the following to your config: - - .. code-block:: python - - config = Config( - executors=[ - WorkQueueExecutor( - # ...other options go here - pack=True - ) - ] - ) - -.. note:: - There will be a noticeable delay the first time Work Queue sees an app; - it is creating and packaging a complete Python environment. - This packaged environment is cached, so subsequent app invocations should be much faster. - -Using this approach, it is possible to run Parsl applications on nodes that don't have Python available at all. -The packaged environment includes a Python interpreter, -and Work Queue does not require Python to run. - -.. note:: - The automatic packaging feature only supports packages installed via ``pip`` or ``conda``. - Importing from other locations (e.g. via ``$PYTHONPATH``) or importing other modules in the same directory is not supported. - - -Accelerators ------------- - -Many modern clusters provide multiple accelerators per compute note, yet many applications are best suited to using a -single accelerator per task. Parsl supports pinning each worker to different accelerators using -``available_accelerators`` option of the :class:`~parsl.executors.HighThroughputExecutor`. Provide either the number of -executors (Parsl will assume they are named in integers starting from zero) or a list of the names of the accelerators -available on the node. Parsl will limit the number of workers it launches to the number of accelerators specified, -in other words, you cannot have more workers per node than there are accelerators. By default, Parsl will launch -as many workers as the accelerators specified via ``available_accelerators``. - -.. code-block:: python - - local_config = Config( - executors=[ - HighThroughputExecutor( - label="htex_Local", - worker_debug=True, - available_accelerators=2, - provider=LocalProvider( - init_blocks=1, - max_blocks=1, - ), - ) - ], - strategy='none', - ) - -It is possible to bind multiple/specific accelerators to each worker by specifying a list of comma separated strings -each specifying accelerators. In the context of binding to NVIDIA GPUs, this works by setting ``CUDA_VISIBLE_DEVICES`` -on each worker to a specific string in the list supplied to ``available_accelerators``. - -Here's an example: - -.. code-block:: python - - # The following config is trimmed for clarity - local_config = Config( - executors=[ - HighThroughputExecutor( - # Starts 2 workers per node, each bound to 2 GPUs - available_accelerators=["0,1", "2,3"], - - # Start a single worker bound to all 4 GPUs - # available_accelerators=["0,1,2,3"] - ) - ], - ) - -GPU Oversubscription -"""""""""""""""""""" - -For hardware that uses Nvidia devices, Parsl allows for the oversubscription of workers to GPUS. This is intended to -make use of Nvidia's `Multi-Process Service (MPS) `_ available on many of their -GPUs that allows users to run multiple concurrent processes on a single GPU. The user needs to set in the -``worker_init`` commands to start MPS on every node in the block (this is machine dependent). The -``available_accelerators`` option should then be set to the total number of GPU partitions run on a single node in the -block. For example, for a node with 4 Nvidia GPUs, to create 8 workers per GPU, set ``available_accelerators=32``. -GPUs will be assigned to workers in ascending order in contiguous blocks. In the example, workers 0-7 will be placed -on GPU 0, workers 8-15 on GPU 1, workers 16-23 on GPU 2, and workers 24-31 on GPU 3. - -Multi-Threaded Applications ---------------------------- - -Workflows which launch multiple workers on a single node which perform multi-threaded tasks (e.g., NumPy, Tensorflow operations) may run into thread contention issues. -Each worker may try to use the same hardware threads, which leads to performance penalties. -Use the ``cpu_affinity`` feature of the :class:`~parsl.executors.HighThroughputExecutor` to assign workers to specific threads. Users can pin threads to -workers either with a strategy method or an explicit list. - -The strategy methods will auto assign all detected hardware threads to workers. -Allowed strategies that can be assigned to ``cpu_affinity`` are ``block``, ``block-reverse``, and ``alternating``. -The ``block`` method pins threads to workers in sequential order (ex: 4 threads are grouped (0, 1) and (2, 3) on two workers); -``block-reverse`` pins threads in reverse sequential order (ex: (3, 2) and (1, 0)); and ``alternating`` alternates threads among workers (ex: (0, 2) and (1, 3)). - -Select the best blocking strategy for processor's cache hierarchy (choose ``alternating`` if in doubt) to ensure workers to not compete for cores. - -.. code-block:: python - - local_config = Config( - executors=[ - HighThroughputExecutor( - label="htex_Local", - worker_debug=True, - cpu_affinity='alternating', - provider=LocalProvider( - init_blocks=1, - max_blocks=1, - ), - ) - ], - strategy='none', - ) - -Users can also use ``cpu_affinity`` to assign explicitly threads to workers with a string that has the format of -``cpu_affinity="list:::"``. - -Each worker's threads can be specified as a comma separated list or a hyphenated range: -``thread1,thread2,thread3`` -or -``thread_start-thread_end``. - -An example for 12 workers on a node with 208 threads is: - -.. code-block:: python - - cpu_affinity="list:0-7,104-111:8-15,112-119:16-23,120-127:24-31,128-135:32-39,136-143:40-47,144-151:52-59,156-163:60-67,164-171:68-75,172-179:76-83,180-187:84-91,188-195:92-99,196-203" - -This example assigns 16 threads each to 12 workers. Note that in this example there are threads that are skipped. -If a thread is not explicitly assigned to a worker, it will be left idle. -The number of thread "ranks" (colon separated thread lists/ranges) must match the total number of workers on the node; otherwise an exception will be raised. - - - -Thread affinity is accomplished in two ways. -Each worker first sets the affinity for the Python process using `the affinity mask `_, -which may not be available on all operating systems. -It then sets environment variables to control -`OpenMP thread affinity `_ -so that any subprocesses launched by a worker which use OpenMP know which processors are valid. -These include ``OMP_NUM_THREADS``, ``GOMP_COMP_AFFINITY``, and ``KMP_THREAD_AFFINITY``. - -Ad-Hoc Clusters ---------------- - -Parsl's support of ad-hoc clusters of compute nodes without a scheduler -is deprecated. - -See -`issue #3515 `_ -for further discussion. - -Amazon Web Services -------------------- - -.. image:: ./aws_image.png - -.. note:: - To use AWS with Parsl, install Parsl with AWS dependencies via ``python3 -m pip install 'parsl[aws]'`` - -Amazon Web Services is a commercial cloud service which allows users to rent a range of computers and other computing services. -The following snippet shows how Parsl can be configured to provision nodes from the Elastic Compute Cloud (EC2) service. -The first time this configuration is used, Parsl will configure a Virtual Private Cloud and other networking and security infrastructure that will be -re-used in subsequent executions. The configuration uses the `parsl.providers.AWSProvider` to connect to AWS. - -.. literalinclude:: ../../parsl/configs/ec2.py - - -ASPIRE 1 (NSCC) ---------------- - -.. image:: https://www.nscc.sg/wp-content/uploads/2017/04/ASPIRE1Img.png - -The following snippet shows an example configuration for accessing NSCC's **ASPIRE 1** supercomputer. This example uses the `parsl.executors.HighThroughputExecutor` executor and connects to ASPIRE1's PBSPro scheduler. It also shows how ``scheduler_options`` parameter could be used for scheduling array jobs in PBSPro. - -.. literalinclude:: ../../parsl/configs/ASPIRE1.py - - - - -Illinois Campus Cluster (UIUC) ------------------------------- - -.. image:: https://campuscluster.illinois.edu/wp-content/uploads/2018/02/ND2_3633-sm.jpg - -The following snippet shows an example configuration for executing on the Illinois Campus Cluster. -The configuration assumes the user is running on a login node and uses the `parsl.providers.SlurmProvider` to interface -with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. - -.. literalinclude:: ../../parsl/configs/illinoiscluster.py - -Bridges (PSC) -------------- - -.. image:: https://insidehpc.com/wp-content/uploads/2016/08/Bridges_FB1b.jpg - -The following snippet shows an example configuration for executing on the Bridges supercomputer at the Pittsburgh Supercomputing Center. -The configuration assumes the user is running on a login node and uses the `parsl.providers.SlurmProvider` to interface -with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. - -.. literalinclude:: ../../parsl/configs/bridges.py - - - -CC-IN2P3 +Redirect -------- -.. image:: https://cc.in2p3.fr/wp-content/uploads/2017/03/bandeau_accueil.jpg - -The snippet below shows an example configuration for executing from a login node on IN2P3's Computing Centre. -The configuration uses the `parsl.providers.LocalProvider` to run on a login node primarily to avoid GSISSH, which Parsl does not support. -This system uses Grid Engine which Parsl interfaces with using the `parsl.providers.GridEngineProvider`. - -.. literalinclude:: ../../parsl/configs/cc_in2p3.py - - -CCL (Notre Dame, TaskVine) --------------------------- - -.. image:: https://ccl.cse.nd.edu/software/taskvine/taskvine-logo.png - -To utilize TaskVine with Parsl, please install the full CCTools software package within an appropriate Anaconda or Miniconda environment -(instructions for installing Miniconda can be found `in the Conda install guide `_): - -.. code-block:: bash - - $ conda create -y --name python= conda-pack - $ conda activate - $ conda install -y -c conda-forge ndcctools parsl - -This creates a Conda environment on your machine with all the necessary tools and setup needed to utilize TaskVine with the Parsl library. - -The following snippet shows an example configuration for using the Parsl/TaskVine executor to run applications on the local machine. -This examples uses the `parsl.executors.taskvine.TaskVineExecutor` to schedule tasks, and a local worker will be started automatically. -For more information on using TaskVine, including configurations for remote execution, visit the -`TaskVine/Parsl documentation online `_. - -.. literalinclude:: ../../parsl/configs/vineex_local.py - -TaskVine's predecessor, WorkQueue, may continue to be used with Parsl. -For more information on using WorkQueue visit the `CCTools documentation online `_. - -Expanse (SDSC) --------------- - -.. image:: https://www.hpcwire.com/wp-content/uploads/2019/07/SDSC-Expanse-graphic-cropped.jpg - -The following snippet shows an example configuration for executing remotely on San Diego Supercomputer -Center's **Expanse** supercomputer. The example is designed to be executed on the login nodes, using the -`parsl.providers.SlurmProvider` to interface with the Slurm scheduler used by Comet and the `parsl.launchers.SrunLauncher` to launch workers. - -.. literalinclude:: ../../parsl/configs/expanse.py - - -Improv (Argonne LCRC) ---------------------- - -.. image:: https://www.lcrc.anl.gov/sites/default/files/styles/965_wide/public/2023-12/20231214_114057.jpg?itok=A-Rz5pP9 - -**Improv** is a PBS Pro based supercomputer at Argonne's Laboratory Computing Resource -Center (LCRC). The following snippet is an example configuration that uses `parsl.providers.PBSProProvider` -and `parsl.launchers.MpiRunLauncher` to run on multinode jobs. - -.. literalinclude:: ../../parsl/configs/improv.py - - -.. _configuring_nersc_cori: - -Perlmutter (NERSC) ------------------- - -NERSC provides documentation on `how to use Parsl on Perlmutter `_. -Perlmutter is a Slurm based HPC system and parsl uses `parsl.providers.SlurmProvider` with `parsl.launchers.SrunLauncher` -to launch tasks onto this machine. - - -Frontera (TACC) ---------------- - -.. image:: https://frontera-portal.tacc.utexas.edu/media/filer_public/2c/fb/2cfbf6ab-818d-42c8-b4d5-9b39eb9d0a05/frontera-banner-home.jpg - -Deployed in June 2019, Frontera is the 5th most powerful supercomputer in the world. Frontera replaces the NSF Blue Waters system at NCSA -and is the first deployment in the National Science Foundation's petascale computing program. The configuration below assumes that the user is -running on a login node and uses the `parsl.providers.SlurmProvider` to interface with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. - -.. literalinclude:: ../../parsl/configs/frontera.py - - -Kubernetes Clusters -------------------- - -.. image:: https://d1.awsstatic.com/PAC/kuberneteslogo.eabc6359f48c8e30b7a138c18177f3fd39338e05.png - -Kubernetes is an open-source system for container management, such as automating deployment and scaling of containers. -The snippet below shows an example configuration for deploying pods as workers on a Kubernetes cluster. -The KubernetesProvider exploits the Python Kubernetes API, which assumes that you have kube config in ``~/.kube/config``. - -.. literalinclude:: ../../parsl/configs/kubernetes.py - - -Midway (RCC, UChicago) ----------------------- - -.. image:: https://rcc.uchicago.edu/sites/rcc.uchicago.edu/files/styles/slideshow-image/public/uploads/images/slideshows/20140430_RCC_8978.jpg?itok=BmRuJ-wq - -This Midway cluster is a campus cluster hosted by the Research Computing Center at the University of Chicago. -The snippet below shows an example configuration for executing remotely on Midway. -The configuration assumes the user is running on a login node and uses the `parsl.providers.SlurmProvider` to interface -with the scheduler, and uses the `parsl.launchers.SrunLauncher` to launch workers. - -.. literalinclude:: ../../parsl/configs/midway.py - - -Open Science Grid ------------------ - -.. image:: https://www.renci.org/wp-content/uploads/2008/10/osg_logo.png - -The Open Science Grid (OSG) is a national, distributed computing Grid spanning over 100 individual sites to provide tens of thousands of CPU cores. -The snippet below shows an example configuration for executing remotely on OSG. You will need to have a valid project name on the OSG. -The configuration uses the `parsl.providers.CondorProvider` to interface with the scheduler. - -.. literalinclude:: ../../parsl/configs/osg.py - - -Polaris (ALCF) --------------- - -.. image:: https://www.alcf.anl.gov/sites/default/files/styles/965x543/public/2022-07/33181D_086_ALCF%20Polaris%20Crop.jpg?itok=HVAHsZtt - :width: 75% - -ALCF provides documentation on `how to use Parsl on Polaris `_. -Polaris uses `parsl.providers.PBSProProvider` and `parsl.launchers.MpiExecLauncher` to launch tasks onto the HPC system. - - - -Stampede2 (TACC) ----------------- - -.. image:: https://www.tacc.utexas.edu/documents/1084364/1413880/stampede2-0717.jpg/ - -The following snippet shows an example configuration for accessing TACC's **Stampede2** supercomputer. This example uses theHighThroughput executor and connects to Stampede2's Slurm scheduler. - -.. literalinclude:: ../../parsl/configs/stampede2.py - - -Summit (ORNL) -------------- - -.. image:: https://www.olcf.ornl.gov/wp-content/uploads/2018/06/Summit_Exaop-1500x844.jpg - -The following snippet shows an example configuration for executing from the login node on Summit, the leadership class supercomputer hosted at the Oak Ridge National Laboratory. -The example uses the `parsl.providers.LSFProvider` to provision compute nodes from the LSF cluster scheduler and the `parsl.launchers.JsrunLauncher` to launch workers across the compute nodes. - -.. literalinclude:: ../../parsl/configs/summit.py - - -TOSS3 (LLNL) ------------- - -.. image:: https://hpc.llnl.gov/sites/default/files/Magma--2020-LLNL.jpg - -The following snippet shows an example configuration for executing on one of LLNL's **TOSS3** -machines, such as Quartz, Ruby, Topaz, Jade, or Magma. This example uses the `parsl.executors.FluxExecutor` -and connects to Slurm using the `parsl.providers.SlurmProvider`. This configuration assumes that the script -is being executed on the login nodes of one of the machines. - -.. literalinclude:: ../../parsl/configs/toss3_llnl.py - - -Further help ------------- - -For help constructing a configuration, you can click on class names such as :class:`~parsl.config.Config` or :class:`~parsl.executors.HighThroughputExecutor` to see the associated class documentation. The same documentation can be accessed interactively at the python command line via, for example: - -.. code-block:: python - - from parsl.config import Config - help(Config) +This page has been `moved `_ diff --git a/docs/userguide/data.rst b/docs/userguide/data.rst index 9350a6d96f..4626f0ed38 100644 --- a/docs/userguide/data.rst +++ b/docs/userguide/data.rst @@ -1,445 +1,9 @@ -.. _label-data: +:orphan: -Passing Python objects -====================== +.. meta:: + :content http-equiv="refresh": 0;url=configuration/data.html -Parsl apps can communicate via standard Python function parameter passing -and return statements. The following example shows how a Python string -can be passed to, and returned from, a Parsl app. +Redirect +-------- -.. code-block:: python - - @python_app - def example(name): - return 'hello {0}'.format(name) - - r = example('bob') - print(r.result()) - -Parsl uses the dill and pickle libraries to serialize Python objects -into a sequence of bytes that can be passed over a network from the submitting -machine to executing workers. - -Thus, Parsl apps can receive and return standard Python data types -such as booleans, integers, tuples, lists, and dictionaries. However, not -all objects can be serialized with these methods (e.g., closures, generators, -and system objects), and so those objects cannot be used with all executors. - -Parsl will raise a `SerializationError` if it encounters an object that it cannot -serialize. This applies to objects passed as arguments to an app, as well as objects -returned from an app. See :ref:`label_serialization_error`. - - -Staging data files -================== - -Parsl apps can take and return data files. A file may be passed as an input -argument to an app, or returned from an app after execution. Parsl -provides support to automatically transfer (stage) files between -the main Parsl program, worker nodes, and external data storage systems. - -Input files can be passed as regular arguments, or a list of them may be -specified in the special ``inputs`` keyword argument to an app invocation. - -Inside an app, the ``filepath`` attribute of a `File` can be read to determine -where on the execution-side file system the input file has been placed. - -Output `File` objects must also be passed in at app invocation, through the -outputs parameter. In this case, the `File` object specifies where Parsl -should place output after execution. - -Inside an app, the ``filepath`` attribute of an output -`File` provides the path at which the corresponding output file should be -placed so that Parsl can find it after execution. - -If the output from an app is to be used as the input to a subsequent app, -then a `DataFuture` that represents whether the output file has been created -must be extracted from the first app's AppFuture, and that must be passed -to the second app. This causes app -executions to be properly ordered, in the same way that passing AppFutures -to subsequent apps causes execution ordering based on an app returning. - -In a Parsl program, file handling is split into two pieces: files are named in an -execution-location independent manner using :py:class:`~parsl.data_provider.files.File` -objects, and executors are configured to stage those files in to and out of -execution locations using instances of the :py:class:`~parsl.data_provider.staging.Staging` -interface. - - -Parsl files ------------ - -Parsl uses a custom :py:class:`~parsl.data_provider.files.File` to provide a -location-independent way of referencing and accessing files. -Parsl files are defined by specifying the URL *scheme* and a path to the file. -Thus a file may represent an absolute path on the submit-side file system -or a URL to an external file. - -The scheme defines the protocol via which the file may be accessed. -Parsl supports the following schemes: file, ftp, http, https, and globus. -If no scheme is specified Parsl will default to the file scheme. - -The following example shows creation of two files with different -schemes: a locally-accessible data.txt file and an HTTPS-accessible -README file. - -.. code-block:: python - - File('file://home/parsl/data.txt') - File('https://github.com/Parsl/parsl/blob/master/README.rst') - - -Parsl automatically translates the file's location relative to the -environment in which it is accessed (e.g., the Parsl program or an app). -The following example shows how a file can be accessed in the app -irrespective of where that app executes. - -.. code-block:: python - - @python_app - def print_file(inputs=()): - with open(inputs[0].filepath, 'r') as inp: - content = inp.read() - return(content) - - # create an remote Parsl file - f = File('https://github.com/Parsl/parsl/blob/master/README.rst') - - # call the print_file app with the Parsl file - r = print_file(inputs=[f]) - r.result() - -As described below, the method by which this files are transferred -depends on the scheme and the staging providers specified in the Parsl -configuration. - -Staging providers ------------------ - -Parsl is able to transparently stage files between at-rest locations and -execution locations by specifying a list of -:py:class:`~parsl.data_provider.staging.Staging` instances for an executor. -These staging instances define how to transfer files in and out of an execution -location. This list should be supplied as the ``storage_access`` -parameter to an executor when it is constructed. - -Parsl includes several staging providers for moving files using the -schemes defined above. By default, Parsl executors are created with -three common staging providers: -the NoOpFileStaging provider for local and shared file systems -and the HTTP(S) and FTP staging providers for transferring -files to and from remote storage locations. The following -example shows how to explicitly set the default staging providers. - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - from parsl.data_provider.data_manager import default_staging - - config = Config( - executors=[ - HighThroughputExecutor( - storage_access=default_staging, - # equivalent to the following - # storage_access=[NoOpFileStaging(), FTPSeparateTaskStaging(), HTTPSeparateTaskStaging()], - ) - ] - ) - - -Parsl further differentiates when staging occurs relative to -the app invocation that requires or produces files. -Staging either occurs with the executing task (*in-task staging*) -or as a separate task (*separate task staging*) before app execution. -In-task staging -uses a wrapper that is executed around the Parsl task and thus -occurs on the resource on which the task is executed. Separate -task staging inserts a new Parsl task in the graph and associates -a dependency between the staging task and the task that depends -on that file. Separate task staging may occur on either the submit-side -(e.g., when using Globus) or on the execution-side (e.g., HTTPS, FTP). - - -NoOpFileStaging for Local/Shared File Systems -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The NoOpFileStaging provider assumes that files specified either -with a path or with the ``file`` URL scheme are available both -on the submit and execution side. This occurs, for example, when there is a -shared file system. In this case, files will not moved, and the -File object simply presents the same file path to the Parsl program -and any executing tasks. - -Files defined as follows will be handled by the NoOpFileStaging provider. - -.. code-block:: python - - File('file://home/parsl/data.txt') - File('/home/parsl/data.txt') - - -The NoOpFileStaging provider is enabled by default on all -executors. It can be explicitly set as the only -staging provider as follows. - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - from parsl.data_provider.file_noop import NoOpFileStaging - - config = Config( - executors=[ - HighThroughputExecutor( - storage_access=[NoOpFileStaging()] - ) - ] - ) - - -FTP, HTTP, HTTPS: separate task staging -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Files named with the ``ftp``, ``http`` or ``https`` URL scheme will be -staged in using HTTP GET or anonymous FTP commands. These commands -will be executed as a separate -Parsl task that will complete before the corresponding app -executes. These providers cannot be used to stage out output files. - -The following example defines a file accessible on a remote FTP server. - -.. code-block:: python - - File('ftp://www.iana.org/pub/mirror/rirstats/arin/ARIN-STATS-FORMAT-CHANGE.txt') - -When such a file object is passed as an input to an app, Parsl will download the file to whatever location is selected for the app to execute. -The following example illustrates how the remote file is implicitly downloaded from an FTP server and then converted. Note that the app does not need to know the location of the downloaded file on the remote computer, as Parsl abstracts this translation. - -.. code-block:: python - - @python_app - def convert(inputs=(), outputs=()): - with open(inputs[0].filepath, 'r') as inp: - content = inp.read() - with open(outputs[0].filepath, 'w') as out: - out.write(content.upper()) - - # create an remote Parsl file - inp = File('ftp://www.iana.org/pub/mirror/rirstats/arin/ARIN-STATS-FORMAT-CHANGE.txt') - - # create a local Parsl file - out = File('file:///tmp/ARIN-STATS-FORMAT-CHANGE.txt') - - # call the convert app with the Parsl file - f = convert(inputs=[inp], outputs=[out]) - f.result() - -HTTP and FTP separate task staging providers can be configured as follows. - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - from parsl.data_provider.http import HTTPSeparateTaskStaging - from parsl.data_provider.ftp import FTPSeparateTaskStaging - - config = Config( - executors=[ - HighThroughputExecutor( - storage_access=[HTTPSeparateTaskStaging(), FTPSeparateTaskStaging()] - ) - ] - ) - -FTP, HTTP, HTTPS: in-task staging -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -These staging providers are intended for use on executors that do not have -a file system shared between each executor node. - -These providers will use the same HTTP GET/anonymous FTP as the separate -task staging providers described above, but will do so in a wrapper around -individual app invocations, which guarantees that they will stage files to -a file system visible to the app. - -A downside of this staging approach is that the staging tasks are less visible -to Parsl, as they are not performed as separate Parsl tasks. - -In-task staging providers can be configured as follows. - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - from parsl.data_provider.http import HTTPInTaskStaging - from parsl.data_provider.ftp import FTPInTaskStaging - - config = Config( - executors=[ - HighThroughputExecutor( - storage_access=[HTTPInTaskStaging(), FTPInTaskStaging()] - ) - ] - ) - - -Globus -^^^^^^ - -The ``Globus`` staging provider is used to transfer files that can be accessed -using Globus. A guide to using Globus is available `here -`_). - -A file using the Globus scheme must specify the UUID of the Globus -endpoint and a path to the file on the endpoint, for example: - -.. code-block:: python - - File('globus://037f054a-15cf-11e8-b611-0ac6873fc732/unsorted.txt') - -Note: a Globus endpoint's UUID can be found in the Globus `Manage Endpoints `_ page. - -There must also be a Globus endpoint available with access to a -execute-side file system, because Globus file transfers happen -between two Globus endpoints. - -Globus Configuration -"""""""""""""""""""" - -In order to manage where files are staged, users must configure the default ``working_dir`` on a remote location. This information is specified in the :class:`~parsl.executors.base.ParslExecutor` via the ``working_dir`` parameter in the :class:`~parsl.config.Config` instance. For example: - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - - config = Config( - executors=[ - HighThroughputExecutor( - working_dir="/home/user/data" - ) - ] - ) - -Parsl requires knowledge of the Globus endpoint that is associated with an executor. This is done by specifying the ``endpoint_name`` (the UUID of the Globus endpoint that is associated with the system) in the configuration. - -In some cases, for example when using a Globus `shared endpoint `_ or when a Globus endpoint is mounted on a supercomputer, the path seen by Globus is not the same as the local path seen by Parsl. In this case the configuration may optionally specify a mapping between the ``endpoint_path`` (the common root path seen in Globus), and the ``local_path`` (the common root path on the local file system), as in the following. In most cases, ``endpoint_path`` and ``local_path`` are the same and do not need to be specified. - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - from parsl.data_provider.globus import GlobusStaging - from parsl.data_provider.data_manager import default_staging - - config = Config( - executors=[ - HighThroughputExecutor( - working_dir="/home/user/parsl_script", - storage_access=default_staging + [GlobusStaging( - endpoint_uuid="7d2dc622-2edb-11e8-b8be-0ac6873fc732", - endpoint_path="/", - local_path="/home/user" - )] - ) - ] - ) - - -Globus Authorization -"""""""""""""""""""" - -In order to transfer files with Globus, the user must first authenticate. -The first time that Globus is used with Parsl on a computer, the program -will prompt the user to follow an authentication and authorization -procedure involving a web browser. Users can authorize out of band by -running the parsl-globus-auth utility. This is useful, for example, -when running a Parsl program in a batch system where it will be unattended. - -.. code-block:: bash - - $ parsl-globus-auth - Parsl Globus command-line authorizer - If authorization to Globus is necessary, the library will prompt you now. - Otherwise it will do nothing - Authorization complete - -rsync -^^^^^ - -The ``rsync`` utility can be used to transfer files in the ``file`` scheme in configurations where -workers cannot access the submit-side file system directly, such as when executing -on an AWS EC2 instance or on a cluster without a shared file system. -However, the submit-side file system must be exposed using rsync. - -rsync Configuration -""""""""""""""""""" - -``rsync`` must be installed on both the submit and worker side. It can usually be installed -by using the operating system package manager: for example, by ``apt-get install rsync``. - -An `RSyncStaging` option must then be added to the Parsl configuration file, as in the following. -The parameter to RSyncStaging should describe the prefix to be passed to each rsync -command to connect from workers to the submit-side host. This will often be the username -and public IP address of the submitting system. - -.. code-block:: python - - from parsl.data_provider.rsync import RSyncStaging - - config = Config( - executors=[ - HighThroughputExecutor( - storage_access=[HTTPInTaskStaging(), FTPInTaskStaging(), RSyncStaging("benc@" + public_ip)], - ... - ) - ) - -rsync Authorization -""""""""""""""""""" - -The rsync staging provider delegates all authentication and authorization to the -underlying ``rsync`` command. This command must be correctly authorized to connect back to -the submit-side system. The form of this authorization will depend on the systems in -question. - -The following example installs an ssh key from the submit-side file system and turns off host key -checking, in the ``worker_init`` initialization of an EC2 instance. The ssh key must have -sufficient privileges to run ``rsync`` over ssh on the submit-side system. - -.. code-block:: python - - with open("rsync-callback-ssh", "r") as f: - private_key = f.read() - - ssh_init = """ - mkdir .ssh - chmod go-rwx .ssh - - cat > .ssh/id_rsa < .ssh/config <`_ diff --git a/docs/userguide/exceptions.rst b/docs/userguide/exceptions.rst index d18fbe704d..71af9d52d7 100644 --- a/docs/userguide/exceptions.rst +++ b/docs/userguide/exceptions.rst @@ -1,171 +1,9 @@ -.. _label-exceptions: +:orphan: -Error handling -============== +.. meta:: + :content http-equiv="refresh": 0;url=workflows/exceptions.html -Parsl provides various mechanisms to add resiliency and robustness to programs. +Redirect +-------- -Exceptions ----------- - -Parsl is designed to capture, track, and handle various errors occurring -during execution, including those related to the program, apps, execution -environment, and Parsl itself. -It also provides functionality to appropriately respond to failures during -execution. - -Failures might occur for various reasons: - -1. A task failed during execution. -2. A task failed to launch, for example, because an input dependency was not met. -3. There was a formatting error while formatting the command-line string in Bash apps. -4. A task completed execution but failed to produce one or more of its specified - outputs. -5. Task exceeded the specified walltime. - -Since Parsl tasks are executed asynchronously and remotely, it can be difficult to determine -when errors have occurred and to appropriately handle them in a Parsl program. - -For errors occurring in Python code, Parsl captures Python exceptions and returns -them to the main Parsl program. For non-Python errors, for example when a node -or worker fails, Parsl imposes a timeout, and considers a task to have failed -if it has not heard from the task by that timeout. Parsl also considers a task to have failed -if it does not meet the contract stated by the user during invocation, such as failing -to produce the stated output files. - -Parsl communicates these errors by associating Python exceptions with task futures. -These exceptions are raised only when a result is called on the future -of a failed task. For example: - -.. code-block:: python - - @python_app - def bad_divide(x): - return 6 / x - - # Call bad divide with 0, to cause a divide by zero exception - doubled_x = bad_divide(0) - - # Catch and handle the exception. - try: - doubled_x.result() - except ZeroDivisionError as e: - print('Oops! You tried to divide by 0.') - except Exception as e: - print('Oops! Something really bad happened.') - - -Retries -------- - -Often errors in distributed/parallel environments are transient. -In these cases, retrying failed tasks can be a simple way -of overcoming transient (e.g., machine failure, -network failure) and intermittent failures. -When ``retries`` are enabled (and set to an integer > 0), Parsl will automatically -re-launch tasks that have failed until the retry limit is reached. -By default, retries are disabled and exceptions will be communicated -to the Parsl program. - -The following example shows how the number of retries can be set to 2: - -.. code-block:: python - - import parsl - from parsl.configs.htex_local import config - - config.retries = 2 - - parsl.load(config) - -More specific retry handling can be specified using retry handlers, documented -below. - - -Lazy fail ---------- - -Parsl implements a lazy failure model through which a workload will continue -to execute in the case that some tasks fail. That is, the program will not -halt as soon as it encounters a failure, rather it will continue to execute -unaffected apps. - -The following example shows how lazy failures affect execution. In this -case, task C fails and therefore tasks E and F that depend on results from -C cannot be executed; however, Parsl will continue to execute tasks B and D -as they are unaffected by task C's failure. - -.. code-block:: - - Here's a workflow graph, where - (X) is runnable, - [X] is completed, - (X*) is failed. - (!X) is dependency failed - - (A) [A] (A) - / \ / \ / \ - (B) (C) [B] (C*) [B] (C*) - | | => | | => | | - (D) (E) (D) (E) [D] (!E) - \ / \ / \ / - (F) (F) (!F) - - time -----> - - -Retry handlers --------------- - -The basic parsl retry mechanism keeps a count of the number of times a task -has been (re)tried, and will continue retrying that task until the configured -retry limit is reached. - -Retry handlers generalize this to allow more expressive retry handling: -parsl keeps a retry cost for a task, and the task will be retried until the -configured retry limit is reached. Instead of the cost being 1 for each -failure, user-supplied code can examine the failure and compute a custom -cost. - -This allows user knowledge about failures to influence the retry mechanism: -an exception which is almost definitely a non-recoverable failure (for example, -due to bad parameters) can be given a high retry cost (so that it will not -be retried many times, or at all), and exceptions which are likely to be -transient (for example, where a worker node has died) can be given a low -retry cost so they will be retried many times. - -A retry handler can be specified in the parsl configuration like this: - - -.. code-block:: python - - Config( - retries=2, - retry_handler=example_retry_handler - ) - - -``example_retry_handler`` should be a function defined by the user that will -compute the retry cost for a particular failure, given some information about -the failure. - -For example, the following handler will give a cost of 1 to all exceptions, -except when a bash app exits with unix exitcode 9, in which case the cost will -be 100. This will have the effect that retries will happen as normal for most -errors, but the bash app can indicate that there is little point in retrying -by exiting with exitcode 9. - -.. code-block:: python - - def example_retry_handler(exception, task_record): - if isinstance(exception, BashExitFailure) and exception.exitcode == 9: - return 100 - else - return 1 - -The retry handler is given two parameters: the exception from execution, and -the parsl internal task_record. The task record contains details such as the -app name, parameters and executor. - -If a retry handler raises an exception itself, then the task will be aborted -and no further tries will be attempted. +This page has been `moved `_ diff --git a/docs/userguide/execution.rst b/docs/userguide/execution.rst index 832985c164..20346bfe35 100644 --- a/docs/userguide/execution.rst +++ b/docs/userguide/execution.rst @@ -1,389 +1,9 @@ -.. _label-execution: +:orphan: +.. meta:: + :content http-equiv="refresh": 0;url=configuration/execution.html -Execution -========= +Redirect +-------- -Contemporary computing environments may include a wide range of computational platforms or **execution providers**, from laptops and PCs to various clusters, supercomputers, and cloud computing platforms. Different execution providers may require or allow for the use of different **execution models**, such as threads (for efficient parallel execution on a multicore processor), processes, and pilot jobs for running many small tasks on a large parallel system. - -Parsl is designed to abstract these low-level details so that an identical Parsl program can run unchanged on different platforms or across multiple platforms. -To this end, Parsl uses a configuration file to specify which execution provider(s) and execution model(s) to use. -Parsl provides a high level abstraction, called a *block*, for providing a uniform description of a compute resource irrespective of the specific execution provider. - -.. note:: - Refer to :ref:`configuration-section` for information on how to configure the various components described - below for specific scenarios. - -Execution providers -------------------- - -Clouds, supercomputers, and local PCs offer vastly different modes of access. -To overcome these differences, and present a single uniform interface, -Parsl implements a simple provider abstraction. This -abstraction is key to Parsl's ability to enable scripts to be moved -between resources. The provider interface exposes three core actions: submit a -job for execution (e.g., sbatch for the Slurm resource manager), -retrieve the status of an allocation (e.g., squeue), and cancel a running -job (e.g., scancel). Parsl implements providers for local execution -(fork), for various cloud platforms using cloud-specific APIs, and -for clusters and supercomputers that use a Local Resource Manager -(LRM) to manage access to resources, such as Slurm and HTCondor. - -Each provider implementation may allow users to specify additional parameters for further configuration. Parameters are generally mapped to LRM submission script or cloud API options. -Examples of LRM-specific options are partition, wall clock time, -scheduler options (e.g., #SBATCH arguments for Slurm), and worker -initialization commands (e.g., loading a conda environment). Cloud -parameters include access keys, instance type, and spot bid price - -Parsl currently supports the following providers: - -1. `parsl.providers.LocalProvider`: The provider allows you to run locally on your laptop or workstation. -2. `parsl.providers.SlurmProvider`: This provider allows you to schedule resources via the Slurm scheduler. -3. `parsl.providers.CondorProvider`: This provider allows you to schedule resources via the Condor scheduler. -4. `parsl.providers.GridEngineProvider`: This provider allows you to schedule resources via the GridEngine scheduler. -5. `parsl.providers.TorqueProvider`: This provider allows you to schedule resources via the Torque scheduler. -6. `parsl.providers.AWSProvider`: This provider allows you to provision and manage cloud nodes from Amazon Web Services. -7. `parsl.providers.GoogleCloudProvider`: This provider allows you to provision and manage cloud nodes from Google Cloud. -8. `parsl.providers.KubernetesProvider`: This provider allows you to provision and manage containers on a Kubernetes cluster. -9. `parsl.providers.LSFProvider`: This provider allows you to schedule resources via IBM's LSF scheduler. - - - -Executors ---------- - -Parsl programs vary widely in terms of their -execution requirements. Individual Apps may run for milliseconds -or days, and available parallelism can vary between none for -sequential programs to millions for "pleasingly parallel" programs. -Parsl executors, as the name suggests, execute Apps on one or more -target execution resources such as multi-core workstations, clouds, -or supercomputers. As it appears infeasible to implement a single -execution strategy that will meet so many diverse requirements on -such varied platforms, Parsl provides a modular executor interface -and a collection of executors that are tuned for common execution -patterns. - -Parsl executors extend the Executor class offered by Python's -concurrent.futures library, which allows Parsl to use -existing solutions in the Python Standard Library (e.g., ThreadPoolExecutor) -and from other packages such as Work Queue. Parsl -extends the concurrent.futures executor interface to support -additional capabilities such as automatic scaling of execution resources, -monitoring, deferred initialization, and methods to set working -directories. -All executors share a common execution kernel that is responsible -for deserializing the task (i.e., the App and its input arguments) -and executing the task in a sandboxed Python environment. - -Parsl currently supports the following executors: - -1. `parsl.executors.ThreadPoolExecutor`: This executor supports multi-thread execution on local resources. - -2. `parsl.executors.HighThroughputExecutor`: This executor implements hierarchical scheduling and batching using a pilot job model to deliver high throughput task execution on up to 4000 Nodes. - -3. `parsl.executors.WorkQueueExecutor`: This executor integrates `Work Queue `_ as an execution backend. Work Queue scales to tens of thousands of cores and implements reliable execution of tasks with dynamic resource sizing. - -4. `parsl.executors.taskvine.TaskVineExecutor`: This executor uses `TaskVine `_ as the execution backend. TaskVine scales up to tens of thousands of cores and actively uses local storage on compute nodes to offer a diverse array of performance-oriented features, including: smart caching and sharing common large files between tasks and compute nodes, reliable execution of tasks, dynamic resource sizing, automatic Python environment detection and sharing. -These executors cover a broad range of execution requirements. As with other Parsl components, there is a standard interface (ParslExecutor) that can be implemented to add support for other executors. - -.. note:: - Refer to :ref:`configuration-section` for information on how to configure these executors. - - -Launchers ---------- - -Many LRMs offer mechanisms for spawning applications across nodes -inside a single job and for specifying the -resources and task placement information needed to execute that -application at launch time. Common mechanisms include -`srun `_ (for Slurm), -`aprun `_ (for Crays), and `mpirun `_ (for MPI). -Thus, to run Parsl programs on such systems, we typically want first to -request a large number of nodes and then to *launch* "pilot job" or -**worker** processes using the system launchers. -Parsl's Launcher abstraction enables Parsl programs -to use these system-specific launcher systems to start workers across -cores and nodes. - -Parsl currently supports the following set of launchers: - -1. `parsl.launchers.SrunLauncher`: Srun based launcher for Slurm based systems. -2. `parsl.launchers.AprunLauncher`: Aprun based launcher for Crays. -3. `parsl.launchers.SrunMPILauncher`: Launcher for launching MPI applications with Srun. -4. `parsl.launchers.GnuParallelLauncher`: Launcher using GNU parallel to launch workers across nodes and cores. -5. `parsl.launchers.MpiExecLauncher`: Uses Mpiexec to launch. -6. `parsl.launchers.SimpleLauncher`: The launcher default to a single worker launch. -7. `parsl.launchers.SingleNodeLauncher`: This launcher launches ``workers_per_node`` count workers on a single node. - -Additionally, the launcher interface can be used to implement specialized behaviors -in custom environments (for example, to -launch node processes inside containers with customized environments). -For example, the following launcher uses Srun to launch ``worker-wrapper``, passing the -command to be run as parameters to ``worker-wrapper``. It is the responsibility of ``worker-wrapper`` -to launch the command it is given inside the appropriate environment. - -.. code:: python - - class MyShifterSRunLauncher: - def __init__(self): - self.srun_launcher = SrunLauncher() - - def __call__(self, command, tasks_per_node, nodes_per_block): - new_command="worker-wrapper {}".format(command) - return self.srun_launcher(new_command, tasks_per_node, nodes_per_block) - -Blocks ------- - -One challenge when making use of heterogeneous -execution resource types is the need to provide a uniform representation of -resources. Consider that single requests on clouds return individual -nodes, clusters and supercomputers provide batches of nodes, grids -provide cores, and workstations provide a single multicore node - -Parsl defines a resource abstraction called a *block* as the most basic unit -of resources to be acquired from a provider. A block contains one -or more nodes and maps to the different provider abstractions. In -a cluster, a block corresponds to a single allocation request to a -scheduler. In a cloud, a block corresponds to a single API request -for one or more instances. -Parsl can then execute *tasks* (instances of apps) -within and across (e.g., for MPI jobs) nodes within a block. -Blocks are also used as the basis for -elasticity on batch scheduling systems (see Elasticity below). -Three different examples of block configurations are shown below. - -1. A single block comprised of a node executing one task: - - .. image:: ../images/N1_T1.png - :scale: 75% - -2. A single block with one node executing several tasks. This configuration is - most suitable for single threaded apps running on multicore target systems. - The number of tasks executed concurrently is proportional to the number of cores available on the system. - - .. image:: ../images/N1_T4.png - :scale: 75% - -3. A block comprised of several nodes and executing several tasks, where a task can span multiple nodes. This configuration - is generally used by MPI applications. Starting a task requires using a specific - MPI launcher that is supported on the target system (e.g., aprun, srun, mpirun, mpiexec). - The `MPI Apps `_ documentation page describes how to configure Parsl for this case. - - .. image:: ../images/N4_T2.png - -The configuration options for specifying the shape of each block are: - -1. ``workers_per_node``: Number of workers started per node, which corresponds to the number of tasks that can execute concurrently on a node. -2. ``nodes_per_block``: Number of nodes requested per block. - -.. _label-elasticity: - -Elasticity ----------- - -Workload resource requirements often vary over time. -For example, in the map-reduce paradigm the map phase may require more -resources than the reduce phase. In general, reserving sufficient -resources for the widest parallelism will result in underutilization -during periods of lower load; conversely, reserving minimal resources -for the thinnest parallelism will lead to optimal utilization -but also extended execution time. -Even simple bag-of-task applications may have tasks of different durations, leading to trailing -tasks with a thin workload. - -To address dynamic workload requirements, -Parsl implements a cloud-like elasticity model in which resource -blocks are provisioned/deprovisioned in response to workload pressure. -Given the general nature of the implementation, -Parsl can provide elastic execution on clouds, clusters, -and supercomputers. Of course, in an HPC setting, elasticity may -be complicated by queue delays. - -Parsl's elasticity model includes a flow control system -that monitors outstanding tasks and available compute capacity. -This flow control monitor determines when to trigger scaling (in or out) -events to match workload needs. - -The animated diagram below shows how blocks are elastically -managed within an executor. The Parsl configuration for an executor -defines the minimum, maximum, and initial number of blocks to be used. - -.. image:: parsl_scaling.gif - -The configuration options for specifying elasticity bounds are: - -1. ``min_blocks``: Minimum number of blocks to maintain per executor. -2. ``init_blocks``: Initial number of blocks to provision at initialization of workflow. -3. ``max_blocks``: Maximum number of blocks that can be active per executor. - - - -Parallelism -^^^^^^^^^^^ - -Parsl provides a user-managed model for controlling elasticity. -In addition to setting the minimum -and maximum number of blocks to be provisioned, users can also define -the desired level of parallelism by setting a parameter (*p*). Parallelism -is expressed as the ratio of task execution capacity to the sum of running tasks -and available tasks (tasks with their dependencies met, but waiting for execution). -A parallelism value of 1 represents aggressive scaling where the maximum resources -needed are used (i.e., max_blocks); parallelism close to 0 represents the opposite situation in which -as few resources as possible (i.e., min_blocks) are used. By selecting a fraction between 0 and 1, -the provisioning aggressiveness can be controlled. - -For example: - -- When p = 0: Use the fewest resources possible. If there is no workload then no blocks will be provisioned, otherwise the fewest blocks specified (e.g., min_blocks, or 1 if min_blocks is set to 0) will be provisioned. - -.. code:: python - - if active_tasks == 0: - blocks = min_blocks - else: - blocks = max(min_blocks, 1) - -- When p = 1: Use as many resources as possible. Provision sufficient nodes to execute all running and available tasks concurrently up to the max_blocks specified. - -.. code-block:: python - - blocks = min(max_blocks, - ceil((running_tasks + available_tasks) / (workers_per_node * nodes_per_block)) - -- When p = 1/2: Queue up to 2 tasks per worker before requesting a new block. - - -Configuration -^^^^^^^^^^^^^ - -The example below shows how elasticity and parallelism can be configured. Here, a `parsl.executors.HighThroughputExecutor` -is used with a minimum of 1 block and a maximum of 2 blocks, where each block may host -up to 2 workers per node. Thus this setup is capable of servicing 2 tasks concurrently. -Parallelism of 0.5 means that when more than 2 * the total task capacity (i.e., 4 tasks) are queued a new -block will be requested. An example :class:`~parsl.config.Config` is: - -.. code:: python - - from parsl.config import Config - from libsubmit.providers.local.local import Local - from parsl.executors import HighThroughputExecutor - - config = Config( - executors=[ - HighThroughputExecutor( - label='local_htex', - workers_per_node=2, - provider=Local( - min_blocks=1, - init_blocks=1, - max_blocks=2, - nodes_per_block=1, - parallelism=0.5 - ) - ) - ] - ) - -The animated diagram below illustrates the behavior of this executor. -In the diagram, the tasks are allocated to the first block, until -5 tasks are submitted. At this stage, as more than double the available -task capacity is used, Parsl provisions a new block for executing the remaining -tasks. - -.. image:: parsl_parallelism.gif - - -Multi-executor --------------- - -Parsl supports the use of one or more executors as specified in the configuration. -In this situation, individual apps may indicate which executors they are able to use. - -The common scenarios for this feature are: - -* A workflow has an initial simulation stage that runs on the compute heavy - nodes of an HPC system followed by an analysis and visualization stage that - is better suited for GPU nodes. -* A workflow follows a repeated fan-out, fan-in model where the long running - fan-out tasks are computed on a cluster and the quick fan-in computation is - better suited for execution using threads on a login node. -* A workflow includes apps that wait and evaluate the results of a - computation to determine whether the app should be relaunched. - Only apps running on threads may launch other apps. Often, simulations - have stochastic behavior and may terminate before completion. - In such cases, having a wrapper app that checks the exit code - and determines whether or not the app has completed successfully can - be used to automatically re-execute the app (possibly from a - checkpoint) until successful completion. - - -The following code snippet shows how apps can specify suitable executors in the app decorator. - -.. code-block:: python - - #(CPU heavy app) (CPU heavy app) (CPU heavy app) <--- Run on compute queue - # | | | - # (data) (data) (data) - # \ | / - # (Analysis and visualization phase) <--- Run on GPU node - - # A mock molecular dynamics simulation app - @bash_app(executors=["Theta.Phi"]) - def MD_Sim(arg, outputs=()): - return "MD_simulate {} -o {}".format(arg, outputs[0]) - - # Visualize results from the mock MD simulation app - @bash_app(executors=["Cooley.GPU"]) - def visualize(inputs=(), outputs=()): - bash_array = " ".join(inputs) - return "viz {} -o {}".format(bash_array, outputs[0]) - - -Encryption ----------- - -Users can enable encryption for the ``HighThroughputExecutor`` by setting its ``encrypted`` -initialization argument to ``True``. - -For example, - -.. code-block:: python - - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - - config = Config( - executors=[ - HighThroughputExecutor( - encrypted=True - ) - ] - ) - -Under the hood, we use `CurveZMQ `_ to encrypt all communication channels -between the executor and related nodes. - -Encryption performance -^^^^^^^^^^^^^^^^^^^^^^ - -CurveZMQ depends on `libzmq `_ and `libsodium `_, -which `pyzmq `_ (a Parsl dependency) includes as part of its -installation via ``pip``. This installation path should work on most systems, but users have -reported significant performance degradation as a result. - -If you experience a significant performance hit after enabling encryption, we recommend installing -``pyzmq`` with conda: - -.. code-block:: bash - - conda install conda-forge::pyzmq - -Alternatively, you can `install libsodium `_, then -`install libzmq `_, then build ``pyzmq`` from source: - -.. code-block:: bash - - pip3 install parsl --no-binary pyzmq +This page has been `moved `_ diff --git a/docs/userguide/futures.rst b/docs/userguide/futures.rst index 13d22a211b..1a4f40e79e 100644 --- a/docs/userguide/futures.rst +++ b/docs/userguide/futures.rst @@ -1,165 +1,9 @@ -.. _label-futures: +:orphan: -Futures -======= +.. meta:: + :content http-equiv="refresh": 0;url=execution/futures.html -When an ordinary Python function is invoked in a Python program, the Python interpreter waits for the function to complete execution -before proceeding to the next statement. -But if a function is expected to execute for a long period of time, it may be preferable not to wait for -its completion but instead to proceed immediately with executing subsequent statements. -The function can then execute concurrently with that other computation. +Redirect +-------- -Concurrency can be used to enhance performance when independent activities -can execute on different cores or nodes in parallel. The following -code fragment demonstrates this idea, showing that overall execution time -may be reduced if the two function calls are executed concurrently. - -.. code-block:: python - - v1 = expensive_function(1) - v2 = expensive_function(2) - result = v1 + v2 - -However, concurrency also introduces a need for **synchronization**. -In the example, it is not possible to compute the sum of ``v1`` and ``v2`` -until both function calls have completed. -Synchronization provides a way of blocking execution of one activity -(here, the statement ``result = v1 + v2``) until other activities -(here, the two calls to ``expensive_function()``) have completed. - -Parsl supports concurrency and synchronization as follows. -Whenever a Parsl program calls a Parsl app (a function annotated with a Parsl -app decorator, see :ref:`apps`), -Parsl will create a new ``task`` and immediately return a -`future `_ in lieu of that function's result(s). -The program will then continue immediately to the next statement in the program. -At some point, for example when the task's dependencies are met and there -is available computing capacity, Parsl will execute the task. Upon -completion, Parsl will set the value of the future to contain the task's -output. - -A future can be used to track the status of an asynchronous task. -For example, after creation, the future may be interrogated to determine -the task's status (e.g., running, failed, completed), access results, -and capture exceptions. Further, futures may be used for synchronization, -enabling the calling Python program to block until the future -has completed execution. - -Parsl provides two types of futures: `AppFuture` and `DataFuture`. -While related, they enable subtly different parallel patterns. - -AppFutures ----------- - -AppFutures are the basic building block upon which Parsl programs are built. Every invocation of a Parsl app returns an AppFuture that may be used to monitor and manage the task's execution. -AppFutures are inherited from Python's `concurrent library `_. -They provide three key capabilities: - -1. An AppFuture's ``result()`` function can be used to wait for an app to complete, and then access any result(s). -This function is blocking: it returns only when the app completes or fails. -The following code fragment implements an example similar to the ``expensive_function()`` example above. -Here, the ``sleep_double`` app simply doubles the input value. The program invokes -the ``sleep_double`` app twice, and returns futures in place of results. The example -shows how the future's ``result()`` function can be used to wait for the results from the -two ``sleep_double`` app invocations to be computed. - -.. code-block:: python - - @python_app - def sleep_double(x): - import time - time.sleep(2) # Sleep for 2 seconds - return x*2 - - # Start two concurrent sleep_double apps. doubled_x1 and doubled_x2 are AppFutures - doubled_x1 = sleep_double(10) - doubled_x2 = sleep_double(5) - - # The result() function will block until each of the corresponding app calls have completed - print(doubled_x1.result() + doubled_x2.result()) - -2. An AppFuture's ``done()`` function can be used to check the status of an app, *without blocking*. -The following example shows that calling the future's ``done()`` function will not stop execution of the main Python program. - -.. code-block:: python - - @python_app - def double(x): - return x*2 - - # doubled_x is an AppFuture - doubled_x = double(10) - - # Check status of doubled_x, this will print True if the result is available, else False - print(doubled_x.done()) - -3. An AppFuture provides a safe way to handle exceptions and errors while asynchronously executing -apps. The example shows how exceptions can be captured in the same way as a standard Python program -when calling the future's ``result()`` function. - -.. code-block:: python - - @python_app - def bad_divide(x): - return 6/x - - # Call bad divide with 0, to cause a divide by zero exception - doubled_x = bad_divide(0) - - # Catch and handle the exception. - try: - doubled_x.result() - except ZeroDivisionError as ze: - print('Oops! You tried to divide by 0') - except Exception as e: - print('Oops! Something really bad happened') - - -In addition to being able to capture exceptions raised by a specific app, Parsl also raises ``DependencyErrors`` when apps are unable to execute due to failures in prior dependent apps. -That is, an app that is dependent upon the successful completion of another app will fail with a dependency error if any of the apps on which it depends fail. - - -DataFutures ------------ - -While an AppFuture represents the execution of an asynchronous app, -a DataFuture represents a file to be produced by that app. -Parsl's dataflow model requires such a construct so that it can determine -when dependent apps, apps that that are to consume a file produced by another app, -can start execution. - -When calling an app that produces files as outputs, Parsl requires that a list of output files be specified (as a list of `File` objects passed in via the ``outputs`` keyword argument). Parsl will return a DataFuture for each output file as part AppFuture when the app is executed. -These DataFutures are accessible in the AppFuture's ``outputs`` attribute. - -Each DataFuture will complete when the App has finished executing, -and the corresponding file has been created (and if specified, staged out). - -When a DataFuture is passed as an argument to a subsequent app invocation, -that subsequent app will not begin execution until the DataFuture is -completed. The input argument will then be replaced with an appropriate -File object. - -The following code snippet shows how DataFutures are used. In this -example, the call to the echo Bash app specifies that the results -should be written to an output file ("hello1.txt"). The main -program inspects the status of the output file (via the future's -``outputs`` attribute) and then blocks waiting for the file to -be created (``hello.outputs[0].result()``). - -.. code-block:: python - - # This app echoes the input string to the first file specified in the - # outputs list - @bash_app - def echo(message, outputs=()): - return 'echo {} &> {}'.format(message, outputs[0]) - - # Call echo specifying the output file - hello = echo('Hello World!', outputs=[File('hello1.txt')]) - - # The AppFuture's outputs attribute is a list of DataFutures - print(hello.outputs) - - # Print the contents of the output DataFuture when complete - with open(hello.outputs[0].result().filepath, 'r') as f: - print(f.read()) +This page has been `moved `_ diff --git a/docs/userguide/index.rst b/docs/userguide/index.rst index 12254cd6e2..a626b42a2f 100644 --- a/docs/userguide/index.rst +++ b/docs/userguide/index.rst @@ -1,23 +1,17 @@ User guide ========== +Parsl applications are composed of **Apps** that define tasks to be executed, +**Configuration** objects which define resources available for executing tasks, +and **Workflow** scripts that weave tasks together into parallel workflows. + +Start with an `overview of Parsl `_ before learning about each component. + .. toctree:: - :maxdepth: 5 + :maxdepth: 2 overview - apps - futures - data - execution - mpi_apps - exceptions - checkpoints - configuring - monitoring - workflow - modularizing - lifted_ops - joins - usage_tracking - plugins - parsl_perf + apps/index + configuration/index + workflows/index + advanced/index diff --git a/docs/userguide/joins.rst b/docs/userguide/joins.rst index defb0ad012..4cc741f931 100644 --- a/docs/userguide/joins.rst +++ b/docs/userguide/joins.rst @@ -1,257 +1,9 @@ -.. _label-joinapp: +:orphan: -Join Apps -========= +.. meta:: + :content http-equiv="refresh": 0;url=apps/joins.html -Join apps, defined with the ``@join_app`` decorator, are a form of app that can -launch other pieces of a workflow: for example a Parsl sub-workflow, or a task -that runs in some other system. +Redirect +-------- -Parsl sub-workflows -------------------- - -One reason for launching Parsl apps from inside a join app, rather than -directly in the main workflow code, is because the definitions of those tasks -are not known well enough at the start of the workflow. - -For example, a workflow might run an expensive step to detect some objects -in an image, and then on each object, run a further expensive step. Because -the number of objects is not known at the start of the workflow, but instead -only after an expensive step has completed, the subsequent tasks cannot be -defined until after that step has completed. - -In simple cases, the main workflow script can be stopped using -``Future.result()`` and join apps are not necessary, but in more complicated -cases, that approach can severely limit concurrency. - -Join apps allow more naunced dependencies to be expressed that can help with: - -* increased concurrency - helping with strong scaling -* more focused error propagation - allowing more of an ultimately failing workflow to complete -* more useful monitoring information - -Using Futures from other components ------------------------------------ - -Sometimes, a workflow might need to incorporate tasks from other systems that -run asynchronously but do not need a Parsl worker allocated for their entire -run. An example of this is delegating some work into Globus Compute: work can -be given to Globus Compute, but Parsl does not need to keep a worker allocated -to that task while it runs. Instead, Parsl can be told to wait for the ``Future`` -returned by Globus Compute to complete. - -Usage ------ - -A `join_app` looks quite like a `python_app`, but should return one or more -``Future`` objects, rather than a value. Once the Python code has run, the -app will wait for those Futures to complete without occuping a Parsl worker, -and when those Futures complete, their contents will be the return value -of the `join_app`. - -For example: - -.. code-block:: python - - @python_app - def some_app(): - return 3 - - @join_app - def example(): - x: Future = some_app() - return x # note that x is a Future, not a value - - assert example.result() == 3 - -Example of a Parsl sub-workflow -------------------------------- - -This example workflow shows a preprocessing step, followed by -a middle stage that is chosen by the result of the pre-processing step -(either option 1 or option 2) followed by a know post-processing step. - -.. code-block:: python - - @python_app - def pre_process(): - return 3 - - @python_app - def option_one(x): - return x*2 - - @python_app - def option_two(x): - return (-x) * 2 - - @join_app - def process(x): - if x > 0: - return option_one(x) - else: - return option_two(x) - - @python_app - def post_process(x): - return str(x) - - assert post_process(process(pre_process()))).result() == "6" - -* Why can't process be a regular python function? - -``process`` needs to inspect the value of ``x`` to make a decision about -what app to launch. So it needs to defer execution until after the -pre-processing stage has completed. In Parsl, the way to defer that is -using apps: even though ``process`` is invoked at the start of the workflow, -it will execute later on, when the Future returned by ``pre_process`` has a -value. - -* Why can't process be a @python_app? - -A Python app, if run in a `parsl.executors.ThreadPoolExecutor`, can launch -more parsl apps; so a ``python_app`` implementation of process() would be able -to inspect x and choose and invoke the appropriate ``option_{one, two}``. - -From launching the ``option_{one, two}`` app, the app body python code would -get a ``Future[int]`` - a ``Future`` that will eventually contain ``int``. - -But, we want to invoke ``post_process`` at submission time near the start of -workflow so that Parsl knows about as many tasks as possible. But we don't -want it to execute until the value of the chosen ``option_{one, two}`` app -is known. - -If we don't have join apps, how can we do this? - -We could make process wait for ``option_{one, two}`` to complete, before -returning, like this: - -.. code-block:: python - - @python_app - def process(x): - if x > 0: - f = option_one(x) - else: - f = option_two(x) - return f.result() - -but this will block the worker running ``process`` until ``option_{one, two}`` -has completed. If there aren't enough workers to run ``option_{one, two}`` this -can even deadlock. (principle: apps should not wait on completion of other -apps and should always allow parsl to handle this through dependencies) - -We could make process return the ``Future`` to the main workflow thread: - -.. code-block:: python - - @python_app - def process(x): - if x > 0: - f = option_one(x) - else: - f = option_two(x) - return f # f is a Future[int] - - # process(3) is a Future[Future[int]] - - -What comes out of invoking ``process(x)`` now is a nested ``Future[Future[int]]`` -- it's a promise that eventually process will give you a promise (from -``option_one, two}``) that will eventually give you an int. - -We can't pass that future into post_process... because post_process wants the -final int, and that future will complete before the int is ready, and that -(outer) future will have as its value the inner future (which won't be complete yet). - -So we could wait for the result in the main workflow thread: - -.. code-block:: python - - f_outer = process(pre_process()) # Future[Future[int]] - f_inner = f_outer.result # Future[int] - result = post_process(f_inner) - # result == "6" - -But this now blocks the main workflow thread. If we really only need to run -these three lines, that's fine, but what about if we are in a for loop that -sets up 1000 parametrised iterations: - -.. code-block:: python - - for x in [1..1000]: - f_outer = process(pre_process(x)) # Future[Future[int]] - f_inner = f_outer.result() # Future[int] - result = post_process(f_inner) - -The ``for`` loop can only iterate after pre_processing is done for each -iteration - it is unnecessarily serialised by the ``.result()`` call, -so that pre_processing cannot run in parallel. - -So, the rule about not calling ``.result()`` applies in the main workflow thread -too. - -What join apps add is the ability for parsl to unwrap that Future[Future[int]] into a -Future[int] in a "sensible" way (eg it doesn't need to block a worker). - - -.. _label-join-globus-compute: - -Example of invoking a Futures-driven task from another system -------------------------------------------------------------- - - -This example shows launching some activity in another system, without -occupying a Parsl worker while that activity happens: in this example, work is -delegated to Globus Compute, which performs the work elsewhere. When the work -is completed, Globus Compute will put the result into the future that it -returns, and then (because the Parsl app is a ``@join_app``), that result will -be used as the result of the Parsl app. - -As above, the motivation for doing this inside an app, rather than in the -top level is that sufficient information to launch the Globus Compute task -might not be available at start of the workflow. - -This workflow will run a first stage, ``const_five``, on a Parsl worker, -then using the result of that stage, pass the result as a parameter to a -Globus Compute task, getting a ``Future`` from that submission. Then, the -results of the Globus Compute task will be passed onto a second Parsl -local task, ``times_two``. - -.. code-block:: python - - import parsl - from globus_compute_sdk import Executor - - tutorial_endpoint_uuid = '4b116d3c-1703-4f8f-9f6f-39921e5864df' - gce = Executor(endpoint_id=tutorial_endpoint_uuid) - - def increment_in_funcx(n): - return n+1 - - @parsl.join_app - def increment_in_parsl(n): - future = gce.submit(increment_in_funcx, n) - return future - - @parsl.python_app - def times_two(n): - return n*2 - - @parsl.python_app - def const_five(): - return 5 - - parsl.load() - - workflow = times_two(increment_in_parsl(const_five())) - - r = workflow.result() - - assert r == (5+1)*2 - - -Terminology ------------ - -The term "join" comes from use of monads in functional programming, especially Haskell. +This page has been `moved `_ diff --git a/docs/userguide/lifted_ops.rst b/docs/userguide/lifted_ops.rst index 6e258b9b62..ed28f55d56 100644 --- a/docs/userguide/lifted_ops.rst +++ b/docs/userguide/lifted_ops.rst @@ -1,56 +1,9 @@ -.. _label-liftedops: +:orphan: -Lifted operators -================ +.. meta:: + :content http-equiv="refresh": 0;url=workflows/lifted_ops.html -Parsl allows some operators (``[]`` and ``.``) to be used on an AppFuture in -a way that makes sense with those operators on the eventually returned -result. +Redirect +-------- -Lifted [] operator ------------------- - -When an app returns a complex structure such as a ``dict`` or a ``list``, -it is sometimes useful to pass an element of that structure to a subsequent -task, without waiting for that subsequent task to complete. - -To help with this, Parsl allows the ``[]`` operator to be used on an -`AppFuture`. This operator will return another `AppFuture` that will -complete after the initial future, with the result of ``[]`` on the value -of the initial future. - -The end result is that this assertion will hold: - -.. code-block:: python - - fut = my_app() - assert fut['x'].result() == fut.result()[x] - -but more concurrency will be available, as execution of the main workflow -code will not stop to wait for ``result()`` to complete on the initial -future. - -`AppFuture` does not implement other methods commonly associated with -dicts and lists, such as ``len``, because those methods should return a -specific type of result immediately, and that is not possible when the -results are not available until the future. - -If a key does not exist in the returned result, then the exception will -appear in the Future returned by ``[]``, rather than at the point that -the ``[]`` operator is applied. This is because the valid values that can -be used are not known until the underlying result is available. - -Lifted . operator ------------------ - -The ``.`` operator works similarly to ``[]`` described above: - -.. code-block:: python - - fut = my_app - assert fut.x.result() == fut.result().x - -Attributes beginning with ``_`` are not lifted as this usually indicates an -attribute that is used for internal purposes, and to try to avoid mixing -protocols (such as iteration in for loops) defined on AppFutures vs protocols -defined on the underlying result object. +This page has been `moved `_ diff --git a/docs/userguide/modularizing.rst b/docs/userguide/modularizing.rst index 143a4ebcd8..69ed82482c 100644 --- a/docs/userguide/modularizing.rst +++ b/docs/userguide/modularizing.rst @@ -1,109 +1,9 @@ -.. _codebases: +:orphan: -Structuring Parsl programs --------------------------- +.. meta:: + :content http-equiv="refresh": 0;url=advanced/modularizing.html -While convenient to build simple Parsl programs as a single Python file, -splitting a Parsl programs into multiple files and a Python module -has significant benefits, including: +Redirect +-------- - 1. Better readability - 2. Logical separation of components (e.g., apps, config, and control logic) - 3. Ease of reuse of components - -Large applications that use Parsl often divide into several core components: - -.. contents:: - :local: - :depth: 2 - -The following sections use an example where each component is in a separate file: - -.. code-block:: - - examples/logic.py - examples/app.py - examples/config.py - examples/__init__.py - run.py - pyproject.toml - -Run the application by first installing the Python library and then executing the "run.py" script. - -.. code-block:: bash - - pip install . # Install module so it can be imported by workers - python run.py - - -Core application logic -====================== - -The core application logic should be developed without any deference to Parsl. -Implement capabilities, write unit tests, and prepare documentation -in which ever way works best for the problem at hand. - -Parallelization with Parsl will be easy if the software already follows best practices. - -The example defines a function to convert a single integer into binary. - -.. literalinclude:: examples/library/logic.py - :caption: library/logic.py - -Workflow functions -================== - -Tasks within a workflow may require unique combinations of core functions. -Functions to be run in parallel must also meet :ref:`specific requirements ` -that may complicate writing the core logic effectively. -As such, separating functions to be used as Apps is often beneficial. - -The example includes a function to convert many integers into binary. - -Key points to note: - -- It is not necessary to have import statements inside the function. - Parsl will serialize this function by reference, as described in :ref:`functions-from-modules`. - -- The function is not yet marked as a Parsl PythonApp. - Keeping Parsl out of the function definitions simplifies testing - because you will not need to run Parsl when testing the code. - -- *Advanced*: Consider including Parsl decorators in the library if using complex workflow patterns, - such as :ref:`join apps ` or functions which take :ref:`special arguments `. - -.. literalinclude:: examples/library/app.py - :caption: library/app.py - - -Parsl configuration functions -============================= - -Create Parsl configurations specific to your application needs as functions. -While not necessary, including the Parsl configuration functions inside the module -ensures they can be imported into other scripts easily. - -Generating Parsl :class:`~parsl.config.Config` objects from a function -makes it possible to change the configuration without editing the module. - -The example function provides a configuration suited for a single node. - -.. literalinclude:: examples/library/config.py - :caption: library/config.py - -Orchestration Scripts -===================== - -The last file defines the workflow itself. - -Such orchestration scripts, at minimum, perform at least four tasks: - -1. *Load execution options* using a tool like :mod:`argparse`. -2. *Prepare workflow functions for execution* by creating :class:`~parsl.app.python.PythonApp` wrappers over each function. -3. *Create configuration then start Parsl* with the :meth:`parsl.load` function. -4. *Launch tasks and retrieve results* depending on the needs of the application. - -An example run script is as follows - -.. literalinclude:: examples/run.py - :caption: run.py +This page has been `moved `_ diff --git a/docs/userguide/monitoring.rst b/docs/userguide/monitoring.rst index 02b3177ca7..138127ec90 100644 --- a/docs/userguide/monitoring.rst +++ b/docs/userguide/monitoring.rst @@ -1,121 +1,9 @@ -Monitoring -========== +:orphan: -Parsl includes a monitoring system to capture task state as well as resource -usage over time. The Parsl monitoring system aims to provide detailed -information and diagnostic capabilities to help track the state of your -programs, down to the individual apps that are executed on remote machines. +.. meta:: + :content http-equiv="refresh": 0;url=advanced/monitoring.html -The monitoring system records information to an SQLite database while a -workflow runs. This information can then be visualised in a web dashboard -using the ``parsl-visualize`` tool, or queried using SQL using regular -SQLite tools. - - -Monitoring configuration ------------------------- - -Parsl monitoring is only supported with the `parsl.executors.HighThroughputExecutor`. - -The following example shows how to enable monitoring in the Parsl -configuration. Here the `parsl.monitoring.MonitoringHub` is specified to use port -55055 to receive monitoring messages from workers every 10 seconds. - -.. code-block:: python - - import parsl - from parsl.monitoring.monitoring import MonitoringHub - from parsl.config import Config - from parsl.executors import HighThroughputExecutor - from parsl.addresses import address_by_hostname - - import logging - - config = Config( - executors=[ - HighThroughputExecutor( - label="local_htex", - cores_per_worker=1, - max_workers_per_node=4, - address=address_by_hostname(), - ) - ], - monitoring=MonitoringHub( - hub_address=address_by_hostname(), - monitoring_debug=False, - resource_monitoring_interval=10, - ), - strategy='none' - ) - - -Visualization -------------- - -To run the web dashboard utility ``parsl-visualize`` you first need to install -its dependencies: - - $ pip install 'parsl[monitoring,visualization]' - -To view the web dashboard while or after a Parsl program has executed, run -the ``parsl-visualize`` utility:: - - $ parsl-visualize - -By default, this command expects that the default ``monitoring.db`` database is used -in the runinfo directory. Other databases can be loaded by passing -the database URI on the command line. For example, if the full path -to the database is ``/tmp/my_monitoring.db``, run:: - - $ parsl-visualize sqlite:////tmp/my_monitoring.db - -By default, the visualization web server listens on ``127.0.0.1:8080``. If the web server is deployed on a machine with a web browser, the dashboard can be accessed in the browser at ``127.0.0.1:8080``. If the web server is deployed on a remote machine, such as the login node of a cluster, you will need to use an ssh tunnel from your local machine to the cluster:: - - $ ssh -L 50000:127.0.0.1:8080 username@cluster_address - -This command will bind your local machine's port 50000 to the remote cluster's port 8080. -The dashboard can then be accessed via the local machine's browser at ``127.0.0.1:50000``. - -.. warning:: Alternatively you can deploy the visualization server on a public interface. However, first check that this is allowed by the cluster's security policy. The following example shows how to deploy the web server on a public port (i.e., open to Internet via ``public_IP:55555``):: - - $ parsl-visualize --listen 0.0.0.0 --port 55555 - - -Workflows Page -^^^^^^^^^^^^^^ - -The workflows page lists all Parsl workflows that have been executed with monitoring enabled -with the selected database. -It provides a high level summary of workflow state as shown below: - -.. image:: ../images/mon_workflows_page.png - -Throughout the dashboard, all blue elements are clickable. For example, clicking a specific worklow -name from the table takes you to the Workflow Summary page described in the next section. - -Workflow Summary -^^^^^^^^^^^^^^^^ - -The workflow summary page captures the run level details of a workflow, including start and end times -as well as task summary statistics. The workflow summary section is followed by the *App Summary* that lists -the various apps and invocation count for each. - -.. image:: ../images/mon_workflow_summary.png - - -The workflow summary also presents three different views of the workflow: - -* Workflow DAG - with apps differentiated by colors: This visualization is useful to visually inspect the dependency - structure of the workflow. Hovering over the nodes in the DAG shows a tooltip for the app represented by the node and it's task ID. - -.. image:: ../images/mon_task_app_grouping.png - -* Workflow DAG - with task states differentiated by colors: This visualization is useful to identify what tasks have been completed, failed, or are currently pending. - -.. image:: ../images/mon_task_state_grouping.png - -* Workflow resource usage: This visualization provides resource usage information at the workflow level. - For example, cumulative CPU/Memory utilization across workers over time. - -.. image:: ../images/mon_resource_summary.png +Redirect +-------- +This page has been `moved `_ diff --git a/docs/userguide/mpi_apps.rst b/docs/userguide/mpi_apps.rst index 82123123b6..3f758190ee 100644 --- a/docs/userguide/mpi_apps.rst +++ b/docs/userguide/mpi_apps.rst @@ -1,153 +1,9 @@ -MPI and Multi-node Apps -======================= +:orphan: -The :class:`~parsl.executors.MPIExecutor` supports running MPI applications or other computations which can -run on multiple compute nodes. +.. meta:: + :content http-equiv="refresh": 0;url=apps/mpi_apps.html -Background ----------- +Redirect +-------- -MPI applications run multiple copies of a program that complete a single task by -coordinating using messages passed within or across nodes. - -Starting MPI application requires invoking a "launcher" code (e.g., ``mpiexec``) -with options that define how the copies of a program should be distributed. - -The launcher includes options that control how copies of the program are distributed -across the nodes (e.g., how many copies per node) and -how each copy is configured (e.g., which CPU cores it can use). - -The options for launchers vary between MPI implementations and compute clusters. - -Configuring ``MPIExecutor`` ---------------------------- - -The :class:`~parsl.executors.MPIExecutor` is a wrapper over -:class:`~parsl.executors.high_throughput.executor.HighThroughputExecutor` -which eliminates options that are irrelevant for MPI applications. - -Define a configuration for :class:`~parsl.executors.MPIExecutor` by - -1. Setting ``max_workers_per_block`` to the maximum number of tasks to run per block of compute nodes. - This value is typically the number of nodes per block divided by the number of nodes per task. -2. Setting ``mpi_launcher`` to the launcher used for your application. -3. Specifying a provider that matches your cluster and use the :class:`~parsl.launchers.SimpleLauncher`, - which will ensure that no Parsl processes are placed on the compute nodes. - -An example for ALCF's Polaris supercomputer that will run 3 MPI tasks of 2 nodes each at the same time: - -.. code-block:: python - - config = Config( - executors=[ - MPIExecutor( - address=address_by_interface('bond0'), - max_workers_per_block=3, # Assuming 2 nodes per task - provider=PBSProProvider( - account="parsl", - worker_init=f"""module load miniconda; source activate /lus/eagle/projects/parsl/env""", - walltime="1:00:00", - queue="debug", - scheduler_options="#PBS -l filesystems=home:eagle:grand", - launcher=SimpleLauncher(), - select_options="ngpus=4", - nodes_per_block=6, - max_blocks=1, - cpus_per_node=64, - ), - ), - ] - ) - - -.. warning:: - Please note that ``Provider`` options that specify per-task or per-node resources, for example, - ``SlurmProvider(cores_per_node=N, ...)`` should not be used with :class:`~parsl.executors.high_throughput.MPIExecutor`. - Parsl primarily uses a pilot job model and assumptions from that context do not translate to the MPI context. For - more info refer to : - `github issue #3006 `_ - -Writing an MPI App ------------------- - -:class:`~parsl.executors.high_throughput.MPIExecutor` can execute both Python or Bash Apps which invoke an MPI application. - -Create the app by first defining a function which includes ``parsl_resource_specification`` keyword argument. -The resource specification is a dictionary which defines the number of nodes and ranks used by the application: - -.. code-block:: python - - resource_specification = { - 'num_nodes': , # Number of nodes required for the application instance - 'ranks_per_node': , # Number of ranks / application elements to be launched per node - 'num_ranks': , # Number of ranks in total - } - -Then, replace the call to the MPI launcher with ``$PARSL_MPI_PREFIX``. -``$PARSL_MPI_PREFIX`` references an environmental variable which will be replaced with -the correct MPI launcher configured for the resource list provided when calling the function -and with options that map the task to nodes which Parsl knows to be available. - -The function can be a Bash app - -.. code-block:: python - - @bash_app - def lammps_mpi_application(infile: File, parsl_resource_specification: Dict): - # PARSL_MPI_PREFIX will resolve to `mpiexec -n 4 -ppn 2 -hosts NODE001,NODE002` - return f"$PARSL_MPI_PREFIX lmp_mpi -in {infile.filepath}" - - -or a Python app: - -.. code-block:: python - - @python_app - def lammps_mpi_application(infile: File, parsl_resource_specification: Dict): - from subprocess import run - with open('stdout.lmp', 'w') as fp, open('stderr.lmp', 'w') as fe: - proc = run(['$PARSL_MPI_PREFIX', '-i', 'in.lmp'], stdout=fp, stderr=fe) - return proc.returncode - - -Run either App by calling with its arguments and a resource specification which defines how to execute it - -.. code-block:: python - - # Resources in terms of nodes and how ranks are to be distributed are set on a per app - # basis via the resource_spec dictionary. - resource_spec = { - "num_nodes": 2, - "ranks_per_node": 2, - "num_ranks": 4, - } - future = lammps_mpi_application(File('in.file'), parsl_resource_specification=resource_spec) - -Advanced: More Environment Variables -++++++++++++++++++++++++++++++++++++ - -Parsl Apps which run using :class:`~parsl.executors.high_throughput.MPIExecutor` -can make their own MPI invocation using other environment variables. - -These other variables include versions of the launch command for different launchers - -- ``PARSL_MPIEXEC_PREFIX``: mpiexec launch command which works for a large number of batch systems especially PBS systems -- ``PARSL_SRUN_PREFIX``: srun launch command for Slurm based clusters -- ``PARSL_APRUN_PREFIX``: aprun launch command prefix for some Cray machines - -And the information used by Parsl when assembling the launcher commands: - -- ``PARSL_NUM_RANKS``: Total number of ranks to use for the MPI application -- ``PARSL_NUM_NODES``: Number of nodes to use for the calculation -- ``PARSL_MPI_NODELIST``: List of assigned nodes separated by commas (Eg, NODE1,NODE2) -- ``PARSL_RANKS_PER_NODE``: Number of ranks per node - -Limitations -+++++++++++ - -Support for MPI tasks in HTEX is limited. It is designed for running many multi-node MPI applications within a single -batch job. - -#. MPI tasks may not span across nodes from more than one block. -#. Parsl does not correctly determine the number of execution slots per block (`Issue #1647 `_) -#. The executor uses a Python process per task, which can use a lot of memory (`Issue #2264 `_) \ No newline at end of file +This page has been `moved `_ diff --git a/docs/userguide/overview.rst b/docs/userguide/overview.rst index 073cc202e6..a83d9875e5 100644 --- a/docs/userguide/overview.rst +++ b/docs/userguide/overview.rst @@ -1,8 +1,8 @@ Overview ======== -Parsl is designed to enable straightforward parallelism and orchestration of asynchronous -tasks into dataflow-based workflows, in Python. Parsl manages the concurrent execution of +Parsl is designed to enable straightforward parallelism and orchestration of asynchronous +tasks into dataflow-based workflows, in Python. Parsl manages the concurrent execution of these tasks across various computation resources, from laptops to supercomputers, scheduling each task only when its dependencies (e.g., input data dependencies) are met. @@ -28,7 +28,7 @@ Parsl and Concurrency Any call to a Parsl app creates a new task that executes concurrently with the main program and any other task(s) that are currently executing. Different tasks may execute on the same nodes or on different nodes, and on the same or -different computers. +different computers. The Parsl execution model thus differs from the Python native execution model, which is inherently sequential. A Python program that does not contain Parsl @@ -42,13 +42,13 @@ main program resumes only after the function returns. .. image:: ../images/overview/python-concurrency.png :scale: 70 - :align: center + :align: center In contrast, the Parsl execution model is inherently concurrent. Whenever a program calls an app, a separate thread of execution is created, and the main program continues without pausing. Thus in the example shown in the figure below. There is initially a single task: the main program (black). The first -call to ``double`` creates a second task (red) and the second call to ``double`` +call to ``double`` creates a second task (red) and the second call to ``double`` creates a third task (orange). The second and third task terminate as the function that they execute returns. (The dashed lines represent the start and finish of the tasks). The calling program will only block (wait) when it is @@ -70,28 +70,28 @@ Parsl and Execution We have now seen that Parsl tasks are executed concurrently alongside the main Python program and other Parsl tasks. We now turn to the question of how and where are those tasks executed. Given the range of computers on which parallel -programs may be executed, Parsl allows tasks to be executed using different -executors (:py:class:`parsl.executors`). Executors are responsible for taking a queue of tasks and executing +programs may be executed, Parsl allows tasks to be executed using different +executors (:py:class:`parsl.executors`). Executors are responsible for taking a queue of tasks and executing them on local or remote resources. -We briefly describe two of Parsl's most commonly used executors. +We briefly describe two of Parsl's most commonly used executors. Other executors are described in :ref:`label-execution`. -The `parsl.executors.HighThroughputExecutor` (HTEX) implements a *pilot job model* that enables -fine-grain task execution using across one or more provisioned nodes. -HTEX can be used on a single node (e.g., a laptop) and will make use of +The `parsl.executors.HighThroughputExecutor` (HTEX) implements a *pilot job model* that enables +fine-grain task execution using across one or more provisioned nodes. +HTEX can be used on a single node (e.g., a laptop) and will make use of multiple processes for concurrent execution. -As shown in the following figure, HTEX uses Parsl's provider abstraction (:py:class:`parsl.providers`) to -communicate with a resource manager (e.g., batch scheduler or cloud API) to +As shown in the following figure, HTEX uses Parsl's provider abstraction (:py:class:`parsl.providers`) to +communicate with a resource manager (e.g., batch scheduler or cloud API) to provision a set of nodes (e.g., Parsl will use Slurm’s sbatch command to request -nodes on a Slurm cluster) for the duration of execution. -HTEX deploys a lightweight worker agent on the nodes which subsequently connects -back to the main Parsl process. Parsl tasks are then sent from the main program -to the connected workers for execution and the results are sent back via the -same mechanism. This approach has a number of advantages over other methods: -it avoids long job scheduler queue delays by acquiring one set of resources -for the entire program and it allows for scheduling of many tasks on individual -nodes. +nodes on a Slurm cluster) for the duration of execution. +HTEX deploys a lightweight worker agent on the nodes which subsequently connects +back to the main Parsl process. Parsl tasks are then sent from the main program +to the connected workers for execution and the results are sent back via the +same mechanism. This approach has a number of advantages over other methods: +it avoids long job scheduler queue delays by acquiring one set of resources +for the entire program and it allows for scheduling of many tasks on individual +nodes. .. image:: ../images/overview/htex-model.png @@ -101,13 +101,13 @@ nodes. back to the main Parsl process. Thus, you should verify that there is network connectivity between the workers and the Parsl process and ensure that the correct network address is used by the workers. Parsl provides a helper - function to automatically detect network addresses + function to automatically detect network addresses (`parsl.addresses.address_by_query`). -The `parsl.executors.ThreadPoolExecutor` allows tasks to be executed on a pool of locally -accessible threads. As execution occurs on the same computer, on a pool of -threads forked from the main program, the tasks share memory with one another +The `parsl.executors.ThreadPoolExecutor` allows tasks to be executed on a pool of locally +accessible threads. As execution occurs on the same computer, on a pool of +threads forked from the main program, the tasks share memory with one another (this is discussed further in the following sections). @@ -115,14 +115,14 @@ Parsl and Communication ----------------------- Parsl tasks typically need to communicate in order to perform useful work. Parsl provides for two forms of communication: by parameter passing -and by file passing. +and by file passing. As described in the next section, Parsl programs may also communicate by -interacting with shared filesystems and services its environment. +interacting with shared filesystems and services its environment. Parameter Passing ^^^^^^^^^^^^^^^^^ -The figure above illustrates communication via parameter passing. +The figure above illustrates communication via parameter passing. The call ``double(3)`` to the app ``double`` in the main program creates a new task and passes the parameter value, 3, to that new task. When the task completes execution, its return value, 6, is returned to the main program. Similarly, the @@ -133,27 +133,27 @@ passed to/from tasks. File Passing ^^^^^^^^^^^^ -Parsl supports communication via files in both Bash apps and Python apps. -Files may be used in place of parameter passing for many reasons, such as for -apps are designed to support files, when data to be exchanged are large, -or when data cannot be easily serialized into Python objects. -As Parsl tasks may be executed on remote nodes, without shared file systems, -Parsl offers a Parsl :py:class:`parsl.data_provider.files.File` construct for location-independent reference +Parsl supports communication via files in both Bash apps and Python apps. +Files may be used in place of parameter passing for many reasons, such as for +apps are designed to support files, when data to be exchanged are large, +or when data cannot be easily serialized into Python objects. +As Parsl tasks may be executed on remote nodes, without shared file systems, +Parsl offers a Parsl :py:class:`parsl.data_provider.files.File` construct for location-independent reference to files. Parsl will translate file objects to worker-accessible paths when executing dependent apps. Parsl is also able to transfer files in, out, and between Parsl -apps using one of several methods (e.g., FTP, HTTP(S), Globus and rsync). -To accommodate the asynchronous nature of file transfer, Parsl treats +apps using one of several methods (e.g., FTP, HTTP(S), Globus and rsync). +To accommodate the asynchronous nature of file transfer, Parsl treats data movement like a Parsl app, adding a dependency to the execution graph -and waiting for transfers to complete before executing dependent apps. +and waiting for transfers to complete before executing dependent apps. More information is provided in :ref:`label-data`). Futures ^^^^^^^ -Communication via parameter and file passing also serves a second purpose, namely +Communication via parameter and file passing also serves a second purpose, namely synchronization. As we discuss in more detail in :ref:`label-futures`, a call to an -app returns a special object called a future that has a special unassigned -state until such time as the app returns, at which time it takes the return +app returns a special object called a future that has a special unassigned +state until such time as the app returns, at which time it takes the return value. (In the example program, two futures are thus created, d1 and d2.) The AppFuture function result() blocks until the future to which it is applied takes a value. Thus the print statement in the main program blocks until both child @@ -168,16 +168,16 @@ and when those values are available, is active again. The Parsl Environment --------------------- Regular Python and Parsl-enhanced Python differ in terms of the environment in -which code executes. We use the term *environment* here to refer to the -variables and modules (the *memory environment*), the file system(s) -(the *file system environment*), and the services (the *service environment*) +which code executes. We use the term *environment* here to refer to the +variables and modules (the *memory environment*), the file system(s) +(the *file system environment*), and the services (the *service environment*) that are accessible to a function. -An important question when it comes to understanding the behavior of Parsl -programs is the environment in which this new task executes: does it have the -same or different memory, file system, or service environment as its parent -task or any other task? The answer, depends on the executor used, and (in the -case of the file system environment) where the task executes. +An important question when it comes to understanding the behavior of Parsl +programs is the environment in which this new task executes: does it have the +same or different memory, file system, or service environment as its parent +task or any other task? The answer, depends on the executor used, and (in the +case of the file system environment) where the task executes. Below we describe behavior for the most commonly used `parsl.executors.HighThroughputExecutor` which is representative of all Parsl executors except the `parsl.executors.ThreadPoolExecutor`. @@ -186,13 +186,13 @@ which is representative of all Parsl executors except the `parsl.executors.Threa it allows tasks to share memory. Memory environment -^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^ -In Python, the variables and modules that are accessible to a function are defined -by Python scoping rules, by which a function has access to both variables defined -within the function (*local* variables) and those defined outside the function -(*global* variables). Thus in the following code, the print statement in the -print_answer function accesses the global variable "answer", and we see as output +In Python, the variables and modules that are accessible to a function are defined +by Python scoping rules, by which a function has access to both variables defined +within the function (*local* variables) and those defined outside the function +(*global* variables). Thus in the following code, the print statement in the +print_answer function accesses the global variable "answer", and we see as output "the answer is 42." .. code-block:: python @@ -206,17 +206,17 @@ print_answer function accesses the global variable "answer", and we see as outpu In Parsl (except when using the `parsl.executors.ThreadPoolExecutor`) a Parsl app is executed -in a distinct environment that only has access to local variables associated -with the app function. Thus, if the program above is executed with say the +in a distinct environment that only has access to local variables associated +with the app function. Thus, if the program above is executed with say the `parsl.executors.HighThroughputExecutor`, will print "the answer is 0" rather than "the answer -is 42," because the print statement in provide_answer does not have access to +is 42," because the print statement in provide_answer does not have access to the global variable that has been assigned the value 42. The program will run without errors when using the `parsl.executors.ThreadPoolExecutor`. -Similarly, the same scoping rules apply to import statements, and thus -the following program will run without errors with the `parsl.executors.ThreadPoolExecutor`, -but raise errors when run with any other executor, because the return statement -in ``ambiguous_double`` refers to a variable (factor) and a module (random) that are +Similarly, the same scoping rules apply to import statements, and thus +the following program will run without errors with the `parsl.executors.ThreadPoolExecutor`, +but raise errors when run with any other executor, because the return statement +in ``ambiguous_double`` refers to a variable (factor) and a module (random) that are not known to the function. .. code-block:: python @@ -229,9 +229,9 @@ not known to the function. return x * random.random() * factor print(ambiguous_double(42)) - -To allow this program to run correctly with all Parsl executors, the random + +To allow this program to run correctly with all Parsl executors, the random library must be imported within the app, and the factor variable must be passed as an argument, as follows. @@ -248,12 +248,12 @@ passed as an argument, as follows. print(good_double(factor, 42)) -File system environment +File system environment ^^^^^^^^^^^^^^^^^^^^^^^ -In a regular Python program the environment that is accessible to a Python -program also includes the file system(s) of the computer on which it is -executing. +In a regular Python program the environment that is accessible to a Python +program also includes the file system(s) of the computer on which it is +executing. Thus in the following code, a value written to a file "answer.txt" in the current directory can be retrieved by reading the same file, and the print statement outputs "the answer is 42." @@ -272,15 +272,15 @@ statement outputs "the answer is 42." The question of which file system environment is accessible to a Parsl app -depends on where the app executes. If two tasks run on nodes that share a -file system, then those tasks (e.g., tasks A and B in the figure below, -but not task C) share a file system environment. Thus the program above will -output "the answer is 42" if the parent task and the child task run on +depends on where the app executes. If two tasks run on nodes that share a +file system, then those tasks (e.g., tasks A and B in the figure below, +but not task C) share a file system environment. Thus the program above will +output "the answer is 42" if the parent task and the child task run on nodes 1 and 2, but not if they run on nodes 2 and 3. .. image:: ../images/overview/filesystem.png :scale: 70 - :align: center + :align: center Service Environment ^^^^^^^^^^^^^^^^^^^ @@ -292,7 +292,7 @@ service. These services are accessible to any task. Environment Summary ^^^^^^^^^^^^^^^^^^^ -As we summarize in the table, if tasks execute with the `parsl.executors.ThreadPoolExecutor`, +As we summarize in the table, if tasks execute with the `parsl.executors.ThreadPoolExecutor`, they share the memory and file system environment of the parent task. If they execute with any other executor, they have a separate memory environment, and may or may not share their file system environment with other tasks, depending @@ -302,7 +302,7 @@ services. +--------------------+--------------------+--------------------+---------------------------+------------------+ | | Share memory | Share file system | Share file system | Share service | | | environment with | environment with | environment with other | environment | -| | parent/other tasks | parent | tasks | with other tasks | +| | parent/other tasks | parent | tasks | with other tasks | +====================+====================+====================+===========================+==================+ +--------------------+--------------------+--------------------+---------------------------+------------------+ | Python | Yes | Yes | N/A | N/A | diff --git a/docs/userguide/parsl_perf.rst b/docs/userguide/parsl_perf.rst index 2ea1adb00f..88c4f1a20c 100644 --- a/docs/userguide/parsl_perf.rst +++ b/docs/userguide/parsl_perf.rst @@ -1,53 +1,9 @@ -.. _label-parsl-perf: +:orphan: -Measuring performance with parsl-perf -===================================== +.. meta:: + :content http-equiv="refresh": 0;url=advanced/parsl_perf.html -``parsl-perf`` is tool for making basic performance measurements of Parsl -configurations. +Redirect +-------- -It runs increasingly large numbers of no-op apps until a batch takes -(by default) 120 seconds, giving a measurement of tasks per second. - -This can give a basic measurement of some of the overheads in task -execution. - -``parsl-perf`` must be invoked with a configuration file, which is a Python -file containing a variable ``config`` which contains a `Config` object, or -a function ``fresh_config`` which returns a `Config` object. The -``fresh_config`` format is the same as used with the pytest test suite. - -To specify a ``parsl_resource_specification`` for tasks, add a ``--resources`` -argument. - -To change the target runtime from the default of 120 seconds, add a -``--time`` parameter. - -For example: - -.. code-block:: bash - - - $ python -m parsl.benchmark.perf --config parsl/tests/configs/workqueue_ex.py --resources '{"cores":1, "memory":0, "disk":0}' - ==== Iteration 1 ==== - Will run 10 tasks to target 120 seconds runtime - Submitting tasks / invoking apps - warning: using plain-text when communicating with workers. - warning: use encryption with a key and cert when creating the manager. - All 10 tasks submitted ... waiting for completion - Submission took 0.008 seconds = 1248.676 tasks/second - Runtime: actual 3.668s vs target 120s - Tasks per second: 2.726 - - [...] - - ==== Iteration 4 ==== - Will run 57640 tasks to target 120 seconds runtime - Submitting tasks / invoking apps - All 57640 tasks submitted ... waiting for completion - Submission took 34.839 seconds = 1654.487 tasks/second - Runtime: actual 364.387s vs target 120s - Tasks per second: 158.184 - Cleaning up DFK - The end - +This page has been `moved `_ diff --git a/docs/userguide/plugins.rst b/docs/userguide/plugins.rst index cd9244960c..2f4cdbcb60 100644 --- a/docs/userguide/plugins.rst +++ b/docs/userguide/plugins.rst @@ -1,106 +1,9 @@ -Plugins -======= +:orphan: -Parsl has several places where code can be plugged in. Parsl usually provides -several implementations that use each plugin point. +.. meta:: + :content http-equiv="refresh": 0;url=advanced/plugins.html -This page gives a brief summary of those places and why you might want -to use them, with links to the API guide. +Redirect +-------- -Executors ---------- -When the parsl dataflow kernel is ready for a task to run, it passes that -task to an `ParslExecutor`. The executor is then responsible for running the task's -Python code and returning the result. This is the abstraction that allows one -executor to run code on the local submitting host, while another executor can -run the same code on a large supercomputer. - - -Providers and Launchers ------------------------ -Some executors are based on blocks of workers (for example the -`parsl.executors.HighThroughputExecutor`: the submit side requires a -batch system (eg slurm, kubernetes) to start worker processes, which then -execute tasks. - -The particular way in which a system makes those workers start is implemented -by providers and launchers. - -An `ExecutionProvider` allows a command line to be submitted as a request to the -underlying batch system to be run inside an allocation of nodes. - -A `Launcher` modifies that command line when run inside the allocation to -add on any wrappers that are needed to launch the command (eg srun inside -slurm). Providers and launchers are usually paired together for a particular -system type. - -File staging ------------- -Parsl can copy input files from an arbitrary URL into a task's working -environment, and copy output files from a task's working environment to -an arbitrary URL. A small set of data staging providers is installed by default, -for ``file://`` ``http://`` and ``ftp://`` URLs. More data staging providers can -be added in the workflow configuration, in the ``storage`` parameter of the -relevant `ParslExecutor`. Each provider should subclass the `Staging` class. - - -Default stdout/stderr name generation -------------------------------------- -Parsl can choose names for your bash apps stdout and stderr streams -automatically, with the parsl.AUTO_LOGNAME parameter. The choice of path is -made by a function which can be configured with the ``std_autopath`` -parameter of Parsl `Config`. By default, ``DataFlowKernel.default_std_autopath`` -will be used. - - -Memoization/checkpointing -------------------------- - -When parsl memoizes/checkpoints an app parameter, it does so by computing a -hash of that parameter that should be the same if that parameter is the same -on subsequent invocations. This isn't straightforward to do for arbitrary -objects, so parsl implements a checkpointing hash function for a few common -types, and raises an exception on unknown types: - -.. code-block:: - - ValueError("unknown type for memoization ...") - -You can plug in your own type-specific hash code for additional types that -you need and understand using `id_for_memo`. - - -Invoking other asynchronous components --------------------------------------- - -Parsl code can invoke other asynchronous components which return Futures, and -integrate those Futures into the task graph: Parsl apps can be given any -`concurrent.futures.Future` as a dependency, even if those futures do not come -from invoking a Parsl app. This includes as the return value of a -``join_app``. - -An specific example of this is integrating Globus Compute tasks into a Parsl -task graph. See :ref:`label-join-globus-compute` - -Dependency resolution ---------------------- - -When Parsl examines the arguments to an app, it uses a `DependencyResolver`. -The default `DependencyResolver` will cause Parsl to wait for -``concurrent.futures.Future`` instances (including `AppFuture` and -`DataFuture`), and pass through other arguments without waiting. - -This behaviour is pluggable: Parsl comes with another dependency resolver, -`DEEP_DEPENDENCY_RESOLVER` which knows about futures contained with structures -such as tuples, lists, sets and dicts. - -This plugin interface might be used to interface other task-like or future-like -objects to the Parsl dependency mechanism, by describing how they can be -interpreted as a Future. - -Removed interfaces ------------------- - -Parsl had a deprecated ``Channel`` abstraction. See -`issue 3515 `_ -for further discussion on its removal. +This page has been `moved `_ diff --git a/docs/userguide/usage_tracking.rst b/docs/userguide/usage_tracking.rst index da8ac9b79d..1b891ce0e9 100644 --- a/docs/userguide/usage_tracking.rst +++ b/docs/userguide/usage_tracking.rst @@ -1,171 +1,9 @@ -.. _label-usage-tracking: +:orphan: -Usage Statistics Collection -=========================== +.. meta:: + :content http-equiv="refresh": 0;url=advanced/usage_tracking.html -Parsl uses an **Opt-in** model for usage tracking, allowing users to decide if they wish to participate. Usage statistics are crucial for improving software reliability and help focus development and maintenance efforts on the most used components of Parsl. The collected data is used solely for enhancements and reporting and is not shared in its raw form outside of the Parsl team. - -Why are we doing this? ----------------------- - -The Parsl development team relies on funding from government agencies. To sustain this funding and advocate for continued support, it is essential to show that the research community benefits from these investments. - -By opting in to share usage data, you actively support the ongoing development and maintenance of Parsl. (See:ref:`What is sent? ` below). - -Opt-In Model ------------- - -We use an **opt-in model** for usage tracking to respect user privacy and provide full control over shared information. We hope that developers and researchers will choose to send us this information. The reason is that we need this data - it is a requirement for funding. - -Choose the data you share with Usage Tracking Levels. - -**Usage Tracking Levels:** - -* **Level 1:** Only basic information such as Python version, Parsl version, and platform name (Linux, MacOS, etc.) -* **Level 2:** Level 1 information and configuration information including provider, executor, and launcher names. -* **Level 3:** Level 2 information and workflow execution details, including the number of applications run, failures, and execution time. - -By enabling usage tracking, you support Parsl's development. - -**To opt-in, set** ``usage_tracking`` **to the desired level (1, 2, or 3) in the configuration object** (``parsl.config.Config``) **.** - -Example: - -.. code-block:: python3 - - config = Config( - executors=[ - HighThroughputExecutor( - ... - ) - ], - usage_tracking=3 - ) - -.. _what-is-sent: - -What is sent? -------------- - -The data collected depends on the tracking level selected: - -* **Level 1:** Only basic information such as Python version, Parsl version, and platform name (Linux, MacOS, etc.) -* **Level 2:** Level 1 information and configuration information including provider, executor, and launcher names. -* **Level 3:** Level 2 information and workflow execution details, including the number of applications run, failures, and execution time. - -**Example Messages:** - -- At launch: - - .. code-block:: json - - { - "correlator":"6bc7484e-5693-48b2-b6c0-5889a73f7f4e", - "parsl_v":"1.3.0-dev", - "python_v":"3.12.2", - "platform.system":"Darwin", - "tracking_level":3, - "components":[ - { - "c":"parsl.config.Config", - "executors_len":1, - "dependency_resolver":false - }, - "parsl.executors.threads.ThreadPoolExecutor" - ], - "start":1727156153 - } - -- On closure (Tracking Level 3 only): - - .. code-block:: json - - { - "correlator":"6bc7484e-5693-48b2-b6c0-5889a73f7f4e", - "execution_time":31, - "components":[ - { - "c":"parsl.dataflow.dflow.DataFlowKernel", - "app_count":3, - "app_fails":0 - }, - { - "c":"parsl.config.Config", - "executors_len":1, - "dependency_resolver":false - }, - "parsl.executors.threads.ThreadPoolExecutor" - ], - "end":1727156156 - } - -**All messages sent are logged in the** ``parsl.log`` **file, ensuring complete transparency.** - -How is the data sent? ---------------------- - -Data is sent using **UDP** to minimize the impact on workflow performance. While this may result in some data loss, it significantly reduces the chances of usage tracking affecting the software's operation. - -The data is processed through AWS CloudWatch to generate a monitoring dashboard, providing valuable insights into usage patterns. - -When is the data sent? ----------------------- - -Data is sent twice per run: - -1. At the start of the script. -2. Upon script completion (for Tracking Level 3). - -What will the data be used for? -------------------------------- - -The data will help the Parsl team understand Parsl usage and make development and maintenance decisions, including: - -* Focus development and maintenance on the most-used components of Parsl. -* Determine which Python versions to continue supporting. -* Track the age of Parsl installations. -* Assess how long it takes for most users to adopt new changes. -* Track usage statistics to report to funders. - -Usage Statistics Dashboard --------------------------- - -The collected data is aggregated and displayed on a publicly accessible dashboard. This dashboard provides an overview of how Parsl is being used across different environments and includes metrics such as: - -* Total workflows executed over time -* Most-used Python and Parsl versions -* Most common platforms and executors and more - -`Find the dashboard here `_ - -Leaderboard ------------ - -**Opting in to usage tracking also allows you to participate in the Parsl Leaderboard. -To participate in the leaderboard, you can deanonymize yourself using the** ``project_name`` **parameter in the parsl configuration object** (``parsl.config.Config``) **.** - -`Find the Parsl Leaderboard here `_ - -Example: - -.. code-block:: python3 - - config = Config( - executors=[ - HighThroughputExecutor( - ... - ) - ], - usage_tracking=3, - project_name="my-test-project" - ) - -Every run of parsl with usage tracking **Level 1** or **Level 2** earns you **1 point**. And every run with usage tracking **Level 3**, earns you **2 points**. - -Feedback +Redirect -------- -Please send us your feedback at parsl@googlegroups.com. Feedback from our user communities will be -useful in determining our path forward with usage tracking in the future. - -**Please consider turning on usage tracking to support the continued development of Parsl.** +This page has been `moved `_ diff --git a/docs/userguide/workflow.rst b/docs/userguide/workflow.rst index 2a0a2c8c28..139dd79175 100644 --- a/docs/userguide/workflow.rst +++ b/docs/userguide/workflow.rst @@ -1,243 +1,9 @@ -.. _label-workflow: +:orphan: -Example parallel patterns -========================= +.. meta:: + :content http-equiv="refresh": 0;url=workflows/workflow.html -Parsl can be used to implement a wide range of parallel programming patterns, from bag of tasks -through to nested workflows. Parsl implicitly assembles a dataflow -dependency graph based on the data shared between apps. -The flexibility of this model allows for the implementation of a wide range -of parallel programming and workflow patterns. +Redirect +-------- -Parsl is also designed to address broad execution requirements, from programs -that run many short tasks to those that run a few long tasks. - -Below we illustrate a range of parallel programming and workflow patterns. It is important -to note that this set of examples is by no means comprehensive. - - -Bag of Tasks ------------- -Parsl can be used to execute a large bag of tasks. In this case, Parsl -assembles the set of tasks (represented as Parsl apps) and manages their concurrent -execution on available resources. - -.. code-block:: python - - from parsl import python_app - - parsl.load() - - # Map function that returns double the input integer - @python_app - def app_random(): - import random - return random.random() - - results = [] - for i in range(0, 10): - x = app_random() - results.append(x) - - for r in results: - print(r.result()) - - -Sequential workflows --------------------- - -Sequential workflows can be created by passing an AppFuture from one task to another. For example, in the following program the ``generate`` app (a Python app) generates a random number that is consumed by the ``save`` app (a Bash app), which writes it to a file. Because ``save`` cannot execute until it receives the ``message`` produced by ``generate``, the two apps execute in sequence. - -.. code-block:: python - - from parsl import python_app - - parsl.load() - - # Generate a random number - @python_app - def generate(limit): - from random import randint - """Generate a random integer and return it""" - return randint(1, limit) - - # Write a message to a file - @bash_app - def save(message, outputs=()): - return 'echo {} &> {}'.format(message, outputs[0]) - - message = generate(10) - - saved = save(message, outputs=['output.txt']) - - with open(saved.outputs[0].result(), 'r') as f: - print(f.read()) - - -Parallel workflows ------------------- - -Parallel execution occurs automatically in Parsl, respecting dependencies among app executions. In the following example, three instances of the ``wait_sleep_double`` app are created. The first two execute concurrently, as they have no dependencies; the third must wait until the first two complete and thus the ``doubled_x`` and ``doubled_y`` futures have values. Note that this sequencing occurs even though ``wait_sleep_double`` does not in fact use its second and third arguments. - -.. code-block:: python - - from parsl import python_app - - parsl.load() - - @python_app - def wait_sleep_double(x, foo_1, foo_2): - import time - time.sleep(2) # Sleep for 2 seconds - return x*2 - - # Launch two apps, which will execute in parallel, since they do not have to - # wait on any futures - doubled_x = wait_sleep_double(10, None, None) - doubled_y = wait_sleep_double(10, None, None) - - # The third app depends on the first two: - # doubled_x doubled_y (2 s) - # \ / - # doublex_z (2 s) - doubled_z = wait_sleep_double(10, doubled_x, doubled_y) - - # doubled_z will be done in ~4s - print(doubled_z.result()) - - -Parallel workflows with loops ------------------------------ - -A common approach to executing Parsl apps in parallel is via loops. The following example uses a loop to create many random numbers in parallel. - -.. code-block:: python - - from parsl import python_app - - parsl.load() - - @python_app - def generate(limit): - """Generate a random integer and return it""" - from random import randint - return randint(1, limit) - - rand_nums = [] - for i in range(1,5): - rand_nums.append(generate(i)) - - # Wait for all apps to finish and collect the results - outputs = [r.result() for r in rand_nums] - -The :class:`~parsl.concurrent.ParslPoolExecutor` simplifies this pattern using the same interface as -`Python's native Executors `_. - -.. code-block:: python - - from parsl.concurrent import ParslPoolExecutor - from parsl.configs.htex_local import config - - # NOTE: Functions used by the ParslPoolExecutor do _not_ use decorators - def generate(limit): - """Generate a random integer and return it""" - from random import randint - return randint(1, limit) - - - with ParslPoolExecutor(config) as pool: - outputs = pool.map(generate, range(1, 5)) - - -In the preceding example, the execution of different tasks is coordinated by passing Python objects from producers to consumers. -In other cases, it can be convenient to pass data in files, as in the following reformulation. Here, a set of files, each with a random number, is created by the ``generate`` app. These files are then concatenated into a single file, which is subsequently used to compute the sum of all numbers. - -.. code-block:: python - - from parsl import python_app, bash_app - - parsl.load() - - @bash_app - def generate(outputs=()): - return 'echo $(( RANDOM % (10 - 5 + 1 ) + 5 )) &> {}'.format(outputs[0]) - - @bash_app - def concat(inputs=(), outputs=(), stdout='stdout.txt', stderr='stderr.txt'): - return 'cat {0} >> {1}'.format(' '.join(inputs), outputs[0]) - - @python_app - def total(inputs=()): - total = 0 - with open(inputs[0].filepath, 'r') as f: - for l in f: - total += int(l) - return total - - # Create 5 files with random numbers - output_files = [] - for i in range (5): - output_files.append(generate(outputs=['random-%s.txt' % i])) - - # Concatenate the files into a single file - cc = concat(inputs=[i.outputs[0] for i in output_files], outputs=['all.txt']) - - # Calculate the average of the random numbers - totals = total(inputs=[cc.outputs[0]]) - - print(totals.result()) - - -MapReduce ---------- -MapReduce is a common pattern used in data analytics. It is composed of a map phase -that filters values and a reduce phase that aggregates values. -The following example demonstrates how Parsl can be used to specify a MapReduce computation -in which the map phase doubles a set of input integers and the reduce phase computes -the sum of those results. - -.. code-block:: python - - from parsl import python_app - - parsl.load() - - # Map function that returns double the input integer - @python_app - def app_double(x): - return x*2 - - # Reduce function that returns the sum of a list - @python_app - def app_sum(inputs=()): - return sum(inputs) - - # Create a list of integers - items = range(0,4) - - # Map phase: apply the double *app* function to each item in list - mapped_results = [] - for i in items: - x = app_double(i) - mapped_results.append(x) - - # Reduce phase: apply the sum *app* function to the set of results - total = app_sum(inputs=mapped_results) - - print(total.result()) - -The program first defines two Parsl apps, ``app_double`` and ``app_sum``. -It then makes calls to the ``app_double`` app with a set of input -values. It then passes the results from ``app_double`` to the ``app_sum`` app -to aggregate values into a single result. -These tasks execute concurrently, synchronized by the ``mapped_results`` variable. -The following figure shows the resulting task graph. - -.. image:: ../images/MapReduce.png - -Caching expensive initialisation between tasks ----------------------------------------------- - -Many tasks in workflows require a expensive "initialization" steps that, once performed, can be used across successive invocations for that task. For example, you may want to reuse a machine learning model for multiple interface tasks and avoid loading it onto GPUs more than once. - -`This ExaWorks tutorial `_ gives examples of how to do this. +This page has been `moved `_ diff --git a/docs/userguide/workflows/checkpoints.rst b/docs/userguide/workflows/checkpoints.rst new file mode 100644 index 0000000000..8867107b7a --- /dev/null +++ b/docs/userguide/workflows/checkpoints.rst @@ -0,0 +1,299 @@ +.. _label-memos: + +Memoization and checkpointing +----------------------------- + +When an app is invoked several times with the same parameters, Parsl can +reuse the result from the first invocation without executing the app again. + +This can save time and computational resources. + +This is done in two ways: + +* Firstly, *app caching* will allow reuse of results within the same run. + +* Building on top of that, *checkpointing* will store results on the filesystem + and reuse those results in later runs. + +.. _label-appcaching: + +App caching +=========== + + +There are many situations in which a program may be re-executed +over time. Often, large fragments of the program will not have changed +and therefore, re-execution of apps will waste valuable time and +computation resources. Parsl's app caching solves this problem by +storing results from apps that have successfully completed +so that they can be re-used. + +App caching is enabled by setting the ``cache`` +argument in the :func:`~parsl.app.app.python_app` or :func:`~parsl.app.app.bash_app` +decorator to ``True`` (by default it is ``False``). + +.. code-block:: python + + @bash_app(cache=True) + def hello (msg, stdout=None): + return 'echo {}'.format(msg) + +App caching can be globally disabled by setting ``app_cache=False`` +in the :class:`~parsl.config.Config`. + +App caching can be particularly useful when developing interactive programs such as when +using a Jupyter notebook. In this case, cells containing apps are often re-executed +during development. Using app caching will ensure that only modified apps are re-executed. + + +App equivalence +^^^^^^^^^^^^^^^ + +Parsl determines app equivalence using the name of the app function: +if two apps have the same name, then they are equivalent under this +relation. + +Changes inside the app, or by functions called by an app will not invalidate +cached values. + +There are lots of other ways functions might be compared for equivalence, +and `parsl.dataflow.memoization.id_for_memo` provides a hook to plug in +alternate application-specific implementations. + + +Invocation equivalence +^^^^^^^^^^^^^^^^^^^^^^ + +Two app invocations are determined to be equivalent if their +input arguments are identical. + +In simple cases, this follows obvious rules: + +.. code-block:: python + + # these two app invocations are the same and the second invocation will + # reuse any cached input from the first invocation + x = 7 + f(x).result() + + y = 7 + f(y).result() + + +Internally, equivalence is determined by hashing the input arguments, and +comparing the hash to hashes from previous app executions. + +This approach can only be applied to data types for which a deterministic hash +can be computed. + +By default Parsl can compute sensible hashes for basic data types: +str, int, float, None, as well as more some complex types: +functions, and dictionaries and lists containing hashable types. + +Attempting to cache apps invoked with other, non-hashable, data types will +lead to an exception at invocation. + +In that case, mechanisms to hash new types can be registered by a program by +implementing the `parsl.dataflow.memoization.id_for_memo` function for +the new type. + +Ignoring arguments +^^^^^^^^^^^^^^^^^^ + +On occasion one may wish to ignore particular arguments when determining +app invocation equivalence - for example, when generating log file +names automatically based on time or run information. +Parsl allows developers to list the arguments to be ignored +in the ``ignore_for_cache`` app decorator parameter: + +.. code-block:: python + + @bash_app(cache=True, ignore_for_cache=['stdout']) + def hello (msg, stdout=None): + return 'echo {}'.format(msg) + + +Caveats +^^^^^^^ + +It is important to consider several important issues when using app caching: + +- Determinism: App caching is generally useful only when the apps are deterministic. + If the outputs may be different for identical inputs, app caching will obscure + this non-deterministic behavior. For instance, caching an app that returns + a random number will result in every invocation returning the same result. + +- Timing: If several identical calls to an app are made concurrently having + not yet cached a result, many instances of the app will be launched. + Once one invocation completes and the result is cached + all subsequent calls will return immediately with the cached result. + +- Performance: If app caching is enabled, there may be some performance + overhead especially if a large number of short duration tasks are launched rapidly. + This overhead has not been quantified. + +.. _label-checkpointing: + +Checkpointing +============= + +Large-scale Parsl programs are likely to encounter errors due to node failures, +application or environment errors, and myriad other issues. Parsl offers an +application-level checkpointing model to improve resilience, fault tolerance, and +efficiency. + +.. note:: + Checkpointing builds on top of app caching, and so app caching must be + enabled. If app caching is disabled in the config ``Config.app_cache``, checkpointing will + not work. + +Parsl follows an incremental checkpointing model, where each checkpoint file contains +all results that have been updated since the last checkpoint. + +When a Parsl program loads a checkpoint file and is executed, it will use +checkpointed results for any apps that have been previously executed. +Like app caching, checkpoints +use the hash of the app and the invocation input parameters to identify previously computed +results. If multiple checkpoints exist for an app (with the same hash) +the most recent entry will be used. + +Parsl provides four checkpointing modes: + +1. ``task_exit``: a checkpoint is created each time an app completes or fails + (after retries if enabled). This mode minimizes the risk of losing information + from completed tasks. + + .. code-block:: python + + from parsl.configs.local_threads import config + config.checkpoint_mode = 'task_exit' + +2. ``periodic``: a checkpoint is created periodically using a user-specified + checkpointing interval. Results will be saved to the checkpoint file for + all tasks that have completed during this period. + + .. code-block:: python + + from parsl.configs.local_threads import config + config.checkpoint_mode = 'periodic' + config.checkpoint_period = "01:00:00" + +3. ``dfk_exit``: checkpoints are created when Parsl is + about to exit. This reduces the risk of losing results due to + premature program termination from exceptions, terminate signals, etc. However + it is still possible that information might be lost if the program is + terminated abruptly (machine failure, SIGKILL, etc.) + + .. code-block:: python + + from parsl.configs.local_threads import config + config.checkpoint_mode = 'dfk_exit' + +4. ``manual``: in addition to these automated checkpointing modes, it is also possible + to manually initiate a checkpoint by calling ``DataFlowKernel.checkpoint()`` in the + Parsl program code. + + .. code-block:: python + + import parsl + from parsl.configs.local_threads import config + dfk = parsl.load(config) + .... + dfk.checkpoint() + +In all cases the checkpoint file is written out to the ``runinfo/RUN_ID/checkpoint/`` directory. + +.. Note:: Checkpoint modes ``periodic``, ``dfk_exit``, and ``manual`` can interfere with garbage collection. + In these modes task information will be retained after completion, until checkpointing events are triggered. + + +Creating a checkpoint +^^^^^^^^^^^^^^^^^^^^^ + +Automated checkpointing must be explicitly enabled in the Parsl configuration. +There is no need to modify a Parsl program as checkpointing will occur transparently. +In the following example, checkpointing is enabled at task exit. The results of +each invocation of the ``slow_double`` app will be stored in the checkpoint file. + +.. code-block:: python + + import parsl + from parsl.app.app import python_app + from parsl.configs.local_threads import config + + config.checkpoint_mode = 'task_exit' + + parsl.load(config) + + @python_app(cache=True) + def slow_double(x): + import time + time.sleep(5) + return x * 2 + + d = [] + for i in range(5): + d.append(slow_double(i)) + + print([d[i].result() for i in range(5)]) + +Alternatively, manual checkpointing can be used to explictly specify when the checkpoint +file should be saved. The following example shows how manual checkpointing can be used. +Here, the ``dfk.checkpoint()`` function will save the results of the prior invocations +of the ``slow_double`` app. + +.. code-block:: python + + import parsl + from parsl import python_app + from parsl.configs.local_threads import config + + dfk = parsl.load(config) + + @python_app(cache=True) + def slow_double(x, sleep_dur=1): + import time + time.sleep(sleep_dur) + return x * 2 + + N = 5 # Number of calls to slow_double + d = [] # List to store the futures + for i in range(0, N): + d.append(slow_double(i)) + + # Wait for the results + [i.result() for i in d] + + cpt_dir = dfk.checkpoint() + print(cpt_dir) # Prints the checkpoint dir + + +Resuming from a checkpoint +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When resuming a program from a checkpoint Parsl allows the user to select +which checkpoint file(s) to use. +Checkpoint files are stored in the ``runinfo/RUNID/checkpoint`` directory. + +The example below shows how to resume using all available checkpoints. +Here, the program re-executes the same calls to the ``slow_double`` app +as above and instead of waiting for results to be computed, the values +from the checkpoint file are are immediately returned. + +.. code-block:: python + + import parsl + from parsl.tests.configs.local_threads import config + from parsl.utils import get_all_checkpoints + + config.checkpoint_files = get_all_checkpoints() + + parsl.load(config) + + # Rerun the same workflow + d = [] + for i in range(5): + d.append(slow_double(i)) + + # wait for results + print([d[i].result() for i in range(5)]) diff --git a/docs/userguide/workflows/exceptions.rst b/docs/userguide/workflows/exceptions.rst new file mode 100644 index 0000000000..d18fbe704d --- /dev/null +++ b/docs/userguide/workflows/exceptions.rst @@ -0,0 +1,171 @@ +.. _label-exceptions: + +Error handling +============== + +Parsl provides various mechanisms to add resiliency and robustness to programs. + +Exceptions +---------- + +Parsl is designed to capture, track, and handle various errors occurring +during execution, including those related to the program, apps, execution +environment, and Parsl itself. +It also provides functionality to appropriately respond to failures during +execution. + +Failures might occur for various reasons: + +1. A task failed during execution. +2. A task failed to launch, for example, because an input dependency was not met. +3. There was a formatting error while formatting the command-line string in Bash apps. +4. A task completed execution but failed to produce one or more of its specified + outputs. +5. Task exceeded the specified walltime. + +Since Parsl tasks are executed asynchronously and remotely, it can be difficult to determine +when errors have occurred and to appropriately handle them in a Parsl program. + +For errors occurring in Python code, Parsl captures Python exceptions and returns +them to the main Parsl program. For non-Python errors, for example when a node +or worker fails, Parsl imposes a timeout, and considers a task to have failed +if it has not heard from the task by that timeout. Parsl also considers a task to have failed +if it does not meet the contract stated by the user during invocation, such as failing +to produce the stated output files. + +Parsl communicates these errors by associating Python exceptions with task futures. +These exceptions are raised only when a result is called on the future +of a failed task. For example: + +.. code-block:: python + + @python_app + def bad_divide(x): + return 6 / x + + # Call bad divide with 0, to cause a divide by zero exception + doubled_x = bad_divide(0) + + # Catch and handle the exception. + try: + doubled_x.result() + except ZeroDivisionError as e: + print('Oops! You tried to divide by 0.') + except Exception as e: + print('Oops! Something really bad happened.') + + +Retries +------- + +Often errors in distributed/parallel environments are transient. +In these cases, retrying failed tasks can be a simple way +of overcoming transient (e.g., machine failure, +network failure) and intermittent failures. +When ``retries`` are enabled (and set to an integer > 0), Parsl will automatically +re-launch tasks that have failed until the retry limit is reached. +By default, retries are disabled and exceptions will be communicated +to the Parsl program. + +The following example shows how the number of retries can be set to 2: + +.. code-block:: python + + import parsl + from parsl.configs.htex_local import config + + config.retries = 2 + + parsl.load(config) + +More specific retry handling can be specified using retry handlers, documented +below. + + +Lazy fail +--------- + +Parsl implements a lazy failure model through which a workload will continue +to execute in the case that some tasks fail. That is, the program will not +halt as soon as it encounters a failure, rather it will continue to execute +unaffected apps. + +The following example shows how lazy failures affect execution. In this +case, task C fails and therefore tasks E and F that depend on results from +C cannot be executed; however, Parsl will continue to execute tasks B and D +as they are unaffected by task C's failure. + +.. code-block:: + + Here's a workflow graph, where + (X) is runnable, + [X] is completed, + (X*) is failed. + (!X) is dependency failed + + (A) [A] (A) + / \ / \ / \ + (B) (C) [B] (C*) [B] (C*) + | | => | | => | | + (D) (E) (D) (E) [D] (!E) + \ / \ / \ / + (F) (F) (!F) + + time -----> + + +Retry handlers +-------------- + +The basic parsl retry mechanism keeps a count of the number of times a task +has been (re)tried, and will continue retrying that task until the configured +retry limit is reached. + +Retry handlers generalize this to allow more expressive retry handling: +parsl keeps a retry cost for a task, and the task will be retried until the +configured retry limit is reached. Instead of the cost being 1 for each +failure, user-supplied code can examine the failure and compute a custom +cost. + +This allows user knowledge about failures to influence the retry mechanism: +an exception which is almost definitely a non-recoverable failure (for example, +due to bad parameters) can be given a high retry cost (so that it will not +be retried many times, or at all), and exceptions which are likely to be +transient (for example, where a worker node has died) can be given a low +retry cost so they will be retried many times. + +A retry handler can be specified in the parsl configuration like this: + + +.. code-block:: python + + Config( + retries=2, + retry_handler=example_retry_handler + ) + + +``example_retry_handler`` should be a function defined by the user that will +compute the retry cost for a particular failure, given some information about +the failure. + +For example, the following handler will give a cost of 1 to all exceptions, +except when a bash app exits with unix exitcode 9, in which case the cost will +be 100. This will have the effect that retries will happen as normal for most +errors, but the bash app can indicate that there is little point in retrying +by exiting with exitcode 9. + +.. code-block:: python + + def example_retry_handler(exception, task_record): + if isinstance(exception, BashExitFailure) and exception.exitcode == 9: + return 100 + else + return 1 + +The retry handler is given two parameters: the exception from execution, and +the parsl internal task_record. The task record contains details such as the +app name, parameters and executor. + +If a retry handler raises an exception itself, then the task will be aborted +and no further tries will be attempted. diff --git a/docs/userguide/workflows/futures.rst b/docs/userguide/workflows/futures.rst new file mode 100644 index 0000000000..13d22a211b --- /dev/null +++ b/docs/userguide/workflows/futures.rst @@ -0,0 +1,165 @@ +.. _label-futures: + +Futures +======= + +When an ordinary Python function is invoked in a Python program, the Python interpreter waits for the function to complete execution +before proceeding to the next statement. +But if a function is expected to execute for a long period of time, it may be preferable not to wait for +its completion but instead to proceed immediately with executing subsequent statements. +The function can then execute concurrently with that other computation. + +Concurrency can be used to enhance performance when independent activities +can execute on different cores or nodes in parallel. The following +code fragment demonstrates this idea, showing that overall execution time +may be reduced if the two function calls are executed concurrently. + +.. code-block:: python + + v1 = expensive_function(1) + v2 = expensive_function(2) + result = v1 + v2 + +However, concurrency also introduces a need for **synchronization**. +In the example, it is not possible to compute the sum of ``v1`` and ``v2`` +until both function calls have completed. +Synchronization provides a way of blocking execution of one activity +(here, the statement ``result = v1 + v2``) until other activities +(here, the two calls to ``expensive_function()``) have completed. + +Parsl supports concurrency and synchronization as follows. +Whenever a Parsl program calls a Parsl app (a function annotated with a Parsl +app decorator, see :ref:`apps`), +Parsl will create a new ``task`` and immediately return a +`future `_ in lieu of that function's result(s). +The program will then continue immediately to the next statement in the program. +At some point, for example when the task's dependencies are met and there +is available computing capacity, Parsl will execute the task. Upon +completion, Parsl will set the value of the future to contain the task's +output. + +A future can be used to track the status of an asynchronous task. +For example, after creation, the future may be interrogated to determine +the task's status (e.g., running, failed, completed), access results, +and capture exceptions. Further, futures may be used for synchronization, +enabling the calling Python program to block until the future +has completed execution. + +Parsl provides two types of futures: `AppFuture` and `DataFuture`. +While related, they enable subtly different parallel patterns. + +AppFutures +---------- + +AppFutures are the basic building block upon which Parsl programs are built. Every invocation of a Parsl app returns an AppFuture that may be used to monitor and manage the task's execution. +AppFutures are inherited from Python's `concurrent library `_. +They provide three key capabilities: + +1. An AppFuture's ``result()`` function can be used to wait for an app to complete, and then access any result(s). +This function is blocking: it returns only when the app completes or fails. +The following code fragment implements an example similar to the ``expensive_function()`` example above. +Here, the ``sleep_double`` app simply doubles the input value. The program invokes +the ``sleep_double`` app twice, and returns futures in place of results. The example +shows how the future's ``result()`` function can be used to wait for the results from the +two ``sleep_double`` app invocations to be computed. + +.. code-block:: python + + @python_app + def sleep_double(x): + import time + time.sleep(2) # Sleep for 2 seconds + return x*2 + + # Start two concurrent sleep_double apps. doubled_x1 and doubled_x2 are AppFutures + doubled_x1 = sleep_double(10) + doubled_x2 = sleep_double(5) + + # The result() function will block until each of the corresponding app calls have completed + print(doubled_x1.result() + doubled_x2.result()) + +2. An AppFuture's ``done()`` function can be used to check the status of an app, *without blocking*. +The following example shows that calling the future's ``done()`` function will not stop execution of the main Python program. + +.. code-block:: python + + @python_app + def double(x): + return x*2 + + # doubled_x is an AppFuture + doubled_x = double(10) + + # Check status of doubled_x, this will print True if the result is available, else False + print(doubled_x.done()) + +3. An AppFuture provides a safe way to handle exceptions and errors while asynchronously executing +apps. The example shows how exceptions can be captured in the same way as a standard Python program +when calling the future's ``result()`` function. + +.. code-block:: python + + @python_app + def bad_divide(x): + return 6/x + + # Call bad divide with 0, to cause a divide by zero exception + doubled_x = bad_divide(0) + + # Catch and handle the exception. + try: + doubled_x.result() + except ZeroDivisionError as ze: + print('Oops! You tried to divide by 0') + except Exception as e: + print('Oops! Something really bad happened') + + +In addition to being able to capture exceptions raised by a specific app, Parsl also raises ``DependencyErrors`` when apps are unable to execute due to failures in prior dependent apps. +That is, an app that is dependent upon the successful completion of another app will fail with a dependency error if any of the apps on which it depends fail. + + +DataFutures +----------- + +While an AppFuture represents the execution of an asynchronous app, +a DataFuture represents a file to be produced by that app. +Parsl's dataflow model requires such a construct so that it can determine +when dependent apps, apps that that are to consume a file produced by another app, +can start execution. + +When calling an app that produces files as outputs, Parsl requires that a list of output files be specified (as a list of `File` objects passed in via the ``outputs`` keyword argument). Parsl will return a DataFuture for each output file as part AppFuture when the app is executed. +These DataFutures are accessible in the AppFuture's ``outputs`` attribute. + +Each DataFuture will complete when the App has finished executing, +and the corresponding file has been created (and if specified, staged out). + +When a DataFuture is passed as an argument to a subsequent app invocation, +that subsequent app will not begin execution until the DataFuture is +completed. The input argument will then be replaced with an appropriate +File object. + +The following code snippet shows how DataFutures are used. In this +example, the call to the echo Bash app specifies that the results +should be written to an output file ("hello1.txt"). The main +program inspects the status of the output file (via the future's +``outputs`` attribute) and then blocks waiting for the file to +be created (``hello.outputs[0].result()``). + +.. code-block:: python + + # This app echoes the input string to the first file specified in the + # outputs list + @bash_app + def echo(message, outputs=()): + return 'echo {} &> {}'.format(message, outputs[0]) + + # Call echo specifying the output file + hello = echo('Hello World!', outputs=[File('hello1.txt')]) + + # The AppFuture's outputs attribute is a list of DataFutures + print(hello.outputs) + + # Print the contents of the output DataFuture when complete + with open(hello.outputs[0].result().filepath, 'r') as f: + print(f.read()) diff --git a/docs/userguide/workflows/index.rst b/docs/userguide/workflows/index.rst new file mode 100644 index 0000000000..0ccb0841f2 --- /dev/null +++ b/docs/userguide/workflows/index.rst @@ -0,0 +1,18 @@ +Running Workflows +================= + +Parsl workflows are a Python "main" program that defines Apps, +how the Apps are invoked, +and how results are passed between different Apps. + +The core concept of workflows is that Parsl Apps produce **Futures** +with all features from those in Python's :mod:`concurrent.futures` module and more. + +.. toctree:: + :maxdepth: 2 + + futures + workflow + exceptions + lifted_ops + checkpoints diff --git a/docs/userguide/workflows/lifted_ops.rst b/docs/userguide/workflows/lifted_ops.rst new file mode 100644 index 0000000000..6e258b9b62 --- /dev/null +++ b/docs/userguide/workflows/lifted_ops.rst @@ -0,0 +1,56 @@ +.. _label-liftedops: + +Lifted operators +================ + +Parsl allows some operators (``[]`` and ``.``) to be used on an AppFuture in +a way that makes sense with those operators on the eventually returned +result. + +Lifted [] operator +------------------ + +When an app returns a complex structure such as a ``dict`` or a ``list``, +it is sometimes useful to pass an element of that structure to a subsequent +task, without waiting for that subsequent task to complete. + +To help with this, Parsl allows the ``[]`` operator to be used on an +`AppFuture`. This operator will return another `AppFuture` that will +complete after the initial future, with the result of ``[]`` on the value +of the initial future. + +The end result is that this assertion will hold: + +.. code-block:: python + + fut = my_app() + assert fut['x'].result() == fut.result()[x] + +but more concurrency will be available, as execution of the main workflow +code will not stop to wait for ``result()`` to complete on the initial +future. + +`AppFuture` does not implement other methods commonly associated with +dicts and lists, such as ``len``, because those methods should return a +specific type of result immediately, and that is not possible when the +results are not available until the future. + +If a key does not exist in the returned result, then the exception will +appear in the Future returned by ``[]``, rather than at the point that +the ``[]`` operator is applied. This is because the valid values that can +be used are not known until the underlying result is available. + +Lifted . operator +----------------- + +The ``.`` operator works similarly to ``[]`` described above: + +.. code-block:: python + + fut = my_app + assert fut.x.result() == fut.result().x + +Attributes beginning with ``_`` are not lifted as this usually indicates an +attribute that is used for internal purposes, and to try to avoid mixing +protocols (such as iteration in for loops) defined on AppFutures vs protocols +defined on the underlying result object. diff --git a/docs/userguide/workflows/workflow.rst b/docs/userguide/workflows/workflow.rst new file mode 100644 index 0000000000..f8b34fa6c5 --- /dev/null +++ b/docs/userguide/workflows/workflow.rst @@ -0,0 +1,243 @@ +.. _label-workflow: + +Example parallel patterns +========================= + +Parsl can be used to implement a wide range of parallel programming patterns, from bag of tasks +through to nested workflows. Parsl implicitly assembles a dataflow +dependency graph based on the data shared between apps. +The flexibility of this model allows for the implementation of a wide range +of parallel programming and workflow patterns. + +Parsl is also designed to address broad execution requirements, from programs +that run many short tasks to those that run a few long tasks. + +Below we illustrate a range of parallel programming and workflow patterns. It is important +to note that this set of examples is by no means comprehensive. + + +Bag of Tasks +------------ +Parsl can be used to execute a large bag of tasks. In this case, Parsl +assembles the set of tasks (represented as Parsl apps) and manages their concurrent +execution on available resources. + +.. code-block:: python + + from parsl import python_app + + parsl.load() + + # Map function that returns double the input integer + @python_app + def app_random(): + import random + return random.random() + + results = [] + for i in range(0, 10): + x = app_random() + results.append(x) + + for r in results: + print(r.result()) + + +Sequential workflows +-------------------- + +Sequential workflows can be created by passing an AppFuture from one task to another. For example, in the following program the ``generate`` app (a Python app) generates a random number that is consumed by the ``save`` app (a Bash app), which writes it to a file. Because ``save`` cannot execute until it receives the ``message`` produced by ``generate``, the two apps execute in sequence. + +.. code-block:: python + + from parsl import python_app + + parsl.load() + + # Generate a random number + @python_app + def generate(limit): + from random import randint + """Generate a random integer and return it""" + return randint(1, limit) + + # Write a message to a file + @bash_app + def save(message, outputs=()): + return 'echo {} &> {}'.format(message, outputs[0]) + + message = generate(10) + + saved = save(message, outputs=['output.txt']) + + with open(saved.outputs[0].result(), 'r') as f: + print(f.read()) + + +Parallel workflows +------------------ + +Parallel execution occurs automatically in Parsl, respecting dependencies among app executions. In the following example, three instances of the ``wait_sleep_double`` app are created. The first two execute concurrently, as they have no dependencies; the third must wait until the first two complete and thus the ``doubled_x`` and ``doubled_y`` futures have values. Note that this sequencing occurs even though ``wait_sleep_double`` does not in fact use its second and third arguments. + +.. code-block:: python + + from parsl import python_app + + parsl.load() + + @python_app + def wait_sleep_double(x, foo_1, foo_2): + import time + time.sleep(2) # Sleep for 2 seconds + return x*2 + + # Launch two apps, which will execute in parallel, since they do not have to + # wait on any futures + doubled_x = wait_sleep_double(10, None, None) + doubled_y = wait_sleep_double(10, None, None) + + # The third app depends on the first two: + # doubled_x doubled_y (2 s) + # \ / + # doublex_z (2 s) + doubled_z = wait_sleep_double(10, doubled_x, doubled_y) + + # doubled_z will be done in ~4s + print(doubled_z.result()) + + +Parallel workflows with loops +----------------------------- + +A common approach to executing Parsl apps in parallel is via loops. The following example uses a loop to create many random numbers in parallel. + +.. code-block:: python + + from parsl import python_app + + parsl.load() + + @python_app + def generate(limit): + """Generate a random integer and return it""" + from random import randint + return randint(1, limit) + + rand_nums = [] + for i in range(1,5): + rand_nums.append(generate(i)) + + # Wait for all apps to finish and collect the results + outputs = [r.result() for r in rand_nums] + +The :class:`~parsl.concurrent.ParslPoolExecutor` simplifies this pattern using the same interface as +`Python's native Executors `_. + +.. code-block:: python + + from parsl.concurrent import ParslPoolExecutor + from parsl.configs.htex_local import config + + # NOTE: Functions used by the ParslPoolExecutor do _not_ use decorators + def generate(limit): + """Generate a random integer and return it""" + from random import randint + return randint(1, limit) + + + with ParslPoolExecutor(config) as pool: + outputs = pool.map(generate, range(1, 5)) + + +In the preceding example, the execution of different tasks is coordinated by passing Python objects from producers to consumers. +In other cases, it can be convenient to pass data in files, as in the following reformulation. Here, a set of files, each with a random number, is created by the ``generate`` app. These files are then concatenated into a single file, which is subsequently used to compute the sum of all numbers. + +.. code-block:: python + + from parsl import python_app, bash_app + + parsl.load() + + @bash_app + def generate(outputs=()): + return 'echo $(( RANDOM % (10 - 5 + 1 ) + 5 )) &> {}'.format(outputs[0]) + + @bash_app + def concat(inputs=(), outputs=(), stdout='stdout.txt', stderr='stderr.txt'): + return 'cat {0} >> {1}'.format(' '.join(inputs), outputs[0]) + + @python_app + def total(inputs=()): + total = 0 + with open(inputs[0].filepath, 'r') as f: + for l in f: + total += int(l) + return total + + # Create 5 files with random numbers + output_files = [] + for i in range (5): + output_files.append(generate(outputs=['random-%s.txt' % i])) + + # Concatenate the files into a single file + cc = concat(inputs=[i.outputs[0] for i in output_files], outputs=['all.txt']) + + # Calculate the average of the random numbers + totals = total(inputs=[cc.outputs[0]]) + + print(totals.result()) + + +MapReduce +--------- +MapReduce is a common pattern used in data analytics. It is composed of a map phase +that filters values and a reduce phase that aggregates values. +The following example demonstrates how Parsl can be used to specify a MapReduce computation +in which the map phase doubles a set of input integers and the reduce phase computes +the sum of those results. + +.. code-block:: python + + from parsl import python_app + + parsl.load() + + # Map function that returns double the input integer + @python_app + def app_double(x): + return x*2 + + # Reduce function that returns the sum of a list + @python_app + def app_sum(inputs=()): + return sum(inputs) + + # Create a list of integers + items = range(0,4) + + # Map phase: apply the double *app* function to each item in list + mapped_results = [] + for i in items: + x = app_double(i) + mapped_results.append(x) + + # Reduce phase: apply the sum *app* function to the set of results + total = app_sum(inputs=mapped_results) + + print(total.result()) + +The program first defines two Parsl apps, ``app_double`` and ``app_sum``. +It then makes calls to the ``app_double`` app with a set of input +values. It then passes the results from ``app_double`` to the ``app_sum`` app +to aggregate values into a single result. +These tasks execute concurrently, synchronized by the ``mapped_results`` variable. +The following figure shows the resulting task graph. + +.. image:: ../../images/MapReduce.png + +Caching expensive initialisation between tasks +---------------------------------------------- + +Many tasks in workflows require a expensive "initialization" steps that, once performed, can be used across successive invocations for that task. For example, you may want to reuse a machine learning model for multiple interface tasks and avoid loading it onto GPUs more than once. + +`This ExaWorks tutorial `_ gives examples of how to do this. diff --git a/setup.py b/setup.py index 62706c60db..94551e2ba3 100755 --- a/setup.py +++ b/setup.py @@ -13,7 +13,6 @@ 'sqlalchemy>=2,<2.1' ], 'visualization' : [ - # this pydot bound is copied from networkx's pyproject.toml, # version 3.2 (aa2de1adecea09f7b86ff6093b212ca86f22b3ef), # because networkx[extra] installs quite a lot of extra stuff