[RFC, archived] Unified oneDPL approach to asynchrony (#1916)

uxlfoundation · Jan 13, 2025 · a1aaf97 · a1aaf97
1 parent 1b67871
commit a1aaf97
Showing 1 changed file with 215 additions and 0 deletions.
diff --git a/rfcs/archived/asynchronous_api_general/readme.md b/rfcs/archived/asynchronous_api_general/readme.md
@@ -0,0 +1,215 @@
+# General architecture for asynchronous API
+
+Rejected and archived.
+
+## Introduction
+
+oneDPL algorithms with device execution policies are developed on top of SYCL, and since the early
+days there was some demand for the algorithms to preserve the SYCL ability of computing
+asynchronously to the main program running on the host CPU. However the C++ standard semantics for
+parallel algorithms, which oneDPL follows, does not assume asynchronous execution, as the calling
+thread can only return when the algorithm finishes (for details, see [algorithms.parallel.exec]
+section of the C++ standard).
+
+To address this demand, [experimental asynchronous algorithms](#onedpl-experimental-asynchronous-algorithms)
+have been added that do not block the calling thread. Then oneDPL added the experimental functionality for
+[dynamic selection](https://uxlfoundation.github.io/oneDPL/dynamic_selection_api_main.html) that also
+allows starting asynchronous work and waiting for its completion later.
+
+## Original RFC goal
+
+For the mentioned experimental APIs to get solid and go into production, we wanted to design a single
+consistent approach to asynchronous execution. That was the original goal of this RFC proposal.
+However, after studying the topic we decided not to proceed with it now.
+
+## Reasons for archival
+
+We have concluded that it would be premature to make the oneDPL asynchronous algorithms fully supported.
+The major concerns are:
+- All known use cases for these algorithms are SYCL based. There is therefore not much practical motivation
+  for a general solution, while some SYCL specific API, such as oneDPL [kernel templates](
+  https://uxlfoundation.github.io/oneDPL/kernel_templates_main.html), can potentially better serve the needs.
+- In many practical usages we observed the pattern of [synchronization with a queue or a device](#2-synchronize-with-a-work-queue).
+  It might be addressed in a simpler and more extendable way with deferred waiting hints similar to
+  [`par_nosync` policy in Thrust](#thrust-and-cub).
+- The C++ community starts shifting from [future-based asynchronous APIs](#c-async--future) to the
+  [schedulers and senders](#c26-execution-control-library) based approach. For example, NVIDIA actively
+  develops the experimental [stdexec library](https://github.com/NVIDIA/stdexec) while it considers
+  deprecating the [asynchronous algorithms in Thrust](#thrust-and-cub).
+
+So it seems better to explore alternative ways to address the use cases for asynchronous execution,
+as well as algorithms based on C++ 26 schedulers and senders. That eliminates the motivation for
+designing a unified generic approach for asynchrony in oneDPL.
+
+The rest of the document contains additional information as well as useful links.
+
+## Use case study
+
+In the practical use of the oneDPL asynchronous APIs as well as similar APIs of other libraries
+(such as Thrust) we observed several typical patterns, pseudocode examples of which follow.
+**The proposal is aimed primarily at supporting these very patterns**. The list can be extended
+if there is enough evidence of demand for other patterns of asynchronous compute.
+
+In the examples, `foo-async` represents a call such as oneDPL `for_each_async` and `submit`
+functions that start some asynchronous work, and `sync-with` indicates a synchronization
+point with previously started asynchronous work.
+
+### 1. Synchronize with a single call
+
+This is the basic use case where the main program invokes a function asynchronously and later waits
+for its completion via the returned object.
+
+```
+/* start asynchronous work */
+sync-object s = foo-async(/*arguments*/);
+/* do some other work */
+...
+/* synchronize */
+sync-with(s);
+```
+
+ Variations of the pattern are supported by many APIs including `std::thread` and `std::async`.
+ Some of the APIs do not guarantee that the execution of `foo-async` is actually done simultaneously
+ with the main program; it might get *deferred* till the synchronization point. However the usage
+ of oneDPL asynchronous APIs, and specifically those that work with or on top of SYCL, likely
+ assumes that the execution is *eager*, not deferred.
+
+### 2. Synchronize with a work queue
+
+This pattern is specifically common for *heterogeneous* APIs that offload the execution to
+another compute device such as a GPU. Usually devices are represented by *queues* or *streams*
+where the work can be submitted to. The main program then waits for completion of all work
+previously submitted to a queue.
+
+```
+work-queue q{/*initialization arguments*/};
+/* submit work to the queue* /
+foo-async(q, /*arguments*/);
+bar-async(q, /*arguments*/);
+...
+/* synchronize */
+sync-with(q);
+```
+
+The work queue is shown explicitly in this pseudocode example, but in real APIs it might be
+implicit as well if a default device is assumed. Note also that, unlike the first example,
+asynchronous calls are not expected to return a synchronization object; it is excessive because
+the synchronization is done once with all the work in the queue.
+
+It is important to mention that work queues are usually provided by the "core" heterogeneous
+programming model such as CUDA or SYCL, and it is common for a program to use one queue
+with several libraries as well as custom-written kernels.
+
+### 3. Fork and join
+
+We also observed programs using asynchronous algorithms in a very common *fork-join* parallel
+pattern for processing some big work in parallel.
+
+```
+work-queue qs[] = {/*initialize all the queues*/};
+sync-object jobs[];
+/* split work to multiple queues */
+for (work-queue q in qs) {
+    sync-object s = foo-async(q, /*arguments*/);
+    append s to jobs;
+}
+/* synchronize with all parts */
+sync-with(jobs); /* possibly in a loop */
+
+/* combine the results, if needed */
+```
+
+For example, [Distributed Ranges](https://github.com/oneapi-src/distributed-ranges) use oneDPL
+asynchronous algorithms in this way to distribute work across available devices.
+
+The fork-join pattern can also use work queue synchronization:
+```
+work-queue qs[] = {/*initialize all the queues*/};
+/* split work to multiple queues */
+for (work-queue q in qs) {
+    foo-async(q, /*arguments*/);
+}
+/* synchronize with all queues */
+for (work-queue q in qs) {
+    sync-with(q);
+}
+/* combine the results, if needed */
+```
+
+### Out of scope
+
+There is no intention to support asynchronous computations in general nor use cases beyond the functionality
+of oneDPL, such as dependencies between any asynchronously executed functions. There exist other libraries
+as well as the C++ standard capacities for that purpose, some mentioned later in the document.
+
+SYCL supports out-of-order queues and provides APIs to set dependencies between kernels,
+including explicit dependencies via events. The experimental async algorithms in oneDPL were designed
+to preserve this capability; however, we have no evidence of it being used in practice. Also our study of
+the usage of Thrust asynchronous algorithms have not found examples of dependency chains. Therefore,
+support for this use case is not a requirement.
+
+## Additional context
+
+### Thrust and CUB
+
+The Thrust library uses two approaches for its asynchronous algorithms.
+Both approaches are implemented for the CUDA backend only.
+
+First, it has a small set of explicitly asynchronous algorithms in `namespace thrust::async`
+that return an event or a *future* to later synchronize with. However, this API have not much evolved
+since introduction, are not well documented, and, according to https://github.com/NVIDIA/cccl/issues/100,
+it is considered deprecated since at least 2023 (though yet unofficially).
+
+Second, Thrust has a special `par_nosync` execution policy that indicates that the implementation
+can skip non-essential synchronization as the caller will explicitly synchronize with the device
+or stream before accessing the results.
+
+More information can be found in the [Thrust changelog](https://nvidia.github.io/cccl/thrust/releases/changelog.html).
+
+The [device algorithms in CUB](https://nvidia.github.io/cccl/cub/developer_overview.html#device-scope) are
+implicitly asynchronous. Unlike Thrust, these do not return any waitable and require synchronization
+with the device. There are notably more `cub::Device*` algorithms than those in `thrust::async`.
+
+### oneDPL experimental asynchronous algorithms
+
+As mentioned before, [the asynchronous algorithms](https://oneapi-src.github.io/oneDPL/parallel_api/async_api.html)
+in oneDPL are intended to allow the underlying SYCL implementation proceed without blocking
+the calling thread. These functions return a future that can be used to synchronize and obtain
+the computed value at a later time. The functions can also accept a list of `sycl::event` objects
+as *input dependencies* (though the implementation has not advanced beyond immediate wait on these events).
+The `wait_for_all` function waits for completion of a given list of events or futures.
+
+The second goal of this API was to allow functional mapping for `thrust::async` algorithms,
+facilitating support for SYCL in applications that use Thrust.
+
+So far, we know of only a few projects that use these algorithms.
+
+### C++ async & future
+
+The C++ standard provides several ways for a program to use asynchrony, but [`std::future` and
+related APIs](https://en.cppreference.com/w/cpp/header/future), especially `std::async`, are of the
+most interest for this discussion. The `std::async` routine runs a given function asynchronously,
+returning `std::future` to obtain the result later. Essentially, this is the model that both
+oneDPL and Thrust (as well as other libraries) use for their asynchronous APIs.
+
+There is a fair amount of criticism for this model, which primarily points to the lack of support
+for advanced usage scenarios, including setting graphs of dependent asynchronous tasks.
+Alternative implementations of futures, for example in [stlab](https://stlab.cc/includes/stlab/concurrency/)
+as well as in Thrust, try addressing some of the shortcomings.
+
+### C++26 execution control library
+
+The new [execution control library](https://eel.is/c++draft/exec) in C++ 26, also known as
+[*schedulers/senders/receivers*](https://wg21.link/p2300), is the proposed way to improve
+asynchronous programming with C++, dealing with the limitations of `std::future`. In the essence,
+this library provides a language to build program flow graphs and then run those on chosen execution
+resources.
+
+The stages of creating and executing a graph of computations are separate; the execution can only
+be started explicitly by one of a few dedicated calls. Therefore the approach appears more
+suitable for deferred execution, while eager execution would be at least more verbose to code.
+
+Some companion proposals, notably for [async_scope](https://wg21.link/p3149) and
+[system execution context](https://wg21.link/p2079), are yet to be accepted to the working draft
+as of now. The proposal for adding [asynchronous parallel algorithms](https://wg21.link/p3300) is
+at a very early stage and is not planned for C++ 26.