Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

mmichel11 · 2024-12-19T01:57:27Z

High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:

Process multiple input iterations per work-item which involves a switch to a nd_range kernel combined with a sub / work group strided indexing approach.
To generate wide loads for small types, implement a path that vectorizes loads / stores by processing adjacent indices within a single work item. This is combined with the above approach to maximize hardware bandwidth utilization. Vectorization is only applied to fundamental types of size less than 4 (e.g. uint16_t, uint8_t) under a contiguous container.

Implementation Details

Parallel for bricks have been reworked in the following manner:
- Each brick contains a pack of ranges within its template parameters to define tuning parameters.
- The following static integral members are defined (implemented with inheritance):
  - __can_vectorize
  - __preferred_vector_size (1 if __can_vectorize is false)
  - __preferred_iters_per_item
- The following public member functions are defined
  - __scalar_path (for small input sizes this member function is explicitly called)
  - __vector_path (optional for algorithms that are not vectorizable e.g. binary_search)
  - An overloaded function call operator which dispatches to the appropriate strategy

To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.

This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.

Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h

include/oneapi/dpl/pstl/hetero/histogram_impl_hetero.h

include/oneapi/dpl/internal/binary_search_impl.h

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

With uint8_t types, the icpx compiler fails to vectorize even when calling begin() on our range within a kernel to pull out a raw pointer. To work around this issue, begin() needs to be called on the host and passed to the kernel Signed-off-by: Matthew Michel <matthew.michel@intel.com>

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 · 2025-01-07T14:31:42Z

So now we have 3 entity with defined constexpr static bool __can_vectorize :

class walk_vector_or_scalar_base

class walk_scalar_base

struct __brick_shift_left

Does these constexpr-variables really has different semantic?

And if the semantic of these entities are the same, may be make sense to make some re-design to have only one entity __can_vectorize ?

These three cases are all unique when you consider that they define __can_vectorize, __preferred_vector_size, and __preferred_iters_per_item. These three fields are all tightly coupled, so in my opinion it makes sense to define them together for readability. If we were to define a single __can_vectorize, then I think it would need to function more like a trait class dependent on the provided brick as the brick itself plays a roll in whether not vectorization is possible. This design would not get us much in my opinion as we would still need specializations for the different cases.

The three unique cases I mention are the following:

struct walk_vector_or_scalar_base - Vectorization is possible so long as the ranges meet the requirements to be vectorizable. This is then used to determine iters per item and the vector size.
struct walk_scalar_base - Vectorization is not possible due to some limitation of the brick. Binary search is a good example since its accesses are non-sequential. The iterations per work item is still set based on the size of the provided ranges.
struct __brick_shift_left - This brick has a limitation that prevents vectorization and that only one iteration per item can be processed and is a special case.

This reverts commit 1336735.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…_path_impl Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 · 2025-01-09T16:53:52Z

In some moments implementation details remind me tag-dispatching which were designed by @rarutyun. But with some differences: for example the walk2_vectors_or_scalars has not only information about vectorization or parallelization should be executed, but also two variant of functional staff and operator() with compile-time condition check to run one code or another code.

But what if we instead of two different functions
    template <typename _IsFull, typename _ItemId>
    void
    __vector_path(_IsFull __is_full, const _ItemId __idx, _Range __rng) const
    {
        // This is needed to enable vectorization
        auto __raw_ptr = __rng.begin();
        oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,
                                                                                                 __raw_ptr);
    }

    // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
    template <typename _IsFull, typename _ItemId>
    void
    __scalar_path(_IsFull, const _ItemId __idx, _Range __rng) const
    {

        __f(__rng[__idx]);
    }
we will have some two functions with the same name and the format excepting the first parameter type which will be used as some tag ?

Please take a look at __parallel_policy_tag_selector_t for details.

Thanks for the suggestion. I had previously considered this approach by having two function call operators and using
a true_type / false_type as a tag:

     template <typename _IsFull, typename _ItemId>
     void
     operator()(/*__vectorize*/std::true_type, _IsFull __is_full, const _ItemId __idx, _Range __rng) const
     {
         oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,__rng);
     }
 
     // _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
     template <typename _IsFull, typename _ItemId>
     void
     operator()(/*__vectorize*/std::false_type, _IsFull, const _ItemId __idx, _Range __rng) const
     {
         __f(__rng[__idx]);
     }

This allows us to remove the if constexpr ... else dispatch that shows up in each brick. The reason I decided against this is that the caller of the brick, oneapi::dpl::__par_backend_hetero::__strided_loop, would also need to have an additional vectorization tag that would need to be passed through to the brick. This is intended to be a generic utility, and I do not believe that we should require the caller of __strided_loop to provide any additional tag beyond the necessary _IsFull to avoid OOB accesses.

Alternatively, we could replace __scalar_path_impl / __vector_path_impl with a single function __impl that accepts a vector tag as its first parameter. This is less clear in my opinion compared to separate function names.

This is well beyond the cutoff point to invoke the large submitter and prevents timeouts observed on devices with many compute units when testing CPU paths. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…e_range Signed-off-by: Matthew Michel <matthew.michel@intel.com>

SergeyKopienko · 2025-01-14T11:34:51Z

include/oneapi/dpl/internal/binary_search_impl.h

    void
-    operator()(_ItemId idx, _Acc acc) const
+    __scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const


I believe that we may try to improve this code by replacing run-time bool value use_32bit_indexing to compile-time indexing type specialization.
I found only 3 places with the code

const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();

so it's not big deal to add if statement outside and call __parallel_for inside for both branches with the different index types. But inside the brick we exclude condition check at all.

As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.

SergeyKopienko · 2025-01-14T11:37:27Z

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h

@@ -56,8 +56,12 @@ __pattern_walk1(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt
        oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator>();
    auto __buf = __keep(__first, __last);

+    auto __view = __buf.all_view();


Should we really extract this into local variable? Will we have some profit?
We still have a lot of code without this local variable...

Initially, I did this to remove the duplicate all_view()s that arise but this actually seems to lead to more code looking at the diff. All of these added local view variables have been reverted in the PR.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added the enhancement label Dec 19, 2024

mmichel11 added this to the 2022.8.0 milestone Dec 19, 2024

mmichel11 marked this pull request as ready for review December 19, 2024 19:17

mmichel11 requested review from timmiesmith, akukanov, MikeDvorskiy, SergeyKopienko, danhoeflinger, adamfidel and dmitriy-sobolev December 19, 2024 19:18

mmichel11 changed the title ~~[Draft] Re-implement SYCL backend parallel_for to improve bandwidth utilization~~ Re-implement SYCL backend parallel_for to improve bandwidth utilization Dec 19, 2024

mmichel11 added 19 commits December 19, 2024 16:01

Optimize memory transactions in SYCL backend parallel for

9764a57

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

clang-format

c836b1d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Correct comment and error handling.

55f33a4

128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

__num_groups bugfix

adadd56

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Introduce stride recommender for different targets and better distrib…

71d7bcc

…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Cleanup

ebb3d56

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Unroll loop if possible

2c4ecd0

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Revert "Unroll loop if possible"

dc6bd0c

This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.

Code cleanup

d376124

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Clang format

a7c7606

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update comments

b8aa15c

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Bugfix in comment

b45a7c2

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

More cleanup and better handle non-full case

4f9a360

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Rename __ndi to __item for consistency with codebase

7bb1d2b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update all comments on kernel naming trick

a2ad920

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Handle non-full case in a cleaner way

47fe214

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Switch min tuple type utility to return size of type

79a18e9

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added 5 commits January 2, 2025 11:26

Address review comments

8b0b18b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove intrusive test macro and adjust input sizes in test framework

f7d9753

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Make walk_scalar_base and walk_vector_or_scalar_base structs

83c5ca4

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add missing max_n

fedd5de

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add constructors for for-based bricks

08aa260

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/histogram_impl_hetero.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/internal/binary_search_impl.h Show resolved Hide resolved

mmichel11 added 3 commits January 6, 2025 08:23

Remove extraneous {} and add constructor to custom_brick

a5eca96

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Limit recursive searching of __min_nested_type_size to tuples

32612a1

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Jan 6, 2025

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h Outdated Show resolved Hide resolved

mmichel11 added 2 commits January 6, 2025 16:15

Add missing decays

c5e7d61

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add compile time check to ensure we do not get buffer pointer on host

0c6ca75

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 mentioned this pull request Jan 7, 2025

Improve SYCL backend __parallel_for performance for large input sizes #1870

Closed

mmichel11 added 5 commits January 7, 2025 11:41

Revert "Work around compiler vectorization issue"

be8aeda

This reverts commit 1336735.

Remove all begin() calls on views in vectorization paths

86b9c89

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove unused __is_passed_directly_range utility

ffd95cc

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…

537a6f0

…_path_impl Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Correct __vector_walk deleters and a type in __reverse_copy

50a60ea

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added 3 commits January 9, 2025 11:29

Set upper limit of 10,000,000 for get_pattern_for_max_n

1081ab8

This is well beyond the cutoff point to invoke the large submitter and prevents timeouts observed on devices with many compute units when testing CPU paths. Signed-off-by: Matthew Michel <matthew.michel@intel.com>

General cleanup and renaming for consistency

9513edb

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Explicitly list template types in specializations of __is_vectorizabl…

b274c8d

…e_range Signed-off-by: Matthew Michel <matthew.michel@intel.com>

SergeyKopienko reviewed Jan 14, 2025

View reviewed changes

mmichel11 added 2 commits January 14, 2025 15:30

Remove unnecessary local variables

92f3374

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove unnecessary local variables in async and numeric headers

eb2cdf8

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

mmichel11 commented Dec 19, 2024 •

edited

Loading

mmichel11 commented Jan 7, 2025

mmichel11 commented Jan 9, 2025 •

edited

Loading

SergeyKopienko Jan 14, 2025

mmichel11 Jan 14, 2025

SergeyKopienko Jan 14, 2025 •

edited

Loading

mmichel11 Jan 14, 2025

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Are you sure you want to change the base?

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Conversation

mmichel11 commented Dec 19, 2024 • edited Loading

mmichel11 commented Jan 7, 2025

mmichel11 commented Jan 9, 2025 • edited Loading

SergeyKopienko Jan 14, 2025

Choose a reason for hiding this comment

mmichel11 Jan 14, 2025

Choose a reason for hiding this comment

SergeyKopienko Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

mmichel11 Jan 14, 2025

Choose a reason for hiding this comment

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

Re-implement SYCL backend `parallel_for` to improve bandwidth utilization #1976

mmichel11 commented Dec 19, 2024 •

edited

Loading

mmichel11 commented Jan 9, 2025 •

edited

Loading

SergeyKopienko Jan 14, 2025 •

edited

Loading