Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement SYCL backend parallel_for to improve bandwidth utilization #1976

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 66 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
9764a57
Optimize memory transactions in SYCL backend parallel for
mmichel11 Sep 5, 2024
c836b1d
clang-format
mmichel11 Sep 5, 2024
55f33a4
Correct comment and error handling.
mmichel11 Sep 6, 2024
adadd56
__num_groups bugfix
mmichel11 Sep 10, 2024
71d7bcc
Introduce stride recommender for different targets and better distrib…
mmichel11 Sep 16, 2024
ebb3d56
Cleanup
mmichel11 Sep 16, 2024
2c4ecd0
Unroll loop if possible
mmichel11 Sep 18, 2024
dc6bd0c
Revert "Unroll loop if possible"
mmichel11 Sep 18, 2024
d5126b2
Use a small and large kernel in parallel for
mmichel11 Sep 20, 2024
6433a50
Improve __iters_per_work_item heuristic.
mmichel11 Sep 20, 2024
d376124
Code cleanup
mmichel11 Sep 20, 2024
a7c7606
Clang format
mmichel11 Sep 23, 2024
b8aa15c
Update comments
mmichel11 Sep 23, 2024
b45a7c2
Bugfix in comment
mmichel11 Sep 23, 2024
4f9a360
More cleanup and better handle non-full case
mmichel11 Sep 23, 2024
7bb1d2b
Rename __ndi to __item for consistency with codebase
mmichel11 Sep 24, 2024
a2ad920
Update all comments on kernel naming trick
mmichel11 Sep 24, 2024
47fe214
Handle non-full case in a cleaner way
mmichel11 Sep 24, 2024
79a18e9
Switch min tuple type utility to return size of type
mmichel11 Sep 24, 2024
3ab8c75
Remove unnecessary template parameter
mmichel11 Sep 24, 2024
4a70fe2
Make non-template function inline for ODR compliance
mmichel11 Sep 24, 2024
5530209
If the iters per work item is 1, then only compile the basic pfor kernel
mmichel11 Sep 24, 2024
90f19d4
Address several PR comments
mmichel11 Sep 25, 2024
1ac65b9
Remove free function __stride_recommender
mmichel11 Sep 25, 2024
6a5a562
Accept ranges as forwarding references in __parallel_for_large_submitter
mmichel11 Sep 25, 2024
357032f
Address reviewer comments
mmichel11 Nov 6, 2024
ca9e594
Introduce vectorized for-path for small types and parallel_backend_sy…
mmichel11 Dec 16, 2024
e4060f5
Improve testing and cleanup of code
mmichel11 Dec 16, 2024
283b053
clang format
mmichel11 Dec 16, 2024
75e4beb
Miscellaneous fixes identified during testing
mmichel11 Dec 17, 2024
7990bc1
clang-format
mmichel11 Dec 17, 2024
4aaa81f
Fix ordering to __vector_load call
mmichel11 Dec 17, 2024
65e4a68
Add support for vectorization with C++20 parallel range APIs
mmichel11 Dec 17, 2024
b4657a6
Add device copyable specializations for new walk patterns
mmichel11 Dec 17, 2024
3086dd3
Align vector_walk implementation with other vector functors
mmichel11 Dec 17, 2024
df17673
Add back non-spirv path
mmichel11 Dec 17, 2024
fd4e2c3
Further improve test coverage
mmichel11 Dec 17, 2024
58fd466
Restore original shift_left due to implicit implementation requiremen…
mmichel11 Dec 17, 2024
094124f
Fix issues in vectorized rotate
mmichel11 Dec 18, 2024
82135f6
Fix fpga parallel for compilation issues
mmichel11 Dec 18, 2024
e979118
Restore initial shift_left_right.pass.cpp
mmichel11 Dec 18, 2024
4bfaada
Fix test side issue when unnamed lambdas are disabled
mmichel11 Dec 18, 2024
8ae18db
Add a vector path specialization for std::swap_ranges
mmichel11 Dec 18, 2024
6cb11c7
General code cleanup
mmichel11 Dec 18, 2024
505bdf3
Bugfix with __pattern_swap using nanoranges
mmichel11 Dec 18, 2024
114924d
clang-format
mmichel11 Dec 19, 2024
845de21
Address applicable comments from PR #1870
mmichel11 Dec 20, 2024
71678d0
Refactor __lazy_ctor_storage deleter
mmichel11 Jan 2, 2025
8b0b18b
Address review comments
mmichel11 Jan 2, 2025
f7d9753
Remove intrusive test macro and adjust input sizes in test framework
mmichel11 Jan 4, 2025
83c5ca4
Make walk_scalar_base and walk_vector_or_scalar_base structs
mmichel11 Jan 4, 2025
fedd5de
Add missing max_n
mmichel11 Jan 4, 2025
08aa260
Add constructors for for-based bricks
mmichel11 Jan 4, 2025
a5eca96
Remove extraneous {} and add constructor to custom_brick
mmichel11 Jan 6, 2025
32612a1
Limit recursive searching of __min_nested_type_size to tuples
mmichel11 Jan 6, 2025
1336735
Work around compiler vectorization issue
mmichel11 Jan 6, 2025
c5e7d61
Add missing decays
mmichel11 Jan 7, 2025
0c6ca75
Add compile time check to ensure we do not get buffer pointer on host
mmichel11 Jan 7, 2025
be8aeda
Revert "Work around compiler vectorization issue"
mmichel11 Jan 7, 2025
86b9c89
Remove all begin() calls on views in vectorization paths
mmichel11 Jan 7, 2025
ffd95cc
Remove unused __is_passed_directly_range utility
mmichel11 Jan 7, 2025
537a6f0
Rename __scalar_path / __vector_path to __scalar_path_impl / __vector…
mmichel11 Jan 8, 2025
50a60ea
Correct __vector_walk deleters and a type in __reverse_copy
mmichel11 Jan 8, 2025
1081ab8
Set upper limit of 10,000,000 for get_pattern_for_max_n
mmichel11 Jan 9, 2025
9513edb
General cleanup and renaming for consistency
mmichel11 Jan 9, 2025
b274c8d
Explicitly list template types in specializations of __is_vectorizabl…
mmichel11 Jan 13, 2025
92f3374
Remove unnecessary local variables
mmichel11 Jan 14, 2025
eb2cdf8
Remove unnecessary local variables in async and numeric headers
mmichel11 Jan 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ namespace oneapi::dpl::experimental::kt::gpu::esimd::__impl
{

//------------------------------------------------------------------------
// Please see the comment for __parallel_for_submitter for optional kernel name explanation
// Please see the comment above __parallel_for_small_submitter for optional kernel name explanation
//------------------------------------------------------------------------

template <bool __is_ascending, ::std::uint8_t __radix_bits, ::std::uint16_t __data_per_work_item,
Expand Down
26 changes: 20 additions & 6 deletions include/oneapi/dpl/internal/async_impl/async_impl_hetero.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,13 @@ __pattern_walk1_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator>();
auto __buf = __keep(__first, __last);

auto __view = __buf.all_view();

auto __future_obj = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view());
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__view)>{
__f, static_cast<std::size_t>(__n)},
__n, __view);
return __future_obj;
}

Expand All @@ -65,9 +69,14 @@ __pattern_walk2_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For
auto __keep2 = oneapi::dpl::__ranges::__get_sycl_range<__acc_mode2, _ForwardIterator2>();
auto __buf2 = __keep2(__first2, __first2 + __n);

auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view());
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2)>{
__f, static_cast<std::size_t>(__n)},
__n, __view1, __view2);

return __future.__make_future(__first2 + __n);
}
Expand All @@ -91,10 +100,15 @@ __pattern_walk3_async(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _For
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator3>();
auto __buf3 = __keep3(__first3, __first3 + __n);

auto __future =
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n,
__buf1.all_view(), __buf2.all_view(), __buf3.all_view());
auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();
auto __view3 = __buf3.all_view();

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2),
decltype(__view3)>{__f, static_cast<size_t>(__n)},
__n, __view1, __view2, __view3);

return __future.__make_future(__first3 + __n);
}
Expand Down
37 changes: 26 additions & 11 deletions include/oneapi/dpl/internal/binary_search_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,19 @@ enum class search_algorithm
binary_search
};

template <typename Comp, typename T, search_algorithm func>
struct custom_brick
#if _ONEDPL_BACKEND_SYCL
template <typename Comp, typename T, typename _Range, search_algorithm func>
struct custom_brick : oneapi::dpl::unseq_backend::walk_scalar_base<_Range>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets fix the naming of this while were touching all its instances __custom_brick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the historical convention within the internal/ directory is to not use any leading underscores although it has changed a bit over time.

I do not have a strong preference if we make this change or leave it as is, but maybe it fits in a broader discussion regarding the remaining implementations in this directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is compelling reason other than resistance to making purely cosmetic changes in the changelog to have a different convention here. This is why I suggest adjusting it while we are already touching all (or most) instances of it. Perhaps someone with a longer historical knowledge of this code could chime in here if there is a reason to keep this with different conventions.

Not super important to me, so optional nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait a bit longer to see if anyone has objections. If not, then I will add this suggestion.

{
Comp comp;
T size;
bool use_32bit_indexing;

custom_brick(Comp comp, T size, bool use_32bit_indexing)
: comp(comp), size(size), use_32bit_indexing(use_32bit_indexing)
{
}

template <typename _Size, typename _ItemId, typename _Acc>
void
search_impl(_ItemId idx, _Acc acc) const
Expand All @@ -68,17 +74,23 @@ struct custom_brick
get<2>(acc[idx]) = (value != end_orig) && (get<1>(acc[idx]) == get<0>(acc[value]));
}
}

template <typename _ItemId, typename _Acc>
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_ItemId idx, _Acc acc) const
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we may try to improve this code by replacing run-time bool value use_32bit_indexing to compile-time indexing type specialization.
I found only 3 places with the code

const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();

so it's not big deal to add if statement outside and call __parallel_for inside for both branches with the different index types. But inside the brick we exclude condition check at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.

{
if (use_32bit_indexing)
search_impl<std::uint32_t>(idx, acc);
else
search_impl<std::uint64_t>(idx, acc);
}
template <typename _IsFull, typename _ItemId, typename _Acc>
void
operator()(_IsFull __is_full, _ItemId idx, _Acc acc) const
{
__scalar_path_impl(__is_full, idx, acc);
}
};
#endif

template <class _Tag, typename Policy, typename InputIterator1, typename InputIterator2, typename OutputIterator,
typename StrictWeakOrdering>
Expand Down Expand Up @@ -155,7 +167,8 @@ lower_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, ::std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::lower_bound>{comp, size, use_32bit_indexing},
custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::lower_bound>{
danhoeflinger marked this conversation as resolved.
Show resolved Hide resolved
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -187,7 +200,8 @@ upper_bound_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, InputIt
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::upper_bound>{comp, size, use_32bit_indexing},
custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::upper_bound>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
Expand Down Expand Up @@ -217,10 +231,11 @@ binary_search_impl(__internal::__hetero_tag<_BackendTag>, Policy&& policy, Input
auto result_buf = keep_result(result, result + value_size);
auto zip_vw = make_zip_view(input_buf.all_view(), value_buf.all_view(), result_buf.all_view());
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
__bknd::__parallel_for(_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
__bknd::__parallel_for(
_BackendTag{}, std::forward<decltype(policy)>(policy),
custom_brick<StrictWeakOrdering, decltype(size), decltype(zip_vw), search_algorithm::binary_search>{
comp, size, use_32bit_indexing},
value_size, zip_vw)
.__deferrable_wait();
return result + value_size;
}
Expand Down
96 changes: 71 additions & 25 deletions include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,12 @@ __pattern_walk1(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator>();
auto __buf = __keep(__first, __last);

auto __view = __buf.all_view();
Copy link
Contributor

@SergeyKopienko SergeyKopienko Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really extract this into local variable? Will we have some profit?
We still have a lot of code without this local variable...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I did this to remove the duplicate all_view()s that arise but this actually seems to lead to more code looking at the diff. All of these added local view variables have been reverted in the PR.

oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, __exec, unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf.all_view())
_BackendTag{}, __exec,
unseq_backend::walk1_vector_or_scalar<_ExecutionPolicy, _Function, decltype(__view)>{
__f, static_cast<std::size_t>(__n)},
__n, __view)
.__deferrable_wait();
}

Expand Down Expand Up @@ -101,9 +105,14 @@ __pattern_walk2(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt
auto __keep2 = oneapi::dpl::__ranges::__get_sycl_range<__acc_mode2, _ForwardIterator2>();
auto __buf2 = __keep2(__first2, __first2 + __n);

auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n, __buf1.all_view(), __buf2.all_view());
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2)>{
__f, static_cast<std::size_t>(__n)},
__n, __view1, __view2);

// Call no wait, wait or deferrable wait depending on _WaitMode
__future.wait(_WaitMode{});
Expand All @@ -130,10 +139,28 @@ _ForwardIterator2
__pattern_swap(__hetero_tag<_BackendTag> __tag, _ExecutionPolicy&& __exec, _ForwardIterator1 __first1,
_ForwardIterator1 __last1, _ForwardIterator2 __first2, _Function __f)
{
return __pattern_walk2</*_WaitMode*/ __par_backend_hetero::__deferrable_mode,
__par_backend_hetero::access_mode::read_write,
__par_backend_hetero::access_mode::read_write>(
__tag, ::std::forward<_ExecutionPolicy>(__exec), __first1, __last1, __first2, __f);
auto __n = __last1 - __first1;
if (__n == 0)
return __first2;

auto __keep1 =
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator1>();
auto __buf1 = __keep1(__first1, __last1);

auto __keep2 =
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator2>();
auto __buf2 = __keep2(__first2, __first2 + __n);

auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();

auto __future = oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__brick_swap<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2)>{
__f, static_cast<std::size_t>(__n)},
__n, __view1, __view2);
__future.wait(__par_backend_hetero::__deferrable_mode{});
return __first2 + __n;
}

//------------------------------------------------------------------------
Expand All @@ -160,9 +187,15 @@ __pattern_walk3(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt
auto __keep3 = oneapi::dpl::__ranges::__get_sycl_range<__acc_mode3, _ForwardIterator3>();
auto __buf3 = __keep3(__first3, __first3 + __n);

oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk_n<_ExecutionPolicy, _Function>{__f}, __n,
__buf1.all_view(), __buf2.all_view(), __buf3.all_view())
auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();
auto __view3 = __buf3.all_view();

oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::walk3_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__view1), decltype(__view2),
decltype(__view3)>{__f, static_cast<std::size_t>(__n)},
__n, __view1, __view2, __view3)
.__deferrable_wait();

return __first3 + __n;
Expand Down Expand Up @@ -1564,9 +1597,10 @@ __pattern_reverse(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterato
auto __keep = oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _Iterator>();
auto __buf = __keep(__first, __last);
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_functor<typename ::std::iterator_traits<_Iterator>::difference_type>{__n}, __n / 2,
__buf.all_view())
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_functor<typename std::iterator_traits<_Iterator>::difference_type,
decltype(__buf.all_view())>{__n},
__n / 2, __buf.all_view())
.__deferrable_wait();
}

Expand All @@ -1589,10 +1623,13 @@ __pattern_reverse_copy(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Bi
auto __keep2 =
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator>();
auto __buf2 = __keep2(__result, __result + __n);
auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type>{__n},
__n, __buf1.all_view(), __buf2.all_view())
_BackendTag{}, std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__reverse_copy<typename std::iterator_traits<_BidirectionalIterator>::difference_type,
decltype(__view1), decltype(__view2)>{__n},
__n, __view1, __view2)
.__deferrable_wait();

return __result + __n;
Expand Down Expand Up @@ -1626,24 +1663,27 @@ __pattern_rotate(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Iterator
auto __buf = __keep(__first, __last);
auto __temp_buf = oneapi::dpl::__par_backend_hetero::__buffer<_ExecutionPolicy, _Tp>(__exec, __n);

auto __view = __buf.all_view();
auto __temp_rng_w =
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::write>(__temp_buf.get_buffer());

const auto __shift = __new_first - __first;
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__rotate_wrapper>(__exec),
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_Iterator>::difference_type>{__n, __shift}, __n,
__buf.all_view(), __temp_rng_w);
unseq_backend::__rotate_copy<typename std::iterator_traits<_Iterator>::difference_type, decltype(__view),
decltype(__temp_rng_w)>{__n, __shift},
__n, __view, __temp_rng_w);

//An explicit wait isn't required here because we are working with a temporary sycl::buffer and sycl accessors and
//SYCL runtime makes a dependency graph to prevent the races between two __parallel_for patterns.

using _Function = __brick_move<__hetero_tag<_BackendTag>, _ExecutionPolicy>;
auto __brick = unseq_backend::walk_n<_ExecutionPolicy, _Function>{_Function{}};

auto __temp_rng_rw =
oneapi::dpl::__ranges::all_view<_Tp, __par_backend_hetero::access_mode::read_write>(__temp_buf.get_buffer());
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec), __brick,
auto __brick =
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__temp_rng_rw),
decltype(__buf.all_view())>{_Function{}, static_cast<std::size_t>(__n)};
oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, std::forward<_ExecutionPolicy>(__exec), __brick,
__n, __temp_rng_rw, __buf.all_view())
.__deferrable_wait();

Expand Down Expand Up @@ -1673,13 +1713,16 @@ __pattern_rotate_copy(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Bid
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::write, _ForwardIterator>();
auto __buf2 = __keep2(__result, __result + __n);

auto __view1 = __buf1.all_view();
auto __view2 = __buf2.all_view();

const auto __shift = __new_first - __first;

oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type>{__n,
__shift},
__n, __buf1.all_view(), __buf2.all_view())
unseq_backend::__rotate_copy<typename ::std::iterator_traits<_BidirectionalIterator>::difference_type,
decltype(__view1), decltype(__view2)>{__n, __shift},
__n, __view1, __view2)
.__deferrable_wait();

return __result + __n;
Expand Down Expand Up @@ -1936,19 +1979,22 @@ __pattern_shift_left(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _Rang
if (__n >= __mid)
{
using _Function = __brick_move<__hetero_tag<_BackendTag>, _ExecutionPolicy>;
auto __brick = oneapi::dpl::unseq_backend::walk_n<_ExecutionPolicy, _Function>{_Function{}};

//TODO: to consider use just "read" access mode for a source range and just "write" - for a destination range.
auto __src = oneapi::dpl::__ranges::drop_view_simple<_Range, _DiffType>(__rng, __n);
auto __dst = oneapi::dpl::__ranges::take_view_simple<_Range, _DiffType>(__rng, __size_res);

auto __brick =
unseq_backend::walk2_vectors_or_scalars<_ExecutionPolicy, _Function, decltype(__src), decltype(__dst)>{
_Function{}, static_cast<std::size_t>(__size_res)};

oneapi::dpl::__par_backend_hetero::__parallel_for(_BackendTag{}, ::std::forward<_ExecutionPolicy>(__exec),
__brick, __size_res, __src, __dst)
.__deferrable_wait();
}
else //2. n < size/2; 'n' parallel copying
{
auto __brick = unseq_backend::__brick_shift_left<_ExecutionPolicy, _DiffType>{__size, __n};
auto __brick = unseq_backend::__brick_shift_left<_ExecutionPolicy, _DiffType, decltype(__rng)>{__size, __n};
oneapi::dpl::__par_backend_hetero::__parallel_for(
_BackendTag{},
oneapi::dpl::__par_backend_hetero::make_wrapped_policy<__shift_left_right>(
Expand Down
Loading
Loading