-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement SYCL backend parallel_for
to improve bandwidth utilization
#1976
base: main
Are you sure you want to change the base?
Conversation
parallel_for
to improve bandwidth utilizationparallel_for
to improve bandwidth utilization
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
128 byte memory operations are performed instead of 512 after inspecting the assembly. Processing 512 bytes per sub-group still seems to be the best value after experimentation. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…ute work for small inputs Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
This reverts commit e4cbceb. Small sizes slightly slower and for horizontal vectorization no "real" benefit is observed.
Small but measurable overheads can be observed for small inputs where runtime dispatch in the kernel is present to check for the correct path to take. Letting the compiler handle the the small input case in the original kernel shows the best performance. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
We now flatten the user-provided ranges and find the minimum sized type to estimate the best __iters_per_work_item. This benefits performance in calls that wrap multiple buffers in a single input / output through a zip_iterator (e.g. dpct::scatter_if in SYCLomatic compatibility headers). Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
With uint8_t types, the icpx compiler fails to vectorize even when calling begin() on our range within a kernel to pull out a raw pointer. To work around this issue, begin() needs to be called on the host and passed to the kernel Signed-off-by: Matthew Michel <matthew.michel@intel.com>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_for.h
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
These three cases are all unique when you consider that they define The three unique cases I mention are the following:
|
This reverts commit 1336735.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…_path_impl Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Thanks for the suggestion. I had previously considered this approach by having two function call operators and using template <typename _IsFull, typename _ItemId>
void
operator()(/*__vectorize*/std::true_type, _IsFull __is_full, const _ItemId __idx, _Range __rng) const
{
oneapi::dpl::__par_backend_hetero::__vector_walk<__base_t::__preferred_vector_size>{__n}(__is_full, __idx, __f,__rng);
}
// _IsFull is ignored here. We assume that boundary checking has been already performed for this index.
template <typename _IsFull, typename _ItemId>
void
operator()(/*__vectorize*/std::false_type, _IsFull, const _ItemId __idx, _Range __rng) const
{
__f(__rng[__idx]);
} This allows us to remove the Alternatively, we could replace |
This is well beyond the cutoff point to invoke the large submitter and prevents timeouts observed on devices with many compute units when testing CPU paths. Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
…e_range Signed-off-by: Matthew Michel <matthew.michel@intel.com>
void | ||
operator()(_ItemId idx, _Acc acc) const | ||
__scalar_path_impl(_IsFull, _ItemId idx, _Acc acc) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we may try to improve this code by replacing run-time bool
value use_32bit_indexing
to compile-time indexing type specialization.
I found only 3 places with the code
const bool use_32bit_indexing = size <= std::numeric_limits<std::uint32_t>::max();
so it's not big deal to add if
statement outside and call __parallel_for
inside for both branches with the different index types. But inside the brick we exclude condition check at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I will reevaluate performance here and provide an update. The advantage of the current approach is that we only compile a single kernel whereas your suggestion may improve kernel performance with the cost of increased JIT overhead.
@@ -56,8 +56,12 @@ __pattern_walk1(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIt | |||
oneapi::dpl::__ranges::__get_sycl_range<__par_backend_hetero::access_mode::read_write, _ForwardIterator>(); | |||
auto __buf = __keep(__first, __last); | |||
|
|||
auto __view = __buf.all_view(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really extract this into local variable? Will we have some profit?
We still have a lot of code without this local variable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I did this to remove the duplicate all_view()
s that arise but this actually seems to lead to more code looking at the diff. All of these added local view variables have been reverted in the PR.
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
Signed-off-by: Matthew Michel <matthew.michel@intel.com>
High Level Description
This PR improves hardware bandwidth utilization of oneDPL's SYCL backend parallel for pattern through two ideas:
Implementation Details
binary_search
)To implement this approach, the parallel for kernel rewrite from #1870 was adopted with additional changes to handle vectorization paths. Additionally, generic vectorization and strided loop utilities have been defined with the intention for these to be applicable in other portions of the codebase as well. Tests have been expanded to ensure coverage of vectorization paths.
This PR will supersedes #1870. Initially, the plan was to merge this PR into 1870 but after comparing the diff, I believe the most straightforward approach will be to target this directly to main.