-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize merge_sort
algorithm for largest data sizes
#1977
base: main
Are you sure you want to change the base?
Conversation
…introduce new function __find_start_point_in Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix compile error Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix Kernel names Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…rename template parameter names in __parallel_merge_submitter Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce __starting_size_limit_for_large_submitter into __parallel_merge Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…renames Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce _split_point_t type Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…remove usages of std::make_pair Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…optimize evaluation of split-points on base diagonals Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…renames Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…extract eval_split_points_for_groups function Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…extract run_parallel_merge function Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using SLM bank size to define chunk in the eval_nd_range_params function Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using SLM bank size to define chunk in the eval_nd_range_params function (16) Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…restore old implementation of __find_start_point Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…rename: base_diag_part -> steps_between_two_base_diags Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix an error in __parallel_merge_submitter_large::eval_split_points_for_groups Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…onals is too short Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…erge_submitter_large` into one `__parallel_merge_submitter` (#1956)
…fix review comment: remove extra condition check from __find_start_point_in Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment: fix condition check in __find_start_point_in Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…apply GitHUB clang format Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, but enough feedback to group together. I still need time to digest the algorithm, and look into details
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk); | ||
|
||
// TODO required to evaluate this value based on available SLM size for each work-group. | ||
_IndexT __base_diag_count = tune_amount_of_base_diagonals(__rng_size, 32 * 1'024); // 32 Kb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more explanation of these magic numbers is warranted. Is it possible to consolidate this magic number with the one in tune_amount_of_base_diagonals
?
It seems like it should be equivalent to something like __base_diag_count = bit_floor(ceil(__nsorted/8192))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter __amount_of_base_diagonals
has been removed from tune_amount_of_base_diagonals
function.
In other parts I propose to save the current state as is.
About __base_diag_count
constant: we have this value in two submitters: for merge' and for
merge_sort` algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the formula is
base_diags = ceil_div( 32 * 1024, bitceil( ceil_div(256 * 1024 * 1024, __n_sorted) )
We have two magic numbers here, and I don't know why either are chosen. Do they represent some characteristics of target hardware?
From the calculation, in practice, it seems like we are effectively targeting some "reasonable" size between base diagonals, while keeping the number of base diagonals a power of two. This makes some sense from a performance perspective in that for target hardware, there will likely be some fixed distance between base diagonals which optimizes the second kernel's performance. This is all speculation though, since I haven't done any tuning here.
A simpler function which should be similar in effect and the magic number would have more meaning (to me) would be something like this:
__optimal_dist_between_base_diags = 8 * 1024;
__base_diag_count = bit_floor(__n_sorted / __optimal_dist_between_base_diags);
https://godbolt.org/z/TjPMMG44Y
They only differ once __n_sorted
grows larger than your 256M magic number, and so the ceiling division does more work to round up to 1.
If the goal is to keep the distance between diagonals the same, then the simpler function does a better job of it (extending to higher numbers). However, this is just my speculation about why this is function is successful at tuning the diagonal count. Please check this with your understanding and the performance of the code (and perhaps try different values of __optimal_dist_between_base_diags
if you have time). I also don't know if there is a reason to keep a power of 2 number of diagonals, but you may have good reason to do that.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
} | ||
}); | ||
}); | ||
if (2 * __n_sorted < __get_starting_size_limit_for_large_submitter<__value_type>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense that the starting size limit is the same for "merge" as it is for segments of merge_sort
?
I would expect that it would be a somewhat lower threshold for merge sort, since we can combine calculations of base diagonals across multiple segments of merge within merge sort in a single kernel.
Perhaps it has more to do with the size between base segments than justifying the diagonal kernel, in which case it perhaps makes sense to have the same threshold. I think a discussion within a tech meeting would be helpful to consider this stuff.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
…ge_sort.h Co-authored-by: Dan Hoeflinger <109972525+danhoeflinger@users.noreply.github.com>
….h - fix review comment: rename run_parallel_merge to run_parallel_merge_from_diagonals
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
….h - fix review comment: using common base diagonal's container for all iterations with the enough big size
….h - fix review comment: using common base diagonal's container for all iterations with the enough big size
….h - fix review comment: add __portions parameter into eval_nd_range_params
….h - fix self-review comment: using own __no_op impl
….h - remove extra changes in __subgroup_bubble_sorter::sort()
….h - fix review comment: using common base diagonal's container for all iterations with the enough big size
….h - fix review comment: remove unrequired ternary operators
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
….h - fix review comments: remove extra ternary operators
….h - pack __amount_of_base_diagonals into tune_amount_of_base_diagonals
….h - pack __amount_of_base_diagonals into tune_amount_of_base_diagonals
….h - pack __amount_of_base_diagonals into tune_amount_of_base_diagonals (fix compile error)
// Multiply work per item by a power of 2 to reach the desired number of iterations. | ||
// __dpl_bit_ceil rounds the ratio up to the next power of 2. | ||
const std::size_t __k = | ||
oneapi::dpl::__internal::__dpl_bit_ceil((std::size_t)std::ceil(256 * 1024 * 1024 / __n_sorted)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can __dpl_ceiling_div
be used here instead of just std::ceil
which involves conversion to floating point type?
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_merge_sort.h
Outdated
Show resolved
Hide resolved
….h - fix review comment: remove assertion from the Kernel code
In this PR we extends the approach from #1933 to
merge_sort
algorithm.