Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize merge_sort algorithm for largest data sizes #1977

Open
wants to merge 114 commits into
base: main
Choose a base branch
from

Conversation

SergeyKopienko
Copy link
Contributor

@SergeyKopienko SergeyKopienko commented Dec 19, 2024

In this PR we extends the approach from #1933 to merge_sort algorithm.

…introduce new function __find_start_point_in

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce __parallel_merge_submitter_large for merge of biggest data sizes

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using __parallel_merge_submitter_large for merge data equal or greater then 4M items

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix compile error

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix Kernel names

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…rename template parameter names in __parallel_merge_submitter

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce __starting_size_limit_for_large_submitter into __parallel_merge

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…renames

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…introduce _split_point_t type

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…remove usages of std::make_pair

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…optimize evaluation of split-points on base diagonals

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…renames

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…extract eval_split_points_for_groups function

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…extract run_parallel_merge function

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using SLM bank size to define chunk in the eval_nd_range_params function

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…using SLM bank size to define chunk in the eval_nd_range_params function (16)

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…restore old implementation of __find_start_point

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…rename: base_diag_part -> steps_between_two_base_diags

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix an error in __parallel_merge_submitter_large::eval_split_points_for_groups

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…onals is too short

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…erge_submitter_large` into one `__parallel_merge_submitter` (#1956)
…fix review comment: remove extra condition check from __find_start_point_in

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…fix review comment: fix condition check in __find_start_point_in

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
…apply GitHUB clang format

Signed-off-by: Sergey Kopienko <sergey.kopienko@intel.com>
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, but enough feedback to group together. I still need time to digest the algorithm, and look into details

const _IndexT __steps = oneapi::dpl::__internal::__dpl_ceiling_div(__rng_size, __chunk);

// TODO required to evaluate this value based on available SLM size for each work-group.
_IndexT __base_diag_count = tune_amount_of_base_diagonals(__rng_size, 32 * 1'024); // 32 Kb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more explanation of these magic numbers is warranted. Is it possible to consolidate this magic number with the one in tune_amount_of_base_diagonals?
It seems like it should be equivalent to something like __base_diag_count = bit_floor(ceil(__nsorted/8192)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter __amount_of_base_diagonals has been removed from tune_amount_of_base_diagonals function.
In other parts I propose to save the current state as is.
About __base_diag_count constant: we have this value in two submitters: for merge' and for merge_sort` algorithms.

Copy link
Contributor

@danhoeflinger danhoeflinger Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the formula is

base_diags = ceil_div( 32 * 1024, bitceil( ceil_div(256 * 1024 * 1024, __n_sorted) )

We have two magic numbers here, and I don't know why either are chosen. Do they represent some characteristics of target hardware?

From the calculation, in practice, it seems like we are effectively targeting some "reasonable" size between base diagonals, while keeping the number of base diagonals a power of two. This makes some sense from a performance perspective in that for target hardware, there will likely be some fixed distance between base diagonals which optimizes the second kernel's performance. This is all speculation though, since I haven't done any tuning here.

A simpler function which should be similar in effect and the magic number would have more meaning (to me) would be something like this:

__optimal_dist_between_base_diags = 8 * 1024;
__base_diag_count = bit_floor(__n_sorted / __optimal_dist_between_base_diags);

https://godbolt.org/z/TjPMMG44Y

They only differ once __n_sorted grows larger than your 256M magic number, and so the ceiling division does more work to round up to 1.

If the goal is to keep the distance between diagonals the same, then the simpler function does a better job of it (extending to higher numbers). However, this is just my speculation about why this is function is successful at tuning the diagonal count. Please check this with your understanding and the performance of the code (and perhaps try different values of __optimal_dist_between_base_diags if you have time). I also don't know if there is a reason to keep a power of 2 number of diagonals, but you may have good reason to do that.

}
});
});
if (2 * __n_sorted < __get_starting_size_limit_for_large_submitter<__value_type>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense that the starting size limit is the same for "merge" as it is for segments of merge_sort?
I would expect that it would be a somewhat lower threshold for merge sort, since we can combine calculations of base diagonals across multiple segments of merge within merge sort in a single kernel.

Perhaps it has more to do with the size between base segments than justifying the diagonal kernel, in which case it perhaps makes sense to have the same threshold. I think a discussion within a tech meeting would be helpful to consider this stuff.

SergeyKopienko and others added 3 commits January 13, 2025 10:21
…ge_sort.h

Co-authored-by: Dan Hoeflinger <109972525+danhoeflinger@users.noreply.github.com>
….h - fix review comment: rename run_parallel_merge to run_parallel_merge_from_diagonals
….h - fix review comment: using common base diagonal's container for all iterations with the enough big size
….h - fix review comment: using common base diagonal's container for all iterations with the enough big size
….h - fix review comment: add __portions parameter into eval_nd_range_params
….h - fix self-review comment: using own __no_op impl
….h - remove extra changes in __subgroup_bubble_sorter::sort()
….h - fix review comment: using common base diagonal's container for all iterations with the enough big size
….h - fix review comments: remove extra ternary operators
….h - pack __amount_of_base_diagonals into tune_amount_of_base_diagonals
….h - pack __amount_of_base_diagonals into tune_amount_of_base_diagonals
….h - pack __amount_of_base_diagonals into tune_amount_of_base_diagonals (fix compile error)
// Multiply work per item by a power of 2 to reach the desired number of iterations.
// __dpl_bit_ceil rounds the ratio up to the next power of 2.
const std::size_t __k =
oneapi::dpl::__internal::__dpl_bit_ceil((std::size_t)std::ceil(256 * 1024 * 1024 / __n_sorted));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can __dpl_ceiling_div be used here instead of just std::ceil which involves conversion to floating point type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants