Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Implementation of Histogram APIs #1974

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

danhoeflinger
Copy link
Contributor

@danhoeflinger danhoeflinger commented Dec 18, 2024

Implementation of histogram APIs for host backends.

Implementations are provided for serial, tbb, and openMP backends. We add a generic __thread_enumerable_storage struct to add a generic thread local storage for our host backends. We use the new TLS (Thread local storage) with parallel_for to implement histogram. Testing is also added, and some minor adjustments are made to cmake.

Please see the RFC documentation / discussion here for more details.

@danhoeflinger danhoeflinger marked this pull request as ready for review December 30, 2024 20:59
@danhoeflinger danhoeflinger changed the title [Draft] Host Implementation of Histogram APIs Host Implementation of Histogram APIs Dec 30, 2024
@danhoeflinger danhoeflinger added this to the 2022.8.0 milestone Dec 30, 2024
include/oneapi/dpl/pstl/algorithm_impl.h Outdated Show resolved Hide resolved
include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved
include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved
include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved
include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved
Copy link
Contributor

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good to me. I had to go through it a few times to understand __construct_by_args, but I think I get how it works now and why it's useful.

Just some minor comments from my side.

@@ -4289,6 +4289,86 @@ __pattern_shift_right(_Tag __tag, _ExecutionPolicy&& __exec, _BidirectionalItera
return __res.base();
}

template <typename _ForwardIterator, typename _IdxHashFunc, typename _RandomAccessIterator, class _IsVector>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why _IsVector is a class when the rest are typenames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it seems the convention in this file is generally for everything to be class (though its not fully consistent). I'll adjust it to the norm.

std::uint32_t __count = 0;
std::uint32_t __j = 0;

for (; __j < __thread_specific_storage.size() && __count <= __i; ++__j)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::uint32_t __count = 0 could probably be moved here to the initialization expression of this loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done.

}

std::uint32_t
size() const
Copy link
Contributor

@SergeyKopienko SergeyKopienko Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to use std::size_t for size-related stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is associated with the number of threads, is it necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, if we describe anything about the size, better to use std::size_t type.
The exception - only one case when we save size in the memory and want to reduce the amount of memory.
But of course, I don't insist, it's not one place with another approach.

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
@danhoeflinger danhoeflinger force-pushed the dev/dhoefling/histogram_CPU_impl branch from d2f99dc to 455cf9c Compare January 13, 2025 15:04
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Comment on lines +4349 to +4351
_DiffType __local_n = __global_histogram_last - __global_histogram_first;
std::uint32_t __num_temporary_copies = __tls.size();
_DiffType __range_begin_id = __global_histogram_first - __histogram_first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_DiffType __local_n = __global_histogram_last - __global_histogram_first;
std::uint32_t __num_temporary_copies = __tls.size();
_DiffType __range_begin_id = __global_histogram_first - __histogram_first;
const _DiffType __local_n = __global_histogram_last - __global_histogram_first;
const std::uint32_t __num_temporary_copies = __tls.size();
const _DiffType __range_begin_id = __global_histogram_first - __histogram_first;

_PSTL_PRAGMA(omp single) { __thread_specific_storage.resize(omp_get_num_threads()); }
}

// Note: Size should not be used concurrantly with parallel loops which may instantiate storage objects, as it may
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note: Size should not be used concurrantly with parallel loops which may instantiate storage objects, as it may
// Note: Size should not be used concurrently with parallel loops which may instantiate storage objects, as it may

__backend_tag{}, std::forward<_ExecutionPolicy>(__exec), __histogram_first, __histogram_first + __num_bins,
[__histogram_first, &__tls](auto __global_histogram_first, auto __global_histogram_last) {
_DiffType __local_n = __global_histogram_last - __global_histogram_first;
std::uint32_t __num_temporary_copies = __tls.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the first parallel_for, __tls.size() should no longer grow, correct? What about moving __num_temporary_copies outside of this parallel_for and capturing it into the lambda? This way we only perform a single atomic load.

using _DiffType = typename std::iterator_traits<_RandomAccessIterator2>::difference_type;

_DiffType __n = __last - __first;
if (__n > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't insist... But there is a common good practice to process a shorter branch of if..else operator first. It has better readability. Also, in case of immediate return from the shorter branch we have the "non indented" code of the bigger branch:

if(n ==0)
{
    //shorter branch
   return;
}
//else: bigger branch - no indents.
.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants