Host Implementation of Histogram APIs #1974

danhoeflinger · 2024-12-18T18:28:26Z

Implementation of histogram APIs for host backends.

Implementations are provided for serial, tbb, and openMP backends. We add a generic __thread_enumerable_storage struct to add a generic thread local storage for our host backends. We use the new TLS (Thread local storage) with parallel_for to implement histogram. Testing is also added, and some minor adjustments are made to cmake.

Please see the RFC documentation / discussion here for more details.

include/oneapi/dpl/pstl/algorithm_impl.h

include/oneapi/dpl/pstl/omp/util.h

adamfidel

Looks generally good to me. I had to go through it a few times to understand __construct_by_args, but I think I get how it works now and why it's useful.

Just some minor comments from my side.

adamfidel · 2025-01-10T19:45:44Z

include/oneapi/dpl/pstl/algorithm_impl.h

@@ -4289,6 +4289,86 @@ __pattern_shift_right(_Tag __tag, _ExecutionPolicy&& __exec, _BidirectionalItera
    return __res.base();
 }

+template <typename _ForwardIterator, typename _IdxHashFunc, typename _RandomAccessIterator, class _IsVector>


Is there a reason why _IsVector is a class when the rest are typenames?

No, it seems the convention in this file is generally for everything to be class (though its not fully consistent). I'll adjust it to the norm.

adamfidel · 2025-01-10T20:50:22Z

include/oneapi/dpl/pstl/omp/util.h

+        std::uint32_t __count = 0;
+        std::uint32_t __j = 0;
+
+        for (; __j < __thread_specific_storage.size() && __count <= __i; ++__j)


std::uint32_t __count = 0 could probably be moved here to the initialization expression of this loop.

thanks, done.

include/oneapi/dpl/pstl/omp/util.h

SergeyKopienko · 2025-01-13T11:50:54Z

include/oneapi/dpl/pstl/parallel_backend_serial.h

+    }
+
+    std::uint32_t
+    size() const


I think better to use std::size_t for size-related stuff.

Again, this is associated with the number of threads, is it necessary?

From my point of view, if we describe anything about the size, better to use std::size_t type.
The exception - only one case when we save size in the memory and want to reduce the amount of memory.
But of course, I don't insist, it's not one place with another approach.

include/oneapi/dpl/pstl/omp/util.h

include/oneapi/dpl/pstl/algorithm_impl.h

include/oneapi/dpl/pstl/parallel_backend_serial.h

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

SergeyKopienko · 2025-01-14T11:20:03Z

include/oneapi/dpl/pstl/algorithm_impl.h

+                _DiffType __local_n = __global_histogram_last - __global_histogram_first;
+                std::uint32_t __num_temporary_copies = __tls.size();
+                _DiffType __range_begin_id = __global_histogram_first - __histogram_first;


Suggested change

_DiffType __local_n = __global_histogram_last - __global_histogram_first;

std::uint32_t __num_temporary_copies = __tls.size();

_DiffType __range_begin_id = __global_histogram_first - __histogram_first;

const _DiffType __local_n = __global_histogram_last - __global_histogram_first;

const std::uint32_t __num_temporary_copies = __tls.size();

const _DiffType __range_begin_id = __global_histogram_first - __histogram_first;

mmichel11 · 2025-01-14T16:53:47Z

include/oneapi/dpl/pstl/omp/util.h

+        _PSTL_PRAGMA(omp single) { __thread_specific_storage.resize(omp_get_num_threads()); }
+    }
+
+    // Note: Size should not be used concurrantly with parallel loops which may instantiate storage objects, as it may


Suggested change

// Note: Size should not be used concurrantly with parallel loops which may instantiate storage objects, as it may

// Note: Size should not be used concurrently with parallel loops which may instantiate storage objects, as it may

mmichel11 · 2025-01-14T17:02:10Z

include/oneapi/dpl/pstl/algorithm_impl.h

+            __backend_tag{}, std::forward<_ExecutionPolicy>(__exec), __histogram_first, __histogram_first + __num_bins,
+            [__histogram_first, &__tls](auto __global_histogram_first, auto __global_histogram_last) {
+                _DiffType __local_n = __global_histogram_last - __global_histogram_first;
+                std::uint32_t __num_temporary_copies = __tls.size();


After the first parallel_for, __tls.size() should no longer grow, correct? What about moving __num_temporary_copies outside of this parallel_for and capturing it into the lambda? This way we only perform a single atomic load.

MikeDvorskiy · 2025-01-15T10:29:24Z

include/oneapi/dpl/pstl/algorithm_impl.h

+    using _DiffType = typename std::iterator_traits<_RandomAccessIterator2>::difference_type;
+
+    _DiffType __n = __last - __first;
+    if (__n > 0)


I don't insist... But there is a common good practice to process a shorter branch of if..else operator first. It has better readability. Also, in case of immediate return from the shorter branch we have the "non indented" code of the bigger branch:

if(n ==0) { //shorter branch return; } //else: bigger branch - no indents. .....

danhoeflinger mentioned this pull request Dec 19, 2024

RFC for histogram CPU implementation #1930

Open

danhoeflinger marked this pull request as ready for review December 30, 2024 20:59

danhoeflinger changed the title ~~[Draft] Host Implementation of Histogram APIs~~ Host Implementation of Histogram APIs Dec 30, 2024

danhoeflinger requested review from reble, timmiesmith, akukanov, MikeDvorskiy, SergeyKopienko, mmichel11, adamfidel and dmitriy-sobolev December 30, 2024 21:00

danhoeflinger added this to the 2022.8.0 milestone Dec 30, 2024

mmichel11 reviewed Jan 3, 2025

View reviewed changes

include/oneapi/dpl/pstl/algorithm_impl.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved

mmichel11 reviewed Jan 7, 2025

View reviewed changes

include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/omp/util.h Outdated Show resolved Hide resolved

adamfidel reviewed Jan 10, 2025

View reviewed changes