Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for histogram CPU implementation #1930

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a03577e
initial rough commit
danhoeflinger Oct 30, 2024
ce117f5
minor improvements
danhoeflinger Oct 30, 2024
ccc001e
revision
danhoeflinger Nov 1, 2024
d518a14
Formatting, minor
danhoeflinger Nov 1, 2024
6e03468
spelling and grammar
danhoeflinger Nov 1, 2024
10c4e50
Minor improvements
danhoeflinger Nov 1, 2024
efa7c9b
subsection
danhoeflinger Nov 1, 2024
1ac82fd
Adding some alternative approaches
danhoeflinger Nov 1, 2024
02523c4
minor improvements
danhoeflinger Nov 1, 2024
ac7b654
line widths
danhoeflinger Nov 4, 2024
506fb62
fixing numbering.
danhoeflinger Nov 6, 2024
1c6cb47
putting in specifics for TBB / OpenMP
danhoeflinger Nov 6, 2024
ceee3e3
Update Atomic strategy
danhoeflinger Nov 12, 2024
0711090
more clarity about serial backend and policy
danhoeflinger Nov 12, 2024
3c5ad12
minor corrections
danhoeflinger Nov 12, 2024
06a734f
c++17 -> c++20 fix
danhoeflinger Nov 13, 2024
b858a0e
Updates after some experimentation and thought
danhoeflinger Dec 16, 2024
53f4643
improvements from feedback
danhoeflinger Dec 20, 2024
d718e0e
thread enumerable storage +
danhoeflinger Dec 20, 2024
bb9e6f9
remove general language keep specifics to histogram
danhoeflinger Dec 20, 2024
17e0510
SIMD naming
danhoeflinger Dec 20, 2024
9614209
spelling
danhoeflinger Dec 20, 2024
2964a9e
clarifying thread enumerable storage
danhoeflinger Dec 20, 2024
9287fd2
minor improvements
danhoeflinger Dec 30, 2024
cdf5092
spelling
danhoeflinger Dec 30, 2024
215c2b7
adding link to implementation
danhoeflinger Dec 30, 2024
04d5127
rename to __enumerable_thread_local_storage
danhoeflinger Jan 15, 2025
fe1efa2
Added sections on complexity
danhoeflinger Jan 15, 2025
60ec0e5
spelling
danhoeflinger Jan 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions rfcs/proposed/host_backend_histogram/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ for a full definition of the semantics of the histogram APIs. In short, they tak
classify them into either evenly distributed or user-defined bins via a list of separating values and count the number
of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported
with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of
`histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in
much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be
straightforward.
`histogram` for these host-side backends.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me share my thoughts:
In my understanding RFC is not a book... So, I would preferer to have a short, concise and precise description of what is offered, without frills, like a mathematical theorem. For example:

"The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
oneAPI Specification for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've accepted your language here. Thanks.

## Motivations
Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases
Expand Down Expand Up @@ -57,6 +55,10 @@ required to overcome the race conditions which we add from the additional concur
sense to decline to add vectorized operations within histogram depending on the implementation used, and based on
performance results.

### Serial Backend
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/oneapi-src/oneDPL/pull/1930/files#diff-fb5f6394ad0d350d719b9f31b139fa60c347ec64795c78e56875c4f002aeb0e7R25
We already have the key requirements topic where we enumerate all backends that we propose to support.
It is good enough I think, and we also can omit this topic "Serial Backend".

Explanation what is "Serial Backend" means as the others backends mean, is a kind of "oneDPL general description" and not related to RFC for histogram feature, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some recent changes, there is some specifics about the serial implementation I wanted to add here so I've kept the section.

We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
policies types, but always provide a serial unvectorized implementation.

## Existing Patterns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we intent to give some information about OneDPL parallel backend patterns on which histogram can based on, I would notify, there is not "count_if" pattern, there is "reduce"("transform_reduce") pattern.
When a man says "reduce", it becomes more or less obvious that histogram calculation based on reduce is not effective at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the language a little here to make it more clear that copy_if uses reduce internally. I still think it deserves some text describing it as it may not be immediately obvious to everyone that reduce is not well matched.

### count_if
Expand Down Expand Up @@ -110,9 +112,11 @@ To deal with atomics appropriately, we have some limitations. We must either use
specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own
`atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic
operations, but that is only available for the OpenMP backend.
data in an atomic wrapper, but we cannot assume `C++17` for all users. OpenMP provides atomic
akukanov marked this conversation as resolved.
Show resolved Hide resolved
operations, but that is only available for the OpenMP backend. The working plan is to implement a macro like
`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available , and alternatively uses compiler builtins
like `InterlockedAdd` or `__atomic_fetch_add_n`. It needs to be investigated if we need to have any version which
needs to turn off the atomic implementation, due to lack of support by the compiler (I think this is unlikely).

It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend
on the different approaches available.
Expand All @@ -137,8 +141,6 @@ better atomics should do vs redundant copies of the output.
does not provide a good fallback.

## Open Questions
* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require
specializations for each of our supported compilers.

* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
merely having extra copies of the histogram and accumulating?
Expand Down
Loading