Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for histogram CPU implementation #1930

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Changes from 23 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a03577e
initial rough commit
danhoeflinger Oct 30, 2024
ce117f5
minor improvements
danhoeflinger Oct 30, 2024
ccc001e
revision
danhoeflinger Nov 1, 2024
d518a14
Formatting, minor
danhoeflinger Nov 1, 2024
6e03468
spelling and grammar
danhoeflinger Nov 1, 2024
10c4e50
Minor improvements
danhoeflinger Nov 1, 2024
efa7c9b
subsection
danhoeflinger Nov 1, 2024
1ac82fd
Adding some alternative approaches
danhoeflinger Nov 1, 2024
02523c4
minor improvements
danhoeflinger Nov 1, 2024
ac7b654
line widths
danhoeflinger Nov 4, 2024
506fb62
fixing numbering.
danhoeflinger Nov 6, 2024
1c6cb47
putting in specifics for TBB / OpenMP
danhoeflinger Nov 6, 2024
ceee3e3
Update Atomic strategy
danhoeflinger Nov 12, 2024
0711090
more clarity about serial backend and policy
danhoeflinger Nov 12, 2024
3c5ad12
minor corrections
danhoeflinger Nov 12, 2024
06a734f
c++17 -> c++20 fix
danhoeflinger Nov 13, 2024
b858a0e
Updates after some experimentation and thought
danhoeflinger Dec 16, 2024
53f4643
improvements from feedback
danhoeflinger Dec 20, 2024
d718e0e
thread enumerable storage +
danhoeflinger Dec 20, 2024
bb9e6f9
remove general language keep specifics to histogram
danhoeflinger Dec 20, 2024
17e0510
SIMD naming
danhoeflinger Dec 20, 2024
9614209
spelling
danhoeflinger Dec 20, 2024
2964a9e
clarifying thread enumerable storage
danhoeflinger Dec 20, 2024
9287fd2
minor improvements
danhoeflinger Dec 30, 2024
cdf5092
spelling
danhoeflinger Dec 30, 2024
215c2b7
adding link to implementation
danhoeflinger Dec 30, 2024
04d5127
rename to __enumerable_thread_local_storage
danhoeflinger Jan 15, 2025
fe1efa2
Added sections on complexity
danhoeflinger Jan 15, 2025
60ec0e5
spelling
danhoeflinger Jan 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions rfcs/proposed/host_backend_histogram/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Host Backends Support for the Histogram APIs

## Introduction
The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These
APIs are defined in the oneAPI Specification 1.4. Please see the
[oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending
histogram support to these backends.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me share my thoughts:
In my understanding RFC is not a book... So, I would preferer to have a short, concise and precise description of what is offered, without frills, like a mathematical theorem. For example:

"The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
oneAPI Specification for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've accepted your language here. Thanks.

## Motivations
There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding
the support is simply to be spec compliant with the oneAPI specification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to it is not a story telling, I would suggest omitting introductory expressions like "It may make more sense" or "It's natural for a user to expect"... Only short and exact information.

For example,
"There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken suggestion. Thanks

## Design Considerations

### Key Requirements
Provide support for the `histogram` APIs with the following policies and backends:
- Policies: `seq`, `unseq`, `par`, `par_unseq`
- Backends: `serial`, `tbb`, `openmp`

Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends
which they can select from when using oneDPL. It is important that all combinations of these options have support for
the `histogram` APIs.

### Performance
With little computation, a histogram algorithm is likely a memory-bound algorithm. So, the implementation prioritize
reducing memory accesses and minimizing temporary memory traffic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account my shared thought above, I would propose to re-prahse it keeping the main point shorter:

"A histogram algorithm is a memory-bound algorithm. So, the implementation should care of reducing memory accesses and minimizing temporary memory traffic."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken mostly. Thanks

### Memory Footprint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we wish to specify space / computational complexity of the proposed algorithm somewhere, or is this too much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some sections on computation complexity and temporary memory requirements.

I don't think we should add something like this to the specification though, because I don't want to dictate the algorithm used from the specification, but its good to document it in the current proposal. Thanks.

There are no guidelines here from the standard library as this is an extension API. Still, we will minimize memory
footprint where possible.

### Code Reuse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can this topic omit at all. It tells nothing about 'histogram', just general wording, which can be applied for any new feature in oneDPL...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed some of the general language and added something which is important for histogram in an attempt to answer feedback from @akukanov to clarify where the implementation of the algorithm will live.

We want to minimize adding requirements for parallel backends to implement, and lift as much as possible to the
algorithm implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual
backends, and instead rely upon `__parallel_for`.

### SIMD/openMP SIMD Implementation
Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
provided by OpenMP.

There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment
there are two APIs, even and custom range which have significantly different methods to determine the bin to
increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
each input has the same mathematical operations applied to each. However, for the custom range API, each input element
uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
different length and control flow based upon each input element and will be very difficult to vectorize.

Next, lets consider the increment operation itself. This operation increments a data dependent bin location, and may
result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without
more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not
available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the
vector width, but it is unclear if it is worth the overhead. OpenMP SIMD provides an `ordered` structured block which
we can use to exempt the increment from SIMD operations as well. However, this often results in vectorization being
refused by the compiler. Initial implementation will avoid vectorization of this main histogram loop.

Last, for our below proposed implementation there is the task of combining temporary histogram data into the global
output histogram. This is directly vectorizable via our existing brick_walk implementation, and will be vectorized when
a vector policy is used.

### Serial Backend
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/oneapi-src/oneDPL/pull/1930/files#diff-fb5f6394ad0d350d719b9f31b139fa60c347ec64795c78e56875c4f002aeb0e7R25
We already have the key requirements topic where we enumerate all backends that we propose to support.
It is good enough I think, and we also can omit this topic "Serial Backend".

Explanation what is "Serial Backend" means as the others backends mean, is a kind of "oneDPL general description" and not related to RFC for histogram feature, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some recent changes, there is some specifics about the serial implementation I wanted to add here so I've kept the section.

We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other
approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram.

## Existing Patterns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we intent to give some information about OneDPL parallel backend patterns on which histogram can based on, I would notify, there is not "count_if" pattern, there is "reduce"("transform_reduce") pattern.
When a man says "reduce", it becomes more or less obvious that histogram calculation based on reduce is not effective at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the language a little here to make it more clear that copy_if uses reduce internally. I still think it deserves some text describing it as it may not be immediately obvious to everyone that reduce is not well matched.

### count_if
`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a
sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being
incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for
each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a
`histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention
should be far higher, and `reduce` is a very well-matched pattern performance-wise.

### parallel_for
`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what
we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use
`parallel_for` alone, there would be a race condition between threads when incrementing the values in the output
histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way
to synchronize and accumulate between threads.


## Alternative Approaches

### Atomics
This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the
output histogram data, we can merely run a `parallel_for` pattern.

To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics
specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic
operations, but that is only available for the OpenMP backend. The working plan was to implement a macro like
`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins
like `InterlockedAdd` or `__atomic_fetch_add_n`. In a proof of concept implementation,this seemed to work, but does
reach more into details than compiler / OS specifics than is desired for implementations prior to `C++20`.

After experimenting with a proof of concept implementation of this implementation, it seems that the atomic
implementation has very limited applicability to real cases. We explored a spectrum of number of elements combined with
number of bins with both OpenMP and TBB. There was some subset of cases for which the atomics implementation
outperformed the proposed implementation (below). However, this was generally limited to some specific cases where
the number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases
with a small number for input elements relative to number of bins. This makes sense because the atomic implementation
is able to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when
the number of bins is large compared to the number of input elements. With many bins, contention on atomics is also
limited as compared to the embarassingly parallel proposal which does experience this contention.

When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important
use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to
characterize the result. Cases for which there are similar or more bins than input elements are not very practical in
practice. The maintenance and complexity cost associated with supporting and maintaining a second implementation to
serve this subset of cases does not seem to be justified. Therefore, this implementation has been discarded at this
time.

### Other Unexplored Approaches
* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I have a curiosity question. Which approach does NVidia use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVidia has a similar API within CUB but not within Thrust, and therefore does not have a CPU implementation that I am aware of, only one specifically for a GPU device.

modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring.

* Another possible approach could be to do something like the proposed implementation one, but with some sparse
representation of output data. However, I think the general assumptions we can make about the normal case make this
less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple
threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
does not provide a good fallback.

## Proposal
After exploring the above implementation for `histogram`, the following proposal better represents the use
cases which are important, and provides reasonable performance for most cases.

### Embarrassingly Parallel Via Temporary Histograms
This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Embarrassingly Parallel term mean?

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Update: got it.. https://en.wikipedia.org/wiki/Embarrassingly_parallel

  2. Of course, if you are solving a concrete task and you are allowed to use the all machine recourses, and there are no any other workloads on the node, the best way for histogram calculation - to make static dividing of amount of work, each thread is calculating a local histogram, and after the local histograms are reducing into one.

  3. But, talking about parallelism in a kind of general library we have to keep in mind that a final user's application can work in "different circumstances", depends on their application type, task, real-time data, other workloads on the same host and other many things..
    When we were developing TBB backend we kept in mind that things and preferred to use TBB auto partitioner (instead of static f.e).
    Also composability reasons make sense here.

  4. BTW, have you considered a "common parallel reduce" (in general) pattern (and tbb::parallel_reduce pattern, in particular) for histogram calculation? It seems the parallel histogram calculation matches on the common reduce (with a certain "big" grainsize): each Body calculates a local histogram (bins), Combiner summaries the all local bins into final ones.
    Additionally, if number of bins is "big" we can apply the second level of parallelism within Combiner code - SIMD or even "parallel_for" and SIMD, if number of bins is "too big".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, although I think there is no reason here to do static division of work, but rather rely upon our existing parallel_for implementation to provide a good composable implementation.

  2. Agree, which is why the intent is to use the existing parallel_for structure (including partitioners) to implement the parallelism. If we were to do it from scratch, we would do it in a similarly composable way, but better to rely upon existing infrastructure

  3. Yes, I thought about this. For TBB and even more for openMP the built in reduction functionality is geared toward very simple lightweight types as the reduction variable where we may have an arbitrarily large array. Especially since we want a unified implementation, it does not seem like these backend are really set up to handle these large reduction variables. It seems we should take more control to ensure no unnecessary copies are made, and that the final combination is done performantly, based on knowledge we have of the task. The implementation remains quite simple and unified.

`histogram`.

For this algorithm, each parallel backend will add a `__thread_enumerable_storage<_StoredType>` struct which provides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why a TLS is used, not the global memory with "thread id" as a key? F.e. bins[thread_id]?
  2. Does TBB guarantee that the same threads finalize the work? "The same threads" means the threads which have started the work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are a number of reasons:
    a) My understanding is that in TBB while it may be technically possible to get the the local thread id within an arena, it is an undocumented API and generally discouraged and against the TBB mindset. Using TLS seems to be the preferred method specifically with TBB.
    b) While what you suggest perhaps fits better within OpenMP, we want to create a single implementation and not require a __parallel_histogram within every current and future backend, but rather depend upon existing functionality within the backend as much as we can (in this case __parallel_for).
    c) With smaller values of n, num_bins and larger number of threads, not all threads should be used because of the overhead associated with allocation and initialization of each temporary bin copy. We can let the partitioner decide how many blocks to employ, but we want to avoid unnecessary allocation and initialization overheads wherever possible.

I will mention a downside for completeness, but it is outweighed here in my opinion:
It requires implementation of a thread local storage class for each backend. This is only non-trivial for OpenMP. It has been written generically though to serve future patterns though so it is nice to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not exactly sure what you mean here by "finalize the work". If you mean the second parallel for, then no, we are explicitly parallelizing over a different dimension (num_bins), and accumulating across the temporary histograms which were used from different threads. TBB does guarantee that each thread will always use its own TLS for each grain of work though, when retrieved through local().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. I'm not exactly sure what you mean here by "finalize the work"

I try to explain with example:
Usage of a general TBB pattern tbb::paralle_for doesn't suppose using system thread directly. There is only a "Body" which is called (with a part of data(tbb range) by executing thread. Imagine the input range is split into 4 parts. Two threads call 2 parts simultaneously. The Body stores local bin results in TLS, associated with mentioned threads.
After, to "finalize the work", TBB should call Body two time to process final 2 parts of input range. These final two calls may be done by another threads which have the other associated TLSs. So, it is impossible to make final reduce of local bins, located in TLSs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 parallel_for calls, each of which is "embarrassingly parallel" where no thread body depends on previous thread bodies. The first parallel_for must complete before the second one starts though. The first parallel_for uses the TLS as normal, and just accumulates sections of the input data into each thread's individual TLS.

The second parallel_for call does not use the TLS as normal, but rather has every thread visit a section of every TLS which was created one by one, processing a section of the histogram bins in parallel, combining the work of different threads from the first loop into the final global histogram.

The TLS we propose here (that is also implemented in the PR) supports this, and we obtain the correct result. We will not have perfect cache effects when accessing TLS from different threads than it was created upon but that is just something we have to deal with.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your answer, Dan...
It seems you don't catch my question/concerns...

I try to explain my concern again:
tbb::paralle_for "produces several calls of Body, which is passed to tbb::paralle_for. You don't know how many callbacks is, because "tbb auto-partioner" is applied by default.
Each call of this body may be done by the different threads. Moreover, The first calls of the body may be done by threads "ids" 0-3, the last calls may be done by another threads, "ids" 4-7 by example. Each TLS is associated with its own thread. you don't know IDs of threads.... I don't understand how you can get the calculated local bins from the all TLSs....

Copy link
Contributor Author

@danhoeflinger danhoeflinger Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are basically using TBB's enumerable_thread_specific as a model here, and implementing a stripped down version for omp and a trivial version for serial backend.

enumerable_thread_specific has two ways of accessing the data.
a) local() which gets the TLS for the current thread,
b) begin() and end() which provide iterators to the sequence of all local storage from all threads.

This allows us to use (a) in the first parallel loop and (b) in the second parallel loop. The second parallel loop does not use the enumerable_thread_specific as a "Thread Local Storage" but rather a 2-D array space which it iterates over summing across columns (corresponding to individual histogram bins from different threads). This allows us to accumulate the data from all threads into the global space histogram copy no matter which threads are used and when.

I'm not sure how else I can explain it. The code in the implementation is tested, working, and pretty concise, if you want to see the details you can look at the PR.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "TBB TLS", not system TLS...
I clarified that question with Alexey.
TBB TLS is a kind of container and allows to iterate the all local bins... I was not aware of that.
Now I got it.

the following:
* constructor which takes a variadic list of args to pass to the constructor of each thread's object
* `get()` returns reference to the current threads stored object
* `get_with_id(int i)` returns reference to the stored object for an index
* `size()` returns number of stored objects

In the TBB backend, this will use `enumerable_thread_specific` internally. For OpenMP, this will pre-allocate
and initialize an object for each possible thread in parallel. The serial backend will merely create a single copy of
the temporary object for use.

With this new structure we will use the following algorithm:

1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
its own temporary histogram returned by `__thread_enumerable_storage`.
2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
histogram created within `__thread_enumerable_storage` into the output histogram sequence.

Loading