uxlfoundation · danhoeflinger · Oct 30, 2024 · Oct 30, 2024 · Nov 1, 2024 · Nov 1, 2024
diff --git a/rfcs/proposed/host_backend_histogram/README.md b/rfcs/proposed/host_backend_histogram/README.md
@@ -9,9 +9,7 @@ for a full definition of the semantics of the histogram APIs. In short, they tak
 classify them into either evenly distributed or user-defined bins via a list of separating values and count the number
 of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported
 with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of
-`histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in
-much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be
-straightforward.
+`histogram` for these host-side backends.
 
 ## Motivations
 Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases
@@ -57,6 +55,10 @@ required to overcome the race conditions which we add from the additional concur
 sense to decline to add vectorized operations within histogram depending on the implementation used, and based on
 performance results.
 
+### Serial Backend
+We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
+policies types, but always provide a serial unvectorized implementation.
+
 ## Existing Patterns
 
 ### count_if
@@ -110,9 +112,11 @@ To deal with atomics appropriately, we have some limitations. We must either use
 specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
 only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
 copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
-data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own
-`atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic
-operations, but that is only available for the OpenMP backend.
+data in an atomic wrapper, but we cannot assume `C++17` for all users. OpenMP provides atomic
+operations, but that is only available for the OpenMP backend.  The working plan is to implement a macro like
+`_ONEDPL_ATOMIC_INCREMENT(var)` which uses an `std::atomic_ref` if available , and alternatively uses compiler builtins
+like `InterlockedAdd` or `__atomic_fetch_add_n`.  It needs to be investigated if we need to have any version which
+needs to turn off the atomic implementation, due to lack of support by the compiler (I think this is unlikely).
 
 It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend
 on the different approaches available.
@@ -137,8 +141,6 @@ better atomics should do vs redundant copies of the output.
   does not provide a good fallback.
 
 ## Open Questions
-* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require
-  specializations for each of our supported compilers.
 
 * What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
   merely having extra copies of the histogram and accumulating?