codeplaysoftware · muhammad-tanvir-1211 · Jan 15, 2024 · Dec 22, 2023 · Dec 27, 2023 · Dec 27, 2023
diff --git a/Dockerfile b/Dockerfile
@@ -48,7 +48,7 @@ CMD cd /portBLAS && \
     if [ "${COMMAND}" = 'build-test' ]; then \
       if [ "${SYCL_IMPL}" = 'DPCPP' ]; then \
         export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p build && cd build && \
-        cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
+        cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
         -DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
         -DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
         make -j$(nproc) && cd test && ctest -VV --timeout 1200; \
@@ -58,7 +58,7 @@ CMD cd /portBLAS && \
     elif [ "${COMMAND}" = 'auto-tuner' ]; then \
       export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p tools/auto_tuner/build \
       && cd tools/auto_tuner/build && \
-      cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
+      cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
       -DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
       -DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
       make -j$(nproc); \

diff --git a/README.md b/README.md
diff --git a/Roadmap.md b/Roadmap.md
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -5,53 +5,82 @@ Benchmarks
 
 The portBLAS benchmarks are intended to measure the evolution of the
 performance of this BLAS implementation and how it compares with other tuned
-implementations, such as [CLBLAST](https://github.com/CNugteren/CLBlast)
-(a very performant OpenCL BLAS library).
+implementations, such as [CUBLAS](https://docs.nvidia.com/cuda/cublas/index.html) native and optimized CUDA BLAS library for Nvidia GPUs, 
+[ROCBLAS](https://github.com/ROCm/rocBLAS) which is the native HIP/Rocm BLAS 
+library optimized for AMD GPUs, and [CLBLAST](https://github.com/CNugteren/CLBlast),
+a very performant OpenCL BLAS library for CPU targets. 
 
 The benchmarks use Google's [benchmark](https://github.com/google/benchmark)
-library and generate a report with indicative metrics (see instructions below).
+library and generate a report with indicative metrics *(see instructions below)*.
 
 ## How to compile the benchmarks
+### portBLAS Benchmarks
+The portBLAS default benchmarks are compiled with the project if the 
+`BLAS_ENABLE_BENCHMARK` CMake option is activated *(which is the case by 
+default)*. For the results to be relevant, portBLAS needs to be built in Release
+mode *(Passing `-DCMAKE_BUILD_TYPE=Release` to `cmake` command)*.
 
-The benchmarks are compiled with the project if the `BLAS_ENABLE_BENCHMARK`
-CMake option is activated (which is the case by default).
-
+### clBLAST Benchmarks
 The CLBLAST benchmarks are compiled only if the `BUILD_CLBLAST_BENCHMARKS` CMake
 option is activated. If so, if CLBlast cannot be found the build will fail. The
 location of CLBlast can be given with `CLBLAST_ROOT`.
 To install CLBlast, see:
 [CLBlast: Building and installing](
-https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md))
+https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md)
+
+### cuBLAS Benchmarks
+cuBLAS benchmarks can be built by enabling the `BUILD_CUBLAS_BENCHMARKS` option and 
+require an installation of CUDA *(>=11.x)* and cuBLAS library *(Both can be obtained by 
+installing the cuda Toolkit : 
+[installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))*.
 
+### rocBLAS Benchmarks
+cuBLAS benchmarks can be built by enabling the `BUILD_ROCBLAS_BENCHMARKS` option 
+and require an installation of Rocm *(>=4.5.x)* and rocBLAS library *(Please refer 
+to this [installation guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Linux_Install_Guide.html) 
+for the setup)*
+
+## General Notes
 After the compilation, the binaries will be available:
-* in the build folder, in `benchmark/portblas/` and `benchmark/clblast/`
+* in the build folder, in `benchmark/[Library Name]/` 
 * if you provide an installation directory with the CMake variable
     `CMAKE_INSTALL_PREFIX`, and run the installation command, e.g
     `ninja install`, in your installation folder, in `portblas/bin/`
 
 A verification of the results is enabled by default and can be disabled with the
 CMake option `BLAS_VERIFY_BENCHMARK` set to `OFF` or `0`. The verification will
-be run a small number of times (more than once because of the way the benchmark
+be run a small number of times *(more than once because of the way the benchmark
 library works, but much less than the usual number of iterations of the
-benchmarks). The verification requires that a reference implementation of BLAS
+benchmarks)*. The verification requires that a reference implementation of BLAS
 like OpenBLAS is installed, which path can be given with the `CMAKE_PREFIX_PATH`
 CMake parameter.
 
 ## How to run the benchmarks
 
 The benchmarks take two kinds of command-line options: those for the benchmark
-library and those specific to the portBLAS projects.
-
-Essentially, the benchmarks can take a CSV configuration file (or will use
-defaults), and if your machine has more than one OpenCL device, you can specify
-which one to use. The other options specify how to output the results.
+library and those specific to portBLAS: 
+- Benchmark library options : these options help specify for instance the output 
+format, the verbosity level and help filter specific benchmarks based on regex 
+arguments. 
+- portBLAS options : these options are portBLAS specific and they help specify 
+the target SYCL device *(portBLAS benchmarks only)* as well as pass a custom set 
+of operator-specfic configurations *(applies to rocBLAS and cuBLAS benchmarks as 
+well)*. If no csv file is specified, the benchmark will use defaults values 
+*(found in `include/common/common_utils.hpp`)*. 
+
+For portBLAS and clBLASt benchmarks, and if the target machine has more than one 
+OpenCL device, you can specify which one to use at runtime(\*). 
+
+(*): Preferably, for portBLAS benchmarks, this device has to match the 
+`TUNING_TARGET` as some operators are configured differently depending on the 
+SYCL target for optimal performance.  
 
 The most useful options for us are:
 
 |option|parameter|description|
 |------|:-------:|-----------|
 | `--help` |   | Show help message |
-| `--device` | device name | Select a device to run on (e.g `intel:gpu`) |
+| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful for portBLAS and clBLASt benchmarks|
 | `--csv-param` | file path | Path to a CSV file with the benchmark parameters |
 | `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output |
 | `--benchmark_out` | file path | Specify a file where to write the report |
@@ -64,21 +93,27 @@ The most useful options for us are:
 You can check the [GitHub repository](https://github.com/google/benchmark) of
 the library for information about the other supported command-line arguments.
 
-Here is an example of an invocation of the GEMM benchmark running on Intel GPU,
-displaying the results in the console and saving a json report:
+Here is an example of an invocation of the portBLAS GEMM benchmark running on 
+Intel GPU, displaying the results in the console and saving a json report:
 
 ```bash
-./bench_gemm --device=intel:gpu --csv-param=parameters.csv \
+./benchmark/portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
     --benchmark_out=../results.json --benchmark_out_format=json \
     --benchmark_format=console
 ```
 
+Here is the same benchmark through cuBLAS: 
+```bash
+./benchmark/cublas/bench_cublas_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ 
+--benchmark_out_format=json --benchmark_format=console
+```
+
 ### CSV format
 
 The benchmarks can be given a CSV file containing the parameters to run with
-(matrix/vector dimensions, transpose or not, etc), in the following format: one
-line corresponds to one set of parameters, i.e. one name for the library (though
-it will be iterated many times for statistical accuracy).
+*(matrix/vector dimensions, transpose or not, etc)*, in the following format: one
+line corresponds to one set of parameters, i.e. one name for the library *(though
+it will be iterated many times for statistical accuracy)*.
 
 The formats for the different BLAS levels are:
 
@@ -88,24 +123,51 @@ The formats for the different BLAS levels are:
 | blas 2 | *transpose_A,m,n,alpha,beta* | Action on the matrix (`n`, `t`, `c`), dimensions, and scalars alpha and beta |
 | blas 3 |  | |
 | gemm | *transpose_A,transpose_B,m,k,n,alpha,beta* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), and scalars alpha and beta |
-| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size |
+| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size,batch_type* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size, batch_type |
 | trsm | *side,triangle,transpose,diagonal,m,n,alpha* | Position of A (`l`, `r`), A is upper or lower triangular (`u`, `l`), transposition of A (`n`, `t`), A is unit or non-unit diagonal(`u`,`n`),dimensions, scalar alpha |
 
-Note: for operations that support a stride, the benchmarks will use a stride of
-1 (contiguous values), except for the GEMM batched operation where valid default stride values are used depending on batch type *(strided or interleaved)*. For operations that support a leading dimension, the
-benchmarks use the minimum possible value (the actual leading dimension of the
-matrix).
+### Notes: 
+
+For operations that support an increment, the benchmarks will use an increment of
+1 *(contiguous values)*, except for the GEMM batched operation where valid 
+default increment values are used depending on `batch_type` *(strided or 
+interleaved)*. For operations that support a leading dimension, the
+benchmarks use the minimum possible value *(the actual leading dimension 
+of the matrix)*.
 
-Here is an example of a valid CSV file for the GEMM benchmark:
+For batched-strided operations that expect a stride(s), the benchmarking suite
+expects **stride multipliers** instead of explicit stride values. For example, in
+`gemm_batched_strided` operator, a `stride_a_mul=3` for matrix `A` of `size_a=m*k` 
+is equivalent to an actual stride of `stride_a = stride_a_mul * size_a`. 
+
+For operations that support Complex data type *(and when cmake option 
+BLAS_ENABLE_COMPLEX is enabled)*, scalars such as `alpha` and `beta` are 
+expected to have real and imaginary parts as separate values, if only single
+scalar values are passed in the csv configuration files, complex values
+for these arguments will be constructed by duplicating the same real value
+for both imaginary and real parts. (e.g. alpha_cplx={alpha_real, alpha_real}). 
+
+
+Here are two examples of a valid CSV files for the GEMM benchmark:
+
+- Scalar only alpha & beta values
 
 ```
 n,n,42,42,42,1,0
 n,t,64,128,64,0.5,0.5
 t,n,13,3,7,0,0.7
 ```
 
+- Complex alpha & beta values *(two scalars each)*
+```
+n,n,42,42,42,1,2,3,0
+n,t,64,128,64,0.5,1,0.5,1
+t,n,13,3,7,0,1,0.7,3
+```
+
+
 The folder `config_csv` provides a few files corresponding to sizes that are
-relevant for neural networks, but you can use your own files, see the next
+relevant for neural networks, but you can provide your own, see the next
 section for more info on how to generate them.
 
 ### Python tool to generate a CSV file
@@ -262,6 +324,9 @@ following keys:
 * `cpu_time`: actual CPU time spent running the benchmark
 * `time_unit`: unit used for these times. Should be `ns`, if not please file an
     issue.
+* `label`: Contains the name of the benchmark backend *(`@backend` : "portblas",
+    "cublas" or "rocblas")*, and target device related informations *(`device_name`,
+    `device_version`, `driver_version` etc..)* 
 * `avg_event_time`: the average of the CL/SYCL event times in nanoseconds. This
     time depends on the events returned by the BLAS functions used and might not
     be accurate in some cases
@@ -280,9 +345,10 @@ following keys:
 * `bytes_processed`: total number of bytes read and written in memory. It is
     calculated theoretically based on the operations that we think the benchmark
     is doing.
-* a few benchmark parameters (e.g `m`, `n`, `k` for GEMM)
+* other operator-specific parameters that affect the computations *(e.g `m`, `n`, 
+`k` for GEMM, but `alpha` is skipped as it doens't affect the operation directly.)*
 * some other keys from the benchmark library
 
 **Note:** to calculate the performance in Gflops, you can divide `n_fl_ops` by one
-of the best or average time metrics, e.g `avg_overall_time` (the event and wall
-time usually converge for large dimensions).
+of the best or average time metrics, e.g `avg_overall_time` *(the event and wall
+time usually converge for large dimensions)*.
diff --git a/benchmark/gen_param.py b/benchmark/gen_param.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited
@@ -32,12 +32,14 @@
 import itertools
 import argparse
 
+
 def main(args):
     """Generate the csv file according to the given arguments
     """
     # Match DSL to Python names
     nd_range = itertools.product
     value_range = lambda *v: list(v)
+
     def size_range(low, high, mult):
         val = low
         while val <= high:

diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md
@@ -35,7 +35,7 @@ typename sb_handle_t::event_t _trsm(sb_handle_t& sb_handle, char Side,
                                              container_0_t A, index_t lda,
                                              container_1_t B, index_t ldb);
 
-}
+} // namespace internal
 
 // User-facing function to call the TRSM operation
 template <typename sb_handle_t, typename container_0_t, typename container_1_t,
@@ -47,7 +47,6 @@ typename sb_handle_t::event_t inline _trsm(
   return internal::_trsm(sb_handle, Side, Triangle, Transpose, Diagonal, M, N, alpha, A, lda,
                          B, ldb);
 }
-} // namespace internal
 } // namespace blas
 ```
 
@@ -98,14 +97,19 @@ void run_test(const combination_t<scalar_t> combi) {
   auto q = make_queue();
   SB_Handle sb_handle(q);
 
+  //
+  // Perform any host-to-device copies here
+  //
+
   // Invoke the newly added operation
-  _trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb);
+  _trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb, dependencies);
 
   // Verify the results
 }
 
 // Create the combinations of parameters to invoke the test
-const auto combi = ::testing::Combine(::testing::Values(7, 513, 1027),  // m
+const auto combi = ::testing::Combine(::testing::Values("usm", "buf"),  // allocation type
+                                      ::testing::Values(7, 513, 1027),  // m
                                       ::testing::Values(7, 513, 1027),  // n
                                       ::testing::Values('n', 't'),  // transA
                                       ::testing::Values('l', 'r'),  // side
@@ -143,7 +147,8 @@ template <typename sb_handle_t, typename container_0_t, typename container_1_t,
 typename sb_handle_t::event_t _trsm(
     sb_handle_t& sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
     index_t M, index_t N, element_t alpha, container_0_t A, index_t lda,
-    container_1_t B, index_t ldb) {
+    container_1_t B, index_t ldb,
+    const typename sb_handle_t::event_t& _dependencies) {
   // Implementation of the new operation
   // This will probably invoke a kernel which we don't yet have defined.
 
@@ -153,7 +158,7 @@ typename sb_handle_t::event_t _trsm(
   auto gemmEvent = internal::_gemm(
             sb_handle, 'n', isTranspose ? 't' : 'n', M, currentBlockSize,
             currentBlockSize, (i == 0) ? alpha : element_t{1}, B + i * ldb, ldb,
-            invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx);
+            invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx, intermediate_events);
   trsmEvents = concatenate_vectors(trsmEvents, gemmEvent);  
 
   // Ultimately, a list of all events created in this function is returned
@@ -173,8 +178,9 @@ binary is linked against portBLAS, the linker will find the definition of the mi
 
 To do this, we create the source file that will contain instantiations of the new `_trsm` operation.
 The file is located at `src/interface/blas3/trsm.cpp.in`. This is not the file that will be 
-compiled, but a template file that the python script `python_generator/py_gen_blas_binary.py`
-will use to generate the actual source file where the instantiation of `_trsm` will happen.
+compiled, but a template file that the python script `python_generator/py_gen_blas_ops.py`
+will use to generate the actual source file where the instantiation of `_trsm` will happen. The call
+to this generator is wrapped within the cmake function `generate_blas_objects`.
 
 The file `src/interface/blas3/trsm.cpp.in` must include all files that are necessary to successfully
 compile `blas::internal::_trsm`, for this particular example, this file looks like the following:
@@ -193,13 +199,28 @@ compile `blas::internal::_trsm`, for this particular example, this file looks li
 namespace blas {
 namespace internal {
 
-
+// Buffer Declaration
 template typename SB_Handle::event_t _trsm(
   SB_Handle sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
   ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
   BufferIterator<${DATA_TYPE}> A, ${INDEX_TYPE} lda,
-  BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb);
+  BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb,
+  const typename SB_Handle::event_t& _dependencies);
 
+// USM Declarations
+#ifdef SB_ENABLE_USM
+template typename SB_Handle::event_t _trsm(
+    SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
+    ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha, ${DATA_TYPE} * A,
+    ${INDEX_TYPE} lda, ${DATA_TYPE} * B, ${INDEX_TYPE} ldb,
+    const typename SB_Handle::event_t& _dependencies);
+
+template typename SB_Handle::event_t _trsm(
+    SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
+    ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
+    const ${DATA_TYPE} * A, ${INDEX_TYPE} lda, ${DATA_TYPE} * B,
+    ${INDEX_TYPE} ldb, const typename SB_Handle::event_t& _dependencies);
+#endif
 
 } // namespace internal
 } // namespace blas

diff --git a/python_generator/py_gen_blas_binary.py b/python_generator/py_gen_blas_binary.py
diff --git a/python_generator/py_gen_blas_binary_special.py b/python_generator/py_gen_blas_binary_special.py
diff --git a/python_generator/py_gen_blas_gemm_launcher.py b/python_generator/py_gen_blas_gemm_launcher.py
@@ -1,4 +1,4 @@
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited

diff --git a/python_generator/py_gen_blas_ops.py b/python_generator/py_gen_blas_ops.py
@@ -1,4 +1,4 @@
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited