Updated Benchmark Doc to include cuBLAS, rocBLAS and complex data upd…

…ates
codeplaysoftware · Jan 2, 2024 · 5f3240f · 5f3240f
1 parent eee89a0
commit 5f3240f
Showing 1 changed file with 98 additions and 32 deletions.
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -5,53 +5,82 @@ Benchmarks
 
 The portBLAS benchmarks are intended to measure the evolution of the
 performance of this BLAS implementation and how it compares with other tuned
-implementations, such as [CLBLAST](https://github.com/CNugteren/CLBlast)
-(a very performant OpenCL BLAS library).
+implementations, such as [CUBLAS](https://docs.nvidia.com/cuda/cublas/index.html) native and optimized CUDA BLAS library for Nvidia GPUs, 
+[ROCBLAS](https://github.com/ROCm/rocBLAS) which is the native HIP/Rocm BLAS 
+library optimized for AMD GPUs, and [CLBLAST](https://github.com/CNugteren/CLBlast),
+a very performant OpenCL BLAS library for CPU targets. 
 
 The benchmarks use Google's [benchmark](https://github.com/google/benchmark)
-library and generate a report with indicative metrics (see instructions below).
+library and generate a report with indicative metrics *(see instructions below)*.
 
 ## How to compile the benchmarks
+### portBLAS Benchmarks
+The portBLAS default benchmarks are compiled with the project if the 
+`BLAS_ENABLE_BENCHMARK` CMake option is activated *(which is the case by 
+default)*. For the results to be relevant, portBLAS needs to be built in Release
+mode *(Passing `-DCMAKE_BUILD_TYPE=Release` to `cmake` command)*.
 
-The benchmarks are compiled with the project if the `BLAS_ENABLE_BENCHMARK`
-CMake option is activated (which is the case by default).
-
+### clBLAST Benchmarks
 The CLBLAST benchmarks are compiled only if the `BUILD_CLBLAST_BENCHMARKS` CMake
 option is activated. If so, if CLBlast cannot be found the build will fail. The
 location of CLBlast can be given with `CLBLAST_ROOT`.
 To install CLBlast, see:
 [CLBlast: Building and installing](
-https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md))
+https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md)
+
+### cuBLAS Benchmarks
+cuBLAS benchmarks can be built by enabling the `BUILD_CUBLAS_BENCHMARKS` option and 
+require an installation of CUDA *(>=11.x)* and cuBLAS library *(Both can be obtained by 
+installing the cuda Toolkit : 
+[installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))*.
 
+### rocBLAS Benchmarks
+cuBLAS benchmarks can be built by enabling the `BUILD_ROCBLAS_BENCHMARKS` option 
+and require an installation of Rocm *(>=4.5.x)* and rocBLAS library *(Please refer 
+to this [installation guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Linux_Install_Guide.html) 
+for the setup)*
+
+## General Notes
 After the compilation, the binaries will be available:
-* in the build folder, in `benchmark/portblas/` and `benchmark/clblast/`
+* in the build folder, in `benchmark/[Library Name]/` 
 * if you provide an installation directory with the CMake variable
     `CMAKE_INSTALL_PREFIX`, and run the installation command, e.g
     `ninja install`, in your installation folder, in `portblas/bin/`
 
 A verification of the results is enabled by default and can be disabled with the
 CMake option `BLAS_VERIFY_BENCHMARK` set to `OFF` or `0`. The verification will
-be run a small number of times (more than once because of the way the benchmark
+be run a small number of times *(more than once because of the way the benchmark
 library works, but much less than the usual number of iterations of the
-benchmarks). The verification requires that a reference implementation of BLAS
+benchmarks)*. The verification requires that a reference implementation of BLAS
 like OpenBLAS is installed, which path can be given with the `CMAKE_PREFIX_PATH`
 CMake parameter.
 
 ## How to run the benchmarks
 
 The benchmarks take two kinds of command-line options: those for the benchmark
-library and those specific to the portBLAS projects.
-
-Essentially, the benchmarks can take a CSV configuration file (or will use
-defaults), and if your machine has more than one OpenCL device, you can specify
-which one to use. The other options specify how to output the results.
+library and those specific to portBLAS: 
+- Benchmark library options : these options help specify for instance the output 
+format, the verbosity level and help filter specific benchmarks based on regex 
+arguments. 
+- portBLAS options : these options are portBLAS specific and they help specify 
+the target SYCL device *(portBLAS benchmarks only)* as well as pass a custom set 
+of operator-specfic configurations *(applies to rocBLAS and cuBLAS benchmarks as 
+well)*. If no csv file is specified, the benchmark will use defaults values 
+*(found in `include/common/common_utils.hpp`)*. 
+
+For portBLAS and clBLASt benchmarks, and if the target machine has more than one 
+OpenCL device, you can specify which one to use at runtime(\*). 
+
+(*): Preferably, for portBLAS benchmarks, this device has to match the 
+`TUNING_TARGET` as some operators are configured differently depending on the 
+SYCL target for optimal performance.  
 
 The most useful options for us are:
 
 |option|parameter|description|
 |------|:-------:|-----------|
 | `--help` |   | Show help message |
-| `--device` | device name | Select a device to run on (e.g `intel:gpu`) |
+| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful when targeting openCL devices with portBLAS or clBLASt benchmarks|
 | `--csv-param` | file path | Path to a CSV file with the benchmark parameters |
 | `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output |
 | `--benchmark_out` | file path | Specify a file where to write the report |
@@ -64,21 +93,27 @@ The most useful options for us are:
 You can check the [GitHub repository](https://github.com/google/benchmark) of
 the library for information about the other supported command-line arguments.
 
-Here is an example of an invocation of the GEMM benchmark running on Intel GPU,
-displaying the results in the console and saving a json report:
+Here is an example of an invocation of the portBLAS GEMM benchmark running on 
+Intel GPU, displaying the results in the console and saving a json report:
 
 ```bash
-./bench_gemm --device=intel:gpu --csv-param=parameters.csv \
+./portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
     --benchmark_out=../results.json --benchmark_out_format=json \
     --benchmark_format=console
 ```
 
+Here is the same benchmark through cuBLAS: 
+```bash
+./cublas/bench_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ 
+--benchmark_out_format=json --benchmark_format=console
+```
+
 ### CSV format
 
 The benchmarks can be given a CSV file containing the parameters to run with
-(matrix/vector dimensions, transpose or not, etc), in the following format: one
-line corresponds to one set of parameters, i.e. one name for the library (though
-it will be iterated many times for statistical accuracy).
+*(matrix/vector dimensions, transpose or not, etc)*, in the following format: one
+line corresponds to one set of parameters, i.e. one name for the library *(though
+it will be iterated many times for statistical accuracy)*.
 
 The formats for the different BLAS levels are:
 
@@ -88,24 +123,51 @@ The formats for the different BLAS levels are:
 | blas 2 | *transpose_A,m,n,alpha,beta* | Action on the matrix (`n`, `t`, `c`), dimensions, and scalars alpha and beta |
 | blas 3 |  | |
 | gemm | *transpose_A,transpose_B,m,k,n,alpha,beta* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), and scalars alpha and beta |
-| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size |
+| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size,batch_type* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size, batch_type |
 | trsm | *side,triangle,transpose,diagonal,m,n,alpha* | Position of A (`l`, `r`), A is upper or lower triangular (`u`, `l`), transposition of A (`n`, `t`), A is unit or non-unit diagonal(`u`,`n`),dimensions, scalar alpha |
 
-Note: for operations that support a stride, the benchmarks will use a stride of
-1 (contiguous values), except for the GEMM batched operation where valid default stride values are used depending on batch type *(strided or interleaved)*. For operations that support a leading dimension, the
-benchmarks use the minimum possible value (the actual leading dimension of the
-matrix).
+### Notes: 
+
+For operations that support an increment, the benchmarks will use an increment of
+1 *(contiguous values)*, except for the GEMM batched operation where valid 
+default increment values are used depending on `batch_type` *(strided or 
+interleaved)*. For operations that support a leading dimension, the
+benchmarks use the minimum possible value *(the actual leading dimension 
+of the matrix)*.
 
-Here is an example of a valid CSV file for the GEMM benchmark:
+For batched-strided operations that expect a stride(s), the benchmarking suite
+expects **stride multipliers** instead of explicit stride values. For example, in
+`gemm_batched_strided` operator, a `stride_a_mul=3` for matrix `A` of `size_a=m*k` 
+is equivalent to an actual stride of `stride_a = stride_a_mul * size_a`. 
+
+For operations that support Complex data type *(and when cmake option 
+BLAS_ENABLE_COMPLEX is enabled)*, scalars such as `alpha` and `beta` are 
+expected to have real and imaginary parts as separate values, if only single
+scalar values are passed in the csv configuration files, complex values
+for these arguments will be constructed by duplicating the same real value
+for both imaginary and real parts. (e.g. alpha_cplx={alpha_real, alpha_real}). 
+
+
+Here are two examples of a valid CSV files for the GEMM benchmark:
+
+- Scalar only alpha & beta values
 
 ```
 n,n,42,42,42,1,0
 n,t,64,128,64,0.5,0.5
 t,n,13,3,7,0,0.7
 ```
 
+- Complex alpha & beta values *(two scalars each)*
+```
+n,n,42,42,42,1,2,3,0
+n,t,64,128,64,0.5,1,0.5,1
+t,n,13,3,7,0,1,0.7,3
+```
+
+
 The folder `config_csv` provides a few files corresponding to sizes that are
-relevant for neural networks, but you can use your own files, see the next
+relevant for neural networks, but you can provide your own, see the next
 section for more info on how to generate them.
 
 ### Python tool to generate a CSV file
@@ -262,6 +324,9 @@ following keys:
 * `cpu_time`: actual CPU time spent running the benchmark
 * `time_unit`: unit used for these times. Should be `ns`, if not please file an
     issue.
+* `label`: Contains the name of the benchmark backend *(`@backend` : "portblas",
+    "cublas" or "rocblas")*, and target device related informations *(`device_name`,
+    `device_version`, `driver_version` etc..)* 
 * `avg_event_time`: the average of the CL/SYCL event times in nanoseconds. This
     time depends on the events returned by the BLAS functions used and might not
     be accurate in some cases
@@ -280,9 +345,10 @@ following keys:
 * `bytes_processed`: total number of bytes read and written in memory. It is
     calculated theoretically based on the operations that we think the benchmark
     is doing.
-* a few benchmark parameters (e.g `m`, `n`, `k` for GEMM)
+* other operator-specific parameters that affect the computations *(e.g `m`, `n`, 
+`k` for GEMM, but `alpha` is skipped as it doens't affect the operation directly.)*
 * some other keys from the benchmark library
 
 **Note:** to calculate the performance in Gflops, you can divide `n_fl_ops` by one
-of the best or average time metrics, e.g `avg_overall_time` (the event and wall
-time usually converge for large dimensions).
+of the best or average time metrics, e.g `avg_overall_time` *(the event and wall
+time usually converge for large dimensions)*.