Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Minor edits #487

Merged
merged 17 commits into from
Jan 15, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ CMD cd /portBLAS && \
if [ "${COMMAND}" = 'build-test' ]; then \
if [ "${SYCL_IMPL}" = 'DPCPP' ]; then \
export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p build && cd build && \
cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
-DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
-DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
make -j$(nproc) && cd test && ctest -VV --timeout 1200; \
Expand All @@ -58,7 +58,7 @@ CMD cd /portBLAS && \
elif [ "${COMMAND}" = 'auto-tuner' ]; then \
export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p tools/auto_tuner/build \
&& cd tools/auto_tuner/build && \
cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
-DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
-DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
make -j$(nproc); \
Expand Down
149 changes: 84 additions & 65 deletions README.md

Large diffs are not rendered by default.

30 changes: 0 additions & 30 deletions Roadmap.md

This file was deleted.

130 changes: 98 additions & 32 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,53 +5,82 @@ Benchmarks

The portBLAS benchmarks are intended to measure the evolution of the
performance of this BLAS implementation and how it compares with other tuned
implementations, such as [CLBLAST](https://github.com/CNugteren/CLBlast)
(a very performant OpenCL BLAS library).
implementations, such as [CUBLAS](https://docs.nvidia.com/cuda/cublas/index.html) native and optimized CUDA BLAS library for Nvidia GPUs,
[ROCBLAS](https://github.com/ROCm/rocBLAS) which is the native HIP/Rocm BLAS
library optimized for AMD GPUs, and [CLBLAST](https://github.com/CNugteren/CLBlast),
a very performant OpenCL BLAS library for CPU targets.

The benchmarks use Google's [benchmark](https://github.com/google/benchmark)
library and generate a report with indicative metrics (see instructions below).
library and generate a report with indicative metrics *(see instructions below)*.

## How to compile the benchmarks
### portBLAS Benchmarks
The portBLAS default benchmarks are compiled with the project if the
`BLAS_ENABLE_BENCHMARK` CMake option is activated *(which is the case by
default)*. For the results to be relevant, portBLAS needs to be built in Release
mode *(Passing `-DCMAKE_BUILD_TYPE=Release` to `cmake` command)*.

The benchmarks are compiled with the project if the `BLAS_ENABLE_BENCHMARK`
CMake option is activated (which is the case by default).

### clBLAST Benchmarks
The CLBLAST benchmarks are compiled only if the `BUILD_CLBLAST_BENCHMARKS` CMake
option is activated. If so, if CLBlast cannot be found the build will fail. The
location of CLBlast can be given with `CLBLAST_ROOT`.
To install CLBlast, see:
[CLBlast: Building and installing](
https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md))
https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md)

### cuBLAS Benchmarks
cuBLAS benchmarks can be built by enabling the `BUILD_CUBLAS_BENCHMARKS` option and
require an installation of CUDA *(>=11.x)* and cuBLAS library *(Both can be obtained by
installing the cuda Toolkit :
[installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))*.

### rocBLAS Benchmarks
cuBLAS benchmarks can be built by enabling the `BUILD_ROCBLAS_BENCHMARKS` option
and require an installation of Rocm *(>=4.5.x)* and rocBLAS library *(Please refer
to this [installation guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Linux_Install_Guide.html)
for the setup)*

## General Notes
After the compilation, the binaries will be available:
* in the build folder, in `benchmark/portblas/` and `benchmark/clblast/`
* in the build folder, in `benchmark/[Library Name]/`
* if you provide an installation directory with the CMake variable
`CMAKE_INSTALL_PREFIX`, and run the installation command, e.g
`ninja install`, in your installation folder, in `portblas/bin/`

A verification of the results is enabled by default and can be disabled with the
CMake option `BLAS_VERIFY_BENCHMARK` set to `OFF` or `0`. The verification will
be run a small number of times (more than once because of the way the benchmark
be run a small number of times *(more than once because of the way the benchmark
library works, but much less than the usual number of iterations of the
benchmarks). The verification requires that a reference implementation of BLAS
benchmarks)*. The verification requires that a reference implementation of BLAS
like OpenBLAS is installed, which path can be given with the `CMAKE_PREFIX_PATH`
CMake parameter.

## How to run the benchmarks

The benchmarks take two kinds of command-line options: those for the benchmark
library and those specific to the portBLAS projects.

Essentially, the benchmarks can take a CSV configuration file (or will use
defaults), and if your machine has more than one OpenCL device, you can specify
which one to use. The other options specify how to output the results.
library and those specific to portBLAS:
- Benchmark library options : these options help specify for instance the output
format, the verbosity level and help filter specific benchmarks based on regex
arguments.
- portBLAS options : these options are portBLAS specific and they help specify
the target SYCL device *(portBLAS benchmarks only)* as well as pass a custom set
of operator-specfic configurations *(applies to rocBLAS and cuBLAS benchmarks as
well)*. If no csv file is specified, the benchmark will use defaults values
*(found in `include/common/common_utils.hpp`)*.

For portBLAS and clBLASt benchmarks, and if the target machine has more than one
OpenCL device, you can specify which one to use at runtime(\*).

(*): Preferably, for portBLAS benchmarks, this device has to match the
`TUNING_TARGET` as some operators are configured differently depending on the
SYCL target for optimal performance.

The most useful options for us are:

|option|parameter|description|
|------|:-------:|-----------|
| `--help` | | Show help message |
| `--device` | device name | Select a device to run on (e.g `intel:gpu`) |
| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful for portBLAS and clBLASt benchmarks|
| `--csv-param` | file path | Path to a CSV file with the benchmark parameters |
| `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output |
| `--benchmark_out` | file path | Specify a file where to write the report |
Expand All @@ -64,21 +93,27 @@ The most useful options for us are:
You can check the [GitHub repository](https://github.com/google/benchmark) of
the library for information about the other supported command-line arguments.

Here is an example of an invocation of the GEMM benchmark running on Intel GPU,
displaying the results in the console and saving a json report:
Here is an example of an invocation of the portBLAS GEMM benchmark running on
Intel GPU, displaying the results in the console and saving a json report:

```bash
./bench_gemm --device=intel:gpu --csv-param=parameters.csv \
./benchmark/portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
--benchmark_out=../results.json --benchmark_out_format=json \
--benchmark_format=console
```

Here is the same benchmark through cuBLAS:
```bash
./benchmark/cublas/bench_cublas_gemm --csv-param=parameters.csv --benchmark_out=../results.json \
--benchmark_out_format=json --benchmark_format=console
```

### CSV format

The benchmarks can be given a CSV file containing the parameters to run with
(matrix/vector dimensions, transpose or not, etc), in the following format: one
line corresponds to one set of parameters, i.e. one name for the library (though
it will be iterated many times for statistical accuracy).
*(matrix/vector dimensions, transpose or not, etc)*, in the following format: one
line corresponds to one set of parameters, i.e. one name for the library *(though
it will be iterated many times for statistical accuracy)*.

The formats for the different BLAS levels are:

Expand All @@ -88,24 +123,51 @@ The formats for the different BLAS levels are:
| blas 2 | *transpose_A,m,n,alpha,beta* | Action on the matrix (`n`, `t`, `c`), dimensions, and scalars alpha and beta |
| blas 3 | | |
| gemm | *transpose_A,transpose_B,m,k,n,alpha,beta* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), and scalars alpha and beta |
| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size |
| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size,batch_type* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size, batch_type |
| trsm | *side,triangle,transpose,diagonal,m,n,alpha* | Position of A (`l`, `r`), A is upper or lower triangular (`u`, `l`), transposition of A (`n`, `t`), A is unit or non-unit diagonal(`u`,`n`),dimensions, scalar alpha |

Note: for operations that support a stride, the benchmarks will use a stride of
1 (contiguous values), except for the GEMM batched operation where valid default stride values are used depending on batch type *(strided or interleaved)*. For operations that support a leading dimension, the
benchmarks use the minimum possible value (the actual leading dimension of the
matrix).
### Notes:

For operations that support an increment, the benchmarks will use an increment of
1 *(contiguous values)*, except for the GEMM batched operation where valid
default increment values are used depending on `batch_type` *(strided or
interleaved)*. For operations that support a leading dimension, the
benchmarks use the minimum possible value *(the actual leading dimension
of the matrix)*.

Here is an example of a valid CSV file for the GEMM benchmark:
For batched-strided operations that expect a stride(s), the benchmarking suite
expects **stride multipliers** instead of explicit stride values. For example, in
`gemm_batched_strided` operator, a `stride_a_mul=3` for matrix `A` of `size_a=m*k`
is equivalent to an actual stride of `stride_a = stride_a_mul * size_a`.

For operations that support Complex data type *(and when cmake option
BLAS_ENABLE_COMPLEX is enabled)*, scalars such as `alpha` and `beta` are
expected to have real and imaginary parts as separate values, if only single
scalar values are passed in the csv configuration files, complex values
for these arguments will be constructed by duplicating the same real value
for both imaginary and real parts. (e.g. alpha_cplx={alpha_real, alpha_real}).


Here are two examples of a valid CSV files for the GEMM benchmark:

- Scalar only alpha & beta values

```
n,n,42,42,42,1,0
n,t,64,128,64,0.5,0.5
t,n,13,3,7,0,0.7
```

- Complex alpha & beta values *(two scalars each)*
```
n,n,42,42,42,1,2,3,0
n,t,64,128,64,0.5,1,0.5,1
t,n,13,3,7,0,1,0.7,3
```


The folder `config_csv` provides a few files corresponding to sizes that are
relevant for neural networks, but you can use your own files, see the next
relevant for neural networks, but you can provide your own, see the next
section for more info on how to generate them.

### Python tool to generate a CSV file
Expand Down Expand Up @@ -262,6 +324,9 @@ following keys:
* `cpu_time`: actual CPU time spent running the benchmark
* `time_unit`: unit used for these times. Should be `ns`, if not please file an
issue.
* `label`: Contains the name of the benchmark backend *(`@backend` : "portblas",
"cublas" or "rocblas")*, and target device related informations *(`device_name`,
`device_version`, `driver_version` etc..)*
* `avg_event_time`: the average of the CL/SYCL event times in nanoseconds. This
time depends on the events returned by the BLAS functions used and might not
be accurate in some cases
Expand All @@ -280,9 +345,10 @@ following keys:
* `bytes_processed`: total number of bytes read and written in memory. It is
calculated theoretically based on the operations that we think the benchmark
is doing.
* a few benchmark parameters (e.g `m`, `n`, `k` for GEMM)
* other operator-specific parameters that affect the computations *(e.g `m`, `n`,
`k` for GEMM, but `alpha` is skipped as it doens't affect the operation directly.)*
* some other keys from the benchmark library

**Note:** to calculate the performance in Gflops, you can divide `n_fl_ops` by one
of the best or average time metrics, e.g `avg_overall_time` (the event and wall
time usually converge for large dimensions).
of the best or average time metrics, e.g `avg_overall_time` *(the event and wall
time usually converge for large dimensions)*.
4 changes: 3 additions & 1 deletion benchmark/gen_param.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env python3
#/***************************************************************************
# /***************************************************************************
# *
# * @license
# * Copyright (C) Codeplay Software Limited
Expand Down Expand Up @@ -32,12 +32,14 @@
import itertools
import argparse


def main(args):
"""Generate the csv file according to the given arguments
"""
# Match DSL to Python names
nd_range = itertools.product
value_range = lambda *v: list(v)

def size_range(low, high, mult):
val = low
while val <= high:
Expand Down
41 changes: 31 additions & 10 deletions doc/AddingBlas3Op.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ typename sb_handle_t::event_t _trsm(sb_handle_t& sb_handle, char Side,
container_0_t A, index_t lda,
container_1_t B, index_t ldb);

}
} // namespace internal

// User-facing function to call the TRSM operation
template <typename sb_handle_t, typename container_0_t, typename container_1_t,
Expand All @@ -47,7 +47,6 @@ typename sb_handle_t::event_t inline _trsm(
return internal::_trsm(sb_handle, Side, Triangle, Transpose, Diagonal, M, N, alpha, A, lda,
B, ldb);
}
} // namespace internal
} // namespace blas
```

Expand Down Expand Up @@ -98,14 +97,19 @@ void run_test(const combination_t<scalar_t> combi) {
auto q = make_queue();
SB_Handle sb_handle(q);

//
// Perform any host-to-device copies here
//

// Invoke the newly added operation
_trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb);
_trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb, dependencies);

// Verify the results
}

// Create the combinations of parameters to invoke the test
const auto combi = ::testing::Combine(::testing::Values(7, 513, 1027), // m
const auto combi = ::testing::Combine(::testing::Values("usm", "buf"), // allocation type
::testing::Values(7, 513, 1027), // m
::testing::Values(7, 513, 1027), // n
::testing::Values('n', 't'), // transA
::testing::Values('l', 'r'), // side
Expand Down Expand Up @@ -143,7 +147,8 @@ template <typename sb_handle_t, typename container_0_t, typename container_1_t,
typename sb_handle_t::event_t _trsm(
sb_handle_t& sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
index_t M, index_t N, element_t alpha, container_0_t A, index_t lda,
container_1_t B, index_t ldb) {
container_1_t B, index_t ldb,
const typename sb_handle_t::event_t& _dependencies) {
// Implementation of the new operation
// This will probably invoke a kernel which we don't yet have defined.

Expand All @@ -153,7 +158,7 @@ typename sb_handle_t::event_t _trsm(
auto gemmEvent = internal::_gemm(
sb_handle, 'n', isTranspose ? 't' : 'n', M, currentBlockSize,
currentBlockSize, (i == 0) ? alpha : element_t{1}, B + i * ldb, ldb,
invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx);
invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx, intermediate_events);
trsmEvents = concatenate_vectors(trsmEvents, gemmEvent);

// Ultimately, a list of all events created in this function is returned
Expand All @@ -173,8 +178,9 @@ binary is linked against portBLAS, the linker will find the definition of the mi

To do this, we create the source file that will contain instantiations of the new `_trsm` operation.
The file is located at `src/interface/blas3/trsm.cpp.in`. This is not the file that will be
compiled, but a template file that the python script `python_generator/py_gen_blas_binary.py`
will use to generate the actual source file where the instantiation of `_trsm` will happen.
compiled, but a template file that the python script `python_generator/py_gen_blas_ops.py`
will use to generate the actual source file where the instantiation of `_trsm` will happen. The call
to this generator is wrapped within the cmake function `generate_blas_objects`.

The file `src/interface/blas3/trsm.cpp.in` must include all files that are necessary to successfully
compile `blas::internal::_trsm`, for this particular example, this file looks like the following:
Expand All @@ -193,13 +199,28 @@ compile `blas::internal::_trsm`, for this particular example, this file looks li
namespace blas {
namespace internal {


// Buffer Declaration
template typename SB_Handle::event_t _trsm(
SB_Handle sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
BufferIterator<${DATA_TYPE}> A, ${INDEX_TYPE} lda,
BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb);
BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb,
const typename SB_Handle::event_t& _dependencies);

// USM Declarations
#ifdef SB_ENABLE_USM
template typename SB_Handle::event_t _trsm(
SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha, ${DATA_TYPE} * A,
${INDEX_TYPE} lda, ${DATA_TYPE} * B, ${INDEX_TYPE} ldb,
const typename SB_Handle::event_t& _dependencies);

template typename SB_Handle::event_t _trsm(
SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
const ${DATA_TYPE} * A, ${INDEX_TYPE} lda, ${DATA_TYPE} * B,
${INDEX_TYPE} ldb, const typename SB_Handle::event_t& _dependencies);
#endif

} // namespace internal
} // namespace blas
Expand Down
Empty file.
Empty file.
2 changes: 1 addition & 1 deletion python_generator/py_gen_blas_gemm_launcher.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#/***************************************************************************
# /***************************************************************************
# *
# * @license
# * Copyright (C) Codeplay Software Limited
Expand Down
2 changes: 1 addition & 1 deletion python_generator/py_gen_blas_ops.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#/***************************************************************************
# /***************************************************************************
# *
# * @license
# * Copyright (C) Codeplay Software Limited
Expand Down
Loading