From 0c225e17c015f1caecc37e80674e8962b38fe491 Mon Sep 17 00:00:00 2001 From: Rod Burns <800194+rodburns@users.noreply.github.com> Date: Fri, 22 Dec 2023 10:55:28 +0000 Subject: [PATCH 01/14] Minor edits --- README.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index c5383b73f..3788c8793 100644 --- a/README.md +++ b/README.md @@ -3,13 +3,10 @@ portBLAS Implementation [![Build and Test](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml/badge.svg?event=push)](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml) -portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL 1.2]( -https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf), the -[Khronos](http://www.khronos.org) abstraction layer for [OpenCL](https://www.khronos.org/opencl/). +portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/)). -portBLAS is a current work in progress research project from an ongoing -collaboration with the *High Performance Computing & Architectures (HPCA) group* -from the Universitat Jaume I [UJI](http://www.hpca.uji.es/). +portBLAS is an ongoing collaboration with the *High Performance Computing +& Architectures (HPCA) group* from the Universitat Jaume I [UJI](http://www.hpca.uji.es/). portBLAS is written using modern C++. The current implementation uses C++11 features. From be90e3d2b8c4a1b2b41df11b62e109c38f5b39e0 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Wed, 27 Dec 2023 15:02:32 +0000 Subject: [PATCH 02/14] Added remaining operators to Doc & other Readme fixes --- README.md | 133 ++++++++++++++++++++---------------------------------- 1 file changed, 50 insertions(+), 83 deletions(-) diff --git a/README.md b/README.md index 3788c8793..e8c17bc51 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ portBLAS Implementation [![Build and Test](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml/badge.svg?event=push)](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml) -portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/)). +portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/). portBLAS is an ongoing collaboration with the *High Performance Computing & Architectures (HPCA) group* from the Universitat Jaume I [UJI](http://www.hpca.uji.es/). @@ -30,14 +30,11 @@ the project. - [Experimental Joint Matrix Support](#jm_support) - [Requirements](#requirements) - [Setup](#setup) - - [Compile with ComputeCpp](#compile-with-computecpp) - [Compile with DPC++](#compile-with-dpc) - [Compile with hipSYCL](#compile-with-hipsycl) - [Instaling portBLAS](#instaling-portBLAS) - - [POWER\_VR support (ComputeCpp Only)](#power_vr-support-computecpp-only) - [Doxygen](#doxygen) - [CMake options](#cmake-options) - - [Cross-Compile (ComputeCpp Only)](#cross-compile-computecpp-only) - [Tests and benchmarks](#tests-and-benchmarks) - [Contributing to the project](#contributing-to-the-project) - [Guides and Other Documents](#guides-and-other-documents) @@ -204,30 +201,32 @@ The following table sums up the interface that can be found in For all these operations: * `vx` and `vy` are containers for vectors `x` and `y`. -* `incx` and `incy` are their increments (number of steps to jump to the next - value, 1 for contiguous values). -* `N`, an integer, is the size of the vectors (less than or equal to the size of - the containers). +* `incx` and `incy` are their increments *(number of steps to jump to the next + value, 1 for contiguous values)*. +* `N`, an integer, is the size of the vectors *(less than or equal to the size of + the containers)*. * `alpha` is a scalar. * `rs` is a container of size 1, containing either a scalar, an integer, or an index-value tuple. -* `c` and `s` for `_rot` are scalars (cosine and sine) +* `c` and `s` for `_rot` are scalars *(cosine and sine)*. +* `sb` for `_sdsdot` is a single precision scalar to be added to output. | operation | arguments | description | |-----------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `_asum` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Absolute sum of the vector `x`; written in `rs` if passed, else returned | | `_axpy` | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy` | Vector multiply-add: `y = alpha * x + y` | | `_copy` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` | Copies a vector to another: `y = x` | -| `_dot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors `x` and `y`; written in `rs` if passed, else returned | -| `_asum` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Absolute sum of the vector `x`; written in `rs` if passed, else returned | -| `_iamax` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | First index and value of the maximum element of `x`; written in `rs` if passed, else the index only is returned | -| `_iamin` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | First index and value of the minimum element of `x`; written in `rs` if passed, else the index only is returned | -| `_swap` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` | Interchanges two vectors: `y = x` and `x = y` | -| `_scal` | `sb_handle`, `N`, `alpha`, `vx`, `incx` | Scalar product of a vector: `x = alpha * x` | +| `_dot` | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors `x` and `y`; added to `sb`` and written to `rs` if passed, else returned | +| `_sdsdot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors with double precision `x` and `y`; written in `rs` if passed, else returned | | `_nrm2` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Euclidean norm of the vector `x`; written in `rs` if passed, else returned | | `_rot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s` | | `_rotg` | `sb_handle`, `a`, `b`, `c`, `s` | Given the Cartesian coordinates (`a`, `b`) of a point, return the parameters `c`, `s`, `r`, and `z` associated with the Givens rotation. | | `_rotm` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `param` | Applies a modified Givens rotation to `x` and `y`. | | `_rotmg` | `sb_handle`, `d1`, `d2`, `x1`, `y1` `param` | Given the Cartesian coordinates (`x1`, `y1`) of a point, return the components of a modified Givens transformation matrix that zeros the y-component of the resulting point. | +| `_scal` | `sb_handle`, `N`, `alpha`, `vx`, `incx` | Scalar product of a vector: `x = alpha * x` | +| `_swap` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` | Interchanges two vectors: `y = x` and `x = y` | +| `_iamax` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | First index and value of the maximum element of `x`; written in `rs` if passed, else the index only is returned | +| `_iamin` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | First index and value of the minimum element of `x`; written in `rs` if passed, else the index only is returned | ### BLAS 2 @@ -253,19 +252,26 @@ For all these operations: its neighbor in the next column and same row. `lda` must be at least `M`. * `vx` and `vy` are containers for vectors `x` and `y`. * `incx` and `incy` are their increments (cf BLAS 1). +* `K` Number of sub/super-diagonals of the matrix. | operation | arguments | description | |---|---|---| -| `_gemv` | `sb_handle`, `trans`, `M`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Generalised matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* | | `_gbmv` | `sb_handle`, `trans`, `M`, `N`, `KL`, `KU`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Generalised band matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* | -| `_trmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx` | Matrix-vector product for a triangular matrix: `x = A * x` | -| `_symv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Variant of GEMV for a symmetric matrix (`y = alpha * A * x + beta * y`). *Note: `uplo` specifies which side of the matrix will be read* | +| `_gemv` | `sb_handle`, `trans`, `M`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Generalised matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* | | `_ger` | `sb_handle`, `M`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mA`, `lda` | Generalised vector-vector product followed by a matrix sum: `A = alpha * x * yT + A` | +| `_sbmv`| `sb_handle`, `uplo`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Compute a scalar-matrix-vector product and add the result to a scalar-vector product, with a symmetric band matrix: `y = alpha * mA * x + beta * y` | +| `_spmv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `vx`, `incx`, `beta`, `vy`, `incy` | Symmetric packed matrix-vector product: `y = alpha * A * x + beta * y` | +| `_spr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mPA` | Symmetric vector-vector product followed by a matrix sum: `mPA = alpha * x * xT + mPA` | +| `_spr2` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mPA` | Compute two scalar-vector-vector products and add them to a symmetric packed matrix: `mPA = alpha * x * yT + alpha * y * xT + mPA` | +| `_symv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Variant of GEMV for a symmetric matrix (`y = alpha * A * x + beta * y`). *Note: `uplo` specifies which side of the matrix will be read* | | `_syr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mA`, `lda` | Generalised vector squaring followed by a sum with a symmetric matrix: `A = alpha * x * xT + A` | | `_syr2` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mA`, `lda` | Generalised vector products followed by a sum with a symmetric matrix: `A = alpha*x*yT + alpha*y*xT + A` | -| `_spr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mPA` | Symmetric vector-vector product followed by a matrix sum: `mPA = alpha * x * xT + mPA` | -| `_spmv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `vx`, `incx`, `beta`, `vy`, `incy` | Symmetric packed matrix-vector product: `y = alpha * A * x + beta * y` | -| `_tpmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Triangular packed matrix-vector product: `x = A * x` | +| `_tbmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `K`, `mA`, `lda`, `vx`, `incx` | Compute a matrix-vector product with a triangular band matrix: `A = A * x` | +| `_tbsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `K`, `mA`, `lda`, `vx`, `incx` | Solve a system of linear equations whose coefficients are in a triangular band matrix: `A * x = b` | +| `_tpmv`| `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Triangular packed matrix-vector product: `x = A * x` | +| `_tpsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Solve a system of linear equations whose coefficients are in a triangular packed matrix: `A * x = b` | +| `_trmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx` | Matrix-vector product for a triangular matrix: `x = A * x` | +| `_trsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `lda`, `vx`, `incx` | Compute a matrix-vector product with a triangular band matrix: `A * x = b` | ### BLAS 3 @@ -274,7 +280,7 @@ The following table sums up the interface that can be found in For all these operations: -* `A`, `B` and `C` are containers for the column-major matrices A, B and C. +* `mA`, `mB` and `mC` are containers for the column-major matrices A, B and C. * `lda`, `ldb` and `ldc` are the leading dimensions of the matrices A, B and C (cf BLAS 2). The leading dimension of a matrix must be greater than or equal to its number of rows. @@ -288,15 +294,19 @@ For all these operations: * `uplo` is a `char` that provides information about triangular matrices: `u` for upper triangular and `l` for lower triangular matrices. * `diag` is a `char` that provides information about the diagonal elements of a - triangular matrix: `u` if the matrix is unit triangular (all diagonal elements - are 1), else `n`. + triangular matrix: `u` if the matrix is unit triangular *(all diagonal elements + are 1)*, else `n`. +* `stride_a`, `stride_b` and `stride_c` are the striding size between consecutive +matrices in a batched-strided entry for the inputs/outputs. +* `batch_type` for `_gemm_batched` is either `strided` *(by default)* or `interleaved`*(More details about it here : [Gemm.md](doc/Gemm.md))*. | operation | arguments | description | |---|---|---| -| `_gemm` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `A`, `lda`, `B`, `ldb`, `beta`, `C`, `ldc` | Generalised matrix-matrix multiplication followed by matrix addition: `C = alpha * A * B + beta * C` | -| `_gemm_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `A`, `lda`, `B`, `ldb`, `beta`, `C`, `ldc`, `batch_size`, `batch_type` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. | -| `_gemm_strided_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `A`, `lda`, `stridea`, `B`, `ldb`, `strideb`, `beta`, `C`, `ldc`, `stridec`, `batch_size` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. -| `_trsm` | `sb_handle`, `side`, `uplo`, `trans`, `diag`, `M`, `N`, `alpha`, `A`, `lda`, `B`, `ldb` | Triangular solve with Multiple Right-Hand Sides. | +| `_gemm` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc` | Generalised matrix-matrix multiplication followed by matrix addition: `C = alpha * A * B + beta * C` | +| `_gemm_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc`, `batch_size`, `batch_type` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. | +| `_gemm_strided_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `stride_a`, `mB`, `ldb`, `stride_b`, `beta`, `mC`, `ldc`, `stride_c`, `batch_size` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. +| `_symm` | `sb_handle`, `side` , `uplo` , `M`, `N`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc`| Compute a scalar-matrix-matrix product and add the result to a scalar-matrix product, where one of the matrices in the multiplication is symmetric. | +| `_trsm` | `sb_handle`, `side`, `uplo`, `trans`, `diag`, `M`, `N`, `alpha`, `mA`, `lda`, `mB`, `ldb` | Triangular solve with Multiple Right-Hand Sides. | ### EXTENSION @@ -311,17 +321,17 @@ For all these operations: to its number of rows. In the case of in-place copy/transpose, the same matrix `A` is used with two different leading dimensions for input & output. * `stride_a`, `stride_b` and `stride_c` are the striding size between consecutive -matrices in a batched entry for inputs/outputs A, B and C. -* `inc_a` and `inc_b` are the distance between consecutive elements in A & B matrices. +matrices in a batched entry for the inputs/outputs. +* `inc_a` and `inc_b` are the jump-count between consecutive elements in A & B matrices. * `transa` and `transb` are the transpose modes of the matrices A and B (cf BLAS 2). -* `M` and `N` are the dimensions of the matrices. +* `M` and `N` are the dimensions of the matrices (Rows and Columns respectively). * `alpha` and `beta` are scaling scalars. * `batch_size` is the number of batch matrices. -* `inc_a` and `inc_b` are integers. The distance between element in the same column. | operation | arguments | description | |---|---|---| +| `_axpy_batch` | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `stride_x`, `vy`, `incy`, `stride_y`, `batch_size` | Perform multiple axpy operators in batch | | `_omatcopy` | `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `B`, `ldb` | Perform an out-of-place scaled matrix transpose or copy operation using a general dense matrix. | | `_omatcopy2`| `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `inc_a`, `B`, `ldb`, `inc_b` | Computes two-strided scaling and out-of-place transposition or copying of general dense matrices. | | `_omatadd`| `sb_handle`, `transa`, `transb`, `M`, `N`, `alpha`, `A`, `lda`, `beta`, `B`, `ldb`, `C`,`ldc` | Computes scaled general dense matrix addition with possibly transposed arguments. | @@ -352,12 +362,12 @@ The user should expect erroneous behaviour from the code if both of these requir ## Requirements -portBLAS is designed to work with any SYCL 1.2.1 implementation. +portBLAS is designed to work with any SYCL implementation. We do not use any OpenCL interoperability, hence, the code is pure C++. -The project is developed using [ComputeCpp CE Edition](http://www.computecpp.com) -using Ubuntu 16.04 on Intel OpenCL CPU and Intel GPU. -In order to build the sources, GCC 5.4 or higher is required. -The build system is CMake version 3.4.2 or higher. +The project is developed using [DPCPP open source](https://github.com/intel/llvm) +or [oneapi release](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html#gs.2iaved), +using Ubuntu 22.04 on Intel OpenCL CPU, Intel GPU, NVIDIA GPU and AMD GPU. +The build system is CMake version 3.4.3 or higher. A BLAS library, such as [OpenBLAS](https://github.com/xianyi/OpenBLAS), is also required to build and verify the test results. @@ -376,20 +386,13 @@ added to the `CMAKE_PREFIX_PATH` when building portBLAS (see **IMPORTANT NOTE:** The `TARGET` CMake variable is no longer supported. It has been replaced by `TUNING_TARGET`, which accepts the same options. -`TUNING_TARGET` affects only the tuning configuration and has no effect on the target +`TUNING_TARGET` affects only the tuning configuration, applicable for some operators such as GEMM, and has no effect on the target triplet for DPC++ or the hipSYCL target. Please refer to the sections below for setting them. 1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order to clone submodule(s). 2. Create a build directory -3. Run `CMake` from the build directory (see options in the section below): - -### Compile with ComputeCpp -```bash -cd build -cmake -GNinja ../ -DComputeCpp_DIR=/path/to/computecpp -DSYCL_COMPILER=computecpp -ninja -``` +3. Run `CMake` from the build directory *(see options in the section below)*: ### Compile with DPC++ ```bash @@ -423,11 +426,6 @@ To install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below) ```bash ninja install ``` -### POWER_VR support (ComputeCpp Only) - -To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR` - -To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library` ### Doxygen @@ -460,38 +458,7 @@ Some of the supported options are: | `BLAS_ENABLE_EXTENSIONS` | `ON`/`OFF` | Determines whether to enable portBLAS extensions (`ON` by default) | | `BLAS_DATA_TYPES` | `half;float;double` | Determines the floating-point types to instantiate BLAS operations for. Default is `float` | | `BLAS_INDEX_TYPES` | `int32_t;int64_t` | Determines the type(s) to use for `index_t` and `increment_t`. Default is `int` | -| `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Kernels only)* (`ON` by default) | - -### Cross-Compile (ComputeCpp Only) - -To cross-compile portBLAS first the following environment variables must be -set: - -```bash -export COMPUTECPP_TOOLCHAIN_DIR="PATH TO TOOLCHAIN_DIR" -export COMPUTECPP_TARGET_TRIPLE="PATH TO TARGET_TRIPLE" -export COMPUTECPP_SYSROOT_DIR="$PATH TO SYSROOT_DIR" -``` - -Clone the [ComputeCpp-SDK](https://github.com/codeplaysoftware/computecpp-sdk) to retrieve the toolchain file. -The following CMake command can be used to cross-compile portBLAS: - -```bash -cmake -GNinja \ - ${SOURCE_ROOT} \ - -DCMAKE_PREFIX_PATH="${OPENBLAS_PATH}" \ - -DComputeCpp_DIR="${COMPUTECPP_DEVICE_PATH}" \ - -DComputeCpp_HOST_DIR="${COMPUTECPP_X86_PATH}" \ - -DCMAKE_TOOLCHAIN_FILE="/path/to/computecpp-sdk/cmake/toolchains/gcc-generic.cmake" \ - -DCMAKE_BUILD_TYPE='Release' \ - -DCMAKE_INSTALL_PREFIX=${CROSS_COMPILED_PORTBLAS_INSTALL} \ - -DOpenCL_INCLUDE_DIR="${OpenCL_Headers_PATH}" \ - -DOpenCL_LIBRARY="${OpenCL_LIBRARY}" \ - -DCOMPUTECPP_BITCODE="${DEVICE_BITCODE}" \ - -DCMAKE_CXX_FLAGS='-O3' \ - -DTUNING_TARGET="${CHOSEN_TARGET}" -``` - +| `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Operators only)* (`ON` by default) | ## Tests and benchmarks From b46de5f59e5341a207cc1d7596b3938a1dee709c Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Wed, 27 Dec 2023 15:03:19 +0000 Subject: [PATCH 03/14] Fixes and additions to Blas3Op doc file --- doc/AddingBlas3Op.md | 33 ++++++++++++++++++++++++++------- 1 file changed, 26 insertions(+), 7 deletions(-) diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md index 51d40ab7c..d6d578cbc 100644 --- a/doc/AddingBlas3Op.md +++ b/doc/AddingBlas3Op.md @@ -35,7 +35,7 @@ typename sb_handle_t::event_t _trsm(sb_handle_t& sb_handle, char Side, container_0_t A, index_t lda, container_1_t B, index_t ldb); -} +} // namespace internal // User-facing function to call the TRSM operation template combi) { auto q = make_queue(); SB_Handle sb_handle(q); + // + // Perform any host-to-device copies here + // + // Invoke the newly added operation - _trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb); + _trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb, dependencies); // Verify the results } @@ -143,7 +146,8 @@ template A, ${INDEX_TYPE} lda, - BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb); + BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb, + const typename SB_Handle::event_t& _dependencies); +// USM Declarations +#ifdef SB_ENABLE_USM +template typename SB_Handle::event_t _trsm( + SB_Handle& sb_handle, char side, char uplo, char trans, char diag, + ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha, ${DATA_TYPE} * A, + ${INDEX_TYPE} lda, ${DATA_TYPE} * B, ${INDEX_TYPE} ldb, + const typename SB_Handle::event_t& _dependencies); + +template typename SB_Handle::event_t _trsm( + SB_Handle& sb_handle, char side, char uplo, char trans, char diag, + ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha, + const ${DATA_TYPE} * A, ${INDEX_TYPE} lda, ${DATA_TYPE} * B, + ${INDEX_TYPE} ldb, const typename SB_Handle::event_t& _dependencies); +#endif } // namespace internal } // namespace blas From eee89a069b260076d2f0ac45729d89c43a70fdd9 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Mon, 1 Jan 2024 13:25:19 +0000 Subject: [PATCH 04/14] Recovered powerVR mention in readme --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index e8c17bc51..58bc37346 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,7 @@ the project. - [Compile with DPC++](#compile-with-dpc) - [Compile with hipSYCL](#compile-with-hipsycl) - [Instaling portBLAS](#instaling-portBLAS) + - [POWER\_VR support](#power_vr-support-computecpp-only) - [Doxygen](#doxygen) - [CMake options](#cmake-options) - [Tests and benchmarks](#tests-and-benchmarks) @@ -427,6 +428,12 @@ To install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below) ninja install ``` +### POWER_VR support + +To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR` + +To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library` + ### Doxygen Doxygen documentation can be generated by running: From 5f3240fb369a1e66c3b2000d94799e0c1ecc1887 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Tue, 2 Jan 2024 15:40:05 +0000 Subject: [PATCH 05/14] Updated Benchmark Doc to include cuBLAS, rocBLAS and complex data updates --- benchmark/README.md | 130 +++++++++++++++++++++++++++++++++----------- 1 file changed, 98 insertions(+), 32 deletions(-) diff --git a/benchmark/README.md b/benchmark/README.md index 0089ba9d2..bfc1f9ec0 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -5,53 +5,82 @@ Benchmarks The portBLAS benchmarks are intended to measure the evolution of the performance of this BLAS implementation and how it compares with other tuned -implementations, such as [CLBLAST](https://github.com/CNugteren/CLBlast) -(a very performant OpenCL BLAS library). +implementations, such as [CUBLAS](https://docs.nvidia.com/cuda/cublas/index.html) native and optimized CUDA BLAS library for Nvidia GPUs, +[ROCBLAS](https://github.com/ROCm/rocBLAS) which is the native HIP/Rocm BLAS +library optimized for AMD GPUs, and [CLBLAST](https://github.com/CNugteren/CLBlast), +a very performant OpenCL BLAS library for CPU targets. The benchmarks use Google's [benchmark](https://github.com/google/benchmark) -library and generate a report with indicative metrics (see instructions below). +library and generate a report with indicative metrics *(see instructions below)*. ## How to compile the benchmarks +### portBLAS Benchmarks +The portBLAS default benchmarks are compiled with the project if the +`BLAS_ENABLE_BENCHMARK` CMake option is activated *(which is the case by +default)*. For the results to be relevant, portBLAS needs to be built in Release +mode *(Passing `-DCMAKE_BUILD_TYPE=Release` to `cmake` command)*. -The benchmarks are compiled with the project if the `BLAS_ENABLE_BENCHMARK` -CMake option is activated (which is the case by default). - +### clBLAST Benchmarks The CLBLAST benchmarks are compiled only if the `BUILD_CLBLAST_BENCHMARKS` CMake option is activated. If so, if CLBlast cannot be found the build will fail. The location of CLBlast can be given with `CLBLAST_ROOT`. To install CLBlast, see: [CLBlast: Building and installing]( -https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md)) +https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md) + +### cuBLAS Benchmarks +cuBLAS benchmarks can be built by enabling the `BUILD_CUBLAS_BENCHMARKS` option and +require an installation of CUDA *(>=11.x)* and cuBLAS library *(Both can be obtained by +installing the cuda Toolkit : +[installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))*. +### rocBLAS Benchmarks +cuBLAS benchmarks can be built by enabling the `BUILD_ROCBLAS_BENCHMARKS` option +and require an installation of Rocm *(>=4.5.x)* and rocBLAS library *(Please refer +to this [installation guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Linux_Install_Guide.html) +for the setup)* + +## General Notes After the compilation, the binaries will be available: -* in the build folder, in `benchmark/portblas/` and `benchmark/clblast/` +* in the build folder, in `benchmark/[Library Name]/` * if you provide an installation directory with the CMake variable `CMAKE_INSTALL_PREFIX`, and run the installation command, e.g `ninja install`, in your installation folder, in `portblas/bin/` A verification of the results is enabled by default and can be disabled with the CMake option `BLAS_VERIFY_BENCHMARK` set to `OFF` or `0`. The verification will -be run a small number of times (more than once because of the way the benchmark +be run a small number of times *(more than once because of the way the benchmark library works, but much less than the usual number of iterations of the -benchmarks). The verification requires that a reference implementation of BLAS +benchmarks)*. The verification requires that a reference implementation of BLAS like OpenBLAS is installed, which path can be given with the `CMAKE_PREFIX_PATH` CMake parameter. ## How to run the benchmarks The benchmarks take two kinds of command-line options: those for the benchmark -library and those specific to the portBLAS projects. - -Essentially, the benchmarks can take a CSV configuration file (or will use -defaults), and if your machine has more than one OpenCL device, you can specify -which one to use. The other options specify how to output the results. +library and those specific to portBLAS: +- Benchmark library options : these options help specify for instance the output +format, the verbosity level and help filter specific benchmarks based on regex +arguments. +- portBLAS options : these options are portBLAS specific and they help specify +the target SYCL device *(portBLAS benchmarks only)* as well as pass a custom set +of operator-specfic configurations *(applies to rocBLAS and cuBLAS benchmarks as +well)*. If no csv file is specified, the benchmark will use defaults values +*(found in `include/common/common_utils.hpp`)*. + +For portBLAS and clBLASt benchmarks, and if the target machine has more than one +OpenCL device, you can specify which one to use at runtime(\*). + +(*): Preferably, for portBLAS benchmarks, this device has to match the +`TUNING_TARGET` as some operators are configured differently depending on the +SYCL target for optimal performance. The most useful options for us are: |option|parameter|description| |------|:-------:|-----------| | `--help` | | Show help message | -| `--device` | device name | Select a device to run on (e.g `intel:gpu`) | +| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful when targeting openCL devices with portBLAS or clBLASt benchmarks| | `--csv-param` | file path | Path to a CSV file with the benchmark parameters | | `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output | | `--benchmark_out` | file path | Specify a file where to write the report | @@ -64,21 +93,27 @@ The most useful options for us are: You can check the [GitHub repository](https://github.com/google/benchmark) of the library for information about the other supported command-line arguments. -Here is an example of an invocation of the GEMM benchmark running on Intel GPU, -displaying the results in the console and saving a json report: +Here is an example of an invocation of the portBLAS GEMM benchmark running on +Intel GPU, displaying the results in the console and saving a json report: ```bash -./bench_gemm --device=intel:gpu --csv-param=parameters.csv \ +./portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \ --benchmark_out=../results.json --benchmark_out_format=json \ --benchmark_format=console ``` +Here is the same benchmark through cuBLAS: +```bash +./cublas/bench_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ +--benchmark_out_format=json --benchmark_format=console +``` + ### CSV format The benchmarks can be given a CSV file containing the parameters to run with -(matrix/vector dimensions, transpose or not, etc), in the following format: one -line corresponds to one set of parameters, i.e. one name for the library (though -it will be iterated many times for statistical accuracy). +*(matrix/vector dimensions, transpose or not, etc)*, in the following format: one +line corresponds to one set of parameters, i.e. one name for the library *(though +it will be iterated many times for statistical accuracy)*. The formats for the different BLAS levels are: @@ -88,15 +123,34 @@ The formats for the different BLAS levels are: | blas 2 | *transpose_A,m,n,alpha,beta* | Action on the matrix (`n`, `t`, `c`), dimensions, and scalars alpha and beta | | blas 3 | | | | gemm | *transpose_A,transpose_B,m,k,n,alpha,beta* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), and scalars alpha and beta | -| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size | +| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size,batch_type* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size, batch_type | | trsm | *side,triangle,transpose,diagonal,m,n,alpha* | Position of A (`l`, `r`), A is upper or lower triangular (`u`, `l`), transposition of A (`n`, `t`), A is unit or non-unit diagonal(`u`,`n`),dimensions, scalar alpha | -Note: for operations that support a stride, the benchmarks will use a stride of -1 (contiguous values), except for the GEMM batched operation where valid default stride values are used depending on batch type *(strided or interleaved)*. For operations that support a leading dimension, the -benchmarks use the minimum possible value (the actual leading dimension of the -matrix). +### Notes: + +For operations that support an increment, the benchmarks will use an increment of +1 *(contiguous values)*, except for the GEMM batched operation where valid +default increment values are used depending on `batch_type` *(strided or +interleaved)*. For operations that support a leading dimension, the +benchmarks use the minimum possible value *(the actual leading dimension +of the matrix)*. -Here is an example of a valid CSV file for the GEMM benchmark: +For batched-strided operations that expect a stride(s), the benchmarking suite +expects **stride multipliers** instead of explicit stride values. For example, in +`gemm_batched_strided` operator, a `stride_a_mul=3` for matrix `A` of `size_a=m*k` +is equivalent to an actual stride of `stride_a = stride_a_mul * size_a`. + +For operations that support Complex data type *(and when cmake option +BLAS_ENABLE_COMPLEX is enabled)*, scalars such as `alpha` and `beta` are +expected to have real and imaginary parts as separate values, if only single +scalar values are passed in the csv configuration files, complex values +for these arguments will be constructed by duplicating the same real value +for both imaginary and real parts. (e.g. alpha_cplx={alpha_real, alpha_real}). + + +Here are two examples of a valid CSV files for the GEMM benchmark: + +- Scalar only alpha & beta values ``` n,n,42,42,42,1,0 @@ -104,8 +158,16 @@ n,t,64,128,64,0.5,0.5 t,n,13,3,7,0,0.7 ``` +- Complex alpha & beta values *(two scalars each)* +``` +n,n,42,42,42,1,2,3,0 +n,t,64,128,64,0.5,1,0.5,1 +t,n,13,3,7,0,1,0.7,3 +``` + + The folder `config_csv` provides a few files corresponding to sizes that are -relevant for neural networks, but you can use your own files, see the next +relevant for neural networks, but you can provide your own, see the next section for more info on how to generate them. ### Python tool to generate a CSV file @@ -262,6 +324,9 @@ following keys: * `cpu_time`: actual CPU time spent running the benchmark * `time_unit`: unit used for these times. Should be `ns`, if not please file an issue. +* `label`: Contains the name of the benchmark backend *(`@backend` : "portblas", + "cublas" or "rocblas")*, and target device related informations *(`device_name`, + `device_version`, `driver_version` etc..)* * `avg_event_time`: the average of the CL/SYCL event times in nanoseconds. This time depends on the events returned by the BLAS functions used and might not be accurate in some cases @@ -280,9 +345,10 @@ following keys: * `bytes_processed`: total number of bytes read and written in memory. It is calculated theoretically based on the operations that we think the benchmark is doing. -* a few benchmark parameters (e.g `m`, `n`, `k` for GEMM) +* other operator-specific parameters that affect the computations *(e.g `m`, `n`, +`k` for GEMM, but `alpha` is skipped as it doens't affect the operation directly.)* * some other keys from the benchmark library **Note:** to calculate the performance in Gflops, you can divide `n_fl_ops` by one -of the best or average time metrics, e.g `avg_overall_time` (the event and wall -time usually converge for large dimensions). +of the best or average time metrics, e.g `avg_overall_time` *(the event and wall +time usually converge for large dimensions)*. From fbcf50f4c5320cc7cd6cbfbc3c42b769e09a8f9b Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com> Date: Wed, 3 Jan 2024 14:04:54 +0000 Subject: [PATCH 06/14] Update README.md Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 58bc37346..9e87bd488 100644 --- a/README.md +++ b/README.md @@ -217,7 +217,7 @@ For all these operations: | `_asum` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Absolute sum of the vector `x`; written in `rs` if passed, else returned | | `_axpy` | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy` | Vector multiply-add: `y = alpha * x + y` | | `_copy` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` | Copies a vector to another: `y = x` | -| `_dot` | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors `x` and `y`; added to `sb`` and written to `rs` if passed, else returned | +| `_dot` | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors `x` and `y`; added to `sb` and written to `rs` if passed, else returned | | `_sdsdot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors with double precision `x` and `y`; written in `rs` if passed, else returned | | `_nrm2` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Euclidean norm of the vector `x`; written in `rs` if passed, else returned | | `_rot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s` | From 91717dc930407d7aee7d0af816641bd4ae4c0d80 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com> Date: Wed, 3 Jan 2024 14:05:11 +0000 Subject: [PATCH 07/14] Update benchmark/README.md Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com> --- benchmark/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/README.md b/benchmark/README.md index bfc1f9ec0..c8e564fa1 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -97,7 +97,7 @@ Here is an example of an invocation of the portBLAS GEMM benchmark running on Intel GPU, displaying the results in the console and saving a json report: ```bash -./portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \ +./benchmark/portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \ --benchmark_out=../results.json --benchmark_out_format=json \ --benchmark_format=console ``` From f03cb83c476cc8d4988c1052e656d0de90fb0e19 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com> Date: Wed, 3 Jan 2024 14:05:20 +0000 Subject: [PATCH 08/14] Update benchmark/README.md Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com> --- benchmark/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/README.md b/benchmark/README.md index c8e564fa1..27d4611db 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -104,7 +104,7 @@ Intel GPU, displaying the results in the console and saving a json report: Here is the same benchmark through cuBLAS: ```bash -./cublas/bench_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ +./benchmark/cublas/bench_cublas_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ --benchmark_out_format=json --benchmark_format=console ``` From fb087c7e78bf6d67e278f25c5ad2ffd01491917f Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Wed, 3 Jan 2024 15:11:18 +0000 Subject: [PATCH 09/14] Addressed comments & other doc updates --- README.md | 67 +++++++++++++++++++++++++++++++++++++------- Roadmap.md | 30 -------------------- benchmark/README.md | 2 +- doc/AddingBlas3Op.md | 3 +- samples/README.md | 34 ++++++++++------------ 5 files changed, 75 insertions(+), 61 deletions(-) delete mode 100644 Roadmap.md diff --git a/README.md b/README.md index 9e87bd488..a3028e9f6 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -portBLAS Implementation +# portBLAS Implementation === [![Build and Test](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml/badge.svg?event=push)](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml) @@ -33,9 +33,12 @@ the project. - [Compile with DPC++](#compile-with-dpc) - [Compile with hipSYCL](#compile-with-hipsycl) - [Instaling portBLAS](#instaling-portBLAS) - - [POWER\_VR support](#power_vr-support-computecpp-only) - [Doxygen](#doxygen) - [CMake options](#cmake-options) + - [ComputeCpp Compilation](#computecpp-deprecated) + - [Compile with ComputeCpp](#compile-with-computecpp) + - [POWER\_VR support](#power_vr-support-computecpp-only) + - [Cross-Compile](#cross-compile-computecpp-only) - [Tests and benchmarks](#tests-and-benchmarks) - [Contributing to the project](#contributing-to-the-project) - [Guides and Other Documents](#guides-and-other-documents) @@ -421,19 +424,13 @@ ninja ``` To build for other than the default devices (`omp`), set the `HIPSYCL_TARGETS` environment variable or specify `-DHIPSYCL_TARGETS` as [documented](https://github.com/illuhad/hipSYCL/blob/develop/doc/using-hipsycl.md). -### Instaling portBLAS +### Installing portBLAS To install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below) ```bash ninja install ``` -### POWER_VR support - -To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR` - -To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library` - ### Doxygen Doxygen documentation can be generated by running: @@ -454,7 +451,7 @@ Some of the supported options are: |---|---|---| | `BLAS_ENABLE_TESTING` | `ON`/`OFF` | Set it to `OFF` to avoid building the tests (`ON` is the default value) | | `BLAS_ENABLE_BENCHMARK` | `ON`/`OFF` | Set it to `OFF` to avoid building the benchmarks (`ON` is the default value) | -| `SYCL_COMPILER` | name | Used to determine which SYCL implementation to use. By default, the first implementation found is used. Supported values are: `computecpp`, `dpcpp` and `hipsycl`. | +| `SYCL_COMPILER` | name | Used to determine which SYCL implementation to use. By default, the first implementation found is used. Supported values are: `dpcpp`, `hipsycl` and `computecpp`*(deprecated)*. | | `TUNING_TARGET` | name | By default, this flag is set to `DEFAULT_CPU` to restrict any device specific compiler optimizations. Use this flag to tune the code for a target (**highly recommended** for performance). The supported targets are: `INTEL_GPU`, `NVIDIA_GPU`, `AMD_GPU` | | `CMAKE_PREFIX_PATH` | path | List of paths to check when searching for dependencies | | `CMAKE_INSTALL_PREFIX` | path | Specify the install location, used when invoking `ninja install` | @@ -467,6 +464,56 @@ Some of the supported options are: | `BLAS_INDEX_TYPES` | `int32_t;int64_t` | Determines the type(s) to use for `index_t` and `increment_t`. Default is `int` | | `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Operators only)* (`ON` by default) | +## ComputeCpp Compilation *(Deprecated)* + +portBLAS ComputeCpp compilation is deprecated since ComputeCpp releasing has been +discontinued. More information about this are found in this [announcement](https://codeplay.com/portal/news/2023/07/07/the-future-of-computecpp). + +### Compile with ComputeCpp + +```bash +cd build +cmake -GNinja ../ -DComputeCpp_DIR=/path/to/computecpp -DSYCL_COMPILER=computecpp +ninja +``` + +### Cross-Compile *(ComputeCpp Only)* + +To cross-compile portBLAS first the following environment variables must be +set: + +```bash +export COMPUTECPP_TOOLCHAIN_DIR="PATH TO TOOLCHAIN_DIR" +export COMPUTECPP_TARGET_TRIPLE="PATH TO TARGET_TRIPLE" +export COMPUTECPP_SYSROOT_DIR="$PATH TO SYSROOT_DIR" +``` + +Clone the [ComputeCpp-SDK](https://github.com/codeplaysoftware/computecpp-sdk) to retrieve the toolchain file. +The following CMake command can be used to cross-compile portBLAS: + +```bash +cmake -GNinja \ + ${SOURCE_ROOT} \ + -DCMAKE_PREFIX_PATH="${OPENBLAS_PATH}" \ + -DComputeCpp_DIR="${COMPUTECPP_DEVICE_PATH}" \ + -DComputeCpp_HOST_DIR="${COMPUTECPP_X86_PATH}" \ + -DCMAKE_TOOLCHAIN_FILE="/path/to/computecpp-sdk/cmake/toolchains/gcc-generic.cmake" \ + -DCMAKE_BUILD_TYPE='Release' \ + -DCMAKE_INSTALL_PREFIX=${CROSS_COMPILED_PORTBLAS_INSTALL} \ + -DOpenCL_INCLUDE_DIR="${OpenCL_Headers_PATH}" \ + -DOpenCL_LIBRARY="${OpenCL_LIBRARY}" \ + -DCOMPUTECPP_BITCODE="${DEVICE_BITCODE}" \ + -DCMAKE_CXX_FLAGS='-O3' \ + -DTUNING_TARGET="${CHOSEN_TARGET}" +``` + +### POWER_VR support *(ComputeCpp Only)* + +To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR` + +To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library` + + ## Tests and benchmarks The tests and benchmarks have their own documentation: diff --git a/Roadmap.md b/Roadmap.md deleted file mode 100644 index 2a424abcc..000000000 --- a/Roadmap.md +++ /dev/null @@ -1,30 +0,0 @@ -Roadmap -==================== - - -Current Status: - -* BLAS LVL1 : 90% complete, only rotmg missing. -* BLAS LVL2 : Only two sample routines -* BLAS LVL3 : Only three variants of the Matrix Multiplication -* Examples : - * Interface tests for all implemented API entries - * CG example with multiple variants of the algorithm using kernel fusion. - -Short Term QX 2017: - -* Have a common class for the implementation of the different view classes -* Obtain and analyse performance bottlenecks -* Work on the Blas 2 interface - -Medium Term: - -* Complete the Blas 2 interface -* Work on the Blas 3 interface -* Move evaluation methods out of the expression tree into the execution tree - - -Long Term: - -* Complete the Blas 3 interface -* Add continuous integration testing diff --git a/benchmark/README.md b/benchmark/README.md index 27d4611db..105e2ebcc 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -80,7 +80,7 @@ The most useful options for us are: |option|parameter|description| |------|:-------:|-----------| | `--help` | | Show help message | -| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful when targeting openCL devices with portBLAS or clBLASt benchmarks| +| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful for portBLAS and clBLASt benchmarks| | `--csv-param` | file path | Path to a CSV file with the benchmark parameters | | `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output | | `--benchmark_out` | file path | Specify a file where to write the report | diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md index d6d578cbc..9791d3f24 100644 --- a/doc/AddingBlas3Op.md +++ b/doc/AddingBlas3Op.md @@ -108,7 +108,8 @@ void run_test(const combination_t combi) { } // Create the combinations of parameters to invoke the test -const auto combi = ::testing::Combine(::testing::Values(7, 513, 1027), // m +const auto combi = ::testing::Combine(::testing::Values("usm", "buf"), // allocation type + ::testing::Values(7, 513, 1027), // m ::testing::Values(7, 513, 1027), // n ::testing::Values('n', 't'), // transA ::testing::Values('l', 'r'), // side diff --git a/samples/README.md b/samples/README.md index ec5717f84..412c8ee2b 100644 --- a/samples/README.md +++ b/samples/README.md @@ -2,30 +2,26 @@ portBLAS samples === ## How to compile the samples +A SYCL Compiler (DPCPP, hipSYCL or ComputeCpp) along with the target device's +relevant compute drivers *(OpenCL, CUDA etc..)* are required to compile and +run the samples. +Any project that integrates portBLAS can either use it as : + * a library *(install the library and include `portblas.h` in an application)* + * a header-only framework *(include `portblas.hpp` in an application)* -At the moment any project using portBLAS requires: - -* OpenCL -* [ComputeCpp](http://www.computecpp.com) -* portBLAS, either: - * as a library (install the library and include `portblas.h` in an application) - * as a header-only framework (include `portblas.hpp` in an application) - -### With CMake +### CMake Configuration This folder contains a basic CMake configuration file and a module to find -portBLAS (which will be used as a header-only framework). It also uses a module -to find ComputeCpp that is located in the folder `cmake/Modules`. - -Usage: +portBLAS *(which will be used as a header-only framework)*. It also uses a module +to find the SYCL Compiler(DPCPP, hipSYCL or computeCpp *-deprecated-*) that is +located in the folder `cmake/Modules`. -* set `ComputeCpp_DIR` to your ComputeCpp root path -* set `PORTBLAS_DIR` to your portBLAS root path +Sample usage with DPCPP Compiler: ```bash -mkdir build -cd build -cmake .. -GNinja -DComputeCpp_DIR=/path/to/computecpp \ - -DPORTBLAS_DIR=~/path/to/portblas +mkdir build && cd build +export CXX=/path/to/dpcpp/clang++ +export CC=/path/to/dpcpp/clang +cmake .. -GNinja -DPORTBLAS_DIR=~/path/to/portblas ninja ``` From d8d777358b50f20154b3d2bdaa28456f3443bd5f Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Fri, 5 Jan 2024 14:19:31 +0000 Subject: [PATCH 10/14] removed uncessary flag in Dockerfile --- Dockerfile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index 4e26a9db7..e3eda6faa 100644 --- a/Dockerfile +++ b/Dockerfile @@ -48,7 +48,7 @@ CMD cd /portBLAS && \ if [ "${COMMAND}" = 'build-test' ]; then \ if [ "${SYCL_IMPL}" = 'DPCPP' ]; then \ export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p build && cd build && \ - cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \ + cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \ -DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \ -DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \ make -j$(nproc) && cd test && ctest -VV --timeout 1200; \ @@ -58,7 +58,7 @@ CMD cd /portBLAS && \ elif [ "${COMMAND}" = 'auto-tuner' ]; then \ export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p tools/auto_tuner/build \ && cd tools/auto_tuner/build && \ - cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \ + cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \ -DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \ -DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \ make -j$(nproc); \ From 0a9840593785d337145aaaa8d2cd19fc4b0033b7 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Mon, 8 Jan 2024 10:59:54 +0000 Subject: [PATCH 11/14] Additional changes/deletions of py_gen related files and documentation --- benchmark/gen_param.py | 4 +++- doc/AddingBlas3Op.md | 5 +++-- python_generator/py_gen_blas_binary.py | 0 python_generator/py_gen_blas_binary_special.py | 0 python_generator/py_gen_blas_gemm_launcher.py | 2 +- python_generator/py_gen_blas_ops.py | 2 +- python_generator/py_gen_blas_reduction.py | 2 +- python_generator/py_gen_blas_rotg_return.py | 0 python_generator/py_gen_blas_rotmg.py | 0 python_generator/py_gen_blas_ternary.py | 0 10 files changed, 9 insertions(+), 6 deletions(-) delete mode 100644 python_generator/py_gen_blas_binary.py delete mode 100644 python_generator/py_gen_blas_binary_special.py delete mode 100644 python_generator/py_gen_blas_rotg_return.py delete mode 100644 python_generator/py_gen_blas_rotmg.py delete mode 100644 python_generator/py_gen_blas_ternary.py diff --git a/benchmark/gen_param.py b/benchmark/gen_param.py index a8d4019bc..aaa6ebafa 100644 --- a/benchmark/gen_param.py +++ b/benchmark/gen_param.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -#/*************************************************************************** +# /*************************************************************************** # * # * @license # * Copyright (C) Codeplay Software Limited @@ -32,12 +32,14 @@ import itertools import argparse + def main(args): """Generate the csv file according to the given arguments """ # Match DSL to Python names nd_range = itertools.product value_range = lambda *v: list(v) + def size_range(low, high, mult): val = low while val <= high: diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md index 9791d3f24..147817025 100644 --- a/doc/AddingBlas3Op.md +++ b/doc/AddingBlas3Op.md @@ -178,8 +178,9 @@ binary is linked against portBLAS, the linker will find the definition of the mi To do this, we create the source file that will contain instantiations of the new `_trsm` operation. The file is located at `src/interface/blas3/trsm.cpp.in`. This is not the file that will be -compiled, but a template file that the python script `python_generator/py_gen_blas_binary.py` -will use to generate the actual source file where the instantiation of `_trsm` will happen. +compiled, but a template file that the python script `python_generator/py_gen_blas_ops.py` +will use to generate the actual source file where the instantiation of `_trsm` will happen. The call +to this generator is wrapped within the cmake function `generate_blas_objects`. The file `src/interface/blas3/trsm.cpp.in` must include all files that are necessary to successfully compile `blas::internal::_trsm`, for this particular example, this file looks like the following: diff --git a/python_generator/py_gen_blas_binary.py b/python_generator/py_gen_blas_binary.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/python_generator/py_gen_blas_binary_special.py b/python_generator/py_gen_blas_binary_special.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/python_generator/py_gen_blas_gemm_launcher.py b/python_generator/py_gen_blas_gemm_launcher.py index c34ab3849..3c4fb6a7e 100644 --- a/python_generator/py_gen_blas_gemm_launcher.py +++ b/python_generator/py_gen_blas_gemm_launcher.py @@ -1,4 +1,4 @@ -#/*************************************************************************** +# /*************************************************************************** # * # * @license # * Copyright (C) Codeplay Software Limited diff --git a/python_generator/py_gen_blas_ops.py b/python_generator/py_gen_blas_ops.py index 68f53eaac..b72c05c07 100644 --- a/python_generator/py_gen_blas_ops.py +++ b/python_generator/py_gen_blas_ops.py @@ -1,4 +1,4 @@ -#/*************************************************************************** +# /*************************************************************************** # * # * @license # * Copyright (C) Codeplay Software Limited diff --git a/python_generator/py_gen_blas_reduction.py b/python_generator/py_gen_blas_reduction.py index 5a80ee10d..be45e5884 100644 --- a/python_generator/py_gen_blas_reduction.py +++ b/python_generator/py_gen_blas_reduction.py @@ -1,4 +1,4 @@ -#/*************************************************************************** +# /*************************************************************************** # * # * @license # * Copyright (C) Codeplay Software Limited diff --git a/python_generator/py_gen_blas_rotg_return.py b/python_generator/py_gen_blas_rotg_return.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/python_generator/py_gen_blas_rotmg.py b/python_generator/py_gen_blas_rotmg.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/python_generator/py_gen_blas_ternary.py b/python_generator/py_gen_blas_ternary.py deleted file mode 100644 index e69de29bb..000000000 From 497abf6b67defea9c085c64d4a572c9393be2c88 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Mon, 8 Jan 2024 16:43:46 +0000 Subject: [PATCH 12/14] Addressed PR comments --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index a3028e9f6..f2af6fc2c 100644 --- a/README.md +++ b/README.md @@ -220,8 +220,8 @@ For all these operations: | `_asum` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Absolute sum of the vector `x`; written in `rs` if passed, else returned | | `_axpy` | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy` | Vector multiply-add: `y = alpha * x + y` | | `_copy` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` | Copies a vector to another: `y = x` | -| `_dot` | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors `x` and `y`; added to `sb` and written to `rs` if passed, else returned | -| `_sdsdot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors with double precision `x` and `y`; written in `rs` if passed, else returned | +| `_dot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`] | Dot product of two vectors `x` and `y`; written to `rs` if passed, else returned | +| `_sdsdot` | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy`[, `rs`] | Compute sum of a constant `sb` with the double precision dot product of two single precision vectors `x` and `y`; written in `rs` if passed, else returned | | `_nrm2` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Euclidean norm of the vector `x`; written in `rs` if passed, else returned | | `_rot` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s` | | `_rotg` | `sb_handle`, `a`, `b`, `c`, `s` | Given the Cartesian coordinates (`a`, `b`) of a point, return the parameters `c`, `s`, `r`, and `z` associated with the Givens rotation. | @@ -229,8 +229,8 @@ For all these operations: | `_rotmg` | `sb_handle`, `d1`, `d2`, `x1`, `y1` `param` | Given the Cartesian coordinates (`x1`, `y1`) of a point, return the components of a modified Givens transformation matrix that zeros the y-component of the resulting point. | | `_scal` | `sb_handle`, `N`, `alpha`, `vx`, `incx` | Scalar product of a vector: `x = alpha * x` | | `_swap` | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` | Interchanges two vectors: `y = x` and `x = y` | -| `_iamax` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | First index and value of the maximum element of `x`; written in `rs` if passed, else the index only is returned | -| `_iamin` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | First index and value of the minimum element of `x`; written in `rs` if passed, else the index only is returned | +| `_iamax` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Index of the first occurence of the maximum element in `x`; written to `rs` if passed, else returned. | +| `_iamin` | `sb_handle`, `N`, `vx`, `incx` [, `rs`] | Index of the first occurence of the minimum element in `x`; written to `rs` if passed, else returned. | ### BLAS 2 From 99c52b15688ee217a6e0726c0f654945147bde1b Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI Date: Wed, 10 Jan 2024 13:56:33 +0000 Subject: [PATCH 13/14] updated joint matrix doc link --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index f2af6fc2c..3aa9006b5 100644 --- a/README.md +++ b/README.md @@ -351,7 +351,7 @@ Other non-official extension operators : ### Experimental Joint Matrix Support portBLAS now supports sub-group based collective GEMM operation using the experimental -[`joint_matrix`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) extension provided by DPC++. This support is only accessible for the latest +[`joint_matrix`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc) extension provided by DPC++. This support is only accessible for the latest NVIDIA Ampere GPUs and beyond. The requirements for using this experimental support are: ```bash @@ -390,11 +390,12 @@ added to the `CMAKE_PREFIX_PATH` when building portBLAS (see **IMPORTANT NOTE:** The `TARGET` CMake variable is no longer supported. It has been replaced by `TUNING_TARGET`, which accepts the same options. -`TUNING_TARGET` affects only the tuning configuration, applicable for some operators such as GEMM, and has no effect on the target -triplet for DPC++ or the hipSYCL target. Please refer to the sections below for -setting them. +`TUNING_TARGET` affects only the tuning configuration, applicable for some operators such +as GEMM, and has no effect on the target triplet for DPC++ or the hipSYCL target. Please +refer to the sections below for setting them. -1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order to clone submodule(s). +1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order +to clone submodule(s). 2. Create a build directory 3. Run `CMake` from the build directory *(see options in the section below)*: From e926fcb301debe32923327d2c2e8c0605d023c94 Mon Sep 17 00:00:00 2001 From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com> Date: Fri, 12 Jan 2024 11:19:40 +0000 Subject: [PATCH 14/14] Update README.md Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3aa9006b5..97476e994 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ the project. - [Instaling portBLAS](#instaling-portBLAS) - [Doxygen](#doxygen) - [CMake options](#cmake-options) - - [ComputeCpp Compilation](#computecpp-deprecated) + - [ComputeCpp Compilation (Deprecated)](#computecpp-deprecated) - [Compile with ComputeCpp](#compile-with-computecpp) - [POWER\_VR support](#power_vr-support-computecpp-only) - [Cross-Compile](#cross-compile-computecpp-only)