From 0c225e17c015f1caecc37e80674e8962b38fe491 Mon Sep 17 00:00:00 2001
From: Rod Burns <800194+rodburns@users.noreply.github.com>
Date: Fri, 22 Dec 2023 10:55:28 +0000
Subject: [PATCH 01/14] Minor edits

---
 README.md | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index c5383b73f..3788c8793 100644
--- a/README.md
+++ b/README.md
@@ -3,13 +3,10 @@ portBLAS Implementation
 
 [![Build and Test](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml/badge.svg?event=push)](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml)
 
-portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL 1.2](
-https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf), the
-[Khronos](http://www.khronos.org) abstraction layer for [OpenCL](https://www.khronos.org/opencl/).
+portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/)).
 
-portBLAS is a current work in progress research project from an ongoing
-collaboration with the *High Performance Computing & Architectures (HPCA) group*
-from the Universitat Jaume I [UJI](http://www.hpca.uji.es/).
+portBLAS is an ongoing collaboration with the *High Performance Computing 
+& Architectures (HPCA) group* from the Universitat Jaume I [UJI](http://www.hpca.uji.es/).
 
 portBLAS is written using modern C++. The current implementation uses C++11
 features.

From be90e3d2b8c4a1b2b41df11b62e109c38f5b39e0 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Wed, 27 Dec 2023 15:02:32 +0000
Subject: [PATCH 02/14] Added remaining operators to Doc & other Readme fixes

---
 README.md | 133 ++++++++++++++++++++----------------------------------
 1 file changed, 50 insertions(+), 83 deletions(-)

diff --git a/README.md b/README.md
index 3788c8793..e8c17bc51 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@ portBLAS Implementation
 
 [![Build and Test](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml/badge.svg?event=push)](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml)
 
-portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/)).
+portBLAS implements BLAS - [Basic Linear Algebra Subroutines](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) - using [SYCL](https://www.khronos.org/sycl/).
 
 portBLAS is an ongoing collaboration with the *High Performance Computing 
 & Architectures (HPCA) group* from the Universitat Jaume I [UJI](http://www.hpca.uji.es/).
@@ -30,14 +30,11 @@ the project.
     - [Experimental Joint Matrix Support](#jm_support)
   - [Requirements](#requirements)
   - [Setup](#setup)
-    - [Compile with ComputeCpp](#compile-with-computecpp)
     - [Compile with DPC++](#compile-with-dpc)
     - [Compile with hipSYCL](#compile-with-hipsycl)
     - [Instaling portBLAS](#instaling-portBLAS)
-    - [POWER\_VR support (ComputeCpp Only)](#power_vr-support-computecpp-only)
     - [Doxygen](#doxygen)
     - [CMake options](#cmake-options)
-    - [Cross-Compile (ComputeCpp Only)](#cross-compile-computecpp-only)
   - [Tests and benchmarks](#tests-and-benchmarks)
   - [Contributing to the project](#contributing-to-the-project)
     - [Guides and Other Documents](#guides-and-other-documents)
@@ -204,30 +201,32 @@ The following table sums up the interface that can be found in
 For all these operations:
 
 * `vx` and `vy` are containers for vectors `x` and `y`.
-* `incx` and `incy` are their increments (number of steps to jump to the next
-   value, 1 for contiguous values).
-* `N`, an integer, is the size of the vectors (less than or equal to the size of
-  the containers).
+* `incx` and `incy` are their increments *(number of steps to jump to the next
+   value, 1 for contiguous values)*.
+* `N`, an integer, is the size of the vectors *(less than or equal to the size of
+  the containers)*.
 * `alpha` is a scalar.
 * `rs` is a container of size 1, containing either a scalar, an integer, or an
   index-value tuple.
-* `c` and `s` for `_rot` are scalars (cosine and sine)
+* `c` and `s` for `_rot` are scalars *(cosine and sine)*.
+* `sb` for `_sdsdot` is a single precision scalar to be added to output.
 
 | operation | arguments                                       | description                                                                                                                                                                  |
 |-----------|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `_asum`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Absolute sum of the vector `x`; written in `rs` if passed, else returned                                                                                                                               |
 | `_axpy`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`  | Vector multiply-add: `y = alpha * x + y`                                                                                                                                     |
 | `_copy`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Copies a vector to another: `y = x`                                                                                                                                          |
-| `_dot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors `x` and `y`; written in `rs` if passed, else returned                                                                                             |
-| `_asum`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Absolute sum of the vector `x`; written in `rs` if passed, else returned                                                                                                     |
-| `_iamax`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | First index and value of the maximum element of `x`; written in `rs` if passed, else the index only is returned                                                              |
-| `_iamin`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | First index and value of the minimum element of `x`; written in `rs` if passed, else the index only is returned                                                              |
-| `_swap`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Interchanges two vectors: `y = x` and `x = y`                                                                                                                                |
-| `_scal`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`                | Scalar product of a vector: `x = alpha * x`                                                                                                                                  |
+| `_dot`    | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors `x` and `y`; added to `sb`` and written to `rs` if passed, else returned                                                                                             |
+| `_sdsdot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors with double precision `x` and `y`; written in `rs` if passed, else returned                                                                                             |
 | `_nrm2`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Euclidean norm of the vector `x`; written in `rs` if passed, else returned                                                                                                   |
 | `_rot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s`                                                                                                     |
 | `_rotg`   | `sb_handle`, `a`, `b`, `c`, `s`                        | Given the Cartesian coordinates (`a`, `b`) of a point, return the parameters `c`, `s`, `r`, and `z` associated with the Givens rotation.                                     |
 | `_rotm`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `param`  | Applies a modified Givens rotation to `x` and `y`.                                                                                                                           |
 | `_rotmg`  | `sb_handle`, `d1`, `d2`, `x1`, `y1` `param`            | Given the Cartesian coordinates (`x1`, `y1`) of a point, return the components of a modified Givens transformation matrix that zeros the y-component of the resulting point. |
+| `_scal`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`                | Scalar product of a vector: `x = alpha * x`                                                                                                                                  |
+| `_swap`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Interchanges two vectors: `y = x` and `x = y`                                                                                                                                |
+| `_iamax`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | First index and value of the maximum element of `x`; written in `rs` if passed, else the index only is returned                                                              |
+| `_iamin`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | First index and value of the minimum element of `x`; written in `rs` if passed, else the index only is returned                                                              |
 
 ### BLAS 2
 
@@ -253,19 +252,26 @@ For all these operations:
   its neighbor in the next column and same row. `lda` must be at least `M`.
 * `vx` and `vy` are containers for vectors `x` and `y`.
 * `incx` and `incy` are their increments (cf BLAS 1).
+* `K` Number of sub/super-diagonals of the matrix.
 
 | operation | arguments | description |
 |---|---|---|
-| `_gemv` | `sb_handle`, `trans`, `M`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy`  | Generalised matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* |
 | `_gbmv` | `sb_handle`, `trans`, `M`, `N`, `KL`, `KU`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy`  | Generalised band matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* |
-| `_trmv`  | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx` | Matrix-vector product for a triangular matrix: `x = A * x` |
-| `_symv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Variant of GEMV for a symmetric matrix (`y = alpha * A * x + beta * y`). *Note: `uplo` specifies which side of the matrix will be read* |
+| `_gemv` | `sb_handle`, `trans`, `M`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy`  | Generalised matrix-vector product followed by a vector sum: `y = alpha * A * x + beta * y`. *Note: the dimensions of the vectors depend on the transpose mode (`x`: `N` and `y`: `M` for mode `'n'` ; `x`: `M` and `y`: `N` otherwise)* |
 | `_ger` | `sb_handle`, `M`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mA`, `lda` | Generalised vector-vector product followed by a matrix sum: `A = alpha * x * yT + A` |
+| `_sbmv`| `sb_handle`, `uplo`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Compute a scalar-matrix-vector product and add the result to a scalar-vector product, with a symmetric band matrix: `y = alpha * mA * x + beta * y` |
+| `_spmv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `vx`, `incx`, `beta`, `vy`, `incy` |  Symmetric packed matrix-vector product: `y = alpha * A * x + beta * y` |
+| `_spr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mPA` | Symmetric vector-vector product followed by a matrix sum: `mPA = alpha * x * xT + mPA` |
+| `_spr2` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mPA` | Compute two scalar-vector-vector products and add them to a symmetric packed matrix: `mPA = alpha * x * yT + alpha * y * xT + mPA` |
+| `_symv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx`, `beta`, `vy`, `incy` | Variant of GEMV for a symmetric matrix (`y = alpha * A * x + beta * y`). *Note: `uplo` specifies which side of the matrix will be read* |
 | `_syr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mA`, `lda` | Generalised vector squaring followed by a sum with a symmetric matrix: `A = alpha * x * xT + A` |
 | `_syr2` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`, `mA`, `lda` | Generalised vector products followed by a sum with a symmetric matrix: `A = alpha*x*yT + alpha*y*xT + A` |
-| `_spr` | `sb_handle`, `uplo`, `N`, `alpha`, `vx`, `incx`, `mPA` | Symmetric vector-vector product followed by a matrix sum: `mPA = alpha * x * xT + mPA` |
-| `_spmv` | `sb_handle`, `uplo`, `N`, `alpha`, `mA`, `vx`, `incx`, `beta`, `vy`, `incy` |  Symmetric packed matrix-vector product: `y = alpha * A * x + beta * y` |
-| `_tpmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Triangular packed matrix-vector product: `x = A * x` |
+| `_tbmv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `K`, `mA`, `lda`, `vx`, `incx` |  Compute a matrix-vector product with a triangular band matrix:  `A = A * x` |
+| `_tbsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `K`, `mA`, `lda`, `vx`, `incx` | Solve a system of linear equations whose coefficients are in a triangular band matrix: `A * x = b` |
+| `_tpmv`| `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Triangular packed matrix-vector product: `x = A * x` |
+| `_tpsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `vx`, `incx` | Solve a system of linear equations whose coefficients are in a triangular packed matrix: `A * x = b` |
+| `_trmv`  | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `alpha`, `mA`, `lda`, `vx`, `incx` | Matrix-vector product for a triangular matrix: `x = A * x` |
+| `_trsv` | `sb_handle`, `uplo`, `trans`, `diag`, `N`, `mA`, `lda`, `vx`, `incx` | Compute a matrix-vector product with a triangular band matrix: `A * x = b` |
 
 ### BLAS 3
 
@@ -274,7 +280,7 @@ The following table sums up the interface that can be found in
 
 For all these operations:
 
-* `A`, `B` and `C` are containers for the column-major matrices A, B and C.
+* `mA`, `mB` and `mC` are containers for the column-major matrices A, B and C.
 * `lda`, `ldb` and `ldc` are the leading dimensions of the matrices A, B and C
   (cf BLAS 2). The leading dimension of a matrix must be greater than or equal
   to its number of rows.
@@ -288,15 +294,19 @@ For all these operations:
 * `uplo` is a `char` that provides information about triangular matrices: `u` for
   upper triangular and `l` for lower triangular matrices.
 * `diag` is a `char` that provides information about the diagonal elements of a
-  triangular matrix: `u` if the matrix is unit triangular (all diagonal elements
-  are 1), else `n`.
+  triangular matrix: `u` if the matrix is unit triangular *(all diagonal elements
+  are 1)*, else `n`.
+* `stride_a`, `stride_b` and `stride_c` are the striding size between consecutive 
+matrices in a batched-strided entry for the inputs/outputs. 
+* `batch_type` for `_gemm_batched` is either `strided` *(by default)* or `interleaved`*(More details about it here : [Gemm.md](doc/Gemm.md))*.
 
 | operation | arguments | description |
 |---|---|---|
-| `_gemm` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `A`, `lda`, `B`, `ldb`, `beta`, `C`, `ldc` | Generalised matrix-matrix multiplication followed by matrix addition: `C = alpha * A * B + beta * C` |
-| `_gemm_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `A`, `lda`, `B`, `ldb`, `beta`, `C`, `ldc`, `batch_size`, `batch_type` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. |
-| `_gemm_strided_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `A`, `lda`, `stridea`, `B`, `ldb`, `strideb`, `beta`, `C`, `ldc`, `stridec`, `batch_size` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices.
-| `_trsm` | `sb_handle`, `side`, `uplo`, `trans`, `diag`, `M`, `N`, `alpha`, `A`, `lda`, `B`, `ldb` | Triangular solve with Multiple Right-Hand Sides. |
+| `_gemm` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc` | Generalised matrix-matrix multiplication followed by matrix addition: `C = alpha * A * B + beta * C` |
+| `_gemm_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc`, `batch_size`, `batch_type` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices. |
+| `_gemm_strided_batched` | `sb_handle`, `transa`, `transb`, `M`, `N`, `K`, `alpha`, `mA`, `lda`, `stride_a`, `mB`, `ldb`, `stride_b`, `beta`, `mC`, `ldc`, `stride_c`, `batch_size` | Same as `_gemm` but the containers contain `batch_size` end-to-end matrices. GEMM operations are performed independently with matching matrices.
+| `_symm` | `sb_handle`, `side` , `uplo` , `M`, `N`, `alpha`, `mA`, `lda`, `mB`, `ldb`, `beta`, `mC`, `ldc`| Compute a scalar-matrix-matrix product and add the result to a scalar-matrix product, where one of the matrices in the multiplication is symmetric. |
+| `_trsm` | `sb_handle`, `side`, `uplo`, `trans`, `diag`, `M`, `N`, `alpha`, `mA`, `lda`, `mB`, `ldb` | Triangular solve with Multiple Right-Hand Sides. |
 
 ### EXTENSION
 
@@ -311,17 +321,17 @@ For all these operations:
   to its number of rows. In the case of in-place copy/transpose, the same matrix `A`
   is used with two different leading dimensions for input & output.
 * `stride_a`, `stride_b` and `stride_c` are the striding size between consecutive 
-matrices in a batched entry for inputs/outputs A, B and C. 
-* `inc_a` and `inc_b` are the distance between consecutive elements in A & B matrices.
+matrices in a batched entry for the inputs/outputs. 
+* `inc_a` and `inc_b` are the jump-count between consecutive elements in A & B matrices.
 * `transa` and `transb` are the transpose modes of the matrices A and B
   (cf BLAS 2).
-* `M` and `N` are the dimensions of the matrices.
+* `M` and `N` are the dimensions of the matrices (Rows and Columns respectively).
 * `alpha` and `beta` are scaling scalars.
 * `batch_size` is the number of batch matrices.
-* `inc_a` and `inc_b` are integers. The distance between element in the same column.
 
 | operation | arguments | description |
 |---|---|---|
+| `_axpy_batch` | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `stride_x`, `vy`, `incy`, `stride_y`, `batch_size` | Perform multiple axpy operators in batch |
 | `_omatcopy` | `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `B`, `ldb`  | Perform an out-of-place scaled matrix transpose or copy operation using a general dense matrix. |
 | `_omatcopy2`| `sb_handle`, `transa`, `M`, `N`, `alpha`, `A`, `lda`, `inc_a`, `B`, `ldb`, `inc_b`  | Computes two-strided scaling and out-of-place transposition or copying of general dense matrices. |
 | `_omatadd`| `sb_handle`, `transa`, `transb`, `M`, `N`, `alpha`, `A`, `lda`, `beta`, `B`, `ldb`, `C`,`ldc`  | Computes scaled general dense matrix addition with possibly transposed arguments. |
@@ -352,12 +362,12 @@ The user should expect erroneous behaviour from the code if both of these requir
 
 ## Requirements
 
-portBLAS is designed to work with any SYCL 1.2.1 implementation.
+portBLAS is designed to work with any SYCL implementation.
 We do not use any OpenCL interoperability, hence, the code is pure C++.
-The project is developed using [ComputeCpp CE Edition](http://www.computecpp.com)
-using Ubuntu 16.04 on Intel OpenCL CPU and Intel GPU.
-In order to build the sources, GCC 5.4 or higher is required.
-The build system is CMake version 3.4.2 or higher.
+The project is developed using [DPCPP open source](https://github.com/intel/llvm)
+or [oneapi release](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html#gs.2iaved),
+using Ubuntu 22.04 on Intel OpenCL CPU, Intel GPU, NVIDIA GPU and AMD GPU.
+The build system is CMake version 3.4.3 or higher.
 
 A BLAS library, such as [OpenBLAS](https://github.com/xianyi/OpenBLAS), is also
 required to build and verify the test results. 
@@ -376,20 +386,13 @@ added to the `CMAKE_PREFIX_PATH` when building portBLAS (see
 
 **IMPORTANT NOTE:** The `TARGET` CMake variable is no longer supported. It has
 been replaced by `TUNING_TARGET`, which accepts the same options.
-`TUNING_TARGET` affects only the tuning configuration and has no effect on the target
+`TUNING_TARGET` affects only the tuning configuration, applicable for some operators such as GEMM, and has no effect on the target
 triplet for DPC++ or the hipSYCL target. Please refer to the sections below for
 setting them.
 
 1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order to clone submodule(s).
 2. Create a build directory
-3. Run `CMake` from the build directory (see options in the section below):
-
-### Compile with ComputeCpp
-```bash
-cd build
-cmake -GNinja ../ -DComputeCpp_DIR=/path/to/computecpp -DSYCL_COMPILER=computecpp
-ninja
-```
+3. Run `CMake` from the build directory *(see options in the section below)*:
 
 ### Compile with DPC++
 ```bash
@@ -423,11 +426,6 @@ To install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below)
 ```bash
 ninja install
 ```
-### POWER_VR support (ComputeCpp Only)
-
-To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR`
-
-To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library`
 
 ### Doxygen
 
@@ -460,38 +458,7 @@ Some of the supported options are:
 | `BLAS_ENABLE_EXTENSIONS` | `ON`/`OFF` | Determines whether to enable portBLAS extensions (`ON` by default) |
 | `BLAS_DATA_TYPES` | `half;float;double` | Determines the floating-point types to instantiate BLAS operations for. Default is `float` |
 | `BLAS_INDEX_TYPES` | `int32_t;int64_t` | Determines the type(s) to use for `index_t` and `increment_t`. Default is `int` |
-| `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Kernels only)* (`ON` by default) |
-
-### Cross-Compile (ComputeCpp Only)
-
-To cross-compile portBLAS first the following environment variables must be
-set:
-
-```bash
-export COMPUTECPP_TOOLCHAIN_DIR="PATH TO TOOLCHAIN_DIR"
-export COMPUTECPP_TARGET_TRIPLE="PATH TO TARGET_TRIPLE"
-export COMPUTECPP_SYSROOT_DIR="$PATH TO SYSROOT_DIR"
-```
-
-Clone the [ComputeCpp-SDK](https://github.com/codeplaysoftware/computecpp-sdk) to retrieve the toolchain file.
-The following CMake command can be used to cross-compile portBLAS:
-
-```bash
-cmake  -GNinja                                                                         \
-    ${SOURCE_ROOT}                                                                     \
-   -DCMAKE_PREFIX_PATH="${OPENBLAS_PATH}"                                              \
-   -DComputeCpp_DIR="${COMPUTECPP_DEVICE_PATH}"                                        \
-   -DComputeCpp_HOST_DIR="${COMPUTECPP_X86_PATH}"                                      \
-   -DCMAKE_TOOLCHAIN_FILE="/path/to/computecpp-sdk/cmake/toolchains/gcc-generic.cmake" \
-   -DCMAKE_BUILD_TYPE='Release'                                                        \
-   -DCMAKE_INSTALL_PREFIX=${CROSS_COMPILED_PORTBLAS_INSTALL}                           \
-   -DOpenCL_INCLUDE_DIR="${OpenCL_Headers_PATH}"                                       \
-   -DOpenCL_LIBRARY="${OpenCL_LIBRARY}"                                                \
-   -DCOMPUTECPP_BITCODE="${DEVICE_BITCODE}"                                            \
-   -DCMAKE_CXX_FLAGS='-O3'                                                             \
-   -DTUNING_TARGET="${CHOSEN_TARGET}"
-```
-
+| `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Operators only)* (`ON` by default) |
 
 ## Tests and benchmarks
 

From b46de5f59e5341a207cc1d7596b3938a1dee709c Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Wed, 27 Dec 2023 15:03:19 +0000
Subject: [PATCH 03/14] Fixes and additions to Blas3Op doc file

---
 doc/AddingBlas3Op.md | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md
index 51d40ab7c..d6d578cbc 100644
--- a/doc/AddingBlas3Op.md
+++ b/doc/AddingBlas3Op.md
@@ -35,7 +35,7 @@ typename sb_handle_t::event_t _trsm(sb_handle_t& sb_handle, char Side,
                                              container_0_t A, index_t lda,
                                              container_1_t B, index_t ldb);
     
-}
+} // namespace internal
 
 // User-facing function to call the TRSM operation
 template <typename sb_handle_t, typename container_0_t, typename container_1_t,
@@ -47,7 +47,6 @@ typename sb_handle_t::event_t inline _trsm(
   return internal::_trsm(sb_handle, Side, Triangle, Transpose, Diagonal, M, N, alpha, A, lda,
                          B, ldb);
 }
-} // namespace internal
 } // namespace blas
 ```
 
@@ -98,8 +97,12 @@ void run_test(const combination_t<scalar_t> combi) {
   auto q = make_queue();
   SB_Handle sb_handle(q);
 
+  //
+  // Perform any host-to-device copies here
+  //
+
   // Invoke the newly added operation
-  _trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb);
+  _trsm(sb_handle, side, triangle, transA, diag, m, n, alpha, a_gpu, lda, b_gpu, ldb, dependencies);
 
   // Verify the results
 }
@@ -143,7 +146,8 @@ template <typename sb_handle_t, typename container_0_t, typename container_1_t,
 typename sb_handle_t::event_t _trsm(
     sb_handle_t& sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
     index_t M, index_t N, element_t alpha, container_0_t A, index_t lda,
-    container_1_t B, index_t ldb) {
+    container_1_t B, index_t ldb,
+    const typename sb_handle_t::event_t& _dependencies) {
   // Implementation of the new operation
   // This will probably invoke a kernel which we don't yet have defined.
     
@@ -153,7 +157,7 @@ typename sb_handle_t::event_t _trsm(
   auto gemmEvent = internal::_gemm(
             sb_handle, 'n', isTranspose ? 't' : 'n', M, currentBlockSize,
             currentBlockSize, (i == 0) ? alpha : element_t{1}, B + i * ldb, ldb,
-            invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx);
+            invA + i * blockSize, blockSize, element_t{0}, X + i * ldx, ldx, intermediate_events);
   trsmEvents = concatenate_vectors(trsmEvents, gemmEvent);  
 
   // Ultimately, a list of all events created in this function is returned
@@ -193,13 +197,28 @@ compile `blas::internal::_trsm`, for this particular example, this file looks li
 namespace blas {
 namespace internal {
 
-
+// Buffer Declaration
 template typename SB_Handle::event_t _trsm(
   SB_Handle sb_handle, char Side, char Triangle, char Transpose, char Diagonal,
   ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
   BufferIterator<${DATA_TYPE}> A, ${INDEX_TYPE} lda,
-  BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb);
+  BufferIterator<${DATA_TYPE}> B, ${INDEX_TYPE} ldb,
+  const typename SB_Handle::event_t& _dependencies);
 
+// USM Declarations
+#ifdef SB_ENABLE_USM
+template typename SB_Handle::event_t _trsm(
+    SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
+    ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha, ${DATA_TYPE} * A,
+    ${INDEX_TYPE} lda, ${DATA_TYPE} * B, ${INDEX_TYPE} ldb,
+    const typename SB_Handle::event_t& _dependencies);
+
+template typename SB_Handle::event_t _trsm(
+    SB_Handle& sb_handle, char side, char uplo, char trans, char diag,
+    ${INDEX_TYPE} M, ${INDEX_TYPE} N, ${DATA_TYPE} alpha,
+    const ${DATA_TYPE} * A, ${INDEX_TYPE} lda, ${DATA_TYPE} * B,
+    ${INDEX_TYPE} ldb, const typename SB_Handle::event_t& _dependencies);
+#endif
 
 } // namespace internal
 } // namespace blas

From eee89a069b260076d2f0ac45729d89c43a70fdd9 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Mon, 1 Jan 2024 13:25:19 +0000
Subject: [PATCH 04/14] Recovered powerVR mention in readme

---
 README.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/README.md b/README.md
index e8c17bc51..58bc37346 100644
--- a/README.md
+++ b/README.md
@@ -33,6 +33,7 @@ the project.
     - [Compile with DPC++](#compile-with-dpc)
     - [Compile with hipSYCL](#compile-with-hipsycl)
     - [Instaling portBLAS](#instaling-portBLAS)
+    - [POWER\_VR support](#power_vr-support-computecpp-only)
     - [Doxygen](#doxygen)
     - [CMake options](#cmake-options)
   - [Tests and benchmarks](#tests-and-benchmarks)
@@ -427,6 +428,12 @@ To install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below)
 ninja install
 ```
 
+### POWER_VR support
+
+To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR`
+
+To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library`
+
 ### Doxygen
 
 Doxygen documentation can be generated by running:

From 5f3240fb369a1e66c3b2000d94799e0c1ecc1887 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Tue, 2 Jan 2024 15:40:05 +0000
Subject: [PATCH 05/14] Updated Benchmark Doc to include cuBLAS,  rocBLAS and
 complex data updates

---
 benchmark/README.md | 130 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 98 insertions(+), 32 deletions(-)

diff --git a/benchmark/README.md b/benchmark/README.md
index 0089ba9d2..bfc1f9ec0 100644
--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -5,53 +5,82 @@ Benchmarks
 
 The portBLAS benchmarks are intended to measure the evolution of the
 performance of this BLAS implementation and how it compares with other tuned
-implementations, such as [CLBLAST](https://github.com/CNugteren/CLBlast)
-(a very performant OpenCL BLAS library).
+implementations, such as [CUBLAS](https://docs.nvidia.com/cuda/cublas/index.html) native and optimized CUDA BLAS library for Nvidia GPUs, 
+[ROCBLAS](https://github.com/ROCm/rocBLAS) which is the native HIP/Rocm BLAS 
+library optimized for AMD GPUs, and [CLBLAST](https://github.com/CNugteren/CLBlast),
+a very performant OpenCL BLAS library for CPU targets. 
 
 The benchmarks use Google's [benchmark](https://github.com/google/benchmark)
-library and generate a report with indicative metrics (see instructions below).
+library and generate a report with indicative metrics *(see instructions below)*.
 
 ## How to compile the benchmarks
+### portBLAS Benchmarks
+The portBLAS default benchmarks are compiled with the project if the 
+`BLAS_ENABLE_BENCHMARK` CMake option is activated *(which is the case by 
+default)*. For the results to be relevant, portBLAS needs to be built in Release
+mode *(Passing `-DCMAKE_BUILD_TYPE=Release` to `cmake` command)*.
 
-The benchmarks are compiled with the project if the `BLAS_ENABLE_BENCHMARK`
-CMake option is activated (which is the case by default).
-
+### clBLAST Benchmarks
 The CLBLAST benchmarks are compiled only if the `BUILD_CLBLAST_BENCHMARKS` CMake
 option is activated. If so, if CLBlast cannot be found the build will fail. The
 location of CLBlast can be given with `CLBLAST_ROOT`.
 To install CLBlast, see:
 [CLBlast: Building and installing](
-https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md))
+https://github.com/CNugteren/CLBlast/blob/master/doc/installation.md)
+
+### cuBLAS Benchmarks
+cuBLAS benchmarks can be built by enabling the `BUILD_CUBLAS_BENCHMARKS` option and 
+require an installation of CUDA *(>=11.x)* and cuBLAS library *(Both can be obtained by 
+installing the cuda Toolkit : 
+[installation guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))*.
 
+### rocBLAS Benchmarks
+cuBLAS benchmarks can be built by enabling the `BUILD_ROCBLAS_BENCHMARKS` option 
+and require an installation of Rocm *(>=4.5.x)* and rocBLAS library *(Please refer 
+to this [installation guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Linux_Install_Guide.html) 
+for the setup)*
+
+## General Notes
 After the compilation, the binaries will be available:
-* in the build folder, in `benchmark/portblas/` and `benchmark/clblast/`
+* in the build folder, in `benchmark/[Library Name]/` 
 * if you provide an installation directory with the CMake variable
     `CMAKE_INSTALL_PREFIX`, and run the installation command, e.g
     `ninja install`, in your installation folder, in `portblas/bin/`
 
 A verification of the results is enabled by default and can be disabled with the
 CMake option `BLAS_VERIFY_BENCHMARK` set to `OFF` or `0`. The verification will
-be run a small number of times (more than once because of the way the benchmark
+be run a small number of times *(more than once because of the way the benchmark
 library works, but much less than the usual number of iterations of the
-benchmarks). The verification requires that a reference implementation of BLAS
+benchmarks)*. The verification requires that a reference implementation of BLAS
 like OpenBLAS is installed, which path can be given with the `CMAKE_PREFIX_PATH`
 CMake parameter.
 
 ## How to run the benchmarks
 
 The benchmarks take two kinds of command-line options: those for the benchmark
-library and those specific to the portBLAS projects.
-
-Essentially, the benchmarks can take a CSV configuration file (or will use
-defaults), and if your machine has more than one OpenCL device, you can specify
-which one to use. The other options specify how to output the results.
+library and those specific to portBLAS: 
+- Benchmark library options : these options help specify for instance the output 
+format, the verbosity level and help filter specific benchmarks based on regex 
+arguments. 
+- portBLAS options : these options are portBLAS specific and they help specify 
+the target SYCL device *(portBLAS benchmarks only)* as well as pass a custom set 
+of operator-specfic configurations *(applies to rocBLAS and cuBLAS benchmarks as 
+well)*. If no csv file is specified, the benchmark will use defaults values 
+*(found in `include/common/common_utils.hpp`)*. 
+
+For portBLAS and clBLASt benchmarks, and if the target machine has more than one 
+OpenCL device, you can specify which one to use at runtime(\*). 
+
+(*): Preferably, for portBLAS benchmarks, this device has to match the 
+`TUNING_TARGET` as some operators are configured differently depending on the 
+SYCL target for optimal performance.  
 
 The most useful options for us are:
 
 |option|parameter|description|
 |------|:-------:|-----------|
 | `--help` |   | Show help message |
-| `--device` | device name | Select a device to run on (e.g `intel:gpu`) |
+| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful when targeting openCL devices with portBLAS or clBLASt benchmarks|
 | `--csv-param` | file path | Path to a CSV file with the benchmark parameters |
 | `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output |
 | `--benchmark_out` | file path | Specify a file where to write the report |
@@ -64,21 +93,27 @@ The most useful options for us are:
 You can check the [GitHub repository](https://github.com/google/benchmark) of
 the library for information about the other supported command-line arguments.
 
-Here is an example of an invocation of the GEMM benchmark running on Intel GPU,
-displaying the results in the console and saving a json report:
+Here is an example of an invocation of the portBLAS GEMM benchmark running on 
+Intel GPU, displaying the results in the console and saving a json report:
 
 ```bash
-./bench_gemm --device=intel:gpu --csv-param=parameters.csv \
+./portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
     --benchmark_out=../results.json --benchmark_out_format=json \
     --benchmark_format=console
 ```
 
+Here is the same benchmark through cuBLAS: 
+```bash
+./cublas/bench_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ 
+--benchmark_out_format=json --benchmark_format=console
+```
+
 ### CSV format
 
 The benchmarks can be given a CSV file containing the parameters to run with
-(matrix/vector dimensions, transpose or not, etc), in the following format: one
-line corresponds to one set of parameters, i.e. one name for the library (though
-it will be iterated many times for statistical accuracy).
+*(matrix/vector dimensions, transpose or not, etc)*, in the following format: one
+line corresponds to one set of parameters, i.e. one name for the library *(though
+it will be iterated many times for statistical accuracy)*.
 
 The formats for the different BLAS levels are:
 
@@ -88,15 +123,34 @@ The formats for the different BLAS levels are:
 | blas 2 | *transpose_A,m,n,alpha,beta* | Action on the matrix (`n`, `t`, `c`), dimensions, and scalars alpha and beta |
 | blas 3 |  | |
 | gemm | *transpose_A,transpose_B,m,k,n,alpha,beta* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), and scalars alpha and beta |
-| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size |
+| gemm (Batched) | *transpose_A,transpose_B,m,k,n,alpha,beta,batch_size,batch_type* | Action on the matrices (`n`, `t`, `c`), dimensions (A: mk, B:kn, C: mn), scalars alpha and beta, batch size, batch_type |
 | trsm | *side,triangle,transpose,diagonal,m,n,alpha* | Position of A (`l`, `r`), A is upper or lower triangular (`u`, `l`), transposition of A (`n`, `t`), A is unit or non-unit diagonal(`u`,`n`),dimensions, scalar alpha |
 
-Note: for operations that support a stride, the benchmarks will use a stride of
-1 (contiguous values), except for the GEMM batched operation where valid default stride values are used depending on batch type *(strided or interleaved)*. For operations that support a leading dimension, the
-benchmarks use the minimum possible value (the actual leading dimension of the
-matrix).
+### Notes: 
+
+For operations that support an increment, the benchmarks will use an increment of
+1 *(contiguous values)*, except for the GEMM batched operation where valid 
+default increment values are used depending on `batch_type` *(strided or 
+interleaved)*. For operations that support a leading dimension, the
+benchmarks use the minimum possible value *(the actual leading dimension 
+of the matrix)*.
 
-Here is an example of a valid CSV file for the GEMM benchmark:
+For batched-strided operations that expect a stride(s), the benchmarking suite
+expects **stride multipliers** instead of explicit stride values. For example, in
+`gemm_batched_strided` operator, a `stride_a_mul=3` for matrix `A` of `size_a=m*k` 
+is equivalent to an actual stride of `stride_a = stride_a_mul * size_a`. 
+
+For operations that support Complex data type *(and when cmake option 
+BLAS_ENABLE_COMPLEX is enabled)*, scalars such as `alpha` and `beta` are 
+expected to have real and imaginary parts as separate values, if only single
+scalar values are passed in the csv configuration files, complex values
+for these arguments will be constructed by duplicating the same real value
+for both imaginary and real parts. (e.g. alpha_cplx={alpha_real, alpha_real}). 
+
+
+Here are two examples of a valid CSV files for the GEMM benchmark:
+
+- Scalar only alpha & beta values
 
 ```
 n,n,42,42,42,1,0
@@ -104,8 +158,16 @@ n,t,64,128,64,0.5,0.5
 t,n,13,3,7,0,0.7
 ```
 
+- Complex alpha & beta values *(two scalars each)*
+```
+n,n,42,42,42,1,2,3,0
+n,t,64,128,64,0.5,1,0.5,1
+t,n,13,3,7,0,1,0.7,3
+```
+
+
 The folder `config_csv` provides a few files corresponding to sizes that are
-relevant for neural networks, but you can use your own files, see the next
+relevant for neural networks, but you can provide your own, see the next
 section for more info on how to generate them.
 
 ### Python tool to generate a CSV file
@@ -262,6 +324,9 @@ following keys:
 * `cpu_time`: actual CPU time spent running the benchmark
 * `time_unit`: unit used for these times. Should be `ns`, if not please file an
     issue.
+* `label`: Contains the name of the benchmark backend *(`@backend` : "portblas",
+    "cublas" or "rocblas")*, and target device related informations *(`device_name`,
+    `device_version`, `driver_version` etc..)* 
 * `avg_event_time`: the average of the CL/SYCL event times in nanoseconds. This
     time depends on the events returned by the BLAS functions used and might not
     be accurate in some cases
@@ -280,9 +345,10 @@ following keys:
 * `bytes_processed`: total number of bytes read and written in memory. It is
     calculated theoretically based on the operations that we think the benchmark
     is doing.
-* a few benchmark parameters (e.g `m`, `n`, `k` for GEMM)
+* other operator-specific parameters that affect the computations *(e.g `m`, `n`, 
+`k` for GEMM, but `alpha` is skipped as it doens't affect the operation directly.)*
 * some other keys from the benchmark library
 
 **Note:** to calculate the performance in Gflops, you can divide `n_fl_ops` by one
-of the best or average time metrics, e.g `avg_overall_time` (the event and wall
-time usually converge for large dimensions).
+of the best or average time metrics, e.g `avg_overall_time` *(the event and wall
+time usually converge for large dimensions)*.

From fbcf50f4c5320cc7cd6cbfbc3c42b769e09a8f9b Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com>
Date: Wed, 3 Jan 2024 14:04:54 +0000
Subject: [PATCH 06/14] Update README.md

Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 58bc37346..9e87bd488 100644
--- a/README.md
+++ b/README.md
@@ -217,7 +217,7 @@ For all these operations:
 | `_asum`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Absolute sum of the vector `x`; written in `rs` if passed, else returned                                                                                                                               |
 | `_axpy`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`  | Vector multiply-add: `y = alpha * x + y`                                                                                                                                     |
 | `_copy`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Copies a vector to another: `y = x`                                                                                                                                          |
-| `_dot`    | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors `x` and `y`; added to `sb`` and written to `rs` if passed, else returned                                                                                             |
+| `_dot`    | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors `x` and `y`; added to `sb` and written to `rs` if passed, else returned                                                                                             |
 | `_sdsdot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors with double precision `x` and `y`; written in `rs` if passed, else returned                                                                                             |
 | `_nrm2`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Euclidean norm of the vector `x`; written in `rs` if passed, else returned                                                                                                   |
 | `_rot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s`                                                                                                     |

From 91717dc930407d7aee7d0af816641bd4ae4c0d80 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com>
Date: Wed, 3 Jan 2024 14:05:11 +0000
Subject: [PATCH 07/14] Update benchmark/README.md

Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com>
---
 benchmark/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmark/README.md b/benchmark/README.md
index bfc1f9ec0..c8e564fa1 100644
--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -97,7 +97,7 @@ Here is an example of an invocation of the portBLAS GEMM benchmark running on
 Intel GPU, displaying the results in the console and saving a json report:
 
 ```bash
-./portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
+./benchmark/portblas/bench_gemm --device=intel:gpu --csv-param=parameters.csv \
     --benchmark_out=../results.json --benchmark_out_format=json \
     --benchmark_format=console
 ```

From f03cb83c476cc8d4988c1052e656d0de90fb0e19 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com>
Date: Wed, 3 Jan 2024 14:05:20 +0000
Subject: [PATCH 08/14] Update benchmark/README.md

Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com>
---
 benchmark/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/benchmark/README.md b/benchmark/README.md
index c8e564fa1..27d4611db 100644
--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -104,7 +104,7 @@ Intel GPU, displaying the results in the console and saving a json report:
 
 Here is the same benchmark through cuBLAS: 
 ```bash
-./cublas/bench_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ 
+./benchmark/cublas/bench_cublas_gemm --csv-param=parameters.csv --benchmark_out=../results.json \ 
 --benchmark_out_format=json --benchmark_format=console
 ```
 

From fb087c7e78bf6d67e278f25c5ad2ffd01491917f Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Wed, 3 Jan 2024 15:11:18 +0000
Subject: [PATCH 09/14] Addressed comments & other doc updates

---
 README.md            | 67 +++++++++++++++++++++++++++++++++++++-------
 Roadmap.md           | 30 --------------------
 benchmark/README.md  |  2 +-
 doc/AddingBlas3Op.md |  3 +-
 samples/README.md    | 34 ++++++++++------------
 5 files changed, 75 insertions(+), 61 deletions(-)
 delete mode 100644 Roadmap.md

diff --git a/README.md b/README.md
index 9e87bd488..a3028e9f6 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-portBLAS Implementation
+# portBLAS Implementation
 ===
 
 [![Build and Test](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml/badge.svg?event=push)](https://github.com/codeplaysoftware/portBLAS/actions/workflows/build-and-test.yml)
@@ -33,9 +33,12 @@ the project.
     - [Compile with DPC++](#compile-with-dpc)
     - [Compile with hipSYCL](#compile-with-hipsycl)
     - [Instaling portBLAS](#instaling-portBLAS)
-    - [POWER\_VR support](#power_vr-support-computecpp-only)
     - [Doxygen](#doxygen)
     - [CMake options](#cmake-options)
+    - [ComputeCpp Compilation](#computecpp-deprecated)
+      - [Compile with ComputeCpp](#compile-with-computecpp)
+      - [POWER\_VR support](#power_vr-support-computecpp-only)
+      - [Cross-Compile](#cross-compile-computecpp-only)
   - [Tests and benchmarks](#tests-and-benchmarks)
   - [Contributing to the project](#contributing-to-the-project)
     - [Guides and Other Documents](#guides-and-other-documents)
@@ -421,19 +424,13 @@ ninja
 ```
 To build for other than the default devices (`omp`), set the `HIPSYCL_TARGETS` environment variable or specify `-DHIPSYCL_TARGETS` as [documented](https://github.com/illuhad/hipSYCL/blob/develop/doc/using-hipsycl.md).
 
-### Instaling portBLAS
+### Installing portBLAS
 To install the portBLAS library (see `CMAKE_INSTALL_PREFIX` below)
 
 ```bash
 ninja install
 ```
 
-### POWER_VR support
-
-To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR`
-
-To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library`
-
 ### Doxygen
 
 Doxygen documentation can be generated by running:
@@ -454,7 +451,7 @@ Some of the supported options are:
 |---|---|---|
 | `BLAS_ENABLE_TESTING` | `ON`/`OFF` | Set it to `OFF` to avoid building the tests (`ON` is the default value) |
 | `BLAS_ENABLE_BENCHMARK` | `ON`/`OFF` | Set it to `OFF` to avoid building the benchmarks (`ON` is the default value) |
-| `SYCL_COMPILER` | name | Used to determine which SYCL implementation to use. By default, the first implementation found is used. Supported values are: `computecpp`, `dpcpp` and `hipsycl`. |
+| `SYCL_COMPILER` | name | Used to determine which SYCL implementation to use. By default, the first implementation found is used. Supported values are: `dpcpp`, `hipsycl` and `computecpp`*(deprecated)*. |
 | `TUNING_TARGET` | name | By default, this flag is set to `DEFAULT_CPU` to restrict any device specific compiler optimizations. Use this flag to tune the code for a target (**highly recommended** for performance). The supported targets are: `INTEL_GPU`, `NVIDIA_GPU`, `AMD_GPU` |
 | `CMAKE_PREFIX_PATH` | path | List of paths to check when searching for dependencies |
 | `CMAKE_INSTALL_PREFIX` | path | Specify the install location, used when invoking `ninja install` |
@@ -467,6 +464,56 @@ Some of the supported options are:
 | `BLAS_INDEX_TYPES` | `int32_t;int64_t` | Determines the type(s) to use for `index_t` and `increment_t`. Default is `int` |
 | `BLAS_ENABLE_COMPLEX` | `ON`/`OFF` | Determines whether to enable Complex data type support *(GEMM Operators only)* (`ON` by default) |
 
+## ComputeCpp Compilation *(Deprecated)*
+
+portBLAS ComputeCpp compilation is deprecated since ComputeCpp releasing has been
+discontinued. More information about this are found in this [announcement](https://codeplay.com/portal/news/2023/07/07/the-future-of-computecpp). 
+
+### Compile with ComputeCpp
+
+```bash
+cd build
+cmake -GNinja ../ -DComputeCpp_DIR=/path/to/computecpp -DSYCL_COMPILER=computecpp
+ninja
+```
+
+### Cross-Compile *(ComputeCpp Only)*
+
+To cross-compile portBLAS first the following environment variables must be
+set:
+
+```bash
+export COMPUTECPP_TOOLCHAIN_DIR="PATH TO TOOLCHAIN_DIR"
+export COMPUTECPP_TARGET_TRIPLE="PATH TO TARGET_TRIPLE"
+export COMPUTECPP_SYSROOT_DIR="$PATH TO SYSROOT_DIR"
+```
+
+Clone the [ComputeCpp-SDK](https://github.com/codeplaysoftware/computecpp-sdk) to retrieve the toolchain file.
+The following CMake command can be used to cross-compile portBLAS:
+
+```bash
+cmake  -GNinja                                                                         \
+    ${SOURCE_ROOT}                                                                     \
+   -DCMAKE_PREFIX_PATH="${OPENBLAS_PATH}"                                              \
+   -DComputeCpp_DIR="${COMPUTECPP_DEVICE_PATH}"                                        \
+   -DComputeCpp_HOST_DIR="${COMPUTECPP_X86_PATH}"                                      \
+   -DCMAKE_TOOLCHAIN_FILE="/path/to/computecpp-sdk/cmake/toolchains/gcc-generic.cmake" \
+   -DCMAKE_BUILD_TYPE='Release'                                                        \
+   -DCMAKE_INSTALL_PREFIX=${CROSS_COMPILED_PORTBLAS_INSTALL}                           \
+   -DOpenCL_INCLUDE_DIR="${OpenCL_Headers_PATH}"                                       \
+   -DOpenCL_LIBRARY="${OpenCL_LIBRARY}"                                                \
+   -DCOMPUTECPP_BITCODE="${DEVICE_BITCODE}"                                            \
+   -DCMAKE_CXX_FLAGS='-O3'                                                             \
+   -DTUNING_TARGET="${CHOSEN_TARGET}"
+```
+
+### POWER_VR support *(ComputeCpp Only)*
+
+To enable the PowerVR target tuning, pass: `-DTUNING_TARGET=POWER_VR`
+
+To use the neural network library from Imagination, pass: `-DIMGDNN_DIR=path/to/library`
+
+
 ## Tests and benchmarks
 
 The tests and benchmarks have their own documentation:
diff --git a/Roadmap.md b/Roadmap.md
deleted file mode 100644
index 2a424abcc..000000000
--- a/Roadmap.md
+++ /dev/null
@@ -1,30 +0,0 @@
-Roadmap
-====================
-
-
-Current Status:
-
-* BLAS LVL1 : 90% complete, only rotmg missing.
-* BLAS LVL2 : Only two sample routines
-* BLAS LVL3 : Only three variants of the Matrix Multiplication
-* Examples :
-  * Interface tests for all implemented API entries
-  * CG example with multiple variants of the algorithm using kernel fusion.
-
-Short Term QX 2017:
-
-* Have a common class for the implementation of the different view classes
-* Obtain and analyse performance bottlenecks
-* Work on the Blas 2 interface
-
-Medium Term:
-
-* Complete the Blas 2 interface
-* Work on the Blas 3 interface
-* Move evaluation methods out of the expression tree into the execution tree
-
-
-Long Term:
-
-* Complete the Blas 3 interface
-* Add continuous integration testing 
diff --git a/benchmark/README.md b/benchmark/README.md
index 27d4611db..105e2ebcc 100644
--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -80,7 +80,7 @@ The most useful options for us are:
 |option|parameter|description|
 |------|:-------:|-----------|
 | `--help` |   | Show help message |
-| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful when targeting openCL devices with portBLAS or clBLASt benchmarks|
+| `--device` | device name | Select a device to run on (e.g `intel:gpu`), useful for portBLAS and clBLASt benchmarks|
 | `--csv-param` | file path | Path to a CSV file with the benchmark parameters |
 | `--benchmark_format` | `console` / `json` / `csv` | Specify the format of the standard output |
 | `--benchmark_out` | file path | Specify a file where to write the report |
diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md
index d6d578cbc..9791d3f24 100644
--- a/doc/AddingBlas3Op.md
+++ b/doc/AddingBlas3Op.md
@@ -108,7 +108,8 @@ void run_test(const combination_t<scalar_t> combi) {
 }
 
 // Create the combinations of parameters to invoke the test
-const auto combi = ::testing::Combine(::testing::Values(7, 513, 1027),  // m
+const auto combi = ::testing::Combine(::testing::Values("usm", "buf"),  // allocation type
+                                      ::testing::Values(7, 513, 1027),  // m
                                       ::testing::Values(7, 513, 1027),  // n
                                       ::testing::Values('n', 't'),  // transA
                                       ::testing::Values('l', 'r'),  // side
diff --git a/samples/README.md b/samples/README.md
index ec5717f84..412c8ee2b 100644
--- a/samples/README.md
+++ b/samples/README.md
@@ -2,30 +2,26 @@ portBLAS samples
 ===
 
 ## How to compile the samples
+A SYCL Compiler (DPCPP, hipSYCL or ComputeCpp) along with the target device's 
+relevant compute drivers *(OpenCL, CUDA etc..)* are required to compile and 
+run the samples.   
+Any project that integrates portBLAS can either use it as :
+  * a library *(install the library and include `portblas.h` in an application)*
+  * a header-only framework *(include `portblas.hpp` in an application)*
 
-At the moment any project using portBLAS requires:
-
-* OpenCL
-* [ComputeCpp](http://www.computecpp.com)
-* portBLAS, either:
-  * as a library (install the library and include `portblas.h` in an application)
-  * as a header-only framework (include `portblas.hpp` in an application)
-
-### With CMake
+### CMake Configuration
 
 This folder contains a basic CMake configuration file and a module to find
-portBLAS (which will be used as a header-only framework). It also uses a module
-to find ComputeCpp that is located in the folder `cmake/Modules`.
-
-Usage:
+portBLAS *(which will be used as a header-only framework)*. It also uses a module
+to find the SYCL Compiler(DPCPP, hipSYCL or computeCpp *-deprecated-*) that is 
+located in the folder `cmake/Modules`.
 
-* set `ComputeCpp_DIR` to your ComputeCpp root path
-* set `PORTBLAS_DIR` to your portBLAS root path
+Sample usage with DPCPP Compiler: 
 
 ```bash
-mkdir build
-cd build
-cmake .. -GNinja -DComputeCpp_DIR=/path/to/computecpp \
-                 -DPORTBLAS_DIR=~/path/to/portblas
+mkdir build && cd build
+export CXX=/path/to/dpcpp/clang++
+export CC=/path/to/dpcpp/clang
+cmake .. -GNinja -DPORTBLAS_DIR=~/path/to/portblas
 ninja
 ```

From d8d777358b50f20154b3d2bdaa28456f3443bd5f Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Fri, 5 Jan 2024 14:19:31 +0000
Subject: [PATCH 10/14] removed uncessary flag in Dockerfile

---
 Dockerfile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 4e26a9db7..e3eda6faa 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -48,7 +48,7 @@ CMD cd /portBLAS && \
     if [ "${COMMAND}" = 'build-test' ]; then \
       if [ "${SYCL_IMPL}" = 'DPCPP' ]; then \
         export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p build && cd build && \
-        cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
+        cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
         -DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
         -DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
         make -j$(nproc) && cd test && ctest -VV --timeout 1200; \
@@ -58,7 +58,7 @@ CMD cd /portBLAS && \
     elif [ "${COMMAND}" = 'auto-tuner' ]; then \
       export LD_LIBRARY_PATH="/tmp/dpcpp/lib" && mkdir -p tools/auto_tuner/build \
       && cd tools/auto_tuner/build && \
-      cmake .. -DBLAS_ENABLE_STATIC_LIBRARY=ON -DGEMM_TALL_SKINNY_SUPPORT=OFF \
+      cmake .. -DGEMM_TALL_SKINNY_SUPPORT=OFF \
       -DSYCL_COMPILER=dpcpp -DCMAKE_PREFIX_PATH=/tmp/OpenBLAS/build \
       -DBLAS_ENABLE_CONST_INPUT=OFF -DCMAKE_BUILD_TYPE=Release && \
       make -j$(nproc); \

From 0a9840593785d337145aaaa8d2cd19fc4b0033b7 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Mon, 8 Jan 2024 10:59:54 +0000
Subject: [PATCH 11/14] Additional changes/deletions of py_gen related files
 and documentation

---
 benchmark/gen_param.py                         | 4 +++-
 doc/AddingBlas3Op.md                           | 5 +++--
 python_generator/py_gen_blas_binary.py         | 0
 python_generator/py_gen_blas_binary_special.py | 0
 python_generator/py_gen_blas_gemm_launcher.py  | 2 +-
 python_generator/py_gen_blas_ops.py            | 2 +-
 python_generator/py_gen_blas_reduction.py      | 2 +-
 python_generator/py_gen_blas_rotg_return.py    | 0
 python_generator/py_gen_blas_rotmg.py          | 0
 python_generator/py_gen_blas_ternary.py        | 0
 10 files changed, 9 insertions(+), 6 deletions(-)
 delete mode 100644 python_generator/py_gen_blas_binary.py
 delete mode 100644 python_generator/py_gen_blas_binary_special.py
 delete mode 100644 python_generator/py_gen_blas_rotg_return.py
 delete mode 100644 python_generator/py_gen_blas_rotmg.py
 delete mode 100644 python_generator/py_gen_blas_ternary.py

diff --git a/benchmark/gen_param.py b/benchmark/gen_param.py
index a8d4019bc..aaa6ebafa 100644
--- a/benchmark/gen_param.py
+++ b/benchmark/gen_param.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python3
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited
@@ -32,12 +32,14 @@
 import itertools
 import argparse
 
+
 def main(args):
     """Generate the csv file according to the given arguments
     """
     # Match DSL to Python names
     nd_range = itertools.product
     value_range = lambda *v: list(v)
+
     def size_range(low, high, mult):
         val = low
         while val <= high:
diff --git a/doc/AddingBlas3Op.md b/doc/AddingBlas3Op.md
index 9791d3f24..147817025 100644
--- a/doc/AddingBlas3Op.md
+++ b/doc/AddingBlas3Op.md
@@ -178,8 +178,9 @@ binary is linked against portBLAS, the linker will find the definition of the mi
 
 To do this, we create the source file that will contain instantiations of the new `_trsm` operation.
 The file is located at `src/interface/blas3/trsm.cpp.in`. This is not the file that will be 
-compiled, but a template file that the python script `python_generator/py_gen_blas_binary.py`
-will use to generate the actual source file where the instantiation of `_trsm` will happen.
+compiled, but a template file that the python script `python_generator/py_gen_blas_ops.py`
+will use to generate the actual source file where the instantiation of `_trsm` will happen. The call
+to this generator is wrapped within the cmake function `generate_blas_objects`.
 
 The file `src/interface/blas3/trsm.cpp.in` must include all files that are necessary to successfully
 compile `blas::internal::_trsm`, for this particular example, this file looks like the following:
diff --git a/python_generator/py_gen_blas_binary.py b/python_generator/py_gen_blas_binary.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/python_generator/py_gen_blas_binary_special.py b/python_generator/py_gen_blas_binary_special.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/python_generator/py_gen_blas_gemm_launcher.py b/python_generator/py_gen_blas_gemm_launcher.py
index c34ab3849..3c4fb6a7e 100644
--- a/python_generator/py_gen_blas_gemm_launcher.py
+++ b/python_generator/py_gen_blas_gemm_launcher.py
@@ -1,4 +1,4 @@
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited
diff --git a/python_generator/py_gen_blas_ops.py b/python_generator/py_gen_blas_ops.py
index 68f53eaac..b72c05c07 100644
--- a/python_generator/py_gen_blas_ops.py
+++ b/python_generator/py_gen_blas_ops.py
@@ -1,4 +1,4 @@
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited
diff --git a/python_generator/py_gen_blas_reduction.py b/python_generator/py_gen_blas_reduction.py
index 5a80ee10d..be45e5884 100644
--- a/python_generator/py_gen_blas_reduction.py
+++ b/python_generator/py_gen_blas_reduction.py
@@ -1,4 +1,4 @@
-#/***************************************************************************
+# /***************************************************************************
 # *
 # *  @license
 # *  Copyright (C) Codeplay Software Limited
diff --git a/python_generator/py_gen_blas_rotg_return.py b/python_generator/py_gen_blas_rotg_return.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/python_generator/py_gen_blas_rotmg.py b/python_generator/py_gen_blas_rotmg.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/python_generator/py_gen_blas_ternary.py b/python_generator/py_gen_blas_ternary.py
deleted file mode 100644
index e69de29bb..000000000

From 497abf6b67defea9c085c64d4a572c9393be2c88 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Mon, 8 Jan 2024 16:43:46 +0000
Subject: [PATCH 12/14] Addressed PR comments

---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index a3028e9f6..f2af6fc2c 100644
--- a/README.md
+++ b/README.md
@@ -220,8 +220,8 @@ For all these operations:
 | `_asum`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Absolute sum of the vector `x`; written in `rs` if passed, else returned                                                                                                                               |
 | `_axpy`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`, `vy`, `incy`  | Vector multiply-add: `y = alpha * x + y`                                                                                                                                     |
 | `_copy`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Copies a vector to another: `y = x`                                                                                                                                          |
-| `_dot`    | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors `x` and `y`; added to `sb` and written to `rs` if passed, else returned                                                                                             |
-| `_sdsdot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy` [, `rs`]  | Dot product of two vectors with double precision `x` and `y`; written in `rs` if passed, else returned                                                                                             |
+| `_dot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`   [, `rs`] | Dot product of two vectors `x` and `y`; written to `rs` if passed, else returned                                                                                             |
+| `_sdsdot`    | `sb_handle`, `N`, `sb`, `vx`, `incx`, `vy`, `incy`[, `rs`] | Compute sum of a constant `sb` with the double precision  dot product of two single precision vectors `x` and `y`; written in `rs` if passed, else returned                                                                                            |
 | `_nrm2`   | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Euclidean norm of the vector `x`; written in `rs` if passed, else returned                                                                                                   |
 | `_rot`    | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`, `c`, `s` | Applies a plane rotation to `x` and `y` with a cosine `c` and a sine `s`                                                                                                     |
 | `_rotg`   | `sb_handle`, `a`, `b`, `c`, `s`                        | Given the Cartesian coordinates (`a`, `b`) of a point, return the parameters `c`, `s`, `r`, and `z` associated with the Givens rotation.                                     |
@@ -229,8 +229,8 @@ For all these operations:
 | `_rotmg`  | `sb_handle`, `d1`, `d2`, `x1`, `y1` `param`            | Given the Cartesian coordinates (`x1`, `y1`) of a point, return the components of a modified Givens transformation matrix that zeros the y-component of the resulting point. |
 | `_scal`   | `sb_handle`, `N`, `alpha`, `vx`, `incx`                | Scalar product of a vector: `x = alpha * x`                                                                                                                                  |
 | `_swap`   | `sb_handle`, `N`, `vx`, `incx`, `vy`, `incy`           | Interchanges two vectors: `y = x` and `x = y`                                                                                                                                |
-| `_iamax`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | First index and value of the maximum element of `x`; written in `rs` if passed, else the index only is returned                                                              |
-| `_iamin`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | First index and value of the minimum element of `x`; written in `rs` if passed, else the index only is returned                                                              |
+| `_iamax`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Index of the first occurence of the maximum element in `x`; written to `rs` if passed, else returned.                                                              |
+| `_iamin`  | `sb_handle`, `N`, `vx`, `incx` [, `rs`]                | Index of the first occurence of the minimum element in `x`; written to `rs` if passed, else returned.                                                            |
 
 ### BLAS 2
 

From 99c52b15688ee217a6e0726c0f654945147bde1b Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
Date: Wed, 10 Jan 2024 13:56:33 +0000
Subject: [PATCH 13/14] updated joint matrix doc link

---
 README.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index f2af6fc2c..3aa9006b5 100644
--- a/README.md
+++ b/README.md
@@ -351,7 +351,7 @@ Other non-official extension operators :
 ### Experimental Joint Matrix Support
 
 portBLAS now supports sub-group based collective GEMM operation using the experimental 
-[`joint_matrix`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) extension provided by DPC++. This support is only accessible for the latest 
+[`joint_matrix`](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc) extension provided by DPC++. This support is only accessible for the latest 
 NVIDIA Ampere GPUs and beyond. The requirements for using this experimental support 
 are: 
 ```bash
@@ -390,11 +390,12 @@ added to the `CMAKE_PREFIX_PATH` when building portBLAS (see
 
 **IMPORTANT NOTE:** The `TARGET` CMake variable is no longer supported. It has
 been replaced by `TUNING_TARGET`, which accepts the same options.
-`TUNING_TARGET` affects only the tuning configuration, applicable for some operators such as GEMM, and has no effect on the target
-triplet for DPC++ or the hipSYCL target. Please refer to the sections below for
-setting them.
+`TUNING_TARGET` affects only the tuning configuration, applicable for some operators such
+as GEMM, and has no effect on the target triplet for DPC++ or the hipSYCL target. Please 
+refer to the sections below for setting them.
 
-1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order to clone submodule(s).
+1. Clone the portBLAS repository, making sure to pass the `--recursive` option, in order 
+to clone submodule(s).
 2. Create a build directory
 3. Run `CMake` from the build directory *(see options in the section below)*:
 

From e926fcb301debe32923327d2c2e8c0605d023c94 Mon Sep 17 00:00:00 2001
From: Ouadie EL FAROUKI <104583441+OuadiElfarouki@users.noreply.github.com>
Date: Fri, 12 Jan 2024 11:19:40 +0000
Subject: [PATCH 14/14] Update README.md

Co-authored-by: Muhammad Tanvir <84532306+muhammad-tanvir-1211@users.noreply.github.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 3aa9006b5..97476e994 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ the project.
     - [Instaling portBLAS](#instaling-portBLAS)
     - [Doxygen](#doxygen)
     - [CMake options](#cmake-options)
-    - [ComputeCpp Compilation](#computecpp-deprecated)
+    - [ComputeCpp Compilation (Deprecated)](#computecpp-deprecated)
       - [Compile with ComputeCpp](#compile-with-computecpp)
       - [POWER\_VR support](#power_vr-support-computecpp-only)
       - [Cross-Compile](#cross-compile-computecpp-only)