Skip to content

Releases: NVIDIA/cutlass

CUTLASS 3.0

10 Mar 04:19
c4f6b8c
Compare
Choose a tag to compare

3.0.0 (2023-01-23)

CUTLASS 2.11

20 Jan 21:35
66d9cdd
Compare
Choose a tag to compare

2.11.0 (2022-11-19)

  • Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.

  • Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.

  • Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.

  • Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.

  • BLAS3 functions with Hoppers new double precision matrix multiplication instructions.

  • ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.

  • Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.

  • Optimized DepthWise Conv. Two new modes are added

    • kOptimized - use direct conv to compute instead of implicit GEMM.
      • The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
    • kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
      • The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
  • Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.

  • FP8 data type definition and conversion routines.

  • Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2

CUTLASS 2.10.0

16 Sep 02:42
fc9ebc6
Compare
Choose a tag to compare

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)

CUTLASS 2.9.1

29 Jun 01:15
e45e773
Compare
Choose a tag to compare

Bug fixes, performance tuning, and enhancements to documentation.

CUTLASS 2.9.0

27 Apr 16:31
319a389
Compare
Choose a tag to compare

CUTLASS 2.9.0

CUTLASS 2.8

06 Dec 19:22
5fe09c2
Compare
Choose a tag to compare

CUTLASS 2.7

20 Sep 18:10
2e07c4c
Compare
Choose a tag to compare

2.7.0

CUTLASS 2.6.1

03 Sep 17:27
6c2f8f2
Compare
Choose a tag to compare
  • Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
  • Tuning for GEMMs fused with partial reductions
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

CUTLASS 2.6.0

03 Sep 16:52
a01feb9
Compare
Choose a tag to compare

CUTLASS 2.6.0

  • Optimal performance when compiled with the CUDA 11.4 Toolkit
  • Fused operators with GEMM and Convolution
  • 64b tensor strides and leading dimensions support for GEMMs
  • Affine rank=2 matrix layouts
  • Batched GEMV preview implementation
  • New strided Dgrad implementation
    • Accelerates over previous implementation by cutting down redundant math by 4x
    • Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
  • Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
  • Many improvements to the epilogue.
    • Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
    • Performance improvement for FP16 tensor core kernels
    • Bug fixes
  • Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
  • Updated minimum CUDA Toolkit requirement to 10.2
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

CUTLASS 2.5.0

03 Mar 19:20
0f10563
Compare
Choose a tag to compare

CUTLASS 2.5 is a minor release contributing:

  • Tensor reductions
    • m-to-n reductions of tensors with affine layout
    • Specializations for reductions including contiguous dimension
    • Specializations for reductions excluding contiguous dimension
    • Custom reduction functors such as cutlass::logical_and
    • Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
  • Optimizations for 3-D convolution
  • Fused Convolution+Convolution example
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!