Releases: NVIDIA/cutlass
Releases ยท NVIDIA/cutlass
CUTLASS 3.6.0
- Hopper structured sparse GEMM.
- A refactor to the CUTLASS 3.x convolution
kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. Now the 3.x convolution API is no longer considered as a beta API. - An improved mixed input GEMM and a lookup table implementation for
INT4
xFP8
scale-only mode. - EVT nodes for Top-K selection and softmax and GEMM example using those.
- Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
- A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
- A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
- A SIMT-enabled pointer-array epilogue.
- A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
- A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
- A new hardware support for comparisons and computations of
cutlass::bfloat16_t
- Fixed use of isnan on Windows for
half_t
.
CUTLASS 3.5.1
- Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
- Exposure of L2
cache_hint
s in TMA copy atoms - Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48. - TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
- A new
GemmSparseUniversal
API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference. - CUDA host adapter extensions to support TMA descriptor construction driver APIs.
- Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
- Support for residual add (beta != 0) in convolution kernels.
- A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
- A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
- A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
- Better support for MSVC as a host compiler.
- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
- NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution
kernel::ConvUniversal
API to bring it in line withgemm::GemmUniversal
. After this, the 3.x convolution API will no longer be considered as a beta API. - Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution
CUTLASS 3.5.0
- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
- Support for Fprop, Dgrad, and Wgrad algorithms.
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
- Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
- Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
- 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
- Updates to CuTe documentation for
cute::Tensor<>
, MMA atoms, and an overhauled CuTe GEMM tutorial series. - Extensions to CuTe to support L2 prefetching and TMA store+reductions.
- Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
- Fixes to greatly reduce build warnings.
- Updates and bugfixes from the community (thanks!)
CUTLASS 3.4.1
- Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
- Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs.
- Updates and bugfixes from the community (thanks!).
CUTLASS 3.4.0
- Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
- Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
- Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
- Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
- Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
- Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.
CUTLASS 3.3.0
- New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
- New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
- New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
- EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
- Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
- Support for Clang as a host compiler.
- Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface
CUTLASS 3.2.2
Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.
CUTLASS 3.2.1
- Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
- SM80 EVT support in C++ and Python.
- Other SM90 epilogue improvements.
- Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
- Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
- SM90 TF32 kernel improvements for all layouts.
- SM90 rasterization direction support in the CUTLASS profiler.
- Improvement for CUTLASS profiler build times.
- Remove Python-C++ bindings.
CUTLASS 3.2
- New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
- New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
- Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
- Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
- Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
- Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
- New CUTLASS 2D Convolution Python interface. New example here.
- Support for Windows (MSVC) builds.
CUTLASS 3.1
- New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
- New efficient epilogues using TMA for Hopper.
- Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
- New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
- New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
- An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
- Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
- Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
- Performance optimizations for the warp-specialized persistent ping-pong kernel.
- Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying
Params
structs. - FMHA Backward Pass from Meta xFormers.
- Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
- Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
- Batched Strided GEMV support both row major and column major input matrix.
- Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
- Row Broadcast can be fused in the epilogue.
- The GitHub branch is renamed from
master
tomain
in this release. - Optimal performance using CUDA 12.1
- Updates and bugfixes from the community (thanks!)