CUTLASS: CUDA Templates for Linear Algebra
CUDA C++ templates and Python DSLs for high-performance matrix multiplication on GPUs.
Learn more about CUTLASS
CUTLASS is a library that provides CUDA C++ template abstractions and Python domain-specific languages for high-performance linear algebra operations on NVIDIA GPUs. It decomposes matrix operations into reusable, modular software components using hierarchical parallelization strategies and custom tiling approaches. The library supports extensive data type flexibility including mixed-precision computations, various floating-point formats, and specialized tensor core operations across NVIDIA architectures from Volta to Blackwell. CUTLASS is commonly used for implementing optimized GEMM kernels, deep learning operations, and custom GPU compute applications.
Hierarchical Decomposition
Breaks down matrix operations into modular components with customizable tiling sizes and algorithmic policies. Components can be specialized and tuned for different levels of the parallelization hierarchy.
Multi-Language Interface
Combines traditional CUDA C++ templates with Python DSLs for kernel development. CuTe DSL provides faster compile times and intuitive metaprogramming without requiring deep C++ expertise.
Comprehensive Data Type Support
Supports mixed-precision computations across FP64, FP32, TF32, FP16, BF16, 8-bit floating point, block scaled types, narrow integers, and binary data types. Optimized for tensor core operations on modern NVIDIA architectures.
import cutlass
from cutlass import cute
# Define a simple GEMM kernel using CuTe DSL
@cute.jit
def gemm_kernel(A, B, C, M, N, K):
# Define thread block and tile sizes
block_m, block_n, block_k = 128, 128, 32
# Create tensor layouts
tA = cute.make_tensor(A, cute.make_layout(M, K))
tB = cute.make_tensor(B, cute.make_layout(K, N))
tC = cute.make_tensor(C, cute.make_layout(M, N))
# Perform tiled matrix multiplication
cute.gemm(tA, tB, tC, block_m, block_n, block_k)
# Compile and execute
gemm_kernel.compile()
gemm_kernel(tensor_a, tensor_b, tensor_c, m_size, n_size, k_size)Fixes CPU overhead issue from v4.3.4 and updates runtime API usage while refreshing copyright notices.
- –Fixed the unexpected CPU overhead issue introduced by 4.3.4
- –Update copyright to 2026
- –Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs
Adds PDL support to CuTe DSL and fixes several bugs including CUDA graph and memory layout issues.
- –Added PDL support along with example Kernel launch with Programmatic Dependent Launch
- –Fixed a frame refcnt issue with cuda graph
- –Enhancement for tvm-ffi AoT case for earlier module unload
Enhances CuTe DSL with improved JIT function argument support and better error handling.
- –Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
- –Supported variadic tuples for JIT function argument in tvm-ffi
- –Fixed an issue when JIT function argument with union type annotation for tvm-ffi
See how people are using CUTLASS
Related Repositories
Discover similar tools and frameworks used by developers
OpenAI.fm
Web demo showcasing OpenAI's Speech API text-to-speech capabilities with an interactive Next.js interface.
Open Notebook
Open source implementation of Google's NotebookLM that runs locally with document processing and podcast generation.
DALL-E
Official PyTorch package implementing the discrete VAE component for image tokenization used in OpenAI's DALL-E system.
ComfyUI
Visual graph-based diffusion model workflow builder.
MLX
Lazy-evaluated NumPy-like arrays optimized for Apple silicon.