Navigate:
CUTLASS
~$CUTLA0.4%

CUTLASS: CUDA Templates for Linear Algebra

CUDA C++ templates and Python DSLs for high-performance matrix multiplication on GPUs.

LIVE RANKINGS • 10:20 AM • STEADY
OVERALL
#211
6
AI & ML
#71
7
30 DAY RANKING TREND
ovr#211
·AI#71
STARS
9.3K
FORKS
1.7K
7D STARS
+38
7D FORKS
+13
Tags:
See Repo:
Share:

Learn more about CUTLASS

CUTLASS is a library that provides CUDA C++ template abstractions and Python domain-specific languages for high-performance linear algebra operations on NVIDIA GPUs. It decomposes matrix operations into reusable, modular software components using hierarchical parallelization strategies and custom tiling approaches. The library supports extensive data type flexibility including mixed-precision computations, various floating-point formats, and specialized tensor core operations across NVIDIA architectures from Volta to Blackwell. CUTLASS is commonly used for implementing optimized GEMM kernels, deep learning operations, and custom GPU compute applications.

CUTLASS

1

Hierarchical Decomposition

Breaks down matrix operations into modular components with customizable tiling sizes and algorithmic policies. Components can be specialized and tuned for different levels of the parallelization hierarchy.

2

Multi-Language Interface

Combines traditional CUDA C++ templates with Python DSLs for kernel development. CuTe DSL provides faster compile times and intuitive metaprogramming without requiring deep C++ expertise.

3

Comprehensive Data Type Support

Supports mixed-precision computations across FP64, FP32, TF32, FP16, BF16, 8-bit floating point, block scaled types, narrow integers, and binary data types. Optimized for tensor core operations on modern NVIDIA architectures.


import cutlass
from cutlass import cute

# Define a simple GEMM kernel using CuTe DSL
@cute.jit
def gemm_kernel(A, B, C, M, N, K):
    # Define thread block and tile sizes
    block_m, block_n, block_k = 128, 128, 32
    
    # Create tensor layouts
    tA = cute.make_tensor(A, cute.make_layout(M, K))
    tB = cute.make_tensor(B, cute.make_layout(K, N))
    tC = cute.make_tensor(C, cute.make_layout(M, N))
    
    # Perform tiled matrix multiplication
    cute.gemm(tA, tB, tC, block_m, block_n, block_k)

# Compile and execute
gemm_kernel.compile()
gemm_kernel(tensor_a, tensor_b, tensor_c, m_size, n_size, k_size)

vv4.3.5

Fixes CPU overhead issue from v4.3.4 and updates runtime API usage while refreshing copyright notices.

  • Fixed the unexpected CPU overhead issue introduced by 4.3.4
  • Update copyright to 2026
  • Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs
vv4.3.4

Adds PDL support to CuTe DSL and fixes several bugs including CUDA graph and memory layout issues.

  • Added PDL support along with example Kernel launch with Programmatic Dependent Launch
  • Fixed a frame refcnt issue with cuda graph
  • Enhancement for tvm-ffi AoT case for earlier module unload
vv4.3.3

Enhances CuTe DSL with improved JIT function argument support and better error handling.

  • Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
  • Supported variadic tuples for JIT function argument in tvm-ffi
  • Fixed an issue when JIT function argument with union type annotation for tvm-ffi

See how people are using CUTLASS

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers