Triton: Language for GPU Computing Primitives

Domain-specific language and compiler for writing GPU deep learning primitives with higher productivity than CUDA.

LIVE RANKINGS • 12:26 AM

OVERALL

#150

AI & ML

#61

30 DAY RANKING TREND

STARS

19.0K

FORKS

2.8K

7D STARS

+245

7D FORKS

+91

Tags:

AI & ML

See Repo:

Learn more about Triton

Triton is a programming language and compiler infrastructure designed for writing high-performance GPU kernels for deep learning applications. The compiler translates Triton code into optimized GPU assembly through an MLIR-based compilation pipeline that includes automatic memory coalescing, shared memory management, and instruction scheduling. The language uses a Python-like syntax with explicit control over memory hierarchy and parallelization patterns, allowing developers to write GPU kernels without managing low-level CUDA details. Triton is commonly used for implementing custom neural network operators, matrix computations, and other compute-intensive primitives in machine learning frameworks.

MLIR-Based Compilation

Uses Multi-Level Intermediate Representation (MLIR) infrastructure for code generation and optimization. The compiler automatically handles memory coalescing, shared memory usage, and instruction scheduling.

Python-Like Syntax

Provides a high-level programming interface similar to Python while generating efficient GPU code. Developers can write kernels without managing CUDA's low-level memory and threading details.

Automatic Optimization

Performs automatic tiling, vectorization, and memory hierarchy optimization during compilation. The compiler analyzes memory access patterns and generates optimized GPU assembly code.

import triton
import triton.language as tl
import torch

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add_vectors(x, y):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output