Triton: Language for GPU Computing Primitives
Domain-specific language and compiler for writing GPU deep learning primitives with higher productivity than CUDA.
Learn more about Triton
Triton is a programming language and compiler infrastructure designed for writing high-performance GPU kernels for deep learning applications. The compiler translates Triton code into optimized GPU assembly through an MLIR-based compilation pipeline that includes automatic memory coalescing, shared memory management, and instruction scheduling. The language uses a Python-like syntax with explicit control over memory hierarchy and parallelization patterns, allowing developers to write GPU kernels without managing low-level CUDA details. Triton is commonly used for implementing custom neural network operators, matrix computations, and other compute-intensive primitives in machine learning frameworks.
MLIR-Based Compilation
Uses Multi-Level Intermediate Representation (MLIR) infrastructure for code generation and optimization. The compiler automatically handles memory coalescing, shared memory usage, and instruction scheduling.
Python-Like Syntax
Provides a high-level programming interface similar to Python while generating efficient GPU code. Developers can write kernels without managing CUDA's low-level memory and threading details.
Automatic Optimization
Performs automatic tiling, vectorization, and memory hierarchy optimization during compilation. The compiler analyzes memory access patterns and generates optimized GPU assembly code.
import triton
import triton.language as tl
import torch
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add_vectors(x, y):
output = torch.empty_like(x)
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return outputSee how people are using Triton
Related Repositories
Discover similar tools and frameworks used by developers
Mask2Former
Unified transformer architecture for multi-task image segmentation.
InvokeAI
Node-based workflow interface for local Stable Diffusion deployment.
Ollama
Go-based CLI for local LLM inference and management.
StabilityMatrix
Multi-backend inference UI manager with embedded dependencies.
LeRobot
PyTorch library for robot imitation learning and sim-to-real transfer.