Megatron-LM: GPU-optimized transformer training at scale
Library for training large transformer models with distributed computing and GPU-optimized building blocks.
Learn more about Megatron-LM
Megatron-LM is a framework for training large-scale transformer models on GPU clusters, developed by NVIDIA. The system implements multiple parallelism strategies including tensor parallelism, pipeline parallelism, data parallelism, and expert parallelism to distribute model training across multiple devices. It provides GPU-optimized kernels, mixed precision support (FP16, BF16, FP8), and memory management optimizations for efficient large model training. The framework supports various model architectures including GPT, LLaMA, Mixtral, and Mamba models.
Multi-dimensional Parallelism
Implements tensor, pipeline, data, context, and expert parallelism strategies that can be combined to efficiently distribute training across large GPU clusters.
Modular Architecture
Megatron Core provides composable building blocks for transformer components, allowing developers to construct custom training frameworks and model architectures.
GPU Kernel Optimization
Includes specialized CUDA kernels and memory management techniques optimized for NVIDIA hardware, with support for advanced precision formats including FP8.
import torch
from megatron import get_args, get_timers, initialize_megatron
from megatron.core import mpu
from megatron.model import GPTModel
from megatron.training import pretrain
from megatron.utils import get_ltor_masks_and_position_ids
# Initialize Megatron with arguments
args = initialize_megatron()
# Create model with tensor and pipeline parallelism
model = GPTModel(
num_tokentypes=0,
parallel_output=True,
pre_process=mpu.is_pipeline_first_stage(),
post_process=mpu.is_pipeline_last_stage()
)
# Setup optimizer and learning rate scheduler
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
# Start pretraining
pretrain(train_valid_test_datasets_provider,
model_provider,
forward_step,
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})Related Repositories
Discover similar tools and frameworks used by developers
ByteTrack
Multi-object tracker associating low-confidence detections across frames.
Goose
LLM-powered agent automating local software engineering workflows.
CUTLASS
CUDA C++ templates and Python DSLs for high-performance matrix multiplication on GPUs.
LightRAG
Graph-based retrieval framework for structured RAG reasoning.
TTS
PyTorch toolkit for deep learning text-to-speech synthesis.