DeepSpeed: Deep learning optimization library for distributed training
PyTorch library for training billion-parameter models efficiently.
Learn more about DeepSpeed
DeepSpeed is a Python library built on PyTorch that optimizes distributed deep learning training and inference through system-level innovations. It implements multiple parallelism approaches including data parallelism, model parallelism, pipeline parallelism, and sequence parallelism, along with memory optimization techniques like ZeRO (Zero Redundancy Optimizer). The library handles communication patterns across GPU clusters and supports offloading to CPU memory and NVMe storage. Common applications include training large language models with billions to trillions of parameters across multi-GPU and multi-node setups.
ZeRO Memory Optimization
Partitions model states, gradients, and optimizer states across devices to reduce per-GPU memory footprint. Trains models 8x larger than standard data parallelism on the same hardware without code changes.
Hybrid Parallelism Strategies
Combines data, tensor, pipeline, and sequence parallelism through configuration rather than custom implementation. Users select and compose strategies based on model architecture and cluster topology.
Multi-Tier Memory Offloading
Automatically manages memory across GPU, CPU, and NVMe storage to train models larger than available VRAM. ZeRO-Infinity enables trillion-parameter model training on consumer hardware through intelligent memory orchestration.
import deepspeed
import torch
model = torch.nn.Linear(1024, 1024)
optimizer = torch.optim.Adam(model.parameters())
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config="ds_config.json"
)
# Train with model_engine.forward() and model_engine.backward()Patch release with SuperOffloadOptimizerStage3 crash fix and AMD ROCm improvements.
- –Fix SuperOffloadOptimizerStage3 crash due to missing paramnames parameter
- –Improve support of AMD ROCm
- –Disable deterministic option in compile tests
Patch release adding separate learning rates for muon optimizer and leaf module improvements.
- –allow seperate learning rate "muonlr" and "adamlr" for muon optimizer
- –leaf modules: explain better
- –disable nv-lightning-v100.yml CI
Patch release with ZeRO3 fp32 weight deduplication and Ulysses API improvements.
- –Deduplicate fp32 weights under torch autocast and ZeRO3
- –ulysses mpu: additional api
- –ALST/UlyssesSP: more intuitive API wrt variable seqlen
- –Fix misplaced overflow handling return in fused_optimizer.py
See how people are using DeepSpeed
Related Repositories
Discover similar tools and frameworks used by developers
Open WebUI
Extensible multi-LLM chat platform with RAG pipeline.
Unsloth
Memory-efficient Python library for accelerated LLM training.
tiktoken
Fast BPE tokenizer for OpenAI language models.
Kimi-K2
Trillion-parameter MoE model with Muon-optimized training.
Paperless-ngx
Self-hosted OCR document archive with ML classification.