DeepSpeed: Deep learning optimization library for distributed training
PyTorch library for training billion-parameter models efficiently.
Learn more about DeepSpeed
DeepSpeed is a Python library built on PyTorch that optimizes distributed deep learning training and inference through system-level innovations. It implements multiple parallelism approaches including data parallelism, model parallelism, pipeline parallelism, and sequence parallelism, along with memory optimization techniques like ZeRO (Zero Redundancy Optimizer). The library handles communication patterns across GPU clusters and supports offloading to CPU memory and NVMe storage. Common applications include training large language models with billions to trillions of parameters across multi-GPU and multi-node setups.
ZeRO Memory Optimization
Partitions model states, gradients, and optimizer states across devices to reduce per-GPU memory footprint. Trains models 8x larger than standard data parallelism on the same hardware without code changes.
Hybrid Parallelism Strategies
Combines data, tensor, pipeline, and sequence parallelism through configuration rather than custom implementation. Users select and compose strategies based on model architecture and cluster topology.
Multi-Tier Memory Offloading
Automatically manages memory across GPU, CPU, and NVMe storage to train models larger than available VRAM. ZeRO-Infinity enables trillion-parameter model training on consumer hardware through intelligent memory orchestration.
import deepspeed
import torch
model = torch.nn.Linear(1024, 1024)
optimizer = torch.optim.Adam(model.parameters())
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config="ds_config.json"
)
# Train with model_engine.forward() and model_engine.backward()Patch release fixing ZeRO3 memory duplication, optimizer overflow handling, and tensor slicing bugs; no breaking changes noted.
- –Update to resolve fp32 weight duplication under torch autocast with ZeRO3 and fix misplaced overflow handling in fused optimizer.
- –Apply fixes for comm_dtype in large param reduction and 0-dim tensor slicing in state padding to prevent runtime errors.
Patch release fixing a critical illegal memory access bug in multi_tensor_apply and adding tensor learning rate support.
- –Update if using multi_tensor_apply with large tensors; fixes illegal memory access when size exceeds INT_MAX.
- –Enable tensor learning rates (vs scalar only) and use new DataStates-LLM async checkpointing for large models.
Adds SuperOffload for memory-efficient training, improves DeepCompile ZeRO-3 stability, and fixes universal checkpoint loading bugs in multi-machine and stage-3 scenarios.
- –Enable SuperOffload via config to reduce memory footprint during large-model training.
- –Update checkpoint loading logic if using ZeRO stage-3 with multiple subgroups or world-size expansion.
See how people are using DeepSpeed
Related Repositories
Discover similar tools and frameworks used by developers
goose
LLM-powered agent automating local software engineering workflows.
YOLOX
PyTorch anchor-free object detector with scalable model variants.
AutoGPT
Block-based visual editor for autonomous AI agents.
adk-python
Modular Python framework for building production AI agents.
opencv
Cross-platform C++ library for real-time computer vision algorithms.