Navigate:
All ReposDeepSpeed
~$DEEPSP0.1%

DeepSpeed: Deep learning optimization library for distributed training

PyTorch library for training billion-parameter models efficiently.

LIVE RANKINGS • 06:52 AM • STEADY
OVERALL
#124
19
AI & ML
#54
2
30 DAY RANKING TREND
ovr#124
·AI#54
STARS
41.2K
FORKS
4.7K
DOWNLOADS
7D STARS
+30
7D FORKS
-1
Tags:
See Repo:
Share:

Learn more about DeepSpeed

DeepSpeed is a Python library built on PyTorch that optimizes distributed deep learning training and inference through system-level innovations. It implements multiple parallelism approaches including data parallelism, model parallelism, pipeline parallelism, and sequence parallelism, along with memory optimization techniques like ZeRO (Zero Redundancy Optimizer). The library handles communication patterns across GPU clusters and supports offloading to CPU memory and NVMe storage. Common applications include training large language models with billions to trillions of parameters across multi-GPU and multi-node setups.


1

ZeRO Memory Optimization

Partitions model states, gradients, and optimizer states across devices to reduce per-GPU memory footprint. Trains models 8x larger than standard data parallelism on the same hardware without code changes.

2

Hybrid Parallelism Strategies

Combines data, tensor, pipeline, and sequence parallelism through configuration rather than custom implementation. Users select and compose strategies based on model architecture and cluster topology.

3

Multi-Tier Memory Offloading

Automatically manages memory across GPU, CPU, and NVMe storage to train models larger than available VRAM. ZeRO-Infinity enables trillion-parameter model training on consumer hardware through intelligent memory orchestration.


import deepspeed
import torch

model = torch.nn.Linear(1024, 1024)
optimizer = torch.optim.Adam(model.parameters())

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config="ds_config.json"
)

# Train with model_engine.forward() and model_engine.backward()


vv0.18.2

Patch release fixing ZeRO3 memory duplication, optimizer overflow handling, and tensor slicing bugs; no breaking changes noted.

  • Update to resolve fp32 weight duplication under torch autocast with ZeRO3 and fix misplaced overflow handling in fused optimizer.
  • Apply fixes for comm_dtype in large param reduction and 0-dim tensor slicing in state padding to prevent runtime errors.
vv0.18.1

Patch release fixing a critical illegal memory access bug in multi_tensor_apply and adding tensor learning rate support.

  • Update if using multi_tensor_apply with large tensors; fixes illegal memory access when size exceeds INT_MAX.
  • Enable tensor learning rates (vs scalar only) and use new DataStates-LLM async checkpointing for large models.
vv0.18.0

Adds SuperOffload for memory-efficient training, improves DeepCompile ZeRO-3 stability, and fixes universal checkpoint loading bugs in multi-machine and stage-3 scenarios.

  • Enable SuperOffload via config to reduce memory footprint during large-model training.
  • Update checkpoint loading logic if using ZeRO stage-3 with multiple subgroups or world-size expansion.

See how people are using DeepSpeed

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers