Navigate:
DeepSpeed
~$DEEPS0.1%

DeepSpeed: Deep learning optimization library for distributed training

PyTorch library for training billion-parameter models efficiently.

LIVE RANKINGS • 10:20 AM • STEADY
OVERALL
#292
163
AI & ML
#90
33
30 DAY RANKING TREND
ovr#292
·AI#90
STARS
41.7K
FORKS
4.7K
7D STARS
+50
7D FORKS
+7
Tags:
See Repo:
Share:

Learn more about DeepSpeed

DeepSpeed is a Python library built on PyTorch that optimizes distributed deep learning training and inference through system-level innovations. It implements multiple parallelism approaches including data parallelism, model parallelism, pipeline parallelism, and sequence parallelism, along with memory optimization techniques like ZeRO (Zero Redundancy Optimizer). The library handles communication patterns across GPU clusters and supports offloading to CPU memory and NVMe storage. Common applications include training large language models with billions to trillions of parameters across multi-GPU and multi-node setups.

DeepSpeed

1

ZeRO Memory Optimization

Partitions model states, gradients, and optimizer states across devices to reduce per-GPU memory footprint. Trains models 8x larger than standard data parallelism on the same hardware without code changes.

2

Hybrid Parallelism Strategies

Combines data, tensor, pipeline, and sequence parallelism through configuration rather than custom implementation. Users select and compose strategies based on model architecture and cluster topology.

3

Multi-Tier Memory Offloading

Automatically manages memory across GPU, CPU, and NVMe storage to train models larger than available VRAM. ZeRO-Infinity enables trillion-parameter model training on consumer hardware through intelligent memory orchestration.


import deepspeed
import torch

model = torch.nn.Linear(1024, 1024)
optimizer = torch.optim.Adam(model.parameters())

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config="ds_config.json"
)

# Train with model_engine.forward() and model_engine.backward()


vv0.18.4

Patch release with SuperOffloadOptimizerStage3 crash fix and AMD ROCm improvements.

  • Fix SuperOffloadOptimizerStage3 crash due to missing paramnames parameter
  • Improve support of AMD ROCm
  • Disable deterministic option in compile tests
vv0.18.3

Patch release adding separate learning rates for muon optimizer and leaf module improvements.

  • allow seperate learning rate "muonlr" and "adamlr" for muon optimizer
  • leaf modules: explain better
  • disable nv-lightning-v100.yml CI
vv0.18.2

Patch release with ZeRO3 fp32 weight deduplication and Ulysses API improvements.

  • Deduplicate fp32 weights under torch autocast and ZeRO3
  • ulysses mpu: additional api
  • ALST/UlyssesSP: more intuitive API wrt variable seqlen
  • Fix misplaced overflow handling return in fused_optimizer.py

See how people are using DeepSpeed

Loading tweets...


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers