Navigate:
All Reposhiggsfield
~$HIGGSF0.0%

Higgsfield: GPU orchestration for large-scale model training

Cluster manager for multi-node PyTorch model training.

LIVE RANKINGS • 06:51 AM • STEADY
OVERALL
#245
6
AI & ML
#84
1
30 DAY RANKING TREND
ovr#245
·AI#84
STARS
3.5K
FORKS
585
DOWNLOADS
7D STARS
+1
7D FORKS
+1
Tags:
See Repo:
Share:

Learn more about higgsfield

Higgsfield is a GPU orchestration framework designed to manage and coordinate distributed training of large-scale PyTorch models across multiple nodes. It provides a cluster management layer that handles resource allocation, task scheduling, and inter-node communication for multi-GPU training workloads. The framework abstracts the complexity of distributed computing by automatically configuring process groups, managing data parallelism strategies, and monitoring training jobs across the cluster. It integrates with existing PyTorch training pipelines through a Python API that wraps model and data configurations with distributed execution logic. The system is optimized for high-performance computing environments where training large neural networks requires coordinated use of GPUs spanning multiple physical machines.

higgsfield

1

GitHub-integrated workflows

Experiments are defined as Python code and automatically deployed through GitHub Actions, eliminating separate deployment pipelines. Checkpoints and experiment monitoring occur through GitHub's interface rather than custom dashboards.

2

ZeRO-3 and FSDP support

Native compatibility with DeepSpeed's ZeRO-3 and PyTorch's fully sharded data parallel APIs enables efficient parameter sharding for trillion-parameter models without requiring custom implementation.

3

Simplified experiment definition

Training experiments are written as standard Python functions with a decorator, avoiding configuration files or argument parsing. Users can incorporate custom PyTorch code, DeepSpeed, or Accelerate without framework constraints.


from higgsfield import DistributedTrainer
import torch
import torch.nn as nn

model = nn.Linear(1024, 512)
trainer = DistributedTrainer(model=model, world_size=4, backend='nccl')

for batch in dataloader:
    loss = trainer.training_step(batch)
    trainer.backward(loss)
    trainer.step()

vv0.0.4-rc

Release candidate with invoker version control and asyncssh security patch; no breaking changes documented.

  • Pin asyncssh to 2.14.1 or later to pick up the security update from 2.14.0.
  • Specify invoker version explicitly if your workflow requires a particular build; action-builder field generation is now fixed.


[ EXPLORE MORE ]

Related Repositories

Discover similar tools and frameworks used by developers