Home

Awesome

prime - decentralized training at scale

prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.

https://github.com/user-attachments/assets/c034d2a2-400c-4bf8-acd0-c84b6c897d69

Key Features

A research paper about the framework and our INTELLECT-1 10B experiment is coming soon.

Getting Started

For an easy install that download the data

curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime/main/scripts/install/install.sh | bash

step by step :

  1. Clone:
git clone git@github.com:PrimeIntellect-ai/prime.git
  1. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
  1. Set up the environment:
sudo apt install iperf -y
uv venv
source .venv/bin/activate
uv sync --extra all
git submodule update --init --recursive
  1. Log into Hugging Face:
huggingface-cli login
  1. Download the data
mkdir -p datasets
uv run python scripts/subset_data.py --dataset_name PrimeIntellect/fineweb-edu --data_world_size 1 --data_rank 0 --max_shards 32
mv fineweb-edu/ datasets/fineweb-edu/

Quick Check

Verify your setup:

GLOO_SOCKET_IFNAME=lo GLOBAL_ADDR=localhost GLOBAL_RANK=0 GLOBAL_UNIQUE_ID=0 GLOBAL_WORLD_SIZE=1 GLOBAL_PORT=8989  uv run torchrun --nproc_per_node=2 src/zeroband/train.py  @configs/debug/diloco.toml

Usage

Running DiLoCo

To test DiLoCo locally you can use the helper script scripts/simulatsimulate_multi_nodee_mutl.sh

# Using 4 GPUs
ZERO_BAND_LOG_LEVEL=DEBUG ./scripts/simulate_multi_node_diloco.sh 2 2 src/zeroband/train.py @configs/debug/diloco.toml

# Using 2 GPUs
ZERO_BAND_LOG_LEVEL=DEBUG ./scripts/simulate_multi_node_diloco.sh 2 1 src/zeroband/train.py @configs/debug/diloco.toml

Note: Single GPU setups are currently not supported due to an FSDP implementation bug.

Running Tests

Ensure you have at least two GPU to run the full test suite:

uv run pytest

Eval

To eval you need first to convert the checkpoint to a huggingface compatible model.

uv run python scripts/export_dcp.py @configs/10B/H100.toml --ckpt.path CONVERTED_MODEL_PATH --ckpt.resume CHECKPOINT_PATH --torch_dtype bfloat16  --ckpt.interval 1
uv run accelerate launch -m lm_eval --model hf --model_args pretrained=CONVERTED_MODEL_PATH,add_bos_token=True  --tasks hellaswag --num_fewshot 10

Environment variables

Global Store Initialization

Environment VariableDescriptionDefault Value
GLOBAL_UNIQUE_IDUnique identifier worker in global store.None
GLOBAL_ADDRIP Address of the global storeNone
GLOBAL_PORTPort number of the global store.None
GLOBAL_WORLD_SIZEThe size of the global process group.1
GLOBAL_RANKRank of the process in the global process group.0

Elastic Device Mesh Configuration

Environment VariableDescriptionDefault Value
ZERO_BAND_LOG_LEVELEnable debug mode for logeFalse
ZERO_BAND_GLOBAL_STORE_TIMEOUT_SECONDSNumber of seconds before the global store operations timeout300
ZERO_BAND_GLOBAL_PG_TIMEOUT_SECONDSNumber of seconds before the global process group operations timeout600
ZERO_BAND_GLOBAL_STORE_POLLING_INTERVAL_SECONDSNumber of seconds between polls to the store when waiting for values0.1
ZERO_BAND_EDM_HEARTBEAT_INTERVAL_SECONDSInterval in seconds between heartbeats2
ZERO_BAND_EDM_HEARTBEAT_TIMEOUT_SECONDSTime in seconds after which a node is considered dead if no heartbeat is received10
ZERO_BAND_LIVE_RECO_PORTPort number for the live recovery serverrandom
ZERO_BAND_LIVE_RECO_ADDRIP Address for the live recovery serverlocalhost

Troubleshooting

If you encounter any dataset loading errors at the beginning of training, try setting:

export HF_HUB_ETAG_TIMEOUT=500

Pre-downloading datasets

Streaming datasets from huggingface hub can sometimes result in http 443 errors which will crash the training process. To avoid them, you can pre-download the dataset.

Here is an example that downloads all the files in PrimeIntellect/fineweb-edu which are used by data_rank 5 in a training with data_world_size of 12.

python3 scripts/subset_data.py --dataset_name PrimeIntellect/fineweb-edu --data_world_size 12 --data_rank 5

For info about the arguments to the script, do:

python3 scripts/subset_data.py --help

Exporting checkpoints to huggingface compatible model

You can convert the checkpoints saved by the training script to a model that can be run with any huggingface-compatible inference engine (e.g. transformers, vLLM) using our export script. The export script takes the training config as a positional argument and 2 keyword arguments, ckpt.resume which is the path to the checkpoint, ckpt.path which is the path you wish to save the converted model. You may also pass the torch_dtype argument to either float32 or bfloat16 to specify the precision of the exported model weights. The default torch_dtype is float32.

Example export command:

python scripts/export_dcp.py @configs/10B/H100.toml --ckpt.path /path/to/save/converted_model --ckpt.resume /path/to/ckpt/step_84000 --torch_dtype bfloat16

You can then upload the model to huggingface using huggingface-cli:

# Usage:  huggingface-cli upload [repo_id] [local_path] [path_in_repo]
huggingface-cli upload username/mymodel /path/to/save/converted_model . --private

The repo will be created if repo_id does not exist. The --private will create the repo as a private repo and cab ne ommited to create a publicly accessible repo.