Home

Awesome

ffcv ImageNet Training

A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get...

Results

Train models more efficiently, either with 8 GPUs in parallel or by training 8 ResNet-18's at once. <img src="assets/perf_scatterplot.svg" width='830px'/>

See benchmark setup here: https://docs.ffcv.io/benchmarks.html.

Citation

If you use this setup in your research, cite:

@misc{leclerc2022ffcv,
    author = {Guillaume Leclerc and Andrew Ilyas and Logan Engstrom and Sung Min Park and Hadi Salman and Aleksander Madry},
    title = {ffcv},
    year = {2022},
    howpublished = {\url{https://github.com/libffcv/ffcv/}},
    note = {commit xxxxxxx}
}

(Make sure to replace xxxxxxx above with the hash of the commit used!)

Configurations

The configuration files corresponding to the above results are:

Link to Configtop_1top_5# EpochsTime (mins)ArchitectureSetup
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_88_epochs.yaml'>Link</a>0.7840.9418877.2ResNet-508 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_56_epochs.yaml'>Link</a>0.7800.9375649.4ResNet-508 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_40_epochs.yaml'>Link</a>0.7720.9324035.6ResNet-508 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_32_epochs.yaml'>Link</a>0.7660.9273228.7ResNet-508 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_24_epochs.yaml'>Link</a>0.7560.9212421.7ResNet-508 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn50_configs/rn50_16_epochs.yaml'>Link</a>0.7380.9081614.9ResNet-508 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_88_epochs.yaml'>Link</a>0.7240.90388187.3ResNet-181 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_56_epochs.yaml'>Link</a>0.7130.89956119.4ResNet-181 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_40_epochs.yaml'>Link</a>0.7060.8944085.5ResNet-181 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_32_epochs.yaml'>Link</a>0.7000.8893268.9ResNet-181 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_24_epochs.yaml'>Link</a>0.6880.8812451.6ResNet-181 x A100
<a href='https://github.com/libffcv/ffcv-imagenet/tree/main/rn18_configs/rn18_16_epochs.yaml'>Link</a>0.6690.8681635.0ResNet-181 x A100

Training Models

First pip install the requirements file in this directory:

pip install -r requirements.txt

Then, generate an ImageNet dataset; make the dataset used for the results above with the following command (IMAGENET_DIR should point to a PyTorch style ImageNet dataset:

# Required environmental variables for the script:
export IMAGENET_DIR=/path/to/pytorch/format/imagenet/directory/
export WRITE_DIR=/your/path/here/

# Starting in the root of the Git repo:
cd examples;

# Serialize images with:
# - 500px side length maximum
# - 50% JPEG encoded
# - quality=90 JPEGs
./write_imagenet.sh 500 0.50 90

Then, choose a configuration from the configuration table. With the config file path in hand, train as follows:

# 8 GPU training (use only 1 for ResNet-18 training)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Set the visible GPUs according to the `world_size` configuration parameter
# Modify `data.in_memory` and `data.num_workers` based on your machine
python train_imagenet.py --config-file rn50_configs/<your config file>.yaml \
    --data.train_dataset=/path/to/train/dataset.ffcv \
    --data.val_dataset=/path/to/val/dataset.ffcv \
    --data.num_workers=12 --data.in_memory=1 \
    --logging.folder=/your/path/here

Adjust the configuration by either changing the passed YAML file or by specifying arguments via fastargs (i.e. how the dataset paths were passed above).

Training Details

<p><b>System setup.</b> We trained on p4.24xlarge ec2 instances (8 A100s). </p> <p><b>Dataset setup. Generally larger side length will aid in accuracy but decrease throughput:</b> </p> <p><b>Algorithmic details.</b> We use a standard ImageNet training pipeline (à la the PyTorch ImageNet example) with only the following differences/highlights: </p>

Refer to the code and configuration files for a more exact specification. To obtain configurations we first gridded for hyperparameters at a 30 epoch schedule. Fixing these parameters, we then varied only the number of epochs (stretching the learning rate schedule across the number of epochs as motivated by Budgeted Training) and plotted the results above.

FAQ

Why is the first epoch slow?

The first epoch can be slow for the first epoch if the dataset hasn't been cached in memory yet.

What if I can't fit my dataset in memory?

See this guide here.

Other questions

Please open up a GitHub discussion for non-bug related questions; if you find a bug please report it on GitHub issues.