Home

Awesome

Aims of this library:

Features:

Installation

The Deep Learning and HPC starter pack is available on PyPI:

pip install dlhpcstarter

TL;DR

To train and test a model:

dlhpcstarter -t cifar10 -c tests/task/cifar10/config/tl_dr.yaml --trial 0 --stages_module tests.task.cifar10.stages --train --test

The configuration YAML (tests/task/cifar10/config/tl_dr.yaml):

module: tests.task.cifar10.model.baseline
definition: Baseline
exp_dir: /datasets/work/hb-mlaifsp-mm/work/experiments
monitor: val_acc
monitor_mode: max

# Inputs to Lightning module:
dataset_dir: /datasets/work/hb-mlaifsp-mm/work/datasets
lr: 1e-3
max_epochs: 1
mbatch_size: 32

Must include:

The remaining arguments depend on your Lightning module's __init__ keyword arguments.

Table of Contents

Package map

The package is structured as follows:

├──  dlhpcstarter
│    │
│    ├── __main__.py               - __main__.py does the following:
│    │                                    1. Reads command line arguments using argparse.
│    │                                    2. Imports the 'stages' function for the task from task/
│    │                                    3. Loads the specified configuration .yaml for the job from 
│    │                                    4. Submits the job (the configuration + 'stages') to the 
│    │                                       cluster manager (or runs it locally if 'submit' is false).
│    └── cluster.py                - contains the cluster management object.
│    └── command_line_arguments.py - argparse for reading command line arguments.
│    └── trainer.py                - contains an optional wrapper for the Lightning Trainer.
│    └── utils.py                  - small utility definitions.

Tasks

Tasks can have any name. The name could be based on the data or the type of inference being made. For example:

Some publicly available tasks include:

What does the name do?

It is used to separate the outputs of the experiment from other tasks.

Models

Please familiarise yourself with the Lightning LightningModule in order to correctly implement a model: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html

A model is created using a Lightning LightningModule. Everything we need for the model can be placed in the LightningModule, including commonly used libraries and objects, for example:

Note:

Example:

Development via Model Composition and Inheritance

To promote rapid development of models, one solution is to use class composition and/or inheritance. For example, we may have a baseline that not only includes a basic model, but also the data pipeline:

from lightning.pytorch import LightningModule
from torch.utils.data import DataLoader, random_split
import torchvision
import torch

class Baseline(LightningModule):
    def __init__(self, lr, ..., **kwargs):
        super(Baseline, self).__init__()
        self.save_hyperparameters()
        self.lr = lr
        self.model = torchvision.models.resnet18(...)

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            train_set = torchvision.datasets.CIFAR10(...)
            self.train_set, self.val_set = random_split(train_set, [45000, 5000])

        if stage == 'test' or stage is None:
            self.test_set = torchvision.datasets.CIFAR10(...)

    def train_dataloader(self, shuffle=True):
        return DataLoader(self.train_set, ...)

    def val_dataloader(self):
        return DataLoader(self.val_set, ...)

    def test_dataloader(self):
        return DataLoader(self.test_set, ...)

    def configure_optimizers(self):
        optimiser = {'optimizer': torch.optim.SGD(self.parameters(), lr=self.lr, momentum=0.9)}
        return optimiser

    def forward(self, images):
        return self.model(images)

    def training_step(self, batch, batch_idx):
        images, labels = batch
        y_hat = self(images)
        loss = self.loss(y_hat, labels)
        self.log_dict({'train_loss': loss}, ...)
        return loss

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        y_hat = self(images)
        loss = self.loss(y_hat, labels)
        self.val_accuracy(torch.argmax(y_hat['logits'], dim=1), labels)
        self.log_dict({'val_acc': self.val_accuracy, 'val_loss': loss}, ...)

    def test_step(self, batch, batch_idx):
        images, labels = batch
        y_hat = self(images)
        self.test_accuracy(torch.argmax(y_hat['logits'], dim=1), labels)
        self.log_dict({'test_acc': self.test_accuracy}, ...)

After training and testing the baseline, we may want to improve upon its performance. For example, if we wanted to make the following modifications:

All we would need to do is inherit the baseline and make our modifications:

from transformers import get_constant_schedule_with_warmup

class Inheritance(Baseline):

    def __init__(self, num_warmup_steps, **kwargs):
        super(Inheritance, self).__init__(**kwargs)
        self.save_hyperparameters()
        self.num_warmup_steps = num_warmup_steps
        self.model = torchvision.models.densenet121(...)

    def configure_optimizers(self):
        optimiser = {'optimizer': torch.optim.AdamW(self.parameters(), lr=self.lr)}
        optimiser['scheduler'] = {
            'scheduler': get_constant_schedule_with_warmup(optimiser['optimizer'], self.num_warmup_steps),
            'interval': 'step',
            'frequency': 1,
        }
        return optimiser

We could also construct a model that is the combination of the two via composition. For example, we may want to use everything from Baseline, but the optimiser from Inheritance:

from lightning.pytorch import LightningModule

class Composite(LightningModule):
    def __init__(self, **kwargs):
        self.baseline = Baseline(self, **kwargs)

    def setup(self, stage=None):
        self.baseline.setup(stage)

    def train_dataloader(self, shuffle=True):
        return self.baseline.train_dataloader(shuffle)

    def val_dataloader(self):
        return self.baseline.val_dataloader()

    def test_dataloader(self):
        return self.baseline.test_dataloader()

    def configure_optimizers(self):
        return Inheritance.configure_optimizers(self)  # Use configure_optimizers() from Inheritance.

    def forward(self, images):
        return self.baseline.forward(images)

    def training_step(self, batch, batch_idx):
        return self.baseline.training_step(batch, batch_idx)

    def validation_step(self, batch, batch_idx):
        return self.baseline.validation_step(batch, batch_idx)

    def test_step(self, batch, batch_idx):
        return self.baseline.test_step(batch, batch_idx)

Configuration YAML files and argparse

Currently, there are two methods for giving arguments:

  1. Via command line arguments using the argparse module. argparse mainly handles paths, development stage flags (e.g., training and testing flags), and cluster manager arguments.
  2. Via a configuration file stored in YAML format. Can handle all the arguments defined by the argparse plus more, including hyperparameters for the model.

NOTE: Command line arguments will override configuration arguments!

The mandatory arguments include:

  1. task, the name of the task.
  2. config, relative or absolute path to the configuration file (with or without the extension).
  3. module, the module that the model definition is housed.
  4. definition, the class representing the model.
  5. exp_dir, the experiment directory, i.e., where all outputs, including model checkpoints will be saved.
  6. monitor, metric to monitor for ModelCheckpoint and EarlyStopping (optional), as well as test checkpoint loading (e.g., 'val_loss').
  7. monitor_mode, whether the monitored metric is to be maximised or minimised ('max' or 'min').

task and config must be given as command line arguments for argparse:

dlhpcstarter --config task/cifar10/config/baseline --task cifar10

module, definition, and exp_dir can be given either as command line arguments, or be placed in the configuration file.

For each model of a task, we define a configuration. Hyperparameters, paths, as well as the device configuration can be stored in a configuration file. Configurations are in YAML format, e.g., task/cifar10/config/baseline.yaml.

Development via Configuration Files

If we have the following configuration file for the aforementioned CIFAR10 Baseline model, task/cifar10/config/baseline.yaml:

train: True
test: True
module: task.cifar10.model.baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 32
mbatch_size: 32
num_workers: 5
exp_dir: /my/experiment/directory
dataset_dir: /my/datasets/directory

Another way we can improve upon the baseline model, i.e., the baseline configuration, is by modifying its hyperparameters. For example, we can still use Baseline, but alter the learning rate in task/cifar10/config/baseline_rev_a.yaml:

train: True
test: True
module: task.cifar10.model.baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-4  # modify this.
max_epochs: 32
mbatch_size: 32
num_workers: 5
exp_dir: /my/experiment/directory
dataset_dir: /my/datasets/directory
dlhpcstarter --config task/cifar10/config/baseline_rev_a --task cifar10

Next level: Configuration composition via Hydra

If your new configuration only modifies a few arguments of another configuration file, you can take advantage of the composition feature of Hydra. This makes creating task/cifar10/config/baseline_rev_a.yaml from the previous section easy. We simply add the arguments from task/cifar10/config/baseline.yaml by adding its name to the defaults list:

defaults:
  - baseline
  - _self_

lr: 1e-4

Note that other configuration files are imported with reference to the current configuration path (not the working directory).

Please note that groups are not being used, and packages should be placed using @_global_ if the configurations being used for composition are not in the same directory. For example, the following would not work with this repository as the arguments in hpc_paths will be grouped under paths:

defaults:
  - paths/hpc_paths
  - _self_

train: True
test: True
resumable: True
module: task.cifar10.model.baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 3
mbatch_size: 32
num_workers: 5

To get around this, simply place @_global_ to remove the grouping:

defaults:
  - paths/hpc_paths@_global_  # changed here to remove "paths" grouping.
  - _self_

train: True
test: True
resumable: True
module: task.cifar10.model.baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 3
mbatch_size: 32
num_workers: 5

This also allows us to organise configurations easily. For example, if we have the following directory structure:

├── task
│   └──  cifar10          
│        └── config  
│            ├── cluster
│            │    ├── 2hr.yaml
│            │    └── 24hr.yaml
│            │
│            ├── distributed
│            │    ├── 1gpu.yaml
│            │    ├── 4gpu.yaml
│            │    └── 4gpu4node.yaml
│            │
│            ├── paths
│            │    ├── local.yaml
│            │    └── hpc.yaml
│            │
│            └── baseline.yaml

With task/cifar10/config/baseline.yaml as:

defaults:
  - cluster/2hr@_global_
  - distributed/4gpu@_global_
  - paths/hpc_paths@_global_
  - _self_

train: True
test: True
resumable: True
module: task.cifar10.model.baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 3
mbatch_size: 32
num_workers: 5

Where task/cifar10/config/baseline.yaml will now include arguments from the following example sub-configurations:

See the following documentation for more information:

Stages and Trainer

In each task directory is a Python module called stages.py, which contains the stages definition. This definition takes an object as input that houses the configuration for a job.

Typically, the following things happen in stages():

It handles the training and testing of a model for a task by using a Lightning Trainer.

A helpful wrapper at src/trainer.py exists that passes frequently used and useful callbacks, loggers, and plugins to a Lightning Trainer instance:

from src.dlhpcstarter.trainer import trainer_instance

trainer = trainer_instance(**vars(args))

Place any of the parameters for the trainer detailed at https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-class-api in your configuration file, and they will be passed to the Lightning Trainer instance.

Tying it all together: dlhpcstarter

This is an overview of what occurs when the entrypoint dlhpcstarter is executed, this is not necessary to understand to use the package.

dlhpcstarter does the following:

Cluster manager and distributed computing

The following arguments are used for distributed computing:

ArgumentDescriptionDefault
num_workersNo. of workers per DataLoader & GPU.1
num_gpusNumber of GPUs per node.None
num_nodesNumber of nodes (should only be used with submit = True).1

The following arguments are used to configure a job for a cluster manager (the default cluster manager is SLURM):

ArgumentDescriptionDefault
memoryAmount of memory per node.'16GB'
time_limitJob time limit.'02:00:00'
submitSubmit job to the cluster manager.None
resumableResumable training; Automatic resubmission to cluster manager.None
qosQuality of service.None
beginWhen to begin the Slurm job, e.g. now+1hour.None
emailEmail for cluster manager notifications.None
venv_pathPath to ''bin/activate'' of a venv.None

These can be given as command line arguments:

dlhpcstarter --config task/cifar10/config/baseline --task cifar10 --submit 1 --num-gpus 4 --num-workers 5 --memory 32GB

Or they can be placed in the configuration .yaml file:

num_gpus: 4  # Added.
num_workers: 5  # Added.
memory: '32GB'  # Added.

train: True
test: True
module: task.cifar10.model.baseline
definition: Baseline
monitor: 'val_acc'
monitor_mode: 'max'
lr: 1e-3
max_epochs: 32
mbatch_size: 32
num_workers: 5
exp_dir: /my/experiment/directory
dataset_dir: /my/datasets/directory

And executed with:

dlhpcstarter --config task/cifar10/config/baseline --task cifar10 --submit True

If using a cluster manager, add the path to the bin/activate of your virtual environment:

...
venv_path: /my/env/name/bin/activate
...

Monitoring using Neptune.ai

Simply sign up at https://neptune.ai/ and add your username and API token to your configuration file:

...
neptune_username: my_username
neptune_api_key: df987y94y2q9hoiusadhc9wy9tr82uq408rjw98ch987qwhtr093q4jfi9uwehc987wqhc9qw4uf9w3q4h897324th
...

The PyTorch Lightning Trainer will then automatically upload metrics using the Neptune Logger to Neptune.ai. Once logged in to https://neptune.ai/, you will be able to monitor your task. See here for information about using the online UI: https://docs.neptune.ai/you-should-know/displaying-metadata.

Where all the outputs go: exp_dir

The experiments directory is where all your outputs will be saved, including model checkpoints, metric scores. This is also where the cluster manager script, as well as where stderr and stdout are saved.

Note: the trial number also sets the seed number for your experiment.

Repository Wish List