

Training CNNs with Pytorch

This repository provides a modular base to train CNNs using Pytorch. It has integration with TensorboardX which allows Tensorboard style visualisations while having the rest of the code remain in pythonic Pytorch.

The list of files in the directory and their functions are described below.


Anaconda config file that can be used to setup a conda environment with all the required dependencies. The list of dependencies can be found in this file.


Contains main function that can is used to call functions in the rest of the files


Reads the config file present in the configs folder that holds the configuration and returns an object with all the parameters that were passed in. The config file to be used can be specified as python3 main.py --config-file "name of config file" when calling the function.






Note: Only one of Resume, Branch, or Evaluate can be set to True at any given time. Directory structure for checkpointing will be specified in the section describing the checkpointing.py file.


Looks at the cnn section of the config file and loads the model specified in the models folder. Within the models folder, the __init__.py in the models/dataset folder should have the from .dataset.py import * command for each model that you wish to use, and within the dataset.py file, the __all__ value needs to be set to the name of the dataset to be imported


Defines a class that is instantiated in the main which holds the state during training as well as deals with checkpointing. Whenever a new test is created, i.e. Branch, Resume and Evaluate are all False, the directory is set to Checkpoint_Path/Test_Name/orig. In here, after every epoch, two files are stored with the names (epoch_number)-model.pth.tar and (epoch_number)-state.pth.tar.

If a Resume is called, then whichever directory the model.pth.tar file was in, the new checkpoints are placed in that directory itself. The code checks to ensure that the checkpoint file passed to resume from is the last epoch that is stored in that directory. If you wish to resume from a different epoch, a Branch command needs to be used.

If a Branch is called, then at the same level as the orig directory, a new directory is created with name (start_epoch_number)-(version). So if multiple different branches are created with the same start epoch, the version number is incremented by 1 each time. The checkpoint of the start epoch is copied from the orig directory into this new directory, and training is resumed from the following epoch. The relevant files in the old log file are also copied over, so the new logfile within this new directory is complete with history data and new data.