Awesome

Main features

Support for multi-node multi-gpu jobs. The assignment of GPUs to each process (via $CUDA_VISIBLE_DEVICES) is automatically done. Each process of the job is assumed to be single-GPU.
Run a command without writing submission scripts

runjob --ngpus=8 --project=myproject --queue=myqueue python distributed_program.py

Create automatically one directory for each job and redirect the output of each process to a separate log file.
Automatically print the outputs of your job to your monitor. No need to run tail -f manually.
Cancel your job using a CTRL+C as if you were running you program interactively.
Exports useful environment variables related to your project and available resources.

Exported environment variables:

JOB_DIR
PROJECT_DIR
N_CPUS
CONDA_ROOT
CONDA_ENV
N_PROCS
PROC_ID
OUT_FILE
JOB_LOG_FILE

Config file

See examples/config.yaml for config of projects and queues.

Usage

runjob-config examples/config.yaml
runjob --ngpus=8 --project=myproject --queue=myqueue python distributed_program.py

This will start a multi-gpu (possibly multi-node according to your queue config) job with 1 process per GPU and print the output (stdout and stderr) of one of the processes (the one with SLURM_LOCALID=0). Use keyboard interupt to cancel your job.

Running tests

Make sure you are on a SLURM cluster which sinfo should output something.

runjob-config examples/config.yaml
pytest -vs

Features that will be added in the future:

A simple interface to resume long-running jobs automatically.
Copy your python project to an temporary directory to ensure modifications you make to your code do not affect your job while it's running.
A utility to runjob for interactive jobs.