Home

Awesome

GPU Task Scheduler

GPU Task Scheduler is a Python library for scheduling GPU jobs in parallel.

When designing and running neural-network-based algorithms, we often need to test the code on a large set of parameter combinations. To be more efficient, we may also want to distribute the tasks on multiple GPUs in parallel. Writing scripts to achieve these goals may be a headache, especially when the parameter combinations that need to test are complicated, let alone the sophisticated configurations of defining which GPUs to use.

GPU Task Scheduler offers you an easy and quick way to do it. All you need is to define the parameter combinations by simple configurations, write the test code in our framework, and then let GPU Task Scheduler run the tests in parallel for you. If you already have one test code, don't worry. Migrating your test code to our framework is very easy.

Following are tips on how to install and use it.

Installation

Usage

Configuration

The configuration is defined in a nested Python dictionary structure, which contains multiple optional/mandatory keys. Each key-value pair defines one setting. The definitions of each configuration are as follow.

Here is a sample configuration. Assume that it is stored in config.py.

config = {
    "scheduler_config": {
        "gpu": ["0", "1", "2"]
    },

    "global_config": {
        "num_run": 5,
        "num_epoch": 200,
    },

    "test_config": [
        {
            "method": ["GAN", "ALI"],
            "num_packing": [1],
            "num_zmode": [1]
        }, 
        {
            "method": ["WGAN"],
            "num_packing": [3],
            "num_zmode": [1, 2]
        }
    ]
}

The test instances for this example will be:

{"num_run": 5, "num_epoch": 200, "method": "GAN", "num_packing": 1, "num_zmode": 1}
{"num_run": 5, "num_epoch": 200, "method": "ALI", "num_packing": 1, "num_zmode": 1}
{"num_run": 5, "num_epoch": 200, "method": "WGAN", "num_packing": 3, "num_zmode": 1}
{"num_run": 5, "num_epoch": 200, "method": "WGAN", "num_packing": 3, "num_zmode": 2}

Implementing test code interface

The test code should inherit from gpu_task_scheduler.gpu_task.GPUTask class, which has two interfaces.

There are two useful class variables accessible in those interfaces:

Assume that you implement the class in my_gpu_task.py, and the class name is MyGPUTask.

Running

Now everything is ready, you can start running the tests by a few lines of code. First let's import the two files you wrote:

from config import config
from my_gpu_task import MyGPUTask

We also need to import the scheduler class:

from gpu_task_scheduler.gpu_task_scheduler import GPUTaskScheduler

Now we construct a scheduler by passing config and task class to constructor:

scheduler = GPUTaskScheduler(config = config, gpu_task_class = MyGPUTask)

and start running the tests:

scheduler.start()

Now the scheduler will schedule the test instances on the GPUs you set in parallel for you. Whenever a test instance finishes on one GPU, the scheduler will fetch the next test instances and run it on that GPU.

WARNING: If you import theano or tensorflow library in the top module of my_gpu_task.py (or config.py), the code may immediately occupy part of GPU resources before the scheduler starts. Usually that would only waste part of GPU memory resources, but not GPU calculation resources. If you don't want this happen, there are many workarounds:

TODO: A better way to get around this.

WARNING: When running the test code, the scheduler will automatically set matplotlib's backend to Agg (if matplotlib is installed), because in most cases we don't need to show figures on screen. If this is not what you need, in the begining of your test code's main function, you can use matplotlib.pyplot.switch_backend to switch backend, or reload(matplotlib) and then choose your desired backend.

TODO: A better way to do this.

Other useful interfaces

Besides the scheduler class, the library contains another userful class ConfigManager, which is used for parsing the configuration file. It is implicitly instantiated in the scheduler class. You can also construct one yourself, as it provides many useful interfaces especially when you want to collect or further process the results.

It contains following public interfaces:

Example

Projects that use this library:

Contributing

If you find bugs/problems or want to add more features to this library, feel free to submit issues or make pull requests.