Home

Awesome

Learning@home: Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

img

PyTorch original implementation of "Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts" (NeurIPS 2020).

TL;DR: Learning@home is an approach for training large (up to multi-terabyte) neural networks on hardware provided by volunteers with unreliable and slow connection.

This repository contains a snapshot of Learning@home that was used to conduct initial experiments. While this snapshot implements the main functionality of Learning@home, it should be treated as a testbed to reproduce our experiments, not as a finished library (see limitations below). To see an updated implementation designed for practical use, please refer to the hivemind project.

What do I need to run it?

How do I run it?

  1. Clone or download this repo. cd to its root directory.
  2. Create a working python enviromnent. Anaconda works fine.
  3. Install packages from requirements.txt
  4. Follow the instructions in the next section

Running the experiments

Throughput

All three scripts are contained in the folder throughput and are ready for customized benchmark runs.

To run the baseline with parameters from the paper, use

python baseline_throughput.py --batches-for-latency 5 --batches-for-throughput 10 --batch-size 4 --throughput-runs 5 --linspace-points 10 --block-type transformer --layers-per-gpu 56 --gpus 0 1 2 3

For testing Learning@home throughput under latency, first start the server for each GPU you have with

python throughput_server.py -a 16 -p PORT_NUMBER --block_type BLOCK_TYPE --gpu GPU_NUMBER

and then run a multiple-trainer client with commands like

python throughput_client.py -j 64 --batches-for-latency 5 --batches-for-throughput 2 --throughput-runs 5 --linspace-points 10 --layers-per-gpu 56 --block-type ffn --hosts HOSTAME1:PORT_NUMBER1 HOSTAME2:PORT_NUMBER2”

python throughput_client.py -j 64 --batches-for-latency 5 --batches-for-throughput 2 --throughput-runs 5 --linspace-points 10 --layers-per-gpu 56 --block-type transformer --max-ping 0.2 --hosts HOSTAME1:PORT_NUMBER1 HOSTAME2:PORT_NUMBER2 --batch-size 4

Convergence

This experiment can be conducted both in a distributed setting and with an emulator. We recommend using the emulator to make results hardware-agnostic and reduce variance due to CPU and network interference from other processes.

You can find notebooks for large FFN, DMoE with 64 experts, DMoE with 4096 experts in ./experiments/convergence.

Below we include the full grid of parameters used to conduct convergence experiments:

setupnotebookexperts_per_layernum_trainersbatch_sizedelay_ms
100ms large ffnclick-644100
100ms 64 expertsclick16164100
100ms 256 expertsclick64644100
100ms 4096 expertsclick1024648100
1000ms large ffnclick-6441000
1000ms 64 expertsclick161641000
1000ms 256 expertsclick646441000
1000ms 4096 expertsclick10246481000
10% failure 64 expertsclick161641000
10% failure 256 expertsclick646441000
10% failure 4096 expertsclick10246481000

You can reproduce the curves in Figure 4 by opening the associated notebook, setting parameters as described in the table and iterating through random seeds 1337-1341 (including both borders).

Please note that these experiments can take up a lot of GPU memory due to storing "stale" gradients. With 16 trainers, the code should fit well into consumer GPU. For 4096 experts, we bypassed the memory limit by running on CPU.

Gating function over DHT

We also provide a reference implementation of DMoE gating function over Kademlia DHT via lib.GatingFunction.

In order to test our implementation, you need to do two things:

First, set up DHT with at least one server process:

import torch
import lib

# initial kademlia node
node_zero = lib.TesseractNetwork(port=ROOT_PORT, start=True)


# create experts. Warning: expert uids must be unique
experts = {}
for expert_uid in expert_uids:
    expert = torch.jit.script(NetworkBlock(1024))
    expert_backend = lib.ExpertBackend(
        name=expert_uid, expert=expert, opt=torch.optim.Adam(expert.parameters(), amsgrad=True),
        args_schema=(lib.BatchTensorProto(1024),), outputs_schema=lib.BatchTensorProto(1024),
        max_batch_size=2048, pool_size=8)
    experts[expert_uid] = expert_backend

# set up server(s)
runtime = lib.TesseractServer(lib.TesseractNetwork(('127.0.0.1', ROOT_PORT), port=SOME_OTHER_PORT, start=rue),
                              experts, port=PORTS[0], conn_handler_processes=64,
                              sender_threads=1, device=torch.device('cuda'),
                              start=True)
# after creating node_zero you can create additional TesseractServer instances in separate processes

Second, create a client process and connect to any DHT node:

import torch
import lib

# create one or several backends with expert uids following the "expert.[0-32).[0-32)" pattern
# all backends must have TesseractNetwork active

network = lib.TesseractNetwork(('127.0.0.1', ROOT_PORT), port=SOME_NEW_PORT, start=True)
dmoe = lib.GatingFunction(in_features=1024, grid_size=[32, 32], k_best=4, network=network, uid_prefix='expert')

average_out = dmoe(torch.randn(32, 1024))
average_out.sum().backward()

Learning@home quick tour

Trainer process:

Runtime process:

DHT:

Limitations

As stated above, this implementation is a testbed for experiments, not a feature-complete library. More specifically:

An updated version of the library is available at https://github.com/learning-at-home/hivemind.

References

Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts (Max Ryabinin and Anton Gusev, NeurIPS 2020).

@misc{ryabinin2020crowdsourced,
      title={Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts}, 
      author={Max Ryabinin and Anton Gusev},
      year={2020},
      eprint={2002.04013},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}