Home

Awesome

PRs Welcome

Random Network Distillation

Implementation of the Exploration by Random Network Distillation on Montezuma's Revenge Atari game. The algorithm simply consists of generating intrinsic rewards based on the novelty that the agent faces and using these rewards to reduce the sparsity of the game. The main algorithm to train the agent is Proximal Policy Optimization which is able to combine extrinsic and intrinsic rewards easily and has fairly less variance during training.

Demo

RNN PolicyCNN PolicySuper Mario Bros

Results

RNN PolicyCNN Policy

Important findings to mention

Table of hyper-parameters

By using the max and skip frames of 4, max frames per episode should be 4500 so 4500 * 4 = 18000 as it has been mentioned in the paper.

ParametersValue
total rollouts per environment30000
max frames per episode4500
rollout length128
number of environments128
number of epochs4
number of mini batches4
learning rate1e-4
extrinsic gamma0.999
intrinsic gamma0.99
lambda0.95
extrinsic advantage coefficient2
intrinsic advantage coefficient1
entropy coefficient0.001
clip range0.1
steps for initial normalization50
predictor proportion0.25

Structure

PPO-RND
├── Brain
│   ├── brain.py
│   └── model.py
├── Common
│   ├── config.py
│   ├── logger.py
│   ├── play.py
│   ├── runner.py
│   └── utils.py
├── demo
│   ├── CNN_Policy.gif
│   └── RNN_Policy.gif
├── main.py
├── Models
│   └── 2020-10-20-15-39-45
│       └── params.pth
├── Plots
│   ├── CNN
│   │   ├── ep_reward.png
│   │   ├── RIR.png
│   │   └── visited_rooms.png
│   └── RNN
│       ├── ep_reward.png
│       ├── RIR.png
│       └── visited_rooms.png
├── README.md
└── requirements.txt

  1. Brain dir includes the neural networks structures and the agent decision-making core.
  2. Common includes minor codes that are common for most RL codes and do auxiliary tasks like: logging, wrapping Atari environments, and... .
  3. main.py is the core module of the code that manages all other parts and make the agent interact with the environment.
  4. Models includes a pre-trained weight that you can use to play or keep training by it, also every weight is saved in this directory.

Dependencies

Installation

pip3 install -r requirements.txt

Usage

How to run

usage: main.py [-h] [--n_workers N_WORKERS] [--interval INTERVAL] [--do_test]
               [--render] [--train_from_scratch]

Variable parameters based on the configuration of the machine or user's choice

optional arguments:
  -h, --help            show this help message and exit
  --n_workers N_WORKERS
                        Number of parallel environments.
  --interval INTERVAL   The interval specifies how often different parameters
                        should be saved and printed, counted by iterations.
  --do_test             The flag determines whether to train the agent or play
                        with it.
  --render              The flag determines whether to render each agent or
                        not.
  --train_from_scratch  The flag determines whether to train from scratch or
                        continue previous tries.

python3 main.py --n_workers=128 --interval=100
python3 main.py --n_workers=128 --interval=100 --train_from_scratch
python3 main.py --do_test

Hardware requirements

References

  1. Exploration by Random Network Distillation, Burda et al., 2018
  2. Proximal Policy Optimization Algorithms, Schulman et al., 2017

Papers cited this repo

  1. Benchmarking the Spectrum of Agent Capabilities, D. Hafner, 2021 [Code]

Acknowledgement

  1. @jcwleo for random-network-distillation-pytorch.
  2. @OpenAI for random-network-distillation.