Home

Awesome

RL Games: High performance RL library

Discord Channel Link

Papers and related links

Some results on the different environments

Ant_running Humanoid_running

Allegro_Hand_400 Shadow_Hand_OpenAI

Allegro_Hand_real_world

AllegroKuka

Implemented in Pytorch:

Implemented in Tensorflow 1.x (was removed in this version):

Quickstart: Colab in the Cloud

Explore RL Games quick and easily in colab notebooks:

Installation

For maximum training performance a preliminary installation of Pytorch 2.2 or newer with CUDA 12.1 or newer is highly recommended:

conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia or: pip install pip3 install torch torchvision

Then:

pip install rl-games

To run CPU-based environments either Ray or envpool are required pip install envpool or pip install ray To run Mujoco, Atari games or Box2d based environments training they need to be additionally installed with pip install gym[mujoco], pip install gym[atari] or pip install gym[box2d] respectively.

To run Atari also pip install opencv-python is required. In addition installation of envpool for maximum simulation and training perfromance of Mujoco and Atari environments is highly recommended: pip install envpool

Citing

If you use rl-games in your research please use the following citation:

@misc{rl-games2021,
title = {rl-games: A High-performance Framework for Reinforcement Learning},
author = {Makoviichuk, Denys and Makoviychuk, Viktor},
month = {May},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Denys88/rl_games}},
}

Development setup

poetry install
# install cuda related dependencies
poetry run pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Training

NVIDIA Isaac Gym

Download and follow the installation instructions of Isaac Gym: https://developer.nvidia.com/isaac-gym
And IsaacGymEnvs: https://github.com/NVIDIA-Omniverse/IsaacGymEnvs

Ant

python train.py task=Ant headless=True python train.py task=Ant test=True checkpoint=nn/Ant.pth num_envs=100

Humanoid

python train.py task=Humanoid headless=True python train.py task=Humanoid test=True checkpoint=nn/Humanoid.pth num_envs=100

Shadow Hand block orientation task

python train.py task=ShadowHand headless=True python train.py task=ShadowHand test=True checkpoint=nn/ShadowHand.pth num_envs=100

Other

Atari Pong

poetry install -E atari
poetry run python runner.py --train --file rl_games/configs/atari/ppo_pong.yaml
poetry run python runner.py --play --file rl_games/configs/atari/ppo_pong.yaml --checkpoint nn/PongNoFrameskip.pth

Brax Ant

poetry install -E brax
poetry run pip install --upgrade "jax[cuda]==0.3.13" -f https://storage.googleapis.com/jax-releases/jax_releases.html
poetry run python runner.py --train --file rl_games/configs/brax/ppo_ant.yaml
poetry run python runner.py --play --file rl_games/configs/brax/ppo_ant.yaml --checkpoint runs/Ant_brax/nn/Ant_brax.pth

Experiment tracking

rl_games support experiment tracking with Weights and Biases.

poetry install -E atari
poetry run python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --track
WANDB_API_KEY=xxxx poetry run python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --track
poetry run python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --wandb-project-name rl-games-special-test --track
poetry run python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --wandb-project-name rl-games-special-test -wandb-entity openrlbenchmark --track

Multi GPU

We use torchrun to orchestrate any multi-gpu runs.

torchrun --standalone --nnodes=1 --nproc_per_node=2 runner.py --train --file rl_games/configs/ppo_cartpole.yaml

Config Parameters

FieldExample ValueDefaultDescription
seed8NoneSeed for pytorch, numpy etc.
algoAlgorithm block.
namea2c_continuousNoneAlgorithm name. Possible values are: sac, a2c_discrete, a2c_continuous
modelModel block.
namecontinuous_a2c_logstdNonePossible values: continuous_a2c ( expects sigma to be (0, +inf), continuous_a2c_logstd ( expects sigma to be (-inf, +inf), a2c_discrete, a2c_multi_discrete
networkNetwork description.
nameactor_criticPossible values: actor_critic or soft_actor_critic.
separateFalseWhether use or not separate network with same same architecture for critic. In almost all cases if you normalize value it is better to have it False
spaceNetwork space
continuouscontinuous or discrete
mu_activationNoneActivation for mu. In almost all cases None works the best, but we may try tanh.
sigma_activationNoneActivation for sigma. Will be threated as log(sigma) or sigma depending on model.
mu_initInitializer for mu.
namedefault
sigma_initInitializer for sigma. if you are using logstd model good value is 0.
nameconst_initializer
val0
fixed_sigmaTrueIf true then sigma vector doesn't depend on input.
cnnConvolution block.
typeconv2dType: right now two types supported: conv2d or conv1d
activationeluactivation between conv layers.
initializerInitialier. I took some names from the tensorflow.
nameglorot_normal_initializerInitializer name
gain1.4142Additional parameter.
convsConvolution layers. Same parameters as we have in torch.
filters32Number of filters.
kernel_size8Kernel size.
strides4Strides
padding0Padding
filters64Next convolution layer info.
kernel_size4
strides2
padding0
filters64
kernel_size3
strides1
padding0
mlpMLP Block. Convolution is supported too. See other config examples.
unitsArray of sizes of the MLP layers, for example: [512, 256, 128]
d2rlFalseUse d2rl architecture from https://arxiv.org/abs/2010.09163.
activationeluActivations between dense layers.
initializerInitializer.
namedefaultInitializer name.
rnnRNN block.
namelstmRNN Layer name. lstm and gru are supported.
units256Number of units.
layers1Number of layers
before_mlpFalseFalseApply rnn before mlp block or not.
configRL Config block.
reward_shaperReward Shaper. Can apply simple transformations.
min_val-1You can apply min_val, max_val, scale and shift.
scale_value0.11
normalize_advantageTrueTrueNormalize Advantage.
gamma0.995Reward Discount
tau0.95Lambda for GAE. Called tau by mistake long time ago because lambda is keyword in python :(
learning_rate3e-4Learning rate.
namewalkerName which will be used in tensorboard.
save_best_after10How many epochs to wait before start saving checkpoint with best score.
score_to_win300If score is >=value then this value training will stop.
grad_norm1.5Grad norm. Applied if truncate_grads is True. Good value is in (1.0, 10.0)
entropy_coef0Entropy coefficient. Good value for continuous space is 0. For discrete is 0.02
truncate_gradsTrueApply truncate grads or not. It stabilizes training.
env_nameBipedalWalker-v3Envinronment name.
e_clip0.2clip parameter for ppo loss.
clip_valueFalseApply clip to the value loss. If you are using normalize_value you don't need it.
num_actors16Number of running actors/environments.
horizon_length4096Horizon length per each actor. Total number of steps will be num_actors*horizon_length * num_agents (if env is not MA num_agents==1).
minibatch_size8192Minibatch size. Total number number of steps must be divisible by minibatch size.
minibatch_size_per_env8Minibatch size per env. If specified will overwrite total number number the default minibatch size with minibatch_size_per_env * nume_envs value.
mini_epochs4Number of miniepochs. Good value is in [1,10]
critic_coef2Critic coef. by default critic_loss = critic_coef * 1/2 * MSE.
lr_scheduleadaptiveNoneScheduler type. Could be None, linear or adaptive. Adaptive is the best for continuous control tasks. Learning rate is changed changed every miniepoch
kl_threshold0.008KL threshould for adaptive schedule. if KL < kl_threshold/2 lr = lr * 1.5 and opposite.
normalize_inputTrueApply running mean std for input.
bounds_loss_coef0.0Coefficient to the auxiary loss for continuous space.
max_epochs10000Maximum number of epochs to run.
max_frames5000000Maximum number of frames (env steps) to run.
normalize_valueTrueUse value running mean std normalization.
use_diagnosticsTrueAdds more information into the tensorboard.
value_bootstrapTrueBootstraping value when episode is finished. Very useful for different locomotion envs.
bound_loss_typeregularisationNoneAdds aux loss for continuous case. 'regularisation' is the sum of sqaured actions. 'bound' is the sum of actions higher than 1.1.
bounds_loss_coef0.00050Regularisation coefficient
use_smooth_clampFalseUse smooth clamp instead of regular for cliping
zero_rnn_on_doneFalseTrueIf False RNN internal state is not reset (set to 0) when an environment is rest. Could improve training in some cases, for example when domain randomization is on
playerPlayer configuration block.
renderTrueFalseRender environment
deterministicTrueTrueUse deterministic policy ( argmax or mu) or stochastic.
use_vecenvTrueFalseUse vecenv to create environment for player
games_num200Number of games to run in the player mode.
env_configEnv configuration block. It goes directly to the environment. This example was take for my atari wrapper.
skip4Number of frames to skip
nameBreakoutNoFrameskip-v4The exact name of an (atari) gym env. An example, depends on the training env this parameters can be different.
evaluationTrueFalseEnables the evaluation feature for inferencing while training.
update_checkpoint_freq100100Frequency in number of steps to look for new checkpoints.
dir_to_monitorDirectory to search for checkpoints in during evaluation.

Custom network example:

simple test network
This network takes dictionary observation. To register it you can add code in your init.py

from rl_games.envs.test_network import TestNetBuilder 
from rl_games.algos_torch import model_builder
model_builder.register_network('testnet', TestNetBuilder)

simple test environment example environment

Additional environment supported properties and functions

FieldDefault ValueDescription
use_central_valueFalseIf true than returned obs is expected to be dict with 'obs' and 'state'
value_size1Shape of the returned rewards. Network wil support multihead value automatically.
concat_infosFalseShould default vecenv convert list of dicts to the dicts of lists. Very usefull if you want to use value_boostrapping. in this case you need to always return 'time_outs' : True or False, from the env.
get_number_of_agents(self)1Returns number of agents in the environment
has_action_mask(self)FalseReturns True if environment has invalid actions mask.
get_action_mask(self)NoneReturns action masks if has_action_mask is true. Good example is SMAC Env

Release Notes

1.6.1

1.6.0

1.5.2

1.5.1

1.5.0

1.4.0

1.3.2

1.3.1

1.3.0

1.2.0

1.1.4

1.1.3

1.1.0

Troubleshouting

Known issues