Home

Awesome

RLlib Reference Results

Benchmarks of RLlib algorithms against published results. These benchmarks are a work in progress. For other results to compare against, see yarlp and more plots from OpenAI.

Ape-X Distributed Prioritized Experience Replay

rllib train -f atari-apex/atari-apex.yaml

Comparison of RLlib Ape-X to Async DQN after 10M time-steps (40M frames). Results compared to learning curves from Mnih et al, 2016 extracted at 10M time-steps from Figure 3.

envRLlib Ape-X 8-workersMnih et al Async DQN 16-workersMnih et al DQN 1-worker
BeamRider6134~6000~3000
Breakout123~50~10
QBert15302~1200~500
SpaceInvaders686~600~500

Here we use only eight workers per environment in order to run all experiments concurrently on a single g3.16xl machine. Further speedups may be obtained by using more workers. Comparing wall-time performance after 1 hour of training:

envRLlib Ape-X 8-workersMnih et al Async DQN 16-workersMnih et al DQN 1-worker
BeamRider4873~1000~300
Breakout77~10~1
QBert4083~500~150
SpaceInvaders646~300~160

Ape-X plots: apex

IMPALA and A2C

rllib train -f atari-impala/atari-impala.yaml

rllib train -f atari-a2c/atari-a2c.yaml

RLlib IMPALA and A2C on 10M time-steps (40M frames). Results compared to learning curves from Mnih et al, 2016 extracted at 10M time-steps from Figure 3.

envRLlib IMPALA 32-workersRLlib A2C 5-workersMnih et al A3C 16-workers
BeamRider20711401~3000
Breakout385374~150
QBert40683620~1000
SpaceInvaders719692~600

IMPALA and A2C vs A3C after 1 hour of training:

envRLlib IMPALA 32-workersRLlib A2C 5-workersMnih et al A3C 16-workers
BeamRider3181874~1000
Breakout538268~10
QBert108501212~500
SpaceInvaders843518~300

IMPALA plots: tensorboard

A2C plots: tensorboard

Pong in 3 minutes

With a bit of tuning, RLlib IMPALA can solve Pong in ~3 minutes:

rllib train -f pong-speedrun/pong-impala-fast.yaml

tensorboard

DQN / Rainbow

rllib train -f atari-dqn/basic-dqn.yaml rllib train -f atari-dqn/duel-ddqn.yaml rllib train -f atari-dqn/dist-dqn.yaml

RLlib DQN after 10M time-steps (40M frames). Note that RLlib evaluation scores include the 1% random actions of epsilon-greedy exploration. You can expect slightly higher rewards when rolling out the policies without any exploration at all.

envRLlib Basic DQNRLlib Dueling DDQNRLlib Distributional DQNHessel et al. DQNHessel et al. Rainbow
BeamRider286919104447~2000~13000
Breakout287312410~150~300
QBert3921796815780~4000~20000
SpaceInvaders65010011025~500~2000

Basic DQN plots: tensorboard

Dueling DDQN plots: tensorboard

Distributional DQN plots: tensorboard

Proximal Policy Optimization

rllib train -f atari-ppo/atari-ppo.yaml

rllib train -f halfcheetah-ppo/halfcheetah-ppo.yaml

2018-09:

RLlib PPO with 10 workers (5 envs per worker) after 10M and 25M time-steps (40M/100M frames). Note that RLlib does not use clip parameter annealing.

envRLlib PPO @10MRLlib PPO @25MBaselines PPO @10M
BeamRider28074480~1800
Breakout104201~250
QBert1108514247~14000
SpaceInvaders671944~800

tensorboard

RLlib PPO wall-time performance vs other implementations using a single Titan XP and the same number of CPUs. Results compared to learning curves from Fan et al, 2018 extracted at 1 hour of training from Figure 7. Here we get optimal results with a vectorization of 32 environment instances per worker:

envRLlib PPO 16-workersFan et al PPO 16-workersTF BatchPPO 16-workers
HalfCheetah9664~7700~3200

tensorboard

2020-01:

Same as 2018-09, comparing only RLlib PPO-tf vs PPO-torch.

envRLlib PPO @20M (tf)RLlib PPO @20M (torch)plot
BeamRider41423850tensorboard
Breakout132166tensorboard
QBert798714294tensorboard
SpaceInvaders9561016tensorboard

Soft Actor Critic

rllib train -f halfcheetah-sac/halfcheetah-sac.yaml

RLlib SAC after 3M time-steps.

RLlib SAC versus SoftLearning implementation Haarnoja et al, 2018 benchmarked at 500k and 3M timesteps respectively.

envRLlib SAC @500KHaarnoja et al SAC @500KRLlib SAC @3MHaarnoja et al SAC @3M
HalfCheetah9000~900013000~15000

tensorboard

MAML

MAML uses additional metrics to measure performance; episode_reward_mean measures the agent's returns before adaptation, episode_reward_mean_adapt_N measures the agent's returns after N gradient steps of inner adaptation, and adaptation_delta measures the difference in performance before and after adaptation.

rllib train -f maml/halfcheetah-rand-direc-maml.yaml

tensorboard

rllib train -f maml/ant-rand-goal-maml.yaml

tensorboard

rllib train -f maml/pendulum-mass-maml.yaml

tensorboard

MB-MPO

rllib train -f mbmpo/halfcheetah-mbmpo.yaml

rllib train -f mbmpo/hopper-mbmpo.yaml

MBMPO uses additional metrics to measure performance. For each MBMPO iteration, MBMPO samples fake data from the transition dynamics workers and steps through MAML for N iterations. MAMLIter$i$_DynaTrajInner_$j$_episode_reward_mean corresponds to agent's performance across the dynamics models at the ith iteration of MAML and the jth step of inner adaptation.

RLlib MBMPO versus Clavera et al, 2018 benchmarked at 100k timesteps. Results reported below were ran on RLLib and the master branch of the original codebase respectively.

envRLlib MBPO @100KClavera et al MBMPO @100K
HalfCheetah520~550
Hopper620~650

tensorboard

Dreamer

rllib train -f dreamer/dreamer-deepmind-control.yaml

RLlib Dreamer at 1M time-steps.

RLlib Dreamer versus Google implementation Danijar et al, 2020 benchmarked at 100k and 1M timesteps respectively.

envRLlib Dreamer @100KDanijar et al Dreamer @100KRLlib Dreamer @1MDanijar et al Dreamer @1M
Walker320~250920~930
Cheetah300~250640~800

tensorboard

RLlib Dreamer also logs gifs of Dreamer's imagined trajectories (Top: Ground truth, Middle: Model prediction, Bottom: Delta).

Alt Text Alt Text

CQL

rllib train -f halfcheetah-cql/halfcheetah-cql.yaml

rllib train -f halfcheetah-cql/halfcheetah-bc.yaml

Since CQL is an offline RL algorithm, CQL's returns are evaluated only during the evaluation loop (once every 1000 gradient steps for Mujoco-based envs).

RLlib CQL versus Behavior Cloning (BC) benchmarked at 1M gradient steps over the dataset derived from the D4RL benchmark (Fu et al, 2020). Results reported below were ran on RLLib. The only difference between BC and CQL is the bc_iters parameter in CQL (how many iterations to run BC loss).

RLlib's CQL is evaluated on four different enviornments: HalfCheetah-Random-v0 and Hopper-Random-v0 contain datasets collected by a random policy, while HalfCheetah-Medium-v0 and Hopper-Medium-v0 contain datasets collected by a policy trained 1/3 of the way through. In all envs, CQL does better than BC by a significant margin (especially HalfCheetah-Random-v0).

envRLlib BC @1MRLlib CQL @1M
HalfCheetah-Random-v0-3203000
Hopper-Random-v0290320
HalfCheetah-Medium-v034503850
Hopper-Medium-v010002000

rllib train -f cql/halfcheetah-cql.yaml & rllib train -f cql/halfcheetah-bc.yaml

tensorboard

tensorboard

rllib train -f cql/hopper-cql.yaml & rllib train -f cql/hopper-bc.yaml

tensorboard

tensorboard

Transformers

rllib train -f vizdoom-attention/vizdoom-attention.yaml

RLlib's model catalog feature implements a variety of different models for the policy and value network, one of which supports using attention in RL. In particular, RLlib implements a Gated Transformer (Parisotta et al, 2019), abbreviated as GTrXL.

GTrXL is benchmarked in the Vizdoom environment, where the goal is to shoot a monster as quickly as possible. With PPO as the algorithm and GTrXL as the model, RLlib can successfuly solve the Vizdoom environment and reach human level performance.

envRLlib Transformer @2M
VizdoomBasic-v0~75

tensorboard