Home

Awesome

Trust Region Policy Optimization with Generalized Advantage Estimation

By Patrick Coady: Learning Artificial Intelligence

Summary

NOTE: The code has been refactored to use TensorFlow 2.0 and PyBullet (instead of MuJoCo). See the tf1_mujoco branch for old version.

The project's original goal was to use the same algorithm to "solve" 10 MuJoCo robotic control environments. And, specifically, to achieve this without hand-tuning the hyperparameters (network sizes, learning rates, and TRPO settings) for each environment. This is challenging because the environments range from a simple cart pole problem with a single control input to a humanoid with 17 controlled joints and 44 observed variables. The project was successful, nabbing top spots on almost all of the AI Gym MuJoCo leaderboards.

With the release of TensorFlow 2.0, I decided to dust off this project and upgrade the code. And, while I was at it, I moved from the paid MuJoCo simulator to the free PyBullet simulator.

Here are the key points:

PyBullet Gym Environments

HumanoidDeepMimicBulletEnv-v1
CartPoleBulletEnv-v1
MinitaurBulletEnv-v0
MinitaurBulletDuckEnv-v0
RacecarBulletEnv-v0
RacecarZedBulletEnv-v0
KukaBulletEnv-v0
KukaCamBulletEnv-v0
InvertedPendulumBulletEnv-v0
InvertedDoublePendulumBulletEnv-v0
InvertedPendulumSwingupBulletEnv-v0
ReacherBulletEnv-v0
PusherBulletEnv-v0
ThrowerBulletEnv-v0
StrikerBulletEnv-v0
Walker2DBulletEnv-v0
HalfCheetahBulletEnv-v0
AntBulletEnv-v0
HopperBulletEnv-v0
HumanoidBulletEnv-v0
HumanoidFlagrunBulletEnv-v0
HumanoidFlagrunHarderBulletEnv-v0

Using

I ran quick checks on three of the above environments and successfully stabilized a double-inverted pendulum and taught the "half cheetah" to run.

python train.py InvertedPendulumBulletEnv-v0
python train.py InvertedDoublePendulumBulletEnv-v0 -n 5000
python train.py HalfCheetahBulletEnv-v0 -n 5000 -b 5

Videos

During training, videos are periodically saved automatically to the /tmp folder. These can be enjoyable to view, and also instructive.

Dependencies

References

  1. Trust Region Policy Optimization (Schulman et al., 2016)
  2. Emergence of Locomotion Behaviours in Rich Environments (Heess et al., 2017)
  3. High-Dimensional Continuous Control Using Generalized Advantage Estimation (Schulman et al., 2016)
  4. GitHub Repository with several helpful implementation ideas (Schulman)