Awesome

Constrained Model-Based Policy Optimization

This repository contains code for Constrained Model-Based Policy Optimization (CMBPO), a model-based version of Constrained Policy Optimization (Achiam et al.). Installation, execution and code examples for the reproduction of the experiments described in Safe Continuous Control with Constrained Model-Based Policy Optimization are provided below.

Prerequisites

The simulation experiments using mujoco-Py require a working install of MuJoCo 2.0 and a valid license.
We use conda environments for installs (tested on conda 4.6 - 4.10), please refer to Anaconda for instructions.

Installation

Clone this repository

git clone https://github.com/anyboby/Constrained-Model-Based-Policy-Optimization.git

Create a conda environment using the cmbpo yml-file

cd Constrained-Model-Based-Policy-Optimization/
conda env create -f cmbpo.yml
conda activate cmbpo
pip install -e .

This should create a conda environment labeled 'cmbpo' with the necessary packages and modules. The number of required modules is limited, so it is worth taking a look at the cmbpo.yml and requirements.txt files in case of troubles with the installs.

Usage

To start an experiment with cmbpo, run

cmbpo run_local configs.baseconfig --config=configs.cmbpo_hcs --gpus=1 --trial-gpus=1

-- config specifies the configuration file for experiment (here: CMBPO for HalfCheetahSafe)
-- gpus specifies the number of gpus to use

A list of all available flags is provided in baseconfig/utils. As of writing,only local running is supported. For further options, refer to the ray documentation.

The cmbpo command uses the console scripts as an entry point for running experiments. A simple workflow of running experiments with ray-tune is illustrated in run.py, which can be executed with

python scripts/run.py configs.cmbpo_hcs

Algorithms

Constrained Model-Based Policy Optimization aims at combining Constrained Policy Optimization with model-based data augmentation and reconciling constraint satisfaction with the entailed model-errors.

This repository can therefore also be used to run experiments with model-free versions of Constrained Policy Optimization and Trust-Region Policy Optimization by configuring the use_model and constrain_cost flags accordingly in the experiment configurations (see CPO - HalfCheetahSafe and TRPO - HalfCheetahSafe):

'use_model': 		False,	# set to True for model-based
'constrain_cost':   False,  # set to True for cost-constrained optimziation

Adding new environments and running custom experiments

Different environments can be tested by creating a config file in the configs directory. OpenAi gym environments can be loaded directly with the corresponding parameters, for example:

'universe': 'gym',
'task':     'HalfCheetahSafe-v2',

Environments from other sources require an entry in the ENVS_FUNCTIONS dict in the environment utils that specifies how to create an instance of the environment. For example, the Gym environments are specified with the following entries:

def get_gym_env():
    import gym
    
    return gym.make

ENVS_FUNCTIONS = {
    'gym':get_gym_env()
}

Model-Learning with custom environments

When using a model with custom environments, the model requires a few interfaces to function with the provided code. The base model should be inherited by a learned (or handcrafted) model and specify whether rewards, costs, and termination functions are predicted alongside the dynamics.

By default our algorithm learns to predict rewards but assumes handcrafted cost- and termination-functions c(s,a,s') and t(s,a,s'). When adding a new environment, these functions should be defined (if not provided by the model) in the statics file. For example, a default termination function that continues episodes for all states looks like this:

def no_done(obs, act, next_obs):
    assert len(obs.shape) == len(next_obs.shape) == len(act.shape)

    done = np.zeros(shape=obs.shape[:-1], dtype=np.bool) #always false
    done = done[...,None]
    return done

The static functions should then be linked by the environments' task name, such that the Fake Environment correctly discovers them:

TERMS_BY_TASK = {
    'default':no_done,
    'HalfCheetah-v2':no_done,
}

Hyperparameters

Hyperparameters for a new experiment can be defined in the configs folder. The general form of our config files follows the following structure:

params = {
    'universe': 'gym',
    'task': 'HalfCheetahSafe-v2',
    'algorithm_params': {...},
    'policy_params':{...},
    'buffer_params': {...},
    'sampler_params': {...},
    'run_params': {...},
}

Parameters specified in a config file overwrite the base config file. For new algorithms or a new suite of environments, it might be practical to directly change the base config.

In addition to model-parameters and policy-parameters, the main parameters of concern in CMPBO define rollout- and sampling-behavior of the algorithm.

'n_initial_exploration_steps': int(10000), ### number of initial exploration steps for model-learning and 
                                            # determining uncertainty calibration measurements
'sampling_alpha': 2,                    ### temperature for boltzman-sampling
'rollout_mode' : 'uncertainty',         ### model rollouts terminate based on per-step uncertainty
'rollout_schedule': [10, 500, 5, 30],   ### if rollout_mode:'schedule' this schedule is defined as 
                                                # [min_epoch, max_epoch, min_horizon, max_horizon]
                                            ## if rollout_mode:'uncertainty', 'min_horizon' is used as 
                                            # the initial rollout horizon and adapted throughout 
                                            # training based on per-step uncertainty estimates 
                                            # (KL-Divergence).
'batch_size_policy': 50000,             ### batch size per policy update
'initial_real_samples_per_epoch': 1500, ### initial number of real samples per policy update, 
                                            # adapted   throughout training based on average uncertainty 
                                            # estimates (mean KL-Divergence).
'min_real_samples_per_epoch': 500,      ### absolute minimum number of real samples per policy update

Logging

A range of measurements is logged automatically in tensorboard, the parameter configuration is saved as a JSON file. The location for summaries and checkpoints can be defined by specifying a 'log_dir' in the configuration files. By default, this location will be set to '~/ray_cmbpo/{env-taks}/defaults/{seed}' and can be accessed with tensorboard by

tensorboard --logdir ~/ray_cmbpo/<env>/defaults/<seed_dir>

Acknowledgments

Several sections of this repository contain code from other repositories, notably from Tuomas Haarnoja, Kristian Hartikainen's, Michael Janner, Kurtland Chua, and CPO by Joshua Achiam and Alex Ray.