Home

Awesome

Latent Action Q-Learning

Installation

This project was developed using Python 3.8. Install dependencies using pip

pip install -r requirements.txt

Please also install torch==1.8.1 and torchvision==0.9.1 separately following the instructions here. Follow additional instructions for setting up mujoco from https://github.com/openai/mujoco-py.

Kitchen, maze and ant environments require a modified version of d4rl(https://github.com/rail-berkeley/d4rl) which has been setup support a leared reward function. That repo can be found here (https://github.com/MatthewChang/d4rl_learned_reward). Install it by cloning the repo and running the install script.

git clone git@github.com:MatthewChang/d4rl_learned_reward.git
pip install "git+https://github.com/aravindr93/mjrl@master#egg=mjrl"
pip install -e ./d4rl_learned_reward

Experiments in visual navigation are based on this repo 'Semantic Visual Navigation by Watching Youtube Videos' [VLV Repo] (https://github.com/MatthewChang/video-dqn). Setup instructions in that repo should be followed for installation.

Usage

Data Generation

To generate data for the kitchen enviornment run python kitchen/gen_data.py, for the maze environment run python maze2d/gen_data.py maze2d-medium-v1 --skip 2 --num-actions 4. These scripts render data into their respective sub-folders.

To generate data for visual navigation, use the below command VLV_LOCATION=[VLV_LOCATION] GIBSON_LOCATION=[GIBSON_LOCATION] python vis_nav/generate_data.py [LOCATION_TO_WRITE_DATA] where filling in the location of the VLV repo and the gibson meshes installed by following instruction in the VLV repo.

Data for the gridworld environment is already included in this repo.

To generate data for freeway, we use the code in An Optimistic Perspective on Offline Reinforcement Learning, crucially with sticky actions turned off. Store this data in freeway/batch_rl_data.

Training Latent Action Forward Models

cd env
python latent_action_mining.py --gpu [gpu_id]

env can be one of {vis_nav, freeway, maze2d, kitchen}.

This writes the models to env/lam_runs/repro.

Train latent action model on gridworld data with python gridworld/latent_action_mining.py --gpu 1 --bottleneck_size 8 --batch_norm --logdir grid-bs8-ss6-center-clean --step_size 6 --logdir_prefix output-grid-v2

Saving Latent Action Labels

cd env
python save_actions.py

env can be one of {vis_nav, freeway, maze2d, kitchen}.

Generating Value Functions

cd env
python ./train_q_network.py -g [gpu_id] configs/experiments/real_data

After training, the value function on gridworld is trained with python gridworld/generate_value_function.py --model learned --model_path output-grid-v2/grid-bs8-ss6-center-clean/model-70000.pth --bottleneck_size 8 --batch_norm

Model Selection

cd env
python spearman.py

For visual navigation, please follow this repo 'Semantic Visual Navigation by Watching Youtube Videos' [VLV Repo] (https://github.com/MatthewChang/video-dqn) for obtaining the ground truth value function.

This writes out a file: spearman.npy; Using the spearman values in this, copy the 95th percentile checkpoint file to value_fuctions/env/

Note: If having trouble loading the maze2d DDPG model, please try using stable-baselines==2.9.0.

Evaluation

After learning a value function you can evaluate agent performance with densified reward using

python ./evaluate.py --env [env_name] --model [path_to_model]

where env_name is 'kitchen', 'ant', or 'maze', with the path to the model generated by the value function generating script. For example, if you generated a value function using the above command for the maze environment you can run python ./evaluate.py --env maze --model [path_to_model]. Results are written into tensorboard archives in ./runs.

Evaluation of the gridworld is done with tabular q-learning and can be launched with python gridworld/evaluate.py output-grid-v2/grid-bs8-ss6-center-clean/value_function.npy