Awesome
Q*BERT
Code accompanying paper How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds by Prithviraj Ammanabrolu, Ethan Tien, Matthew Hausknecht, and Mark O. Riedl
Please use this Bibtex to cite us:
@article{ammanabrolu20how,
title={How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds},
author={Ammanabrolu, Prithviraj and Tien, Ethan and Hausknecht, Matthew and Riedl, Mark O.},
journal={CoRR},
year={2020},
url={http://arxiv.org/abs/2006.07409},
volume={abs/2006.07409}
}
Structured exploration using knowledge graph A2C agents. Overall and architecture one-step knowledge graph extraction is seen below: in the Jericho-QA format architecture at time step t. At each step the ALBERT-QA model extracts a relevant highlighted entity set V_t by answering questions based on the observation, which is used to update the knowledge graph.
Underlying A2C code is adapted from https://github.com/rajammanabrolu/KG-A2C. Go-Explore code adapted from https://github.com/uber-research/go-explore.
Quickstart
Step 1: Install Dependencies: Jericho==2.4.2, Redis, Pytorch >= 1.2
Full list of dependencies in conda environment file environment.yml
conda env create -f environment.yml
source activate qbert
python -m spacy download en
Step 2: Download ROM files for games from https://github.com/BYU-PCCL/z-machine-games/archive/master.zip
Step 3: Train BERT model.
Jericho-QA Datafiles can be downloaded here.
cd qbert/extraction
python run_squad.py --model_type albert --model_name_or_path albert-large-v2 --do_train --train_file data/cleaned_qa_train.json --predict_file data/cleaned_qa_dev.json --per_gpu_eval_batch_size 8 --learning_rate 3e-5 --max_seq_length 512 --doc_stride 128 --output_dir ./models/ --warmup_steps 814 --max_steps 8144 --version_2_with_negative --gradient_accumulation_steps 24 --overwrite_output_dir
(Optional) Evaluate BERT model:
cd qbert/extraction
python run_squad.py --model_type albert --model_name_or_path model_name_here --do_eval --train_file data/cleaned_qa_train.json --predict_file data/cleaned_qa_dev.json --per_gpu_eval_batch_size 8 --learning_rate 3e-5 --max_seq_length 512 --doc_stride 128 --output_dir ./models/ --warmup_steps 814 --max_steps 8144 --version_2_with_negative --gradient_accumulation_steps 24 --overwrite_output_dir
Step 4: Train Q*BERT
cd qbert
mkdir models && mkdir models/checkpoints
python train.py --rom_file_path path_to_your_rom --tsv_file ../data/rom_name_here --attr_file attrs/rom_name_here --training_type trainingtype --reward_type rew
For example, to run the game zork1 with MC!Q*BERT, with reward type Game+IM:
cd qbert
mkdir models && mkdir models/checkpoints
python train.py --rom_file_path roms/zork1.z5 --tsv_file ../data/zork1_entity2id.tsv --attr_file attrs/zork1_attr.txt --training_type chained --reward_type game_and_IM
This will produce a number of files, including progress.csv listing averaged scores and other metrics during agent exploration, to be used for evaluation and analysis.
Q*BERT flags
--training_type
can be ['base', 'chained', 'goexplore'], which will train base Q*BERT, MC!Q*BERT, or GO!Q*BERT respectively
--reward_type
can be ['game_only', 'IM_only', 'game_and_IM'], IM meaning Intrinsic Motivation (calculated by size set of all edges seen before in KG)
--intrinsic_motivation_factor
a float constant multiplied to IM reward (only used in IM_only reward. game_and_IM reward = base_score + IM * (base_score + episilon) / max_game_score)
--goexplore_logger
goexplore logger logging each cell exploration and its obs
--extraction
confidence threshold for Albert-QA entity extraction
MC!Q*BERT only flags
--patience
is the max number of steps taken before we trigger a 'bottleneck', and begin refreshing training from the previously best seen state
--buffer_size
the max number of valid steps we keep track of up until the current state to begin stepping back from
--patience_valid_only
an option to only count towards patience when a valid action is taken
--patience_batch_factor
if patience_valid_only is True, a 'bottleneck' is triggered when this percentage of a batch has valid steps over patience
--chained_logger
chained logger location that logs the steps where bottlenecks are triggered and the obs at that state
--clear_kg_on_reset
boolean to clear KG upon refresh