Awesome
<h1 align="center"> <p>TRIL</p></h1> <h3 align="center"> <p>Transformer Reinforcement and Imitation Learning Library</p> </h3>TRIL
is a modular library for Reinforcement Learning (RL) and Imitation Learning (IL) algorithm development with transformers. We directly build on top of transformers
, accelerate
, and peft
libraries by š¤ Hugging Face. That way TRIL is able to support open-sourced pretrained models, distributed computing, as well as parameter efficient training. Note we currently support most decoder and encoder-decoder architectures availble in transformers
.
Supported Algorithms:
- Behavior Cloning (i.e. Supervised Fine Tuning)
- Proximal Policy Optimization (PPO) (https://arxiv.org/abs/1707.06347)
- Generative Adversarial Imitation Learning (GAIL) (https://arxiv.org/abs/1606.03476)
- PPO++ (https://arxiv.org/pdf/2306.11816)
- AggreVaTeD (https://arxiv.org/pdf/2306.11816)
- Locally Optimal Learning to Search (LOLS) (https://arxiv.org/pdf/2306.11816)
- Direct and Differentiable Locally Optimal Learning to Search (D2LOLS) (https://arxiv.org/pdf/2306.11816)
Supported Tasks:
- IMDB Positive Sentiment (https://arxiv.org/abs/2210.01241)
- CommonGen: Common Sense Generation (https://arxiv.org/abs/1911.03705)
- TL;DR Summarization (https://arxiv.org/pdf/2203.02155.pdf)
Planned Algorithms:
- Direct Preference Optimization (DPO) (https://arxiv.org/pdf/2305.18290.pdf)
- Statistical Rejection Sampling Optimization (RSO) (https://arxiv.org/pdf/2309.06657.pdf)
- Phasic Policy Gradient (PPG) (https://arxiv.org/abs/2009.04416)
- Pairwise Proximal Policy Optimization (P3O) (https://arxiv.org/pdf/2310.00212.pdf)
- Advantage-Induced Policy Alignment (APA) (https://arxiv.org/pdf/2306.02231.pdf)
- Advantage-Leftover Lunch RL (A-LoL) (https://arxiv.org/abs/2305.14718)
Planned Tasks:
- Helpfulness and Harmfullness (https://arxiv.org/pdf/2204.05862.pdf)
Installation
To install tril
do:
pip install tril
For the run scripts and the example scripts for usage please see the respository.
To setup a development environment we use conda
for version control. To install TRIL, please follow these steps
git clone https://github.com/Cornell-RL/tril.git
cd tril
conda create -n tril python=3.10
conda activate tril
pip install -e .
Optionally, for caption_metrics
such as CiDER-D and SPICE, please install these additional dependencies.
# Spacy model install
python -m spacy download en_core_web_sm
# CoreNLP library install
cd src/tril/metrics/caption_metrics/spice && bash get_stanford_models.sh
Example Scripts
In the examples
directory, there are example scripts to run TRIL algorithms on IMDB
positive sentiment generation using pytorch Fully Sharded Data Parallel (FSDP)
and TL;DR
summarization using deepspeed
. The name of each script is of the format, <task>_<alg>.sh
. Run each experiment like the following:
./examples/<task>/<script>
Within each script the command is
accelerate --config <accelerate config> [accelerate args] main.py task=<task config> alg=<alg config> [hydra CLI config specification]
Please see the accelerate
launch tutorial for how to launch jobs with accelerate
. We provide examples of different accelerate
configs in the accelerate_cfgs
directoy. For more details on Hydra CLI and config usage please see this tutorial.
Usage Example
Here is a minimal example of running PPO with TRIL:
import hydra
from accelerate import Accelerator
from tril import tril_run
from tril.logging import Tracker
from tril.algorithms import PPO
@hydra.main(version_base=None, config_path="cfgs", config_name="config") # Hydra Decorator for Config
@tril_run # TRIL decorator for hydra config processing
def run_ppo(cfg):
# Initialize accelerator for distributed computing
accelerator = Accelerator()
# Grab experiment save directory from Hydra
save_path = hydra.core.hydra_config.HydraConfig.get().runtime.output_dir
# Instantiate TRIL logger for WandB and CLI logging/saving
tracker = Tracker(
save_path,
OmegaConf.to_container(cfg, resolve=True),
cfg.project_name,
cfg.experiment_name,
cfg.entity_name,
cfg.log_to_wandb,
log_level=logging.INFO,
is_main_process=accelerator.is_main_process,
)
# Instantiate Algorithm
ppo = PPO(cfg, accelerator, tracker)
# Start learn to train LLM
ppo.learn()
if __name__ == '__main__':
run_ppo()
TRIL
also provides an AlgorithmRegistry
to instantiate algorithms. Please see our main.py
to see how our scripts instantiate the algorithms. The list of available algorithms can be seen by the configs in cfgs/task
.
Current Task/Algorithm Support Matrix
Algorithm | IMDB | CommonGen | TL;DR |
---|---|---|---|
PPO | ā | ā | ā |
PPO++ | ā | ā | ā |
AggreVaTeD | ā | ā | ā |
LOLS | ā | ā | ā |
D2LOLS | ā | ā | ā |
BC | ā | ā | ā |
GAIL | ā |
Code Structure
The directory structure of the configs, run script, and TRIL components looks like this.
āāā cfgs <- Hydra configs
ā āāā alg <- Algorithm configs (e.g. PPO)
ā āāā task <- Task configs (e.g. TL;DR summarization)
ā āāā logging <- Logging configs (e.g. WandB)
ā ā
ā āāā config.yaml <- Main config for training
ā
āāā accelerate_cfgs <- Accelerate configs
ā
āāā main.py <- TRIL main function
ā
āāā tril <- TRIL src
ā āāā algorithms <- Algorithm implementations
ā āāā buffers <- Data Buffer (e.g. OnlineBuffer, PromptBuffer)
ā āāā metrics <- Evaluation Metrics
ā āāā policies <- Language Model Policies (e.g. Actor, ActorCritic)
ā āāā rewards <- Reward Functions
ā āāā tasks <- Supported Tasks
ā āāā utils <- Helper functions for TRIL
ā ā
ā āāā agent.py <- Agent contains all torch.nn Modules (i.e. Policy and Reward)
ā āāā base_algorithm.py <- Algorithm abstract class
ā āāā base_metric.py <- Metric abstract class
ā āāā base_reward.py <- Reward abstract class
ā āāā base_task.py <- Task abstract class
ā āāā logging.py <- TRIL Logger
In each directory's __init__.py
, there is a registry to register all supported algorithms
, metrics
, rewards
, and tasks
. When extending TRIL
, please add the respective addition to one of these registries.
Logging
TRIL support Weights and Biases logging. Please enter your wandb
details such as entity_name
and project_name
into cfgs/logging/wandb.yaml
. If you would not like to log to wandb
, please set log_to_wandb=False
.
By default, we save training and evaluation information in outputs/<experiment_name>/<datetime>
You can define experiment_name
in cfgs/config.yaml
or through Hydra CLI, main.py experiment_name=<name>
.
Example WandB Reports
Here is an example WandB Report of how the logging would look like when running multiple different algorithms
Citing TRIL
If you use TRIL in your publication, please cite it by using the following BibTeX entry.
@misc{TRIL,
title={TRIL: Transformers Reinforcement and Imitation Learning Library},
author={Jonathan D Chang and Kiante Brantley and Rajkumar Ramamurthy and Dipendra Misra and Wen Sun},
howpublished={\url{https://github.com/Cornell-RL/tril}},
year={2023}
}
Here is the citation of the accompanying paper for many of the supported algorithms.
@misc{chang2023learning,
title={Learning to Generate Better Than Your LLM},
author={Jonathan D. Chang and Kiante Brantley and Rajkumar Ramamurthy and Dipendra Misra and Wen Sun},
year={2023},
eprint={2306.11816},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Acknowledgements
We would like to acknowledge RL4LMs, TRL, and TRLx for being inspirations for this library.