Awesome
Extreme Q-Learning (X-QL)
[Project Page]
Official code base for Extreme Q-Learning: MaxEnt RL without Entropy by Div Garg*, Joey Hejna*, Mattheiu Geist, and Stefano Ermon. (*Equal Contribution)
This repo contains code for two novel methods: Gumbel Regression and Extreme Q-learning (X-QL) formulated in our paper.
Gumbel Regression is a novel method that enables accurate and unbiased estimates of the Partition function over a distribution using simple gradient descent.
Extreme Q-learning (X-QL) is an novel & simple RL algorithm for Q-learning that models the maximal soft-values (LogSumExp) without needing to sample from a policy. It directly estimates the optimal Bellman operator B* in continuous action spaces, successfully extending Q-iteration to continuous settings.
It obtains state-of-art results on Offline RL benchmarks such as D4RL, and can improve existing Online RL methods like SAC and TD3. It combines Max Entropy, Conservative & Implicit RL in a single framework.
Introduction
<a href="https://div99.github.io/XQL"> <p align="center"> <img class="img" src="docs/diagram.svg" style="width:400px;height:auto;"><br> </p> </a>Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from Economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. <br><br> Using EVT, we derive our Extreme Q-Learning (XQL) framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on some tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks.
Citation
@article{
garg2022extreme,
title={Extreme Q-Learning: MaxEnt Reinforcement Learning Without Entropy},
url = {https://arxiv.org/abs/2301.02328},
author = {Garg, Divyansh and Hejna, Joey and Geist, Matthieu and Ermon, Stefano},
publisher = {arXiv},
year = {2023},
}
Key Advantages
✅ Directly models V* in continuous action spaces (Continuous Q-iteration)
✅ Implict, no OOD Sampling or actor-critic formulation
✅ Conservative with respect to the behavior policy
✅ Improves performance on the D4RL benchmark versus similar approaches
Usage
For exploring Gumbel Regression, you can play with the Gumbel Regression notebook in Google Colab.
This repository is divided into two subparts, one for the offline RL and one for the online RL experiments. To install and use X-QL check the instructions provided in the Offline folder for running Offline RL and Online folder for running Online RL.
Questions
Please feel free to email us if you have any questions.
Div Garg (divgarg@stanford.edu), Joey Hejna(jhejna@stanford.edu)