Awesome

Katakomba: Tools and Benchmarks for Data-Driven NetHack

<p align="center"><b>Katakomba</b> is an open-source benchmark for data-driven NetHack. At the moment, it provides a set of standardized datasets with familiar interfaces and offline RL baselines augmented with recurrence. Full training logs synced to the Weights&Biases are included.</p> <div align="center">

</div>

Installation

For now, Katakomba is not pip installable. However, the installation is easy. We also provide an alternative with the Dockerfile already set up to work (this is the preferred way!).

git clone https://github.com/tinkoff-ai/katakomba.git && cd katakomba
pip install -r requirements.txt

# or alternatively, you could use docker
docker build -t katakomba .
docker run --gpus all -it --rm --name katakomba katakomba

One last step is the installation of additional utils used for faster rendering of tty observations as images:

# use pip3 inside the docker container
pip install -e katakomba/utils/render_utils/

Getting Started

from katakomba.env import NetHackChallenge, OfflineNetHackChallengeWrapper
from katakomba.utils.datasets import SequentialBuffer

# The task is specified using the character field
env = NetHackChallenge (
  character = "mon-hum-neu",
  observation_keys = ["tty_chars", "tty_colors", "tty_cursor"]
)

# A convenient wrapper that provides interfaces for dataset loading, score normalization, and deathlevel extraction
env = OfflineNetHackChallengeWrapper(env)

# Several options for dataset reading (check the paper for details): 
# - from RAM, decompressed ("in_memory"): fast but requires a lot of RAM, takes 5-10 minutes for decompression first
# - from Disk, decompressed ("memmap"): a bit slower than RAM, takes 5-10 minutes for decompression first
# - from Disk, compressed ("compressed"): very slow but no need for decompression, useful for debugging
# Note that this will download the dataset automatically if not found
dataset = env.get_dataset(mode="memmap", scale="small")

# Auxillary tools for computing normalized scores or extracting deathlevels
env.get_normalized_score(score=1337.0)
env.get_current_depth()

# We also provide an example of a sequential replay buffer
buffer = SequentialBuffer(
  dataset=dataset,
  seq_len=YOUR_SEQ_LEN,
  batch_size=YOUR_BATCH_SIZE, # Each batch element is a different trajectory
  seed=YOUR_SEED,
  add_next_step=True # if you want (s, a, r, s') instead of (s, a, r)
)

# What's inside the batch?
# Note that the next batch will include the +1 element as expected
batch = buffer.sample()
print(
  batch["tty_chars"],  # [batch_size, seq_len + 1, 80, 24]
  batch["tty_colors"], # [batch_size, seq_len + 1, 80, 24]
  batch["tty_cursor"], # [batch_size, seq_len + 1, 2]
  batch["actions"],    # [batch_size, seq_len + 1]
  batch["rewards"],    # [batch_size, seq_len + 1]
  batch["dones"]       # [batch_size, seq_len + 1]
)

# In case you don't want to store the decompressed dataset beyond code execution
dataset.close()

Baselines

We also provide a set of offline RL baselines for discrete control augmented with recurrence. Implementations are based on the Chaotic-Dwarven-GPT-5 architecture and kept as simple as possible, feel free to dive in both full training logs and algorithms: you can find many useful stuff like sequential replay buffer or bias-corrected vectorized evaluation.

Algorithm	Variants Implemented	Wandb Report
✅ Behavioral Cloning <br>(BC)	`bc_chaotic_lstm.py`	`Katakomba-All`
✅ Conservative Q-Learning for Offline Reinforcement Learning <br>(CQL)	`cql_chaotic_lstm.py`	`Katakomba-All`
✅ Accelerating Online Reinforcement Learning with Offline Datasets <br>(AWAC)	`awac_chaotic_lstm.py`	`Katakomba-All`
✅ Offline Reinforcement Learning with Implicit Q-Learning <br>(IQL)	`iql_chaotic_lstm.py`	`Katakomba-All`
✅ An Optimistic Perspective on Offline Reinforcement Learning <br>(REM)	`rem_chaotic_lstm.py`	`Katakomba-All`

Datasets

In our benchmark, we treat every character configuration as a separate game to be solved -- different configurations may require highly varied forms of gameplay in the early game. To this end, we repacked the original large-scale AutoAscend (this symbolic agent is essentialy an early-game contender) dataset into 38 smaller datasets. This decomposition should allow practitioners to download less data and be more focused on specifics.

Additionally, as benchmarking new algorithms on all of the datasets could be computationally expensive for many practitioners, we separate the benchmark into three categories, where roles > races > alignments as by wisdom of the NetHack community.

We host all of the datasets on the HuggingFace, you can download them from there directly. But as we described above, our wrappers will take care of it automatically similar to the D4RL benchmark. The script for repacking the large-scale dataset can be found here.

Tasks

Tasks	# Transitions	Median Turns	Median Score	Median Deathlvl	Size (GB)	Compressed Size (GB)
Base (Role-Centric)	-	-	-	-	-	-
`arc-hum-neu`	24527163	32858.0	4802.5	2.0	94.5	1.3
`bar-hum-neu`	26266771	35716.0	11964.0	4.0	101.1	1.7
`cav-hum-neu`	21674680	30361.0	8152.0	4.0	83.5	1.3
`hea-hum-neu`	14473997	18051.0	2043.0	1.0	55.7	0.8
`kni-hum-law`	22287283	28246.0	6305.0	3.0	85.8	1.5
`mon-hum-neu`	33741542	42400.0	11356.0	4.0	129.9	2.1
`pri-hum-neu`	18376473	26796.5	5366.5	2.0	70.8	1.1
`ran-hum-neu`	17625493	25354.0	6168.0	2.0	67.9	1.0
`rog-hum-cha`	14284927	19334.0	3005.5	1.0	55.0	0.8
`sam-hum-law`	22422537	32951.0	7850.0	4.0	86.3	1.3
`tou-hum-neu`	13376498	17955.5	2554.5	1.0	51.5	0.8
`val-hum-neu`	27784788	35250.0	11402.5	4.0	107.0	1.8
`wiz-hum-neu`	14343449	19808.5	3132.5	1.0	55.2	0.8
Extended (Race-Centric)	-	-	-	-	-	-
`pri-elf-cha`	18796560	26909.5	4718.5	2.0	72.4	1.1
`ran-elf-cha`	18238686	26607.0	7583.0	4.0	70.2	1.1
`wiz-elf-cha`	15277820	19512.0	2988.5	1.0	58.8	0.9
`arc-dwa-law`	25100788	34669.0	4026.0	1.0	96.7	1.5
`cav-dwa-law`	22871890	32261.0	7158.0	3.0	88.1	1.5
`val-dwa-law`	32787658	33973.0	8652.5	3.0	126.6	2.5
`arc-gno-neu`	24144048	34432.0	4077.5	1.0	93.0	1.4
`cav-gno-neu`	21624779	29860.0	6446.0	3.0	83.3	1.4
`hea-gno-neu`	14884704	18518.0	1980.5	1.0	57.3	0.9
`ran-gno-neu`	17571659	25970.0	5326.0	2.0	67.7	1.1
`wiz-gno-neu`	14193637	19206.0	2736.0	1.0	54.7	0.9
`bar-orc-cha`	27826356	39291.0	10499.0	4.0	107.2	1.8
`ran-orc-cha`	18127448	26707.0	5460.0	2.0	69.8	1.1
`rog-orc-cha`	16674806	22351.0	3103.0	1.0	64.2	1.0
`wiz-orc-cha`	15994150	22570.5	3241.5	1.0	61.6	1.0
Complete (Alignment-Centric)	-	-	-	-	-	-
`arc-hum-law`	23422383	31446.0	4188.0	1.0	90.2	1.3
`cav-hum-law`	22328494	31039.0	8174.0	4.0	86.0	1.3
`mon-hum-law`	30782317	39647.0	10855.0	4.0	118.5	1.9
`pri-hum-law`	18298816	27192.0	4833.0	1.0	70.5	1.1
`val-hum-law`	30171035	34570.5	9707.0	4.0	116.2	2.1
`bar-hum-cha`	25362111	35925.0	12574.0	5.0	97.7	1.6
`mon-hum-cha`	33662420	41730.5	11418.0	4.0	129.6	2.1
`pri-hum-cha`	18667816	28204.5	5847.0	2.0	71.9	1.1
`ran-hum-cha`	16999630	24698.5	6236.0	2.0	65.6	1.0
`wiz-hum-cha`	14635591	20257.0	3294.0	1.0	56.4	0.9

Citing Katakamoba

@misc{kurenkov2023katakomba,
      title={Katakomba: Tools and Benchmarks for Data-Driven NetHack}, 
      author={Vladislav Kurenkov and Alexander Nikulin and Denis Tarasov and Sergey Kolesnikov},
      year={2023},
      eprint={2306.08772},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}