Awesome

LLM-jp Corpus

This repository contains scripts to reproduce the LLM-jp corpus.

In scripts, we provide scripts to download, filter, and tokenize the data.

The code in this repository is licensed under the Apache 2.0 license.

As for the dataset itself, refer to the licenses of the data subsets: