Home

Awesome

<p align="center" style="margin-top: -2em"> <img src="res/logo.png" alt="peS2o logo. It's a picure of a mortar and pestle with documents flying in." width=384px height=auto> </p> <p align="center" style="font-size: 1.2em; margin-top: -1em"><i>Pretraining Effectively on <a href="https://github.com/allenai/s2orc">S2ORC</a>!</i></p>

The peS2o dataset is a collection of ~40M open access academic papers, cleaned, filtered, and formatted for pre-training of language models. It is derived from the Semantic Scholar Open Research Corpus(Lo et al, 2020), or S2ORC.

<p align="center" style="font-size: 1.2em;">peS2o is available on the <span><img src="res/hf-logo.png" width=auto height=30px style="margin: -8px auto;"></span> <a href="https://huggingface.co/datasets/allenai/pes2o">Huggingface Hub</a>!</p>
from datasets import load_dataset
dataset = load_dataset("allenai/peS2o", "v2", split="train")

We release multiple version of peS2o, each with different processing and knowledge cutoff date. We recommend you to use the latest version available.

If you use this dataset, please cite:

@techreport{peS2o,
    author = {Luca Soldaini and Kyle Lo},
    year = 2023,
    title = {{peS2o (Pretraining Efficiently on S2ORC) Dataset}},
    institution = {{Allen Institute for AI}},
    note = {ODC-By, \url{https://github.com/allenai/pes2o}}
}

Document Format

Each document in the dataset is a dictionary with the following fields:


peS2o V1

Key Facts

Processing

Processing differs slightly whether it was derived from the full-text corpus (s2orc) or the title and abstract corpus (s2ag).

S2ORC-derived documents

Unfiltered, S2ORC contains 11.3M papers and 46.9B whitespace-separated tokens as of 2023-01-03. To derive peS2o v1, we impose the following constraints:

The train set contains papers published before 2022-12-01; the validation set includes documents published after 2022-12-01 and until 2023-01-03.

S2AG-derived documents

The S2AG corpus contains titles and abstracts of papers in Semantic Scholar. Unfiltered, the corpus contains 91.1M papers and 15.5B whitespace-separated tokens as of 2023-01-03. To derive peS2o v1, we impose the following constraints:

Statistics

DatasetSplit# Documents# Words
s2orctrain8,242,16236,088,195,908
s2orcvalid51,323255,139,074
s2agtrain59,382,30111,009,123,378
s2agvalid111,22824,398,512

peS2o V2

Key Facts

Processing

peS2o V2 is largely the same as V1, but it includes additional heuristics s2ag aimed at filtering out OCR errors from abstract.

First, we check if the abstract was obtained from Semantic Scholar sources that are likely to contain OCR'ed content. For any abstract derived from those sources, we count how often the text contains subsequences matching \b([A-Za-z]\s)([a-z]\s)*[A-Za-z]\b, i.e. individual alpha letters separated by a space. This heuristic matches cases such as A b stra ct (2 matching subsequences), where the OCR parser inserted erroneous spaces. Any abstract with more than 4 matching subsequences is removed.

Statistics

DatasetSplit# Documents# Words
s2orctrain8,242,16236,088,195,908
s2orcvalid51,323255,139,074
s2agtrain30,569,0175,920,099,207
s2agvalid109,70924,029,459