Home

Awesome

PyPI version License

Open Access harvester

This tool is a Python utility for harvesting efficiently a very large Open Access collection of scholar PDF, XML and metadata, from the Unpaywall dataset, from PubMed Central or from a given list of DOI:

The utility can be used in particular to harvest the full Unpaywall dataset (PDF) and the full PMC publications (PDF and corresponding JATS/NLM XML files). The tool is designed to scale to several ten million of full text and metadata downloads.

Requirements

The utility requires Python 3.6 or more. It is developed for a deployment on a POSIX/Linux server (it uses imagemagick to generate thumbnails, gzip and wget as external process). An S3 account and a dedicated S3 bucket or a SWIFT object storage and a dedicated SWIFT container must have been created for the cloud storage of the data collection.

The utility will use some local storage dedicated to the embedded databases keeping track of the advancement of the harvesting, metadata and temporary downloaded resources. Consider a few GB of free space for a large scale harvesting of TB of PDF.

Storage: as a rule of thumb, consider bewteen 1 and 1.5 TB for storing 1 million scholar PDF. As of May 2023, we harvested 35M full texts for the full Unpaywall collection, which takes around 45TB storage space.

Install

Get the github repo:

git clone https://github.com/kermitt2/biblio_glutton_harvester
cd biblio_glutton_harvester

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env
source env/bin/activate

Install the dependencies and the project:

python3 -m pip install -r requirements.txt
python3 -m pip install -e .

For generating thumbnails corresponding to the harvested PDF, ImageMagick must be installed. For instance on Ubuntu:

apt-get install imagemagick

Using PyPI package

PyPI packages are available for stable versions. Latest stable version is 0.2.3:

python3 -m pip install biblio_glutton_harvester==0.2.3

Configuration

A configuration file must be completed, by default the file config.yaml will be used, but it is also possible to use it as a template and specifies a particular configuration file when using the tool (via the --config parameter).

The resources part of the configuration indicates how to access PubMed Central (PMC), arXiv and PLOS resources.

In metadata part of the configuration:

The configuration for a compatible S3 storage uses the aws section of the configuration. If Amazon AWS S3 service is used, leave the aws_end_point empty. If you are using an alternative compatible S3 service, you must indicate the end point in the aws_end_point parameter. If you are not using a S3 storage, remove the related related or leave these values empty.

Important: It is assumed that the complete S3 bucket is dedicated to the harvesting. The --reset parameter will clear all the objects stored in the bucket, so be careful.

The configuration for an OpenStack SWIFT object storage uses swift section of the configuration. If you are not using a SWIFT storage, remove the related parameters or leave these values empty. Important: It is assumed that the complete SWIFT container is dedicated to the harvesting. The --reset parameter will clear all the objects stored in the container, so be careful.

The "swift" key will contain the account and authentication information, typically via Keystone.

Note: for harvesting PMC files, although the ftp server is used, the downloads tend to fail as the parallel requests increase. It might be useful to lower the default, and to launch reprocess for completing the harvesting. For the unpaywall dataset, we have good results with high batch_size (like 200), probably because the distribution of the URL implies that requests are never concentrated on one OA server. However, batch_size at 100 is more conservative in general and should give higher download rate, and if only PMC files are downloaded batch_size at 20 is recommended.

Also note that:

Usage and options

usage: python3 -m biblio_glutton_harvester.OAHarvester [-h] [--unpaywall UNPAYWALL] [--pmc PMC] [--config CONFIG] [--dump DUMP]
                      [--reprocess] [--reset] [--thumbnail] [--sample SAMPLE]

Open Access PDF harvester

optional arguments:
  -h, --help            show this help message and exit
  --unpaywall UNPAYWALL
                        path to the Unpaywall dataset (gzipped)
  --pmc PMC             path to the pmc file list, as available on NIH's site
  --config CONFIG       path to the config file, default is ./config.yaml
  --dump DUMP           write a map with UUID, article main identifiers and available harvested
                        resources
  --reprocess           reprocessed failed entries with OA link
  --reset               ignore previous processing states, clear the existing storage and re-
                        init the harvesting process from the beginning
  --thumbnail           generate thumbnail files for the front page of the PDF
  --sample SAMPLE       Harvest only a random sample of indicated size

The Unpaywall database snapshot is available from OurResearch.

PMC_FILE_LIST can currently be accessed as follow:

For processing all entries of an Unpaywall snapshot:

> python3 -m biblio_glutton_harvester.OAHarvester --unpaywall /mnt/data/biblio/unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz

By default, no thumbnail images are generated. For generating thumbnail images from the front page of the downloaded PDF (small, medium, large):

> python3 -m biblio_glutton_harvester.OAHarvester --thumbnail --unpaywall /mnt/data/biblio/unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz 

By default, ./config.json is used, but you can pass a specific config with the --config option:

> python3 -m biblio_glutton_harvester.OAHarvester --config ./my_config.json --unpaywall /mnt/data/biblio/unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz

If the process is interrupted, relaunching the above command will resume the process at the interruption point. For re-starting the process from the beginning, and removing existing local information about the state of process, use the parameter --reset:

> python3 -m biblio_glutton_harvester.OAHarvester --reset --unpaywall /mnt/data/biblio/unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz

After the completion of the snapshot, we can retry the PDF harvesting for the failed entries with the parameter --reprocess:

> python3 -m biblio_glutton_harvester.OAHarvester --reprocess --unpaywall /mnt/data/biblio/unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz

For downloading the PDF from the PMC set, simply use the --pmc parameter instead of --unpaywall:

> python3 -m biblio_glutton_harvester.OAHarvester --pmc /mnt/data/biblio/oa_file_list.txt

For harvesting only a predifined random number of entries and not the whole sets, the parameter --sample can be used with the desired number:

> python3 -m biblio_glutton_harvester.OAHarvester --pmc /mnt/data/biblio/oa_file_list.txt --sample 2000

This command will harvest 2000 PDF randomly distributed in the complete PMC set. For the Unpaywall set, as around 20% of the entries only have an Open Access PDF, you will need to multiply by 5 the sample number, e.g. if you wish 2000 PDF, indicate --sample 10000.

Map for identifier mapping

A mapping with the UUID associated with an Open Access full text resource and the main identifiers of the entries can be dumped in JSONL (default file name is map.jsonl) with the following command:

> python3 -m biblio_glutton_harvester.OAHarvester --dump output.jsonl

By default, this map is always generated at the completion of an harvesting or re-harvesting. This mapping is necessary for further usage and for accessing resources associated to an entry (listing million files directly with AWS S3 is by far too slow, we thus need a local index/catalog).

In the JSONL dump, each entry identified as available Open Access is present with its UUID given by the attribute id, its main identifiers (doi, pmid, pmcid, pii, istextId), the list of available harvested resources and the target best Open Access URL considered.

{"id": "00005fb2-0969-4ed6-92b3-0552f3fa283c", "doi": "10.1001/jamanetworkopen.2019.13325", "pmid": 31617925, "resources": ["json", "pdf"], "oa_link": "https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2752991/ganguli_2019_oi_190509.pdf"}

The UUID can then be used for accessing the resources for this entry, the prefix path being based on the first 8 characters of the UUID, as follow:

Note that if "compression" is set to True in the configuration file, all these files will have a .gz extension.

Depending on the config, the resources can be accessed either locally under data_path or on AWS S3 following the URL prefix: https://bucket_name.s3.amazonaws.com/, for instance https://bucket_name.s3.amazonaws.com/1b/a0/cc/e3/1ba0cce3-335b-46d8-b29f-9cdfb6430fd2/1ba0cce3-335b-46d8-b29f-9cdfb6430fd2.pdf - if you have set the appropriate access rights. The same applies to a SWIFT object storage based on the container name indicated in the config file.

Only entries available in Open Access according to Unpaywall or PMC are present in the JSONL map file. If an entry is present in the JSONL map file but without a full text resource ("pdf" or ""xml"), it means that the harvesting of the Open Access file has failed.

Converting the PDF files into XML TEI

GROBID is a service developed to structure automatically scholar PDF into XML TEI files thanks to Machine Learning techniques. First, you will need a Grobid service installed and running. We recommand using a Docker container to simplify the installation and deployment of the server. Second, we recommand using the Grobid Python client to process at scale the harvested PDF. The client will process in an efficient concurrent manner the PDF in the data_path directory.

Converting the PMC XML JATS files into XML TEI

After the harvesting realised by biblio_glutton_harvester.OAHarvester, it is possible to convert efficiently of downloaded PMC XML JATS files into XML TEI. This will provide better XML quality than what can be extracted automatically by Grobid from the PDF. This conversion allows to have all the documents in the same XML TEI customization format. As the TEI format superseeds JATS, there is no loss of information from the JATS file. It requires Pub2TEI to be installed and the path to Pub2TEI pub2tei_path to be set in the config.yaml file of the biblio_glutton_harvester project.

To launch the conversion under the default data/ directory:

python3 -m biblio_glutton_harvester.nlm2tei

If a custom config file and custom data/ path are used:

python3 -m biblio_glutton_harvester.nlm2tei --config ./my_config.yaml

This will apply Pub2TEI (a set of XSLT) to all the harvested *.nxml files and add to the document repository a new file TEI file, for instance for a CORD-19 entry:

00/0a/je/vz/000ajevz/000ajevz.pub2tei.tei.xml

Note 1: Pub2TEI supports a lot of other publisher's XML formats (and variants of these formats), so the principle and current tool could be used to transform different publisher XML formats into a single one (TEI) - not just NLM/JATS, facilitating and centralizing further ingestion and process by avoiding to write complicated XML parsers for each case.

Note 2: It is expected to get 8 transformations failed at the end of the process, these "failed" transformations correspond to temporary empty DTD added to avoid loading DTD online for each input XML document.

Converting the LaTeX source files into XML TEI

After the harvesting realised by biblio_glutton_harvester.OAHarvester, it is possible to convert the downloaded LaTeX source files into XML TEI. These source files come typically from arXiv. This will provide better XML quality than what can be extracted automatically by Grobid from the PDF. This conversion allows to have all the documents in the same XML TEI customization format. For best TEI conformance, it requires the forked LaTeXML to be installed and the path to LaTeXML latexml_path to be set in the config.yaml file of the biblio_glutton_harvester project.

To launch the conversion under the default data/ directory:

python3 -m biblio_glutton_harvester.latex2tei

If a custom config file and custom data/ path are used:

python3 -m biblio_glutton_harvester.latex2tei --config ./my_config.yaml

This will apply LaTeXML to all the harvested *.zip files, examine the .tex files, identify the root latex file, convert and finally add the converted TEI XML file in the document repository, similarly as other resources. The extension for TEI XML files generated from the LaTeX source is .latex.tei.xml, for example:

ea/53/8f/ec/ea538fec-f7ec-4119-bcab-7362a47b31b6/ea538fec-f7ec-4119-bcab-7362a47b31b6.latex.tei.xml

Harvesting from a list of DOI

The tool has been designed first for mass harvesting of full texts from the Unpaywall dataset or from PubMed Central. However, it can also be used from a list of DOI to donwload and an Unpaywall dump. The list of DOI to harvest must be provided in a file, with one DOI per line. The following script will generate the subset of the Unpaywall dataset for this list of DOI:

usage: unpaywall_preprocess_selection.py [-h] [--unpaywall UNPAYWALL] [--dois DOIS] [--output OUTPUT]

Open Access PDF harvester

optional arguments:
  -h, --help            show this help message and exit
  --unpaywall UNPAYWALL
                        path to the Unpaywall dataset (gzipped)
  --dois DOIS           path to the list of DOIs to be used to create the Unpaywall subset
  --output OUTPUT       where to write the subset Unpaywall file, a .json.gz extension file

For example, with a file of DOI (one DOI per line) called dois.txt:

python3 biblio_glutton_harvester/unpaywall_preprocess_selection.py --unpaywall unpaywall_snapshot_2023-11-12T083002.jsonl.gz --dois dois.txt --output dois-unpaywall.json.gz

The generated file dois-unpaywall.json.gz is the unpaywall subset corresponding to the list of DOI to donwload, which can then be used with the main harvesting command:

> python3 -m biblio_glutton_harvester.OAHarvester --unpaywall dois-unpaywall.json.gz

Troubleshooting with imagemagick

A relatively recent update (end of October 2018) of imagemagick is breaking the normal conversion usage. Basically the converter does not convert by default for security reason related to server usage. For non-server mode as involved in our module, it is not a problem to allow PDF conversion. For this, simply edit the file /etc/ImageMagick-6/policy.xml and put into comment the following line:

<!-- <policy domain="coder" rights="none" pattern="PDF" /> -->

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

If you contribute to this Open Source project, you agree to share your contribution following this license.

Main author and contact: Patrice Lopez (patrice.lopez@science-miner.com)