Awesome

Pub2TEI

Project goal

This project aims at converting XML documents encoded in various scientific publisher formats into a common TEI XML format. Often called document ingestion, converting a myriad of heterogeneous publisher formats into a common working format is a painful and time-consuming sub-task for building scientific digital library applications.

The target TEI XML is the same as the Grobid TEI XML format, which makes possible to ingest various publisher XML or PDF into the same XML format, avoiding then to write multiple specific parsers. The publisher XML transformation should normally preserve all the information from the source XML.

In addition to avoid any XML publisher information loss, the converter offers various possibilities to enhanced the publisher XML:

when the input publisher XML has raw strings for affiliations and bibliographical references Grobid can be used automatically to further parses the raw string into a structured representation that is added to the final TEI document,
a sentence segmentation is possible for the final TEI document with the same sentence segmenter as Grobid,
the converter service will fix various problems like empty nodes, duplicated XML identifiers, and invalid NCName for attribute values.

With Pub2TEI, it is thus possible to obtain TEI XML documents with at least the same level of structuring as Grobid TEI XML created from PDF, while preserving the high quality of publisher full text XML encoding.

Coverage

The following publisher's XML formats should be properly processed:

BMJ: metadata, header, bibliography, body
Elsevier (journals and conferences): metadata, header, bibliography, body
IOP: metadata, header, bibliography.
NPG (Nature): metadata, header, bibliography, body
NLM/JATS: metadata, header, bibliography, body
OUP: metadata, header, bibliography, body
PNAS: metadata, header, bibliography, body
RSC: metadata, header, bibliography, body
Sage: metadata, header
ScholarOne: metadata, header
Springer: metadata, header, bibliography, body
Wiley: metadata, header, bibliography, body

Coverage of NLM and JATS should be comprehensive (all known versions), so covering in particular PMC, PLOS and bioRxiv XML. However, unfortunately, JATS is so loose that a new JATS flavor might require some stylesheet adjustements. In case you observe some issues in the resulting TEI XML for a new JATS publisher flavor, please fill an issue in this project.

How it works

The project offers a web service for transforming and enhancing publisher XML in an efficient parallelized manner.

It uses a set of stylesheets for converting XML documents encoded in various scientific publisher formats into a common TEI XML format. These style sheets have been first developed in the context of the European Project PEER and have been then further extended over the last years, in particular in the context of the ISTEX project. Depending on the publishers (see above), the encoding of bibliographical information, abstracts, citation and full texts are supported.

Enhancement is then realized by Grobid, selecting the appropriate model dynamically from the publisher XML based on the identified raw fields that can be further structured.

The simplest way to run the converter is to use the docker image and the web service API. The docker image contains all the required stylesheets, the Grobid Deep learning models, sentence segmenter utility and XSLT 2.0 processor for XML transformation. The service compiles the stylesheets at start and keep them "warm" for the transformation requests.

Running the project with Docker

Start the Pub2TEI service as follow:

docker run --rm --gpus all --init --ulimit core=0 -p 8060:8060 grobid/pub2tei:0.2

As visible, by default, the service is started on the port :8060, which can be changed as follow for port :8080:

docker run --rm --gpus all --init --ulimit core=0 -p 8080:8060 grobid/pub2tei:0.2

Python client

After starting the service, to process easily directories of XML files, a simple Python client is provided:

git clone https://github.com/kermitt2/Pub2TEI
cd client
python3 pub2tei_client.py --help

usage: pub2tei_client.py [-h] --input INPUT [--output OUTPUT] [--config CONFIG] [--n N] [--consolidate_references] [--segment_sentences]
                         [--generate_ids] [--grobid_refine] [--force] [--verbose]

Client for Pub2TEI services

optional arguments:
  -h, --help            show this help message and exit
  --input INPUT         path to the directory containing XML files to process: .xml
  --output OUTPUT       path to the directory where to put the results (optional)
  --config CONFIG       path to the config file, default is ./config.json
  --n N                 concurrency for service usage
  --consolidate_references
                        use GROBID for consolidation of the bibliographical references
  --segment_sentences   segment sentences in the text content of the document with additional <s> elements
  --generate_ids        Generate idenfifier for each text item
  --grobid_refine       use Grobid to structure/enhance raw fields: affiliations, references, person, dates
  --force               force re-processing pdf input files when tei output files already exist
  --verbose             print information about processed files in the console

For example for processing recursively all the .xml files in a given directory, with sentence segmentation, the resulting transformed files being written alongside the input files:

python3 pub2tei_client.py --input ~/test/input/ --segment_sentences

For processing recursively all the .xml files in a given input directory, with results in a given output directory, using Grobid to further enhance the transformed document and consolidate the references:

python3 pub2tei_client.py --input ~/test/input/ --output ~/test/output/ --grobid_refine --consolidate_references

Note that the consolidation is realized with the consolidation service indicated in the configuration file of the Pub2TEI server (under pub2tei/resources/config/config.yml, this selected consolidation service overrides the consolidation service possibly indicated in the Grobid configuration file).

Web services

Tranform a publisher XML into TEI XML format, with optional enhancements.

method	request type	response type	parameters	requirement	description
POST	`multipart/form-data`	`application/xml`	`input`	required	publisher XML file to be processed
			`segmentSentences`	optional	Boolean, if true the paragraphs structures in the resulting TEI will be further segmented into sentence elements <s>
			`grobidRefine`	optional	Boolean, if true the raw affiliations and raw biblographical reference strings will be parsed with Grobid and the resulting structured information added in the transformed TEI XML
			`consolidateReferences`	optional	Consolidate all the biblographical references, `consolidateReferences` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only).
			`generateIDs`	optional	Inject the attribute `xml:id` in the textual elements (`title`, `note`, `term`, `keywords`, `p`, `s`)

Response status codes:

HTTP Status code	reason
200	Successful operation.
204	Process was completed, but no content could be provided
400	Wrong request, missing parameters, missing header
500	Indicate an internal service error, further described by a provided message

Assuming that the service is started on the default port :8060 of a local machine, here is a curl example:

curl --form input=@/home/lopez/biblio/PMC_sample_1943/main.nxml --form segmentSentences=1 --form grobidRefine=1 localhost:8060/service/processXML

The resulting TEI has additional sentence markups, additional structured affilitions and additional structured bibliographical references for the entries without markup.

Running the project as a Java application

It is recommended to use the Docker image, which is the easiest way to run Pub2TEI. The following explains how to install, build and run the service from the Java source.

Requirements

As a first requirement, you need to first install and build GROBID as described here.

You need JDK 1.11 or higher to build the project.

Install and build the library

Install Pub2TEI:

git clone https://github.com/kermitt2/Pub2TEI
cd Pub2TEI
./gradlew clean install

Be sure to indicate the correct installtion location of the grobid-home directory, for example:

grobidHome: ../grobid/grobid-home

The service can then be started with:

./gradlew run

By default, the server uses port :8060, this can be changed in the configuration file resources/config/config.yml.

Building the docker image

From a local deployment, under the project repository Pub2TEI/:

docker build -t grobid/pub2tei:0.2 --build-arg PUB2TEI_VERSION=0.2 --file Dockerfile .

Only using the stylesheets

This legacy usage should be normally avoided, because document enhancements and corrections will not take place. In addition, the transformation here are not parallelized, so less efficient for large scale document processing. However, it is useful to consider only the stylesheets when testing the transformation and working on improving these stylesheets.

Requirement

The minimum requirement is an XSLT 2.0 processor. For convenience, we package saxon9he.jar in the project.

Usage

The starting point of the transformation process is the style sheet Publisher.xsl.

The resulting TEI documents follow a TEI customisation documented under the sub-directory Schemas.

Example with saxon9

Here is a usage example with the Open Source Saxon 9 Home Edition (java). You can download more recent saxon_he versions here (for conveniency, one is included in the Samples/ directory):

java -jar localLibs/saxon9he.jar -s:Samples/TestPubInput/BMJ/bmj_sample.xml -xsl:Stylesheets/Publishers.xsl -o:out.tei.xml -dtd:off -a:off -expand:off --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false -t

The command will apply the Pub2TEI style sheets to a NLM file and produce a TEI out.tei.xml. You can remove the -t option for not producing the trace information.

You can select a directory as input and ouput, in order to process a large amount of files, while compiling the XSLT only one time. The normal behavior is then to transform around one hundred files per second. If it's closer to one file per hundred seconds, see the next section...

Note: the test XML documents present in the sub-directory Samples are dummy documents with realistic publisher structures but random content (due to copyrights).

Usual troubleshooting when using stylesheets only

It is crutial to avoid online fetching of resources - for thousand of files, online fetching will lead to abyssal runtime and something unusable.

Remember that XML is from the W3C, so anything simple is by default complicated, painful and inefficient. In particular, pay attention to:

be sure that your XSLT processor does not try to fetch the DTD on the internet (this will have a disastrous impact on the performance),

For saxon, the option -dtd:off only applies to the XSLT part (the saxon part), which should solve point 2) above, but unfortunately it does not apply to the parsing which will always try to fetch these damn DTDs.

be sure that the XML parser used by saxon does not try to fetch the DTD on the internet,

The option --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false should prevent the parser to fetch the DTD (many thanks @superdude264 for the information https://github.com/kermitt2/Pub2TEI/issues/3 !)

the DTD declared in the source XML file should point locally to the file system.

Point 3) is a possible solution if 2) is not working.

A further solution is to add locally empty DTD files (empty file, yes!) with the same name (see also here). saxon will intercept the idiotic (but conformant) online fetching of DTD with these local version and neutralize validation.

If it does not work, avoiding online fetching might suppose to write some preprocessing script to modify the path of the declared DTD or remove/comment them completely (but don't parse the XML in your script!).

Alternatively, you can try to use a non-validating XML parser like piccolo, see also here.

In practice, it is normally possible to prevent any possible idiotic online fetching of resources, combining all the above tricks, and get the expected "one hundred files transformed per second". Using the web service and application, via the Docker image in particular, solves already all these problems.

License

Pub2TEI is distributed under Apache 2 license.

Maintainer and main developer:

Patrice Lopez, patrice.lopez@science-miner.com

Stylesheet authors:

Laurent Romary, Laurent.Romary@inria.fr
Stephanie Gregorio, stephanie.gregorio@inist.fr
Patrice Lopez, patrice.lopez@science-miner.com