Home

Awesome

Softcite software mention recognition service

License Demo cloud.science-miner.com/software Docker Hub

The goal of this GROBID module is to recognize any software mentions in scholar textual documents, publisher XML and PDF. It uses as training data the Softcite Dataset developed by James Howison Lab at the University of Texas at Austin. This annotated corpus and the present software text mining component have been developed supported by a grant from the Alfred P. Sloan foundation to improve credit for research software.

Code with paper: the following article is available in CC-BY:

Patrice Lopez, Caifan Du, Johanna Cohoon, Karthik Ram, and James Howison. 2021. 
Mining Software Entities in Scientific Literature: Document-level NER for an Extremely Imbalance and Large-scale Task. 
In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM ’21), 
November 1–5, 2021, QLD, Australia. https://doi.org/10.1145/3459637.3481936
[Best Applied Research Paper Award runner-up]

For more recent evaluations and a description of a use case in production to monitor Open Science in France, see:

Aricia Bassinet, Laetitia Bracco, Anne L'Hôte, Eric Jeangirard, Patrice Lopez, et al. 2023. 
Large-scale Machine-Learning analysis of scientific PDF for monitoring the production and the openness of research data and software in France. ⟨hal-04121339v3⟩ 
https://hal.science/hal-04121339v3

As the other GROBID models, the module relies only on state-of-the-art machine learning. The tool can use linear CRF (via Wapiti JNI integration) or Deep Learning model such as BiLSTM-CRF, ELMo or fine-tuned transformers BERT, e.g. SciBERT and LinkBERT (via DeLFT JNI integration) and any combination of them.

Thanks to its integration in the GROBID framework, the software mention extraction on scholar PDF is:

Latest performance (accuracy and runtime) can be found in the most recent cited publication above, and more model comparisons below.

Demo

A public demo of the service is available at the following address: https://cloud.science-miner.com/software/

The web console allows you to test the processing of text or of a full scholar PDF. The component is developed targeting complete PDF, so the output of a PDF processing will be richer (attachment, parsing and DOI-matching of the bibliographical references appearing with a software mention, coordinates in the PDF of the mentions, document level propagation of mentions). The console displays extracted mentions directly on the PDF pages (via PDF.js), with infobox describing when possible Wikidata entity linking and full reference metadata (with Open Access links when found via Unpaywall).

This demo is only provided for test, without any guaranties regarding the service quality and availability. If you plan to use this component at scale or in production, you need to install it locally (see how to deploy a docker image).

Note: The demo run with the CRF model to reduce the computational load, as the server is used for other demos and has no GPU (for cost reasons). For significantly more accurate results (see the benchmarking), sciBERT/LinkBERT models are required, the Docker image being the easiest way to achieve this (fine-tuned transformer models are included and used by default in the image).

The Softcite Dataset

For sampling, training and evaluation of the sequence labeling model and additional attribute attachment mechanisms, we use the Softcite dataset, a gold standard manually annotated corpus of 4,971 scholar articles, available on Zenodo (version 2.0):

DOI

More details on the Softcite dataset can be found in the following publication:

Du, C, Cohoon, J, Lopez, P, Howison, J. Softcite dataset: A dataset of software mentions 
in biomedical and economic research publications. J Assoc Inf Sci Technol. (JASIST) 2021; 1–15. 
https://doi.org/10.1002/asi.24454

The latest version of the dataset is maintained on the following GitHub repository: https://github.com/softcite/softcite_dataset_v2

Original development was carried out at https://github.com/howisonlab/softcite-dataset

Docker image

It is recommended to use the Docker image for running the service. The best Deep Learning models are included and are used by default by this image. To use a Docker image via docker HUB, pull the image (around 11GB) as follow:

docker pull grobid/software-mentions:0.8.0

After pulling or building the Docker image, you can now run the software-mentions service as a container:

>  docker run --rm --gpus all -it --ulimit core=0 -p 8060:8060 grobid/software-mentions:0.8.0

The build image includes the automatic support of GPU when available on the host machine via the parameter --gpus all (with automatic recognition of the CUDA version). The support of GPU is only available on Linux host machine. If no GPU are available on your host machine, just remove the --gpus all parameter, but usage of GPU is recommended for best runtime:

>  docker run --rm -it --ulimit core=0 -p 8060:8060 grobid/software-mentions:0.8.0

To specify to use only certain GPUs (see the nvidia container toolkit user guide for more details):

> docker run --rm --gpus '"device=1,2"' -it --init --ulimit core=0 -p 8060:8060 grobid/software-mentions:0.8.0

Note that starting for convenience the container with option --ulimit core=0 avoids having possible core dumped inside the container, which can happen overwise due to the very rare crash of the PDF parsing C++ component. Starting the container with parameter -it allows to interact with the docker process, which is of limited use here, except conveniently stopping the docker container with control-c.

The software-mentions service is available at the default host/port localhost:8060, but it is possible to map the port at launch time of the container as follow:

> docker run --rm --gpus all -it --init --ulimit core=0 -p 8080:8060 grobid/software-mentions:0.8.0

In this image, the best deep learning models are used by default. The selection of models can be modified, for example to use faster models or requiring less GPU memory. To modify the configuration without rebuilding the image - for instance rather use the CRF model, it is possible to mount a modified config file at launch as follow:

> docker run --rm --gpus all -it --init --ulimit core=0 -p 8060:8060 -v /home/lopez/grobid/software-mentions/resources/config/config.yml:/opt/grobid/software-mentions/resources/config/config.yml:ro  grobid/software-mentions:0.8.0

As an alternative, a docker image for the software-mentions service can be built with the project Dockerfile to match the current master version. The complete process is as follow:

~/grobid/software-mentions$ cp ./Dockerfile.software ..
> docker build -t grobid/software-mentions:0.8.0-SNAPSHOT --build-arg GROBID_VERSION=0.8.0 --file Dockerfile.software .

Building the Docker image takes several minutes: installing GROBID, software-mentions, a complete Python Deep Learning environment based on DeLFT and deep learning models downloaded from the internet (one fine-tuned model with a BERT layer has a size of around 400 MB). The resulting image is thus very large, around 8GB, due to the deep learning resources and models.

Install, build, run

The easiest way to deploy and run the service is to use the Docker image, see previous section. If you're courageous or would like to contribute to the development, this section presents the install and build process.

Building the module requires JDK 1.8 or higher (tested up to Java 15). First install and build the latest development version of GROBID as explained by the documentation, together with DeLFT for Deep Learning model support. An installation of Pub2TEI is also necessary to process a variety of publisher XML formats (including for example JATS).

Under the installed and built grobid/ directory, clone the present module software-mentions (it will appear as sibling sub-project to grobid-core, grobid-trainer, etc.):

cd grobid/
git clone https://github.com/softcite/software-mentions

Copy the provided pre-trained models in the standard grobid-home path:

./gradlew copyModels 

Install larger models (fine-tuned transformers, currently the best performing one, total 1.5 GB size and too large to be stored in the GitHub repo), they need to be downloaded and installed with the following command:

./gradlew installModels

Try compiling everything with:

./gradlew clean install 

Run some test:

./gradlew test

To start the service:

./gradlew run

Console web app

Javascript demo/console web app is then accessible at http://localhost:8060. From the console and the RESTfull services tab, you can process chunk of text (select ProcessText) or process a complete PDF document (select Annotate PDF document).

GROBID Software mentions Demo

When processing text, it is possible to examine the JSON output of the service with the Response tab:

GROBID Software mentions Demo

When processing the PDF of a scientific article, the tool will also identify bibliographical reference markers and, when possible, attach the full parsed bibliographical reference to the identified software entity. In addition, bibliographical references can be resolved via biblio-glutton, providing a unique DOI, and optionally additional identifiers like PubMed ID, PMC ID, etc. and a link to the Open Access full text of the reference work when available (via Unpaywall).

GROBID Software mentions Demo

Software entity linking against Wikidata is realized by entity-fishing and provides when possible Wikidata ID and English Wikipedia page ID. The web console allows to interact and view the entity information in the infobox:

GROBID Software mentions Demo

Python client for the Softcite software mention recognition service

To exploit the Softcite software mention recognition service efficiently (concurrent calls) and robustly, a Python client is available here.

If you want to process a directory of PDF and/or XML documents, this is the best and simplest solution: deploy a Dokcer image of the server and use this client.

Tutorial

A tutorial is available at https://github.com/softcite/tutorials/blob/master/process_all_of_plos.md describing how to process the "All of PLOS" collection, step by step. You can apply the same approach for any collection of XML or PDF scientific articles.

JSON format for the extracted software mention

The resulting software mention extractions include many attributes and information. These extractions follow the JSON format documented on this page.

Softcite software mention extraction from the CORD-19 publications

This dataset is the result of the extraction of software mentions from the set of publications of the CORD-19 corpus (https://allenai.org/data/cord-19) by the Softcite software recognizer using SciBERT fine-tuned model: https://zenodo.org/record/5235661

Web API

/service/processSoftwareText

Identify the software mentions in text and optionally disambiguate the extracted software mentions against Wikidata.

methodrequest typeresponse typeparametersrequirementdescription
GET, POSTmultipart/form-dataapplication/jsontextrequiredthe text to be processed
disambiguateoptionaldisambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId)

Response status codes:

HTTP Status codereason
200Successful operation.
204Process was completed, but no content could be extracted and structured
400Wrong request, missing parameters, missing header
500Indicate an internal service error, further described by a provided message
503The service is not available, which usually means that all the threads are currently used

A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 1 seconds for the processSoftwareText service.

Using curl POST/GET requests with some text:

curl -X POST -d "text=We test GROBID (version 0.7.1)." localhost:8060/service/processSoftwareText
curl -GET --data-urlencode "text=We test GROBID (version 0.7.1)." localhost:8060/service/processSoftwareText

which should return this:

{
    "application": "software-mentions",
    "version": "0.7.1",
    "date": "2022-09-10T07:02+0000",
    "mentions": [{
        "software-name": {
            "rawForm": "GROBID",
            "normalizedForm": "GROBID",
            "offsetStart": 8,
            "offsetEnd": 14
        },
        "type": "software",
        "version": {
            "rawForm": "0.7.1",
            "normalizedForm": "0.7.1",
            "offsetStart": 24,
            "offsetEnd": 29
        },
        "context": "We test GROBID (version 0.7.1).",
        "mentionContextAttributes": {
            "used": {
                "value": true,
                "score": 0.9999960660934448
            },
            "created": {
                "value": false,
                "score": 2.384185791015625E-7
            },
            "shared": {
                "value": false,
                "score": 1.1920928955078125E-7
            }
        },
        "documentContextAttributes": {
            "used": {
                "value": true,
                "score": 0.9999960660934448
            },
            "created": {
                "value": false,
                "score": 2.384185791015625E-7
            },
            "shared": {
                "value": false,
                "score": 1.1920928955078125E-7
            }
        }
    }],
    "runtime": 242
}

Runtimes are expressed in milliseconds.

/service/annotateSoftwarePDF

methodrequest typeresponse typeparametersrequirementdescription
POSTmultipart/form-dataapplication/jsoninputrequiredPDF file to be processed
disambiguateoptionaldisambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId)

Response status codes:

HTTP Status codereason
200Successful operation.
204Process was completed, but no content could be extracted and structured
400Wrong request, missing parameters, missing header
500Indicate an internal service error, further described by a provided message
503The service is not available, which usually means that all the threads are currently used

A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the annotateSoftwarePDF service or 3 seconds when disambiguation is also requested.

Using curl POST request with a PDF file:

curl --form input=@./src/test/resources/PMC1636350.pdf --form disambiguate=1 localhost:8060/service/annotateSoftwarePDF

For PDF, each entity will be associated with a list of bounding box coordinates relative to the PDF, see here for more explanation about the coordinate system.

In addition, the response will contain the bibliographical reference information associated to a software mention when found. The bibliographical information are provided in XML TEI (similar format as GROBID).

/service/annotateSoftwareXML

The softcite software mention service can extract software mentions with sentence context information from a variety of publisher XML formats, including not only JATS, but also a dozen of mainstream publisher native XML (Elsevier, Nature, ScholarOne, Wiley, etc.). See Pub2TEI for the list of supported formats. Each call with an XML file (non TEI XML) will involve a transformation of the XML file into a TEI XML file, which will slow down the overall process. This additional time (a few seconds) is due to the loading and compilation of the style sheets that need to be performed for every calls.

methodrequest typeresponse typeparametersrequirementdescription
POSTmultipart/form-dataapplication/jsoninputrequiredXML file to be processed
disambiguateoptionaldisambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId)

Response status codes:

HTTP Status codereason
200Successful operation.
204Process was completed, but no content could be extracted and structured
400Wrong request, missing parameters, missing header
500Indicate an internal service error, further described by a provided message
503The service is not available, which usually means that all the threads are currently used

A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the extractSoftwareXML service or 3 seconds when disambiguation is also requested.

Using curl POST request with a XML file:

curl --form input=@./src/test/resources/PMC3130168.xml --form disambiguate=1 localhost:8060/service/annotateSoftwareXML

/service/annotateSoftwareTEI

The softcite software mention service will extracts software mentions with sentence context information from TEI XML files directly, without then the need of further transformation as for the other publisher XML formats (see above). The process will thus be much faster and should preferably used if possible.

methodrequest typeresponse typeparametersrequirementdescription
POSTmultipart/form-dataapplication/jsoninputrequiredTEI XML file to be processed
disambiguateoptionaldisambiguate is a string of value 0 (no disambiguation, default value) or 1 (disambiguate and inject Wikidata entity id and Wikipedia pageId)

Response status codes:

HTTP Status codereason
200Successful operation.
204Process was completed, but no content could be extracted and structured
400Wrong request, missing parameters, missing header
500Indicate an internal service error, further described by a provided message
503The service is not available, which usually means that all the threads are currently used

A 503 error normally means that all the threads available to Softcite service are currently used for processing concurrent requests. The client need to re-send the query after a wait time that will allow the server to free some threads. The wait time depends on the service and the capacities of the server, we suggest 2 seconds for the extractSoftwareXML service or 3 seconds when disambiguation is also requested.

Using curl POST request with a XML file:

curl --form input=@./src/test/resources/PMC3130168.tei.xml --form disambiguate=1 localhost:8060/service/annotateSoftwareTEI

/service/isalive

The service check /service/isalive will return true/false whether the service is up and running.

Service admin and usage information

The service provides also an admin console, reachable at http://yourhost:8071 where some additional checks like ping, metrics, hearthbeat are available. We recommend, in particular to have a look at the metrics (using the Metric library) which are providing the rate of execution as well as the throughput of each entry point.

Configuration

The software-mention module inherits the configuration of GROBID.

The configuration parameters specific to the software-mention module can be modified in the file resources/config/config.yml:

entityFishingHost: cloud.science-miner.com/nerd
entityFishingPort:

for larger scale PDF processing and to take advantage of a more recent Wikidata dump, a local instance of entity-fishing should be installed and used:

entityFishingHost: localhost
entityFishingPort: 8090

To process XML files following a variety pf publisher native formats, you need to install Pub2TEI and indicate its installation path in the configuration file:

# path to Pub2TEI repository as available at https://github.com/kermitt2/Pub2TEI
pub2teiPath: "../../Pub2TEI/"

For CRF:

models:
  - name: "software"
    engine: "wapiti"

For Deep Learning architectures, which provide significantly better accuracy, indicate delft and indicate the installation path of the DeLFT library. To install and take advantage of DeLFT, see the installation instructions here.

The model to be used can be fully parametrised in the model block:

models:
  - name: "software"
    engine: "delft"
    wapiti:
      # wapiti training parameters, only considered when wapiti is used as engine for the model, these parameters are be used at training time only
      epsilon: 0.00001
      window: 30
      nbMaxIterations: 1500
    delft:
      # deep learning parameters
      architecture: "BidLSTM_CRF"
      useELMo: false
      embeddings_name: "glove-840B"

To use the SciBERT fine-tuned model, this is the recommended model:

models:
  - name: "software"
    engine: "delft"
    delft:  
      architecture: "BERT_CRF"
      transformer: "allenai/scibert_scivocab_cased"

The transformer field indicates the name of the used pretrained model and should match the name of a HuggingFace transformer model that has been fine-tuned for the software mention recognition task.

The possible values for the Deep Learning architectures (supported by DeLFT) are:

For BiLSTM-CRF you need to further specify the embeddings to be used

    embeddings_name: glove-840B
    transformer: "allenai/scibert_scivocab_cased"

Note that the default setting is CRF Wapiti, which does not require any further installation.

DeLFT sequence labeling models are described here. For more details, see also the GROBID Deep Learning model documentation. Using directly DeLFT, it is possible to re-train other Deep Learning models using different architectures and pre-trained models (see the command line here), and run them into this module.

For a transformer-base architecture using LinkBERT base as pretrained model (recommended):

 - name: "software_context_used"
    engine: "delft"
    delft:
      architecture: "bert"
      transformer: "michiyasunaga/LinkBERT-basecased"

For a RNN GRU architecture using glove-840B static embeddings (not recommended):

 - name: "software_context_used"
    engine: "delft"
    delft:
      architecture: "gru"
      embeddings_name: "glove-840B"

The choice to use a multi-label classifier for the context characterization or 3 binary classifiers can be parametrized. Binary classifiers perform better, but require more memory resources. This can be set by the following parameter:

# if true we use binary classifiers for the contexts, otherwise use a single multi-label classifier
# binary classifiers perform better, but havier to use
useBinaryContextClassifiers: true

The single multi-label classifier is named "software_context". The 3 binary classifiers are named "software_context_used", "software_context_creation" and "software_context_shared".

DeLFT text classification models are described here. It is possible to retrain classification models with the DeLFT library (python3 delft/applications/softwareClassifier.py --help) and run them in this module.

Benchmarking of the sequence labeling task

The following sequence labelling algorithms have been benchmarked:

The CRF implementation is based on a custom fork of Wapiti. The other algorithms rely on the Deep Learning library DeLFT. All are natively integrated in the JVM to provide state-of-the-art performance both in accuracy and runtime.

Accuracy of the sequence labeling task

The reference evaluation is realized against a stable holdout set corresponding to 20% of all the documents of the Softcite dataset (994 articles). The remaining articles (3,977 articles) are used for training.

The holdout set reproduces the overall distribution of documents with annotation (29.0% of the documents have at least one annotation), the distribution between Biomedicine and Economics fields, and we used a stratified sampling to reproduce the overall distribution of mentions per document (Python script under scripts/createHoldoutSet.py). The holdout set in TEI XML format is available under resources/dataset/software/evaluation/softcite_corpus-full.holdout-complete.tei.xml. For evaluating portability, we also provides the subset corresponding to the holdout set with PMC files only (biomedicine) and econ files only (Economics).

Traditional evaluation using 10-fold cross-validation cannot be considered as reliable in this context, because the distribution of annotations in the training data is modified with undersampling methods to address the sparsity of software mentions in scientific literature (Class Imbalance Problem). Evaluation using 10-fold cross-validation will significantly over-estimate the performance as compared to a realistic random distribution.

The training data is built from two sources:

The training data is the combination of the first set with a certain number of negative examples of the second set, depending on the selected undersampling technique. Undersampling techniques are introduced to tackle the Class Imbalance Problem (reducing the weight of the negative majority class), they are the following ones:

Summary

To summarize the table below, the best performing model is the fine-tuned SciBERT with active sampling, with a micro-average F1-score at 74.6. Note that this is on complete PDF extracted articles and with a realistic random distribution of mentions, which means extreme imbalance ratio at token-level between 7200:1 (software name) and 17500:1 (URL field). Combined with document-level processing (to increase recall) and entity disambiguation filtering (to increase precision), the complete processing with SciBERT reached 79.1 micro-average F1-score (76.7 for software name).

All the following scores are given at span level (exact match) against the holdout set (994 complete articles, 20% of the Softcite corpus).

modelsamplingsoftware_precisionsoftware_recallsoftware_f1publisher_precisionpublisher_recallpublisher_f1version_precisionversion_recallversion_f1URL_precisionURL_recallURL_f1precision_micro_avgrecall_micro_avgf1_micro_avg
CRFnone29.1858.4938.9341.4576.5653.7851.8584.8564.3718.1868.5728.7434.5867.5945.75
CRFrandom66.9253.759.5970.475.1272.6979.7583.5581.6134.7845.7139.5169.2563.5866.30
CRFactive68.9552.7859.7970.3273.6871.9680.9382.6881.832.6142.8637.0470.4162.5166.23
BiLSTM-CRFnone21.9468.5233.2345.2982.7858.5453.5990.4867.3116.6757.1425.8129.0175.3341.89
BiLSTM-CRFrandom57.1171.9163.6667.4285.1775.2672.9588.7480.0850.9874.2960.4761.9777.9269.03
BiLSTM-CRFactive62.7168.5265.4968.9985.1776.2363.5092.6475.3563.1668.5765.7564.1376.5869.81
BiLSTM-CRF+featuresnone20.9474.5432.6945.6685.6559.5758.4091.7771.3814.5348.5722.3728.0379.3441.42
BiLSTM-CRF+featuresrandom54.0873.6162.3568.4884.2175.5472.2092.2180.9950.0065.7156.7960.0779.1668.31
BiLSTM-CRF+featuresactive54.5473.3062.5468.2085.1775.7479.4892.2185.3747.4680.0059.5761.2779.6169.25
BiLSTM-CRF+elmonone35.5674.8548.2171.5579.4375.2872.8688.3179.8411.6280.0020.2941.7178.6354.51
BiLSTM-CRF+elmorandom67.4462.9665.1263.8783.7372.4683.0584.8583.9454.8448.5751.5269.4670.8870.16
BiLSTM-CRF+elmoactive61.8770.3765.8574.0684.6979.0277.7090.4883.6048.0068.5756.4766.8777.1171.63
BERT-base-CRFnone15.0874.2325.0740.1979.4353.3842.1287.8856.9404.4971.4308.4518.8577.9230.36
BERT-base-CRFrandom52.7667.7559.3261.5778.9569.1865.8985.2874.3414.9654.2923.4653.7473.0261.91
BERT-base-CRFactive56.8567.9061.8866.1378.4771.7773.5185.2878.9619.0054.2928.1558.9973.0265.26
SciBERT-CRFnone25.7380.4038.9844.1484.6958.0371.7292.2180.6827.7871.4340.0033.2783.3547.56
SciBERT-CRFrandom60.4877.0167.7568.1182.7874.7375.3691.3482.5840.3271.4351.5563.9080.8571.38
SciBERT-CRFactive69.3172.8471.0375.5582.7879.0080.2487.8883.8845.2868.5754.5571.7177.6574.56

See below and DeLFT for more details about the models and reproducing all these evaluations. The feature-engineered CRF is based on the custom Wapiti fork integrated in GROBID and available in the present repository.

<software> label means “software name”. <publisher> corresponds usually to the publisher of the software or, more rarely, the main developer. <version> corresponds to both version number and version dates, when available.

Note that the maximum sequence length is normally 1,500 tokens, except for BERT architectures, which have a limit of 512 for the input sequence length. Tokens beyond 1,500 or 512 are truncated and ignored.

Custom features

Custom features, when used, are as follow:

custom features for software mention recognition

The known software name/vocabulary is based on a Wikidata/Wikipedia export for all the software entities, excluding video games. 53,239 different software names are exported (corresponding to around 13K software entities), see here.

Note: Deep learning models only support additional categorical features, so word form features are automatically excluded for deep learning models based on the arity of the feature in the complete training data. This is an automatic mechanism implemented in DeLFT.

Runtimes

The following runtimes have been obtained based on a Ubuntu 16.04 server Intel i7-4790 (4 CPU), 4.00 GHz with 16 GB memory. The runtimes for the Deep Learning architectures are based on the same machine with a nvidia GPU GeForce 1080Ti (11 GB). Runtime can be reproduced with the python script below.

CRF
threadstokens/s
123,685
243,281
359,867
473,339
692,385
797,659
8100,879
BiLSTM-CRF
batch sizetokens/s
5024,774
10028,707
15030,247
20030,520
BiLSTM-CRF+ELMo
batch sizetokens/s
5271
7365
SciBERT+CRF
batch sizetokens/s
54,729
65,060

Batch size is a parameter constrained by the capacity of the available GPU. An improvement of the performance of the deep learning architecture requires increasing the number of GPU and the amount of memory of these GPU, similarly as improving CRF capacity requires increasing the number of available threads and CPU. We observed that running a Deep Learning architectures on CPU is around 50 times slower than on GPU (although it depends on the amount of RAM available with the CPU, which can allow to increase the batch size significantly).

Software mention context characterization

Every mentioned software in a document is automatically enriched with usage, creation and sharing information based on the different software mention contexts in the document. In the JSON results, mentioned software are characterized with the following attributes:

For each of these attributes, a score in [0,1] and binary class values are provided at mention-level and at document-level. For example, the following mention context indicates that the software Mobyle is shared. However, at document-level, other contexts further characterize the role of the software, indicating that it is also used and is a creation described in the research work corresponding to the document:

{
    "context": "Availability: The Mobyle system is distributed under the terms of the GNU GPLv2 on the project web site (http://bioweb2.pasteur.fr/ projects/mobyle/).",
    "mentionContextAttributes": {
        "used": {
            "value": false,
            "score": 0.012282907962799072
        },
        "created": {
            "value": false,
            "score": 5.9604644775390625E-6
        },
        "shared": {
            "value": true,
            "score": 0.9282650947570801
        }
    },
    "documentContextAttributes": {
        "used": {
            "value": true,
            "score": 0.9994845390319824
        },
        "created": {
            "value": true,
            "score": 0.9999511241912842
        },
        "shared": {
            "value": true,
            "score": 0.9282650947570801
        }
    }
}

On the demo console, these attributes are reported in the resulting document-level summary box and in the mention-level infobox:

GROBID Software mentions Demo

Training and evaluation corpus for context characterizations

We use the following manually annotated resources for training and evaluation:

Schindler, David, Bensmann, Felix, Dietze, Stefan, & Krüger, Frank. (2021). 
SoMeSci - Software Mentions in Science (0.1) [Data set]. 
Zenodo. https://doi.org/10.5281/zenodo.4701764

Evaluation of context characterizations

To perform the enrichment for the three above defined attribute classes, we currently use three binary classifiers applied to the contextual sentences of a mention. We hypothesize here that the wording used to introduce and describe a software mention can characterize its possible usage/creation/sharing. Classifiers are based on fine-tuned SciBERT implemented with DeLFT.

The three binary classifiers perform as follow with 10-fold cross-evaluation:

Evaluation on 303 instances:
                   precision        recall       f-score       support
          used        0.9669        0.9791        0.9730           239
      not_used        0.9180        0.8750        0.8960            64     

Evaluation on 303 instances:
                   precision        recall       f-score       support
      creation        0.8400        0.9130        0.8750            23
  not_creation        0.9928        0.9857        0.9892           280

Evaluation on 303 instances:
                   precision        recall       f-score       support
        shared        0.7368        0.7778        0.7568            18
    not_shared        0.9859        0.9825        0.9842           285

Using three binary classifiers perform better than a single multi-class multi-label classifier, still based on 10-fold cross-evaluation:

Evaluation on 303 instances:
                   precision        recall       f-score       support
          used        0.9438        0.9833        0.9631           239
      creation        0.7097        0.9565        0.8148            23
        shared        0.6522        0.8333        0.7317            18    

Commands for sequence labeling training and evaluation

Training only

For training the software model with all the available training data:

> cd PATH-TO-GROBID/grobid/software-mentions/

> ./gradlew train_software 

The training data must be under software-mentions/resources/dataset/software/corpus.

Training and evaluating with automatic corpus split

The following commands will split automatically and randomly the available annotated data (under resources/dataset/software/corpus/) into a training set and an evaluation set, train a model based on the first set and launch an evaluation based on the second set.

>  ./gradlew eval_software_split [-Ps=0.8 -PgH=/custom/grobid/home -Pt=10] 

In this mode, by default, 90% of the available data is used for training and the remaining for evaluation. This default ratio can be changed with the parameter -Ps. By default, the training will use the available number of threads of the machine, but it can also be specified by the parameter -Pt. The grobid home can be optionally specified with parameter -PgH. By default it will take ../grobid-home.

Evaluation with n-fold

For n-fold evaluation using the available annotated data (under resources/dataset/software/corpus/), use the command:

>  ./gradlew eval_software_nfold [-Pn=10 -PgH=/path/grobid/home -Pt=10]

where Pn is the parameter for the number of folds, by default 10. Still by default, the training will use the available number of threads of the machine, but it can also be specified by the parameter -Pt. The grobid home can be optionally specified with parameter -PgH. By default it will take ../grobid-home.

Evaluating only

For evaluating under the labeled data under resources/dataset/software/evaluation (fixed "holdout set" approach), use the command:

>  ./gradlew eval_software [-PgH=/path/grobid/home]

The grobid home can be optionally specified with parameter -PgH. By default it will take ../grobid-home

Evaluation with additional entity disambiguation

Evaluation with entity-disambiguation to discard possible false positives:

>  ./gradlew eval_software_disambiguation [-PgH=/path/grobid/home]

Evaluation is performed against fixed holdout set under resources/dataset/software/evaluation.

Be sure to set the parameters to the entity-fishing server performing the disambiguation in the yaml config.yml file.

Document-level evaluation

Evaluation with document level propagation controlled with TF-IDF:

>  ./gradlew eval_software_doc_level [-PgH=/path/grobid/home]

Evaluation is performed against fixed holdout set under resources/dataset/software/evaluation.

Combined entity disambiguation and document level evaluation

Evaluation with entity-disambiguation to discard possible false positives, then document level propagation controlled with TF-IDF:

>  ./gradlew eval_software_disamb_doc_level [-PgH=/path/grobid/home]

Evaluation is performed against fixed holdout set under resources/dataset/software/evaluation.

This mode is the one implemnented in the standard software recognition method at document level.

Training data import

Assembling the softcite dataset

The source of training data is the softcite dataset developed by James Howison Lab at the University of Texas at Austin. The data need to be compiled with actual PDF content preliminary to training in order to create XML annotated document (MUC conference style). This is done with the following command which takes 3 arguments:

> ./gradlew annotated_corpus_generator_csv -Ppdf=/path/input/pdf -Pcsv=path/csv -Poutput=/output/directory

The path to the PDF repo is the path where the PDF corresponding to the annotated document will be downloaded (done only the first time). For instance:

> ./gradlew annotated_corpus_generator_csv -Ppdf=/media/lopez/T5/softcite-dataset-local-pdf/pdf/ -Pcsv=/home/lopez/tools/softcite-dataset/data/csv_dataset/ -Poutput=resources/dataset/software/corpus/

The compiled XML training files will be written in the standard GROBID training path for the softwate recognition model under grobid/software-mentions/resources/dataset/software/corpus/.

Post-processing for adding provenance information in the corpus XML TEI file

Once the snippet-oriented corpus TEI file is generated, manually reviewed and reconciled, it is possible to re-inject back provenance information (when possible), normalize identifiers, add document entries without mention and segments not aligned with actual article content via GROBID, and filter training articles, with the following command:

> ./gradlew post_process_corpus -Pxml=/path/input/corpus/tei/xml/file -Pcsv=path/csv -Ppdf=path/pdf -Poutput=/output/path/tei/corpus/file

For instance

> ./gradlew post_process_corpus -Pxml=/home/lopez/grobid/software-mentions/resources/dataset/software/corpus/all_clean.tei.xml -Pcsv=/home/lopez/tools/softcite-dataset/data/csv_dataset/ -Ppdf=/home/lopez/tools/softcite-dataset-local-pdf/pdf/ -Poutput=/home/lopez/grobid/software-mentions/resources/dataset/software/corpus/all_clean_post_processed.tei.xml

The post-process corpus is a TEI corpus dataset corresponding to the released and delivery format of the Softcite dataset.

Inter-Annotator Agreement measures

The import process includes the computation of standard Inter-Annotator Agreement (IIA) measures for the documents being annotated by at least two annotators. For the moment, the reported IIA is a percentage agreement measure, with standard error and confidence interval.

See this nice tutorial about IIA. We might need more sophisticated IIA measures than just percentage agreement for more robustness. We plan, in addition to pourcentage agreement, to also cover various IIA metrics from π, κ, and α families using the dkpro-statistics-agreement library:

Christian M. Meyer, Margot Mieskes, Christian Stab, and Iryna Gurevych. DKPro Agreement: 
An Open-Source Java Library for Measuring Inter-Rater Agreement, in: Proceedings of the 
25th International Conference on Computational Linguistics (COLING), pp. 105–109, August 
2014. Dublin, Ireland. 

For explanations on these IIA measures, see:

Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. 
Computational Linguistics, 34(4), 555-596.

Analysis of training data consistency

A Python 3.* script is available under script/ to analyse XML training data and spot possible unconsistencies to review. To launch the script:

> python3 scripts/consistency.py _absolute_path_to_training_directory_

For instance:

> python3 scripts/consistency.py /home/lopez/grobid/software-mentions/resources/dataset/software/corpus/

See the description of the output directly in the header of the script/consistency.py file.

Generation of training data

For generating training data in XML/TEI in an output repository (given by -Pout= parameter), based on the current ML model, from a list of text or PDF files in a input repository (given by -Pin= parameter), use the following command:

> ./gradlew create_training -Pin=/test_software/in/ -Pout=/test_software/out/

Runtime benchmark

A python script is available for benchmarking the service runtime. The main motivation is to evaluation the runtime of the different machine learning models from an end-to-end perspective and on a similar hardware.

By default, the text content for the benchmark is taken from the xml files from the training/eval directory under resources/dataset/software/corpus, to call the script for evaluation the text processing service:

> cd scripts/
> python3 runtime_eval.py
software-mention server is up and running
1000 texts to process
1000 texts to process
317 texts to process
-----------------------------
nb xml files: 1
nb texts: 2317
nb tokens: 962875
-----------------------------
total runtime: 38.769 seconds 
-----------------------------
xml files/s:     0.0258
    texts/s:     59.7642
   tokens/s:     24836.2093

In the above example, 24,836 tokens per second is the processing rate of the CRF model with 1 thread (it goes beyond 100K tokens per second with 8 threads).

optionally you can provide a path to a particular repository of XML files in order to benchmark the text processing processing:

python3 runtime_eval.py --xml-repo /my/xml/directory/

For benchmarking PDF processing, you need to provide a path to a repository of PDF in order to benchmark PDF processing:

python3 runtime_eval.py --pdf-repo /the/path/to/the/pdf/directory

By default the config file ./config.json will be used, but you can also set a particular config file with the parameter --config:

python3 runtime_eval.py --config ./my_config.json

The config file gives the hostname and port of the software-mention service to be used. Default values are service default values (localhost:8060).

Last but not least, you can indicate the number of thread to be used for querying the service in parallel:

python3 runtime_eval.py --threads 10

The default value is 1, so there is no parallelization in the call to the service by default.

Tested with python 3.*

Acknowledgements

We would like to acknowledge the support of the Alfred P. Sloan Foundation, Grant/Award Number: 2016-7209, and of the Gordon and Betty Moore Foundation, Grant/Award Number 8622.

How to cite

For citing this software work, please refer to the present GitHub project, together with the Software Heritage project-level permanent identifier. For example, with BibTeX:

@misc{softwarementions,
    title = {Softcite Software Mention Recognizer},
    howpublished = {\url{https://github.com/ourresearch/software-mentions}},
    publisher = {GitHub},
    year = {2018--2021},
    archivePrefix = {swh},
    eprint = {1:dir:68278b7870e291d58d993e672a4cd9788d5a0666}
}

License

GROBID and the Softcite software mentions module are distributed under Apache 2.0 license.

The documentation of the project is distributed under CC-0 license and the annotated data under CC-BY license.

If you contribute to Softcite software mentions recogniton project, you agree to share your contribution following these licenses.

Contact: Patrice Lopez (patrice.lopez@science-miner.com)