Awesome

HOS-MetadataTransformations

Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.

Use case

Harvest metadata in different standards (dublin core, datacite, ...) from multiple OAI-PMH endpoints
Transform harvested data with specific rules for each source to produce normalized and enriched data
Load transformed data into a Solr search index (which serves as a backend for a discovery system, e.g. HOS-TYPO3-find)

Data Flow

Source: flowchart.mmd (try mermaid live editor)

Preview

preview

Features

Simple automated cronjob-ready workflow: one bash script for each data source and an additional script to run all scripts in parallel
Cache for incremental OAI harvesting (via metha)
Graphical user interface (OpenRefine) for exploring the data, creating the transformation rules and checking the results; it is accessible in the local network via a web browser; data will be updated automatically
Results are made available in preinstalled local or in external Solr core. You can set (and reset) the Solr schema via bash script.
Data is stored in the filesystem in common formats (xml, tsv) so you can extend the workflow with command line tools to further manipulate the data.

System requirements

minimum: 2GB RAM
recommended: 8GB RAM (to run all scripts in parallel)

Installation

tested with Ubuntu 16.04 LTS and Ubuntu 18.04 LTS

install git:

sudo apt install git

clone this git repository:

git clone https://github.com/subhh/HOS-MetadataTransformations.git
cd HOS-MetadataTransformations

install openjdk-8-jre-headless, zip, curl, jq, metha 1.29, OpenRefine 3.2 beta, openrefine-client 0.3.4 and Solr 7.3.1:

sudo ./install.sh

Configure Solr schema:

./init-solr-schema.sh

Usage

Data will be available after first run at:

Solr admin: http://localhost:8983/solr/#/hos
Solr browse: http://localhost:8983/solr/hos/browse
OpenRefine: http://localhost:3333

Run workflow with data source "uhhediss" and load data into local Solr (-s) and local OpenRefine service (-d)

bin/uhhediss.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Run workflow with all data sources in parallel and load data into local Solr (-s) and local OpenRefine service (-d):

./run.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Run workflow with all data sources and load data into two external Solr cores (-s) and external OpenRefine service (-d)

./run.sh -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -s https://openscience.hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80

Solr authentication

If your external Solr is secured with username/password (Basic Authentication Plugin), you may provide the credentials by copying cfg/solr/credentials.example to cfg/solr/credentials and fill in username and password.

cp cfg/solr/credentials.example cfg/solr/credentials
nano cfg/solr/credentials
chmod 400 cfg/solr/credentials

Cronjob

Example for daily cronjob at 00:35 AM to run workflow with all data sources, load data into external Solr core (-s) and external OpenRefine service (-d) and delete files older than 7 days (-x)

command="$(readlink -f run.sh) -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80 -x 7"
job="35 0 * * * $command"
cat <(fgrep -i -v "$command" <(crontab -l)) <(echo "$job") | crontab -

Add a data source

Step 1: Harvest new OAI-PMH endpoint and load data into OpenRefine. Example for a new data source called yourdatasource with OAI-PMH endpoint http://ediss.sub.uni-hamburg.de/oai2/oai2.php:

./load-new-data.sh -c yourdatasource -i http://ediss.sub.uni-hamburg.de/oai2/oai2.php

Step 2: Explore the data in OpenRefine at http://localhost:3333 (project yourdatasource_new) and create transformations until data looks fine and suits the Solr schema.
Step 3: Extract the OpenRefine project history in json format and save it in a subdirectory of cfg/, e.g. cfg/yourdatasource/transformation.json.
Step 4: Copy an existing bash shell script (e.g. bin/uhhediss.sh to bin/yourdatasource.sh and edit line 17 (codename of the source, e.g. yourdatasource) and line 18 (url to OAI-PMH endpoint, e.g. http://ediss.sub.uni-hamburg.de/oai2/oai2.php). If you load a big dataset you may need to allocate more memory to OpenRefine (line 19).

cp -a bin/uhhediss.sh bin/yourdatasource.sh
gedit bin/yourdatasource.sh

Step 5: Run your shell script (or full workflow)

bin/yourdatasource.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Step 6: Check results in OpenRefine at http://localhost:3333 (project yourdatasource_live) and Solr (query: collectionId:yourdatasource)