Awesome
HOS-MetadataTransformations
Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.
Use case
- Harvest metadata in different standards (dublin core, datacite, ...) from multiple OAI-PMH endpoints
- Transform harvested data with specific rules for each source to produce normalized and enriched data
- Load transformed data into a Solr search index (which serves as a backend for a discovery system, e.g. HOS-TYPO3-find)
Data Flow
Source: flowchart.mmd (try mermaid live editor)
Preview
Features
- Simple automated cronjob-ready workflow: one bash script for each data source and an additional script to run all scripts in parallel
- Cache for incremental OAI harvesting (via metha)
- Graphical user interface (OpenRefine) for exploring the data, creating the transformation rules and checking the results; it is accessible in the local network via a web browser; data will be updated automatically
- Results are made available in preinstalled local or in external Solr core. You can set (and reset) the Solr schema via bash script.
- Data is stored in the filesystem in common formats (xml, tsv) so you can extend the workflow with command line tools to further manipulate the data.
System requirements
- minimum: 2GB RAM
- recommended: 8GB RAM (to run all scripts in parallel)
Installation
tested with Ubuntu 16.04 LTS and Ubuntu 18.04 LTS
install git:
sudo apt install git
clone this git repository:
git clone https://github.com/subhh/HOS-MetadataTransformations.git
cd HOS-MetadataTransformations
install openjdk-8-jre-headless, zip, curl, jq, metha 1.29, OpenRefine 3.2 beta, openrefine-client 0.3.4 and Solr 7.3.1:
sudo ./install.sh
Configure Solr schema:
./init-solr-schema.sh
Usage
Data will be available after first run at:
- Solr admin: http://localhost:8983/solr/#/hos
- Solr browse: http://localhost:8983/solr/hos/browse
- OpenRefine: http://localhost:3333
Run workflow with data source "uhhediss" and load data into local Solr (-s) and local OpenRefine service (-d)
bin/uhhediss.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
Run workflow with all data sources in parallel and load data into local Solr (-s) and local OpenRefine service (-d):
./run.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
Run workflow with all data sources and load data into two external Solr cores (-s) and external OpenRefine service (-d)
./run.sh -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -s https://openscience.hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80
Solr authentication
If your external Solr is secured with username/password (Basic Authentication Plugin), you may provide the credentials by copying cfg/solr/credentials.example to cfg/solr/credentials
and fill in username and password.
cp cfg/solr/credentials.example cfg/solr/credentials
nano cfg/solr/credentials
chmod 400 cfg/solr/credentials
Cronjob
Example for daily cronjob at 00:35 AM to run workflow with all data sources, load data into external Solr core (-s) and external OpenRefine service (-d) and delete files older than 7 days (-x)
command="$(readlink -f run.sh) -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80 -x 7"
job="35 0 * * * $command"
cat <(fgrep -i -v "$command" <(crontab -l)) <(echo "$job") | crontab -
Add a data source
- Step 1: Harvest new OAI-PMH endpoint and load data into OpenRefine. Example for a new data source called
yourdatasource
with OAI-PMH endpointhttp://ediss.sub.uni-hamburg.de/oai2/oai2.php
:
./load-new-data.sh -c yourdatasource -i http://ediss.sub.uni-hamburg.de/oai2/oai2.php
-
Step 2: Explore the data in OpenRefine at http://localhost:3333 (project
yourdatasource_new
) and create transformations until data looks fine and suits the Solr schema. -
Step 3: Extract the OpenRefine project history in json format and save it in a subdirectory of cfg/, e.g.
cfg/yourdatasource/transformation.json
. -
Step 4: Copy an existing bash shell script (e.g. bin/uhhediss.sh to
bin/yourdatasource.sh
and edit line 17 (codename of the source, e.g.yourdatasource
) and line 18 (url to OAI-PMH endpoint, e.g.http://ediss.sub.uni-hamburg.de/oai2/oai2.php
). If you load a big dataset you may need to allocate more memory to OpenRefine (line 19).
cp -a bin/uhhediss.sh bin/yourdatasource.sh
gedit bin/yourdatasource.sh
- Step 5: Run your shell script (or full workflow)
bin/yourdatasource.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
- Step 6: Check results in OpenRefine at http://localhost:3333 (project
yourdatasource_live
) and Solr (query: collectionId:yourdatasource)