Home

Awesome

RML Mapper - Cefriel fork

Maven Central

This repository contains a fork of the RMLio/rmlmapper-java repository maintained by Cefriel as a building block of the Chimera solution for semantic data conversion.

The fork is updated to version 4.7.0 of the RMLio/rmlmapper-java.

Changelog

Add support for RDF4J Repositories

Base IRI and Prefix Base IRI

New options:

Concurrency

New options:

Memory Performance optimizations

When the conversion is applied to huge datasets the bottleneck is often mainly related to memory consumption since the conversion is executed once and without particular timing constraints. Analysing memory dumps we retrieved the set of objects retaining the greater amount of memory:

Given this information, we implemented a set of options to optimize performances of the lifting procedure in specific cases.

Optimized JSON and XML files access

We offer alternative implementations of parsing procedures for JSON and XML files. For JSON, we added caches to optimize the compilation of paths and the retrieval. For XML, we changed the implementation using Saxon-HE that guarantees huge improvements in memory consumption and time required to process XPath expressions.

New options:

Incremental Writes

If a Triple Store is used as output store some additional options can help improve performances. In case of huge materialized knowledge graphs, to reduce memory consumption and to avoid flooding the triple store with a unique insert query, we created options to manage batch-size updates to the repository each time the number of triples generated reaches the batch-size. Activating this option, triples written to the triple store are discarded once completed the query removing data from memory. Duplicates elimination is guaranteed and demanded to the triple store. Requests to the Triple Store can be done in multithreading, to avoid stopping the mapping procedure, with the ConcurrentRDF4JRepository class (currently this approach is not available through CLI options).

New options:

Mappings without join conditions

If mappings have no or few join conditions some additional options can help improve performances. We add an option to avoid using subjects and record caches in the executor and, we tested it noticing that memory consumption lowers while no changes in execution time are observed. Moreover, to reduce even more the memory used during the execution we add an option to order the execution of TriplesMap by logical source, cleaning the records cache in RecordsFactory each time all TriplesMap related to a specific logical source are completed.

New options:

Other changes

rmlmapper-cefriel.jar

This is the intended usage of the rmlmapper-cefriel.jar.

usage: java -jar rmlmapper-cefriel.jar <options>
options:
 -b,--batchSize <arg>             If -inc is set it is used as batch size for incremental updates, i.e., 
                                  number of statements for each write, otherwise it is ignored.
 -ctx,--context <arg>             IRI identifying named graph for triples generated.
 -es,--emptyStrings               Set option if empty strings should be considered as values.
 -f,--functionfile <arg>          Path to functions.ttl file (dynamic functions are found relative to functions.ttl).
 -jopt,--jsonOptRecordFactory     Enable optimized parser for JSONPath reference formulation.
 -inc,--incrementalUpdate         Incremental update option to incrementally load triples in the repository 
                                  while performing the mapping procedure. If -b is not set each triple 
                                  generated is directly written to the repository.
 -iri,--baseIRI <arg>             Specify a base IRI for relative IRIs. Otherwise @base is parsed.
 -m,--mappingfile <arg>           One or more mapping file paths and/or strings (multiple values are concatenated).
 -n,--noCache                     Do not use subjects and records caches in the executor. 
 -ord,--ordered                   Mapping execution is ordered by logical source and caches are cleaned 
                                  after each logical source.
 -o,--outputfile <arg>            Path to output file (-o stdout can be used for debugging).
 -pb,--prefixBaseIRI <arg>        Specify a prefix for the base IRI used for relative IRIs.
 -r,--repositoryId <arg>          Repository Id related to the triples store. Also option -ts
                                  should be provided.
 -sax,--saxRecordFactory          Enable Saxon parser for XPath reference formulation.
 -s,--serialization <arg>         Serialization format (nquads (default), turtle, trig, trix, jsonld, hdt).
 -ts,--triplesStore <arg>         Address to reach the triples store. If specified produced triples are also
                                  written at this address. Also option -r should be provided.
 -v,--verbose                     Show more details in debugging output.