Awesome

RML Mapper - Cefriel fork

This repository contains a fork of the RMLio/rmlmapper-java repository maintained by Cefriel as a building block of the Chimera solution for semantic data conversion.

The fork is updated to version 4.7.0 of the RMLio/rmlmapper-java.

Changelog

Add support for RDF4J Repositories

A new QuadStore subclass is defined, RDF4JRepository, targeting a remote repository (identified by the address and the repository ID) or an already initialized Repository for RDF4J
New options:
- --triplesStore: Address to reach the triples store. If specified produced triples are written at this address.
- --repositoryId: Repository Id related to the triples store. Also triplesStore option -ts should be provided.

Base IRI and Prefix Base IRI

Base IRI can be provided also as an argument in cli.Main
Relative IRIs are allowed also for objects in generated triples. The base IRI is added, as for subjects, to build a valid IRI.
The namespace related to the base IRI is added to the output model/to the triple store. If a prefix for the base IRI is provided it is used to identify the base IRI, otherwise, the base: prefix is used.

New options:

--baseIRI and --prefixBaseIRI: To set base IRI and a related prefix (otherwise @base is parsed)

Concurrency

Concurrent Executor
Modification to make multi-thread safe different procedures such as functions loading, record access and storing of generated triples.

New options:

TODO Enable concurrent execution with CLI options

Memory Performance optimizations

When the conversion is applied to huge datasets the bottleneck is often mainly related to memory consumption since the conversion is executed once and without particular timing constraints. Analysing memory dumps we retrieved the set of objects retaining the greater amount of memory:

RDF4JStore: storing the triples generated by the mapping procedure.
Caches in Executor class: subject and records cache in the executor to improve performances when join conditions are used in the RML specification.
Cache in RecordsFactory class: records cache to improve performances when the same logical source is referred by multiple TriplesMap.

Given this information, we implemented a set of options to optimize performances of the lifting procedure in specific cases.

Optimized JSON and XML files access

We offer alternative implementations of parsing procedures for JSON and XML files. For JSON, we added caches to optimize the compilation of paths and the retrieval. For XML, we changed the implementation using Saxon-HE that guarantees huge improvements in memory consumption and time required to process XPath expressions.

New options:

-sax Enable Saxon parser for XPath reference formulation.
-joptEnable optimized parser for JSONPath reference formulation.

Incremental Writes

If a Triple Store is used as output store some additional options can help improve performances. In case of huge materialized knowledge graphs, to reduce memory consumption and to avoid flooding the triple store with a unique insert query, we created options to manage batch-size updates to the repository each time the number of triples generated reaches the batch-size. Activating this option, triples written to the triple store are discarded once completed the query removing data from memory. Duplicates elimination is guaranteed and demanded to the triple store. Requests to the Triple Store can be done in multithreading, to avoid stopping the mapping procedure, with the ConcurrentRDF4JRepository class (currently this approach is not available through CLI options).

New options:

--incrementalUpdate: Incremental update option to incrementally load triples in the database while performing mapping procedure. Note that if a triples store is used duplicated triples are automatically removed, therefore, even if incremental updates are constant in size the triples store size may not grow linearly.
--batchSize: Batch size, i.e., number of statements for each update writing generated triples to the triples store. If -inc is set the default value is 0, i.e. every triple generated is directly written to the triplestore. If -inc is not set its value is ignored.
TODO Enable concurrent writes with CLI options

Mappings without join conditions

If mappings have no or few join conditions some additional options can help improve performances. We add an option to avoid using subjects and record caches in the executor and, we tested it noticing that memory consumption lowers while no changes in execution time are observed. Moreover, to reduce even more the memory used during the execution we add an option to order the execution of TriplesMap by logical source, cleaning the records cache in RecordsFactory each time all TriplesMap related to a specific logical source are completed.

New options:

--noCache: Do not use subjects and records caches in the executor.
--ordered: Mapping execution is ordered by logical source and records caches are cleaned after each logical source.

Other changes

Empty strings in a csv/json/xml records are not considered in mappings (to avoid having ?s ?p "" kind of triples). To consider also empty strings as records -es option should be set.
If a logical source is not found, the procedure continues skipping the mapping and logging the event
Changed -o option behaviour. If -o option is not set, the default behaviour is do nothing. To print to stdout it is required to use -o stdout. To save to file the -o option should be set, it can be combined with -ts and -r options if -inc is not set.
Add -ctx option to specify a context (named graph) for triples generated.

`rmlmapper-cefriel.jar`

This is the intended usage of the rmlmapper-cefriel.jar.

usage: java -jar rmlmapper-cefriel.jar <options>
options:
 -b,--batchSize <arg>             If -inc is set it is used as batch size for incremental updates, i.e., 
                                  number of statements for each write, otherwise it is ignored.
 -ctx,--context <arg>             IRI identifying named graph for triples generated.
 -es,--emptyStrings               Set option if empty strings should be considered as values.
 -f,--functionfile <arg>          Path to functions.ttl file (dynamic functions are found relative to functions.ttl).
 -jopt,--jsonOptRecordFactory     Enable optimized parser for JSONPath reference formulation.
 -inc,--incrementalUpdate         Incremental update option to incrementally load triples in the repository 
                                  while performing the mapping procedure. If -b is not set each triple 
                                  generated is directly written to the repository.
 -iri,--baseIRI <arg>             Specify a base IRI for relative IRIs. Otherwise @base is parsed.
 -m,--mappingfile <arg>           One or more mapping file paths and/or strings (multiple values are concatenated).
 -n,--noCache                     Do not use subjects and records caches in the executor. 
 -ord,--ordered                   Mapping execution is ordered by logical source and caches are cleaned 
                                  after each logical source.
 -o,--outputfile <arg>            Path to output file (-o stdout can be used for debugging).
 -pb,--prefixBaseIRI <arg>        Specify a prefix for the base IRI used for relative IRIs.
 -r,--repositoryId <arg>          Repository Id related to the triples store. Also option -ts
                                  should be provided.
 -sax,--saxRecordFactory          Enable Saxon parser for XPath reference formulation.
 -s,--serialization <arg>         Serialization format (nquads (default), turtle, trig, trix, jsonld, hdt).
 -ts,--triplesStore <arg>         Address to reach the triples store. If specified produced triples are also
                                  written at this address. Also option -r should be provided.
 -v,--verbose                     Show more details in debugging output.