Awesome
Amazon Neptune CSV to RDF Converter
A tool for Amazon Neptune that converts property graphs stored as comma separated values into RDF graphs.
Usage
Amazon Neptune CSV to RDF Converter is a Java library for converting a property graph stored in CSV files to RDF. It expects an input directory containing the CSV files, an output directory, and an optional configuration file. The output directory will be created if it does not exist. See Gremlin Load Data Format about the input and RDF 1.1 N-Quads about the output format.
The input files need to be UTF-8 encoded. The same encoding is used for the output files.
The library is available as executable Jar file and can be run from the command line by java -jar amazon-neptune-csv2rdf.jar -i <input directory> -o <output directory>
. Use java -jar amazon-neptune-csv2rdf.jar -h
to see all options:
Usage: java -jar amazon-neptune-csv2rdf.jar [-hV] [-c=<configuration file>]
-i=<input directory> -o=<output directory>
-c, --config=<configuration file>
Propery file containing the configuration.
-h, --help Show this help message and exit.
-i, --input=<input directory>
Directory containing the CSV files (UTF-8 encoded).
-o, --output=<output directory>
Directory for writing the RDF files (UTF-8 encoded); will be
created if it does not exist.
-V, --version Print version information and exit.
The conversion is based on two steps. First, a general mapping from property graph vertices and edges
to RDF statements is applied to the input files. The optional second step transforms RDF resource IRIs
according to user defined rules for replacing artificial ids with more natural ones. However, this
transformation needs to load all triples into main memory, so the JVM memory must be set accordingly with
-Xmx
, e.g., java -Xmx2g
.
Let's start with a small example to see how both steps work.
General mapping
Let vertices and edges be given as
~id,~label,name,code,country
1,city,Seattle,S,USA
2,city,Vancouver,V,CA
and
~id,~label,~from,~to,distance,type
a,route,1,2,166,highway
Using some simplified namespaces (see Configuration below for the details), the mapping results in:
<vertex:1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <type:City> <dng:/> .
<vertex:1> <vproperty:name> "Seattle" <dng:/> .
<vertex:1> <vproperty:code> "S" <dng:/> .
<vertex:1> <vproperty:country> "USA" <dng:/> .
<vertex:2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <type:City> <dng:/> .
<vertex:2> <vproperty:name> "Vancouver" <dng:/> .
<vertex:2> <vproperty:code> "V" <dng:/> .
<vertex:2> <vproperty:country> "CA" <dng:/> .
<vertex:1> <edge:route> <vertex:2> <econtext:a> .
<econtext:a> <eproperty:distance> "166" <dng:/> .
<econtext:a> <eproperty:type> "highway" <dng:/> .
The result shows that edge identifiers are stored as context of the corresponding RDF statement, and the edge properties are statements about that context. The edge identifiers can be queried in SPARQL using the GRAPH keyword.
Vertex labels are mapped to RDF types. The first letter of the label will be capitalized for this step:
The label city
becomes the RDF type <type:City>
.
Additionally, the mapping can add RDFS labels to the vertices. For example, the configuration
mapper.mapping.pgVertexType2PropertyForRdfsLabel.city=name
creates two additional RDF statements:
<vertex:1> <http://www.w3.org/2000/01/rdf-schema#label> "Seattle" <dng:/> .
<vertex:2> <http://www.w3.org/2000/01/rdf-schema#label> "Vancouver" <dng:/> .
The mapping can also map property values to resources. In the example, the value for country becomes an URI with
mapper.mapping.pgProperty2RdfResourcePattern.country=country:{{VALUE}}
and the two statements with the literal value "USA" and "CA" are replaced by:
<vertex:1> <edge:country> <country:USA> <dng:/> .
<vertex:2> <edge:country> <country:CA> <dng:/> .
URI transformations
A URI transformation rule replaces parts of a resource URI with the value of a property. In the previous example, the code could be used to create the resource URIs. This can be achieved using:
transformer.uriPostTransformations.1.srcPattern=vertex:([0-9]+)
transformer.uriPostTransformations.1.typeUri=type:City
transformer.uriPostTransformations.1.propertyUri=vproperty:code
transformer.uriPostTransformations.1.dstPattern=city:{{VALUE}}
The resulting statements are now:
<city:S> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <type:City> <dng:/> .
<city:S> <http://www.w3.org/2000/01/rdf-schema#label> "Seattle" <dng:/> .
<city:S> <vproperty:name> "Seattle" <dng:/> .
<city:S> <vproperty:code> "S" <dng:/> .
<city:S> <edge:country> <country:USA> <dng:/> .
<city:V> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <type:City> <dng:/> .
<city:V> <http://www.w3.org/2000/01/rdf-schema#label> "Vancouver" <dng:/> .
<city:V> <vproperty:name> "Vancouver" <dng:/> .
<city:V> <vproperty:code> "V" <dng:/> .
<city:V> <edge:country> <country:CA> <dng:/> .
<city:S> <edge:route> <city:V> <econtext:a> .
<econtext:a> <eproperty:distance> "166" <dng:/> .
<econtext:a> <eproperty:type> "highway" <dng:/> .
Configuration
The configuration of the converter is a property file. It contains a default type, a default named graph, and namespaces for building vertex URIs, edge URIs, type URIs, vertex property URIs, and edge property URIs. The rules for adding RDFS labels, creating resources from property values, and the URI transformations are optional. It's also possible to set the file extension of the input files.
If no configuration file is given, the following default values are used:
inputFileExtension=csv
mapper.alwaysAddPropertyStatements=true
mapper.mapping.typeNamespace=http://aws.amazon.com/neptune/csv2rdf/class/
mapper.mapping.vertexNamespace=http://aws.amazon.com/neptune/csv2rdf/resource/
mapper.mapping.edgeNamespace=http://aws.amazon.com/neptune/csv2rdf/objectProperty/
mapper.mapping.edgeContextNamespace=http://aws.amazon.com/neptune/csv2rdf/resource/
mapper.mapping.vertexPropertyNamespace=http://aws.amazon.com/neptune/csv2rdf/datatypeProperty/
mapper.mapping.edgePropertyNamespace=http://aws.amazon.com/neptune/csv2rdf/datatypeProperty/
mapper.mapping.defaultNamedGraph=http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph
mapper.mapping.defaultType=http://www.w3.org/2002/07/owl#Thing
The setting mapper.alwaysAddPropertyStatements
has only effect if a rule for adding RDFS labels is used. In
that case it decides whether or not to add the property that is being used for the RDFS label additionally
as RDF literal statement with that property. For the small example above, if the setting was chosen to be
false
, two statements would not be generated:
<city:S> <vproperty:name> "Seattle" <dng:/> .
<city:V> <vproperty:name> "Vancouver" <dng:/> .
The setting mapper.mapping.edgeContextNamespace
takes effect only when explicitly set. Otherwise, it uses
the value set for mapper.mapping.vertexNamespace
.
Vertex type to RDFS label mapping
Vertex types are defined by vertex labels. The option mapper.mapping.pgVertexType2PropertyForRdfsLabel.<vertex type>.<vertex property>
is used to specify a mapping from a vertex type to to a vertex property, whose value is then used
to create RDFS labels for any vertex belonging to this vertex type. Multiple such mappings are allowed.
Property to RDF resource mapping
The option pgProperty2RdfResourcePattern.<vertex property>=<namespace>{{VALUE}}
is used to create RDF resources
instead of literal values for vertices where the specified property is found. The variable {{value}}
will
be replaced with the value of the property and prefixed with the given namespace. Multiple such mappings
are allowed.
URI Post Transformations
URI Post Transformations are used to transform RDF resource IRIs into more readable ones.
An URI Post Transformation consists of four elements:
uriPostTransformation.<ID>.srcPattern=<URI regex patten>
uriPostTransformation.<ID>.typeUri=<URI>
uriPostTransformation.<ID>.propertyUri=<URI>
uriPostTransformation.<ID>.dstPattern=<URI pattern>
A positive integer <ID>
is required to group the elements. The grouping numbers of several transformation
configurations do not need to be consecutive. The transformation rules will be executed in ascending order
according to the grouping numbers. All four configuration items are required:
srcPattern
is a URI with a single regular expression group, e.g.<http://aws.amazon.com/neptune/csv2rdf/resource/([0-9]+)>
, defining the URI patterns of RDF resources to which the post transformation applies.typeUri
filters out all matched source URIs that do not belong to the specified RDF type.propertyUri
is the RDF predicate pointing to the replacement value.dstPattern
is the new URI, which must contain a{{VALUE}}
substring which is then substituted with the value ofpropertyUri
.
Example:
uriPostTransformation.1.srcPattern=http://example.org/resource/([0-9]+)
uriPostTransformation.1.typeUri=http://example.org/class/Country
uriPostTransformation.1.propertyUri=http://example.org/datatypeProperty/code
uriPostTransformation.1.dstPattern=http://example.org/resource/{{VALUE}}
This configuration transforms the URI http://example.org/resource/123
into http://example.org/resource/FR
,
given that there are the statements:
http://example.org/resource/123 a http://example.org/class/Country.
http://example.org/resource/123 http://example.org/datatypeProperty/code "FR".
Note that we assume that the property propertyUri
is unique for each resource, otherwise a runtime exception
will be thrown. Also note that the post transformation is applied using a two-pass algorithm over the
generated data, and the translation mapping is kept fully in memory. This means the property is suitable
only in cases where the number of mappings is small or if the amount of main memory is large.
Complete Configuration
The complete configuration for the small example above is:
mapper.alwaysAddPropertyStatements=false
mapper.mapping.typeNamespace=type:
mapper.mapping.vertexNamespace=vertex:
mapper.mapping.edgeNamespace=edge:
mapper.mapping.edgeContextNamespace=econtext:
mapper.mapping.vertexPropertyNamespace=vproperty:
mapper.mapping.edgePropertyNamespace=eproperty:
mapper.mapping.defaultNamedGraph=dng:/
mapper.mapping.defaultType=dt:/
mapper.mapping.defaultPredicate=dp:/
mapper.mapping.pgVertexType2PropertyForRdfsLabel.city=name
mapper.mapping.pgProperty2RdfResourcePattern.country=country:{{VALUE}}
transformer.uriPostTransformations.1.srcPattern=vertex:([0-9]+)
transformer.uriPostTransformations.1.typeUri=type:City
transformer.uriPostTransformations.1.propertyUri=vproperty:code
transformer.uriPostTransformations.1.dstPattern=city:{{VALUE}}
Examples
The small example above is contained in src/test/example
and can be tested with:
java -jar amazon-neptune-csv2rdf.jar -i src/test/example/ -o . -c src/test/example/city.properties
Additionally, the directory src/test/air-routes
contains a Zip archive of the
Air Routes data set and a sample configuration.
After unzipping the archive into air-routes
, it can be converted with:
java -jar amazon-neptune-csv2rdf.jar -i air-routes/ -o . -c src/test/air-routes/air-routes.properties
Known Limitations
The general mapping from property graph vertices and edges is done individually for each CSV line in order to avoid loading the whole CSV file into memory. However, that means that properties being defined on different lines are not joined and cardinality constraints cannot be checked. For example, the RDF mapping (using the simplified namespaces from the small example above) of the following property graph
- should reject the statement
<vertex:1> <eproperty:since> "tomorrow" <dng:/>
because edge properties have single cardinality, - should contain only one
<vertex:2> <edge:knows> <vertex:3> <vertex:1>
statement (however, RDF joins multiple equal statements into one), and - should not generate the statement
<vertex:3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <dt:/> <dng:/>
because vertex 3 has a label.
Property Graph:
~id,~label,name
2,person,Alice
3,person,Bob
3,,Robert
~id,~label,~from,~to,since,personally
1,knows,2,3,yesterday,
1,knows,2,3,tomorrow,
1,knows,2,3,,true
RDF mapping:
<vertex:2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <type:Person> <dng:/> .
<vertex:2> <vproperty:name> "Alice" <dng:/> .
<vertex:3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <type:Person> <dng:/> .
<vertex:3> <vproperty:name> "Bob" <dng:/> .
<vertex:3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <dt:/> <dng:/> .
<vertex:3> <vproperty:name> "Robert" <dng:/> .
<vertex:2> <edge:knows> <vertex:3> <econtext:1> .
<econtext:1> <eproperty:since> "yesterday" <dng:/> .
<vertex:2> <edge:knows> <vertex:3> <econtext:1> .
<econtext:1> <eproperty:since> "tomorrow" <dng:/> .
<vertex:2> <edge:knows> <vertex:3> <econtext:1> .
<econtext:1> <eproperty:personally> "true" <dng:/> .
Building from source
Amazon Neptune CSV to RDF Converter is a Java Maven project and requires JDK 8 and Maven 3 to build from source. Change
into the source folder containing the file pom.xml
and run mvn clean install
. The directory target/
contains the
executable Jar library amazon-neptune-csv2rdf.jar
after a successful build. The executable Jar is not attached to the
build artifacts.
Activate the profile integration for running the integration tests during the build by using
mvn -Pintegration clean install
. Integration tests are distinguished from other tests by adding
the annotation @Tag("IntegrationTest")
.
Adding the library to your build
The group id of Amazon Neptune CSV to RDF Converter [javadoc]
is software.amazon.neptune
, its artifact id is amazon-neptune-csv2rdf
. In case you want to use
the library as part of another project, use the following to add a dependency in Maven:
<dependency>
<groupId>software.amazon.neptune</groupId>
<artifactId>amazon-neptune-csv2rdf</artifactId>
<version>1.0.0</version>
</dependency>
License
Amazon Neptune CSV to RDF Converter is available under Apache License, Version 2.0.
Copyright Amazon.com Inc. or its affiliates. All Rights Reserved.