Awesome
DataLake
A library to query heterogeneous data sources uniformly using SPARQL.
Description
Data Lake
The term Data Lake denotes a schema-less repository of data residing in its original format and form. As such, there is not a single point of entry to the Data Lake, as data in its diversity has various schemata, query interfaces and languages.
Semantic Data Lake
Semantic Data Lake is an effort to enable querying this wealth of heterogeneous data using Semantic Web principles: mapping language and SPARQL query language. This supplies the Data Lake with a schema and enables a one entry point, SPARQL query, to the various heterogeneous data. In order to reach a data source, the latter needs to be connected to.
That said, to query the data lake using the Semantic Data Lake approach, users need to provide three inputs: (1) Mappings file, (2) Config file, and (3) a SPARQL query, described in the next three sections.
1. Mapping Language and Data Lake Schema
A virtual schema is added to the Data Lake by mapping data elements, e.g., tables and attributes to ontology concepts, e.g., classes and predicates. We benefit from RML mappings to express those schema mapping links.
An example of such mappings is given below. It maps a collection named Product (rml:source "Product"
) in a MongoDB database to an ontology class Product (rr:class bsbm:Product
), meaning that every documebt in Product document is of type bsbm:Product
. The mappings also link MongoDB collection fields label
, publisher
and producer
to ontology predicates rdfs:label
, dc:publisher
and bsbm:producer
, respectively. The _id
field found in rr:subjectMap rr:template "http://example.com/{_id}"
triple points to the primary key of MongoDB collection.
<#OfferMapping>
rml:logicalSource [
rml:source "//Offer";
nosql:store nosql:Mongodb
];
rr:subjectMap [
rr:template "http://example.com/{_id}";
rr:class schema:Offer
];
rr:predicateObjectMap [
rr:predicate bsbm:validTo;
rr:objectMap [rml:reference "validTo"]
];
rr:predicateObjectMap [
rr:predicate dc:publisher;
rr:objectMap [rml:reference "publisher"]
];
rr:predicateObjectMap [
rr:predicate bsbm:producer;
rr:objectMap [rml:reference "producer"]
];
Note the presence of the triple nosql:store nosql:MongoDB
, it contains an addition to RML mappings from the NoSQL ontology to allow stating what type of source it is being mapped.
The mappings file can either be created manually or using the following graphical utility: Squerall-GUI.
2. Data Connection Configurations
In order for data to connect to a data source, users need to provide a set of config parameters, in JSON format. This differs from data source to another, for example for a MongoDB collection, the config parameters could be: database host URL, database name, collection name, and replica set name.
{
"type": "mongodb",
"options": {
"url": "127.0.0.1",
"database": "bsbm",
"collection": "offer",
"options": "replicaSet=mongo-rs"
},
"source": "//Offer",
"entity": "Offer"
}
It is necessary to link the configured source ("source": "//Offer"
) to the mapped source (rml:logicalSource rml:source "//Offer"
, see Mapping section above)
The config file can either be created manually or using the following graphical utility: Squerall-GUI.
3. SPARQL Query Interface
SPARQL queries are expressed using the Ontology terms the data was previously mapped to. SPARQL query should conform to the currently supported SPARQL fragment:
Query := Prefix* SELECT Distinguish WHERE{ Clauses } Modifiers?
Prefix := PREFIX "string:" IRI
Distinguish := DISTINCT? (“*”|(Var|Aggregate)+)
Aggregate := (AggOpe(Var) ASVar)
AggOpe := SUM|MIN|MAX|AVG|COUNT
Clauses := TP* Filter?
Filter := FILTER (Var FiltOpe Litteral)
| FILTER regex(Var, "%string%")
FiltOpe :==|!=|<|<=|>|>=
TP := VarIRIVar .|Varrdf:type IRI.
Var := "?string"
Modifiers := (LIMITk)? (ORDER BY(ASC|DESC)? Var)? (GROUP BYVar+)?
File Storage format
The previous three files can be stored either locally, in HDFS on in an AWS S3 bucket. For the latter, make sure to have your credentials (see) stored in ~/.aws/credentials (C:\Users\USERNAME.aws\credentials on Windows), in the following form:
[default]
aws_access_key_id=...
aws_secret_access_key=...
Usage
The usage of the Semantic Data Lake is documented under the respective SANSA-Query datalake component.
How to Contribute
We always welcome new contributors to the project! Please see our contribution guide for more details on how to get started contributing to SANSA.