Awesome

Apache Beam pipelines for Scylla-RDF

Here you can find pipelines that manipulate RDF data in Scylla-RDF.

For testing purpose or small files, you can run the pipeline with the embedded Apache Flink. To do so:

set SCYLLA_RDF_STORAGE_HOSTS in docker-compose.yml to hostnames/IPs of the ScyllaDB separated by comma,
put your RDF files in upload folder near the docker-compose.yml,

run the pipeline

docker-compose up && docker-compose rm -f

For large files, you can deploy pipeline on Google Dataflow or Apache Flink. The other runners could be supported as well.

If you want to use Apache Flink, then:

build the project with mvn build -DskipTests,
and upload to the cluster, more about it read in the Deployment & Operations / Clusters & Deployment section in the Flink's docs,
don't forget to use the same parameters as in run_bulkload_flink.sh.

If you wan to use Google Dataflow, then:

also build the project with mvn build -DskipTests,
change parameters in run_bulkload_dataflow.sh, so it'd run in your GCP project,

run the script:

./run_bulkload_dataflow.sh --source=gs://<your bucket>/folder1/*,gs://<your bucket>/folder2/*