Awesome
Apache Beam pipelines for Scylla-RDF
Here you can find pipelines that manipulate RDF data in Scylla-RDF.
Bulk loading
For testing purpose or small files, you can run the pipeline with the embedded Apache Flink. To do so:
- set
SCYLLA_RDF_STORAGE_HOSTS
in docker-compose.yml to hostnames/IPs of the ScyllaDB separated by comma, - put your RDF files in
upload
folder near the docker-compose.yml, - run the pipeline
docker-compose up && docker-compose rm -f
For large files, you can deploy pipeline on Google Dataflow or Apache Flink. The other runners could be supported as well.
If you want to use Apache Flink, then:
- build the project with
mvn build -DskipTests
, - and upload to the cluster, more about it read in the Deployment & Operations / Clusters & Deployment section in the Flink's docs,
- don't forget to use the same parameters as in
run_bulkload_flink.sh
.
If you wan to use Google Dataflow, then:
- also build the project with
mvn build -DskipTests
, - change parameters in
run_bulkload_dataflow.sh
, so it'd run in your GCP project, - run the script:
./run_bulkload_dataflow.sh --source=gs://<your bucket>/folder1/*,gs://<your bucket>/folder2/*