Home

Awesome

Datasplash

Clojars Project

cljdoc badge

Clojure API for a more dynamic Google Cloud Dataflow and (not really battle tested) any other Apache Beam backend.

Usage

API docs

You can also see ports of the official Dataflow examples in the datasplash.examples namespace.

Here is the classic word count example.

:information_source: You will need to run (compile 'datasplash.examples) every time you make a change.

(ns datasplash.examples
  (:require [clojure.string :as str]
            [datasplash.api :as ds]
            [datasplash.options :refer [defoptions]])
  (:gen-class))

(defn tokenize
  [^String l]
  (remove empty? (.split (str/trim l) "[^a-zA-Z']+")))

(defn count-words
  [p]
  (ds/->> :count-words p
          (ds/mapcat tokenize {:name :tokenize})
          (ds/frequencies)))

(defn format-count
  [[k v]]
  (format "%s: %d" k v))

(defoptions WordCountOptions
  {:input {:default "gs://dataflow-samples/shakespeare/kinglear.txt"
           :type String}
   :output {:default "kinglear-freqs.txt" :type String}
   :numShards {:default 0 :type Long}})

(defn -main
  [& str-args]
  (let [p (ds/make-pipeline WordCountOptions str-args)
        {:keys [input output numShards]} (ds/get-pipeline-options p)]
    (->> p
         (ds/read-text-file input {:name "King-Lear"})
         (count-words)
         (ds/map format-count {:name :format-count})
         (ds/write-text-file output {:num-shards numShards})
         (ds/run-pipeline))))

Run it from the repl

Locally on your machine using a DirectRunner:

(in-ns 'datasplash.examples)
(clojure.core/compile 'datasplash.examples)
(-main "--input=sometext.txt" "--output=out-freq.txt" "--numShards=1")

Remotely on Google Cloud using a DataflowRunner:

You should have properly configured your Google Cloud account and Dataflow access from your machine.

(in-ns 'datasplash.examples)
(clojure.core/compile 'datasplash.examples)
(-main "--project=my-project"
       "--runner=DataflowRunner"
       "--gcpTempLocation=gs://bucket/tmp"
       "--input=gs://apache-beam-samples/shakespeare/kinglear.txt"
       "--output=gs://bucket/outputs/kinglear-freq.txt"
       "--numShards=1")

Run it as a standalone program

Datasplash needs to be AOT compiled, so you should prepare an uberjar and run from your main entry like so:

java -jar my-dataflow-job-uber.jar [beam-args]

Caveats

About compression libraries

The Beam Java SDK does not pull in the Zstd library by default, so it is the user's responsibility to declare an explicit dependency on zstd-jni. Attempts to read or write .zst files without this library loaded will result in NoClassDefFoundError at runtime.

The Beam Java SDK does not pull in the required libraries for LZOP compression by default, so it is the user's responsibility to declare an explicit dependency on io.airlift:aircompressor and com.facebook.presto.hadoop:hadoop-apache2. Attempts to read or write .lzo files without those libraries loaded will result in a NoClassDefFoundError at runtime.

See Apache Beam Compression enum for details.

License

Copyright © 2015-2024 Oscaro.com

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.