Awesome
Google Cloud Dataflow Template Pipelines
These Dataflow templates are an effort to solve simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations, without a development environment. The technology under the hood which makes these operations possible is the Google Cloud Dataflow service combined with a set of Apache Beam SDK templated pipelines.
Google is providing this collection of pre-implemented Dataflow templates as a reference and to provide easy customization for developers wanting to extend their functionality.
Note on Default Branch
As of November 18, 2021, our default branch is now named main
. This does not
affect forks. If you would like your fork and its local clone to reflect these
changes you can
follow GitHub's branch renaming guide.
Template Pipelines
- Get Started
- Process Data Continuously (stream)
- Azure Eventhub to Pubsub
- Bigtable Change Streams to HBase Replicator
- Cloud Bigtable change streams to BigQuery
- Cloud Bigtable change streams to Cloud Storage
- Cloud Spanner change streams to BigQuery
- Cloud Spanner change streams to Cloud Storage
- Cloud Spanner change streams to Pub/Sub
- Cloud Storage Text to BigQuery (Stream)
- Data Masking/Tokenization from Cloud Storage to BigQuery (using Cloud DLP)
- Datastream to BigQuery
- Datastream to Cloud Spanner
- Datastream to SQL
- JMS to Pubsub
- Kafka to BigQuery
- Kafka to Cloud Storage
- Kinesis To Pubsub
- MongoDB (CDC) to BigQuery
- Mqtt to Pubsub
- Ordered change stream buffer to Source DB
- Pub/Sub Avro to BigQuery
- Pub/Sub CDC to Bigquery
- Pub/Sub Proto to BigQuery
- Pub/Sub Subscription or Topic to Text Files on Cloud Storage
- Pub/Sub Subscription to BigQuery
- Pub/Sub Topic to BigQuery
- Pub/Sub to Avro Files on Cloud Storage
- Pub/Sub to Datadog
- Pub/Sub to Elasticsearch
- Pub/Sub to JDBC
- Pub/Sub to Kafka
- Pub/Sub to MongoDB
- Pub/Sub to Pub/Sub
- Pub/Sub to Redis
- Pub/Sub to Splunk
- Pub/Sub to Text Files on Cloud Storage
- Pubsub to JMS
- Spanner Change Streams to Sink
- Synchronizing CDC data to BigQuery
- Text Files on Cloud Storage to Pub/Sub
- Process Data in Bulk (batch)
- Any SourceDB to Cloud Spanner
- AstraDB to BigQuery
- Avro Files on Cloud Storage to Cloud Bigtable
- Avro Files on Cloud Storage to Cloud Spanner
- BigQuery export to Parquet (via Storage API)
- BigQuery to Bigtable
- BigQuery to Datastore
- BigQuery to Elasticsearch
- BigQuery to MongoDB
- BigQuery to TensorFlow Records
- Cassandra to Cloud Bigtable
- Cloud Bigtable to Avro Files in Cloud Storage
- Cloud Bigtable to Parquet Files on Cloud Storage
- Cloud Bigtable to SequenceFile Files on Cloud Storage
- Cloud Spanner to Avro Files on Cloud Storage
- Cloud Spanner to Text Files on Cloud Storage
- Cloud Storage To Splunk
- Cloud Storage to Elasticsearch
- Dataplex JDBC Ingestion
- Dataplex: Convert Cloud Storage File Format
- Dataplex: Tier Data from BigQuery to Cloud Storage
- Firestore (Datastore mode) to BigQuery
- Firestore (Datastore mode) to Text Files on Cloud Storage
- Google Ads to BigQuery
- Google Cloud to Neo4j
- JDBC to BigQuery
- JDBC to BigQuery with BigQuery Storage API support
- JDBC to Pub/Sub
- MongoDB to BigQuery
- MySQL to BigQuery
- Parquet Files on Cloud Storage to Cloud Bigtable
- PostgreSQL to BigQuery
- SQLServer to BigQuery
- SequenceFile Files on Cloud Storage to Cloud Bigtable
- Text Files on Cloud Storage to BigQuery
- Text Files on Cloud Storage to BigQuery with BigQuery Storage API support
- Text Files on Cloud Storage to Cloud Spanner
- Text Files on Cloud Storage to Firestore (Datastore mode)
- Utilities
- Legacy Templates
For documentation on each template's usage and parameters, please see the official docs.
Using UDFs
User-defined functions (UDFs) allow you to customize a template's functionality by providing a short JavaScript function without having to maintain the entire codebase. This is useful in situations which you'd like to rename fields, filter values, or even transform data formats before output to the destination. All UDFs are executed by providing the payload of the element as a string to the JavaScript function. You can then use JavaScript's in-built JSON parser or other system functions to transform the data prior to the pipeline's output. The return statement of a UDF specifies the payload to pass forward in the pipeline. This should always return a string value. If no value is returned or the function returns undefined, the incoming record will be filtered from the output.
UDF Function Specification
Template | UDF Input Type | Input Description | UDF Output Type | Output Description |
---|---|---|---|---|
Datastore Bulk Delete | String | A JSON string of the entity | String | A JSON string of the entity to delete; filter entities by returning undefined |
Datastore to Pub/Sub | String | A JSON string of the entity | String | The payload to publish to Pub/Sub |
Datastore to GCS Text | String | A JSON string of the entity | String | A single-line within the output file |
GCS Text to BigQuery | String | A single-line within the input file | String | A JSON string which matches the destination table's schema |
Pub/Sub to BigQuery | String | A string representation of the incoming payload | String | A JSON string which matches the destination table's schema |
Pub/Sub to Datastore | String | A string representation of the incoming payload | String | A JSON string of the entity to write to Datastore |
Pub/Sub to Splunk | String | A string representation of the incoming payload | String | The event data to be sent to Splunk HEC events endpoint. Must be a string or a stringified JSON object |
UDF Examples
For a comprehensive list of samples, please check our udf-samples folder.
Adding fields
/**
* A transform which adds a field to the incoming data.
* @param {string} inJson
* @return {string} outJson
*/
function transform(inJson) {
var obj = JSON.parse(inJson);
obj.dataFeed = "Real-time Transactions";
obj.dataSource = "POS";
return JSON.stringify(obj);
}
Filtering records
/**
* A transform function which only accepts 42 as the answer to life.
* @param {string} inJson
* @return {string} outJson
*/
function transform(inJson) {
var obj = JSON.parse(inJson);
// only output objects which have an answer to life of 42.
if (obj.hasOwnProperty('answerToLife') && obj.answerToLife === 42) {
return JSON.stringify(obj);
}
}
Contributing
To contribute to the repository, see CONTRIBUTING.md.
Release Process
Templates are released in a weekly basis (best-effort) as part of the efforts to keep Google-provided Templates updated with latest fixes and improvements.
To learn more about this process, or how you can stage your own changes, see Release Process.
More Information
- Dataflow - general Dataflow documentation.
- Dataflow Templates - basic template concepts.
- Google-provided Templates - official documentation for templates provided by Google (the source code is in this repository).
- Dataflow Cookbook: Blog, GitHub Repository - pipeline examples and practical solutions to common data processing challenges.
- Dataflow Metrics Collector - CLI tool to collect dataflow resource & execution metrics and export to either BigQuery or Google Cloud Storage. Useful for comparison and visualization of the metrics while benchmarking the dataflow pipelines using various data formats, resource configurations etc
- Apache Beam
- Overview
- Quickstart: Java, Python, Go
- Tour of Beam - an interactive tour with learning topics covering core Beam concepts from simple ones to more advanced ones.
- Beam Playground - an interactive environment to try out Beam transforms and examples without having to install Apache Beam.
- Beam College - hands-on training and practical tips, including video recordings of Apache Beam and Dataflow Templates lessons.
- Getting Started with Apache Beam - Quest - A 5 lab series that provides a Google Cloud certified badge upon completion.