Awesome
meltano-batch
A simple setup of Meltano Extract and Load on AWS Batch, managing the infrastructure with Terraform.
A service to setup a repeatable Meltano EL process, with smoke-tests installed. Runs the Meltano ELT process only, and does not provide a Meltano frontend (which as of writing is not essential)
If you are looking for an even simpler approach, then I strongly recommend taking a look at Meltano-on-Github-Actions as it is much simpler and requires less Devops hassle.
The only reason not to use Github actions is if you require much longer running loads, or control over the infrastructure specifications, or the movement to be fully contained within an AWS environment.
Prerequisites
- Select an AWS Region. Be sure that all required services (e.g. AWS Batch, AWS Lambda) are available in the Region selected.
- Install Docker.
- Install HashiCorp Terraform.
- Install the latest version of the AWS CLI and confirm it is properly configured.
Setup
- Setup terraform
git clone git@github.com:mattarderne/meltano-batch.git
cd meltano-batch/terraform
terraform init
- Run terraform, which will create all necessary infrastructure.
terraform plan
terraform apply
Build and Push Docker Image
Once finished, Terraform will output the name of your newly created ECR Repository, e.g. 123456789.dkr.ecr.eu-west-1.amazonaws.com/meltano-batch-ecr-repo:latest
Note this value as we will use it in subsequent steps (referred to as MY_REPO_NAME):
cd ..
cd meltano
# build the docker image
docker build -t aws-batch-meltano .
# (optional) test the docker image
docker run \
--volume $(pwd)/output:/project/output \
aws-batch-meltano \
elt tap-smoke-test target-jsonl
# tag the image
$ docker tag aws-batch-meltano:latest <MY_REPO_NAME>:latest
# login to the ECR, replace <region>
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <MY_REPO_NAME>
# push the image to the ECR repository
docker push <MY_REPO_NAME>:latest
The above scripts are automated in the meltano/deploy_aws_ecr.sh
script
Create a Job
Now that the docker image has been deployed to the ECR, you can invoke a job with the below, which will print the logs. Replace <region>
aws lambda invoke --function-name submit-job-smoke-test --region <region> \
outfile --log-type Tail \
--query 'LogResult' --output text | base64 -d
You should be able to view a list of the jobs with below command. (returns an empty list, no idea why, please let me know if you do!)
aws batch list-jobs --job-queue meltano-batch-queue
Meltano UI
Load the Meltano UI to have a look. Currently only for display purposes, but can be configured to display the meltano app and kick-off adhoc jobs. Using Apprunner (example in terraform/archive/apprunner.tf
) is viable for deploying to production, but requires a backend DB to be configured in the Dockerfile.
docker run \
--volume $(pwd)/output:/project/output \
aws-batch-meltano \
ui
Resourcing
Depending on the size of the data transferred, you may need to increas the AWS Batch resource "aws_batch_job_definition"
by editing the following fields from 2-8 vcpus and 2GB to 8GB ram
"vcpus": 2 -> 8,
"memory": 2000 -> 8000,
Notifications
By default there are no notifications set. Ideally this should be set by an AWS SNS system.
There is the capability to turn on Slack notifications as follows,
- Change the below line in
elt_tap_smoke_test-target_jsonl.tf
:handler = "lambda.lambda_handler"
tohandler = "alerts.lambda_handler"
- Change the below line in
main.py
:source_file = "lambda/lambda.py"
tosource_file = "alerts/lambda.py"
- Create a slack webhook create a
secret.tfvars
file in thelambda
directory, adding the webhook url
slack_webhook = "<slack_webook>"
- Change the
var.slack_webhook_toggle
invariables.tf
file totrue
(lowercase) - Install
requests
in theterraform/lambda
directory
cd terraform #must be run in terraform
pip install --target ./lambda requests
- Run
terraform apply -var-file="secret.tfvars"
Test with aws lambda ...
command above. It should ping to slack.
However it only is pinging when the job starts (or fails to start), not the outcome of the job. Proper setup should be with AWS Batch SNS Notifications
Todo
AWS
- setup a serverless DB to capture state files for incremental loads
- Adopt Batch best practices
- look into github actions as an alternative done - Meltano-on-Github-Actions
- work out SNS in Terraform
- Test AWS AppRunnner for frontend
- Look into VPC settings Misc
- work out GCP equivalent
- Look into Pulumi using https://www.pulumi.com/tf2pulumi/ - WIP on this branch