Home

Awesome

serverless-parquet-repartitioner

A AWS Lambda function for repartitioning parquet files in S3 via DuckDB queries.

Requirements

You'll need a current v3 version installation of the Serverless Framework on the machine you're planning to deploy the application from.

Also, you'll have to setup your AWS credentials according to the Serverless docs.

Install dependencies

After cloning the repo, you'll need to install the dependencies via

$ npm i

Configuration

You can customize the configuration of the stack by setting some configuration values. Open up the serverless.yml file, and search for TODO in your IDE. This will point you to the places you need to update according to your needs.

Mandatory configuration settings

Optional configuration settings

Using different source/target S3 buckets

If you're planning to use different S3 buckets as source and target for the data repartitioning, you need to adapt the iamRoleStatements settings of the function.

Here's an example with minimal privileges:

iamRoleStatements:
  # Source S3 bucket permissions
  - Effect: Allow
    Action:
      - s3:ListBucket
    Resource: 'arn:aws:s3:::my-source-bucket'
  - Effect: Allow
    Action:
      - s3:GetObject
    Resource: 'arn:aws:s3:::my-source-bucket/*'
  # Target S3 bucket permissions
  - Effect: Allow
    Action:
      - s3:ListBucket
      - s3:ListBucketMultipartUploads
    Resource: 'arn:aws:s3:::my-target-bucket'
  - Effect: Allow
    Action:
      - s3:PutObject
      - s3:AbortMultipartUpload
      - s3:ListMultipartUploadParts
    Resource: 'arn:aws:s3:::my-target-bucket/*'

A query for this use case would look like this:

COPY (SELECT * FROM parquet_scan('s3://my-source-bucket/input/*.parquet', HIVE_PARTITIONING = 1)) TO 's3://my-starget-bucket/output' (FORMAT PARQUET, PARTITION_BY (column1, column2, column3), ALLOW_OVERWRITE TRUE);

Deployment

After you cloned this repository to your local machine and cd'ed in its directory, the application can be deployed like this (don't forget a npm i to install the dependencies!):

$ sls deploy

This will deploy the stack to the default AWS region us-east-1. In case you want to deploy the stack to a different region, you can specify a --region argument:

$ sls deploy --region eu-central-1

The deployment should take 2-3 minutes.

Checks and manual triggering

You can manually invoke the deployed Lambda function by running

$ sls invoke -f repartitionData

After that, you can check the generated CloudWatch logs by issueing

$ sls logs -f repartitionData

If you don't see any DUCKDB_NODEJS_ERROR in the logs, everything ran successfully, and you can have a look at your S3 bucket for the newly generated parquet files.

Costs

Using this repository will generate costs in your AWS account. Please refer to the AWS pricing docs before deploying and running it.