Awesome

serverless-parquet-repartitioner

A AWS Lambda function for repartitioning parquet files in S3 via DuckDB queries.

Requirements

You'll need a current v3 version installation of the Serverless Framework on the machine you're planning to deploy the application from.

Also, you'll have to setup your AWS credentials according to the Serverless docs.

Install dependencies

After cloning the repo, you'll need to install the dependencies via

$ npm i

Configuration

You can customize the configuration of the stack by setting some configuration values. Open up the serverless.yml file, and search for TODO in your IDE. This will point you to the places you need to update according to your needs.

Mandatory configuration settings

S3 bucket name: You need to use the S3 bucket where your data that you want to repartition resides (e.g. my-source-bucket)
Custom repartitioning query: You can write flexible repartitioning queries in the DuckDB syntax. Have a look at the examples at the httpfs extension docs. You need to update this, as the template uses only example values!

Optional configuration settings

S3 region: The AWS region your S3 bucket is deployed to (if different from the region the Lambda funciton is deployed to)
The schedule: The actual schedule on why the Lambda function is run. Have a look at the Serverless Framework docs to find out what the potential settings are.
DuckDB memory limit: The memory limit is influenced by the function memory setting (automatically)
DuckDB threads count: Optionally set the max thread limit (on Lambda, this is set automatically by the amount of memory the functions has assigned), but with this setting you can influence how many files are writte per partition. If you set a lower thread count than available, this means that the computation will not use all available resources for the sake of being able to set the number of generated files! Ideally, rather align the amount of memory you assign to the Lambda function.
Lambda timeout: The maximum time a Lambda function can run is currently 15min / 900sec. This means that if your query takes longer than that, it will be terminated by the underlying Firecracker engine.

Using different source/target S3 buckets

If you're planning to use different S3 buckets as source and target for the data repartitioning, you need to adapt the iamRoleStatements settings of the function.

Here's an example with minimal privileges:

iamRoleStatements:
  # Source S3 bucket permissions
  - Effect: Allow
    Action:
      - s3:ListBucket
    Resource: 'arn:aws:s3:::my-source-bucket'
  - Effect: Allow
    Action:
      - s3:GetObject
    Resource: 'arn:aws:s3:::my-source-bucket/*'
  # Target S3 bucket permissions
  - Effect: Allow
    Action:
      - s3:ListBucket
      - s3:ListBucketMultipartUploads
    Resource: 'arn:aws:s3:::my-target-bucket'
  - Effect: Allow
    Action:
      - s3:PutObject
      - s3:AbortMultipartUpload
      - s3:ListMultipartUploadParts
    Resource: 'arn:aws:s3:::my-target-bucket/*'

A query for this use case would look like this:

COPY (SELECT * FROM parquet_scan('s3://my-source-bucket/input/*.parquet', HIVE_PARTITIONING = 1)) TO 's3://my-starget-bucket/output' (FORMAT PARQUET, PARTITION_BY (column1, column2, column3), ALLOW_OVERWRITE TRUE);

Deployment

After you cloned this repository to your local machine and cd'ed in its directory, the application can be deployed like this (don't forget a npm i to install the dependencies!):

$ sls deploy

This will deploy the stack to the default AWS region us-east-1. In case you want to deploy the stack to a different region, you can specify a --region argument:

$ sls deploy --region eu-central-1

The deployment should take 2-3 minutes.

Checks and manual triggering

You can manually invoke the deployed Lambda function by running

$ sls invoke -f repartitionData

After that, you can check the generated CloudWatch logs by issueing

$ sls logs -f repartitionData

If you don't see any DUCKDB_NODEJS_ERROR in the logs, everything ran successfully, and you can have a look at your S3 bucket for the newly generated parquet files.

Costs

Using this repository will generate costs in your AWS account. Please refer to the AWS pricing docs before deploying and running it.