Home

Awesome

aws-pdf-textract-pipeline Mentioned in Awesome CDK

:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS Textract. Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

Example Extension Popup

<!-- https://cloudcraft.co/view/e135397e-a673-411e-9ee7-05a5618052b2?key=R-OLiwplnkA9dtQxtkVqOw&interactive=true&embed=true -->

Getting Started

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

yarn install
yarn build
cdk bootstrap
cdk deploy

Overview

The following is an overview of each process performed by this CDK stack.

  1. Scrape PDF download URLs from a website

    Scraping data from the COGCC website.

  2. Store PDF download URL in DynamoDB

    Example Extension Popup

  3. Download the PDF to S3

    A lambda fires off when a new PDF download URL has been created in DynamoDB.

  4. Process the PDF with AWS Textract

    Another lambda fires off when a PDF has been downloaded to the S3 bucket.

  5. Process the AWS Textract results

    When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.

  6. Save the processed Textract result to DynamoDB.

    After the full result is pruned down the the desired datastructure, we save the data in DynamoDB. Example Extension Popup

Scripts

Notes

Built with

Additional Resources

License

Opens source under the MIT License.

Built with :heart: by aeksco

<!-- Reddit Threads --> <!-- https://www.reddit.com/r/aws/comments/fbwtr2/example_serverless_data_pipeline_for_crawling/ --> <!-- https://www.reddit.com/r/serverless/comments/fbwsak/serverless_data_pipeline_for_crawling_pdfs_from/ --> <!-- https://www.reddit.com/r/typescript/comments/fcy30x/example_serverless_data_pipeline_for_crawling/ --> <!-- https://www.reddit.com/r/webdev/comments/fd65r2/example_serverless_data_pipeline_for_crawling/ -->