Home

Awesome

Serverless Reference Architecture: Real-time File Processing

The Real-time File Processing reference architecture is a general-purpose, event-driven, parallel data processing architecture that uses AWS Lambda. This architecture is ideal for workloads that need more than one data derivative of an object.

In this example application, we deliver notes from an interview in Markdown format to S3. S3 Events are used to trigger multiple processing flows - one to convert and persist Markdown files to HTML and another to detect and persist sentiment.

Architectural Diagram

Reference Architecture - Real-time File Processing

Application Components

Event Trigger

In this architecture, individual files are processed as they arrive. To achive this, we utilize AWS S3 Events and Amazon Simple Notification Service. When an object is created in S3, an event is emitted to a SNS topic. We deliver our event to 2 seperate SQS Queues, representing 2 different workflows. Refer to What is Amazon Simple Notification Service? for more information about eligible targets.

Conversion Workflow

Our function will take Markdown files stored in our InputBucket, convert them to HTML, and store them in our OutputBucket. The ConversionQueue SQS queue captures the S3 Event JSON payload, allowing for more control of our ConversionFunction and better error handling. Refer to Using AWS Lambda with Amazon SQS for more details.

If our ConversionFunction cannot remove the messages from the ConversionQueue, they are sent to ConversionDlq, a dead-letter queue (DLQ), for inspection. A CloudWatch Alarm is configured to send notification to an email address when there are any messages in the ConversionDlq.

Sentiment Analysis Workflow

Our function will take Markdown files stored in our InputBucket, detect the overall sentiment for each file, and store the result in our SentimentTable.

We are using Amazon Comprehend to detect overall interview sentiment. Amazon Comprehend is a machine learning powered service that makes it easy to find insights and relationships in text. We use the Sentiment Analysis API to understand whether interview responses are positive or negative.

The Sentiment workflow uses the same SQS-to-Lambda Function pattern as the Coversion workflow.

If our SentimentFunction cannot remove the messages from the SentimentQueue, they are sent to SentimentDlq, a dead-letter queue (DLQ), for inspection. A CloudWatch Alarm is configured to send notification to an email address when there are any messages in the SentimentDlq.

Building and Deploying the Application with the AWS Serverless Application Model (AWS SAM)

This application is deployed using the AWS Serverless Application Model (AWS SAM). AWS SAM is an open-source framework that enables you to build serverless applications on AWS. It provides you with a template specification to define your serverless application, and a command line interface (CLI) tool.

Pre-requisites

Clone the Repository

Clone with SSH

git clone git@github.com:aws-samples/lambda-refarch-fileprocessing.git

Clone with HTTPS

git clone https://github.com/aws-samples/lambda-refarch-fileprocessing.git

Build

The AWS SAM CLI comes with abstractions for a number of Lambda runtimes to build your dependencies, and copies the source code into staging folders so that everything is ready to be packaged and deployed. The sam build command builds any dependencies that your application has, and copies your application source code to folders under .aws-sam/build to be zipped and uploaded to Lambda.

sam build --use-container

Note

Be sure to use v0.41.0 of the AWS SAM CLI or newer. Failure to use the proper version of the AWS SAM CLI will result in a InvalidDocumentException exception. The EventInvokeConfig property is not recognized in earlier versions of the AWS SAM CLI. To confirm your version of AWS SAM, run the command sam --version.

Deploy

For the first deployment, please run the following command and save the generated configuration file samconfig.toml. Please use lambda-file-refarch for the stack name.

sam deploy --guided

You will be prompted to enter data for ConversionLogLevel and SentimentLogLevel. The default value for each is INFO but you can also enter DEBUG. You will also be prompted for AlarmRecipientEmailAddress.

Subsequent deployments can use the simplified sam deploy. The command will use the generated configuration file samconfig.toml.

You will receive an email asking you to confirm subscription to the lambda-file-refarch-AlarmTopic SNS topic that will receive alerts should either the ConversionDlq SQS queue or SentimentDlq SQS queue receive messages.

Testing the Example

After you have created the stack using the CloudFormation template, you can manually test the system by uploading a Markdown file to the InputBucket that was created in the stack.

Alternatively you test it by utilising the pipeline tests.sh script, however the test script removes the resources it creates, so if you wish to explore the solution and see the output files and DynamoDB tables manually uploading is the better option.

Manually testing

You can use the any of the sample-xx.md files in the repository /tests directory as example files. After the files have been uploaded, you can see the resulting HTML file in the output bucket of your stack. You can also view the CloudWatch logs for each of the functions in order to see the details of their execution.

You can use the following commands to copy a sample file from the provided S3 bucket into the input bucket of your stack.

INPUT_BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id InputBucket --query "StackResourceDetail.PhysicalResourceId" --output text)
aws s3 cp ./tests/sample-01.md s3://${INPUT_BUCKET}/sample-01.md
aws s3 cp ./tests/sample-02.md s3://${INPUT_BUCKET}/sample-02.md

Once the input files has been uploaded to the input bucket, a series of events are put into motion.

  1. The input Markdown files are converted and stored in a separate S3 bucket.
OUTPUT_BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id ConversionTargetBucket --query "StackResourceDetail.PhysicalResourceId" --output text)
aws s3 ls s3://${OUTPUT_BUCKET}
  1. The input Markdown files are analyzed and their sentiment published to a DynamoDB table.
DYNAMO_TABLE=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id SentimentTable --query "StackResourceDetail.PhysicalResourceId" --output text)
aws dynamodb scan --table-name ${DYNAMO_TABLE} --query "Items[*]"

You can also view the CloudWatch logs generated by the Lambda functions.

Using the test script

The pipeline end to end test script can be manually executed, you will need to ensure you have adequate permissions to perform the test script actions.

bash ./tests.sh lambda-file-refarch

While the script is executing you will see all the stages output to the command line. The samples are uploaded to the InputBucket, the script will then wait for files to appear in the OutputBucket before checking they have all been processed and the matching html file exists in the OutputBucket. It will also check that the sentiment for each of the files has been recorded in the SentimentTable. Once complete the script will remove all the files created and the entries from the SentimentTable.

Extra credit testing

Try uploading (or adding to ./tests if you are using the script) an oversized (>100MB) or invalid file type to the input bucket. You can check in X-ray to explore how you can trace these kind of errors within the solution.

fallocate -l 110M ./tests/sample-oversize.md
mkfile 110m ./tests/sample-oversize.md

X-Ray Error Tracing - Real-time File Processing

Viewing the CloudWatch dashboard

A dashboard is created as a part of the stack creation process. Metrics are published for the conversion and sentiment analysis processes. In addition, the alarms and alarm states are published.

CloudWatch Dashboard - Real-time File Processing

Cleaning Up the Example Resources

To remove all resources created by this example, run the following command:

bash cleanup.sh

What Is Happening in the Script?

Objects are cleared out from the InputBucket and ConversionTargetBucket.

for bucket in InputBucket ConversionTargetBucket; do
  echo "Clearing out ${bucket}..."
  BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-refarch --logical-resource-id ${bucket} --query "StackResourceDetail.PhysicalResourceId" --output text)
  aws s3 rm s3://${BUCKET} --recursive
  echo
done

The CloudFormation stack is deleted.

aws cloudformation delete-stack \
--stack-name lambda-file-refarch

The CloudWatch Logs Groups associated with the Lambda functions are deleted.

for log_group in $(aws logs describe-log-groups --log-group-name-prefix '/aws/lambda/lambda-file-refarch-' --query "logGroups[*].logGroupName" --output text); do
  echo "Removing log group ${log_group}..."
  aws logs delete-log-group --log-group-name ${log_group}
  echo
done

SAM Template Resources

Resources

The provided template creates the following resources:

License

This reference architecture sample is licensed under Apache 2.0.