Awesome
Lambda function to convert a tar-gzipped set of pnm files into one OCRed PDF and upload it go Google Drive.
Designed to use it as processing step after scanning. See this very complete blogpost how to use it together with a scanner via raspberry pi.
Installation
Prerequisites
Before you start, you'll need..
- a source s3 bucket: this is where you'll upload
tar.gz
files which contains scanned images (only tested with pnm files so far). Needs to be in the same region as your lambda function - an IAM role which allows to read + delete from source bucket. And store the log output. Attach this policy to the role:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:DeleteObject" ], "Resource": ["arn:aws:s3:::<source-bucket>/*"] }, { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:*:*:*" } ] }
Lambda function
Set up a lambda function with
Runtime
:Python 3.6
Role
: the role you just createdHandler
:handler.handler
- env variable:
EMPTY_PAGE_THRESHOLD=200
if tesseract finds less than 200 characters on a page it's --- from experience --- likely to be empty and will be removed (assumes you're using a duplex scanner). If you want to disable empty page removal, just put this to 0 - optional: env variable:
TESSERACT_LANG
. Currently supported areeng
english anddeu
deutsch. Default is english. If you want to add another language, see below Timeout
: 5 minutes- Memory: I chose 2048MB. The more memory you take, the faster the execution time (see also the official doc). 128MB is not enough. It will lead to out of memory exceptions.
You have two options for storing your OCRed PDFs:
- Google drive:
- You need to first create a google api in the developers console, and turn on the google drive api as described here.
- Copy the resulting
client_secret.json
into this projects root, thenpip install oauth2client
and then runpython scripts/get_drive_credentials
. Now, copy-paste the resulting values into the environment variables. This grants your lambda function to create files in your google drive and to access the files it created (which it won't need). See here for more details about the right you're granting. - Optional: If you wish your PDFs to be stored in a specific folder, go to that folder in your google drive, copy the part in the url after
/folders/
and put that into an additional environment variabled namedGDRIVE_FOLDER
- S3: This is a lot easier as you'll only need to create an s3 bucket (in the same region as your lambda function) and add these lines to your policy (replace
<dest-bucket>
):
Then set the environment variables{ "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": ["arn:aws:s3:::<dest-bucket>/*"] }
UPLOAD_TYPE=s3
andS3_BUCKET
to the name of the just created bucket.
Now, upload the zip file to lambda:
- download the ocr-lambda.zip from the releases section of github
- upload the zip file onto a s3 bucket of yours which is in the same region as your lambda function
- then, choose Code entry type:
upload from S3
S3 link URL
: the s3 link of the file you just uploaded, this has the formhttps://s3.<region>.amazonaws.com/<bucket>/ocr-lambda.zip
Test
Upload tar.gz with 1 or more pnm files into <bucket-name>
. then add new test handler with this:
{
"Records": [
{
"s3": {
"bucket": {
"name": "<source-bucket>"
},
"object": {
"key": "<filename-you-just-uploaded>.tar.gz"
}
}
}
]
}
Add trigger
From the Add triggers
menu on the left choose S3
, then in Configure trigger
dialogue:
- do a reload of the page (otherwise, you'll be in "test mode" and cannot save the trigger.. yeah, that's a aws bug..)
Bucket
: the bucket where the lambda function should listen toEvent tpye
:Object Created (All)
Prefix
andSuffix
you can leave empty
Build lambda function
cd root/of/repo
virtualenv --python=python3.6 .
pip install -r requirements.txt
scripts/zip.sh
aws s3 cp ocr-lambda.zip s3://<s3-bucket>/
aws lambda update-function-code --function-name <lambda-name> --s3-bucket <s3-bucket> --s3-key ocr-lambda.zip
Further docs
- Build tesseract4 binaries
- Adding another language: git clone this repo, then cd into
tessdata
and load one of the files from tessdata_fast into that dir. Then follow the instructions at "Build lambda function". Don't forget to set theTESSERACT_LANG
env variable.