Awesome
cbers-2-stac
Create and keep up-to-date a STAC static catalog and API for CBERS-4/4A and Amazonia-1 images on AWS.
STAC version and extensions
Implements STAC version 1.0.0 and STAC-API version 1.0.0-beta.1.
Live version
A live version of the stack is deployed to AWS and serve its contents in:
arn:aws:s3:::cbers-stac-1-0-0
S3 bucket with static STAC content, including its root catalog.- STAC API endpoint - please check this important notice.
- SNS topic for new scenes:
arn:aws:sns:us-east-1:769537946825:cbers2stac-prod-stacitemtopic4BCE3141-VI09VRB6LBEK
. This topic receive the complete STAC item for each ingested scene.
Deployment to AWS
Install
$ git clone git@github.com:fredliporace/cbers-2-stac.git
$ cd cbers-2-stac
$ pip install -e .[dev,test,deploy]
Some lambdas require extra pip packages to be installed in the lambda directory before deployment. To install these packages execute:
./pip-on-lambdas.sh
CDK bootstrap
Deployment uses AWS CDK2.
Requirements:
- node: Use nvm to make sure a supported node is being used, tested with 18.0.0
- AWS credentials configured
To install and check AWS CDK (tested with CDK 2.129.0):
$ npm install -g aws-cdk
$ cdk --version
$ cdk bootstrap # Deploys the CDK toolkit stack into an AWS environment
# in specific region
$ cdk bootstrap aws://${AWS_ACCOUNT_ID}/eu-central-1
Configuration
Create a .env
file in the project root directory and configure your application. You should use .env_example
as a guide, this file contains the documentation for all parameters.
If STAC_ENABLE_API
is set you should review the Elasticsearch domain configuration, for which some parameters are hard-coded, see this issue.
After the configuration is completed check if CDK shows the configured stack:
$ cdk list
Deployment
Deploy the stack, replacing cbers2stac-dev
with your configured stack name:
$ cdk deploy cbers2stac-dev
The e-mail configured in STACK_OPERATOR_EMAIL
receives execution alarms and while the first deploy is made it should receive a message requesting confirmation for the alarm topic subscription. Accept the request to receive the alarms.
The stack output shows the SNS topic for new scenes and the API endpoint, if configured:
✅ cbers2stac-prod
Outputs:
cbers2stac-prod.stacapiEndpointBED73CCA = https://....
cbers2stac-prod.stacitemtopicoutput = arn:aws:sns:us-east-1:...:...
Static catalogs and collections
Empty static stac catalogs and collections are created when the stack is deployed. Note that when these files are updated and a new deploy is executed to an already populated stac bucket the deployment may fail.
In order to overcome that you may set STACK_DEPLOY_STATIC_CATALOG_STRUCTURE
to false
in .env
and manually copy the static files. From a prompt in the stack/static_catalog_structure
dir:
$ aws s3 cp ./ s3://[REPLACE_WITH_STAC_BUCKET_NAME] --recursive
Replace cbers-stac-1-0-0
with the given stac bucket name. The collections initially do not contain the links to the children, this is updated when a new scene is inserted at any level.
Creating the Elasticsearch index
If STACK_ENABLE_API
is set in the configuration you should now create the Elasticsearch index. This needs to be executed only once, right after the first deploy that enables the API. The index is created by the lambda create_elastic_index_lambda
, which may be executed from the AWS console or awscli. The function requires no parameters.
It is recommended to change the cluster configuration to disable the automatic creation of indices. AFAIK this can't be done through CDK options, you need to directly access the domain configuration endpoint, see example.
Populate the catalog
Once created the stack automatically listen to the SNS topics that publish new quicklooks. The stac item is automatically created as soon as the message is received.
Reconciliation from INPE's original metadata
The lambda populate_reconcile_queue_lambda
may be used to reconcile the STAC catalog with the original CBERS metadata catalog. This is typically used to populate a new STAC instance with data from the open datasets.
The lambda payload is a bucket name and a prefix, all directories under the prefix are queued. Each directory is then scanned separately by a distinct lambda instance, all XMLs are converted to static STAC documents and indexed again, if the API is configured.
Some examples are shown below.
To index all CBERS-4 MUX scenes with path 102 and row 83.
{
"bucket": "cbers-pds",
"prefix": "CBERS4/MUX/102/083/"
}
To index all CBERS-4A MUX scenes:
{
"bucket": "cbers-pds",
"prefix": "CBERS4A/MUX/"
}
To index all CBERS-4A MUX scenes with path 120:
{
"bucket": "cbers-pds",
"prefix": "CBERS4A/MUX/120/"
}
To index all Amazonia-1 WFI scenes:
{
"bucket": "amazonia-pds",
"prefix": "AMAZONIA1/WFI/"
}
The indexed documents are immediately available through the STAC API. The static catalogs are updated every 30 minutes. To update the static catalogs before that you may execute the generate_catalog_levels_to_be_updated_lambda
lambda.
Reconciliation from STAC static catalog
The lambda reindex_stac_items_lambda
may be used to reconcile the STAC API service with the static catalog. The lambda payload are the parameters to be passed to list_objects_v2
, all STAC items under the prefix are queued and re-indexed. Some examples are shown below.
To index all CBERS-4 AWFI scenes with path 1 and row 27:
{
"prefix": "CBERS4/AWFI/001/"
}
Operation
SQS-Lambda and dead letter queues (DLQs)
The system makes extensive use of the SQS-lambda integration pattern. DLQs are defined to store messages representing failed jobs:
reconcile_queue
: jobs representing the S3 prefixes that will be reconciled are queued here. Consumed byconsume_reconcile_queue_lambda
. Failed jobs are sent toconsume_reconcile_queue_dlq
.new_scenes_queue
: jobs representing a key for a scene to be converted to STAC and indexed. Consumed byprocess_new_scene_lambda
. Failed jobs are sent toprocess_new_scenes_queue_dlq
.insert_into_elasticsearch_queue
: jobs representing a STAC item. This queue subscribes tostac_item_topic
andreconcile_stac_item_topic
, receiving the STAC itemas as notifications. Consumed byinsert_into_elastic_lambda
. Failed jobs (for now) are sent todead_letter_queue
.
Failed lambda executions from other queues are sent to the general dead_letter_queue
.
A tool is provided to move messages from SQS queues, this may be used to re-queue failed jobs:
cb2stac-redrive-sqs --src-url=https://... --dst-url=https://... --messages-no=100
The jobs may also be re-queued using the new Start DLQ redrive
now available from the AWS console.
Recovering from ElasticSearch (ES) cluster failures
Restore from ES snapshot
See AWS documentation.
Re-ingest items from backup queue
The system keeps backup_insert_into_elasticsearch_queue
with the STAC items ingested in the last days, see STACK_BACKUP_QUEUE_RETENTION_DAYS
configuration parameter. The items may be re-ingested after a restore to make sure that the archive includes the items processed between the restored backup date and the failure.
Use cb2stac-redrive-sqs
(or Start DLQ redrive
from AWS console) to transfer messages from DLQ to insert_into_elasticsearch_queue
.
Development
Git hooks
This repo is set to use pre-commit
to run isort, pylint, pydocstring, black ("uncompromising Python code formatter") and mypy when committing new code.
$ pre-commit install
Testing
Requires localstack up to execute tests:
Check before if /tmp/localstack/es_backup
directory exists, this is required to start the ES service.
$ cd test && docker-compose up # Starts localstack
$ pytest
Check CI integration testing before pushing
act may be used to test github actions locally. At the project's root directory:
$ act --env-file < /dev/null -j tests
$ act --env-filr < /dev/null -r -j tests # To keep docker container's state
The env-file
option is used to make sure that the .env
file is used only for CDK and not docker.
References
The mechanism to include new items into the archive as soon as they are ingested is described in this AWS blog post.
Acknowledgments
Radiant Earth Foundation supported the migration to STAC 1.0.0 final.