Awesome
RKMS (Reliable Key Management Service)
RKMS is a highly available key management service, built on top of AWS's KMS.
Objective
While AWS's KMS is an amazing service, it does not have an SLA. As a result, if KMS goes down in the region you are using it in, your application also goes down as it can't encrypt/decrypt data. The idea of RKMS is to replicate your application's data keys across multiple regions, so you can fallback on another region if your main region goes down.
To get a better understanding, check out my blog post about the project.
Architecture
Overview
Before we look at how RKMS is designed, let's go over the main functionalities AWS's KMS provides:
GenerateDataKey()
: creates and returns a random data key to encrypt/decrypt data withEncrypt(data, kmsKeyId)
: encrypts data with the specified KMS keyDecrypt(data, kmsKeyId)
: decrypts data with the specified KMS key
RKMS's main endpoint is GET /key?id=<id>
, which roughly does the following:
- Look in the key/value store for a value for
id
- If found, the value will contain mappings from KMS regions to encrypted data key - Pick a region - Decrypt encrypted data key in the selected region and return the plaintext data key returned by KMS - If call to KMS fails, try other regions
- If not found, a new key has to be created for the given
id
- Ask one of the KMS regions to generate a data key - Encrypt the data key in every region - Save all the encrypted data keys in the store for keyid
- Return plaintext data key
Notes:
- RKMS is AWS specific
- It is not an implementation of a key management service from ground up
- It currently uses DynamoDB as the key/value store, but other stores can easily be swapped in; just need to implement the
Store
interface.
High Availability and Race Conditions
One of the benefits of RKMS is that it is stateless. As a result, one can run multiple copies of the service to avoid single point of failure. On the other hand, running multiple copies bring up concerns regarding race conditions (e.g. creating the same key at the "same" time on multiple servers). In order to address this concern, RKMS is designed with First Write Wins concept. The last step of creating a key is to save it in the key/value store. RKMS performs a conditional write here, where it only saves to the store if no value exists for the given key. For that reason, when the same key is being created at the "same" time, the writes to the store happen serially and only the first write wins. In which case, the second writer will just re-read from the store and return the value generated by the other RKMS server.
Get Started
- (Optional) Use Terraform code in the
terraform
folder to create necessary resources - Update
config.toml
file with values specific to your needs and environment. - Execute the following:
go build ./rkms
Contributing
Contributions to this project are very welcome! You can even contribute by simply requesting features or reporting bugs.
Things I would like to do in the future (which you can help with!) are:
- Write more tests
- Add
DELETE /key?id=<id>
endpoint to allow deletion - Allow key creation even if some regions are down
- GRPC support
- Create a Makefile
- Create a Dockerfile
- Create Helm chart