Awesome
Aptos Multi-Region Benchmark Setup
This repo contains deployment configurations, operational scripts, and benchmarks for a multi-region Aptos benchmark on Google Cloud.
- Each region is deployed separately via open source Terraform modules published by Aptos Labs. These are the same deployment modules used to run validators and fullnodes in production on mainnet. Validators and fullnodes connect to each other over the public internet.
- A lightweight wrapper around the kube API and
kubectl
provides a way to form the network and submit load against the network
Multi-region setup
Google Cloud Inter-Region Latency and Throughput: link
- asia-east1 -- Taiwan
- europe-west4 -- Netherlands
- us-west1 -- Oregon
For each validator, the following is spun up:
- 2x
t2d-standard-48
-- one for each the validator itself and the validator-fullnode (VFN). The machine family and size can be tuned via kubernetes resources requests and nodeAffinities, using GKE's Node-autoprovisioning. More details in the below sections. - 2x Google Cloud Load Balancers -- one for each the validator and VFN
- 2x 1 TiB SSD -- one for each validator and VFN
Raw data
The below latency and throughput stats were pulled from Google Cloud Inter-Region Latency and Throughput. Raw filtered CSV can be found in the ./data
directory.
Latency (snapshot Dec 5, 2022 - Jan 3, 2023):
sending_region | receiving_region | milliseconds |
---|---|---|
asia-east1 | europe-west4 | 251.794 |
asia-east1 | us-west1 | 118.553 |
europe-west4 | asia-east1 | 251.777 |
europe-west4 | us-west1 | 133.412 |
us-west1 | asia-east1 | 118.541 |
us-west1 | europe-west4 | 133.435 |
Throughput (snapshot Dec 5, 2022 - Jan 3, 2023):
sending_region | receiving_region | Gbits/sec |
---|---|---|
asia-east1 | europe-west4 | 9.344 |
asia-east1 | us-west1 | 9.811 |
europe-west4 | asia-east1 | 9.326 |
europe-west4 | us-west1 | 9.815 |
us-west1 | asia-east1 | 9.802 |
us-west1 | europe-west4 | 9.778 |
Benchmark setup
Clone the repo
This repo uses a git submodule to https://github.com/aptos-labs/aptos-core, so be sure to clone that as well
git clone https://github.com/aptos-labs/aptos-multi-region-bench.git --recurse-submodules
cd aptos-multi-region-bench
At any point you can update the submodule with:
git submodule update --remote
Set up GCP access
Create a GCP project and sign in with the gcloud
CLI. Also it will be useful to set the environment variable GCP_PROJECT_ID
for future use.
For reference:
- Install
gcloud
CLI: https://cloud.google.com/sdk/docs/install - Create a GCP project: https://cloud.google.com/resource-manager/docs/creating-managing-projects
export GCP_PROJECT_ID=<YOUR_GCP_PROJECT_ID>
gcloud auth login --update-adc
gcloud config set project $GCP_PROJECT_ID
Set up the infrastructure
NOTE: This section may take a while to run through all the steps. A lot of the time will be spent running commands and waiting on cloud infrastructure to come alive.
Each region's infrastructure is deployed separately, via Terraform. Each directory in the top-level terraform/
directory corresponds to a Terraform project.
If you are unfamiliar with Terraform, it's highly recommended that you familiarize yourself with Terraform concepts before you get started. This will help you ensure the health of your infrastructure, as well as prevent unnecessary costs. Particularly, these reference documentation links:
- What is Terraform: https://developer.hashicorp.com/terraform/intro
- Terraform backends: https://developer.hashicorp.com/terraform/language/settings/backends/configuration
- Terraform workspaces: https://developer.hashicorp.com/terraform/language/state/workspaces
Create a storage bucket for storing the Terraform state on Google Cloud Storage. Use the console or this gcs
command to create the bucket. The name of the bucket must be unique. See the Google Cloud Storage documentation here: https://cloud.google.com/storage/docs/creating-buckets#prereq-cli.
gsutil mb gs://BUCKET_NAME
# for example
gsutil mb gs://<project-name>-aptos-terraform-bench
Then, edit terraform/example.backend.tfvars
to reference the gcs bucket created in the previous step. Rename terraform/example.backend.tfvars
to terraform/backend.tfvars
.
Deploy each region's infrastructure using the following commands. For each of the Terraform project directories in terraform/
, run the following series of commands:
# Initialize terraform and its backend, using the backend configuration created in the previous step
# This will copy the public reference terraform modules written by Aptos Labs into the .terraform/modules directory
terraform init -backend-config=../backend.tfvars
# This environment variable is used to apply the infrastructure to the GCP project you set up in the previous step
export TF_VAR_project=$GCP_PROJECT_ID
# Initialize your terraform workspaces, one unique workspace name for each directory.
terraform workspace new <WORKSPACE_NAME>
# for example
terraform workspace new bench-asia-east1
# check the infrastructure that will be applied
terraform plan
# apply it
terraform apply
After all the infrastructure is created, you can use the cluster.py
utility to authenticate against all clusters. This will be your primary tool for interacting with each of the cluster's workloads. It is a wrapper around the kube API and familiar kubectl
commands.
Authenticate with all GKE clusters
# this script must be run from the repository root
./bin/cluster.py auth
Initialize the Network
At this point, most of the required infrastructure has been set up. You must now begin the genesis process and start all the Aptos nodes in each kubernetes cluster. As a quick sanity check, visit this URL to view all your active kubernetes clusters within the project https://console.cloud.google.com/kubernetes/list/overview?referrer=search&project=<YOUR_PROJECT_ID>, and confirm that all are in a healthy "green" state. If not, use GKE's tooltips and logs to help debug.
By default, the Terraform modules will also install some baseline Aptos workloads on each of the kubernetes clusters as well (e.g. 1 validator). To check these running workloads, run the following from the project root:
./bin/cluster.py kube get pods
These workloads will soon be replaced with the following steps, which initializes the benchmark network.
Install aptos
CLI
Some of the scripts below require the aptos
CLI to be installed. Install instructions: https://aptos.dev/cli-tools/aptos-cli-tool/
Also ensure that the CLI is available in the PATH
.
Run genesis
In this setup, you will mostly be interacting with aptos_node_helm_values.yaml
to configure the benchmark network as a whole.
Firstly, start all the validators and fullnodes.
# 1. This performs a helm upgrade to all clusters to spin up the validators and fullnodes (this may take a few minutes)
time ./bin/cluster.py upgrade --new
You will see most pods are in a ContainerCreating
state. This is because these pods (fullnodes and validators) are waiting for their keys and genesis configurations, which will be done in a later step.
You might also see some pods in Pending
state. This is likely due to GKE's underlying autoscaler kicking in. It may take a few minutes for the necessary compute to be available to the cluster. Part of why we install the validators and fullnodes workloads as the first step is to warm up the infrastructure.
In order to progress to the next steps, check that all LoadBalancers have been provisioned for each validator and fullnode. From the output, check if there are any services that have <pending>
for their EXTERNAL-IP
. Wait until all LoadBalancers are brought up before proceeding to the next step.
# 1.1. Filter all kubernetes services by LoadBalancer type, checking for pending
./bin/cluster.py kube get svc | grep LoadBalancer
# to continue, this should be zero
./bin/cluster.py kube get svc | grep -c pending
To run genesis for the first time, execute the below command. This will generate keys and get the public IP for each of the validators and fullnodes and then generate a genesis blob and waypoint. These will then be uploaded to each node (via kubernetes) for startup
./bin/cluster.py genesis create --generate-keys
From here onwards, you can use Helm to manage the lifecycle of your nodes. If there is any config change you want to make, you can run upgrade
again (NOTE: this time, without --new
). If nothing has changed, running it again should be idempotent:
# 4. Upgrade all nodes (this may take a few minutes)
time ./bin/cluster.py upgrade
Scripts Reference
bin/loadtest.py
- cluster loadtest utility.
bin/cluster.py
- cluster management utility. Creates genesis, and manages nodes lifecycle
loadtest.py
Submit load test against the network. The root keypair is hardcoded in genesis. The below commands show some cutomization options for the loadtest utility.
# apply the benchmnark loadtest for an hour:
./bin/loadtest.py 0xE25708D90C72A53B400B27FC7602C4D546C7B7469FA6E12544F0EBFB2F16AE19 7 --apply --txn-expiration-time-secs=60 --mempool-backlog=25000 --duration=3600 --only-within-cluster --coin-transfer
# more customizations can be seen here
./bin/loadtest.py --help
cluster.py
Spin up or down compute, e.g. to save cost by going idle
./bin/cluster.py start
./bin/cluster.py stop
Delete all workloads in each cluster, e.g. a clean wipe
./bin/cluster.py delete
To bring back the network, you can try:
./bin/cluster.py upgrade --new
Wipe the network and start from scratch
To wipe the chain, change the chain's "era" in the helm values in aptos_node_helm_values.yaml
. This tells the kubernetes workloads to switch their underlying volumes, thus starting the chian from scratch. Then, follow the steps above to Run genesis
Changing the network size (and starting a new network)
- Edit
CLUSTERS
inconstants.py
to change the number of validators (and VFNs) in each region. Please note the quota in your GCP project. - Follow above instructions to re-run genesis and wipe the chain.
Changing the node deployment configuration
Each node is deployed via helm
on each cluster. The configuration is controlled by helm values in the file: aptos_node_helm_values.yaml
. Documentation on which values are available to configure can be found in aptos-core: https://github.com/aptos-labs/aptos-core/tree/main/terraform/helm/aptos-node
For example:
imageTag
-- change the image for each validator and VFNchain.era
-- change the chain era and wipe storagevalidator.config
-- override the NodeConfig as YAML, such as tuning execution, consensus, state sync, etc
Misc
Grab the latest aptos-framework for genesis
docker run -it aptoslabs/tools:${IMAGE_TAG} bash
docker cp `docker container ls | grep tools:${IMAGE_TAG} | awk '{print $1}'`:/aptos-framework/move/head.mrb genesis/framework.mrb
Individual GKE cluster auth
./bin/cluster.py auth
authencates across all clusters, but you make want to use the below commands to authenticate and change your kube context manually for each cluster.
Each cluster is deployed in its own region via terraform/
top-level directory. The kubectx.sh
script in each will authenticate you against each cluster and set your kubectl context to match.
Changing machine types
This kubernetes setup relies on GKE's Node auto-provisioning (NAP). This allows us to specify a machine family in each workload's nodeAffinity
. The size of the machine is automatically assigned, based on the size of the workload. In general the workload resource request must be a bit less than the max capacity. For example, if you want to use a 48 vCPU machine for validators, you may need to set the resource request at 42 vCPU only to give slack to the Node auto-provisioner, otherwise it may provision the next largest machine size.
The default machine configuration via node auto-provisioning is already set in aptos_node_helm_values.yaml
. Particularly, note the following keys:
validator.affinity.nodeAffinity
-- guarantee machine type via NAPvalidator.resources
-- resource request and limitvalidator.affinity.podAntiAffinity
-- prevent validators from sharing the same machine as other validators and fullnodesfullnode.affinity.nodeAffinity
-- same as above, but for fullnodesfullnode.affinity.podAntiAffinity
-- same as above, but for fullnodesfullnode.resources
-- same as above, but for fullnodes