Home

Awesome

Overview

Purpose

The purpose of this tool is to provide a very quick and simple way to provision Google Cloud Platform (GCP) compute clusters of specifically accelerator optimized machines.

Machine Type Comparison

Feature \ MachineA2A3
Nvidia GPU TypeA100 -- 40GB and 80GBH100 80GB
VM ShapesSeveral8 GPUs
GPUDirect-TCPXUnsupportedSupported
Multi-NICUnsupported5 vNICS -- 1 for CPU and 4 for GPUs (one per pair of GPUs)

Repository Content Summary

This repository contains:

How to provision a cluster

Prerequisites

In order to provision a cluster, the following are required:

Google Cloud Authentication

The command to authorize tools to create resources on your behalf is:

gcloud auth application-default login

The above command is:

Methods

After running through the prerequisites above, there are a few ways to provision a cluster:

  1. Run the docker image: do this if you don't have any existing infrastructure as code.
  2. Integrate into an existing terraform project: do this if you already have (or plan to have) a terraform project and would like to have the same terraform apply create this cluster along with all your other infrastructure.
  3. Integrate into an existing HPC Toolkit Blueprint: do this if you already have (or plan to have) an HPC Toolkit Blueprint and would like to have the same ghpc deploy create this cluster along with all your other infrastructure.

Run the docker image

For this method, all you need (in addition to the above requirements) is a terraform.tfvars file (user generated or copied from an example -- a3-mega) in your current directory and the ability to run docker. In a terminal, run:

# create/update the cluster
docker run \
  --rm \
  -v "${HOME}/.config/gcloud:/root/.config/gcloud" \
  -v "${PWD}:/root/aiinfra/input" \
  us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
  create a3-mega mig-cos

# destroy the cluster
docker run \
  --rm \
  -v "${HOME}/.config/gcloud:/root/.config/gcloud" \
  -v "${PWD}:/root/aiinfra/input" \
  us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
  destroy a3-mega mig-cos

Quick explanation of the docker run flags (in same order as above):

Integrate into an existing terraform project

For this method, you need to install terraform. Examples of usage as a terraform module can be found in the main.tf files in any of the examples -- a3-mega. Cluster provisioning then happens the same as any other terraform:

# assuming the directory containing main.tf is the current working directory

# create/update the cluster
terraform init && terraform validate && terraform apply -var-file="terraform.tfvars"

# destroy the cluster
terraform init && terraform validate && terraform apply -destroy

Integrate into an existing HPC Toolkit Blueprint

For this method, you need to build ghpc. Examples of usage as an HPC Toolkit Blueprint can be found in the blueprint.yaml files in any of the examples -- a3-mega. Cluster provisioning then happens the same as any blueprint:

# assuming the ghpc binary and blueprint.yaml are both in
# the current working directory

# create/update the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc deploy a3-mega-mig-cos

# destroy the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc destroy a3-mega-mig-cos