Awesome

Akkalat, Wafer-Scale GPU Simulation Infrastructure

Akkalat is the infrastructure to simulate wafer-scale GPUs, based on the architecture-agnostic simulator framework Akita.

The simulator targeted on Cerebras Wafer-Scale Engines and supported a variety of OpenCL benchmarks.

In the latest stable Akkalat v3, we devised a tile-based hardware paradigm, which can highly scale up compute resources at CU(Compute Unit)-level. Each tile consists of optimized L1 caches and TLBs, distributed memory and other basic components like reorder buffers and address translators. The tiles are connected to a Network-on-Chip (NoC) for communication.

The further performance optimization and hardware designs could be assisted with Vis4Mesh, a visualization tool developed later for mesh Network-on-Chip research.

Build

git clone git@github.com:ueqri/akkalat.git
cd akkalat/samples/fir
go build

# Case 1: run FIR in 64x64 mesh without NoC tracing
./fir -width=64 -height=64 -length=100000 -verify -timing

# Case 2: run FIR with tracing and Vis4Mesh,
# and customize the env to meet your demands
export AKITA_TRACE_PASSWORD=[RedisPassword]
export AKITA_TRACE_IP=127.0.0.1
export AKITA_TRACE_PORT=6379
export AKITA_TRACE_REDIS_DB=0
./fir -width=64 -height=64 -length=100000 -verify -timing -trace-vis
# Then follow Vis4Mesh tutorial to run visualization

Note: many Akita libraries used here are under development, therefore please stick with the latest commit of the correct branches for go packages util, noc, and mgpusim (as described in go.mod).

# Assume the work directory is inside akkalat/
cd ..
git clone --single-branch -b v2 git@gitlab.com/akita/util
git clone --single-branch -b 9-task-tracing-for-networks git@gitlab.com/akita/noc
git clone --single-branch -b 90-command-processor-reuses-todma-sender-port git@gitlab.com/akita/mgpusim

Benchmark

We support the following benchmarks the evaluate the performance.

AMD APP SDK	DNN Mark	HeteroMark	Polybench	Rodinia	SHOC
Bitonic Sort	MaxPooling	AES	ATAX	Needleman-Wunsch	BFS
Fast Walsh Transform	ReLU	FIR	BICG		FFT
Floyd-Warshall		KMeans			SPMV
Matrix Multiplication		PageRank			Stencil2D
Matrix Transpose
NBody
Simple Covolution

Performance

We demonstrate a 32x32 wafer-scale GPU using akkalat. Compared to 16 unified AMD R9Nano GPUs(with the same numbers of compute units) modeled by MGPUSim, the wafer-scale outperforms in many workloads and achieves up-to 4 speedup in Polybench.

This figure is generated full-automatically by another self-developed tool akitaplot, feel free to try it :-). And the detailed metrics of this test could also download in that repo, link.

speedup

Publications

Chris Thames, Hang Yan, and Yifan Sun. Understanding wafer-scale GPU performance using an architectural simulator. In Proceedings of the 14th Workshop on General Purpose Processing Using GPU (GPGPU '22). April 2022.