Awesome

zig_gpt2

GPT-2 inference engine written in Zig. Generation time: ~28ms per token.

Download the GPT-2 checkpoint from OpenAI.

python3 download_weights.py

Build the Zig binary and run it with a prompt to generate completions:

zig build -DOptimize=ReleaseFast
./zig-out/bin/zig_gpt2 "Marcus Aurelius said"

Generate test data by forwarding random tensors through PyTorch ops.

python3 generate_test_data.py

Run tests. Verifies Zig ops produce the same output as PyTorch.

zig build test

Implementation:

✅ Implement basic ops: Embedding, Linear, LayerNorm, GELU, Softmax, CausalSelfAttention.
✅ Implement transformer modules: MLP, Transformer block.
✅ Implement the full GPT model.
✅ Implement sampling from the model.
✅ Implement BPE encoding/decoding.

Efficiency:

✅ Replace custom linear algebra kernels with BLAS.
✅ Stream output as each new token is generated.
✅ Create central set of memory buffers and reuse them for each layer. No allocations at runtime.
✅ Add KV cache.
Parallelize softmax and gelu operations.