Home

Awesome

An extension of the Llama2.java implementation, accelerated with GPUs by using TornadoVM

<img src="https://github.com/mikepapadim/llama2.tornadovm.java/assets/8652854/4493fe14-7427-4532-91fa-7299cd96034b" width="30%">

This repository provides an implementation of llama2.java, extended to use the Vector API and TornadoVM for acceleration.

Prerequisites

Build

The set_paths.sh file provides a template with all the paths that need to be set up for the compilation and execution. From this template, the paths that need to be set are:

After the set_paths.sh file has been configured with the correct paths, run:

./set_paths.sh  

And finally, compile the project by running this script:

./compile.sh

Execution

Token files

Just like the original Java implementation, the program requires a tokenizer.bin file and the input models available in the TinyLlamas.

wget https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

How to run

The repository contains a run.sh script for running. This script takes the following arguments:

Additionally, the script can take an optional that enables the execution of the program in pure Java, without TornadoVM.

// Run with just the model 
./run.sh stories15M.bin 
// Run with the workgroup size and the model
./run.sh -n 128 stories15M.bin
// Run with the workgroup size, the [VectorFloat4|VectorFloat8|VectorFloat16] types and the model
./run.sh -n 128 -v -Dllama2.VectorFloat4=true stories15M.bin
// Run with the [VectorFloat4|VectorFloat8|VectorFloat16] types and the model
./run.sh -v -Dllama2.VectorFloat4=true stories15M.bin
// Run in pure Java, without TornadoVM
./run.sh -j java

Performance

ComponentSpecification
CPU13th Gen Intel® Core i7-13700 × 24 threads
GPUNVIDIA GeForce RTX 3070
OSPop!_OS Linux
JDKOpenJDK 21+35-2513
TornadoVMv1.0

Test Objective: Synergy Between Vector API, Panama and TornadoVM

This test aims to illustrate the collaborative efficiency gained by integrating Vector API, employing off-heap memory types via MemorySegments for read-only weights, and TornadoVM. Following the profiling of the original Java implementation, optimization was directed at offloading only the last matrix vector computation to the GPU through TornadoVM.

To ensure unbiased and reliable performance evaluation, the test will be executed over more than 100 iterations. This extended duration allows the Java Virtual Machine (JVM) to reach a warm-up state, ensuring stability in performance measurements.

Multi-threaded

llama2.java executed with -Djava.util.concurrent.ForkJoinPool.common.parallelism=24
We record in the following table the maximum of tokens per second achieved after warm-up.

ModelTokens per secondSpeedup vs. llama2.javaImplementation
stories15M.bin7181.15xllama2TornadoVM.java
stories15M.bin6261.0llama2.java
stories42M.bin3261.16xllama2TornadoVM.java
stories420M.bin2811.0llama2.java
stories110M.bin1371.09xllama2TornadoVM.java
stories110M.bin1261.0llama2.java

License

MIT