Awesome

TurboPilot 🚀

Turbopilot is deprecated/archived as of 30/9/23. There are other mature solutions that meet the community's needs better. Please read my blog post about my decision to down tools and for recommended alternatives.

BSD Licensed Time Spent

TurboPilot is a self-hosted copilot clone which uses the library behind llama.cpp to run the 6 Billion Parameter Salesforce Codegen model in 4GiB of RAM. It is heavily based and inspired by on the fauxpilot project.

NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.

a screen recording of turbopilot running through fauxpilot plugin

✨ Now Supports StableCode 3B Instruct simply use TheBloke's Quantized GGML models and set -m stablecode.

✨ New: Refactored + Simplified: The source code has been improved to make it easier to extend and add new models to Turbopilot. The system now supports multiple flavours of model

✨ New: Wizardcoder, Starcoder, Santacoder support - Turbopilot now supports state of the art local code completion models which provide more programming languages and "fill in the middle" support.

🤝 Contributing

PRs to this project and the corresponding GGML fork are very welcome.

Make a fork, make your changes and then open a PR.

👋 Getting Started

The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.

Getting The Models

You have 2 options for getting the model

Option A: Direct Download - Easy, Quickstart

You can download the pre-converted, pre-quantized models from Huggingface.

For low RAM users (4-8 GiB), I recommend StableCode and for high power users (16+ GiB RAM, discrete GPU or apple silicon) I recomnmend WizardCoder.

Turbopilot still supports the first generation codegen models from v0.0.5 and earlier builds. Although old models do need to be requantized.

You can find a full catalogue of models in MODELS.md.

Option B: Convert The Models Yourself - Hard, More Flexible

Follow this guide if you want to experiment with quantizing the models yourself.

⚙️ Running TurboPilot Server

Download the latest binary and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the build instructions

Run:

./turbopilot -m starcoder -f ./models/santacoder-q4_0.bin

The application should start a server on port 18080, you can change this with the -p option but this is the default port that vscode-fauxpilot tries to connect to so you probably want to leave this alone unless you are sure you know what you're doing.

If you have a multi-core system you can control how many CPUs are used with the -t option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:

./codegen-serve -t 6 -m starcoder -f ./models/santacoder-q4_0.bin

To run the legacy codegen models. Just change the model type flag -m to codegen instead.

NOTE: Turbopilot 0.1.0 and newer re-quantize your codegen models old models from v0.0.5 and older. I am working on providing updated quantized codegen models

📦 Running From Docker

You can also run Turbopilot from the pre-built docker image supplied here

You will still need to download the models separately, then you can run:

docker run --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL_TYPE=starcoder \
  -e MODEL="/models/santacoder-q4_0.bin" \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:latest

Docker and CUDA

As of release v0.0.5 turbocode now supports CUDA inference. In order to run the cuda-enabled container you will need to have nvidia-docker enabled, use the cuda tagged versions and pass in --gpus=all to docker with access to your GPU like so:

docker run --gpus=all --rm -it \
  -v ./models:/models \
  -e THREADS=6 \
  -e MODEL_TYPE=starcoder \
  -e MODEL="/models/santacoder-q4_0.bin" \
  -e GPU_LAYERS=32 \
  -p 18080:18080 \
  ghcr.io/ravenscroftj/turbopilot:v0.2.0-cuda11-7

If you have a big enough GPU then setting GPU_LAYERS will allow turbopilot to fully offload computation onto your GPU rather than copying data backwards and forwards, dramatically speeding up inference.

Swap ghcr.io/ravenscroftj/turbopilot:v0.1.0-cuda11 for ghcr.io/ravenscroftj/turbopilot:v0.2.0-cuda12-0 or ghcr.io/ravenscroftj/turbopilot:v0.2.0-cuda12-2 if you are using CUDA 12.0 or 12.2 respectively.

You will need CUDA 11 or CUDA 12 later to run this container. You should be able to see /app/turbopilot listed when you run nvidia-smi.

Executable and CUDA

As of v0.0.5 a CUDA version of the linux executable is available - it requires that libcublas 11 be installed on the machine - I might build ubuntu debs at some point but for now running in docker may be more convenient if you want to use a CUDA GPU.

You can use GPU offloading via the --ngl option.

🌐 Using the API

Support for the official Copilot Plugin

Support for the official VS Code copilot plugin is underway (See ticket #11). The API should now be broadly compatible with OpenAI.

Using the API with FauxPilot Plugin

To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.

Open settings (CTRL/CMD + SHIFT + P) and select Preferences: Open User Settings (JSON)
Add the following values:

{
    ... // other settings

    "fauxpilot.enabled": true,
    "fauxpilot.server": "http://localhost:18080/v1/engines",
}

Now you can enable fauxpilot with CTRL + SHIFT + P and select Enable Fauxpilot

The plugin will send API calls to the running codegen-serve process when you make a keystroke. It will then wait for each request to complete before sending further requests.

Calling the API Directly

You can make requests to http://localhost:18080/v1/engines/codegen/completions which will behave just like the same Copilot endpoint.

For example:

curl --request POST \
  --url http://localhost:18080/v1/engines/codegen/completions \
  --header 'Content-Type: application/json' \
  --data '{
 "model": "codegen",
 "prompt": "def main():",
 "max_tokens": 100
}'

Should get you something like this:

{
 "choices": [
  {
   "logprobs": null,
   "index": 0,
   "finish_reason": "length",
   "text": "\n  \"\"\"Main entry point for this script.\"\"\"\n  logging.getLogger().setLevel(logging.INFO)\n  logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n  parser = argparse.ArgumentParser(\n      description=__doc__,\n      formatter_class=argparse.RawDescriptionHelpFormatter,\n      epilog=__doc__)\n  "
  }
 ],
 "created": 1681113078,
 "usage": {
  "total_tokens": 105,
  "prompt_tokens": 3,
  "completion_tokens": 102
 },
 "object": "text_completion",
 "model": "codegen",
 "id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"
}

👉 Known Limitations

Currently Turbopilot only supports one GPU device at a time (it will not try to make use of multiple devices).

👏 Acknowledgements

This project would not have been possible without Georgi Gerganov's work on GGML and llama.cpp
It was completely inspired by fauxpilot which I did experiment with for a little while but wanted to try to make the models work without a GPU
The frontend of the project is powered by Venthe's vscode-fauxpilot plugin
The project uses the Salesforce Codegen models.
Thanks to Moyix for his work on converting the Salesforce models to run in a GPT-J architecture. Not only does this confer some speed benefits but it also made it much easier for me to port the models to GGML using the existing gpt-j example code
The model server uses CrowCPP to serve suggestions.
Check out the original scientific paper for CodeGen for more info.