Home

Awesome

Petals Chat

A chatbot web app + HTTP and WebSocket endpoints for LLM inference with the Petals client

Interactive Chat

<div align="center"> <img src="https://i.imgur.com/QVTzc6u.png" width="600px"> </div>

You can try it out at https://chat.petals.dev or run the backend on your server using these commands:

git clone https://github.com/petals-infra/chat.petals.dev.git
cd chat.petals.dev
pip install -r requirements.txt
flask run --host=0.0.0.0 --port=5000

πŸ¦™ Want to serve Llama 2? Request access to its weights at the ♾️ Meta AI website and πŸ€— Model Hub, then run huggingface-cli login in the terminal before starting the web app. If you don't want Llama 2, just remove the meta-llama models from config.py.

πŸ¦„ Deploying with Gunicorn. In production, we recommend using gunicorn instead of the Flask dev server:

gunicorn app:app --bind 0.0.0.0:5000 --worker-class gthread --threads 100 --timeout 1000

The chat uses the WebSocket API under the hood.

APIs

The backend provides two APIs endpoints:

Please use the WebSocket API when possible - it is much faster, more powerful, and consumes less resources.

If you develop your own web app, you can use our endpoint at https://chat.petals.dev/api/... for research and development, then set up your own backend for production using the commands above.

Note: We do not recommend using the endpoint at https://chat.petals.dev/api/... in production. It has a limited throughput, and we may pause or stop it any time.

<details> <summary><b>Endpoint's system requirements</b></summary>
Model familyEmbeds in 16-bitEmbeds in 32-bit
Llama 2 (70B, 70B-Chat), Llama-65B, Guanaco-65B1.05 GB2.1 GB
BLOOM-176B, BLOOMZ-176B7.19 GB14.38 GB
</details>

WebSocket API (/api/v2/generate)

This API implies that you open a WebSocket connection and exchange JSON-encoded requests and responses. This may be done from any programming language.

<details> <summary><b>Example code (Javascript)</b></summary>

This code opens an inference session with the stabilityai/StableBeluga2 model, sends the prompt "A cat sat on", and samples new tokens until the total length reaches 30 tokens. Sampling is done with temperature = 0.6 and top_p = 0.9.

const ws = new WebSocket(`wss://chat.petals.dev/api/v2/generate`);
ws.onopen = () => {
    const prompt = "A cat sat on";
    const maxLength = 30;
    ws.send(JSON.stringify({
        type: "open_inference_session", model: "stabilityai/StableBeluga2", max_length: maxLength
    }));
    ws.send(JSON.stringify({
        type: "generate", inputs: prompt, max_length: maxLength, do_sample: 1, temperature: 0.6, top_p: 0.9
    }));
    ws.onmessage = event => {
        const response = JSON.parse(event.data);
        if (response.ok) {
            if (response.outputs === undefined) {
                console.log("Session opened, generating...");
            } else {
                console.log("Generated: " + prompt + response.outputs);
                ws.close();
            }
        } else {
            console.log("Error: " + response.traceback);
            ws.close();
        }
    };
};
</details>

🐍 Using Python on Linux/macOS? Please consider running the native Petals client instead. This way, you can connect to the swarm directly (without this API endpoint) and even run fine-tuning.

The requests must follow this protocol:

open_inference_session

The first request must be of type open_inference_session and include these parameters:

Notes:

Request:

{type: "open_inference_session", max_length: 1024}

Response:

{ok: true}  // If successful
{ok: false, traceback: "..."}  // If failed

generate

The next requests must be of type generate and include the same parameters as in the /api/v1/generate HTTP API. In contrast to HTTP API, you can use this API in streaming fashion, generating a response token-by-token and accepting intermediate prompts from a user (e.g., to make a chatbot).

A new feature of the WebSocket API is the stop_sequence parameter (str, optional). If you set it, the server will continue generation with the same parameters unless it generates the stop_sequence, so you may get multiple responses without having to send the request again and wait for the round trip's latency.

Intermediate responses contain the field stop: false, and the last response contains stop: true. For example, you can set max_new_tokens: 1 and receive tokens one by one, as soon as they are generated. Check out the chat's frontend code for a detailed example of how to do that.

Request:

{type: "generate", "inputs": "A cat in French is \"", "max_new_tokens": 3}

Response (one or multiple):

{ok: true, outputs: "chat\".", stop: true}  // If successful
{ok: false, traceback: "..."}  // If failed

HTTP API (/api/v1/...)

POST /api/v1/generate

Parameters:

Generation parameters (compatible with .generate() from πŸ€— Transformers):

Notes:

Returns (JSON):

Example (curl):

$ curl -X POST "https://chat.petals.dev/api/v1/generate" -d "model=meta-llama/Llama-2-70b-chat-hf" -d "inputs=Once upon a time," -d "max_new_tokens=20"
{"ok":true,"outputs":" there was a young woman named Sophia who lived in a small village nestled in the rolling hills"}