Home

Awesome

promptfoo: test your LLM app locally

npm npm GitHub Workflow Status MIT license Discord

promptfoo is a tool for testing, evaluating, and red-teaming LLM apps.

With promptfoo, you can:

The goal: test-driven LLM development instead of trial-and-error.

npx promptfoo@latest init

» View full documentation «

promptfoo produces matrix views that let you quickly evaluate outputs across many prompts and inputs:

prompt evaluation matrix - web viewer

It works on the command line too:

Prompt evaluation

It also produces high-level vulnerability and risk reports:

gen ai red team

Why choose promptfoo?

There are many different ways to evaluate prompts. Here are some reasons to consider promptfoo:

Workflow

Start by establishing a handful of test cases - core use cases and failure cases that you want to ensure your prompt can handle.

As you explore modifications to the prompt, use promptfoo eval to rate all outputs. This ensures the prompt is actually improving overall.

As you collect more examples and establish a user feedback loop, continue to build the pool of test cases.

<img width="772" alt="LLM ops" src="https://github.com/promptfoo/promptfoo/assets/310310/cf0461a7-2832-4362-9fbb-4ebd911d06ff">

Usage - evals

To get started, run this command:

npx promptfoo@latest init

This will create a promptfooconfig.yaml placeholder in your current directory.

After editing the prompts and variables to your liking, run the eval command to kick off an evaluation:

npx promptfoo@latest eval

Usage - red teaming/pentesting

Run this command:

npx promptfoo@latest redteam init

This will ask you questions about what types of vulnerabilities you want to find and walk you through running your first scan.

Configuration

The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if they meet requirements (aka "assert").

See the Configuration docs for a detailed guide.

prompts:
  - file://prompt1.txt
  - file://prompt2.txt
providers:
  - openai:gpt-4o-mini
  - ollama:llama3.1:70b
tests:
  - description: 'Test translation to French'
    vars:
      language: French
      input: Hello world
    assert:
      - type: contains-json
      - type: javascript
        value: output.length < 100

  - description: 'Test translation to German'
    vars:
      language: German
      input: How's it going?
    assert:
      - type: llm-rubric
        value: does not describe self as an AI, model, or chatbot
      - type: similar
        value: was geht
        threshold: 0.6 # cosine similarity

Supported assertion types

See Test assertions for full details.

Deterministic eval metrics

Assertion TypeReturns true if...
equalsoutput matches exactly
containsoutput contains substring
icontainsoutput contains substring, case insensitive
regexoutput matches regex
starts-withoutput starts with string
contains-anyoutput contains any of the listed substrings
contains-alloutput contains all list of substrings
icontains-anyoutput contains any of the listed substrings, case insensitive
icontains-alloutput contains all list of substrings, case insensitive
is-jsonoutput is valid json (optional json schema validation)
contains-jsonoutput contains valid json (optional json schema validation)
is-sqloutput is valid sql
contains-sqloutput contains valid sql
is-xmloutput is valid xml
contains-xmloutput contains valid xml
is-refusaloutput indicates the model refused to perform the task
javascriptprovided Javascript function validates the output
pythonprovided Python function validates the output
webhookprovided webhook returns {pass: true}
rouge-nRouge-N score is above a given threshold (default 0.75)
bleuBLEU score is above a given threshold (default 0.5)
levenshteinLevenshtein distance is below a threshold
latencyLatency is below a threshold (milliseconds)
perplexityPerplexity is below a threshold
perplexity-scoreNormalized perplexity
costCost is below a threshold (for models with cost info such as GPT)
is-valid-openai-function-callEnsure that the function call matches the function's JSON schema
is-valid-openai-tools-callEnsure that all tool calls match the tools JSON schema
assert-setGroup assertions together with optional threshold

Model-assisted eval metrics

Assertion TypeMethod
similarEmbeddings and cosine similarity are above a threshold
classifierRun LLM output through a classifier
llm-rubricLLM output matches a given rubric, using a Language Model to grade output
answer-relevanceEnsure that LLM output is related to original query
context-faithfulnessEnsure that LLM output uses the context
context-recallEnsure that ground truth appears in context
context-relevanceEnsure that context is relevant to original query
factualityLLM output adheres to the given facts, using Factuality method from OpenAI eval
model-graded-closedqaLLM output adheres to given criteria, using Closed QA method from OpenAI eval
moderationMake sure outputs are safe
select-bestCompare multiple outputs for a test case and pick the best one

Every test type can be negated by prepending not-. For example, not-equals or not-regex.

Tests from spreadsheet

Some people prefer to configure their LLM tests in a CSV. In that case, the config is pretty simple:

prompts:
  - file://prompts.txt
providers:
  - openai:gpt-4o-mini
tests: file://tests.csv

See example CSV.

Command-line

If you're looking to customize your usage, you have a wide set of parameters at your disposal.

OptionDescription
-p, --prompts <paths...>Paths to prompt files, directory, or glob
-r, --providers <name or path...>One of: openai:chat, openai:completion, openai:model-name, localai:chat:model-name, localai:completion:model-name. See API providers
-o, --output <path>Path to output file (csv, json, yaml, html)
--tests <path>Path to external test file
-c, --config <paths>Path to one or more configuration files. promptfooconfig.yaml is automatically loaded if present
-j, --max-concurrency <number>Maximum number of concurrent API calls
--table-cell-max-length <number>Truncate console table cells to this length
--prompt-prefix <path>This prefix is prepended to every prompt
--prompt-suffix <path>This suffix is append to every prompt
--graderProvider that will conduct the evaluation, if you are using LLM to grade your output

After running an eval, you may optionally use the view command to open the web viewer:

npx promptfoo view

Examples

Prompt quality

In this example, we evaluate whether adding adjectives to the personality of an assistant bot affects the responses:

npx promptfoo eval -p prompts.txt -r openai:gpt-4o-mini -t tests.csv
<!-- <img width="1362" alt="Side-by-side evaluation of LLM prompt quality, terminal output" src="https://user-images.githubusercontent.com/310310/235329207-e8c22459-5f51-4fee-9714-1b602ac3d7ca.png"> ![Side-by-side evaluation of LLM prompt quality, html output](https://user-images.githubusercontent.com/310310/235483444-4ddb832d-e103-4b9c-a862-b0d6cc11cdc0.png) -->

This command will evaluate the prompts in prompts.txt, substituting the variable values from vars.csv, and output results in your terminal.

You can also output a nice spreadsheet, JSON, YAML, or an HTML file:

Table output

Model quality

In the next example, we evaluate the difference between GPT 3 and GPT 4 outputs for a given prompt:

npx promptfoo eval -p prompts.txt -r openai:gpt-4o openai:gpt-4o-mini -o output.html

Produces this HTML table:

Side-by-side evaluation of LLM model quality, gpt-4o vs gpt-4o-mini, html output

Usage (node package)

You can also use promptfoo as a library in your project by importing the evaluate function. The function takes the following parameters:

Example

promptfoo exports an evaluate function that you can use to run prompt evaluations.

import promptfoo from 'promptfoo';

const results = await promptfoo.evaluate({
  prompts: ['Rephrase this in French: {{body}}', 'Rephrase this like a pirate: {{body}}'],
  providers: ['openai:gpt-4o-mini'],
  tests: [
    {
      vars: {
        body: 'Hello world',
      },
    },
    {
      vars: {
        body: "I'm hungry",
      },
    },
  ],
});

This code imports the promptfoo library, defines the evaluation options, and then calls the evaluate function with these options.

See the full example here, which includes an example results object.

Configuration

Installation

Requires Node.js 18 or newer.

You can install promptfoo using npm, npx, Homebrew, or by cloning the repository.

npm (recommended)

Install promptfoo globally:

npm install -g promptfoo

Or install it locally in your project:

npm install promptfoo

npx

Run promptfoo without installing it:

npx promptfoo@latest init

This will create a promptfooconfig.yaml placeholder in your current directory.

Homebrew

If you prefer using Homebrew, you can install promptfoo with:

brew install promptfoo

From source

For the latest development version:

git clone https://github.com/promptfoo/promptfoo.git
cd promptfoo
npm install
npm run build
npm link

Verify installation

To verify that promptfoo is installed correctly, run:

promptfoo --version

This should display the version number of promptfoo.

For more detailed installation instructions, including system requirements and troubleshooting, please visit our installation guide.

API Providers

We support OpenAI's API as well as a number of open-source models. It's also to set up your own custom API provider. See Provider documentation for more details.

Development

Here's how to build and run locally:

git clone https://github.com/promptfoo/promptfoo.git
cd promptfoo

# Optionally use the Node.js version specified in the .nvmrc file - make sure you are on node >= 18
nvm use

npm i
cd path/to/experiment-with-promptfoo   # contains your promptfooconfig.yaml
npx path/to/promptfoo-source eval

The web UI is located in src/app. To run it in dev mode, run npm run local:app. This will host the web UI at http://localhost:3000. The web UI expects promptfoo view to be running separately.

Then run:

npm run build

The build has some side effects such as e.g. copying HTML templates, migrations, etc.

Contributions are welcome! Please feel free to submit a pull request or open an issue.

promptfoo includes several npm scripts to make development easier and more efficient. To use these scripts, run npm run <script_name> in the project directory.

Here are some of the available scripts:

To run the CLI during development you can run a command like: npm run local -- eval --config $(readlink -f ./examples/cloudflare-ai/chat_config.yaml), where any parts of the command after -- are passed through to our CLI entrypoint. Since the Next dev server isn't supported in this mode, see the instructions above for running the web server.

» View full documentation «

Adding a New Provider

  1. Create an implementation in src/providers/SOME_PROVIDER_FILE
  2. Update loadApiProvider in src/providers.ts to load your provider via string
  3. Add test cases in test/providers.test.ts
    1. Test the actual provider implementation
    2. Test loading the provider via a loadApiProvider test