Awesome
FActScore
:warning: This is a fork of
shmsw25/FActScore
with three modifications:
- We add the functionality to use provided context documents directly, skipping the retrieval stage.
- We assume
topic
is not always available. Whentopic
is not available, we start the prompt with"Answer the question based on the given context."
instead of"Answer the question about {topic} based on the given context."
.factscore.factscorer
module now saves the results (including sample scores) if--result_save_path
is set.
This is the official release accompanying our preprint, FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. FActScore is available as a PIP package as well.
If you find FActScore useful, please cite:
@article{ factscore,
title={ {FActScore}: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation },
author={ Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang Wei and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh },
year={ 2023 },
journal={ arXiv preprint arXiv:2305.14251 },
url={ https://arxiv.org/abs/2305.14251 }
}
Install
<!-- ``` conda create -n fs-env python=3.9 conda activate fs-env pip install -r requirements.txt ``` -->Make a new Python 3.7+ environment using virtualenv
or conda
.
pip install --upgrade factscore
python -m spacy download en_core_web_sm
Download the data
python -m factscore.download_data --llama_7B_HF_path "llama-7B"
This command does the following.
- Download the knowledge source and example data.
- Take the LLAMA 7B model and reconstruct Inst-LLAMA. This requires having access to HuggingFace weights of the LLAMA-7B model, which are added to the
--llama_7B_HF_path
flag. Follow this guide in order to obtain those weights. Skip the--llama_7B_HF_path
if you would only like to use the ChatGPT version of FActScore.
Optional flags:
--data_dir
: directory to store the knowledge source and example data..cache/factscore
by default.--model_dir
: directory to store Inst-LLAMA weights..cache/factscore
by default.
Troubleshooting:
- If you get a
ERROR 429: Too Many Requests
error while downloading the DB file, please download the DB from this Google Drive link and place it under--data_dir
(.cache/factscore
by default). - If everything else fails, consider downloading the files manually from this link and placing them in
--data_dir
and--model_dir
, seefactscore/download_data.py
for more details.
Running FActScore using a command line
We expect running FActScore costs about $1 of the API cost per 100 sentences. For instance, if you have 100 generations, each with 5 sentences on average, it costs $5 in total.
python -m factscore.factscorer --input_path {input_path} --model_name {estimator_name} --openai_key {openai_key}
--input_path
can be something likedata/unlabeled/InstructGPT.jsonl
. It should be a.jsonl
format where each line containstopic
(a topic entity that corresponds to the Wikipedia title) andoutput
(a generation from the model).--model_name
:retrieval+ChatGPT
andretrieval+llama+npm
(You can also useretrieval+ChatGPT+npm
orretrieval+llama
but we recommend the former two.)--openai_key
: File containing OpenAI API Key.
Optional flags:
--data_dir
: Directory containing knowledge source, etc..cache/factscore
by default.--model_dir
: Directory containing Inst-LLAMA weights. Skip if yourmodel_name
doesn't includellama
..cache/factscore
by default.--cache_dir
: Directory containing cache from API/models..cache/factscore
by default.--use_atomic_facts
: If specified, it uses model-generated atomic facts released as part of our data instead of running the atomic fact generator. This will allow reproducing our results with no (or little if it still uses ChatGPT) cost. You can't specify it if you are running new model generations.--n_samples
: If specified, it runs the model on a subset of the data.--verbose
: If specified, it shows the progress bar.--print_rate_limit_error
: It specified, it prints out rate limit errors from OpenAI API.--cost_estimate
: This flag decides the type of OpenAI API cost estimation that we provide before calling it. It can be"consider_cache"
(default) or"ignore_cache"
.
Additional flags added in this fork
--ignore_topics
: Do not use thetopic
field in--input_path
.--use_passages
must be used in combination with this flag.--use_passages
: Use thepassages
field in--input_path
as the context documents (this skips the retrieval stage).passages
should be an array of{"title": "some title", "text": "body of the document"}
(thetitle
field is optional).--result_save_path
: Save the results, including sample scores, to--result_save_path
as serialized JSON.
This command uses the English Wikipedia from 2023/04/01 as a knowledge source. See this section to use your own database as a knowledge source!
To evaluate your own LM
There're two sets of prompt entities, data/labeled/prompt_entities.txt
(183 entities) and data/unlabeled/prompt_entities.txt
(500 entities). Each line contains the name of the person (which is also a corresponding Wikipedia title). You can use the labeled version if you want to be compatible with the data under data/labeled
(Section 3 and Section 4.2 in the paper), and use the unlabeled version if you want to be compatible with the data under data/unlabeled
(Section 4.3 in the paper).
You can prompt your LM with your own prompt (we used Question: Tell me a bio of <entity>.
) and use the following code.
from factscore.factscorer import FactScorer
fs = FactScorer(openai_key="...")
# topics: list of strings (human entities used to generate bios)
# generations: list of strings (model generations)
out = fs.get_score(topics, generations)
print (out["score"]) # FActScore
print (out["respond_ratio"]) # % of responding (not abstaining from answering)
print (out["num_facts_per_response"]) # average number of atomic facts per response
Alternatively, you can create a .jsonl file, where each line has topic
(entity name, exactly same as the one from .txt
file) and output
(generation from LM), and then use a command line above.
We recommend using (A) FactScorer(model_name="retrieval+ChatGPT")
(default) or (B) FactScorer(model_name="retrieval+llama+npm")
. They have 0.99 Pearson correlation. Here're results of a range of models, which you can easily reproduce through these command lines.
Model | % respond | # facts | FActScore from (A) | FActScore from (B) |
---|---|---|---|---|
GPT-4 | 88.2 | 60.8 | 73.1 | 59.9 |
ChatGPT | 84.2 | 37.0 | 71.6 | 60.4 |
Alpaca 65B | 100.0 | 17.1 | 55.6 | 46.3 |
InstructGPT | 99.8 | 27.7 | 52.8 | 41.7 |
Alpaca 13B | 100.0 | 16.6 | 47.7 | 40.3 |
Vicuna 13B | 76.6 | 50.9 | 46.6 | 40.7 |
Alpaca 7B | 100.0 | 17.4 | 39.7 | 36.5 |
Vicuna 7B | 91.0 | 45.6 | 38.9 | 36.9 |
MPT Chat 7B | 88.8 | 37.3 | 30.1 | 27.9 |
Oasst Pythia 12B | 100.0 | 39.7 | 25.1 | 20.8 |
Dolly 12B | 100.0 | 24.6 | 21.7 | 17.1 |
StableLM tuned 7B | 66.6 | 38.0 | 17.3 | 16.3 |
% respond
(% of responding instead of abstaining from answering) and # facts
(# of atomic facts per valid response) indicate "factual recall" (how many pieces of information the model gives) and FActScore indicates "factual precision" (how accurate each piece of information the model gives is).
To use a custom knowledge source
By default, FActScore uses Wikipedia dump from 2023/04/01. But you can also use your own knowledge source!
The knolwedge source should be ready in a .jsonl
format, where each line is a dictionary containing title
and text
. text
can either be a string or a list of strings (e.g., sections).
from factscore.factscorer import FactScorer
fs = FactScorer()
# this will create a database using your file
# for English Wikipedia (18GB)), it takes ~8 hours
# once DB file is created, you can reuse it by only specifying `db_path`
fs.register_knowledge_source(name_of_your_knowledge_source,
data_path=path_to_jsonl_file,
db_path=path_to_output_db_file)
# now, when you compute a score, specify knowledge source to use
out = fs.get_score(topics, generations, knowledge_source=name_of_your_knowledge_source)
print (out["score"]) # FActScore
print (out["respond_ratio"]) # % of responding (not abstaining from answering)
print (out["num_facts_per_response"]) # average number of atomic facts per response