Awesome
Code for "Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation" (arxiv)
Replicate Our Experiments
Packages you might need:
simple-disk-queue: Used to store and run tasks.
persist_to_disk: Used to cache experiment results (i.e. those @ptd.persistf
decorators and ptd.manual_cache
calls).
Set the Paths
First, set the corresponding paths of "Step 1" in _settings.py
.
Generate the responses
Use the llama2-13b
, gemma-7b
or mistralai/Mistral-7B-v0.1
for model, and coqa_new
, triviaqa_new
and nq_open_new
for the dataset below.
python -m pipeline.generate --model llama2-13b --dataset coqa_new
Update GEN_PATHS
in _settings.py
for next steps.
(You could find the exact generatoins we used in our paper here in "output".)
Caching/Computing Results
First, add all tasks to a queue on disk, by running
python -m scripts.dq_add
Then, run the actual computation via the following (in sequence). You could specify the device to use via -d [device_numbers]
python -m scripts.dq_work -q qAll_1 -d 1
python -m scripts.dq_work -q qAll_2 -d 1
python -m scripts.dq_work -q qMult -d 0,1,2 # This runs a 70B model so might require more GPUs
python -m scripts.dq_work -q qAPI # This queue has only GPT API calls, so no GPU is needed
Downloading the Cache
The previous computation could be skipped by downloading our cache from link in "persist_to_disk".
Run python -m test
so that persist_to_disk
package will automatically create a cache folder that looks like /path/persist_to_disk/cache/ContextSL-1/test
.
Put all contents in "persist_to_disk cache" under /path/persist_to_disk/cache/ContextSL-1
.
Once you download the chace, run python -m scripts.dq_add
to confirm that all queues are empty.
Optional But Recommended
After all queues finished, you can optionally run the following to cache down some summarization.
python -m pipeline.uq
python -m scripts.cache
Run the Notebooks
Now, you can run notebook/demo.ipynb
(or other notebooks)