Home

Awesome

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

made-with-python License codecov Code style: black

🤔 What is DCR-Consistency

DCR-Consistency is a novel framework that uses LLM agents to detect and mitigate inconsistencies, or in other words hallucinations. It takes advantage of LLM's power in semantic understanding while circumventing known pitfalls such as relatively poor performance in math. For more details please see our paper.

Given a reference as the ground truth and a candidate to evaluate, it will output a numeric score between [0, 1] indicating its consistency where 0 means no sentence in the candidate is consistent and 1 otherwise. It also outputs a list of reasons about why this score is generated. Better yet, based on such reasons, it can improve the candidate and mitigate detected inconsistencies.

It is composed of three parts:

😋 How well does DCR-Consistency work?

We evaluated the DCR-Consistency framework on a wide range of datasets: QQP, PAWS-QQP, SummEval, QAGS-CNN, and QAGS-XSUM.

Below is a comparison of DCR-Consistency with some start of art metrics on the SummEval dataset about consistency. We included prestigious metrics like BERTScore, and trending new ones leveraging LLMs(GPT-3.5/4) such as G-Eval as well. DCR-Consistency is outperforming those metrics by a large margin.

<img src="assets/performance.png" width="300"/>

We also evaluated DCR-Consistency's effectiveness on inconsistency migration. Below is an illustration showing the consistency rate changes after iterations of applying DCR-Consistency. We observe effective mitigations in all three datasets and that 100% migration of detected inconsistency can be achieved within three rounds.

<img src="assets/rai.png" width="300"/>

🤖 Installation

pip install . 

DCR-Consistency can also be installed directly from pip(coming soon!)

pip install dcr-consistency

🚀 Quickstart

The easiest way to start is to play with the example in examples/example.py. To do so:

pip install -r examples/requirements_example.txt
python examples/example.py

📃 Usage

Evaluation

res = evaluate(_your_LLM_, _your_model_config_, data, worker_count=5)
columnmeaning
idUnique Identifier for each row
scoreFinal consistency score of the row
dce_reasonsReasons for the final score given by DCE
amc_reasonsReasons for scoring of each sentence given by AMC
dce_rawRaw data from DCE
amc_rawRaw data from AMC
decisionConsistency decision based on DCE

Inconsistency Mitigation

res = improve(_your_LLM_, _your_model_config_, data, worker_count=5)
columnmeaning
idUnique Identifier for each row
improved_versionThe improved version where inconsistency is mitigated
rai_rawRaw data from RAI

👏Contributing

See CONTRIBUTING.md.

💁Citation

@inproceedings{cui2023dcr,
      title={DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models},
      author={Wendi Cui, Jiaxin Zhang, Zhuohang Li, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar},
      booktitle={arXiv preprint arXiv:2401.02132},
      year={2023},
      primaryClass={cs.CL}
}