Awesome

A Survey on the Honesty of Large Language Models

This repository offers a comprehensive collection of papers exploring the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Dive deeper into these studies by reading our in-depth survey: A Survey on the Honesty of Large Language Models.

Table of Content

Honesty in LLMs
Evaluation of LLM Honesty
- Self-knowledge
- Self-expression
  - Identification-based Evaluation
  - Identification-free Evaluation
Improvement of Self-knowledge
- Training-free Approaches
- Training-based Approaches
Improvement of Self-expression
- Training-free Approaches
- Training-based Approaches
  - Self-aware Fine-tuning
  - Self-supervised Fine-tuning

🌟 Honesty in LLMs

What is Honesty in LLMs

<div align="center"> <img src="./assets/main_figure.jpg"> <p><em>Figure 1: An illustration of an honest LLM that demonstrates both self-knowledge and self-expression.</em></p> </div>

In this paper, we consider an LLM to be honest if it fulfills these two widely accepted criteria: <i>possessing both self-knowledge and self-expression</i>. Self-knowledge involves the model being aware of its own capabilities, recognizing what it knows and what it doesn’t, allowing it to acknowledge limitations or convey uncertainty when necessary. Self-expression refers to the model’s ability to faithfully express its knowledge, leading to reliable outputs. An illustrated example is shown in Fig. 1.

A general language assistant as a laboratory for alignment, <ins>arXiv, 2021</ins> [Paper]
Language models (mostly) know what they know, <ins>arXiv, 2022</ins> [Paper]
Truthful AI: Developing and governing AI that does not lie, <ins>arXiv, 2021</ins> [Paper]
Teaching models to express their uncertainty in words, <ins>TMLR, 2022</ins> [Paper][Code]
Alignment for honesty, <ins>arXiv, 2023</ins> [Paper][Code]
Behonest: Benchmarking honesty of large language models, <ins>arXiv, 2024</ins> [Paper][Code]

Self-knowledge

The self-knowledge capacity of LLMs hinges on their ability to recognize what they know and what they don’t know. This enables them to explicitly state “I don’t know” when lacking necessary knowledge, thereby avoiding making wrong statements. Additionally, it also allows them to provide confidence or uncertainty indicators in responses to reflect the likelihood of their correctness.

Self-knowledge, <ins>Routledge, 2010</ins> [Paper]
A general language assistant as a laboratory for alignment, <ins>arXiv, 2021</ins> [Paper]
Language models (mostly) know what they know, <ins>arXiv, 2022</ins> [Paper]
Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness?, <ins>EMNLP, 2023</ins> [Paper][Code]
Can AI assistants know what they don’t know?, <ins>ICML, 2024</ins> [Paper][Code]
Behonest: Benchmarking honesty of large language models, <ins>arXiv, 2024</ins> [Paper][Code]
Alignment for honesty, <ins>arXiv, 2023</ins> [Paper][Code]
R-tuning: Instructing large language models to say ‘I don’t know’, <ins>NAACL, 2024</ins> [Paper][Code]
Teaching models to express their uncertainty in words, <ins>TMLR, 2022</ins> [Paper][Code]
Sayself: Teaching LLMs to express confidence with self-reflective rationales, <ins>arXiv, 2024</ins> [Paper][Code]
LACIE: Listener-aware finetuning for confidence calibration in large language models, <ins>arXiv, 2024</ins> [Paper][Code]
Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration<ins>ACL, 2024</ins> [Paper][Code]
Know Your Limits: A Survey of Abstention in Large Language Models<ins>arXiv, 2024</ins> [Paper]
Mitigating LLM hallucinations via conformal abstention, <ins>arXiv, 2024</ins> [Paper]
Uncertainty-based abstention in LLMs improves safety and reduces hallucinations, <ins>arXiv, 2024</ins> [Paper]
On Hallucination and Predictive Uncertainty in Conditional Language Generation, <ins>ECAL, 2021</ins> [Paper]
SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models, <ins>EMNLP, 2023</ins> [Paper][Code]
Semantic entropy probes: Robust and cheap hallucination detection in LLMs, <ins>arXiv, 2024</ins> [Paper]
INSIDE: LLMs' internal states retain the power of hallucination detection, <ins>ICLR, 2024</ins> [Paper]
Detecting hallucinations in large language models using semantic entropy, <ins>Nature, 2024</ins> [Paper][Code]
Active retrieval augmented generation, <ins>EMNLP, 2023</ins> [Paper][Code]
Self-Knowledge Guided Retrieval Augmentation for Large Language Models, <ins>arXiv, 2023</ins> [Paper]
When do LLMs need retrieval augmentation? Mitigating LLMs’ overconfidence helps retrieval augmentation, <ins>arXiv, 2024</ins> [Paper][Code]
SEAKR: Self-Aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation, <ins>ICLR, 2024</ins> [Paper][Code][Code]
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-Efficient Reasoning, <ins>arXiv, 2024</ins> [Paper]
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement, <ins>arXiv, 2024</ins> [Paper][Code]
Language Model Cascades: Token-Level Uncertainty and Beyond, <ins>arXiv, 2024</ins> [Paper]
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection, <ins>arXiv, 2024</ins> [Paper]

Self-expression

Self-expression refers to the model’s ability to express its knowledge faithfully, either parametric knowledge acquired through training or in-context knowledge. This enables the model to ground its responses in its knowledge rather than fabricating information.

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction, <ins>arXiv, 2024</ins> [Paper]
How Language Model Hallucinations Can Snowball, <ins>arXiv, 2024</ins> [Paper][Code]
Inference-time intervention: Eliciting truthful answers from a language model, <ins>NeurIPS, 2024</ins> [Paper][Code]
Lost in the middle: How language models use long contexts, <ins>TACL 2024</ins> [Paper]
Trusting your evidence: Hallucinate less with context-aware decoding, <ins>NAACL, 2024</ins> [Paper]
Hallucination of Multimodal Large Language Models: A Survey, <ins>arXiv, 2024</ins> [Paper]
Robustness of Learning from Task Instructions, <ins>ACL Findings, 2023</ins> [Paper]
State of what art? A call for multi-prompt LLM evaluation, <ins>TACL, 2024</ins> [Paper][Code]
On the robustness of ChatGPT: An adversarial and out-of-distribution perspective, <ins>ICLR Workshop, 2023</ins> [Paper][Code]
Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting, <ins>ICLR, 2024</ins> [Paper][Code]
On the worst prompt performance of large language models, <ins>arXiv, 2024</ins> [Paper][Code]
Simple synthetic data reduces sycophancy in large language models, <ins>arXiv, 2023</ins> [Paper][Code]
Towards Robust and Faithful Summarization with Large Language Models, <ins>arXiv, 2024</ins> [Paper]
TrustLLM: Trustworthiness in Large Language Models, <ins>arXiv, 2024</ins> [Paper][Code]

📈 Evaluation of LLM Honesty

Self-knowledge

<div align="center"> <img src="./assets/evaluation_self_knowledge.jpg"> <p><em>Figure 2: Illustrations of self-knowledge evaluation, encompassing the recognition of known/unknown, calibration, and selective prediction. “Conf” indicates the LLM’s confidence score and “Acc” represents the accuracy of the response.</em></p> </div>

Recognition of Known/Unknown

Do large language models know what they don’t know?, <ins>ACL Findings, 2023</ins> [Paper][Code]
Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models, <ins>arXiv, 2023</ins> [Paper][Code]
Examining LLMs’ uncertainty expression towards questions outside parametric knowledge, <ins>arXiv, 2024a</ins> [Paper][Code]
The best of both worlds: Toward an honest and helpful large language model, <ins>arXiv, 2024</ins> [Paper][Code]
Behonest: Benchmarking honesty of large language models, <ins>arXiv, 2024</ins> [Paper][Code]
Can AI assistants know what they don’t know?, <ins>ICML, 2024</ins> [Paper][Code]

Calibration

On calibration of modern neural networks, <ins>ICML, 2017</ins> [Paper]
Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback, <ins>EMNLP, 2023</ins> [Paper]
A survey of confidence estimation and calibration in large language models, <ins>NAACL, 2024</ins> [Paper]
On the Calibration of Large Language Models and Alignment, <ins>EMNLP Findings, 2023</ins> [Paper]
Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs, <ins>ICLR, 2024</ins> [Paper][Code]
Calibrating large language models with sample consistency, <ins>arXiv, 2024</ins> [Paper]

Selective Prediction

Out-of-Distribution Detection and Selective Generation for Conditional Language Models, <ins>ICLR, 2023</ins> [Paper]
Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs, <ins>ICLR, 2024</ins> [Paper][Code]
Adaptation with self-evaluation to improve selective prediction in llm, <ins>EMNLP Findings, 2023</ins> [Paper]
Uncertainty-based abstention in LLMs improves safety and reduces hallucinations, <ins>arXiv, 2024</ins> [Paper]
Sayself: Teaching LLMs to express confidence with self-reflective rationales, <ins>arXiv, 2024</ins> [Paper][Code]
Factual confidence of LLMs: On reliability and robustness of current estimators, <ins>ACL, 2024</ins> [Paper][Code]
Self-evaluation improves selective generation in large language models, <ins>NeurIPS Workshop, 2023</ins> [Paper]
A survey of confidence estimation and calibration in large language models, <ins>NAACL, 2024</ins> [Paper]
Generating with confidence: Uncertainty quantification for black-box large language models, <ins>TMLR, 2024</ins> [Paper][Code]
Getting MoRE out of Mixture of Language Model Reasoning Experts, <ins>EMNLP Findings, 2023</ins> [Paper] [Code]

Self-expression

<div align="center"> <img src="./assets/evaluation_self_expression.jpg"> <p><em>Figure 3: Illustrations of self-expression evaluation, encompassing both identification-based and identification-free approaches.</em></p> </div>

Identification-based Evaluation

Do large language models know what they don’t know?, <ins>ACL Findings, 2023</ins> [Paper][Code]
Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models, <ins>arXiv, 2023</ins> [Paper][Code]
Examining LLMs’ uncertainty expression towards questions outside parametric knowledge, <ins>arXiv, 2024a</ins> [Paper][Code]
The best of both worlds: Toward an honest and helpful large language model, <ins>arXiv, 2024</ins> [Paper][Code]
Behonest: Benchmarking honesty of large language models, <ins>arXiv, 2024</ins> [Paper][Code]
Can AI assistants know what they don’t know?, <ins>ICML, 2024</ins> [Paper][Code]

Identification-free Evaluation

Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting, <ins>ICLR, 2024</ins> [Paper][Code]
Behonest: Benchmarking honesty of large language models, <ins>arXiv, 2024</ins> [Paper][Code]
State of what art? A call for multi-prompt LLM evaluation, <ins>TACL, 2024</ins> [Paper][Code]
On the worst prompt performance of large language models, <ins>arXiv, 2024</ins> [Paper][Code]
TrustLLM: Trustworthiness in Large Language Models, <ins>arXiv, 2024</ins> [Paper][Code]
Simple synthetic data reduces sycophancy in large language models, <ins>arXiv, 2023</ins> [Paper]
Benchmarking and improving generator-validator consistency of language models, <ins>ICLR, 2024</ins> [Paper][Code]

🚀 Improvement of Self-knowledge

<div align="center"> <img src="./assets/improvement_self_knowledge.jpg"> <p><em>Figure 4: Improvement of self-knowledge, encompassing both training-based and training-free approaches.</em></p> </div>

Training-free Approaches

Predictive Probability

Uncertainty quantification with pre-trained language models: A large-scale empirical analysis, <ins>EMNLP Findings, 2022</ins> [Paper][Code]
Language models (mostly) know what they know, <ins>arXiv, 2022</ins> [Paper]
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, <ins>ICLR, 2023</ins> [Paper][Code]
Self-evaluation improves selective generation in large language models, <ins>NeurIPS Workshop, 2023</ins> [Paper]
Shifting attention to relevance: Towards the uncertainty estimation of large language models, <ins>ACL, 2024</ins> [Paper][Code]
Uncertainty estimation in autoregressive structured prediction, <ins>ICLR, 2021</ins> [Paper]
Prompting gpt-3 to be reliable, <ins>ICLR, 2023</ins> [Paper][Code]

Prompting

Language models (mostly) know what they know, <ins>arXiv, 2022</ins> [Paper]
Fact-and-reflection (FAR) improves confidence calibration of large language models, <ins>ACL Findings, 2024</ins> [Paper]
Self-[in]correct: LLMs struggle with refining self-generated responses, <ins>arXiv, 2024</ins> [Paper]
Do large language models know what they don’t know? <ins>ACL Findings, 2023</ins> [Paper][Code]
Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback, <ins>EMNLP, 2023</ins> [Paper]
Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs, <ins>ICLR, 2024</ins> [Paper][Code]
Navigating the grey area: How expressions of uncertainty and overconfidence affect language models, <ins>EMNLP, 2023</ins> [Paper][Code]

Sampling and Aggregation

Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs, <ins>ICLR, 2024</ins> [Paper][Code]
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries, <ins>arXiv, 2024</ins> [Paper]
Prompt consistency for zero-shot task generalization, <ins>EMNLP Findings, 2022</ins> [Paper][Code]
Calibrating large language models with sample consistency, <ins>arXiv, 2024</ins> [Paper]
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, <ins>ICLR, 2023</ins> [Paper][Code]
Generating with confidence: Uncertainty quantification for black-box large language models, <ins>TMLR, 2024</ins> [Paper][Code]
SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models, <ins>EMNLP, 2023</ins> [Paper][Code]
Mitigating LLM hallucinations via conformal abstention, <ins>arXiv, 2024</ins> [Paper]
Calibrating long-form generations from large language models, <ins>arXiv, 2024</ins> [Paper]
Detecting hallucinations in large language models using semantic entropy, <ins>Nature, 2024</ins> [Paper][Code]
Fact-checking the output of large language models via token-level uncertainty quantification, <ins>ACL Findings, 2024</ins> [Paper][Code]

Training-based Approaches

Supervised Fine-tuning

Alignment for honesty, <ins>arXiv, 2023</ins> [Paper][Code]
R-tuning: Instructing large language models to say ‘I don’t know’, <ins>NAACL, 2024</ins> [Paper][Code]
Can AI assistants know what they don’t know?, <ins>ICML, 2024</ins> [Paper][Code]
Teaching large language models to express knowledge boundary from their own signals, <ins>arXiv, 2024</ins> [Paper]
Knowledge verification to nip hallucination in the bud, <ins>arXiv, 2024</ins> [Paper][Code]
Large language models must be taught to know what they don’t know, <ins>arXiv, 2024</ins> [Paper][Code]
Teaching models to express their uncertainty in words, <ins>TMLR, 2022</ins> [Paper][Code]
Calibrating large language models using their generations only, <ins>arXiv, 2024</ins> [Paper][Code]
Enhancing confidence expression in large language models through learning from past experience, <ins>arXiv, 2024</ins> [Paper]

Reinforcement Learning

Can AI assistants know what they don’t know?, <ins>ICML, 2024</ins> [Paper][Code]
Rejection improves reliability: Training LLMs to refuse unknown questions using RL from knowledge feedback, <ins>COLM, 2024</ins> [Paper]
The best of both worlds: Toward an honest and helpful large language model, <ins>arXiv, 2024</ins> [Paper][Code]
Sayself: Teaching LLMs to express confidence with self-reflective rationales, <ins>arXiv, 2024b</ins> [Paper][Code]
LACIE: Listener-aware finetuning for confidence calibration in large language models, <ins>arXiv, 2024</ins> [Paper][Code]
Linguistic calibration of long-form generations, <ins>ICML, 2024</ins> [Paper][Code]

Probing

Language models (mostly) know what they know, <ins>arXiv, 2022</ins> [Paper]
The internal state of an LLM knows when it’s lying, <ins>EMNLP Findings, 2023</ins> [Paper]
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, <ins>COLM, 2024</ins> [Paper][Code]
On the universal truthfulness hyperplane inside LLMs, <ins>arXiv, 2024</ins> [Paper]
Discovering latent knowledge in language models without supervision, <ins>ICLR, 2023</ins> [Paper][Code]
Semantic entropy probes: Robust and cheap hallucination detection in LLMs, <ins>arXiv, 2024</ins> [Paper]
LLM internal states reveal hallucination risk faced with a query, <ins>arXiv, 2024</ins> [Paper]

🚀 Improvement of Self-expression

<div align="center"> <img src="./assets/improvement_self_expression.jpg"> <p><em>Figure 5: Improvement of self-expression, encompassing both training-based and training-free approaches.</em></p> </div>

Training-free Approaches

Prompting

Chain-of-thought prompting elicits reasoning in large language models, <ins>NeurIPS, 2022</ins> [Paper]
Large Language Models are Zero-Shot Reasoners, <ins>NeurIPS, 2022</ins> [Paper]
Least-to-most prompting enables complex reasoning in large language models, <ins>ICLR, 2023</ins> [Paper]
Measuring and Narrowing the Compositionality Gap in Language Models, <ins>EMNLP Findings, 2023</ins> [Paper][Code]
Take a step back: Evoking reasoning via abstraction in large language models, <ins>ICLR, 2024</ins> [Paper]
Plan-andsolve prompting: Improving zero-shot chain-of-thought reasoning by large language models, <ins>ACL, 2023</ins> [Paper][Code]
Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models, <ins>ACL Findings, 2024</ins> [Paper][Code]

Decoding-time Intervention

Inference-time intervention: Eliciting truthful answers from a language model, <ins>NeurIPS, 2024</ins> [Paper][Code]
In-context sharpness as alerts: An inner representation perspective for hallucination mitigation, <ins>ICML, 2024</ins> [Paper][Code]
Dola: Decoding by contrasting layers improves factuality in large language models, <ins>ICLR, 2024</ins> [Paper][Code]
Alleviating hallucinations of large language models through induced hallucinations, <ins>arXiv, 2023b</ins> [Paper][Code]
Trusting your evidence: Hallucinate less with context-aware decoding, <ins>NAACL, 2024</ins> [Paper]
Mitigating object hallucinations in large vision-language models through visual contrastive decoding, <ins>CVPR, 2024</ins> [Paper][Code]

Sampling and Aggregation

Self-consistency improves chain of thought reasoning in language models, <ins>ICLR, 2023</ins> [Paper]
Universal self-consistency for large language model generation, <ins>arXiv, 2023</ins> [Paper]
Integrate the essence and eliminate the dross: Fine-grained self-consistency for free-form language generation, <ins>ACL, 2024</ins> [Paper][Code]
Atomic self-consistency for better long form generations, <ins>arXiv, 2024</ins> [Paper][Code]

Post-generation Revision

Chain-of-verification reduces hallucination in large language models, <ins>arXiv, 2023</ins> [Paper]
A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation, <ins>arXiv, 2023</ins> [Paper]
Verify-and-edit: A knowledge-enhanced chain-of-thought framework, <ins>ACL, 2023</ins> [Paper][Code]

Training-based Approaches

Self-aware Fine-tuning

Alignment for honesty, <ins>arXiv, 2023</ins> [Paper][Code]
R-tuning: Instructing large language models to say ‘I don’t know’, <ins>NAACL, 2024</ins> [Paper][Code]
Can AI assistants know what they don’t know?, <ins>ICML, 2024</ins> [Paper][Code]
Knowledge verification to nip hallucination in the bud, <ins>arXiv, 2024</ins> [Paper][Code]
Unfamiliar fine-tuning examples control how language models hallucinate, <ins>arXiv, 2024</ins> [Paper][Code]

Self-supervised Fine-tuning

Fine-tuning language models for factuality, <ins>ICLR, 2024</ins> [Paper][Code]
Self-alignment for factuality: Mitigating hallucinations in LLMs via self-evaluation, <ins>arXiv, 2024</ins> [Paper][Code]
FLAME: Factuality-aware alignment for large language models, <ins>arXiv, 2024</ins> [Paper]

📌 Citation

If you find this resource valuable for your research, we would appreciate it if you could cite our paper. Thank you!

@article{li2024survey,
      title={A Survey on the Honesty of Large Language Models},
      author={Siheng Li and Cheng Yang and Taiqiang Wu and Chufan Shi and Yuji Zhang and Xinyu Zhu and Zesen Cheng and Deng Cai and Mo Yu and Lemao Liu and Jie Zhou and Yujiu Yang and Ngai Wong and Xixin Wu and Wai Lam},
      year={2024},
      journal={arXiv preprint arXiv:2409.18786}
}