Home

Awesome

LLM-Factuality-Survey

The repository for the survey paper "Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity"

<p align="center"> Cunxiang Wang<sup>1,7</sup>*, Xiaoze Liu<sup>2</sup>*, Yuanhao Yue<sup>3</sup>*, Qipeng Guo<sup>4</sup>, Xiangkun Hu<sup>4</sup>, Xiangru Tang<sup>5</sup>, Tianhang Zhang<sup>6</sup>, Cheng Jiayang<sup>7</sup>, Yunzhi Yao<sup>8</sup>, Wenyang Gao<sup>1,8</sup>, Xuming Hu<sup>9</sup>, Zehan Qi<sup>9</sup>, Yidong Wang<sup>1</sup>, Linyi Yang<sup>1</sup>, Jindong Wang<sup>10</sup>, Xing Xie<sup>10</sup>, Zheng Zhang<sup>4,11</sup> and Yue Zhang<sup>1</sup>. </p> <p align="center"> 1. School of Engineering, Westlake University; 2. Purdue University; 3. Fudan University; 4. Amazon AWS AI Lab; 5. Yale University; 6. Shanghai Jiao Tong University; 7. HKUST; 8. Zhejiang University; 9. Tsinghua University; 10. Microsoft Research; 11. NYU Shanghai;<br> (*: Equal Contribution; Correspondence to: Yue Zhang) </p>

NOTE: As real-time updates may not be feasible for the arXiv paper. For the most recent developments and modifications, please consult this repository. We greatly appreciate and welcome pull requests or issues to enhance the quality of this survey. All contributions will be list in the <a href="#acknowledgements">acknowledgements</a> section.

Paper List

Analysis of Factuality

Knowledge Storage

  1. Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
  2. Locating and Editing Factual Associations in GPT. Meng et al. 2022. [Paper]
  3. Transformer Feed-Forward Layers Are Key-Value Memories. Geva et al. 2021. [Paper]
  4. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Geva et al. 2022. [Paper]
  5. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. Globerson et al. 2023. [Paper]
  6. Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons. Chen et al. 2023. [Paper]
  7. A rigorous study of integrated gradients method and extensions to internal neuron attributions. Lundstrom et al. 2022. [Paper]

Knowledge Awareness

  1. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. Gou et al. 2023. [Paper]
  2. Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation. Ren et al. 2023. [Paper]
  3. Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
  4. A Survey on In-context Learning. Dong et al. 2023. [Paper]
  5. Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
  6. The internal state of an llm knows when its lying. Azaria et al. 2023. [Paper]

Parametric Knowledge vs Retrieved Knowledge

  1. Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
  2. Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
  3. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. Izacard et al. 2021. [Paper]
  4. Large language models struggle to learn long-tail knowledge. Kandpal et al. 2023. [Paper]
  5. Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? AKA Will LLMs Replace Knowledge Graphs?. Sun et al. 2023. [Paper]

Contextual Influence

  1. Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]
  2. Context-faithful Prompting for Large Language Models. Zhou et al. 2023. [Paper]
  3. Benchmarking Large Language Models in Retrieval-Augmented Generation. Chen et al. 2023. [Paper]
  4. Automatic Evaluation of Attribution by Large Language Models. Yue et al. 2023. [Paper]

Knowledge Conflicts

  1. Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
  2. Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence. Chen et al. 2022. [Paper]
  3. Adaptive Chameleon or Stubborn Sloth: Unraveling the Behavior of Large Language Models in Knowledge Clashes. Xie et al. 2023. [Paper]
  4. Large Language Models with Controllable Working Memory. Li et al. 2023. [Paper]

Causes of Factual Errors

Model-level Causes

Forgetting

  1. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. Goodfellow et al. 2015. [Paper]
  2. Preserving In-Context Learning ability in Large Language Model Fine-tuning. Wang et al. 2022. [Paper]
  3. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting. Chen et al. 2020. [Paper]
  4. Investigating the Catastrophic Forgetting in Multimodal Large Language Models. Zhai et al. 2023. [Paper]
  5. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. Luo et al. 2023. [Paper]

Reasoning Failure

  1. We're Afraid Language Models Aren't Modeling Ambiguity. Liu et al. 2023. [Paper]
  2. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A". Berglund et al. 2023. [Paper]
  3. Understanding Catastrophic Forgetting in Language Models via Implicit Inference. Kotha et al. 2023. [Paper]
  4. Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family. Tan et al. 2023. [Paper]

Retrieval-level Causes

Misinformation Not Recognized by LLMs

  1. Entity-Based Knowledge Conflicts in Question Answering. Longpre et al. 2021. [Paper]
  2. On the Risk of Misinformation Pollution with Large Language Models. Pan et al. 2023. [Paper]
  3. A Survey on Truth Discovery. Han et al. 2015. [Paper]

Distracting Information

  1. SAIL: Search-Augmented Instruction Learning. Luo et al. 2023. [Paper]
  2. Lost in the middle: How language models use long contexts. Liu et al. 2023. [Paper]

Misinterpretation of Related Information

  1. ReAct: Synergizing Reasoning and Acting in Language Models. Yao et al. 2023. [Paper]

Inference-level Causes

Snowballing

  1. How language model hallucinations can snowball. Zhang et al. 2023. [Paper]
  2. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]

Erroneous Decoding

  1. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang et al. 2023. [Paper]
  2. How Decoding Strategies Affect the Verifiability of Generated Text. Massarelli et al. 2020. [Paper]

Exposure Bias

  1. WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. Felkner et al. 2023. [Paper]
  2. Bias and Fairness in Large Language Models: A Survey. Gallegos et al. 2023. [Paper]
  3. MISGENDERED: Limits of Large Language Models in Understanding Pronouns. Hossain et al. 2023. [Paper]

Evaluation of Factuality

Benchmarks

  1. Measuring Massive Multitask Language Understanding. Hendrycks et al. 2021. [Paper]
  2. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Lin et al. 2022. [Paper]
  3. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Li et al. 2023. [Paper]
  4. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Huang et al. 2023. [Paper]
  5. Do Large Language Models Know What They Don't Know?. Yin et al. 2023. [Paper]
  6. Do Large Language Models Know about Facts?. Hu et al. 2023. [Paper]
  7. RealTime QA: What's the Answer Right Now?. Kasai et al. 2022. [Paper]
  8. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Vu et al. 2023. [Paper]
  9. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Srivastava et al. 2023. [Paper]
  10. Natural Questions: a Benchmark for Question Answering Research. Kwiatkowski et al. 2019. [Paper]
  11. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Joshi et al. 2017. [Paper]
  12. Semantic Parsing on Freebase from Question-Answer Pairs. Berant et al. 2013. [Paper]
  13. Open Question Answering over Tables and Text. Chen et al. 2021. [Paper]
  14. AmbigQA: Answering Ambiguous Open-domain Questions. Min et al. 2020. [Paper]
  15. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Yang et al. 2018. [Paper]
  16. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Ho et al. 2020. [Paper]
  17. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. Ferguson et al. 2020. [Paper]
  18. MuSiQue: Multihop Questions via Single-hop Question Composition. Trivedi et al. 2022. [Paper]
  19. ELI5: Long Form Question Answering. Fan et al. 2019. [Paper]
  20. FEVER: a large-scale dataset for Fact Extraction and VERification. Thorne et al. 2018. [Paper]
  21. Fool Me Twice: Entailment from Wikipedia Gamification. Eisenschlos et al. 2021. [Paper]
  22. HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification. Jiang et al. 2020. [Paper]
  23. The Fact Extraction and VERification Over Unstructured and Structured information (FEVEROUS) Shared Task. Aly et al. 2021. [Paper]
  24. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples. Elsahar et al. 2018. [Paper]
  25. Zero-Shot Relation Extraction via Reading Comprehension. Levy et al. 2017. [Paper]
  26. Language Models as Knowledge Bases?. Petroni et al. 2019. [Paper]
  27. Neural Text Generation from Structured Data with Application to the Biography Domain. Lebret et al. 2016. [Paper]
  28. WikiAsp: A Dataset for Multi-domain Aspect-based Summarization. Hayashi et al. 2021. [Paper]
  29. KILT: a Benchmark for Knowledge Intensive Language Tasks. Petroni et al. 2021. [Paper]
  30. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Rae et al. 2022. [Paper]
  31. Curation Corpus Base. Curation et al. 2020. [Paper]
  32. Pointer sentinel mixture models. Merity et al. 2016. [Paper]
  33. The LAMBADA dataset: Word prediction requiring a broad discourse context. Paperno et al. 2016. [Paper]
  34. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Raffel et al. 2020. [Paper]
  35. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. Gao et al. 2020. [Paper]
  36. Wizard of Wikipedia: Knowledge-Powered Conversational agents. Dinan et al. 2019. [Paper]
  37. Grounded response generation task at dstc7. Galley et al. 2019. [Paper]
  38. "What do others think?": Task-Oriented Conversational Modeling with Subjective Knowledge. Zhao et al. 2023. [Paper]
  39. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Gehman et al. 2020. [Paper]
  40. Hey AI, Can You Solve Complex Tasks by Talking to Agents?. Khot et al. 2022. [Paper]
  41. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Geva et al. 2021. [Paper]
  42. TempQuestions: A Benchmark for Temporal Question Answering. Jia et al. 2018. [Paper]
  43. INFOTABS: Inference on Tables as Semi-structured Data. Gupta et al. 2020. [Paper]

Studies

  1. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Manakul et al. 2023. [Paper]
  2. Evaluating Open Question Answering Evaluation. Wang et al. 2023. [Paper]
  3. Measuring and Modifying Factual Knowledge in Large Language Models. Pezeshkpour et al. 2023. [Paper]
  4. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Varshney et al. 2023. [Paper]
  5. FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. Chern et al. 2023. [Paper]
  6. Language Models (Mostly) Know What They Know. Kadavath et al. 2022. [Paper]
  7. Generate rather than retrieve: Large language models are strong context generators. Yu et al. 2023. [Paper]
  8. Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators. Chen et al. 2023. [Paper]
  9. Teaching language models to support answers with verified quotes. Menick et al. 2022. [Paper]

Evaluating Domain-specific Factuality

  1. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. Xie et al. 2023. [Paper]
  2. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. Shah et al. 2022. [Paper]
  3. EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li et al. 2023. [Paper]
  4. CMB: A Comprehensive Medical Benchmark in Chinese. Wang et al. 2023. [Paper]
  5. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Jin et al. 2023. [Paper]
  6. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. Guha et al. 2023. [Paper]
  7. LawBench: Benchmarking Legal Knowledge of Large Language Models. Fei et al. 2023. [Paper]

Factuality Enhancement

On Standalone LLM Generation

Pretraining-based

Initial Pretraining
  1. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Yejin Bang et al. arXiv 2023. [Paper]
  2. Deduplicating Training Data Makes Language Models Better. Lee, Katherine et al. ACL 2022. [Paper]
  3. Unsupervised Improvement of Factual Knowledge in Language Models. Sadeq, Nafis et al. EACL 2023. [Paper]
Continual Pretraining
  1. Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]

Supervised Finetuning

Continual SFT
  1. SKILL: Structured Knowledge Infusion for Large Language Models. Moiseev, Fedor et al. NAACL 2022. [Paper]
  2. Contrastive Learning Reduces Hallucination in Conversations. Sun, Weiwei et al. AAAI 2023. [Paper]
  3. ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling. Linyao Yang et al. arXiv 2023. [Paper]
Model Editing
  1. Editing Large Language Models: Problems, Methods, and Opportunities. Yunzhi Yao et al. arXiv 2023. [Paper]
  2. Knowledge Neurons in Pretrained Transformers. Dai, Damai et al. ACL 2022. [Paper]
  3. Locating and Editing Factual Associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
  4. Editing Factual Knowledge in Language Models. De Cao, Nicola et al. EMNLP 2021. [Paper]
  5. Fast Model Editing at Scale. Eric Mitchell et al. ICLR 2022. [Paper]
  6. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]

Multi-Agent

  1. Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
  2. LM vs LM: Detecting Factual Errors via Cross Examination. Roi Cohen et al. arXiv 2023. [Paper]

Novel Prompt

  1. Generate Rather than Retrieve: Large Language Models are Strong Context Generators. Yu, Wenhao et al. ICLR 2023. [Paper]
  2. "According to ..." Prompting Language Models Improves Quoting from Pre-Training Data. Orion Weller et al. arXiv 2023. [Paper]
  3. Decomposed Prompting: A Modular Approach for Solving Complex Tasks. Tushar Khot et al. arXiv 2023. [Paper]
  4. Chain-of-Verification Reduces Hallucination in Large Language Models. Dhuliawala et al. arXiv 2023. [Paper]

Decoding

  1. Factuality Enhanced Language Models for Open-Ended Text Generation. Lee, Nayeon et al. NeurIPS 2022. [Paper]
  2. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Chuang, Yung-Sung et al. arXiv 2023. [Paper]

On Retrieval-Augmented Generation

Normal RAG Setting

  1. Improving Language Models by Retrieving From Trillions of Tokens. Sebastian Borgeaud et al. arXiv 2021. [Paper]
  2. Internet-Augmented Language Models through Few-Shot Prompting for Open-Domain Question Answering. Angeliki Lazaridou et al. arXiv 2022. [Paper]

Interactive Retrieval

CoT-based Retrieval
  1. Rethinking with Retrieval: Faithful Large Language Model Inference. Hangfeng He et al. arXiv 2023. [Paper]
  2. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. Trivedi, Harsh et al. ACL 2023. [Paper]
  3. Active Retrieval Augmented Generation. Zhengbao Jiang et al. arXiv 2023. [Paper]
Agent-based Retrieval
  1. ReAct: Synergizing Reasoning and Acting in Language Models. Shunyu Yao et al. arXiv 2023. [Paper]
  2. Reflexion: Language Agents with Verbal Reinforcement Learning. Noah Shinn et al. arXiv 2023. [Paper]
  3. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. Neeraj Varshney et al. arXiv 2023. [Paper]

Retrieval Adaptation

Prompt-based
  1. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. Baolin Peng et al. arXiv 2023. [Paper]
  2. Knowledge-Augmented Language Model Verification. Jinheon Baek et al. EMNLP 2023. [Paper]
  3. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. Semnani et al. EMNLP 2023 findings. [Paper] [GitHub] [Demo]
SFT-based
  1. Atlas: Few-shot Learning with Retrieval Augmented Language Models. Gautier Izacard et al. arXiv 2022. [Paper]
  2. REPLUG: Retrieval-Augmented Black-Box Language Models. Weijia Shi et al. arXiv 2023. [Paper]
  3. SAIL: Search-Augmented Instruction Learning. Luo, Hongyin et al. arXiv 2023. [Paper]
RLHF-based
  1. Teaching Language Models to Support Answers with Verified Quotes. Jacob Menick et al. arXiv 2022. [Paper]

Retrieval on External Memory

  1. Decoupled Context Processing for Context Augmented Language Modeling. Zonglin Li et al. NeurIPS 2022. [Paper]
  2. G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks. Zhongwei Wan et al. ICML 2019. [Paper]
  3. Parameter-Efficient Transfer Learning for NLP. Neil Houlsby et al. EMNLP 2022. [Paper]
  4. KALA: Knowledge-Augmented Language Model Adaptation. Kang, Minki et al. NAACL 2022. [Paper]
  5. Entities as Experts: Sparse Memory Access with Entity Supervision. Thibault Févry et al. EMNLP 2020. [Paper]
  6. Mention Memory: Incorporating Textual Knowledge into Transformers through Entity Mention Attention. Michiel de Jong et al. ICLR 2022. [Paper]
  7. Plug-and-Play Knowledge Injection for Pre-trained Language Models. Zhang, Zhengyan et al. ACL 2023. [Paper]
  8. Evidence-based Factual Error Correction. Thorne, James et al. ACL 2021. [Paper]
  9. Rarr: Researching and revising what language models say, using language models. Gao, Luyu et al. ACL 2023. [Paper]
  10. PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions. Chen, Anthony et al. arXiv 2023. [Paper]

Retrieval on Structured Knowledge Source

  1. Mitigating Language Model Hallucination with Interactive Question-Knowledge Alignment. Shuo Zhang et al. arXiv 2023. [Paper]
  2. StructGPT: A general framework for Large Language Model to Reason on Structured Data. Jinhao Jiang et al. arXiv 2023. [Paper]
  3. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. Jinheon Baek et al. arXiv 2023. [Paper]

Domain Factuality Enhanced LLMs

Healthcare Domain-enhanced LLMs

  1. CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study. Guan, Zihan et al. arXiv 2023. [paper]
  2. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Li, Yunxiang et al. Cureus 2023. [paper]
  3. Deid-GPT: Zero-Shot Medical Text De-Identification By Gpt-4. Liu, Zhengliang et al. arXiv 2023. [paper]
  4. Biomedlm: A Domain-Specific Large Language Model for Biomedical Text. Venigalla, A et al. [blog] [model]
  5. MedChatZH: A Better Medical Adviser Learns from Better Instructions. Tan, Yang et al. arXiv 2023. [paper]
  6. BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining. Luo, Renqian et al. Briefings in Bioinformatics 2022. [paper]
  7. Genegpt: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information. Jin, Qiao et al. arXiv 2023. [paper]
  8. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. Hiesinger, William et al. arXiv 2023. [paper]
  9. MolXPT: Wrapping Molecules with Text for Generative Pre-training. Liu, Zequn et al. arXiv 2023. [paper]
  10. HuatuoGPT, Towards Taming Language Model to Be a Doctor. Zhang, Hongbo et al. arXiv 2023. [paper]
  11. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue. Yang, Songhua et al. arXiv 2023. [paper]
  12. Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering. Wang, Yubo et al. arXiv 2023. [paper]
  13. DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation. Bao, Zhijie et al. arXiv 2023. [paper]

Legal Domain enhanced LLMs

  1. Brief Report on LawGPT 1.0: A Virtual Legal Assistant Based on GPT-3. Nguyen, Ha-Thanh et al. arXiv 2023. [paper]
  2. Chatlaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases. Cui, Jiaxi et al. arXiv 2023. [paper]
  3. Explaining Legal Concepts with Augmented Large Language Models (GPT-4). Savelka, Jaromir et al. arXiv 2023. [paper]
  4. Lawyer LLaMA Technical Report. Huang, Quzhe et al. arXiv 2023. [paper]

Finance Domain-enhanced LLMs

  1. EcomGPT: Instruction-tuning Large Language Model with Chain-of-Task Tasks for E-commerce. Li, Yangning et al. arXiv 2023. [paper]
  2. BloombergGPT: A Large Language Model for Finance. Shijie Wu et al. arXiv 2023. [paper]

Other Domain-Enhanced LLMs

Geoscience and Environment domain-enhanced LLMs
  1. Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. Deng, Cheng et al. arXiv 2023. [paper]
  2. HouYi: An Open-Source Large Language Model Specially Designed for Renewable Energy and Carbon Neutrality Field. Bai, Mingliang et al. arXiv 2023. [paper]
Education Domain-enhanced LLMs
  1. GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning. Fan, Yaxin et al. arXiv 2023. [paper]
Food Domain-enhanced LLMs
  1. FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt. Qi, Zhixiao et al. arXiv 2023. [paper]
Home Renovation Domain-enhanced LLMs
  1. ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation. Wen, Cheng et al. arXiv 2023. [paper]

Tables

Table: Comparison between the factuality issue and the hallucination issue.

Factual and Non-HallucinatedFactually correct outputs.
Non-Factual and HallucinatedEntirely fabricated outputs.
Hallucinated but Factual<br>1. Outputs that are unfaithful to the prompt but remain factually correct (cao-etal-2022-hallucinated).<br> 2. Outputs that deviate from the prompt's specifics but don't touch on factuality, e.g., a prompt asking for a story about a rabbit and wolf becoming friends, but the LLM produces a tale about a rabbit and a dog befriending each other.<br> 3. Outputs that provide additional factual details not specified in the prompt, e.g., a prompt asking about the capital of France, and the LLM responds with "Paris, which is known for the Eiffel Tower."
Non-Factual but Non-Hallucinated<br>1. Outputs where the LLM states, "I don't know," or avoids a direct answer.<br> 2. Outputs that are partially correct, e.g., for the question, "Who landed on the moon with Apollo 11?" If the LLM responds with just "Neil Armstrong," the answer is incomplete but not hallucinated.<br> 3. Outputs that provide a generalized or vague response without specific details, e.g., for a question about the causes of World War II, the LLM might respond with "It was due to various political and economic factors."

Causes of Factual Errors

CategoryCauseExample DialogNotes and references
Model-level causesDomain knowledge deficitQ: CEO of Assicurazioni Generali? <br> BloombergGPT: Philippe Donnet <br> GPT-NeoX: Antonio De Lorenzo, Simone Gambarini, Enrico Zanetti <br> FLAN-T5-XXL: John M Forsyth, Christopher K Peters, {empty string}BloombergGPT is a finance domain-specific language model. Wu et al, 2023
Outdated informationQ: When was Kyiv attacked by Russia?<br> ChatGPT: As of my last knowledge update in September 2021, Russia had not launched an attack on Kyiv.Kyiv was attacked by Russia on 25 February 2022.
Reasoning errorQ: Who is Tom Cruise’s mother? <br> A: Mary Lee Pfeiffer <br> Q: Who is Mary Lee Pfeiffer’s son? <br> A: There is no widely known information about...From Berglund et al, 2023. It is clear that the model knows Tom Cruise’s mother is Lee Pfeiffer, but it fails to reason that Lee Pfeiffer has a son named Tom Cruise.
Retrieval-level causesDistracted by the retrieval informationRetrieved context: "Alan Melikdjanian - Age, Bio, Faces and Birthday"<br> Alan Melikdjanian is the Director, Writer (with his father), Editor, and Composer of the 2006 direct-to-DVD Russian-language film Citizen Mavzik...<br> Q: What is the nationality of the director of Citizen Mavzik? <br> A: (1) The director of the film Citizen Mavzik is Alan Melikdjanian. (2) Alan Melikdjanian is from Russia. So, the answer is Russia.Alan Melikdjanian, known by the alias Captain Disillusion, is a Soviet-born American independent filmmaker and YouTuber. Melikdjanian et al, 2018
Misunderstood the retrieval informationRetrieved context: "Gerard III, Count of Guelders - Wikipedia"<br> Gerard married Margaretha of Brabant...<br> Q: Who is Gerard I, Count Of Guelders's wife? <br> A: ... the name of Gerard I, Count of Guelders's wife, which is Margaretha of Brabant.The retrieved context accurately provided the name of Gerard III, Count of Guelders' wife, which is Margaretha of Brabant. However, the model erroneously referred to him as Gerard I, resulting in an incorrect answer.
Fail to address misinformation in retrieved documentsRetrieved context: [0] "James Vincent - Death Notice and Service Information - Legacy.com"<br> James Vincent died on August 9, 2023...<br> [1] "UPDATE: Jury convicts James Vincent of murder after deliberating..." A jury convicted James Vincent of murder in the 2018 Shooting death of Luka Grabb...<br> Q: When did James Vincent die? <br> A: James Vincent died on August 9, 2023, in Long Beach, California.In the presence of contradictory information from retrieved documents, the model should thoroughly outline the response, while also noting the need for clarity. For instance, the model could state: "According to the first document, James Vincent passed away on August 9, 2023, in Long Beach. However, the second document suggests that James Vincent was murdered in 2018."
Inference-level causesSnowballingQ: Was there ever a US senator who represented the state of New Hampshire and whose alma mater was the University of Pennsylvania? <br> A: Yes... His name was John P. HaleJohn P. Hale was graduated from Bowdoin College. Zhang et al, 2023
Exposure biasQ: Aamari was very stoic. [PRONOUN] rarely showed any emotion. <br> A: He.The correct answer was Xe according to Hossain et al, 2023.

Evaluations

ReferenceTaskDatasetMetricsHuman EvalEvaluated LLMsGranularity
FActScore Min et al, 2023 Biography Generation183 people entitiesF1GPT-3.5,<br/>ChatGPT...T
SelfCheckGPT Manakul et al, 2023Bio GenerationWikiBioAUC-PR,<br/>Human ScoreGPT-3,<br/>LLaMA,<br/>OPT,<br/>GPT-J...S
Wang et al, 2023Open QANQ, TQACC,<br/>EMGPT-3.5,<br/>ChatGPT,<br/>GPT-4,<br/>Bing ChatS
Pezeshkpour et al, 2023Knowledge ProbingT-REx,<br/>LAMAACCGPT3.5T
De Cao et al, 2021QA,<br/>Fact CheckingKILT,<br/>FEVER,<br/>zsREACCGPT-3,<br/>FLAN-T5S/T
Varshney et al, 2023Article GenerationUnnamed DatasetACC,<br/>AUCGPT3.5,<br/>VicunaS
FactTool Chern et al, 2023KB-based QARoSEACC,<br/>F1...GPT-4,<br/>ChatGPT,<br/>FLAN-T5S
Kadavath et al, 2022Self-evaluationBIG Bench,<br/>MMLU, LogiQA,<br/>TruthfulQA,<br/>QuALITY, TriviaQA LambadaACC,<br/>Brier Score,<br/>RMS Calibration Error...ClaudeT
ReferenceTaskDatasetMetricsHuman EvalEvaluated LLMsGranularity
Retro Borgeaud et al, 2022QA,<br>Language<br>ModelingMassiveText,<br> Curation Corpus,<br> Wikitext103,<br> Lambada,<br> C4,Pile, NQPPL,<br> ACC,<br> Exact MatchRetroT
GenRead Yu et al, 2023QA,<br> Dialogue,<br> Fact CheckingNQ, TQ, WebQ,<br> FEVER,<br> FM2, WoWEM, ACC,<br> F1, Rouge-L-GPT3.5, Codex<br>GPT-3, Gopher<br>FLAN, GLaM<br>PaLMS
GopherCite Menick et al, 2022Self-supported QANQ, ELI5,<br> TruthfulQA<br>(Health, Law, Fiction, Conspiracies)Human ScoreGopherCiteS
Trivedi et al. Trivedi et al, 2023QAHotpotQA, IIRC<br>2WikiMultihopQA,<br> MuSiQue(music)Retrieval recall,<br> Answer F1-GPT-3<br>FLAN-T5S/T
Peng et al. Peng et al, 2023QA,<br> DialogueDSTC7 track2<br> DSTC11 track5,<br> OTT-QAROUGE, chrF,<br> BERTScore, Usefulness,<br> Humanness...ChatGPTS/T
CRITIC Gou et al, 2023QA<br>Toxicity ReductionAmbigNQ, TriviaQA, HotpotQA,<br> RealToxicityPromptsExact Match, maximum toxicity,<br> perplexity, n-gram diversity,<br> AUROC...,-GPT-3.5<br>ChatGPTT
Khot et al. Khot et al, 2023QA,<br> long-context QACommaQA-E, 2WikiMultihopQA, MuSiQue, HotpotQAExact Match, Answer F1-GPT-3<br>FLAN-T5T
ReAct Yao et al, 2023QA<br> Fact VerificationHotpotQA, FEVERExact Match, ACC-PaLM<br>GPT-3S/T
Jiang et al. Jiang et al, 2023QA, Commonsense Reasoning,<br> long-form QA...2WikiMultihopQA, StrategyQA, ASQA, WikiAspExact Match, Disambig-F1, ROUGE,<br> entity F1...-GPT-3.5T
Lee et al. Lee et al, 2022Open-ended GenerationFEVEREntity score, Entailment<span>Ratio, ppl...-Megatron-LMT
SAIL Luo et al, 2023QA<br> Fact CheckingUniLCACC<br> F1-LLaMA Vicuna<br>SAILT
He et al. He et al, 2022Commonsense Reasoning, Temporal Reasoning,<br> Tabular ReasoningStrategyQA, TempQuestions, IN-FOTABSACC-GPT-3T
Pan et al. Pan et al, 2023Fact CheckingHOVER<br> FEVEROUS-SMacro-F1-Codex<br>FLAN-T5S
Multiagent Debate Du et al, 2023Biography<br> MMLUUnnamed Biography Dataset,<br> MMLUChatGPT Evaluator, ACC-Bard<br>ChatGPTS

Benchmarks

ReferenceTask TypeDatasetMetricsPerformance of Representative LLMs
MMLU Hendrycks et al, 2021Multi-Choice QAHumanities,<br/>Social,<br/>Sciences,<br/>STEM...ACC(ACC, 5-shot)<br/>GPT-4: 86.4<br/>GPT-3.5: 70<br/>LLaMA2-70B: 68.9
TruthfulQA Lin et al, 2022QAHealth, Law,<br/>Conspiracies,<br/>Fiction...Human Score,<br/> GPT-judge, <br/> ROUGE, BLEU, <br/>MC1,MC2...(zero-shot)<br/>GPT-4: ~29 (MC1)<br/>GPT-3.5: ~28 (MC1),<br/> 79.92(%true)<br/>LLaMA2-70B: 53.37 (%true)
C-Eval Huang et al, 2023Multi-Choice QASTEM,<br/>Social Science,<br/>Humanities...ACC(zero-shot, average ACC)<br/>GPT-4: 68.7<br/>GPT-3.5: 54.4<br/>LLaMA2-70B: 50.13
AGIEval Zhong et al, 2023Multi-Choice QAGaokao, (geometry, Bio,<br/>history...),SAT, Law...ACC(zero-shot, average ACC)<br/>GPT-4: 56.4<br/>GPT-3.5: 42.9<br/>LLaMA2-70B: 40.02
HaluEval Li et al, 2023Hallucination EvaluationHaluEvalACC(general ACC)<br/>GPT-3.5: 86.22
BigBench Srivastava et al, 2023Multi-tasks(QA, NLI, Fact Checking, Reasoning...)BigBenchMetric to each type of task(Big-Bench Hard)<br/>GPT-3.5: 49.6<br/>LLaMA-65B: 42.6
ALCE Gao et al, 2023Citation GenerationASQA, ELI5,<br/>QAMPARIMAUVE, Exact Match, ROUGE-L...(ASQA, 3-psg, citation prec)<br/>GPT-3.5: 73.9<br/>LLaMA-33B: 23.0
QUIP Weller et al, 2023Generative QATriviaQA,<br/> NQ, ELI5,<br/>HotpotQAQUIP-Score, Exact match(ELI5, QUIP, null prompt)<br/>GPT-4: 21.0<br/>GPT-3.5: 27.7
PopQA Mallen et al, 2023Multi-Choice QAPopQA,<br/>EntityQuestionsACC(overall ACC)<br/>GPT-3.5: ~37.0
UniLC Zhang et al, 2023Fact CheckingClimate,<br/>Health, MGFNACC, F1(zero-shot, fact tasks, average F1)<br/>GPT-3.5: 51.62
Pinocchio Hu et al, 2023Fact Checking, QA, ReasoningPinocchioACC, F1GPT-3.5: (Zero-shot ACC: 46.8, F1:44.4)<br/>GPT-3.5: (Few-shot ACC: 47.1, F1:45.7)
SelfAware Yin et al, 2023Self-evaluationSelfAwareACC(instruction input, F1)<br/>GPT-4: 75.47<br/>GPT-3.5: 51.43<br/>LLaMA-65B: 46.89
RealTimeQA Kasai et al, 2022Multi-Choice QA, Generative QARealTimeQAACC, F1(original setting, GCS retrieval)<br/>GPT-3: 69.3 (ACC for MC)<br/>GPT-3: 39.4 (F1 for generation)
FreshQA Vu et al, 2023Generative QAFRESHQAACC (Human)(strict ACC, null prompt)<br/>GPT-4: 28.6<br/>GPT-3.5: 26.0

Domain evaluation

ReferenceDomainTaskDatasetsMetricsEvaluated LLMs
Xie et al, 2023FinanceSentiment analysis,<br/> News headline classification,<br/> Named entity recognition,<br/> Question answering,<br/> Stock movement predictionFLAREF1, Acc,<br/> Avg F1,<br/> Entity F1,<br/> EM, MCCGPT-4 ,<br/> BloombergGPT,<br/> FinMA-(7B, 30B, 7B-full),<br/> Vicuna-7B
Li et al, 2023Finance134 E-com tasksEcomInstructMicro-F1,<br/> Macro-F1,<br/> ROUGEBLOOM, BLOOMZ,<br/> ChatGPT, EcomGPT
Wang et al, 2023MedicineMulti-Choice QACMBAccGPT-4, ChatGLM2-6B,<br/> ChatGPT, DoctorGLM,<br/> Baichuan-13B-chat,<br/> HuatuoGPT, MedicalGPT,<br/> ChatMed-Consult,<br/> ChatGLM-Med ,<br/> Bentsao, BianQue-2
Li et al, 2023MedicineGenerative-QAHuatuo-26MBLEU,<br/> ROUGE,<br/> GLEUT5, GPT2
Jin et al, 2023MedicineNomenclature,<br/> Genomic location,<br/> Functional analysis,<br/> Sequence alignmentGeneTuringAccGPT-2, BioGPT, <br/> BioMedLM, <br/> GPT-3, <br/> ChatGPT, New Bing
Guha et al, 2023LawIssue-spotting,<br/> Rule-recall,<br/> Rule-application,<br/> Rule-conclusion,<br/> Interpretation,<br/> Rhetorical-understandingLegalBenchAcc, EMGPT-4, GPT-3.5, <br/> Claude-1, Incite, OPT<br/> Falcon, LLaMA-2, FLAN-T5...
Fei et al, 2023LawLegal QA, NER, <br/> Sentiment Analysis,<br/> Reading ComprehensionLawBenchF1, Acc,<br/> ROUGE-L,<br/> Normalized log-distance...GPT-4,<br/> ChatGPT, <br/> InternLM-Chat,<br/> StableBeluga2...

Enhancement

Enhancement methods

ReferenceDatasetMetricsBaselines ➝ TheirsDatasetMetricsBaselines ➝ Theirs
Li et al, 2022NQEM34.5 ➝ 44.35 (T5 11B)GSM8KACC77.0 ➝ 85.0 (ChatGPT)
Yu et al, 2023NQEM20.9 ➝ 28.0 (InstructGPT)TriviaQAEM57.5 ➝ 59.0 (InstructGPT)
----WebQAEM18.6 ➝ 24.6 (InstructGPT)
Chuang et al, 2023FACTOR NewsACC58.3 ➝ 62.0 (LLaMa-7B)FACTOR NewsACC61.1 ➝ 62.5 (LLaMa-13B)
-FACTOR NewsACC63.8 ➝ 65.4 (LLaMa-33B)FACTOR NewsACC63.6 ➝ 66.2 (LLaMa-65B)
-FACTOR WikiACC58.6 ➝ 62.2 (LLaMa-7B)FACTOR WikiACC62.6 ➝ 66.2 (LLaMa-13B)
-FACTOR WikiACC69.5 ➝ 70.3 (LLaMa-33B)FACTOR WikiACC72.2 ➝ 72.4 (LLaMa-65B)
-TruthfulQA%Truth * Info32.4 ➝ 44.6 (LLaMa-13B)TruthfulQA%Truth * Info34.8 ➝ 49.2 (LLaMa-65B)
Li et al, 2022TruthfulQA%Truth * Info32.4 ➝ 44.4 (LLaMa-13B)TruthfulQA%Truth * Info31.7 ➝ 36.7 (LLaMa-33B)
-TruthfulQA%Truth * Info34.8 ➝ 43.4 (LLaMa-65B)---
Li et al, 2023NQACC46.6 ➝ 51.3 (LLaMA-7B)TriviaQAACC89.6 ➝ 91.1 (LLaMA-7B)
-MMLUACC35.7 ➝ 40.1 (LLaMA-7B)TruthfulQA%Truth * Info32.5 ➝ 65.1 (Alpaca)
-TruthfulQA%Truth * Info26.9 ➝ 43.5 (LLaMa-7B)TruthfulQA%Truth * Info51.5 ➝ 74.0 (Vicuna)
Cohen et al, 2023LAMAF150.7 ➝ 80.8 (ChatGPT)TriviaQAF156.2 ➝ 82.6 (ChatGPT)
-NQF160.6 ➝ 79.1 (ChatGPT)PopQAF165.2 ➝ 85.4 (ChatGPT)
-LAMAF142.5 ➝ 79.3 (GPT-3)TriviaQAF146.7 ➝ 77.2 (GPT-3)
-NQF152.0 ➝ 78.0 (GPT-3)PopQAF143.7 ➝ 77.4 (GPT-3)
...

Domain-enhanced LLMs

ReferenceDomainModelEval TaskEval DatasetContinual Pretrained?Continual SFT?Train From Scratch?External Knowledge
Zhang et al, 2023HealthcareBaichuan-7B, Ziya-LLaMA-13BQAcMedQA2, WebMedQA, Huatuo-26M✔️
Yang et al, 2023HealthcareZiya-LLaMA-13BQACMtMedQA, huatuo-26M✔️✔️
Wang et al, 2023HealthcareGPT-3.5-Turbo, LLaMA-2-13BQAMedQAUSMLE, MedQAMCMLE, MedMCQA✔️
Ross et al, 2022HealthcareMOLFORMERMolecule properties prediction✔️
Bao et al, 2023HealthcareBaichuan-13BQACMB-Clin, CMD, CMID✔️
Guan et al, 2023HealthcareChatGPTIU-RR, MIMIC-CXR✔️
Liu et al, 2023HealthcareGPT-4Medical Text De-Identification✔️
Li et al, 2023HealthcareLLaMAQA✔️
Venigalla et al, 2022HealthcareGPT (2.7b)QA✔️
Xiong et al, 2023HealthcareChatGLM-6BQA✔️
Tan et al, 2023HealthcareBaichuan-7BQAC-Eval, MMLU✔️
Luo et al, 2022HealthcareGPT-2QA, DC, RE✔️
Jin et al, 2023HealthcareCodexQAGeneTuring✔️
Zakka et al, 2023Healthcaretext-davinci-003QAClinicalQA✔️
Liu et al, 2023HealthcareGPT-2mediumMolecular Property Prediction, Molecule-text translation✔️✔️
Nguyen et al, 2023LawGPT3✔️
Savelka et al, 2023LawGPT-4✔️
Huang et al, 2023LawLLaMACN Legal Tasks✔️✔️
Cui et al, 2023LawZiya-LLaMA-13BQAnational judicial examination question✔️✔️
Li et al, 2023FinanceBLOOMZ4 major tasks 12 subtasksEcomInstruct✔️
Wu et al, 2023FinanceBLOOMFinancial NLP (SA, BC, NER, NER+NED, QA)Financial Datasets✔️
Deng et al, 2023GeoscienceLLaMA-7BGeoBench✔️
Bai et al, 2023GeoscienceChatGLM-6B✔️
Fan et al, 2023Educationphoenix-inst-chat-7bChinese Grammatical Error CorrectionChatGPT-generated, Human-annotated✔️
Qi et al, 2023FoodChinese-LLaMA2-13BQA✔️✔️
Wen et al, 2023Home RenovationBaichuan-13BC-Eval, CMMLU, EvalHome✔️

Reference

If you find this project useful in your research or work, please consider citing it:

@misc{wang2023survey,
      title={Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity}, 
      author={Cunxiang Wang and Xiaoze Liu and Yuanhao Yue and Xiangru Tang and Tianhang Zhang and Cheng Jiayang and Yunzhi Yao and Wenyang Gao and Xuming Hu and Zehan Qi and Yidong Wang and Linyi Yang and Jindong Wang and Xing Xie and Zheng Zhang and Yue Zhang},
      year={2023},
      eprint={2310.07521},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

  1. CHEN Liang (ChanLiang) for PR#1.
  2. JinheonBaek (JinheonBaek) for PR#2 and PR#3

Star History

Star History Chart