Awesome

The full pipeline of creating UHGEval hallucination dataset

Status: Full data; Avaliable.
Data location: ./sources/xinhua/raw/
Number: 75 txt files, 737,766 news in total
Note: Those data are belong to Xinhua News Agency, and are only used for research purposes.

Status: No data; Need to generate using the script.
Script: ./sources/xinhua/preprocessor.py
Data location: ./sources/xinhua/processed; Use the script to generate the data
Number: Retained 25,005 news articles (constituting 3.39% of the raw news).
Filtering settings:
- Only includes news categories such as: '政治', '法律', '军事', '教育', '体育', '经济', '市场', '科学', '技术', '医疗', '卫生', '社会', '文化', '艺术', '娱乐', '天气', '环保', '灾害', '事故' ('Politics', 'Law', 'Military', 'Education', 'Sports', 'Economics', 'Market', 'Science', 'Technology', 'Medical', 'Health', 'Society', 'Culture', 'Art', 'Entertainment', 'Weather', 'Environmental Protection', 'Disaster', 'Accident').
- The length of newsBeginning + newsRemainder is between [630, 870].
- newsBeginning has [2, 5] sentences. Note: sentence-ending symbols include "。；：？！"
- The length of newsBeginning is between [80, 120].

Status: No data; Need to generate using the script.
Script: ./gen_candidates.py
Data location: ./candidates/
Number: Retained 17,503 news articles (constituting 70.00% of the preprocessed news).
Filtering settings:
- keywordPrecision is between (0, 1), generally should be between (0.2, 0.6).
- candidateHallucinatedContinuation consists of only 1 sentence.
- The length of candidateHallucinatedContinuation is between [20, 70].
- appearedKeywords has at least 2 keywords.

Status: Partial data as examples; Need to generate using the script.
Script: ./gen_machine_annotations.py
Data location: ./machine_annotations/keyword_hallucinated
Note: Only articles labeled as having hallucinations were left for subsequent processing; those without hallucinations are located in ./machine_annotations/unhallucinated

Label Studio is a multi-type data labeling and annotation tool with standardized output format.

Relevant files can be found in ./label_studio_annotations/.