Home

Awesome

ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information

This directory contains the ChineseWebText2.0 dataset, and a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. Our ChineseWebText2.0 dataset is publicly available on huggingface (here).

ChineseWebText2.0

We have released the latest and largest Chinese dataset, ChineseWebText 2.0, which consists of 3.8 TB of data. Each text in the dataset is accompanied by a quality score, domain single-label and multi-label tags, as well as toxicity classification and scores, enabling LLM researchers to select data based on new quality thresholds.

MDFG-tool

Introduction

We introduce a new toolchain, MDFG-tool (see Figure 1). We begin with the coarse-grained filtering module, which applies rule-based methods to clean the data, focusing on criteria such as text length and sensitive words to ensure data quality. After cleaning, we evaluate the text quality using a BERT-based model. This process generates a quality score, and by selecting an appropriate threshold, we can extract high-quality text data that meets our needs. Next, we use FastText for both single-label and multi-label classification of the cleaned data. Meanwhile, we conduct toxicity assessment. The FastText model is used to filter out toxic content and assign toxicity scores to each text. This scoring system allows researchers to set thresholds for identifying and selecting harmful texts for further training.

<div align="center"> <img src=".\assets\structure.png" width="50%" /> </div>

Environment Dependencies

requirement.txt

Stage 1: Preprocessing

This section focuses on extracting high-quality text from Chinese monolingual data by employing manually constructed rules to filter out violent, pornographic, and advertisement content, as well as erroneous characters. The detailed filtering rules are outlined as follows:

Extract text content from jsonl file after the data preparation stage.

To improve language model training, documents will be filtered out if they have an average line length of fewer than 10 characters or a total text length of less than 200 characters, as such short texts often lack meaningful context and semantic relevance.

We aim to create a high-quality simplified Chinese dataset from web data by eliminating traditional Chinese characters and removing texts with less than 30% Chinese characters to ensure the dataset is suitable for training large language models.

To prevent large language models from generating toxic content, a method is proposed where texts are analyzed for the occurrence of harmful words from a predefined list, and any text with more than 0.5 occurrences of such words per line is classified as toxic and removed from the training dataset.

To enhance training efficiency and model performance, a subsequent analysis using a 13-gram granularity is conducted to identify and filter out data samples where over 50% of the character sequences are repetitive in each data entry.

Here is an example command to run the preprocessing stage:

python ./Preprocessing/preprocess.py --dates 

Stage 2: Quality Evaluation

In preprocessing procedure, we have used some handcrafted rules to remove the explicit noisy texts from our dataset. However, within the remaining data, there is still a considerable amount of low-quality text data, which cannot be filtered out with handcrafted rules. In order to extract the data of higher quality from them, in this section we further propose to design an evaluation models.

Stage 2.1: BERTEval

1. The Classification Results of Different Evaluation Models

<div align="center"> <img src=".\assets\BERTEval.png" width="40%" /> </div>

2. BERTEval Training and Inference

Stage 3: Domain Evaluation

1. Composition of Domain Training and Test Data

<div align="center"> <img src=".\assets\Composition of Domain.png" width="50%" /> </div>

2. Steps for Domain Classification

We developed an interactive rule- and model-guided classification system to provide accurate, domain-specific single-label and multi-label classifications for each data item.The specific process is as follows:

A rule-based classification approach was initially employed to assign preliminary labels using expert-curated keywords. Specifically, 20 to 50 keywords were designated for each category, with a frequency threshold of 3 to 5 non-repeating occurrences. Each text could receive single or multiple labels depending on keyword matches, while texts without category-specific keywords were labeled as "general."

Following the rule-based approach, FastText was utilized to perform model-based classification in a multi-label and multi-category framework.

To improve classification accuracy and generalizability, an iterative approach combined rule- and model-based methods. High-confidence predictions (confidence > 0.9) from the model were used to refine rule-based keywords, enhancing training data quality. This process allowed fine-tuning of labels, particularly for structured data categories like "instruction," ensuring comprehensive label coverage.

The following is the implementation of the domain classification process:

python ./Domain_Classifier/domain_classifier_process.py your_data_path_to_classifier.jsonl > result_output_path.jsonl 

Stage 4: Toxicity Evaluation

1. Composition of Toxicity Training and Test Data

<div align="center"> <img src=".\assets\Composition of Toxicity.png" width="40%" /> </div>

2. Steps for Toxicity Classification

To evaluate and score the toxicity levels within the dataset, we trained a FastText model, which, compared to the BERT model, strikes an optimal balance between processing performance and computational efficiency, delivering high accuracy while significantly reducing both training and inference time.The specific steps are outlined as follows:

To enhance the efficiency of data collection, we integrated high-quality Chinese toxicity datasets into the initial training set. We also sampled a subset of data from our large-scale dataset to serve as benign samples. Furthermore, to maintain a balanced distribution between toxic and benign samples, the number of toxic samples in the training set was doubled. Using this data, we trained an initial FastText model, named Toxic Classifier R0.

We refined our dataset by applying Toxic Classifier R0 to score a subset from our large-scale dataset, selecting samples with toxicity scores above 0.5 for further analysis. These candidates were re-evaluated by the Qwen2.5-32B model, classifying them into toxic and benign categories to form a new training set. The fastText model was then retrained on this refined data, enhancing its performance. This iterative process, conducted over two rounds, improved the model’s generalization and classification accuracy, yielding Toxic Classifier R1 in the first round and R2 in the second.

While models trained with the LLM-in-the-loop approach demonstrate strong generalization across diverse datasets, their performance is limited on specialized datasets, such as Chinese poetry, which contain fewer toxic samples. To address this, we incorporated directly extracted samples into the training set to improve performance across varied data types. Additionally, we introduced a hybrid rule-based and model-based method to handle mathematical formulas, classifying sentences with numerical and symbolic elements exceeding 50% as non-toxic instructions.

cd fasttext
python ./toxic_classifier/main.py --mode train --train_file ./data/train.txt --test_file ./data/test.txt
python ./toxic_classifier/predict_toxic.py  file_path  save_path

Citation

Please cite the paper if you use the data or code in this repo.

@misc{zhang2024chinesewebtext20largescalehighquality,
      title={ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information}, 
      author={Wanyue Zhang and Ziyong Li and Wen Yang and Chunlin Leng and Yinan Bai and Qianlong Du and Chengqing Zong and Jiajun Zhang},
      year={2024},
      eprint={2411.19668},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.19668}, 
}