Home

Awesome

<div align="center"> <h2>LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?</h2>

Paper Dataset

<img src="https://img.shields.io/github/last-commit/Dongping-Chen/MixSet?style=flat-square&color=5D6D7E" alt="git-last-commit" /> <img src="https://img.shields.io/github/commit-activity/m/Dongping-Chen/MixSet?style=flat-square&color=5D6D7E" alt="GitHub commit activity" /> <img src="https://img.shields.io/github/languages/top/Dongping-Chen/MixSet?style=flat-square&color=5D6D7E" alt="GitHub top language" /> <img src="figures/outline.jpg"> <img src="figures/self_bleu.jpg"> <p align="center"> </p> </div>

Updates & News

Table of content

Dataset: MixSet

Overview

The MixSet dataset is a comprehensive collection designed for advanced Machine Learning experiments. It's structured to support a variety of tasks including MGT classification in the era of LLMs, natural language understanding, and more.

Dataset Location

The dataset is located in the ./data/MixSet/ directory relative to the project's root. Ensure that this path exists and contains the necessary data files before running any scripts that depend on the MixSet dataset.

Usage

Benchmark your MGT/MixText detector

Please refer to ./data/MixSet/README.md for our MixSet data structure and how to leverage our dataset with ease.

Experiment Reproduce

Prerequisites

Installation

To set up your environment to run the code, follow these steps:

  1. Clone the Repository:
git clone https://github.com/Dongping-Chen/MixSet.git
cd MixSet
  1. Create and Activate a Virtual Environment (optional but recommended) and Install the Required Packages:
conda create --name mixset python=3.9
conda activate mixset
pip install -r requirements.txt
  1. Download Datasets To download the pure MGT and HWT datasets, please refer to this link, then move the dataset folders to <YOUR PATH>/MixSet/data/MGT_datasets/ and <YOUR PATH>/MixSet/data/pure_processed_HWT/.

  2. Download Checkpoints of GPT-Sentinel Download the pre-trained GPT-Sentinel t5-small follow the instruction here, download the t5-small.0422.pt and put to <YOUR PATH>/MixSet/.

Experiment 1

To reproduce the first experiments, run:

./Ex1_run.sh

You should run GPT-Zero by:

./Ex1_run_GPTzero

As for Ghostbuster, we will update the code as soon as possible.

Experiment 2

To reproduce the second experiment for binary classification, run:

./Ex2_binary_run

To reproduce the second experiment for three-class classification, run:

./Ex2_three_class_run

Experiment 3

To reproduce the third experiment for operation-wise transfer learning, run:

./Ex3_operation_train.sh
./Ex3_operation_test.sh

To reproduce the third experiment for LLM-wise transfer learning, run:

./Ex3_LLM_transfer.sh

Storage Requirements for Experiments 3 and 4 Scripts

Please be aware that the scripts for Experiments 3 and 4 require storing trained checkpoints in the folder path. This may occupy more than 20GB of space. It is essential to ensure that you have sufficient storage available on your device. Failing to allocate the necessary space might lead to interruptions during the code execution. We highly recommend checking and freeing up adequate space before running these scripts to ensure a smooth and uninterrupted experience.

Experiment 4

To reproduce the fourth experiment for the ablation study, run:

./Ex4_auto_train.sh
./Ex4_auto_test.sh

Script Parameters Description

Below are the parameters used in the script along with their descriptions:

Contact

For any issues, questions, or suggestions related to the MixSet dataset, feel free to contact me or open an issue in the project's repository.

Acknowledgments

Part of the code is borrowed from MGTBench. The corresponding author Lichao Sun is supported by the National Science Foundation Grants CRII-2246067.

Citation

@misc{zhang2024llmasacoauthor,
      title={LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?}, 
      author={Qihui Zhang and Chujie Gao and Dongping Chen and Yue Huang and Yixin Huang and Zhenyang Sun and Shilin Zhang and Weiye Li and Zhengyan Fu and Yao Wan and Lichao Sun},
      year={2024},
      eprint={2401.05952},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}