Awesome
DF40: Toward Next-Generation Deepfake Detection (Project Page; Paper; Download DF40; Checkpoints)
ššš Our DF40 has been accepted by NeurIPS 2024 D&B track!
Welcome to our work DF40, for next-generation deepfake detection.
In this work, we propose: (1) a diverse deepfake dataset with 40 distinct generations methods; and (2) a comprehensive benchmark for training, evaluation, and analysis.
"Expanding Your Evaluation with 40 distinct High-Quality Fake Data from the FF++ and CDF domains!!"
DF40 Dataset Highlight: The key features of our proposed DF40 dataset are as follows
ā Forgery Diversity: DF40 comprises 40 distinct deepfake techniques (both representive and SOTA methods are included), facilitating the detection of nowadays' SOTA deepfakes and AIGCs. We provide 10 face-swapping methods, 13 face-reenactment methods, 12 entire face synthesis methods, and 5 face editing.
ā Forgery Realism: DF40 includes realistic deepfake data created by highly popular generation software and methods, e.g., HeyGen, MidJourney, DeepFaceLab, to simulate real-world deepfakes. We even include the just-released DiT, SiT, PixArt-$\alpha$, etc.
ā Forgery Scale: DF40 offers million-level deepfake data scale for both images and videos.
ā Data Alignment: DF40 provides alignment between fake methods and data domains. Most methods (31) are generated under the FF++ and CDF domains. Using our fake data, you can further expand your evaluation (training on FF++ and testing on CDF).
The figure below provides a brief introduction to our DF40 dataset.
<div align="center"> </div> <div style="text-align:center;"> <img src="df40_figs/df40_intro.jpg" style="max-width:60%;"> </div>The following table displays the statistical description and illustrates the details of our DF40 dataset. Please check our paper for details.
<div align="center"> </div> <div style="text-align:center;"> <img src="df40_figs/table1.jpg" style="max-width:60%;"> </div>š„ DF40 Dataset
Type | ID-Number | Generation Method | Original Data Source | Visual Examples |
---|---|---|---|---|
Face-swapping (FS) | 1 | FSGAN | FF++ and Celeb-DF | |
Face-swapping (FS) | 2 | FaceSwap | FF++ and Celeb-DF | |
Face-swapping (FS) | 3 | SimSwap | FF++ and Celeb-DF | |
Face-swapping (FS) | 4 | InSwapper | FF++ and Celeb-DF | |
Face-swapping (FS) | 5 | BlendFace | FF++ and Celeb-DF | |
Face-swapping (FS) | 6 | UniFace | FF++ and Celeb-DF | |
Face-swapping (FS) | 7 | MobileSwap | FF++ and Celeb-DF | |
Face-swapping (FS) | 8 | e4s | FF++ and Celeb-DF | |
Face-swapping (FS) | 9 | FaceDancer | FF++ and Celeb-DF | |
Face-swapping (FS) | 10 | DeepFaceLab | UADFV | |
Face-reenactment (FR) | 11 | FOMM | FF++ and Celeb-DF | |
Face-reenactment (FR) | 12 | FS_vid2vid | FF++ and Celeb-DF | |
Face-reenactment (FR) | 13 | Wav2Lip | FF++ and Celeb-DF | |
Face-reenactment (FR) | 14 | MRAA | FF++ and Celeb-DF | |
Face-reenactment (FR) | 15 | OneShot | FF++ and Celeb-DF | |
Face-reenactment (FR) | 16 | PIRender | FF++ and Celeb-DF | |
Face-reenactment (FR) | 17 | TPSM | FF++ and Celeb-DF | |
Face-reenactment (FR) | 18 | LIA | FF++ and Celeb-DF | |
Face-reenactment (FR) | 19 | DaGAN | FF++ and Celeb-DF | |
Face-reenactment (FR) | 20 | SadTalker | FF++ and Celeb-DF | |
Face-reenactment (FR) | 21 | MCNet | FF++ and Celeb-DF | |
Face-reenactment (FR) | 22 | HyperReenact | FF++ and Celeb-DF | |
Face-reenactment (FR) | 23 | HeyGen | FVHQ | |
Entire Face Synthesis (EFS) | 24 | VQGAN | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 25 | StyleGAN2 | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 26 | StyleGAN3 | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 27 | StyleGAN-XL | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 28 | SD-2.1 | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 29 | DDPM | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 30 | RDDM | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 31 | PixArt-$\alpha$ | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 32 | DiT-XL/2 | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 33 | SiT-XL/2 | Finetuning on FF++ and Celeb-DF | |
Entire Face Synthesis (EFS) | 34 | MidJounery6 | FFHQ | |
Entire Face Synthesis (EFS) | 35 | WhichisReal | FFHQ | |
Face Edit (FE) | 36 | CollabDiff | CelebA | |
Face Edit (FE) | 37 | e4e | CelebA | |
Face Edit (FE) | 38 | StarGAN | CelebA | |
Face Edit (FE) | 39 | StarGANv2 | CelebA | |
Face Edit (FE) | 40 | StyleCLIP | CelebA |
ā³ Quick Start
<a href="#top">[Back to top]</a>
1. Installation
Please run the following script to install the required libraries:
sh install.sh
2. Download ckpts for inference
All checkpoints/weights of ten models training on our DF40 are released at Google Drive and Baidu Disk.
Note that:
- If your want to use the CLIP model that is trained on all FS methods of DF40, you can find it at
df40_weights/train_on_fs/clip.pth
. You can use all ckpts underdf40_weights/train_on_xxx_matrix
to reproduce the results of Protocol-1,2,3 of our paper. - Similarly, if you want to use the Xception model that is trained specifically on the SimSwap method, you can find it in the folder
df40_weights/train_on_fs_matrix/simswap_ff.pth
. You can use all ckpts underdf40_weights/train_on_xxx_matrix
to reproduce the results of Protocol-4 of our paper.
3. Download DF40 data (after pre-processing)
For quick use and convenience, we provide all DF40 data after pre-processing using in our research. You do NOT need to do the pre-processing again but directly use our processed data.
- DF40 (testing data):.
- Description: We provide the Google Drive Link and Baidu Disk of the whole DF40 testing data (40 methods) after preprocessing (frame extraction and face cropping), including fake images only.
- Size: The whole size is ~93G, including all testing fake data of DF40.
- DF40 (training data):
- Description: Similar to the DF40-test, we provide the processed fake images for training in Google Drive Link and Baidu Disk. Please note that the training set ONLY includes the "known" methods and utilizes the FaceForensics++ (ff) domain for training. The Celeb-DF (cdf) domain is not used for training purposes but for testing only.
- Size: The whole size is ~50G, including all training fake data of DF40 (only the FF++ domain).
- Original Real Data (FF++ and Celeb-DF):
- For "known" 31 methods: To obtain the real data for both training and testing purposes, please use the following links: FaceForensics++ real data (Google Drive Link and Baidu Disk) and Celeb-DF real data (Google Drive Link and Baidu Disk).
- For the "unknown" 9 methods: The real data is already included within the folder, so there is NO additional download link required for the real data of the unknown methods.
- JSON files for recording image paths:
- Description: we create a JSON file to load all frame paths for each method in a unified way.
- All the JSON files used in our research can be downloaded here (Google Drive and Baidu Disk).
- After downloading, Please put the folder
dataset_json
inside the `./preprocessing/ folder.
3. Run inference
You can then run inference using the trained weights used in our research.
Example-1: If you want to use the Xception model trained on SimSwap (FF) and test it on BlendFace (FF), run the following line.
cd DeepfakeBench_DF40
python training/test.py \
--detector_path training/config/detector/xception.yaml \
--weights_path training/df40_weights/train_on_fs_matrix/simswap_ff.pth \
--test_dataset blendface_ff
Example-2: If you want to use the Xception model trained on SimSwap (FF) and test it on SimSwap (CDF), run the following line.
cd DeepfakeBench_DF40
python training/test.py \
--detector_path training/config/detector/xception.yaml \
--weights_path training/df40_weights/train_on_fs_matrix/simswap_ff.pth \
--test_dataset simswap_cdf
Example-3: If you want to use the CLIP model trained on all methods of FS (FF) and test it on DeepFaceLab, run the following line.
cd DeepfakeBench_DF40
python training/test.py \
--detector_path training/config/detector/clip.yaml \
--weights_path training/df40_weights/train_on_fs/clip.pth \
--test_dataset deepfacelab
š» Reproduction and Development
<a href="#top">[Back to top]</a>
1. Download DF40 dataset
We provide two ways to download our dataset:
- Option-1: Using the processed data after preprocessing that used also in our research. Please see the
Quick Start
part; - Option-2: If you also want to download the original fake videos of all FS and FR methods, please download them at the link (Google Drive). For EFS and FE methods, the original data is the processed data, they are the same since they do not need to perform preprocessing (e.g., frame extraction and face crop).
2. Preprocessing (optional)
If you only want to use the processed data we provided, you can skip this step. Otherwise, you need to use the following codes for doing data preprocessing.
To start preprocessing DF40 dataset, please follow these steps:
-
Open the
./preprocessing/config.yaml
and locate the linedefault: DATASET_YOU_SPECIFY
. ReplaceDATASET_YOU_SPECIFY
with the name of the dataset you want to preprocess, such asFaceForensics++
. -
Specify the
dataset_root_path
in the config.yaml file. Search for the line that mentions dataset_root_path. By default, it looks like this:dataset_root_path: ./datasets
. Replace./datasets
with the actual path to the folder where your dataset is arranged.
Once you have completed these steps, you can proceed with running the following line to do the preprocessing:
cd preprocessing
python preprocess.py
3. Rearrangement (optional)
"Rearrangment" here means that we need to create a JSON file for each dataset for collecting all frames within different folders.
If you only want to use the processed data we provided, you can skip this step and use the JSON files we used in our research (Google Drive). Otherwise, you need to use the following codes for doing data rearrangement.
After the preprocessing above, you will obtain the processed data (e.g., frames, landmarks, and masks) for each dataset you specify. Similarly, you need to set the parameters in ./preprocessing/config.yaml
for each dataset. After that, run the following line:
cd preprocessing
python rearrange.py
After running the above line, you will obtain the JSON files for each dataset in the ./preprocessing/dataset_json
folder. The rearranged structure organizes the data in a hierarchical manner, grouping videos based on their labels and data splits (i.e., train, test, validation). Each video is represented as a dictionary entry containing relevant metadata, including file paths, labels, compression levels (if applicable), etc.
4. Training
Our benchmark includes four standarad protocols. You can use the following examples of each protocol to train the models:
(a). Protocol-1: Same Data Domain, Differenet Forgery Types
First, you can run the following lines to train a model (e.g., if you want to train the Xception model on all FS methods):
- For multiple GPUs:
python3 -m torch.distributed.launch --nproc_per_node=8 training/train.py \
--detector_path ./training/config/detector/xception.yaml \
--train_dataset FSAll_ff \
--test_dataset FSAll_ff \
--ddp
- For a single GPU:
python3 training/train.py \
--detector_path ./training/config/detector/xception.yaml \
--train_dataset FSAll_ff \
--test_dataset FSAll_ff \
Note, we here perform both training and evaluating on FSAll_ff (using all testing FS methods of FF domain as the evaluation set) to select the best checkpoint. Once finished training, you can use the best checkpoint to evaluate other testing datasets (e.g., all testing EFS and FR methods of the FF domain). Specifically:
python3 training/test.py \
--detector_path ./training/config/detector/xception.yaml \
--test_dataset "FSAll_ff" "FRAll_ff" "EFSAll_ff" \
--weights_path ./training/df40_weights/train_on_fs/xception.pth
Then, you can obtain similar evaluation results reported in Tab. 3 of the manuscript.
(b). Protocol-2: Same Forgery Types, Differenet Data Domain Similarly, you can run the following lines for Protocol-2.
python3 training/test.py \
--detector_path ./training/config/detector/xception.yaml \
--test_dataset "FSAll_cdf" "FRAll_cdf" "EFSAll_cdf" \
--weights_path ./training/df40_weights/train_on_fs/xception.pth
Then, you can obtain similar evaluation results reported in Tab. 4 of the manuscript.
(c). Protocol-3: Differenet Forgery Types, Differenet Data Domain Similarly, you can run the following lines for Protocol-3.
python3 training/test.py \
--detector_path ./training/config/detector/xception.yaml \
--test_dataset "deepfacelab" "heygen" "whichisreal" "MidJourney" "stargan" "starganv2" "styleclip" "e4e" "CollabDiff" \
--weights_path ./training/df40_weights/train_on_fs/xception.pth
Then, you can obtain all evaluation results reported in Tab. 5 of the manuscript.
(c). Protocol-4: Train on one fake method and testing on all other methods (One-vs-All) Similarly, you should first train one model (e.g., Xception) on one specific fake method (e.g., SimSwap):
python3 training/train.py \
--detector_path ./training/config/detector/xception.yaml \
--train_dataset simswap_ff \
--test_dataset simswap_ff \
Then runing the following lines for evaluation:
python3 training/test.py \
--detector_path ./training/config/detector/xception.yaml \
--test_dataset ... (type them one-by-one) \
--weights_path ./training/df40_weights/train_on_fs_matrix/simswap_ff.pth
You can also directly use the bash file (./training/test_df40.sh
) for convenience and then you do not need to type all fake methods one-by-one at the terminal.
Then, you can obtain all evaluation results reported in Fig. 4 of the manuscript.
š More visual examples
<a href="#top">[Back to top]</a>
-
Example samples created by FS (face-swapping) methods: Please check here.
-
Example samples created by FR (face-reenactment) methods: Please check here.
-
Example samples created by EFS (entire face synthesis) methods: Please check here.
-
Example samples created by FE (face editing) methods: Please check here.
Folder Structure
deepfake_detection_datasets
ā
āāā DF40
ā āāā fsgan
ā āāā ff
ā āāā cdf
ā āāā faceswap
ā āāā ff
ā āāā cdf
ā āāā simswap
ā āāā ff
ā āāā cdf
ā āāā inswap
ā āāā ff
ā āāā cdf
ā āāā blendface
ā āāā ff
ā āāā cdf
ā āāā uniface
ā āāā ff
ā āāā cdf
ā āāā mobileswap
ā āāā ff
ā āāā cdf
ā āāā e4s
ā āāā ff
ā āāā cdf
ā āāā facedancer
ā āāā ff
ā āāā cdf
ā āāā fomm
ā āāā ff
ā āāā cdf
ā āāā facevid2vid
ā āāā ff
ā āāā cdf
ā āāā wav2lip
ā āāā ff
ā āāā cdf
ā āāā MRAA
ā āāā ff
ā āāā cdf
ā āāā one_shot_free
ā āāā ff
ā āāā cdf
ā āāā pirender
ā āāā ff
ā āāā cdf
ā āāā tpsm
ā āāā ff
ā āāā cdf
ā āāā lia
ā āāā ff
ā āāā cdf
ā āāā danet
ā āāā ff
ā āāā cdf
ā āāā sadtalker
ā āāā ff
ā āāā cdf
ā āāā mcnet
ā āāā ff
ā āāā cdf
ā āāā heygen
ā āāā fake
ā āāā real
ā āāā VQGAN
ā āāā ff
ā āāā cdf
ā āāā StyleGAN2
ā āāā ff
ā āāā cdf
ā āāā StyleGAN3
ā āāā ff
ā āāā cdf
ā āāā StyleGANXL
ā āāā ff
ā āāā cdf
ā āāā sd2.1
ā āāā ff
ā āāā cdf
ā āāā ddim
ā āāā ff
ā āāā cdf
ā āāā PixArt
ā āāā ff
ā āāā cdf
ā āāā DiT
ā āāā ff
ā āāā cdf
ā āāā SiT
ā āāā ff
ā āāā cdf
ā āāā MidJourney
ā āāā fake
ā āāā real
ā āāā whichfaceisreal
ā āāā fake
ā āāā real
ā āāā stargan
ā āāā fake
ā āāā real
ā āāā starganv2
ā āāā fake
ā āāā real
ā āāā styleclip
ā āāā fake
ā āāā real
ā āāā e4e
ā āāā fake
ā āāā real
ā āāā CollabDiff
ā āāā fake
ā āāā real
ā
āāā DF40_train
ā āāā fsgan
ā āāā ff
ā āāā cdf
ā āāā faceswap
ā āāā ff
ā āāā cdf
ā āāā simswap
ā āāā ff
ā āāā cdf
ā āāā inswap
ā āāā ff
ā āāā cdf
ā āāā blendface
ā āāā ff
ā āāā cdf
ā āāā uniface
ā āāā ff
ā āāā cdf
ā āāā mobileswap
ā āāā ff
ā āāā cdf
ā āāā e4s
ā āāā ff
ā āāā cdf
ā āāā facedancer
ā āāā ff
ā āāā cdf
ā āāā fomm
ā āāā ff
ā āāā cdf
ā āāā facevid2vid
ā āāā ff
ā āāā cdf
ā āāā wav2lip
ā āāā ff
ā āāā cdf
ā āāā MRAA
ā āāā ff
ā āāā cdf
ā āāā one_shot_free
ā āāā ff
ā āāā cdf
ā āāā pirender
ā āāā ff
ā āāā cdf
ā āāā tpsm
ā āāā ff
ā āāā cdf
ā āāā lia
ā āāā ff
ā āāā cdf
ā āāā danet
ā āāā ff
ā āāā cdf
ā āāā sadtalker
ā āāā ff
ā āāā cdf
ā āāā mcnet
ā āāā ff
ā āāā cdf
ā āāā heygen
ā āāā ff
ā āāā cdf
ā āāā VQGAN
ā āāā ff
ā āāā cdf
ā āāā StyleGAN2
ā āāā ff
ā āāā cdf
ā āāā StyleGAN3
ā āāā ff
ā āāā cdf
ā āāā StyleGANXL
ā āāā ff
ā āāā cdf
ā āāā sd2.1
ā āāā ff
ā āāā cdf
ā āāā ddim
ā āāā ff
ā āāā cdf
ā āāā PixArt
ā āāā ff
ā āāā cdf
ā āāā DiT
ā āāā ff
ā āāā cdf
ā āāā SiT
ā āāā ff
ā āāā cdf
ā āāā MidJourney
ā āāā fake
ā āāā real
ā āāā whichfaceisreal
ā āāā fake
ā āāā real
ā āāā stargan
ā āāā fake
ā āāā real
ā āāā starganv2
ā āāā fake
ā āāā real
ā āāā styleclip
ā āāā fake
ā āāā real
ā āāā e4e
ā āāā fake
ā āāā real
ā āāā CollabDiff
ā āāā fake
ā āāā real
Citations
If you use our DF40 dataset, checkpoints/weights, and codes in your research, you must cite DF40 as follows:
@article{yan2024df40,
title={DF40: Toward Next-Generation Deepfake Detection},
author={Yan, Zhiyuan and Yao, Taiping and Chen, Shen and Zhao, Yandan and Fu, Xinghe and Zhu, Junwei and Luo, Donghao and Yuan, Li and Wang, Chengjie and Ding, Shouhong and others},
journal={arXiv preprint arXiv:2406.13495},
year={2024}
}
Since our codebase is mainly based on DeepfakeBench, you should also cite it as follows:
@inproceedings{DeepfakeBench_YAN_NEURIPS2023,
author = {Yan, Zhiyuan and Zhang, Yong and Yuan, Xinhang and Lyu, Siwei and Wu, Baoyuan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {4534--4565},
publisher = {Curran Associates, Inc.},
title = {DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/0e735e4b4f07de483cbe250130992726-Paper-Datasets_and_Benchmarks.pdf},
volume = {36},
year = {2023}
}
License
The use of both the dataset and codes is RESTRICTED to Creative Commons Attribution-NonCommercial 4.0 International Public License (CC BY-NC 4.0). See https://creativecommons.org/licenses/by-nc/4.0/
for details.