Awesome
CelebV-Text: A Large-Scale Facial Text-Video Dataset (CVPR 2023)
<img src="./assets/teaser.png" width="96%" height="96%">CelebV-Text: A large-Scale Facial Text-Video Dataset<br> Jianhui Yu*, Hao Zhu*, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu <br> (*Equal contribution)</small><br> Demo Video | Project Page | Paper (arxiv)
Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, <b>CelebV-Text</b>, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.
Updates
- [11/08/2023]
- Audios (67k) can be downloaded now issue
- [20/06/2023]
- Videos can be downloaded now issue
- [28/03/2023]
- Paper is now released here!
- [01/01/2023]
- [28/12/2022]
- The codebase and project page are created.
- The download and processing tools for the dataset is released. Use them to construct your CelebV-Text!
- [04/01/2024]
- Confusions about annotation files are expalined here.
Table of contents
<!--ts--> <!--te-->TODO
- Video download and processing tools.
- Text descriptions.
- Data annotations.
- Code of MMVID-interp.
- Automatic text generation tool and templates.
- Pretrained models of benchmarks.
<a name="stat"></a>
Dataset Statistics
<!-- https://user-images.githubusercontent.com/121470971/209757073-77fd707b-e8cc-49ea-8d1d-836bc43d078f.mp4 --> <!-- https://user-images.githubusercontent.com/10545746/226877623-1474519a-3389-409b-b599-3b8c924f4999.mp4 -->The distributions of each attribute. CelebV-Text contains <b>70,000 video clips</b> with a total duration of around <b> 279 hours</b>. Each video is accompanied by <b>20 sentences</b> describing <b>6 designed attributes</b>, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.
<img src="./assets/video_stats.png" width="60%" height="60%" alt="video stats"> <img src="./assets/text_stats.png" width="60%" height="60%" alt="text stats"> <img src="./assets/text_video_rel.png" width="70%" height="70%" alt="text-video rel">Visual ChatGPT Demo
This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.
<a name="Agreement"></a>
Agreement
- The CelebV-Text dataset is available for non-commercial research purposes only.
- All videos of the CelebV-Text dataset are obtained from the Internet which are not property of our institutions. Our institutions are not responsible for the content nor the meaning of these videos.
- You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data.
- You agree not to further copy, publish or distribute any portion of the CelebV-Text dataset. Except, for internal use at a single site within the same organization it is allowed to make copies of the dataset.
<a name="download"></a>
Dataset Download
<a name="text"></a>
(1) Text Descriptions & Metadata Annotation
Description | Link |
---|---|
general & detailed face attributes | Google Drive |
emotion | Google Drive |
action | Google Drive |
light direction | Google Drive |
light intensity | Google Drive |
light color temperature | Google Drive |
*metadata annotation | Google Drive |
<a name="videos"></a>
(2) Video Download Pipeline
Prepare the environment & Run script:
# prepare the environment
pip install youtube_dl
pip install opencv-python
# you can change the download folder in the code
python download_and_process.py
JSON File Structure:
{
"clips":
{
"0-5BrmyFsYM_0": // clip 1
{
"ytb_id": "0-5BrmyFsYM", // youtube id
"duration": {"start_sec": 0.0, "end_sec": 9.64}, // start and end times in the original video
"bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872}, // bounding box
"version": "v0.1"
},
"00-30GQl0TM_7": // clip 2
{
"ytb_id": "00-30GQl0TM", // youtube id
"duration": {"start_frame": 415.29, "end_frame": 420.88}, // start and end times in the original video
"bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
"version": "v0.1"
},
"..."
"..."
}
}
<a name="benchmark"></a>
Benchmark on Facial Text-to-Video Generation
<a name="baselines"></a>
(1) Baselines
To train the baselines, we used their original implementations in our paper:
<a name="models"></a>
(2) Pretrained Models
Text Descriptions (MMVID) | Link |
---|---|
VQGAN | Google Drive |
general & detailed face attributes | Google Drive |
emotion | Google Drive |
action | Google Drive |
light direction | Google Drive |
light intensity & color temperature | Google Drive |
general face attributes + emotion + action + light direction | Google Drive |
<a name="related"></a>
More Work May Interest You
There are several our previous publications that might be of interest to you.
-
Face Generation:
- (ECCV 2022) CelebV-HQ: A Large-scale Video Facial Attributes Dataset. Zhu et al. [Paper], [Project Page], [Dataset]
- (CVPR 2022) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing. Xu et al. [Paper], [Project Page], [Code]
-
Human Generation:
- (Tech. Report 2022) 3DHumanGAN: Towards Photo-realistic 3D-Aware Human Image Generation. Yang et al. [Paper], [Project Page], [Code]
- (ECCV 2022) StyleGAN-Human: A Data-Centric Odyssey of Human. Fu et al. [Paper], [Project Page], [Dataset]
- (SIGGRAPH 2022) Text2Human: Text-Driven Controllable Human Image Generation. Jiang et al. [Paper], [Project Page], [Code]
<a name="citation"></a>
Citation
If you find this work useful for your research, please consider citing our paper:
@inproceedings{yu2022celebvtext,
title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
booktitle={CVPR},
year={2023}
}
<a name="acknowledgement"></a>
Acknowledgement
CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).