Home

Awesome

CelebV-Text: A Large-Scale Facial Text-Video Dataset (CVPR 2023)

<img src="./assets/teaser.png" width="96%" height="96%">

CelebV-Text: A large-Scale Facial Text-Video Dataset<br> Jianhui Yu*, Hao Zhu*, Liming Jiang, Chen Change Loy, Weidong Cai, and Wayne Wu <br> (*Equal contribution)</small><br> Demo Video | Project Page | Paper (arxiv)

Currently, text-driven generation models are booming in video editing with their compelling results. However, for the face-centric text-to-video generation, challenges remain severe as a suitable dataset with high-quality videos and highly-relevant texts is lacking. In this work, we present a large-scale, high-quality, and diverse facial text-video dataset, <b>CelebV-Text</b>, to facilitate the research of facial text-to-video generation tasks. CelebV-Text contains 70,000 in-the-wild face video clips covering diverse visual content. Each video clip is paired with 20 texts generated by the proposed semi-auto text generation strategy, which is able to describe both the static and dynamic attributes precisely. We make comprehensive statistical analysis on videos, texts, and text-video relevance of CelebV-Text, verifying its superiority over other datasets. Also, we conduct extensive self-evaluations to show the effectiveness and potential of CelebV-Text. Furthermore, a benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task.

Updates

Table of contents

<!--ts--> <!--te-->

TODO

<a name="stat"></a>

Dataset Statistics

<!-- https://user-images.githubusercontent.com/121470971/209757073-77fd707b-e8cc-49ea-8d1d-836bc43d078f.mp4 --> <!-- https://user-images.githubusercontent.com/10545746/226877623-1474519a-3389-409b-b599-3b8c924f4999.mp4 -->

https://user-images.githubusercontent.com/10545746/227458030-fbb48f66-db14-4c89-a001-4d7cdd29b248.mp4

The distributions of each attribute. CelebV-Text contains <b>70,000 video clips</b> with a total duration of around <b> 279 hours</b>. Each video is accompanied by <b>20 sentences</b> describing <b>6 designed attributes</b>, including 40 general appearances, 5 detailed appearances, 6 light conditions, 37 actions, 8 emotions, and 6 light directions.

<img src="./assets/video_stats.png" width="60%" height="60%" alt="video stats"> <img src="./assets/text_stats.png" width="60%" height="60%" alt="text stats"> <img src="./assets/text_video_rel.png" width="70%" height="70%" alt="text-video rel">

Visual ChatGPT Demo

This is a toy example of the application of text-to-face model with ChatGPT. In this demo, we use MMVID simply trained on the porposed CelebV-Text dataset, to demonstrate CelebV-Text's potential in enabling visual GPT applications. In the future, more sophisticated methods prospectively lead to better results.

https://user-images.githubusercontent.com/10545746/226870355-83c9c875-0e3b-439e-9df7-453d8e408807.mp4

<a name="Agreement"></a>

Agreement

<a name="download"></a>

Dataset Download

<a name="text"></a>

(1) Text Descriptions & Metadata Annotation

DescriptionLink
general & detailed face attributesGoogle Drive
emotionGoogle Drive
actionGoogle Drive
light directionGoogle Drive
light intensityGoogle Drive
light color temperatureGoogle Drive
*metadata annotationGoogle Drive

<a name="videos"></a>

(2) Video Download Pipeline

Prepare the environment & Run script:

# prepare the environment
pip install youtube_dl
pip install opencv-python

# you can change the download folder in the code 
python download_and_process.py
JSON File Structure:
{
    "clips":
    {
        "0-5BrmyFsYM_0":  // clip 1 
        {
            "ytb_id": "0-5BrmyFsYM",                                        // youtube id
            "duration": {"start_sec": 0.0, "end_sec": 9.64},                // start and end times in the original video
            "bbox": {"top": 0, "bottom": 937, "left": 849, "right": 1872},  // bounding box
            "version": "v0.1"
        },
      
        "00-30GQl0TM_7":  // clip 2 
        {
            "ytb_id": "00-30GQl0TM",                                        // youtube id
            "duration": {"start_frame": 415.29, "end_frame": 420.88},       // start and end times in the original video
            "bbox": {"top": 0, "bottom": 1183, "left": 665, "right": 1956}, // bounding box
            "version": "v0.1"
        },
        "..."
        "..."

    }
}

<a name="benchmark"></a>

Benchmark on Facial Text-to-Video Generation

<a name="baselines"></a>

(1) Baselines

To train the baselines, we used their original implementations in our paper:

<a name="models"></a>

(2) Pretrained Models

Text Descriptions (MMVID)Link
VQGANGoogle Drive
general & detailed face attributesGoogle Drive
emotionGoogle Drive
actionGoogle Drive
light directionGoogle Drive
light intensity & color temperatureGoogle Drive
general face attributes + emotion + action + light directionGoogle Drive

<a name="related"></a>

More Work May Interest You

There are several our previous publications that might be of interest to you.

<a name="citation"></a>

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{yu2022celebvtext,
  title={{CelebV-Text}: A Large-Scale Facial Text-Video Dataset},
  author={Yu, Jianhui and Zhu, Hao and Jiang, Liming and Loy, Chen Change and Cai, Weidong and Wu, Wayne},
  booktitle={CVPR},
  year={2023}
}

<a name="acknowledgement"></a>

Acknowledgement

CelebV-Text is affiliated with OpenXDLab -- an open platform for X-Dimension high-quality data. This work is supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088).