Awesome

Summary

This is the dataset proposed in our paper VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models (NeurIPS 2024).

VidProM is the first dataset featuring 1.67 million unique text-to-video prompts and 6.69 million videos generated from 4 different state-of-the-art diffusion models. It inspires many exciting new research areas, such as Text-to-Video Prompt Engineering, Efficient Video Generation, Fake Video Detection, and Video Copy Detection for Diffusion Models.

Download

You can download the VidProM from Hugging Face.

For users from China, we cooperate with Wisemodel, and you can download them faster from here.

Automatical

Install the datasets library first, by:

pip install datasets

Then it can be downloaded automatically with

import numpy as np
from datasets import load_dataset
dataset = load_dataset('WenhaoWang/VidProM')

Manual

You can also download each file by wget, for instance:

wget https://huggingface.co/datasets/WenhaoWang/VidProM/resolve/main/VidProM_unique.csv

Dataloader

We use the example folder to illustrate how to load VidProM using PyTorch Dataloader and WebDataset.

PyTorch Dataloader

The example directory is

*example
    *VidProM_unique_example.csv
    *VidProM_embed_example.hdf5
    *pika_videos_example
	pika-xxx-xxx.mp4
        pika-xxx-xxx.mp4
	...
    *t2vz_videos_example
	t2vz-xxx-xxx.mp4
        t2vz-xxx-xxx.mp4
	...
    *vc2_videos_example
	vc2-xxx-xxx.mp4
        vc2-xxx-xxx.mp4
	...
    *ms_videos_example
	ms-xxx-xxx.mp4
        ms-xxx-xxx.mp4
	...

We have the following PyTorch Dataloader:

import os
import pandas as pd
import h5py
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision.io import read_video
import numpy as np

class VidProMDataset(Dataset):
    def __init__(self, csv_file, hdf5_file, video_dirs, transform=None):
        
        self.metadata = pd.read_csv(csv_file)
        self.video_dirs = video_dirs
        self.transform = transform
        self.nsfw_names = ['toxicity','obscene','identity_attack','insult','threat','sexual_explicit']

        self.hdf5_file =  h5py.File(hdf5_file, 'r')
        self.hdf5_uuid = np.array(self.hdf5_file["uuid"][:], dtype=object).astype(str).tolist()
        self.hdf5_embed = np.array(self.hdf5_file['embeddings'])
        
    def __len__(self):
        return len(self.metadata)

    def __getitem__(self, idx):
        
        video_info = self.metadata.iloc[idx]
        video_id = video_info['uuid']
        prompt = video_info['prompt']
        time = video_info['time']
        nsfw_scores = torch.tensor(list(video_info[self.nsfw_names]))
        
        embed = torch.tensor(self.hdf5_embed[self.hdf5_uuid.index(video_id)])
        video_path = self._find_video_path(video_id)
        video_frames, _, _ = read_video(video_path, pts_unit='sec')
        
        if self.transform:
            video_frames = self.transform(video_frames)

        return {
            'video_id': video_id,
            'video_frames': video_frames,
            'embed': embed,
            'prompt': prompt,
            'time': time,
            'nsfw_scores': nsfw_scores
        }

    def _find_video_path(self, video_id):
        for video_dir in self.video_dirs:
            video_file = os.path.join(video_dir, video_dir.split('_')[0] + f"-{video_id}.mp4")
            if os.path.exists(video_file):
                return video_file
        raise FileNotFoundError(f"Video {video_id}.mp4 not found in any of the directories.")

    def __del__(self):
        self.hdf5_file.close()

csv_file = 'VidProM_unique_example.csv'
hdf5_file = 'VidProM_embed_example.hdf5'
video_dirs = ['t2vz_videos_example', 'pika_videos_example', 'vc2_videos_example', 'ms_videos_example']
dataset = VidProMDataset(csv_file, hdf5_file, video_dirs)
dataloader = DataLoader(dataset, batch_size=16, shuffle=False, num_workers=0)

WebDataset

We can load videos using WebDataset from the tar files directly, and we assume the directory is

*example
    *VidProM_unique_example.csv
    *VidProM_embed_example.hdf5
    *pika_videos_example.tar
    *t2vz_videos_example.tar
    *vc2_videos_example.tar 
    *ms_videos_example.tar

We have the following:

import os
import io
import av
import pandas as pd
import h5py
import numpy as np
from PIL import Image
import torchvision.transforms as transforms
import torch
import webdataset as wds

tar_file_path = 't2vz_videos_example.tar' # we use t2vz_videos_example.tar for example
csv_file = 'VidProM_unique_example.csv'
hdf5_file = 'VidProM_embed_example.hdf5'
dataset = wds.WebDataset(tar_file_path)
metadata = pd.read_csv(csv_file)
hdf5_file = h5py.File(hdf5_file, 'r')
hdf5_uuid = np.array(hdf5_file["uuid"][:], dtype=object).astype(str).tolist()
hdf5_embed = np.array(hdf5_file['embeddings'])

for sample in dataset:
    #obtain tensor of a video
    binary_data = sample['mp4']
    container = av.open(io.BytesIO(binary_data))
    transform = transforms.ToTensor()
    frames = []
    for frame in container.decode(video=0):
        img = frame.to_image()  
        img_tensor = transform(img) 
        frames.append(img_tensor)  
    video_tensor = torch.stack(frames)
    
    #obtain uuid of a video
    uuid = '-'.join(sample['__key__'].split('/')[-1].split('-')[1:])
    
    #obtain the prompt
    prompt = list(metadata[metadata['uuid']==uuid].iloc[:, 1])[0]
    
    #obtain the time
    time = list(metadata[metadata['uuid']==uuid].iloc[:, 2])[0]
    
    #obtain the nsfw_scores
    nsfw_scores = list(metadata[metadata['uuid']==uuid].iloc[0, 3:])
    
    #obtain the prompt embedding
    embed = torch.tensor(hdf5_embed[hdf5_uuid.index(uuid)])

Explanation

VidProM_unique.csv contains the UUID, prompt, time, and 6 NSFW probabilities.

It can easily be read by

import pandas
df = pd.read_csv("VidProM_unique.csv")

Below are three rows from VidProM_unique.csv:

uuid	prompt	time	toxicity	obscene	identity_attack	insult	threat	sexual_explicit
6a83eb92-faa0-572b-9e1f-67dec99b711d	Flying among clouds and stars, kitten Max discovered a world full of winged friends. Returning home, he shared his stories and everyone smiled as they imagined flying together in their dreams.	Sun Sep 3 12:27:44 2023	0.00129	0.00016	7e-05	0.00064	2e-05	2e-05
3ba1adf3-5254-59fb-a13e-57e6aa161626	Use a clean and modern font for the text "Relate Reality 101." Add a small, stylized heart icon or a thought bubble above or beside the text to represent emotions and thoughts. Consider using a color scheme that includes warm, inviting colors like deep reds, soft blues, or soothing purples to evoke feelings of connection and intrigue.	Wed Sep 13 18:15:30 2023	0.00038	0.00013	8e-05	0.00018	3e-05	3e-05
62e5a2a0-4994-5c75-9976-2416420526f7	zoomed out, sideview of an Grey Alien sitting at a computer desk	Tue Oct 24 20:24:21 2023	0.01777	0.00029	0.00336	0.00256	0.00017	5e-05

VidProM_semantic_unique.csv is a semantically unique version of VidProM_unique.csv.

VidProM_embed.hdf5 is the 3072-dim embeddings of our prompts. They are embedded by text-embedding-3-large, which is the latest text embedding model of OpenAI.

It can easily be read by

import numpy as np
import h5py
def read_descriptors(filename):
    hh = h5py.File(filename, "r")
    descs = np.array(hh["embeddings"])
    names = np.array(hh["uuid"][:], dtype=object).astype(str).tolist()
    return names, descs

uuid, features = read_descriptors('VidProM_embed.hdf5')

original_files are the HTML files from official Pika Discord collected by DiscordChatExporter. You can do whatever you want with it under CC BY-NC 4.0 license.

pika_videos, vc2_videos, t2vz_videos, and ms_videos are the generated videos by 4 state-of-the-art text-to-video diffusion models. Each contains 30 tar files.

Datapoint

Comparison with DiffusionDB

Click the WizMap (and wait for 5 seconds) for an interactive visualization of our 1.67 million prompts. Above is a thumbnail.

Please check our paper for a detailed comparison.

Curators

VidProM is created by Wenhao Wang and Professor Yi Yang.

License

The prompts and videos generated by Pika in our VidProM are licensed under the CC BY-NC 4.0 license. Additionally, similar to their original repositories, the videos from VideoCraft2, Text2Video-Zero, and ModelScope are released under the Apache license, the CreativeML Open RAIL-M license, and the CC BY-NC 4.0 license, respectively. Our code is released under the CC BY-NC 4.0 license.

Citation

@article{wang2024vidprom,
  title={VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models},
  author={Wang, Wenhao and Yang, Yi},
  booktitle={Thirty-eighth Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=pYNl76onJL}
}

Contact

If you have any questions, feel free to contact Wenhao Wang (wangwenhao0716@gmail.com).