

Youku-mPLUG 10M Chinese Large-Scale Video Text Dataset

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Download Link HERE


<p align="center"> <img src="assets/youku_mplug_logo.png" alt="examples for youku-mplug"/> </p>

What is Youku-mPLUG?

We release the public largest Chinese high-quality video-language dataset (10 million) named Youku-mPLUG, which is collected from a well-known Chinese video-sharing website, named Youku, with strict criteria of safety, diversity, and quality.

<p align="center"> <img src="assets/pretrain_data.jpg" alt="examples for youku-mplug"/> </p> <p align="center"> <img src="assets/examples.png" alt="examples for youku-mplug"/> </p> <p align="center"> <font size=2 color="gray">Examples of video clips and titles in the proposed Youku-mPLUG dataset.</font> </p>

We provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include:

<p align="center"> <img src="assets/downstream_data.jpg" alt="examples for youku-mplug downstream dataset"/> </p>

Data statistics

The dataset contains 10 million videos in total, which are of high quality and distributed in 20 super categories can 45 categories.

<p align="center"> <img src="assets/statics.jpg" alt="statistics" width="60%"/> </p> <p align="center"> <font size=2 color="gray">The distribution of categories in Youku-mPLUG dataset.</font> </p>

Zero-shot Capability

<p align="center"> <img src="assets/case1.jpg" alt="case1" width="80%"/> <img src="assets/case2.jpg" alt="case2" width="80%"/> </p>


You can download all the videos and annotation files through this link


Note: Due to a bug in megatron_util, after installing megatron_util, it is necessary to replace conda/envs/youku/lib/python3.10/site-packages/megatron_util/initialize.py with the initialize.py in the current directory.

conda env create -f environment.yml
conda activate youku
pip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

# For caption evaluation
apt-get install default-jre

mPLUG-Video (1.3B / 2.7B)


First you should download GPT-3 1.3B & 2.7B checkpoint from Modelscope. The pre-trained model can be downloaded Here (1.3B) and Here (2.7B).

Running the pre-training of mPLUG-Video as:

python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --nnodes=$WORLD_SIZE \
  --node_rank=$RANK \
  --use_env run_pretrain_distributed_gpt3.py \
  --config ./configs/${exp_name}.yaml \
  --output_dir ./output/${exp_name} \
  --enable_deepspeed \
  2>&1 | tee ./output/${exp_name}/train.log


To perform downstream fine-tuning. We take Video Category Prediction as an example:

python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --nnodes=$WORLD_SIZE \
  --node_rank=$RANK \
  --use_env downstream/run_cls_distributed_gpt3.py \
  --config ./configs/${exp_name}.yaml \
  --output_dir ./output/${exp_name} \
  --enable_deepspeed \
  --resume path/to/1_3B_mp_rank_00_model_states.pt \
  2>&1 | tee ./output/${exp_name}/train.log

Experimental results

Below we show the results on the validation sets for reference.

<p align="left"> <img src="assets/val_cls.jpg" alt="Video category prediction results on the validation set." width="70%"/> <img src="assets/val_retrieval.jpg" alt="Video retrieval results on the validation set." width="70%"/> </p>

mPLUG-Video (BloomZ-7B)

We build the mPLUG-Video model based on mPLUG-Owl. To use the model, you should first clone the mPLUG-Owl repo as

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl

The instruction-tuned checkpoint is available on HuggingFace. For finetuning the model, you can refer to mPLUG-Owl Repo. To perform video inference you can use the following code:

import torch
from mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-youku-bloomz-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    device_map={'': 0},
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

# We use a human/AI template to organize the context as a multi-turn conversation.
# <|video|> denotes an video placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <|video|>
Human: 视频中的女人在干什么?
AI: ''']

video_list = ['yoga.mp4']

# generate kwargs (the same in transformers) can be passed in the do_generate()
generate_kwargs = {
    'do_sample': True,
    'top_k': 5,
    'max_length': 512
inputs = processor(text=prompts, videos=video_list, num_frames=4, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)

Citing Youku-mPLUG

If you find this dataset useful for your research, please consider citing our paper.

    title={Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks},
    author={Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Chenliang Li, Qi Qian, Que Maofei, Ji Zhang, Xiao Zeng, Fei Huang},