Awesome

<div align="center"> <h2><a href="https://github.com/bytedance/tarsier">Tarsier: Recipes for Training and Evaluating Large Video Description Models</a></h2>

Jiawei Wang*, Liping Yuan*, Yuchen Zhang*, Haomiao Sun

ByteDance Research

*:Equal contribution, sorted alphabetically.

</div>

Release Notes

[2024/11/05] 🔥🚀 Online Demo of Tarsier2-7B is out! We've redesigned our pre-training and post-training processes, using larger, high-quality video-text datasets (see Tarsier2 Training Data). Tarsier2-7B generates video descriptions that are significantly more precise than those of Tarsier-34B, rivaling state-of-the-art models like GPT-4o. In the human side-by-side comparison, Tarsier2-7B gains a slight advantage (4.8%) over GPT-4o.
[2024/09/19] 🔥🚀 DREAM-1K Leaderboard is out! 20+ latest open-source or closed-source video understanding models are evaluted on the capacity of detailed video description on 1000 video clips of multiple-sources and multi-complexities. Check out the DREAM-1K Explorer for the video clips and different model results.
[2024/07/04] 🔥 Tarsier is out! We released the model (Tarsier-7b/Tarsier-34b), code, and data for inference, evaluation and depolyment. Tarsier-34B gains SOTA results on 6 open video understanding benchmarks and comparable capacity of detailed video description to Genmini 1.5 Pro!

Perface

Welcome to Tarsier!

In this repository, we introduce Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions (see Figure 1), together with good capability of general video understanding (SOTA results on 6 open benchmarks). Tarsier takes a simple model structure (CLIP-ViT + LLM), combined with a carefully designed training strategy: multi-task pre-training (stage-1) and multi-grained instruction tuning (stage-2).

Besides the model, we propose a new video description benchmark called DREAM-1K (<b>D</b>escription with <b>R</b>ich <b>E</b>vents, <b>A</b>ctions, and <b>M</b>otions), featuring videos from diverse sources and varying complexity. AutoDQ (<b>Auto</b>matic <b>D</b>escription <b>Q</b>uality) is also introduced as a highly interpretable and discriminative approach to evaluate video description quality.

We have released the model, code, and data for inference, evaluation and depolyment. We also provide an online demo for Tarsier2-7B:

Model:

Model Link
Tarsier-7b https://huggingface.co/omni-research/Tarsier-7b
Tarsier-34b https://huggingface.co/omni-research/Tarsier-34b
Code: https://github.com/bytedance/tarsier
Dataset: https://huggingface.co/datasets/omni-research/DREAM-1K
Demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b

Model	Link
Tarsier-7b	https://huggingface.co/omni-research/Tarsier-7b
Tarsier-34b	https://huggingface.co/omni-research/Tarsier-34b

Please <a href="#citeus">cite us</a> if you found our work helpful.

<div align="center"> <img src="assets/figures/chatbot-example.png" width = "100%"> <br>Figure 1: Example dialogue between a user and Tarsier. The input video is: <a href="https://github.com/bytedance/tarsier/blob/main/assets/videos/coffee.gif">assets/videos/coffee.gif</a> </div>

Overview

Abstract

Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a +51.4% advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a +12.3% advantage against GPT-4V and a −6.7% disadvantage against Gemini 1.5 Pro. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at https://github.com/bytedance/tarsier.

Simple Model Structure

Tarsier takes a simple sturcture that use a MLP projection layer to connect visual encoder (CLIP ViT) and text decoder (LLM). Frames are encoded independently and concatenated to input into LLM.

<div align="center"> <img src="assets/figures/model-arch.png" width = "90%"> <br>Figure 2: Tarsier Model Structure. </div>

Two-stage Training

Tarsier tasks a two-stage training strategy.

Stage-1: Multi-task Pre-training on 13M data
Stage-2: Multi-grained Instruction Tuning on 500K data

In both stages, we freeze ViT and train all the parameters of projection layer and LLM.

<span id="Tarsier2-Data">Update for Tarsier2 Training Data</span>

For Tarsier2, we have increased both the scale and the quality of our training data:

26.1M video-text pairs, with 18.7M high-quality in-house data;
11.0M image-text pairs, with 1.13M high-quality in-house data;
1.11M text instruction tuning data.

Video Description Evaluation

Benchmark: DREAM-1K

We proposed DREAM-1K as a challenging video description benchmark. It contrains a collection of 1,000 video clips with diverse complexities from five different origins: live-action movies, animated movies, stock videos, long YouTube videos, and TikTok-style short videos. We provide a fine-grained manual annotation for each video. See: data/annotations/DREAM-1k.jsonl

<div align="center"> <img src="assets/figures/dream-1k-statistics.png" width = "90%"> <br>Figure 3: DREAM-1K data Statistics. </div>

Figure 4 shows the human reference and description results of different models of one video clip (assets/videos/sitting.mp4) from DREAM-1K.

<div align="center"> <img src="assets/figures/video-description-example.jpg" width = "100%"> <br>Figure 4: Human reference and description results of different models on one video clip from DREAM-1K. This video features six actions, each highlighted in a unique color. Model hallucinations are indicated by underlining and red color. </div>

Evaluation Approach: AutoDQ

We propose AutoDQ as a more interpretable approach to automatic video description evaluation. AutoDQ uses an extraction model to extract events from two video descriptions, then uses an entailment model to examine how many events extracted from one description are entailed by the other description. We use ChatGPT to implement both models, as shown in Figure 5.

<div align="center"> <img src="assets/figures/automatic-evaluation.png" width = "90%"> <br>Figure 5: The AutoDQ workflow.</a> </div>

The relative code is: evaluation/metrics/evaluate_dream_gpt.py

Evaluation Results

We evaluate some advanced open-source video understanding models and two proprietary models (GPT-4V and Genmini 1.5 Pro) on DREAM-1K. The results are shown in Figure 6.

<div align="center"> <img src="assets/figures/dream_1k_results.png" width = "100%"> <br>Figure 6: Evaluation results on DREAM-1K. </div>

Video Understanding Benchmarks Evaluation

Tarsier is evluated on 7 commonly used video understanding benchmarks, including MVBench, NeXT-QA, Egoschema, MSVD-QA, MSR-VTT-QA, ActivityNet-QA and TGIF-QA. Ours Tarsier-34b gains 6 SOTA results among the 7 benchmarks.

Usage

This section provides guidance on how to run, evaluate and deploy Tarsier.

Setup

Following all are running under the environment of python 3.9. If you are not using python 3.9, you can create a virtual environment with:

conda create -n tarsier python=3.9

Then run the setup script:

git clone https://github.com/bytedance/tarsier.git
cd tarsier

bash setup.sh

Note that you should fill in the environment parameters for calling OpenAI Service through Azure, if you need to run evaluations based on ChatGPT (for Open-ended QA and DREAM-1K).

Model Prepare

Download the model checkpoints from Hugging Face: Tarsier-7b and Tarsier-34b.

Quick Start

You can use the following script to run a quick start of video detailed description:

MODEL_NAME_OR_PATH="path_to_the_model"
VIDEO_FILE="assets/videos/coffee.gif" # Or try your own example, could be images (include gif images), videos.

python3 -m tasks.inference_quick_start \
  --model_name_or_path $MODEL_NAME_OR_PATH \
  --instruction "Describe the video in detail." \
  --input_path $VIDEO_FILE

The result should be:

# Tarsier-7b:
"Prediction: A person is sitting at an outdoor café table with a cup of coffee and a book. The person takes a sip from the cup and then looks at the book. The background features the Eiffel Tower and is surrounded by red roses. Another person walks by in the background, waving at the first person. The scene is set in a Parisian outdoor café with street lamps and greenery visible."

# Tarsier-34b:
"Prediction: A person picks up a cup of coffee from a table and takes a sip while holding a book. The scene is set in an outdoor café with the Eiffel Tower visible in the background. The person then turns to look at two other individuals who are seated at a nearby table. The two individuals wave at the person with the coffee. The person continues to hold the coffee cup and book, looking back at the two individuals."

Benchmark Inference and Evaluation

Data Prepare

DREAM-1K

Download Video from https://huggingface.co/datasets/omni-research/DREAM-1K.

We have preprocessed the metadata for all benchmarks we used, see: data/annotations But you need to change the "<placeholder>" in the annotation file to your local video file path according to the "vid". We provide an example code for processing DREAM-1K. You can refer to it when processing other benchmarks.
Other Benchmarks
- Multi-choice VQA: MVBench, NeXT-QA and Egoschema
- Open-ended VQA: MSVD-QA, MSR-VTT-QA, ActivityNet-QA and TGIF-QA
- Video Caption: MSVD-Caption, MSRVTT-Caption, VATEX

Benchmark Inference and Evaluation

Following command will firstly run in parallel to inference on the selected benchmarks (Edit the parameters in scripts/run_inference_benchmark.sh: "CHUNKS" and "GPULIST" to customly control the parallelism), and then run evaluation.

model_name_or_path="path_to_the_model"
output_dir="dream_predictions"
benchmarks="dream" # Split benchmarks by space. Default as 'all' to inference on all benchmarks; Also could be task types: ('dream', 'caption', 'mc_qa', 'oe_qa'); Or specific benchmark names: ('dream', 'msvd-caption', 'msr-vtt-caption', 'vatex-caption', 'next-qa', 'egoschema', 'mvbench', 'video-mme', 'msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa')

mkdir $output_dir

bash scripts/run_inference_benchmark.sh $model_name_or_path $output_dir $benchmarks

The evaluation results will be printed and saved in $output_dir.

Evaluation Only

Run the following script to only calcluate the metrics for selected benchmarks.

pred_dir="dream_predictions"
benchmarks="dream" # Same as above code block

bash run_evaluation_only.sh $pred_dir $benchmark

The evaluation result will be saved as: {pred_dir}/{benchmark-name}_eval_result.txt

Deployment

CLI Demo

Use the following script to run a conversation demo in command line.

model_path="path_to_the_model"

bash scripts/run_demo_cli.sh $model_path

Bellow is the input video and a conversation with Tarsier-34b about the video:

<div align="center"> <img src="assets/videos/demo_test.gif" width = "100%"> <br>Figure 7: Input video in CLI Demo.</a> </div> <br> <div align="center"> <img src="assets/videos/demo_cli_example.gif" width = "100%"> <br>Figure 8: Conversation in CLI Demo.</a> </div>

Gradio Demo

Use the following script to run a Gradio Demo.

model_path="path_to_the_model"

bash scripts/run_demo_gradio.sh $model_path

The gradio page show be as following. You shoud input a Video/Image/GIF in according block firstly, and then start conversation. Click the "Clear" button to restart.

<div align="center"> <img src="assets/figures/gradio_page.png" width = "100%"> <br>Figure 9: Tarsier Gradio Demo.</a> </div>

<span id="citeus">Citation</span>

Pleae cite us as:

@misc{wang2024tarsierrecipestrainingevaluating,
      title={Tarsier: Recipes for Training and Evaluating Large Video Description Models}, 
      author={Jiawei Wang and Liping Yuan and Yuchen Zhang and Haomiao Sun},
      year={2024},
      eprint={2407.00634},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.00634}, 
}