Home

Awesome

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

<img width="900" src="https://github.com/TimeMarker-LLM/TimeMarker/blob/main/assets/logo_w_name.jpg">

Introduction

Recent advancements in the realm of video-language models have predominantly focused on visual perception and reasoning, leading to less emphasis on temporal localization and detection capabilities. Current models, while trained extensively on video captioning and QA datasets, struggle with placing precise temporal references within video content. Although many Video-LLMs incorporate temporal embedding into video features, this approach still has significant drawbacks. Specifically, these models can only perceive relative time—such as the sequence of events rather than absolute time points, like the exact second an event occurs. This lack of precise temporal grounding leads to less interpretable and verifiable responses, and poses challenges for subsequent temporal reasoning and inference. To address these limitations, we present TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, featuring robust temporal localization abilities.

Key Innovations:

  1. Temporal Separator Tokens Integration: TimeMarker uses Temporal Separator Tokens Integration to enhance temporal awareness in videos. By interleaving textual temporal separator tokens (e.g., sec{20}) with video frame tokens, this method encodes the absolute temporal positions of video frames. These tokens serve as precise time markers, allowing the model to identify and reference specific moments within the video.

  2. AnyLength Mechanism: To process videos of varying lengths efficiently, TimeMarker employs the AnyLength mechanism, which uses dynamic frame sampling and adaptive token resizing/merging. This mechanism adjusts the frames per second (FPS) when sample video frames and modify token compression ratio when adaptively merge tokens in a single video frame based on the video's length, ensuring comprehensive coverage of various-length videos.

  3. Advanced Data Utilization: Beyond conventional video captioning and QA datasets, TimeMarker converts annotations from various temporal-related datasets into video QA formats, facilitating comprehensive model training on temporal understanding tasks. Despite using only approximately 5M video-text pairs in training, the video data is diverse in duration, from less than one minute to 120 minutes. Additionally, extensive image training data (around 90M) and interleaved multi-image data (around 12M) enhance the model's semantic perception and cognitive abilities.

  4. Benchmark Excellence Across Various Video Lengths: TimeMarker achieves state-of-the-art performance across multiple public video benchmarks, excelling in both short and long video categories. It surpasses traditional models in tasks such as temporal sentence grounding, demonstrating superior temporal localization and understanding capabilities. This underscores the model's robustness and versatility in handling videos of varying lengths with exceptional accuracy in time-based tasks.

News

[2024/10/30] 🔥 We release our TimeMarker model. TimeMarker is based on Llama3-8B LLM, and achieves 🌟Rank 1 on LVBench, 🌟Rank 2 on VideoVista (Rank 1 on VideoVista is Human Performance), 🌟Rank 2 on MVBench, and 🌟Rank 3 on MLVU test set! The results of our TimeMarker also rank highly in other video benchmarks. Our paper is coming soon.

Model Architecture

<img width="1260" src="https://github.com/TimeMarker-LLM/TimeMarker/blob/main/assets/timemarker_framework.png">

Performance

Results on Video Benchmarks

Model NameLLMVideoMME (w/o subs)VideoVistaLVbenchLongVideoBench (dev)MLVU (dev)MVBenchMMBench-VideoTempCompass
Gemini-1.5-pro-75.076.433.166.4--1.3067.1
GPT-4V-59.9--60.749.243.71.53-
GPT-4o-71.978.327.066.764.6-1.64-
LLaVA-Next-Video-7BVicuna-7b-v1.533.756.7-43.5-53.1--
PLLaVA-7BVicuna-7b-v1.5-60.4-39.2-46.61.03-
VideoChat2-HDMistral-7B-61.6--47.962.31.22-
VideoLLaMA2-7BMistral-7B47.960.5--48.554.6--
LongVAQwen2-7B52.667.4--56.3--56.9
Video-XLQwen2-7B55.5--49.564.955.3--
Qwen2-VL-7B-InstructQwen2-7B63.3----67.0-67.8
KangarooLlama3-8B56.069.539.454.861.061.11.44-
TimeMarker (Ours)Llama3-8B57.378.441.356.363.967.41.5360.4

Results on Temporal Sentence Grounding Benchmarks

<table> <tr> <th rowspan="2" style="width: 100px;">Model Name</th> <th rowspan="2" style="width: 100px;">Set up</th> <th colspan="4" style="text-align: center; width: 400px;">Charades-STA</th> <th colspan="4" style="text-align: center; width: 400px;">ActivityNetCaptions</th> <th colspan="4" style="text-align: center; width: 400px;">Didemo</th> </tr> <tr> <th style="width: 100px;">R@1,IoU=0.3</th> <th style="width: 100px;">R@1,IoU=0.5</th> <th style="width: 100px;">R@1,IoU=0.7</th> <th style="width: 100px;">mIoU</th> <th style="width: 100px;">R@1,IoU=0.3</th> <th style="width: 100px;">R@1,IoU=0.5</th> <th style="width: 100px;">R@1,IoU=0.7</th> <th style="width: 100px;">mIoU</th> <th style="width: 100px;">R@1,IoU=0.3</th> <th style="width: 100px;">R@1,IoU=0.5</th> <th style="width: 100px;">R@1,IoU=0.7</th> <th style="width: 100px;">mIoU</th> </tr> <tr> <td style="width: 100px;"><a href="https://github.com/researchmm/2D-TAN">2D-TAN</a></td> <td style="width: 100px;">FS</td> <td style="width: 100px;">57.3</td><td style="width: 100px;">45.8</td><td style="width: 100px;">27.9</td><td style="width: 100px;">41.0</td> <td style="width: 100px;">60.4</td><td style="width: 100px;">43.4</td><td style="width: 100px;">25.0</td><td style="width: 100px;">42.5</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> </tr> <tr> <td style="width: 100px;"><a href="https://github.com/MCG-NJU/MMN">MMN</a></td> <td style="width: 100px;">FS</td> <td style="width: 100px;">65.4</td><td style="width: 100px;">53.3</td><td style="width: 100px;">31.5</td><td style="width: 100px;">46.5</td> <td style="width: 100px;">64.5</td><td style="width: 100px;">48.2</td><td style="width: 100px;">29.4</td><td style="width: 100px;">46.6</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> </tr> <tr> <td style="width: 100px;"><a href="https://github.com/showlab/UniVTG">UniVTG</a></td> <td style="width: 100px;">FS</td> <td style="width: 100px;">72.6</td><td style="width: 100px;">60.2</td><td style="width: 100px;">38.6</td><td style="width: 100px;">52.2</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> </tr> <tr> <td style="width: 100px;"><a href="https://github.com/DCDmllm/Momentor">Momentor</a></td> <td style="width: 100px;">VLM</td> <td style="width: 100px;">42.6</td><td style="width: 100px;">26.6</td><td style="width: 100px;">11.6</td><td style="width: 100px;">28.5</td> <td style="width: 100px;">42.9</td><td style="width: 100px;">23.0</td><td style="width: 100px;">12.4</td><td style="width: 100px;">29.3</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> </tr> <tr> <td style="width: 100px;"><a href="https://openaccess.thecvf.com/content/CVPR2024W/PVUW/papers/Qu_ChatVTG_Video_Temporal_Grounding_via_Chat_with_Video_Dialogue_Large_CVPRW_2024_paper.pdf">ChatVTG</a></td> <td style="width: 100px;">VLM</td> <td style="width: 100px;">52.7</td><td style="width: 100px;">33.0</td><td style="width: 100px;">15.9</td><td style="width: 100px;">34.9</td> <td style="width: 100px;">40.7</td><td style="width: 100px;">22.5</td><td style="width: 100px;">9.4</td><td style="width: 100px;">27.2</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> </tr> <tr> <td style="width: 100px;"><a href="https://github.com/huangb23/VTimeLLM">VTimeLLM</a></td> <td style="width: 100px;">VLM</td> <td style="width: 100px;">55.3</td><td style="width: 100px;">34.3</td><td style="width: 100px;">14.7</td><td style="width: 100px;">34.6</td> <td style="width: 100px;">44.8</td><td style="width: 100px;">29.5</td><td style="width: 100px;">14.2</td><td style="width: 100px;">31.4</td> <td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td><td style="width: 100px;">-</td> </tr> <tr> <td style="width: 100px;">TimeMarker(Ours)</td> <td style="width: 100px;">VLM</td> <td style="width: 100px;">73.5</td><td style="width: 100px;">51.9</td><td style="width: 100px;">26.9</td><td style="width: 100px;">48.4</td> <td style="width: 100px;">67.4</td><td style="width: 100px;">50.7</td><td style="width: 100px;">33.0</td><td style="width: 100px;">49.5</td> <td style="width: 100px;">71.3</td><td style="width: 100px;">63.9</td><td style="width: 100px;">56.2</td><td style="width: 100px;">63.6</td> </tr> </table> Note: FS means the model is a specialized model for temporal sentence grounding in video trained in a fully supervised setting, VLM means the model is a Video-LLM.