Awesome

TAVGBench: Benchmarking Text to Audible-Video Generation

Project Overview

We are excited to introduce a pioneering task in the realm of multimodal AI: generating audible-video content from textual descriptions using a latent diffusion model. To facilitate this innovative task, we have developed TAVGBench, a comprehensive benchmark dataset. This large-scale dataset encompasses an impressive 1.7 million entries, each meticulously annotated with corresponding text.

The TAVGBench

Dataset size

Our benchmark dataset is unprecedented in scale, comprising 1.7 million entries, each annotated with rich textual descriptions that align with the corresponding audio and video content. This extensive collection provides a robust foundation for training and evaluating text to audible-video generation models.

Dataset annotation pipeline

The annotation pipeline for TAVGBench is designed to ensure high-quality and consistent data. Each piece of audio and video is paired with detailed textual descriptions, providing a rich dataset for model training and benchmarking. This pipeline involves multiple stages of annotation and validation to guarantee the accuracy and relevance of the annotations.

The video and audio captions within TAVGBench have been open-sourced and are available for download here.

Video demo

To showcase the capabilities of our approach, we have prepared a video demonstration. This demo highlights the impressive results achievable with our text to audible-video generation model, providing a tangible example of the potential applications of this technology.