Awesome

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

Xuan Ju1*, Yiming Gao1*, Zhaoyang Zhang1, Ziyang Yuan1, Xintao Wang1#, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan1, 1ARC Lab, Tencent PCG *Equal contribution #Project lead

</div>

Introduction

Video datasets play a crucial role in video generation such as sora. However, existing text-video datasets often fall short when it comes to handling long video sequences and capturing shot transitions. To address these limitations, we introduce MiraData (Mini-Sora Data), a large-scale video dataset designed specifically for long video generation tasks.

Key Features of MiraData

Long Video Duration: Unlike previous datasets, where video clips are often very short (typically less than 6 seconds), MiraData focuses on uncut video segments with durations ranging from 1 to 2 minutes. This extended duration allows for more comprehensive modeling of video content.
Structured Captions: Each video in MiraData is accompanied by structural captions. These captions provide detailed descriptions from various perspectives, enhancing the richness of the dataset. The average caption length is 349 words, ensuring a thorough representation of the video content.

Current Status

In this initial release, MiraData includes two scenarios:

Gaming: Videos related to gaming experiences.
City/Scenic Exploration: Videos capturing urban or scenic views.

MiraData is still in its early stages, and we will release more scenarios and improve the quality of the dataset in the near future.

<h3 align='center'>Demo Video</h3>

Dataset

Meta Files

This version of MiraData contains 57,803 video clips with an overall of 1,754 hours, containing two scenarios: gaming and city/scenic exploration. The clip number and video duration is shown as follows:

Scenario	Clip Num	Video Duration
Gaming	31,159	893 hrs
City/Scenic Exploration	26,644	861 hrs

The meta file for this version of MiraData is provided here. Additionally, for a better and quicker understanding of our meta file composition, we randomly sample a set of 100 video clips, which can be accessed here. The meta file contains the following index information:

index: video clip index, which is composed of {download_idx}_{video_id}-{clip_id}
video_id: youtube video id
start_frame: clip start frame of the youtube video
end_frame: clip end frame of the youtube video
main_object_caption: caption of the main object in video
background_caption: caption of the video background
style_caption: caption of the video style
camera_caption: caption of the camera movie
short_caption: a short overall caption
dense_caption: a dense overall caption
fps: the video fps used for extracting frame

Note that you can obtain the start and end timestamps by using start_frame/fps or end_frame/fps.

How to Download

To download the videos and split the videos into clips, you can use the following scripts:

python download_data.py --meta_csv miradata_v0.csv --video_start_id 0 --video_end_id 10631 --raw_video_save_dir miradata/raw_video --clip_video_save_dir miradata/clip_video

where the --video_start_id and --video_end_id indicates the start and end values of the download_idx in meta file's index for downloading. The gaming scenario is ranging from 0 to 7416 and the city/scenic exploration is ranging from 7417 to 10631.

Collection and Annotation

To collect the MiraData, we first manually select youtube channels in different scenarios. Then, all the videos in corresponding channels are downloaded and splitted using PySceneDetect. After that, we selected video clips with a duration ranging from 1 to 2 minutes. For video clips longer than 2 minutes, we split them into multiple 2-minute clips. Finally, we caption the video clip using GPT-4V.

Structured Captions

Each video in MiraData is accompanied by structured captions. These captions provide detailed descriptions from various perspectives, enhancing the richness of the dataset.

Six Types of Captions

Main Object Description: Describes the primary object or subject in the video, including their attributes, actions, positions, and movements throughout the video.
Background: Provides context about the environment or setting, including objects, location, weather, and time.
Style: Covers artistic style, visual and photographic aspects, such as realistic, cyberpunk, and cinematic style.
Camera Movement: Details any camera pans, zooms, or other movements.
Short Caption: A concise summary capturing the essence of the video, generated using the Panda-70M caption model.
Dense Caption: A more elaborate and detailed description that summarizes the above five types of captions.

Captions with GPT-4V

We tested the existing open-source visual LLM methods and GPT-4V, and found that GPT-4V's captions show better accuracy and coherence in semantic understanding in terms of temporal sequence. It also provides more accurate descriptions of the main subject and background objects, with fewer object omissions and less hallucination issues. Therefore, we use GPT-4V to generate Dense Captions, Main Object Descriptions, Background Descriptions, Camera Movement Descriptions, and Video Styles.

In order to balance annotation costs and caption accuracy, we uniformly sample 8 frames for each video and arrange them into a 2x4 grid of one large image. Then, we use the caption model of Panda-70M to annotate each video with a one-sentence caption, which serves as a hint for the main content, and input it into our fine-tuned prompt. By feeding the fine-tuned prompt and a 2x4 large image to GPT-4V, we can efficiently output captions for multiple dimensions in just one round of conversation. The specific prompt content can be found in the caption_gpt4v.py, and we welcome everyone to contribute to the more high-quality text-video data. :raised_hands:

Statistic

<div style="display:inline-block" align=center> <img src="assets/statistic_dense.png" width="350"/> <img src="assets/statistic_full.png" width="350"/> </div> <div style="display:inline-block" align=center>      Total text length statistics of dense captions.       Total text length statistics of six types of captions.</div> <div style="display:inline-block" align=center> <img src="assets/wordcloud_short.png" width="300"/> <img src="assets/wordcloud_dense.png" width="300"/> </div> <div style="display:inline-block" align=center>Word cloud of short captions.                 Word cloud of dense captions.</div>

Demonstration

Video-Caption Pairs in MiraData

<table class="center"> <tr> <td width=10% style="border: none">Video</td> <td width=30% style="border: none"><video src="https://github.com/mira-space/MiraData/assets/89566272/b2989b1f-d8d0-42d1-b337-2d1e6b6a0f11"></td> <td width=30% style="border: none"><video src="https://github.com/mira-space/MiraData/assets/89566272/1cb6562b-b99c-4ba6-ab81-215771ea6657"></td> <td width=30% style="border: none"><video src="https://github.com/mira-space/MiraData/assets/89566272/c388f3a4-ced7-4fc2-937b-49b482cb8390"></td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Main Object Caption</td> <td width=30% style="border: none">A futuristic car, is seen driving through various parts of the neon city. Initially, the car is at a standstill, showcasing its design and lighting, before accelerating into the traffic. The car maneuvers through the streets, dodging other vehicles and obstacles, with a change in perspective to a first-person view showing a hand holding a gun, suggesting a shift to an action sequence or a new gameplay mechanic. The car's movement is fluid and fast, indicating high-speed travel through the city.</td> <td width=30% style="border: none">From the player's perspective, initially grapples with an adversary, as evidenced by the close-up of mechanical parts and the player's hands. The focus then shifts to the elderly woman, who is initially aggressive or defensive, holding a shovel raised as if ready to strike. She then turns, leading the player around the side of a wooden structure, possibly her house. Over time, her demeanor softens, and she appears to be talking to the player, as she lowers her shovel and adopts a more relaxed posture.</td> <td width=30% style="border: none">A person, is seen walking at a leisurely pace along a path that cuts through the ancient ruins and natural landscape. They appear to be exploring the area, moving from the open spaces between the large arches into a more enclosed tunnel. Their movement is consistent and unhurried, suggesting a relaxed and contemplative stroll through the historic site.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Background Caption</td> <td width=30% style="border: none">A bustling, rain-slicked metropolis at night, characterized by towering skyscrapers, bright neon signs, and holographic billboards. The weather is rainy, creating reflective surfaces on the roads that mirror the city's vivid lights. The time is night, which is evident from the artificial lighting and dark sky, contributing to the cyberpunk aesthetic of the environment.</td> <td width=30% style="border: none">The background depicts a lush rural setting with a wooden house or shed, surrounded by greenery, rocks, and patches of red flowers. The environment has a naturalistic feel, with a clear sky and daylight indicating a daytime setting. There are no other characters or moving elements visible in the background, which suggests a peaceful, albeit isolated, location.</td> <td width=30% style="border: none">A picturesque scene of an ancient city in ruins, with large brick arches and remnants of structures that speak to a bygone era. The setting is peaceful and natural, with trees, bushes, and grasses softly swaying in the light breeze. The time appears to be late afternoon or early evening, as indicated by the long shadows and the warm, soft light that bathes the scene, likely the golden hour before sunset.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Style Caption</td> <td width=30% style="border: none">The visual, photographic, and artistic style is reminiscent of a cyberpunk universe, with a strong emphasis on neon lighting, rain-drenched streets, and a high-contrast color palette that creates a sense of depth and vibrancy.</td> <td width=30% style="border: none">The visual style is realistic with detailed character models, natural lighting, and a high level of environmental detail, creating an immersive and believable rural setting in a video game context.</td> <td width=30% style="border: none">The video exhibits a cinematic and picturesque quality, utilizing natural lighting and the historical setting to create a visually rich and contemplative narrative.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Camera Caption</td> <td width=30% style="border: none">The view shot starts with a third-person perspective from behind the car, giving a clear view of the vehicle and its immediate surroundings. The camera then transitions to a first-person perspective, showing the driver's hand with a gun, implying a change in gameplay or narrative. The camera angles vary, capturing the high-speed motion of the car from different viewpoints, including a dynamic, forward-facing angle that conveys the sensation of speeding through the city.</td> <td width=30% style="border: none">The camera perspective is consistent with a first-person viewpoint throughout the sequence. The initial frame suggests a dynamic struggle with rapid movement, while the subsequent frames show a steadier camera as the player interacts with the woman. The camera follows the woman as she moves, maintaining her as the focal point, and the shooting angles change as the player's perspective shifts to keep the woman in view, particularly as she moves and turns.</td> <td width=30% style="border: none">The camera follows a steady, linear path, maintaining a consistent distance from the main subject. It captures the scene from a series of angles that alternate between showcasing the expansive ruins and focusing on the path ahead. As the person enters the tunnel, the camera angle shifts to frame them against the light at the end of the tunnel, creating a silhouette effect.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Short Caption</td> <td width=30% style="border: none">The player is driving a futuristic car in a neon-lit city at night.</td> <td width=30% style="border: none">A video game character standing in front of a house.</td> <td width=30% style="border: none">There is a person walking on a path surrounded by trees and ruins of an ancient city.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Dense Caption</td> <td width=30% style="border: none">A player navigating a sleek, futuristic car through a neon-lit cityscape at night. The city is alive with vibrant colors, illuminated advertisements, and dynamic lighting that reflects off the wet streets, adding to the sense of speed and movement. The car's design is cutting-edge, with glowing elements and a streamlined shape that suggests advanced technology and high performance. As the player drives, the environment whizzes by, creating a blur of lights and structures that enhances the feeling of racing through this urban playground.</td> <td width=30% style="border: none">The video sequence showcases a first-person perspective of a video game character interacting with a non-playable character (NPC) in a rural environment. Initially, the player character appears to be grappling with an enemy or creature, as indicated by the close-up struggle and the presence of sparks or embers. The scene transitions to the player character standing before an elderly woman, who is wielding a shovel in a defensive or threatening posture. The woman's expressions and stance suggest she is wary or confrontational towards the player. As the video progresses, the woman seems to relax slightly, lowering her shovel and engaging in conversation with the player, indicated by her changing facial expressions and body language.</td> <td width=30% style="border: none">A serene and historical ambiance as a person walks through a path surrounded by the lush greenery and the majestic ruins of an ancient city. The ruins feature large arches and weathered brick walls, hinting at a grand past. The path is well-trodden and flanked by trees and grass, with the golden hour sunlight casting a warm glow over the scene. Other visitors can be seen in the distance, enjoying the tranquil environment. As the person progresses, they pass through a tunnel-like structure, where the play of light and shadow creates a dramatic effect, enhancing the sense of exploration and discovery.</td> </tr> </table> <table class="center"> <tr> <td width=10% style="border: none">Video</td> <td width=30% style="border: none"><video src="https://github.com/mira-space/MiraData/assets/89566272/aeaf4b11-1fae-4a80-9ff0-ea6d01fdc755"></td> <td width=30% style="border: none"><video src="https://github.com/mira-space/MiraData/assets/89566272/b213cf2a-a04d-4d46-aa23-9dbaf8116ede"></td> <td width=30% style="border: none"><video src="https://github.com/mira-space/MiraData/assets/89566272/2adf3418-093d-48ec-86f6-81c8537596c5"></td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Main Object Caption</td> <td width=30% style="border: none">There are no main subjects such as people or animals that are the focus of the video. Instead, the video's main subject is the changing urban landscape itself. The sequence shows a transition from a pedestrian-friendly street with sidewalks to an area with construction barriers and finally to a riverside scene with the bridge as a focal point.</td> <td width=30% style="border: none">A cyclist, is seen navigating through the city streets. Starting on a quieter street, the cyclist passes by shops with inviting warm lighting and continues through intersections and alongside parked cars. The rider's movement is fluid and uninterrupted, suggesting a familiarity with the route. The bicycle's lights are on, ensuring visibility as the evening darkens.</td> <td width=30% style="border: none">Throughout the video, the main subjects are the pedestrians who are walking at a leisurely pace. They are scattered across the walkway, maintaining a casual flow of movement. Their attire suggests a warm and comfortable climate, and their relaxed demeanor indicates a peaceful setting. The fountain in the early part of the video adds a dynamic element as water sprays rhythmically, while the statue towards the end stands still, providing a contrast to the moving subjects.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Background Caption</td> <td width=30% style="border: none">A cityscape that includes a mix of architectural styles, from red-brick buildings to industrial metal structures. The weather appears to be clear and sunny, with blue skies and minimal cloud cover. The time seems to be during the day, given the shadows and the brightness of the sunlight.</td> <td width=30% style="border: none">A cityscape at twilight, with the sky dimming as the video progresses. The streets are moderately busy with pedestrians and occasional vehicles. The architecture is a mix of residential buildings, small businesses, and modern commercial spaces. Streetlights and building lights provide illumination, and the weather appears clear and calm.</td> <td width=30% style="border: none">Features a blend of natural and urban elements. The plaza with the fountain is surrounded by buildings that suggest a downtown area, while the walkway is bordered by lush greenery, indicating well-kept urban parks or gardens. The clear blue sky suggests fair weather, and the bright sunlight indicates daytime, possibly morning or afternoon given the angle of the shadows.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Style Caption</td> <td width=30% style="border: none">The visual, photographic, and artistic style of the video is realistic with a clear, bright, and high-contrast depiction of an urban environment during a sunny day.</td> <td width=30% style="border: none">The visual style of the video is naturalistic and immersive, capturing the tranquil ambiance of a city transitioning from day to night with a steady, first-person perspective.</td> <td width=30% style="border: none">The video showcases a wide-angle, high-definition view with vivid colors and a clear focus, capturing the tranquility and beauty of a city's public space in a documentary-style presentation.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Camera Caption</td> <td width=30% style="border: none">The camera movement is smooth and appears to be a tracking shot moving forward through the city street. The shooting angle is mostly at street level, providing a first-person perspective of the environment. The camera angle shifts slightly upwards as the video progresses, especially as the bridge becomes the central element in the later frames.</td> <td width=30% style="border: none">The camera angle is consistent throughout the video, maintaining a first-person perspective that likely mimics the cyclist's point of view. The camera moves smoothly, following the natural motion of cycling, and there are no abrupt changes in direction or speed. This steady camera work allows for an immersive experience as if the viewer is the one riding the bicycle.</td> <td width=30% style="border: none">The camera appears to move smoothly along the walkway, maintaining a consistent level and distance from the ground, providing a continuous perspective of the environment. The angle of the shots changes as the camera progresses, starting with a frontal view of the fountain and transitioning to a path leading towards the statue, suggesting a linear path of travel.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Short Caption</td> <td width=30% style="border: none">A view of a city street with a bridge in the background.</td> <td width=30% style="border: none">A person is riding a bicycle on a city street at night, passing by various buildings and cars.</td> <td width=30% style="border: none">People are walking on a sidewalk in a city, and there is a fountain in the middle of the street.</td> </tr> <tr style="text-align: center;"> <td width=10% style="border: none">Dense Caption</td> <td width=30% style="border: none">The video presents a panoramic journey through a city street leading towards a bridge. The sequence begins with a view of a well-maintained urban area, featuring red-brick buildings with large windows and street lamps, and transitions towards a more industrial setting with metal structures and a bridge in the background. The progression of the frames suggests a movement from a commercial zone towards a waterfront area, with the bridge becoming increasingly prominent in the view.</td> <td width=30% style="border: none">The essence of a city at dusk, with the sky transitioning from the last hints of daylight to the onset of night. A person rides a bicycle along the city streets, weaving through the urban landscape that is a mix of residential and commercial areas. The streets are lined with a variety of buildings, from cozy eateries to modern storefronts, all under the soft glow of streetlights and the occasional bright signage. The cyclist moves at a steady pace, allowing viewers to take in the serene atmosphere of the city in the evening.</td> <td width=30% style="border: none">A serene urban scene where people are leisurely walking along a wide sidewalk in a city. The initial frames show a modern plaza with a reflective surface, where water jets from a fountain create a playful and refreshing atmosphere. As the video progresses, the perspective shifts away from the fountain plaza to a tree-lined walkway, with well-maintained grassy areas on either side. Pedestrians of various ages can be seen strolling, some alone and others in groups, indicating a relaxed urban environment. The latter part of the video reveals a statue prominently positioned at the end of the walkway, adding a historical or cultural dimension to the cityscape.</td> </tr> </table>

**We will remove the video samples from our dataset / Github / project webpage as long as you need it. Please contact to us for the request.

License Agreement

Please see LICENSE.

The MiraData dataset is only available for informational purposes only. The copyright remains with the original owners of the video.
All videos of the MiraData dataset are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos.
You agree not to reproduce, duplicate, copy, sell, trade, resell or exploit for any commercial purposes, any portion of the videos and any portion of derived data. You agree not to further copy, publish or distribute any portion of the MiraData dataset.

Citation

If you find this project useful for your research, please cite our paper. :blush:

Current Limitations

Video Data:

Data Volume and Coverage: Since we manually select videos from certain scenarios to guarantee video quality and motion amplitude, videos are selected from a relatively narrow domain with a small video number. This may result in a limited data volume and coverage, which is insufficient for general-purpose video generation and may lead to overfit.
Inaccurate Video Splitting: The video splitting process utilizing PySceneDetect, which detect video clip based on changes in frame intensity, brightness, or HSV color space. This may introduce certain errors, such as missing or superfluous splitting, particularly in videos with effects like fade-ins or fade-outs.
Shot Transitions: Our current release doesn't have many clips describing multiple camera angles for the same scene. We will incorporate a wider range of clips with diverse shot transitions

Caption:

Loss of Detail: The current dataset generates captions based on only 8 frames, which may not fully cover all actions, background details, and content variations within longer video durations. This limitation could potentially lead to missing or insufficient descriptions of certain details. Moreover, combining 8 frames into a single image for captioning, as opposed to frame-by-frame input, might result in the loss of some detailed information. The granularity of the visual data is compromised, which might affect the accuracy of the generated captions.
Hallucination Issues: Despite the advancements, the GPT4-V model still suffers from inevitable hallucination problems. These could manifest as misinterpretations or misrepresentations of the visual data, thereby affecting the quality of the generated captions.
Misleading Hints: In the Panda-70M dataset, there are instances where the provided hints are incorrect. For example, a scene of a person walking might be mistakenly interpreted as a car moving. Such misleading hints could potentially misguide the GPT4-V model and result in inaccurate caption generation.

Acknowlegement

This README file is adpated from Panda-70M

Contact Information

For any inquiries, please email mira-x@googlegroups.com.