Home

Awesome

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

News

The Second Perception Test Challenge is being organised as an ECCV2024 workshop! Please see the website here for more details and links to eval.ai challenge pages: ptchallenge-workshop.github.io.

Overview

Quickstart visualisation notebookOpen In Colab
Dataset ExplorerDataset Explorer
Download dataDownload section here
Evaluation scripts (including data loader, dummy baseline, evaluation metrics)multiple-choice video QA, object tracking, action localisation, point tracking, sound localisation, grounded video QA
Challenges and evaluation serversmultiple-choice video QA, object tracking, action localisation, point tracking, sound localisation, grounded video QA

Perception Test: A Diagnostic Benchmark for Multimodal Video Models is a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models. The Perception Test dataset introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks (object and point tracking, action and sound localisation, multiple-choice and grounded video question-answering) that require understanding of memory, abstract patterns, physics, and semantics, across visual, audio, and text modalities.

In this repository, you will find:

5-minutes summary of the Perception Test

Perception Test Overview Presentation

Try the Perception Test for yourself by accessing this quiz.

For more example videos in the Perception Test, check out this playlist.

Download the data and annotations

The Perception Test dataset can be downloaded as zip files containing:

Full Dataset Splits

TaskSplitVideosAudioLabels
SampleAllsample_videos.zip (214.9MB)sample_audios.zip (83.9MB)sample_annotations.zip (3MB)
All TasksTraintrain_videos.zip (26.5GB)train_audios.zip (12.3GB)train_annotations.zip (30.6MB)
All TasksValidvalid_videos.zip (70.2GB)valid_audios.zip (33.1GB)valid_annotations.zip (81.5MB)
All TasksTesttest_videos.zip (41.8GB)test_audios.zip (19.3GB)test_annotations.zip (633.9kB)

*In test videos, where the end of the video gives away the answer to some questions (e.g. in cup-games, where is the hidden object at the end), we cut the end of the video. For the validation split, we provide the frame id where the cut should be made: cut_frame_mapping_valid.json.

Challenge Downloads

Video IDs
Since some of the challenges use subsets of the benchmark, we provide here the lists of video IDs for each challenge. These should be used to filter the videos/audios/annotations from the full splits above. For single object tracking, single point tracking, and grounded video QA we provide separate zip files since the subsets are much smaller than the full dataset.

Computational TaskChallenge Train Video IDsChallenge Valid Video IDsChallenge Test Video IDs
Single Object Trackingobject_tracking_train_id_list.csvobject_tracking_valid_subset_id_list.csvobject_tracking_test_subset_id_list.csv
Single Point Trackingpoint_tracking_train_id_list.csvpoint_tracking_valid_id_list.csvpoint_tracking_test_id_list.csv
Temporal Action Localisationaction_localisation_train_id_list.csvlocalisation_challenge_valid_id_list.csvlocalisation_challenge_test_id_list.csv
Temporal Sound Localisationsound_localisation_train_id_list.csvlocalisation_challenge_valid_id_list.csvlocalisation_challenge_test_id_list.csv
Multiple-Choice Video QAmc_question_train_id_list.csvmc_question_valid_id_list.csvmc_question_test_id_list.csv
Grounded Video QAgrounded_question_train_id_list.csvgrounded_question_valid_id_list.csvgrounded_question_test_id_list.csv

Single Object Tracking
Challenge link: https://eval.ai/web/challenges/challenge-page/2094/overview

TaskSplitVideosAudioLabels
Single Object TrackingTrainUse full split download aboveN/AUse full split download above
Single Object TrackingValidsot_valid_videos_challenge2023.zip (11.6GB)N/Asot_valid_annotations_challenge2023.zip (9MB)
Single Object TrackingTestsot_test_videos_challenge2023.zip (12.1GB)N/Asot_test_annotations_challenge2023.zip (613kB)

Single Point Tracking
Challenge link: https://eval.ai/web/challenges/challenge-page/2108/overview

TaskSplitVideosAudioLabels
Single Point TrackingTrainpoint_tracking_train_videos.zip (398.4MB)N/Apoint_tracking_train_annotations.zip (4.7MB)
Single Point TrackingValidpoint_tracking_valid_videos.zip (1.1GB)N/Apoint_tracking_valid_annotations.zip (11.1MB)
Single Point TrackingTestpoint_tracking_test_videos.zip (691MB)N/Apoint_tracking_test_annotations.zip (42.2kB)

Temporal Action Localisation
Challenge link: https://eval.ai/web/challenges/challenge-page/2101/overview

TaskSplitVideosAudioLabelsVideo Features (TSP)
Temporal Action LocalisationTrainUse full split download aboveUse full split download aboveaction_localisation_train_annotations.zip (217kB)action_localisation_train_video_features.zip (81.7MB)
Temporal Action LocalisationValidUse full split download aboveUse full split download abovechallenge_action_localisation_valid_annotations.zip (558kB)action_localisation_valid_video_features.zip (219.2MB)
Temporal Action LocalisationTestUse full split download aboveUse full split download aboveN/Aaction_localisation_test_video_features.zip (131.7MB)

Temporal Sound Localisation
Challenge link: https://eval.ai/web/challenges/challenge-page/2109/overview

TaskSplitVideosAudioLabelsAudio Features (MMV)
Temporal Sound LocalisationTrainUse full split download aboveUse full split download abovesound_localisation_train_annotations.zip (363kB)sound_localisation_train_audio_features.zip (109.1MB)
Temporal Sound LocalisationValidUse full split download aboveUse full split download abovechallenge_sound_localisation_valid_annotations.zip (552kB)sound_localisation_valid_audio_features.zip (291.4MB)
Temporal Sound LocalisationTestUse full split download aboveUse full split download aboveN/Asound_localisation_test_video_features.zip (177.2MB)

Multiple-Choice Video QA
Challenge link: https://eval.ai/web/challenges/challenge-page/2091/overview

TaskSplitVideosAudioLabels
Multiple-Choice Video QATrainUse full split download aboveUse full split download abovemc_question_train_annotations.zip (85kB)
Multiple-Choice Video QAValidUse full split download aboveUse full split download abovemc_question_valid_annotations.zip (200kB)
Multiple-Choice Video QATestUse full split download aboveUse full split download abovemc_question_test_annotations.zip (200kB)

Grounded Video QA
Challenge link: https://eval.ai/web/challenges/challenge-page/2110/overview

TaskSplitVideosAudioLabels
Grounded Video QATraingrounded_question_train_videos.zip (7.3GB)grounded_question_train_audios.zip (3.4GB)grounded_question_train_annotations.zip (6.1MB)
Grounded Video QAValidgrounded_question_valid_videos.zip (19.3GB)grounded_question_valid_audios.zip (9.1GB)grounded_question_valid_annotations.zip (16.8MB)
Grounded Video QATestgrounded_question_test_videos.zip (11.3GB)grounded_question_test_annotations.zip (17.5kB)

Baselines

In this repo we provide dummy baselines to demonstrate how to load the data, evaluate and recreate some baseline results from the paper. For the other results in the baselines section in the paper, we will be adding another external repo.

Computational taskBaselineDescription
Single Object TrackingStaticStatic object baseline.
Single Point TrackingStaticStatic point baseline.
Temporal Action LocalisationActionFormerActionFormer model fine-tuned on Perception Test data.
Temporal Sound LocalisationActionFormerActionFormer model fine-tuned on Perception Test data.
Multiple-Choice Video QAFrequencyFrequency baseline using training question/answer pairs. More details are provided in the paper.
Grounded Video QAMDETR + staticMDETR open-vocabulary object detections kept static throughout the video.

Metrics

<!-- The [metrics file](https://link) contains the metric code to evaluate the performance for the different tasks. -->
Computational taskMetric
Single Object TrackingAverage IoU
Single Point TrackingAverage Jaccard
Temporal Action LocalisationMean Average Precision
Temporal Sound LocalisationMean Average Precision
Multiple-Choice Video QATop-1 Accuracy
Grounded Video QAHOTA

Metrics code to evaluate performance for the different tasks coming soon.

Perception Test annotations

Explore the annotations: data_visualisation.ipynb

Summary

Annotation typeNumber of videosNumber of annotations
Object tracks11,609189,940
Point tracks1458,647
Action segments11,35373,503
Sound segments11,433137,128
Multiple-choice Questions10,36138,060
Grounded video Questions3,0636,086

Video metadata

Field NameDescription
splitThe data split the video belongs to, one of ['train','valid','test'].
video_idThe ID of the video ['video_xxxx'].
frame_rateThe frame rate of the video in frames per second.
num_framesThe total number of frames in the video.
resolutionThe height and width of the video in pixels.
audio_samplesThe total number of audio samples in the video.
audio_sample_rateThe sample rate of the audio in the video in Hz.
is_cup_gameWhether the video shows a cups-game or not, see paper for details.
is_camera_movingWhether the camera used to film the video is moving or not.

Object tracks

Field NameDescription
idA unique annotation ID for each object track
labelThe name of the object, can also contain object attributes, e.g. red box.
is_occluderWhether the object occludes other objects in the video (This is valid only for the cups-games videos).
bounding_boxesThe coordinates of the object's bounding box in the format [x1,y1,x2,y2] shape [n,4] where n is the number of annotated frames.
initial_tracking_boxone-hot vector indicating which box annotation should be used to start the tracking for this object [n].
frame_idsThe IDs of the frames that are annotated, normally 1 per second, e.g. 0, 30, 60, etc. shape [n].
timestampsThe timestamps of the annotated frames in μs. Shape [n].
is_maskedWhether the object is masked in the annotated frame, corresponds to the bounding boxes [n] ( This is valid only for the cups-games videos).

Point tracks

Field NameDescription
idA unique annotation ID for each point track.
labelThe label of the point track.
parent_objectsThe id of the object that the point belongs to.
frame_idsThe IDs of the frames that are annotated, normally 0, 1, 2 etc. shape [N], where N is the total number of points in the track.
pointsThe coordinates of the points in [y,x], shape [N, 2].

Action segments

Field NameDescription
idA unique annotation ID for each action segment.
labelThe templated class of the action segment, e.g. Putting something into something.
parent_objectsThe ids of the objects involved in the action, can be empty, single, multiple or -1 for an object not annotated.
timestampsThe start and end timestamps of the action segment in μs [start time, end time].
frame_idsThe start and end frame IDs of the action segment [start frame, end frame].
label_idA unique class ID for each label in the dataset.

Sound segments

Field NameDescription
idA unique annotation ID for each sound segment.
labelThe name or class of the sound segment.
parent_objectsThe object ids related to this sound segment, can be empty, single, multiple or -1 for an object not annotated.
timestampsThe start and end timestamps of the sound segment in μs [start time, end time].
frame_idsThe start and end frame IDs of the sound segment [start frame, end frame].
is_visibleWhether the objects causing the sound in this segment are visible or not, e.g. if an object falls off the table and the impact point with the floor is occluded, then is_visible=False.
label_idA unique class ID for each label in the dataset.

Multiple-choice video question-answers

Field NameDescription
idA unique annotation ID for each question.
questionThe text of the question.
optionsThe possible options for the question. There are 3 possible options, and only one is correct.
answer_idThe ID of the correct option for the question.
areaThe skill area the question pertains to. Can be Memory, Abstraction, Physics, Semantics.
reasoningThe type of reasoning required to answer the question. Can be Descriptive, Explanatory, Predictive, or Counterfactual.
tagDifferent skills involved in answering the given question. A question can have multiple skill tags.

Grounded video question-answers

Field NameDescription
idA unique annotation ID for each question.
questionThe text of the question.
answersThe answer for the question given as a list of IDs, these relate to single object tracking annotation IDs, specifically the 'id' field for a given object in the same video.
areaThe skill area the question pertains to. Can be Memory, Abstraction, Physics, Semantics.
reasoningThe type of reasoning required to answer the question. Can be Descriptive, Explanatory, Predictive, or Counterfactual.

Feedback and support

If you have any questions, feedback, or require support regarding the Perception Test dataset or challenge, please contact us at perception-test@google.com.

Citing this work

@inproceedings{patraucean2023perception,
      title={Perception Test: A Diagnostic Benchmark for Multimodal Video Models}, 
      author={Viorica Pătrăucean and Lucas Smaira and Ankush Gupta and Adrià Recasens Continente and Larisa Markeeva and Dylan Banarse and Skanda Koppula and Joseph Heyward and Mateusz Malinowski and Yi Yang and Carl Doersch and Tatiana Matejovicova and Yury Sulsky and Antoine Miech and Alex Frechette and Hanna Klimczak and Raphael Koster and Junlin Zhang and Stephanie Winkler and Yusuf Aytar and Simon Osindero and Dima Damen and Andrew Zisserman and João Carreira},
      booktitle={Advances in Neural Information Processing Systems},
      year={2023},
      url={https://openreview.net/forum?id=HYEGXFnPoq}
}

License and disclaimer

Copyright 2022 DeepMind Technologies Limited

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.