Awesome
Video Copy Segment Localization (VCSL) dataset and benchmark [CVPR2022]
This is the code for benchmarking segment-level video copy detection approaches with the Video Copy Segment Localization (VCSL) dataset [CVPR2022]. Paper Link.
Updates!!
- 【2023-12-20】: We append 90k labelled video pairs
data/append_202312
to VCSL, the corresponding ISC features can be found indata/vcsl_features.txt
. - 【2023-03-05】: We append 11k labelled video pairs
data/append_202301
to VCSL, the corresponding ISC features can be found indata/vcsl_features.txt
. - 【2023-03-05】: We updated the dataset by removing duplicate video pairs (~1%), the benchmark results are updated accordingly.
- 【2022-11-25】: Our recent work TransVCL was accepted by AAAI2023 (Paper Link, Code Link). TransVCL is a novel network with joint optimization of multiple components for segment-level video copy detection. It achieves the state-of-the-art performance on VCSL benchmark.
- 【2022-07-15】: SOTA frame features (1st Place Solution of the Facebook AI Image Similarity Challenge repo) are extracted and available in
data/vcsl_features.txt
. This feature achieves best performance on VCSL benchmark with even more compact dimension (256d). We recommend to evaluate your video copy localization algorithm with this frame feature (marked as ISC). - 【2022-06-15】: We append 45k labelled copied video pairs
data/append_2022S1
to VCSL (+27% more data compared with original VCSL dataset with 167k copied video pairs). Although small part of video links are not available, we will keep adding labelled video data to VCSL continuously to maintain its large scale. - 【2022-06-15】: We release extracted frame features
data/vcsl_features.txt
(RMAC, ViT, DINO) of all videos in VCSL (9207 in total). Due to the large size of ViSiL feature (400G+), we will not provide its public available link. You could go to visil repo and extract features by yourself. - 【2022-06-15】: SPD codes and the trained VTA models on RMAC, ViSiL, ViT, DINO features are all released in
data/spd_models.txt
. - 【2022-06-15】: We supplement the same amount of negative samples (27765 pairs) to
data/pair_file_test.csv
(55530 pairs in total now) to evaluate the algorithm performance more comprehensively. The metric and benchmark are also updated in the arxiv version of VCSL. - 【2022-03-03】: The paper was accepted by CVPR 2022!
Installation
Requirements
- python 3.6+
- pytorch 1.7+
In order to install PyTorch, follow the official PyTorch guidelines. Other required packages can be installed by:
pip install -r requirements.txt
Pipeline
Video Download
We provide the original video urls in file data/videos_url_uuid.csv
, you can use tools such as youtube-dl or ykdl to download
the videos.
Frame Sampling
We use FFmpeg to extract the frames from a video. A command-line example:
ffmpeg -nostdin -y -vf fps=1 -start_number 0 -q 0 ${video_id}/%05d.jpg
Frames will be sampled at 1FPS and will be saved as JPEG images
with name %05d.jpg
(00000.jpg for example) under directory ${video_id}
Together with the extracted frames for all videos, a csv file should also be generated containing the following info:
Header | Description |
---|---|
uuid | video id |
path | relative path of the directory of frames |
frame_count | number of extracted frames of a video |
An example of the csv file looks like:
uuid,path,frame_count
13fbffb8ccf94766b9066225ca226ca2,13fbffb8ccf94766b9066225ca226ca2,193
7f7a837c5dd548eca00389dc6b870a11,7f7a837c5dd548eca00389dc6b870a11,79
Feature Extraction and Similarity Map Calculation
We provide some handy torch-like Dataset
classes to read the previously extracted frames.
import numpy as np
import pandas as pd
import torch
from torchvision import transforms
from vcsl import VideoFramesDataset
from torch.utils.data import DataLoader
# the frames_all.csv must contain columns described in Frame Sampling
df = pd.read_csv("frames_all.csv")
data_list = df[['uuid', 'path', 'frame_count']].values.tolist()
data_transforms = [
lambda x: x.convert('RGB'),
transforms.Resize((224, 224)),
lambda x: np.array(x)[:, :, ::-1]
]
dataset = VideoFramesDataset(data_list,
id_to_key_fn=VideoFramesDataset.build_image_key,
transforms=data_transforms,
root=args.input_root,
store_type="local")
loader = DataLoader(dataset, collate_fn=lambda x: x,
batch_size=args.batch_size,
num_workers=args.data_workers)
for batch_data in loader:
# batch data: List[Tuple[str, int, np.ndarray]]
video_ids, frame_ids, images = zip(*batch_data)
# for torch models, transform the images to a Tensor
images = torch.Tensor(images) # shape(B, H, W, C)
# preprocess the images if necessary and use your model to do the inference
Features of the same video should be concatenated in temporal order, i.e., a numpy.ndarray
or a torch.Tensor
with shape (N, D)
where N
is the frame number and D
is the feature dimension.
We recommend to put the features together as follows:
├── OUTPUT_ROOT
├── 0000ab50f69044d898ffd71a3d215a81.npy
└── 6445fe9aa1564a3783141ee9f8f56d3c.npy
where each file named as ${video_id}.npy represents a video feature.
You can also directly use the extracted features provided in data/vcsl_features.txt
.
To compute the similarity map between videos, run
bash scripts/test_video_sim.sh
Please remember to correctly set the --input-root
parameter
which is usually the --output-root
from the previous step.
Temporal Alignment
We re-implement some Video Temporal Alignment (VTA) algorithms with some modifications to make them suitable for detecting more than one copied segments between two videos.
For fair comparison, we tune the hyper params of each method with a given feature on the validation set, run the script to start the tuning process and find the best hyper params
bash scripts/test_video_vta_tune.sh
The script first runs run_video_vta_tune.py
to tune the hyper params on the valid set data in a grid search manner
and to evaluate on the val set to find the best hyper params.
Then it runs run_video_vta.py
to run the VTA method with the best hyper params on the test set.
Please refer to the code and comments for tunable parameters of different VTA algorithms.
Evaluation
After the optional parameter tuning step, we evaluate the method on the test set, run the script as:
bash scripts/test_video_vta.sh
then you can see the final metric on the test set in the terminal outputs.
Metric
We provide the following three evaluation metrics:
- Overall segment-level precision/recall performance
- Video-level FRR/FAR performance
- Segment-level macro precision/recall performance on positive samples over query set
We recommend using the first overall metric to reflect segment-level alignment accuracy, while it is also influenced by video-level results. Meanwhile, the second or third metrics can be utilized as an auxiliary metric from the perspective of intuitive video-level or only positive samples.
Benchmark
After executing the above several steps, the overall segment-level precision/recall performance of video copy localization algorithms (with ISC frame feature as example) is indicated below:
Performance | Recall | Precision | Fscore |
---|---|---|---|
HV | 86.84 | 36.48 | 51.37 |
TN | 62.17 | 66.11 | 64.08 |
DP | 49.22 | 60.16 | 54.14 |
DTW | 44.74 | 56.19 | 49.82 |
SPD | 71.19 | 55.44 | 62.34 |
- This table is updated in March 05, 2023, using the updated test set. The parameters used to reproduce the numbers are provided in
data/benchmark_parameters.txt
.
Performance | Recall | Precision | Fscore |
---|---|---|---|
HV | 86.94 | 36.82 | 51.73 |
TN | 62.49 | 66.50 | 64.43 |
DP | 49.56 | 60.63 | 54.53 |
DTW | 45.10 | 56.67 | 50.23 |
SPD | 71.47 | 56.27 | 62.97 |
- This table shows the results on the original test set.
Solution to invalid links
- VCSL is originally constructed in mid-2021 and up to January 2022 around 8% urls are removed by video websites. We will keep adding labelled video data to VCSL continuously (every six months) to maintain its large scale. We hope this will make the dataset valuable and usable for the community in the long term.
- As to result reproducibility and fair comparison across methods, we release the features of all the videos including the invalid ones and other necessary experiment details. Thus, one can reproduce the results reported in the paper. We encourage researchers who use VCSL do what we have done. In this way, following works can compare fairly different approaches by filtering out invalid data at that time.
- For researchers whose interests are in video temporal alignment and other fields that do not depend on raw video data, they can leverage the released features and annotations to develop new techniques on which the invalid videos will have no effect.
Cite VCSL
If the code is helpful for your work, please cite our paper
@inproceedings{he2022large,
title={A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection},
author={He, Sifeng and Yang, Xudong and Jiang, Chen and others},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={21086--21095},
year={2022}
}
@inproceedings{jiang2021learning,
title={Learning segment similarity and alignment in large-scale content based video retrieval},
author={Jiang, Chen and He, Sifeng and Yang, Xudong and others},
booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
pages={1618--1626},
year={2021}
}
License
The code is released under MIT license
MIT License
Copyright (c) 2021 Ant Group
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.