Awesome
Spotting Temporally Precise, Fine-Grained Events in Video
This repository contains code for our paper:
Spotting Temporally Precise, Fine-Grained Events in Video
In ECCV 2022
James Hong, Haotian Zhang, Michael Gharbi, Matthew Fisher, Kayvon Fatahalian
Link: project website
@inproceedings{precisespotting_eccv22,
author={Hong, James and Zhang, Haotian and Gharbi, Micha\"{e}l and Fisher, Matthew and Fatahalian, Kayvon},
title={Spotting Temporally Precise, Fine-Grained Events in Video},
booktitle={ECCV},
year={2022}
}
This code is released under the BSD-3 LICENSE.
Overview
Our paper presents a study of temporal event detection (spotting) in video at the precision of a single or small (e.g., 1-2) tolerance of frames. This is a useful task for annotating video for analysis and synthesis when temporal precision is important, and we demonstrate fine-grained events in several sports as examples.
In this regime, the most crucial design aspect is end-to-end learning of spatial-temporal features from the pixels. We present a surprisingly strong, compact, and end-to-end learned baseline that is conceptually simpler than the two-phase architectures common in the temporal action detection, segmentation, and spotting literature.
Environment
The code is tested in Linux (Ubuntu 16.04 and 20.04) with the dependency versions in requirements.txt
. Similar environments are likely to work also but YMMV.
Datasets
Refer to the READMEs in the data directory for pre-processing and setup instructions.
Basic usage
To train a model, use python3 train_e2e.py <dataset_name> <frame_dir> -s <save_dir> -m <model_arch>
.
<dataset_name>
: supports tennis, fs_comp, fs_perf, finediving, finegym, soccernetv2, soccernet_ball<frame_dir>
: path to the extracted frames<save_dir>
: path to save logs, checkpoints, and inference results<model_arch>
: feature extractor architecture (e.g., RegNet-Y 200MF w/ GSM :rny002_gsm
)
Training will produce checkpoints, predictions for the val
split, and predictions for the test
split on the best validation epoch.
To evaluate a set of predictions with the mean-AP metric, use python3 eval.py -s <split> <model_dir_or_prediction_file>
.
<model_dir_or_prediction_file>
: can be the saved directory of a model containing predictions or a path to a prediction file.
The predictions are saved as either pred-{split}.{epoch}.recall.json.gz
or pred-{split}.{epoch}.json
files. The latter contains only the top class predictions for each frame, omitting all background, while the former contains all non-background detections above a low threshold, to complete the precision-recall curve.
We also save per-frame scores, pred-{split}.{epoch}.score.json.gz
, which can be used to combine predictions from multiple models (see eval_ensemble.py
).
Trained models
Models and configurations can be found at https://github.com/jhong93/e2e-spot-models/. Place the checkpoint file and config.json file in the same directory.
To perform inference with an already trained model, use python3 test_e2e.py <model_dir> <frame_dir> -s <split> --save
. This will save the predictions in the model directory, using the default file naming scheme.
Baselines
Implementations for several baselines in the paper are in baseline,py
. TSP and 2D-VPD features are available in our Google Drive.
Using your own data
Each dataset has plaintext files that contain the list of classes and events in each video.
class.txt
This is a list of the class names, one per line.
{split}.json
This file contains entries for each video and its contained events.
[
{
"video": VIDEO_ID,
"num_frames": 4325, // Video length
"num_events": 10,
"events": [
{
"frame": 525, // Frame
"label": CLASS_NAME, // Event class
"comment": "" // Optional comments
},
...
],
"fps": 25,
"width": 1920, // Metadata about the source video
"height": 1080
},
...
]
Frame directory
We assume pre-extracted frames (either RGB in jpg format or optical flow), that have been resized to 224 pixels high or similar. The organization of the frames is expected to be <frame_dir>/<video_id>/<frame_number>.jpg
. For example,
video1/
├─ 000000.jpg
├─ 000001.jpg
├─ 000002.jpg
├─ ...
video2/
├─ 000000.jpg
├─ ...
Prediction file format
Predictions are formatted similarly to the labels:
[
{
"video": VIDEO_ID,
"events": [
{
"frame": 525, // Frame
"label": CLASS_NAME, // Event class
"score": 0.96142578125
},
...
],
"fps": 25 // Metadata about the source video
},
...
]