Awesome

MLB-YouTube Dataset

The MLB-YouTube dataset is a new, large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube with over 42 hours of video footage. Our dataset consists of two components: segmented videos for activity recognition and continuous videos for activity classification. Our dataset is quite challenging as it is created from TV broadcast baseball games where multiple different activities share the camera angle. Further, the motion/appearance difference between the various activities is quite small.

Please see our paper for more details on the dataset [arXiv].

If you use our dataset or find the code useful for your research, please cite our paper:

    @inproceedings{mlbyoutube2018,
              title={Fine-grained Activity Recognition in Baseball Videos},
	      booktitle={CVPR Workshop on Computer Vision in Sports},
	      author={AJ Piergiovanni and Michael S. Ryoo},
	      year={2018}
    }

Example Frames from various activities: Examples

NEW! MLB-YouTube Captions

We densely annotated the videos with captions from the commentary given by the announcers, resulting in approximately 50 hours of matching text and video. These captions only roughly describe what is happening in the video, and often contain unrelated stories or commentary on a previous event, making this a challenging task. Examples of the text and video:

For more details see our paper introducing the captions dataset [arXiv].

  @article{mlbcaptions2018
        title={Learning Shared Multimodal Embeddings with Unpaired Data},
	author={AJ Piergiovanni and Michael S. Ryoo},
        journal={arXiv preprint arXiv:1806.08251},
        year={2018}
}

Segmented Dataset

Our segmented video dataset consists of 4,290 video clips. Each clip is annotated with the various baseball activities that occur, such as swing, hit, ball, strike, foul, etc. A video clip can contain multiple activities, so we treat this as a multi-label classification task. A full list of the activities and the number of examples of each is shown in the table below.

Activity	# Examples
No Activity	2983
Ball	1434
Strike	1799
Swing	2506
Hit	1391
Foul	718
In Play	679
Bunt	24
Hit by Pitch	14

We additionally annotated each clip containing a pitch with the pitch type (e.g., fastball, curveball, slider, etc.) and the speed of the pitch. We also collected a set of 2,983 hard negative examples where no action occurs. These examples include views of the crowd, the field, or the players standing before or after a pitch occurred. Examples of the activities and hard negatives are shown below:

Strike

Ball

Swing

Hit

Foul

Bunt

Hit By Pitch

Hard-Negative No Activity

Continuous Dataset

Our continuous video dataset consists of 2,128 1-2 minute long clips from the videos. We densely annotate each frame of the clip with the baseball activities that occur. Each continuous clip contains on average of 7.2 activities, resulting in a total of over 15,000 activity instances. We evaluate models using per-frame mean average precision (mAP).

Create the dataset

Download the youtube videos. Run python download_videos.py which relies on youtube-dl. Change the save_dir in the script to where you want the full videos saved.
To extract the segmented video clips, run python extract_segmented_videos.py and change input_directory to be the directory containing the full videos and output_directory to be the location to save the extracted clips.
To extract the continuous video clips, run python extract_continuous_videos.py and change input_directory to be the directory containing the full videos and output_directory to be the location to save the extracted clips.

Baseline Experiments

We compared many approaches using I3D [1] and InceptionV3 [2] as feature extractors.

Segmented Video Activity Recognition Results

Please see our paper for more experimental details and results.

Results from multi-label video classification:

Method	mAP (%)
Random	16.3
I3D + max-pool	57.2
I3D + pyramid pooling	58.7
I3D + LSTM	53.1
I3D + temporal conv	58.4
I3D + sub-events [3]	61.3
IncetpitonV3 + max-pool	54.4
InceptionV3 + pyramid pooling	55.3
InceptionV3 + LSTM	57.7
InceptionV3 + temporal conv	56.1
InceptionV3 + sub-events [3]	62.6

Pitch Speed Regression:

Method	RMSE
I3D	4.3 mph
I3D + LSTM	4.1 mph
I3D + sub-events [3]	3.9 mph
IncetpitonV3	5.3 mph
IncetpitonV3 + LSTM	4.5 mph
IncetpitonV3 + sub-events [3]	3.6 mph

Continuous Video Activity Detection

Method	mAP (%)
Random	13.4
IncetpitonV3	31.9
IncetpitonV3 + max-pool	35.2
InceptionV3 + pyramid pooling	36.8
InceptionV3 + LSTM	34.1
InceptionV3 + temporal conv	33.4
InceptionV3 + sub-events [3]	37.3
InceptionV3 + super-events [4]	39.6
InceptionV3 + sub+super-events	40.9
InceptionV3 + TGM [5]	37.4
InceptionV3 + 3 TGM [5]	38.2
InceptionV3 + super-event [4] + 3 TGM [5]	42.9
I3D	34.2
I3D + max-pool	36.8
I3D + pyramid pooling	39.7
I3D + LSTM	39.4
I3D + temporal conv	39.2
I3D + sub-events [3]	38.5
I3D + super-events [4]	39.1
I3D + sub+super-events	40.4
I3D + TGM [5]	38.5
I3D + 3 TGM [5]	40.1
I3D + super-event [4] + 3 TGM [5]	47.1

Experiments

We provide our code to train and evalute the models in the experiments directory. We have the various models implemented in models.py, a script to load the dataset, and a script to train the models as well.

We also include our PyTorch implementation of I3D, see pytorch-i3d for more details.

Requirements

youtube-dl to download the videos
tested with ffmpeg 2.8.11 to extract clips
PyTorch (tested with version 0.3.1)

References

[1] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [arxiv] [code]

[2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016

[3] A. Piergiovanni, C. Fan, and M. S. Ryoo. Learning latent sub-events in activity videos using temporal attention filters. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2017 [arxiv] [code]

[4] A. Piergiovanni and M. S. Ryoo. Learning latent super-events to detect multiple activities in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 [arxiv] [code]

[5] A. Piergiovanni and M. S. Ryoo. Temporal Gaussian Mixture Layer for Videos. arXiv preprint arXiv:1803.06316, 2018 [arxiv]