Awesome

Multi-Fiber Networks for Video Recognition

This repository contains the code and trained models of:

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng. "Multi-Fiber Networks for Video Recognition" (PDF).

Implementation

We use MXNet @92053bd for image classification and PyTorch 0.4.0a0@a83c240 for video classification.

Normalization

The inputs are substrated by mean RGB = [ 124, 117, 104 ], and then multiplied by 0.0167.

Usage

Train motion from scratch:

python train_kinetics.py

Fine-tune with pre-trained model:

python train_ucf101.py

python train_hmdb51.py

Evaluate the trained model:

cd test
# the default setting is to test trained model on ucf-101 (split1)
python evaluate_video.py

Results

Image Recognition (ImageNet-1k)

Single Model, Single Crop Validation Accuracy:

Model	Params	FLOPs	Top-1	Top-5	MXNet Model
ResNet-18 (reproduced)	11.7 M	1.8 G	71.4 %	90.2 %	GoogleDrive
ResNet-18 (MF embedded)	9.6 M	1.6 G	74.3 %	92.1 %	GoogleDrive
MF-Net (N=16)	5.8 M	861 M	74.6 %	92.0 %	GoogleDrive

Video Recognition (UCF-101, HMDB51, Kinetics)

Model	Params	Target Dataset	Top-1
MF-Net (3D)	8.0 M	Kinetics	72.8 %
MF-Net (3D)	8.0 M	UCF-101	96.0 %*
MF-Net (3D)	8.0 M	HMDB51	74.6 %*

* accuracy averaged on slip1, slip2, and slip3.

Trained Models

Model	Target Dataset	PyTorch Model
MF-Net (2D)	ImageNet-1k	GoogleDrive
MF-Net (3D)	Kinetics	GoogleDrive
MF-Net (3D)	UCF-101 (split1)	GoogleDrive
MF-Net (3D)	HMDB51 (split1)	GoogleDrive

Other Resources

ImageNet-1k Trainig/Validation List:

Download link: GoogleDrive

ImageNet-1k category name mapping table:

Download link: GoogleDrive

Kinetics Dataset:

Downloader: GitHub

UCF-101 Dataset:

Download link: Website

HMDB51 Dataset:

Download link: Website

FAQ

Do I need to convert the raw videos to specific format?

Our `dataiter' supports reading from raw videos and can tolerate corrupted videos.

How can I make the training faster?

Decoding frames from compressed videos consumes quite a lot CPU resources which is the bottleneck for the speed. You can try to convert the downloaded videos to other format or reduce the quality of the video. For example:

# convet to sort_edge_length = 360
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(360*iw)/min(iw\,ih)):-1" -b:v 640k -an ${DST_VID}
# or, convet to sort_edge_length = 256
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(256*iw)/min(iw\,ih)):-1" -b:v 512k -an ${DST_VID}
# or, convet to sort_edge_length = 160
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(160*iw)/min(iw\,ih)):-1" -b:v 240k -an ${DST_VID}

Find another computer with better CPU.
The group convolution may not be well optimized.

Citation

If you use our code/model in your work or find it is helpful, please cite the paper:

@inproceedings{chen2018multifiber,
  title={Multi-Fiber networks for Video Recognition},
  author={Chen, Yunpeng and Kalantidis, Yannis and Li, Jianshu and Yan, Shuicheng and Feng, Jiashi},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2018}
}