

Multi-Fiber Networks for Video Recognition

This repository contains the code and trained models of:

Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng. "Multi-Fiber Networks for Video Recognition" (PDF).


We use MXNet @92053bd for image classification and PyTorch 0.4.0a0@a83c240 for video classification.


The inputs are substrated by mean RGB = [ 124, 117, 104 ], and then multiplied by 0.0167.


Train motion from scratch:

python train_kinetics.py

Fine-tune with pre-trained model:

python train_ucf101.py


python train_hmdb51.py

Evaluate the trained model:

cd test
# the default setting is to test trained model on ucf-101 (split1)
python evaluate_video.py


Image Recognition (ImageNet-1k)

Single Model, Single Crop Validation Accuracy:

ModelParamsFLOPsTop-1Top-5MXNet Model
ResNet-18 (reproduced)11.7 M1.8 G71.4 %90.2 %GoogleDrive
ResNet-18 (MF embedded)9.6 M1.6 G74.3 %92.1 %GoogleDrive
MF-Net (N=16)5.8 M861 M74.6 %92.0 %GoogleDrive

Video Recognition (UCF-101, HMDB51, Kinetics)

ModelParamsTarget DatasetTop-1
MF-Net (3D)8.0 MKinetics72.8 %
MF-Net (3D)8.0 MUCF-10196.0 %*
MF-Net (3D)8.0 MHMDB5174.6 %*

* accuracy averaged on slip1, slip2, and slip3.

Trained Models

ModelTarget DatasetPyTorch Model
MF-Net (2D)ImageNet-1kGoogleDrive
MF-Net (3D)KineticsGoogleDrive
MF-Net (3D)UCF-101 (split1)GoogleDrive
MF-Net (3D)HMDB51 (split1)GoogleDrive

Other Resources

ImageNet-1k Trainig/Validation List:

ImageNet-1k category name mapping table:

Kinetics Dataset:

UCF-101 Dataset:

HMDB51 Dataset:


Do I need to convert the raw videos to specific format?

How can I make the training faster?

# convet to sort_edge_length = 360
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(360*iw)/min(iw\,ih)):-1" -b:v 640k -an ${DST_VID}
# or, convet to sort_edge_length = 256
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(256*iw)/min(iw\,ih)):-1" -b:v 512k -an ${DST_VID}
# or, convet to sort_edge_length = 160
ffmpeg -y -i ${SRC_VID} -c:v mpeg4 -filter:v "scale=min(iw\,(160*iw)/min(iw\,ih)):-1" -b:v 240k -an ${DST_VID}


If you use our code/model in your work or find it is helpful, please cite the paper:

  title={Multi-Fiber networks for Video Recognition},
  author={Chen, Yunpeng and Kalantidis, Yannis and Li, Jianshu and Yan, Shuicheng and Feng, Jiashi},
  booktitle={European Conference on Computer Vision (ECCV)},