Home

Awesome

UniFormerV2

Palestine

arXiv keras-2.12. Open In Colab HugginFace badge HugginFace badge

UniFormerV2, a generic paradigm to build a powerful family of video networks, by arming the pre-trained ViTs with efficient UniFormer designs. It gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400.

This is unofficial keras implementation of UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer.. The official PyTorch code is here.

News

Install

git clone https://github.com/innat/UniFormerV2.git
cd UniFormerV2
pip install -e . 

Usage

The UniFormerV2 checkpoints are available in both SavedModel and H5 formats on total 8 datasets, i.e. Kinetics-400/600/700/710, Something Something V2, Moments in Time V1, ActivityNet and HACS. The variants of this models are base and large. Each variants may have further variation for different number of input size and input frame. That gives around 35 checkpoints for UniFormerV2. Check this release and model zoo page to know details of it. Also check model_configs.py to get overall looks of avaiable model config. Following are some hightlights.

Inference

from uniformerv2 import UniFormerV2

>>> model = UniFormerV2(name='K400_B16_8x224')
>>> model.load_weights('TFUniFormerV2_K400_B16_8x224.h5')
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=8)
>>> y = model(frames)
>>> y.shape
TensorShape([1, 400])

>>> probabilities = tf.nn.softmax(y_pred_tf)
>>> probabilities = probabilities.numpy().squeeze(0)
>>> confidences = {
    label_map_inv[i]: float(probabilities[i]) \
    for i in np.argsort(probabilities)[::-1]
}
>>> confidences

A classification results on a sample from Kinetics-400.

VideoTop-5
<pre>{<br> 'playing cello': 0.9992249011,<br> 'playing violin': 0.00016990336,<br> 'playing clarinet': 6.66150512e-05,<br> 'playing harp': 4.858616014e-05,<br> 'playing bass guitar': 2.0927140212e-05<br>}</pre>

Fine Tune

Each uniformerv2 checkpoints returns logits. We can just add a custom classifier on top of it. A sample view is shown below. See the above notebook for more details.

from uniformerv2 import UniFormerV2

# import pretrained model, i.e.
model_name = 'ANET_L14_16x224'
uniformer_v2 = UniFormerV2(name=model_name)
uniformer_v2.load_weights(f'TFUniFormerV2_{model_name}.h5')
uniformer_v2.trainable = False

# downstream model
model = keras.Sequential([
    uniformer_v2,
    layers.Dense(
        len(class_folders), dtype='float32', activation=None
    )
])
model.compile(...)
model.fit(...)
model.predict(...)

Model Zoo

The uniformer-v2 checkpoints are listed in MODEL_ZOO.md.

TODO

Citation

If you use this uniformerv2 implementation in your research, please cite it using the metadata from our CITATION.cff file.

@misc{li2022uniformerv2,
      title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer}, 
      author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Limin Wang and Yu Qiao},
      year={2022},
      eprint={2211.09552},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}