Awesome

UniFormerV2

UniFormerV2, a generic paradigm to build a powerful family of video networks, by arming the pre-trained ViTs with efficient UniFormer designs. It gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400.

This is unofficial keras implementation of UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer.. The official PyTorch code is here.

News

[24-10-2023]: Kinetics-400 test data set can be found on kaggle, link.
[20-10-2023]: GPU(s), TPU-VM for fine-tune training are supported, colab.
[19-10-2023]: UFV2 checkpoints for HACS becomes available, link.
[19-10-2023]: UFV2 checkpoints for ActivityNet becomes available, link.
[18-10-2023]: UFV2 checkpoints for Moments in Time becomes available, link.
[18-10-2023]: UFV2 checkpoints for K710 becomes available, link.
[17-10-2023]: UFV2 checkpoints for SSV2 becomes available, link.
[17-10-2023]: UFV2 checkpoints for Kinetics-600/700 becomes available, link.
[16-10-2023]: UFV2 checkpoints for Kinetics-400 becomes available, link.
[15-10-2023]: Code of UniFormerV2 (UFV2) in Keras becomes available.

Install

git clone https://github.com/innat/UniFormerV2.git
cd UniFormerV2
pip install -e .

Usage

The UniFormerV2 checkpoints are available in both SavedModel and H5 formats on total 8 datasets, i.e. Kinetics-400/600/700/710, Something Something V2, Moments in Time V1, ActivityNet and HACS. The variants of this models are base and large. Each variants may have further variation for different number of input size and input frame. That gives around 35 checkpoints for UniFormerV2. Check this release and model zoo page to know details of it. Also check model_configs.py to get overall looks of avaiable model config. Following are some hightlights.

Inference

from uniformerv2 import UniFormerV2

>>> model = UniFormerV2(name='K400_B16_8x224')
>>> model.load_weights('TFUniFormerV2_K400_B16_8x224.h5')
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=8)
>>> y = model(frames)
>>> y.shape
TensorShape([1, 400])

>>> probabilities = tf.nn.softmax(y_pred_tf)
>>> probabilities = probabilities.numpy().squeeze(0)
>>> confidences = {
    label_map_inv[i]: float(probabilities[i]) \
    for i in np.argsort(probabilities)[::-1]
}
>>> confidences

A classification results on a sample from Kinetics-400.

Video	Top-5
	<pre>{<br> 'playing cello': 0.9992249011,<br> 'playing violin': 0.00016990336,<br> 'playing clarinet': 6.66150512e-05,<br> 'playing harp': 4.858616014e-05,<br> 'playing bass guitar': 2.0927140212e-05<br>}</pre>

Fine Tune

Each uniformerv2 checkpoints returns logits. We can just add a custom classifier on top of it. A sample view is shown below. See the above notebook for more details.

from uniformerv2 import UniFormerV2

# import pretrained model, i.e.
model_name = 'ANET_L14_16x224'
uniformer_v2 = UniFormerV2(name=model_name)
uniformer_v2.load_weights(f'TFUniFormerV2_{model_name}.h5')
uniformer_v2.trainable = False

# downstream model
model = keras.Sequential([
    uniformer_v2,
    layers.Dense(
        len(class_folders), dtype='float32', activation=None
    )
])
model.compile(...)
model.fit(...)
model.predict(...)

Model Zoo

The uniformer-v2 checkpoints are listed in MODEL_ZOO.md.

TODO

Custom fine-tuning code.
Publish on TF-Hub.
Support Keras V3 to support multi-framework backend.

Citation

If you use this uniformerv2 implementation in your research, please cite it using the metadata from our CITATION.cff file.

@misc{li2022uniformerv2,
      title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer}, 
      author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Limin Wang and Yu Qiao},
      year={2022},
      eprint={2211.09552},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}