Home

Awesome

Continual 3D Convolutional Neural Networks

Paper Framework License Code style: black

<div align="center"> <video src='https://github.com/LukasHedegaard/co3d/raw/c5891aaf6f76bb8bda0ddef238dd7a0feb1afc38/presentation/5612.mp4' width=512/> </div>

Continual 3D Convolutional Neural Networks (Co3D CNNs) are a novel computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip.

In online processing tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in multiple clips.

Co3D CNNs are weight-compatible with regular 3D CNNs, do not need further training, and reduce the floating point operations for frame-wise computations by more than an order of magnitude!

News

Principle

<div align="center"> <img src="figures/coconv.png" width="500"> <br> Continual Convolution. An input (green d or e) is convolved with a kernel (blue α, β). The intermediary feature-maps corresponding to all but the last temporal position are stored, while the last feature map and prior memory are summed to produce the resulting output. For a continual stream of inputs, Continual Convolutions produce identical outputs to regular convolutions. </div>

Results

<div align="center"> <img src="figures/acc-vs-flops.png" width="500"> <br> Accuracy/complexity trade-off for Continual X3D CoX3D and recent state-of-the-art 3D CNNs on Kinetics-400 using 1-clip/frame testing. For regular 3D CNNs, the FLOPs per clip ■ are noted, while the FLOPs per frame ● are shown for the Continual 3D CNNs. The CoX3D models used the weights from the X3D models without further fine-tuning. The global average pool size for the network is noted in each point. The diagonal and vertical arrows indicate respectively a transfer from regular to Continual 3D CNN and an extension of receptive field. <br> <br> <img src="figures/results.png"> <br> Benchmark of state-of-the-art methods on Kinetics-400. The noted accuracy is the single clip or frame top-1 score using RGB as the only input-modality. The performance was evaluated using publicly available pre-trained models without any further fine-tuning. For thoughput comparison, evaluations per second denote frames per second for the CoX3D models and clips per second for the remaining models. Throughput results are the mean +/- std of 100 measurements. Pareto-optimal models are marked with bold. Mem. is the maximum allocated memory during inference noted in megabytes. </div>

Setup

  1. Clone the project code

    git clone https://github.com/LukasHedegaard/co3d
    cd co3d
    
  2. Create and activate conda environent (optional)

    conda create --name co3d python=3.8
    conda activate co3d
    
  3. Install Python dependencies

    pip install -e .[dev]
    
  4. Install FFMPEG and UNRAR

  5. Fill in the information on your dataset folder path in the .env file:

    DATASETS_PATH=/path/to/datasets
    LOGS_PATH=/path/to/logs
    CACHE_PATH=.cache
    
  6. Download dataset using these instructions

Models

CoX3D

CoX3D is the Continual-CNN implementation of X3D. In contrast to regular 3D CNNs, which take a whole video clip as input, Continual CNNs operate frame-by-frame and can thus speed up computation by a significant margin.

CoSlow

CoSlow is the Continual-CNN implementation of Slow.

CoI3D

CoSlow is the Continual-CNN implementation of I3d.

X3D

X3D [ArXiv, Repo] is a family of 3D variants of the EfficientNet achitecture, which produce state-of-the-art results for lightweight human activity recognition.

R(2+1)D

R(2+1)D [ArXiv, Repo] is a CNN for activity recognition, which separates the 3D convolution into a spatial 2D convolution and a temporal 1D convolution in order to reduce the number of parameters and increase the network efficiency.

I3D

I3D [ArXiv, Repo] is a 3D CNN for activity recognition, proposed to "inflate" the weights from a 2D CNN pretrained on ImageNet in the initialisation of the 3D CNN, thereby improving accuracy and reducing training time.

The implementation here is a port of the one found in the SlowFast Repo.

SlowFast

SlowFast [ArXiv, Repo] is two-stream 3D CNNs architecture for video-recognition. The structure includes two pathways with one pathway operating at a slower frame-rate than the other.

Slow

Slow is the "slow" branch of the SlowFast network [ArXiv, Repo]

Usage

The project code written in PyTorch and uses Ride to provide implementations of training, evaluations, and benchmarking methods. A plethora of usage options are available, which are best explored in the Ride docs or the command-line help, e.g.:

python models/cox3d/main.py --help 

This repository contains the implementations of Continual X3D (CoX3D), as well as number of 3D-CNN baselines.

Each model has its own folder with a self-contained implementation, scripts, weight download utilities, hparams and profiling results. Overview tables for scripts used to download weights, run the model test-sequences, and throughput benchmarks are found below:

Download weights

ModelDatasetDownload
I3D-R50Kineticsdownload
R(2+1)D-18Kineticsdownload
SlowFast-8x8Kineticsdownload
SlowFast-4x16Kineticsdownload
Slow-8x8Kineticsdownload
(Co)X3D-XSKineticsdownload
(Co)X3D-SKineticsdownload
(Co)X3D-MKineticsdownload
(Co)X3D-LKineticsdownload
(Co)Slow-8x8Charadesdownload

Evaluate on Kinetics400

Evaluate the 1-clip accuracy of pretrained models. The scripts should be executed from project root.

ModelScript
I3D-R50./models/i3d/scripts/test/kinetics400.sh
R(2+1)D-18./models/r2plus1d/scripts/test/kinetics400.sh
SlowFast./models/slowfast/scripts/test/kinetics400.sh
Slow./models/slow/scripts/test/kinetics400.sh
X3D./models/x3d/scripts/test/kinetics400.sh
CoX3D./models/cox3d/scripts/test/kinetics400.sh
CoSlow./models/coslow/scripts/test/kinetics400.sh
CoI3D./models/coi3d/scripts/test/kinetics400.sh

Evaluate on Charades

Evaluate the 1-clip accuracy of pretrained models. The scripts should be executed from project root.

ModelScript
(Co)Slow-8x8./models/coslow/scripts/test/charades.sh

Benchmark FLOPs and throughput

The scripts should be executed from project root.

ModelScript
I3D-R50./models/i3d/scripts/profile/kinetics400.sh
R(2+1)D-18./models/r2plus1d/scripts/profile/kinetics400.sh
SlowFast./models/slowfast/scripts/profile/kinetics400.sh
Slow./models/slow/scripts/profile/kinetics400.sh
X3D./models/x3d/scripts/profile/kinetics400.sh
CoX3D./models/cox3d/scripts/profile/kinetics400.sh
CoI3D./models/coi3d/scripts/profile/kinetics400.sh
CoSlow./models/coslow/scripts/profile/kinetics400.sh

Citation

@inproceedings{hedegaard2022continual,
    title={Continual 3D Convolutional Neural Networks for Real-time Processing of Videos},
    author={Lukas Hedegaard and Alexandros Iosifidis},
    booktitle={European Conference on Computer Vision (ECCV)},
    year={2022},
}

Acknowledgement

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR).