Home

Awesome

Don't Judge by the Look: Towards Motion Coherent Video Representation (ICLR2024)

<div align="left"> <a><img src="fig/smile.png" height="70px" ></a> <a><img src="fig/neu.png" height="70px" ></a> </div>

arXiv | Primary contact: Yitian Zhang

<div align="center"> <img src="fig/fig1_iclr.jpeg" width="800px" height="240px"> </div>

TL,DR

Datasets

Please follow the instruction of TSM to prepare the Something-Something V1/V2, Kinetics400, HMDB51 datasets.

Support Models

MCA is a general data augmentation method and can be easily applied to existing methods for stronger performance with a few lines of code:

######  Hyperparameter  ######

Beta = 1.0
MCA_Prob = 1.0
Lambda_AV = 1.0

######  SwapMix  ######

r = np.random.rand(1)
if r < MCA_Prob:
    # generate co-efficient lambda
    batch_num = inputs.shape[0]
    lam = torch.from_numpy(np.random.beta(Beta, Beta, batch_num)).view(-1,1,1,1,1).cuda()
    # random shuffle channel order
    rand_index = torch.randperm(3).cuda()
    while (rand_index - torch.tensor([0,1,2]).cuda()).abs().sum() == 0:
        rand_index = torch.randperm(3).cuda()
    # interpolation for enlarged input space
    inputs_color = lam * inputs + (1-lam) * inputs[:, rand_index]
    

######  Variation Alignment  ######    

if r < MCA_Prob:
    # construct training pair
    inputs_cat = []
    inputs_cat.append(torch.cat((inputs,inputs_color),0))
    output = model(input_cat)
    loss = criterion(output[:batch_num], target)
    loss_kl = Lambda_AV * nn.KLDivLoss(reduction='batchmean')(nn.LogSoftmax(dim=1)(output[batch_num:]), nn.Softmax(dim=1)(output[:batch_num].detach()))
    loss += loss_kl
else:
    output = model(inputs)
    loss = criterion(output, target)

Note that Variation Alignment can be easily extended to resolve the distribution shift of other augmentation methods by replacing inputs_color with inputs_aug, which are training samples generated by other augmentation operations:

 inputs_cat.append(torch.cat((inputs,inputs_aug),0))

Currently, MCA supports the implementation of 2D Network: TSM; 3D Network: SlowFast; Transformer Network: Uniformer.

Result

<div align="center"> <img src="fig/architecture.png" width="500px" height="220px"> </div>

MCA can obviously outperform the baseline method on different architectures on Something-Something V1 dataset.

Here we provide the pretrained models on all these architectures:

ModelAcc1.Weight
TSM45.63%link
TSM-MCA47.57%link
SlowFast44.12%link
SlowFast-MCA45.88%link
Uniformer48.48%link
Uniformer-MCA50.51%link
<div align="center"> <img src="fig/datasets.png" width="800px" height="100px"> </div>

MCA can obviously outperform the baseline method on different datasets.

Here we provide the pretrained models on Something-Something V2:

ModelAcc1.Weight
TSM59.29%link
TSM-MCA60.71%link

and Kinetics400:

ModelAcc1.Weight
TSM70.28%link
TSM-MCA71.08%link
<div align="center"> <img src="fig/compatible.png" width="380px" height="400px"> </div> <div align="center"> <img src="fig/application.png" width="500px" height="180px"> </div>

Get Started

We provide a comprehensive codebase for video recognition which contains the implementation of 2D Network, 3D Network and Transformer Network. Please go to the folders for specific docs.

Acknowledgment

Our codebase is heavily build upon TSM, SlowFast, Uniformer and FFN. We gratefully thank the authors for their wonderful works. The README file format is heavily based on the GitHub repos of my colleague Huan Wang, Xu Ma and Yizhou Wang. Great thanks to them! We also greatly thank the anounymous ICLR'24 reviewers for the constructive comments to help us improve the paper.