Home

Awesome

<br /> <p align="center"> <img src="figs/logo.png" align="center" width="42%"> <h3 align="center"><strong>Unsupervised Video Domain Adaptation for Action Recognition:<br>A Disentanglement Perspective</strong></h3> <p align="center"> <a href="https://scholar.google.com/citations?user=a94WthkAAAAJ" target='_blank'>Pengfei Wei</a><sup>1</sup>&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=-j1j7TkAAAAJ" target='_blank'>Lingdong Kong</a><sup>1,2</sup>&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=2PxlmU0AAAAJ" target='_blank'>Xinghua Qu</a><sup>1</sup>&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=4FA6C0AAAAAJ" target='_blank'>Yi Ren</a><sup>1</sup>&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=0R20iBMAAAAJ" target='_blank'>Zhiqiang Xu</a><sup>3</sup>&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=XFtCe08AAAAJ" target='_blank'>Jing Jiang</a><sup>4</sup>&nbsp;&nbsp; <a href="https://scholar.google.com/citations?user=e6_J-lEAAAAJ" target='_blank'>Xiang Yin</a><sup>1</sup> <br> <sup>1</sup>ByteDance AI Lab&nbsp;&nbsp; <sup>2</sup>National University of Singapore&nbsp;&nbsp; <sup>3</sup>MBZUAI&nbsp;&nbsp; <sup>4</sup>University of Technology Sydney </p> </p> <p align="center"> <a href="https://neurips.cc/" target='_blank'><b>NeurIPS 2023</b></a> </p> <p align="center"> <a href="https://arxiv.org/abs/2208.07365" target='_blank'> <img src="https://img.shields.io/badge/Paper-%F0%9F%93%83-firebrick"> </a> <a href="https://ldkong.com/TranSVAE" target='_blank'> <img src="https://img.shields.io/badge/Project-%F0%9F%94%97-red"> </a> <a href="https://huggingface.co/spaces/ldkong/TranSVAE" target='_blank'> <img src="https://img.shields.io/badge/Demo-%F0%9F%8E%AC-lightgray"> </a> <a href="https://zhuanlan.zhihu.com/p/553169112" target='_blank'> <img src="https://img.shields.io/badge/%E4%B8%AD%E8%AF%91%E7%89%88-%F0%9F%90%BC-lightblue"> </a> <a href="" target='_blank'> <img src="https://visitor-badge.laobi.icu/badge?page_id=ldkong1205.TranSVAE&left_color=gray&right_color=blue"> </a> </p>

About

TranSVAE is a disentanglement framework designed for unsupervised video domain adaptation. It aims at disentangling the domain information from the data during the adaptation process. We consider the generation of cross-domain videos from two sets of latent factors: one encoding the static domain-related information and another encoding the temporal and semantic-related information. Objectives are enforced to constrain these latent factors to achieve domain disentanglement and transfer.

<br> <p align="center"> <img src="https://github.com/ldkong1205/TranSVAE/blob/main/figs/example.gif" align="center" width="60%"> <br> <strong>Col1:</strong> Original sequences ("Human" $\mathcal{D}=\mathbf{P}_1$ and "Alien" $\mathcal{D}=\mathbf{P}_2$); <strong>Col2:</strong> Sequence reconstructions; <strong>Col3:</strong> Reconstructed sequences using $z_1^{\mathcal{D}},...,z_T^{\mathcal{D}}$; <strong>Col4:</strong> Domain transferred sequences with exchanged $z_d^{\mathcal{D}}$. </p> <br>

Visit our project page to explore more details. :paw_prints:

Updates

Outline

Highlights

<strong>Conceptual Comparison</strong>
<img src="figs/idea.jpg" width="70%">
<strong>Graphical Model</strong>
<img src="figs/graph.png" width="60%">
<strong>Framework Overview</strong>
<img src="figs/framework.png" width="96%">

Installation

Please refer to INSTALL.md for the installation details.

Data Preparation

Please refer to DATA_PREPARE.md for the details to prepare the <sup>1</sup>UCF<sub>101</sub>, <sup>2</sup>HMDB<sub>51</sub>, <sup>3</sup>Jester, <sup>4</sup>Epic-Kitchens, and <sup>5</sup>Sprites datasets.

Getting Started

Please refer to GET_STARTED.md to learn more usage about this codebase.

Main Results

UCF<sub>101</sub> - HMDB<sub>51</sub>

PWC

MethodBackboneU<sub>101</sub> → H<sub>51</sub>H<sub>51</sub> → U<sub>101</sub>Average
DANN (JMLR'16)ResNet-10175.2876.3675.82
JAN (ICML'17)ResNet-10174.7276.6975.71
AdaBN (PR'18)ResNet-10172.2277.4174.82
MCD (CVPR'18)ResNet-10173.8979.3476.62
TA<sup>3</sup>N (ICCV'19)ResNet-10178.3381.7980.06
ABG (MM'20)ResNet-10179.1785.1182.14
TCoN (AAAI'20)ResNet-10187.2289.1488.18
MA<sup>2</sup>L-TD (WACV'22)ResNet-10185.0086.5985.80
Source-onlyI3D80.2788.7984.53
DANN (JMLR'16)I3D80.8388.0984.46
ADDA (CVPR'17)I3D79.1788.4483.81
TA<sup>3</sup>N (ICCV'19)I3D81.3890.5485.96
SAVA (ECCV'20)I3D82.2291.2486.73
CoMix (NeurIPS'21)I3D86.6693.8790.22
CO<sup>2</sup>A (WACV'22)I3D87.7895.7991.79
TranSVAE (Ours)I3D87.7898.9593.37
OracleI3D95.0096.8595.93

Jester

PWC

TaskSource-onlyDANNADDATA<sup>3</sup>NCoMixTranSVAE (Ours)Oracle
J<sub>S</sub>J<sub>T</sub>51.555.452.355.564.766.195.6

Epic-Kitchens

PWC

TaskSource-onlyDANNADDATA<sup>3</sup>NCoMixTranSVAE (Ours)Oracle
D<sub>1</sub>D<sub>2</sub>32.837.735.434.242.950.564.0
D<sub>1</sub>D<sub>3</sub>34.136.634.937.440.950.363.7
D<sub>2</sub>D<sub>1</sub>35.438.336.340.938.650.357.0
D<sub>2</sub>D<sub>3</sub>39.141.940.842.845.258.663.7
D<sub>3</sub>D<sub>1</sub>34.638.836.139.942.348.057.0
D<sub>3</sub>D<sub>2</sub>35.842.141.444.249.258.064.0
Average35.339.237.439.943.252.661.5

Ablation Study

<strong>UCF<sub>101</sub></strong><strong>HMDB<sub>51</sub></strong> <br> <img src="figs/ablation-ucf2hmdb.png">

<strong>HMDB<sub>51</sub></strong><strong>UCF<sub>101</sub></strong> <br> <img src="figs/ablation-hmdb2ucf.png">

<strong>Domain Transfer Example</strong> <br>

Source (Original)Target (Original)Source (Original)Target (Original)
src_originaltar_originalsrc_originaltar_original
Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$)Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$)Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$)Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$)
src_recontar_reconsrc_recontar_recon
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$)Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$)Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$)Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$)
recon_srcZfrecon_tarZfrecon_srcZfrecon_tarZf
Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$)Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$)Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$)Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$)
recon_srcZtrecon_tarZtrecon_srcZtrecon_tarZt
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$)Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$)Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$)Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$)
recon_srcZf_tarZtrecon_tarZf_srcZtrecon_srcZf_tarZtrecon_tarZf_srcZt

TODO List

License

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a> <br /> This work is under the <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Acknowledgement

We acknowledge the use of the following public resources during the course of this work: <sup>1</sup>UCF<sub>101</sub>, <sup>2</sup>HMDB<sub>51</sub>, <sup>3</sup>Jester, <sup>4</sup>Epic-Kitchens, <sup>5</sup>Sprites, <sup>6</sup>I3D, and <sup>7</sup>TRN.

Citation

If you find this work helpful, please kindly consider citing our paper:

@inproceedings{wei2023transvae,
  title = {Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective},
  author = {Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Ren, Yi and Xu, Zhiqiang and Jiang, Jing and Yin, Xiang},
  booktitle = {Advances in Neural Information Processing Systems}, 
  year = {2023},
}