Awesome
<br /> <p align="center"> <img src="figs/logo.png" align="center" width="42%"> <h3 align="center"><strong>Unsupervised Video Domain Adaptation for Action Recognition:<br>A Disentanglement Perspective</strong></h3> <p align="center"> <a href="https://scholar.google.com/citations?user=a94WthkAAAAJ" target='_blank'>Pengfei Wei</a><sup>1</sup> <a href="https://scholar.google.com/citations?user=-j1j7TkAAAAJ" target='_blank'>Lingdong Kong</a><sup>1,2</sup> <a href="https://scholar.google.com/citations?user=2PxlmU0AAAAJ" target='_blank'>Xinghua Qu</a><sup>1</sup> <a href="https://scholar.google.com/citations?user=4FA6C0AAAAAJ" target='_blank'>Yi Ren</a><sup>1</sup> <a href="https://scholar.google.com/citations?user=0R20iBMAAAAJ" target='_blank'>Zhiqiang Xu</a><sup>3</sup> <a href="https://scholar.google.com/citations?user=XFtCe08AAAAJ" target='_blank'>Jing Jiang</a><sup>4</sup> <a href="https://scholar.google.com/citations?user=e6_J-lEAAAAJ" target='_blank'>Xiang Yin</a><sup>1</sup> <br> <sup>1</sup>ByteDance AI Lab <sup>2</sup>National University of Singapore <sup>3</sup>MBZUAI <sup>4</sup>University of Technology Sydney </p> </p> <p align="center"> <a href="https://neurips.cc/" target='_blank'><b>NeurIPS 2023</b></a> </p> <p align="center"> <a href="https://arxiv.org/abs/2208.07365" target='_blank'> <img src="https://img.shields.io/badge/Paper-%F0%9F%93%83-firebrick"> </a> <a href="https://ldkong.com/TranSVAE" target='_blank'> <img src="https://img.shields.io/badge/Project-%F0%9F%94%97-red"> </a> <a href="https://huggingface.co/spaces/ldkong/TranSVAE" target='_blank'> <img src="https://img.shields.io/badge/Demo-%F0%9F%8E%AC-lightgray"> </a> <a href="https://zhuanlan.zhihu.com/p/553169112" target='_blank'> <img src="https://img.shields.io/badge/%E4%B8%AD%E8%AF%91%E7%89%88-%F0%9F%90%BC-lightblue"> </a> <a href="" target='_blank'> <img src="https://visitor-badge.laobi.icu/badge?page_id=ldkong1205.TranSVAE&left_color=gray&right_color=blue"> </a> </p>About
TranSVAE is a disentanglement framework designed for unsupervised video domain adaptation. It aims at disentangling the domain information from the data during the adaptation process. We consider the generation of cross-domain videos from two sets of latent factors: one encoding the static domain-related information and another encoding the temporal and semantic-related information. Objectives are enforced to constrain these latent factors to achieve domain disentanglement and transfer.
<br> <p align="center"> <img src="https://github.com/ldkong1205/TranSVAE/blob/main/figs/example.gif" align="center" width="60%"> <br> <strong>Col1:</strong> Original sequences ("Human" $\mathcal{D}=\mathbf{P}_1$ and "Alien" $\mathcal{D}=\mathbf{P}_2$); <strong>Col2:</strong> Sequence reconstructions; <strong>Col3:</strong> Reconstructed sequences using $z_1^{\mathcal{D}},...,z_T^{\mathcal{D}}$; <strong>Col4:</strong> Domain transferred sequences with exchanged $z_d^{\mathcal{D}}$. </p> <br>Visit our project page to explore more details. :paw_prints:
Updates
- [2023.10] - We provide our extracted I3D features, kindly refer to this page for more details.
- [2023.09] - TranSVAE was accepted to NeurIPS 2023! :tada:
- [2022.08] - TranSVAE achieves 1st place among the UDA leaderboards of UCF-HMDB, Jester, and Epic-Kitchens, based on Paper-with-Code.
- [2022.08] - Try a Gradio demo for domain disentanglement in TranSVAE at Hugging Face Spaces! :hugs:
- [2022.08] - Our paper is available on arXiv, click here to check it out!
Outline
- Highlights
- Installation
- Data Preparation
- Getting Started
- Main Results
- TODO List
- License
- Acknowledgement
- Citation
Highlights
<strong>Conceptual Comparison</strong> |
---|
<img src="figs/idea.jpg" width="70%"> |
<strong>Graphical Model</strong> |
<img src="figs/graph.png" width="60%"> |
<strong>Framework Overview</strong> |
<img src="figs/framework.png" width="96%"> |
Installation
Please refer to INSTALL.md for the installation details.
Data Preparation
Please refer to DATA_PREPARE.md for the details to prepare the <sup>1</sup>UCF<sub>101</sub>, <sup>2</sup>HMDB<sub>51</sub>, <sup>3</sup>Jester, <sup>4</sup>Epic-Kitchens, and <sup>5</sup>Sprites datasets.
Getting Started
Please refer to GET_STARTED.md to learn more usage about this codebase.
Main Results
UCF<sub>101</sub> - HMDB<sub>51</sub>
Method | Backbone | U<sub>101</sub> → H<sub>51</sub> | H<sub>51</sub> → U<sub>101</sub> | Average |
---|---|---|---|---|
DANN (JMLR'16) | ResNet-101 | 75.28 | 76.36 | 75.82 |
JAN (ICML'17) | ResNet-101 | 74.72 | 76.69 | 75.71 |
AdaBN (PR'18) | ResNet-101 | 72.22 | 77.41 | 74.82 |
MCD (CVPR'18) | ResNet-101 | 73.89 | 79.34 | 76.62 |
TA<sup>3</sup>N (ICCV'19) | ResNet-101 | 78.33 | 81.79 | 80.06 |
ABG (MM'20) | ResNet-101 | 79.17 | 85.11 | 82.14 |
TCoN (AAAI'20) | ResNet-101 | 87.22 | 89.14 | 88.18 |
MA<sup>2</sup>L-TD (WACV'22) | ResNet-101 | 85.00 | 86.59 | 85.80 |
Source-only | I3D | 80.27 | 88.79 | 84.53 |
DANN (JMLR'16) | I3D | 80.83 | 88.09 | 84.46 |
ADDA (CVPR'17) | I3D | 79.17 | 88.44 | 83.81 |
TA<sup>3</sup>N (ICCV'19) | I3D | 81.38 | 90.54 | 85.96 |
SAVA (ECCV'20) | I3D | 82.22 | 91.24 | 86.73 |
CoMix (NeurIPS'21) | I3D | 86.66 | 93.87 | 90.22 |
CO<sup>2</sup>A (WACV'22) | I3D | 87.78 | 95.79 | 91.79 |
TranSVAE (Ours) | I3D | 87.78 | 98.95 | 93.37 |
Oracle | I3D | 95.00 | 96.85 | 95.93 |
Jester
Task | Source-only | DANN | ADDA | TA<sup>3</sup>N | CoMix | TranSVAE (Ours) | Oracle |
---|---|---|---|---|---|---|---|
J<sub>S</sub> → J<sub>T</sub> | 51.5 | 55.4 | 52.3 | 55.5 | 64.7 | 66.1 | 95.6 |
Epic-Kitchens
Task | Source-only | DANN | ADDA | TA<sup>3</sup>N | CoMix | TranSVAE (Ours) | Oracle |
---|---|---|---|---|---|---|---|
D<sub>1</sub> → D<sub>2</sub> | 32.8 | 37.7 | 35.4 | 34.2 | 42.9 | 50.5 | 64.0 |
D<sub>1</sub> → D<sub>3</sub> | 34.1 | 36.6 | 34.9 | 37.4 | 40.9 | 50.3 | 63.7 |
D<sub>2</sub> → D<sub>1</sub> | 35.4 | 38.3 | 36.3 | 40.9 | 38.6 | 50.3 | 57.0 |
D<sub>2</sub> → D<sub>3</sub> | 39.1 | 41.9 | 40.8 | 42.8 | 45.2 | 58.6 | 63.7 |
D<sub>3</sub> → D<sub>1</sub> | 34.6 | 38.8 | 36.1 | 39.9 | 42.3 | 48.0 | 57.0 |
D<sub>3</sub> → D<sub>2</sub> | 35.8 | 42.1 | 41.4 | 44.2 | 49.2 | 58.0 | 64.0 |
Average | 35.3 | 39.2 | 37.4 | 39.9 | 43.2 | 52.6 | 61.5 |
Ablation Study
<strong>UCF<sub>101</sub></strong> → <strong>HMDB<sub>51</sub></strong> <br> <img src="figs/ablation-ucf2hmdb.png">
<strong>HMDB<sub>51</sub></strong> → <strong>UCF<sub>101</sub></strong> <br> <img src="figs/ablation-hmdb2ucf.png">
<strong>Domain Transfer Example</strong> <br>
Source (Original) | Target (Original) | Source (Original) | Target (Original) | |
---|---|---|---|---|
Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{S}}$ + $\mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}}$ + $\mathbf{z}_t^{\mathcal{T}}$) | |
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{0}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{0}$) | |
Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{0} + \mathbf{z}_t^{\mathcal{T}}$) | |
Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{S}} + \mathbf{z}_t^{\mathcal{T}}$) | Reconstruct ($\mathbf{z}_d^{\mathcal{T}} + \mathbf{z}_t^{\mathcal{S}}$) | |
TODO List
- Initial release. 🚀
- Add license. See here for more details.
- Add demo at Hugging Face Spaces.
- Add installation details.
- Add data preparation details.
- Add evaluation details.
- Add training details.
License
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a> <br /> This work is under the <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
Acknowledgement
We acknowledge the use of the following public resources during the course of this work: <sup>1</sup>UCF<sub>101</sub>, <sup>2</sup>HMDB<sub>51</sub>, <sup>3</sup>Jester, <sup>4</sup>Epic-Kitchens, <sup>5</sup>Sprites, <sup>6</sup>I3D, and <sup>7</sup>TRN.
Citation
If you find this work helpful, please kindly consider citing our paper:
@inproceedings{wei2023transvae,
title = {Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective},
author = {Wei, Pengfei and Kong, Lingdong and Qu, Xinghua and Ren, Yi and Xu, Zhiqiang and Jiang, Jing and Yin, Xiang},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023},
}