Awesome
[ECCV2024 oral] C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
Project Page | Paper
<br/>C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
<div align="center"> <table style="border-collapse: collapse;"> <tr> <td style="text-align: center; padding: 10px;"> <img src="samples/open_door.gif" width="120" /> <br /> <i> <font color="black"><strong>Seen:</strong></font> <font color="red">Open</font> <font color="blue">a door</font> </i> </td> <td style="text-align: center; padding: 10px;"> <img src="samples/close_book.gif" width="120" /> <br /> <i> <font color="black"><strong>Seen:</strong></font> <font color="red">Close</font> <font color="blue">a book</font> </i> </td> <td style="height: 120px; width: 1px; border-left: 2px dashed gray; text-align: center; padding: 10px;"></td> <td style="text-align: center; padding: 10px;"> <img src="samples/close_door.gif" width="120" /> <br /> <i> <font color="black"><strong>Unseen:</strong></font> <font color="red">Close</font> <font color="blue">a door</font> </i> </td> </tr> </table> <div style="margin-top: 1px;"> <strong>Zero-Shot Compositional Action Recognition (ZS-CAR)</strong> </div> </div>Rongchang Li, Zhenhua Feng, Tianyang Xu, Linze Li, Xiaojun Wu†, Muhammad Awais, Sara Atito, Josef Kittler
ECCV, 2024
🛠️ Prepare Something-composition (Sth-com)
<p align="middle" style="margin-bottom: 0.5px;"> <img src="samples/bend_spoon.gif" height="80" /> <img src="samples/bend_book.gif" height="80" /> <img src="samples/close_door.gif" height="80" /> <img src="samples/close_book.gif" height="80" /> <img src="samples/twist_obj.gif" height="80" /> </p> <p align="middle" style="margin-bottom: 0.5px;margin-top: 0.5px;"> <img src="samples/squeeze_bottle.gif" height="80" /> <img src="samples/squeeze_pillow.gif" height="80" /> <img src="samples/tear_card.gif" height="80" /> <img src="samples/tear_leaf.gif" height="80" /> <img src="samples/open_wallet.gif" height="80" /> </p> <p align="center" style="margin-top: 0.5px;"> <strong>Some samples in Something-composition</strong> </p>- Download Something-Something V2 (Sth-v2). Our proposed Something-composition (Sth-com) is based on Sth-V2. We refer to the official website to download the videos to the path video_path.
- Extract frames. To accelerate the dataloader when training, we extract the frames for each video and save them in the frame_path. The command is:
python tools/extract_frames.py --video_root video_path --frame_root frame_path
- Download Dataset annotations. We provide our Sth-com annotation files in the data_split dir. The format is like:
Please kindly download these files to annotation_path.[ { "id": "54463", # means the sample name "action": "opening a book", # means composition "verb": "Opening [something]", # means the verb component "object": "book" # means the object component }, { ... }, { ... }, ]
- Finally, the dataset is built successfully. The structure looks like:
- annotation_path
- data_split
- generalized
- train_pairs.json
- val_pairs.json
- test_pairs.json
- generalized
- data_split
- frame_path
- 0
- 000001.jpg
- 000002.jpg
- ......
- 1
- 000001.jpg
- 000002.jpg
- ......
- ......
- 0
- annotation_path
🚀 Train and test
🔔 Now take the dir codes as the project root.
Before running
-
Prepare the word embedding models. We recommend following Compcos to download the word embedding models.
-
You should modify the paths :
(For example, running C2C_vanilla, TSM-18 as the backbone.)
- dataset_path in ./config/c2c_vanilla_tsm.yml
- save_path in ./config/c2c_vanilla_tsm.yml
- The code line: t=fasttext.load_model('YOUR_PATH/cc.en.300.bin') in models/vm_models/word_embedding.py
Train
- Train a model with the command:
CUDA_VISIBLE_DEVICES=YOUR_GPU_INDEXEX python train.py --config config/c2c_vm/c2c_vanilla_tsm.yml
Test
-
For the test, imagine you have trained your model and set the log dir as YOUR_LOG_PATH.
Then, you can test it using:
CUDA_VISIBLE_DEVICES=YOUR_GPU_INDEXEX python test_for_models.py --logpath YOUR_LOG_PATH
📝 TODO List
- Add training codes for VM+word embedding paradigm.
- Add training codes from VLM paradigm.