Awesome

SynSE - Syntactically Guided Generative Embeddings For Zero Shot Skeleton Action Recognition

Original PyTorch implementation for 'Syntactically Guided Generative Embeddings For Zero Shot Skeleton Action Recognition' , accepted at 'IEEE International Conference on Image Processing (ICIP) 2021'

TL;DR version of the work: <a href="https://threadreaderapp.com/thread/1356893399576580100.html"> HERE </a>

<img src = "Images/SynSE_arch.png" /> <div align="center"> <a href="https://youtu.be/r6xgTg3zIwI"> <img src="https://img.shields.io/badge/Watch on YouTube-FF0000?style=for-the-badge&logo=youtube&logoColor=white"/> </a> <table> <tr> <td> <a href="https://youtu.be/r6xgTg3zIwI" target="_blank"> <img src="http://img.youtube.com/vi/r6xgTg3zIwI/maxresdefault.jpg" alt="Watch the video" width="640" height="360" border="5"/> </a> </td> </tr> <tr> <th><samp> Video Overview(Click on Image above)</samp></th> </tr> </table> </div>

Dependencies

<ul> <li> Python >= 3.5 </li> <li> Torch == 1.2.0 </li> <li> Scikit-Learn </li> </ul>

Data Preparation

Creating the test-train splits.

The unseen classes of the various splits are listed below. These splits are also provided under the <code> synse_resources/resources/label_splits</code>, which can be downloaded from here. Place the <code>resources</code> folder in the root <code>synse-zsl</code> directory. Random unseen 5 classes can be found in the ru5.npy file. This naming scheme is used for all splits. R-random, S-seen, U-unseen, V-validation split.

NTU-60:

Unseen Classes (55/5 split):

<table> <tr> <td align = "center">A11 reading </td> <td align = "center">A12 writing </td> <td align = "center">A20 put on a hat/cap </td> <td align = "center">A27 jump up </td> <td align = "center">A57 touch pocket </td> </tr> </table>

Unseen Classes (48/12 split):

<table> <tr> <td align = "center">A4 brush hair </td> <td align = "center">A6 pick up </td> <td align = "center">A10 clapping </td> <td align = "center">A13 tear up paper </td> <td align = "center">A16 put on shoe </td> </tr> <tr> <td align = "center">A41 sneeze or cough </td> <td align = "center">A43 falling down </td> <td align = "center">A48 nausea or vomiting </td> <td align = "center">A52 pushing </td> <td align = "center">A57 touch pocket </td> </tr> <tr> <td align = "center">A59 walking towards </td> <td align = "center">A60 walking apart </td> </tr> </table>

NTU-120:

Unseen Classes (110/10 split):

<table> <tr> <td align = "center">A5 drop </td> <td align = "center">A14 put on jacket </td> <td align = "center">A38 salute </td> <td align = "center">A44 headache </td> <td align = "center">A50 punch or slap </td> </tr> <tr> <td align = "center">A66 juggle table tennis table </td> <td align = "center">A89 put object into bag </td> <td align = "center">A96 cross arms </td> <td align = "center">A100 butt kicks </td> <td align = "center">A107 wield knife </td> </tr> </table>

Unseen Classes (96/24 split):

<table> <tr> <td align = "center">A6 pick up </td> <td align = "center">A10 clapping </td> <td align = "center">A12 writing </td> <td align = "center">A17 take off shoe </td> <td align = "center">A19 take off glasses </td> </tr> <tr> <td align = "center">A21 take off hat or cap </td> <td align = "center">A23 hand waving </td> <td align = "center">A30 type on keyboard </td> <td align = "center">A36 shake head </td> <td align = "center">A40 cross hands in front </td> </tr> <tr> <td align = "center">A46 back pain </td> <td align = "center">A50 punch or slap </td> <td align = "center">A60 walking apart </td> <td align = "center">A69 thumb up </td> <td align = "center">A71 make ok sign </td> </tr> <tr> <td align = "center">A82 fold paper </td> <td align = "center">A85 apply cream on face </td> <td align = "center">A88 take off bag </td> <td align = "center">A94 throw up cap or hat </td> <td align = "center">A95 capitulate </td> </tr> <tr> <td align = "center">A105 blow nose </td> <td align = "center">A114 carry object </td> <td align = "center">A115 take photo </td> <td align = "center">A120 rock paper scissors </td> </tr> </table>

Visual Feature Generation:

We provide the visual features generated via SHIFT-GCN for the NTU-120 and NTU-60 dataset for the various splits. They can be found under the <code>synse_resources/ntu_results</code> repository, which is downloadable from here. train.npy contains the visual features of the training data from the seen classes. ztest.npy contains the test data from the unseen classes. gtest.npy contains the test data from all the classes.

If you wish to generate the visual features yourself:

Download the NTU-60 and NTU-120 datasets by requesting them from <a href="http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp">here</a>.
Create the test-train-val splits for the datasets using the split file created in the previous steps.
Train the visual feature generator. Follow this for training Shift-GCN. For each split a new feature generator has to be trained following the zero shot learning assumption. The trained Shift-GCN weights can be found under the repository. <code>synse_resources/ntu_results/shift_5_r/weights/</code>
Save the features for train data, unseen test data(zsl) and the entire test data(gzsl).

Text feature generators

We provide the generated language features as well, for the labels in NTU-60, and NTU-120 dataset. They can be found in <code>./synse_resources/resources/</code>. Place the <code>resources</code> folder in the root <code> synse-zsl</code> directory.

If you wish to generate the language features yourself.

Word2Vec: Download the <a href="https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit">Pre-Trained Word2Vec Vectors</a> and extract the contents of the archive. Generate the Word2Vec representations by using the gensim python module as described here </li>
For Sentence-BERT, we use the sentence-transformers package from here. We use the stsb-bert-large model.

Experiments

We provide the scripts necessary to obtain the results shown in the paper. They include training and evaluation scripts for ReViSE [1], JPoSE[2], CADA-VAE[3] and our model SynSE. The scripts for each of the three models are present in their respective folders (jpose, revise, synse). A README is present in each folder detailing the use of the provided scripts for both training and evaluation.

References:

<ol> <li>Hubert Tsai, Yao-Hung, Liang-Kang Huang, and Ruslan Salakhutdinov. "Learning robust visual-semantic embeddings." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3571-3580. 2017. </li> <li>Wray, Michael, Diane Larlus, Gabriela Csurka, and Dima Damen. "Fine-grained action retrieval through multiple parts-of-speech embeddings." In Proceedings of the IEEE International Conference on Computer Vision, pp. 450-459. 2019.</li> <li>Schonfeld, Edgar, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. "Generalized zero-and few-shot learning via aligned variational autoencoders." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247-8255. 2019.</li> </ol>