Home

Awesome

TSPNet

TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

By Dongxu Li*, Chenchen Xu*, Xin Yu, Kaihao Zhang, Benjamin Swift, Hanna Suominen and Hongdong Li

(* Authors contributed equally.)

<img src='figs/teaser.png'>

The repository contains the implementation of TSPNet. Preprocessed dataset, video features and the inference results are available at Google Drive.

We thank authors of fairseq for their efforts.

Rquirements

Install from source

Install the project from source and develop locally:

cd TSPNet/
pip install --editable .

Getting started

Preprocessing

Download the preprocessed dataset, and arrange them as:

TSPNet/
├── i3d-features/
│   ├── span=8_stride=2
│   ├── span=12_stride=2
│   └── span=16_stride=2
├── data-bin/
│   └── phoenix2014T/
│       └── sp25000/
│   
├── README.md
├── run-scripts/
└── test-scripts/

Training

Go to the run_scripts folder and start training:

cd TSPNet/run_scripts
SAVE_DIR=CHECKPOINT_PATH bash run_phoenix_pos_embed_sp_test_3lvl.sh
<!---The script replicates performance in the paper. --->

Testing

After training, you can make inference on the testing dataset by specifying a checkpoint file.

<!---To validate the model on the testing test, run the testing script with the checkpoints saved from the training step.--->

Note, CHECKPOINT_FILE_PATH points to a saved checkpoint file, rather the CHECKPOINT folder.

CHECKPOINT=CHECKPOINT_FILE_PATH bash test_phoenix_pos_embed_sp_test_3lvl.sh

The script reports multiple metrics, including the ROUGE-L and BLEU-{n} as reported in the paper.

Alternative instructions for preparing datasets by yourself

  1. Text

Install German word embeddings BPEMB by pip install bpemb.

Preprocess the translation texts using preprocess_sign.py to BPE, repeatedly for each split, for example:

python preprocess_sign.py --save-vecs data/processed/emb data/ori/phoenix2014T.train.de data/processed/train.de

python preprocess_sign.py data/ori/phoenix2014T.test.de data/processed/test.de
  1. Vocabulary
<!---Run `fairseq-preprocess` to generate the dictionary file. It is optional to drop `--dataset-impl raw` to generate binarized dataset. Without invoking binarization, the only thing we need from this step is the vocabulary (dictionary) file `dict.de.txt`.--->

Generate the dictionary file dict.de.txt.

fairseq-preprocess --source-lang de --target-lang de --trainpref data/processed/train --testpref data/processed/test --destdir data-bin/ --dataset-impl raw
  1. Video Prepare sign videos and the corresponding video features (e.g. by pretrained i3d networks), and create a json file for each split (e.g. train.sign-de.sign). The json file should be of the format below. It should have the same number of entries as the text file, where each entry corresponds to the sentence at the same line no in the prepared text file.
[
    {
        "ident": "VIDEO_ID",
        "size": "64  // length of video features"
    },
    "..."
]

  1. Finally, arrange text files, video json files, word embeddings and vocabulary files into a folder as below:
data-bin/
├── train.sign-de.sign
├── train.sign-de.de
│
├── test.sign-de.sign
├── test.sign-de.de
│
├── emb
└── dict.de.txt

Citations

Please cite our paper and WLASL dataset (for pre-training) as:

@inproceedings{li2020tspnet,
	title        = {TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation},
	author       = {Li, Dongxu and Xu, Chenchen and Yu, Xin and Zhang, Kaihao and Swift, Benjamin and Suominen, Hanna and Li, Hongdong},
	year         = 2020,
	booktitle    = {Advances in Neural Information Processing Systems},
	volume       = 33
}

@inproceedings{li2020word,
    title={Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison},
    author={Li, Dongxu and Rodriguez, Cristian and Yu, Xin and Li, Hongdong},
    booktitle={The IEEE Winter Conference on Applications of Computer Vision},
    pages={1459--1469},
    year={2020}
}

Other works you might be interested to look at:

@inproceedings{li2020transferring,
  title={Transferring cross-domain knowledge for video sign language recognition},
  author={Li, Dongxu and Yu, Xin and Xu, Chenchen and Petersson, Lars and Li, Hongdong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={6205--6214},
  year={2020}
}