Home

Awesome

CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

This is the official code of the CVPR 2023 paper (Highlight presentation, acceptance rate: 2.5% of submitted papers) CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment [CVPR Version] [arXiv Version].

!!! See Also

<!-- Please refer to this [repo]() and [paper]() -->

News

Proposed CVT-SLR Framework

<img src=".\imgs\framework.jpg" alt="framework" style="zoom: 80%;" />

For more details, please refer to our paper.

Prerequisites

Dependencies

As a prerequisite, you are suggested to create a brand new conda environment firstly. A reference python dependency packages could be installed as follows :

(1) python==3.8.16

(2) torch==1.12.0+cu116, pls see Pytorch official website

(3) PyYAML==6.0

(4) tqdm==4.64.0

(5) opencv-python==4.2.0.32

(6) scipy==1.4.1

F.Y.I: Not all are required and appropriate, it depends on your actual situation.

Besides, you must install ctcdecode==0.4 for beam search decode, pls see this repo in detail. Run the following command to install ctcdecode:

cd ctcdecode && pip install .

Datasets

For data preparation, please download phoenix2014 dataset and phoenix2014T dataset in advance. After extracting, it is suggested to make a soft link toward downloaded dataset.

For more details on data preparation and prerequisites, please refer to this repo. We are grateful for the foundation that their work has given us.

NB: 1) Please refer to the above-mentioned repo for dataset extracting to the ./dataset directory. 2) Resize the original sign images from 210x260 to 256x256 for augmentation, and the generated gloss dict and resized image sequences are saved in ./preprocess for your reference. 3) We didn't use sclite library for evaluation (this library maybe tricky to install) but use pure python implemented evaluation tools instead, see ./evaluation.

Configuration Setting

According to your actual situations, update the configurations in ./configs/phoenix14.yaml and ./configs/cvtslt_eval_config.yaml. Especially, focus on the hyper-parameters such as dataset_root, evaluation_dir, work_dir.

Demo Evaluation

​We provide the pretrained CVT-SLR models for inference, as:

Firstly, download checkpoints to ./trained_models directory from the corresponding links as follows. Then, evaluate the pretrained model using script as:

-> [Option 1] Using AE based configuration:

python run_demo.py --work-dir ./out_cvpr/cvtslt_2/ --config ./configs/cvtslt_eval_config.yaml --device 1 --load-weights ./trained_models/cvtslt_model_dev_19.87.pt --use_seqAE AE

Evaluation results: test 20.17%, dev 19.87%

-> [Option 2] Using VAE based configuration:

python run_demo.py --work-dir ./out_cvpr/cvtslt_1/ --config ./configs/cvtslt_eval_config.yaml --device 1 --load-weights ./trained_models/cvtslt_model_dev_19.80.pt --use_seqAE VAE

Evaluation results: test 20.06%, dev 19.80%

The updated evaluation results (WER %) and download links:

GroupModelsDevTestTrained Checkpoints
Group 1 (single-cue)SubUNet40.840.7-
Staged-Opt39.438.7-
Align-iOpt37.136.7-
DPD+TEM35.634.5-
Re-Sign27.126.8-
SFL26.226.8-
DNF23.824.4-
FCN23.723.9-
VAC21.222.3-
CMA21.321.9-
SFL24.925.3-
VL-SLT21.922.5-
SMKD20.821.0-
Group 2 (multi-cue)DNF23.122.9-
STMC21.120.7-
C2SLR20.520.4-
Group 3 (Ours)CVT-SLR w/ AE19.8720.17[Baidu] (pwd/提取码: k42q) or [GoogleDrive]
CVT-SLR w/ VAE19.8020.06[Baidu] (pwd/提取码: 0kga) or [GoogleDrive]

NB: please refer to our paper for more details.

Visualization

We visualize the key parts of the sign video frames in focus by using Grad-CAM. To implement this function, you can use the open python tool as:

import pytorch_grad_cam

To generate the cross alignment matrices, here are some hints as:

a = ret["conv_logits"].squeeze(1)
b = ret["sequence_logits"].squeeze(1)
T = 1
simi_matric = softmax(T*(a @ b.T))

Citation

If you find this repository useful, please consider citing:

@inproceedings{zheng2023cvt,
  title={Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment},
  author={Zheng, Jiangbin and Wang, Yile and Tan, Cheng and Li, Siyuan and Wang, Ge and Xia, Jun and Chen, Yidong and Li, Stan Z},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23141--23150},
  year={2023}
}
<!-- @article{zheng2023cvt, title={CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment}, author={Zheng, Jiangbin and Wang, Yile and Tan, Cheng and Li, Siyuan and Wang, Ge and Xia, Jun and Chen, Yidong and Li, Stan Z}, journal={arXiv preprint arXiv:2303.05725}, year={2023} } -->