Home

Awesome

CV-SLT

This repo holds codes of the paper: Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment.

The CV-SLT builds upon the strong baseline MMTLB, many thanks to their great work!

Introduction

We propose CV-SLT to facilitate direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two KL divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. Experiments conducted on public datasets (PHOENIX14T and CSL-daily) demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy.

Detailed model framework of CV-SLT

Performance

DatasetR (Dev)B1B2B3B4R (Test)B1B2B3B4
PHOENIX14T55.0555.3542.9935.0729.5554.5454.7642.8034.9729.52
CSL-daily56.3658.0544.7335.1428.2457.0658.2945.1535.7728.94

Implementation

Prerequisites

conda env create -f environment.yml
conda activate slt

Data preparation

The raw data are from:

Please refer to the implementation of MMTLB for preparing the data and models, as CV-SLT simply focuses on the SLT training. Specifically, the required processed data and pre-trained models include:

Note that the path is configured in the *.yaml file and you can change it anywhere you want.

We backup the ckpts used in this repo here.

Train and Evaluate

Train

dataset=phoenix-2014t #phoenix14t / csl-daily
python -m torch.distributed.launch \
--nproc_per_node 1 \
--use_env training.py \
--config experiments/configs/SingleStream/${dataset}_vs2t.yaml

Evaluate

Upon finishing training, your can evaluate the model with:

dataset=phoenix-2014t #phoenix14t / csl-daily
python -m torch.distributed.launch \
--nproc_per_node 1 \
--use_env prediction.py  \
--config experiments/configs/SingleStream/${dataset}_vs2t.yaml

You can also reproduce our reported performance with our trained ckpts.

We also provide a trained g2t ckpt of CSL-daily to help re-train our CV-SLT since it is lost in the repo of MMTLB.

TODO

Citation

@InProceedings{
    Zhao_2024_AAAI,
    author    = {Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, Yidong Chen},
    title     = {Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment},
    booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
    year      = {2024},
}