Awesome
<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a>
SSVP-SLT: Self-supervised Video Pretraining for Sign Language Translation
This repository contains research code for the paper Towards Privacy-Aware Sign Language Translation at Scale.
<p align="middle"> <img src=".github/ssvp_slt_overview.png" alt="SSVP-SLT Overview"> </p> <p align="middle"> <img width=50% src=".github/ssvp_slt_language_supervised.png" alt="SSVP-SLT Overview"> </p>SSVP-SLT relies on masked autoencoding (MAE) on anonymized and unannotated videos as a form of self-supervised pretraining to learn continuous sign language representations at scale. The learned representations are transferred to the supervised gloss-free sign language translation task. SSVP-SLT outperforms prior SOTA methods on the ASL-to-English How2Sign benchmark in the finetuned and zero-shot settings by over 3 BLEU points.
Installation
We provide installation instructions in INSTALL.md.
Usage
1. Preparing the data
We describe how to prepare the datasets in DATASETS.md.
2. Pretraining
- MAE pretraining instructions are in pretraining/README.md.
- Joint MAE & CLIP/FLIP pretraining instructions are in pretraining_clip/README.md.
3. Sign Language Translation (SLT)
Instructions for feature extraction and SLT training and evaluation are in translation/README.md.
DailyMoth-70h
We release the DailyMoth-70h (DM-70) dataset as part of this project. DailyMoth-70h is released under a CC-BY-NC 4.0 license.
You can find an overview of the data and download and data preparation instructions in DATASETS.md.
Alternatively, download the files manually via these links:
<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="bottom">Subset</th> <th valign="bottom">Link</th> <th valign="bottom">md5</th> <tr><td align="left">Raw videos</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/dailymoth-70h/raw_videos.tar.gz">download</a></td> <td align="center"><tt>875ffe4eeac3a37e50b4202c2b4996d2</tt></td> </tr> <tr><td align="left">Blurred clips</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/dailymoth-70h/blurred_clips.tar.gz">download</a></td> <td align="center"><tt>a2819c7b06a8b38eb7686e4dc90a7433</tt></td> </tr> <tr><td align="left">Unblurred clips</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/dailymoth-70h/unblurred_clips.tar.gz">download</a></td> <td align="center"><tt>3e69046f6cf415cec89c3544d0523325</tt></td> </tr> <tr><td align="left">Manifest files</td> <td align="center"><a href="https://dl.fbaipublicfiles.com/dailymoth-70h/manifests.tar.gz">download</a></td> <td align="center"><tt>69e500cc5cfad3133c4b589428865472</tt></td> </tr> </tbody></table>[!NOTE] Check out our paper for detailed information on the DailyMoth-70h dataset.
Citing our work
If you find our work useful in your research, please consider citing:
@inproceedings{rust-etal-2024-towards,
title = "Towards Privacy-Aware Sign Language Translation at Scale",
author = "Rust, Phillip and Shi, Bowen and Wang, Skyler and Camgoz, Necati Cihan and Maillard, Jean",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.467",
pages = "8624--8641",
}
References
This codebase is heavily influenced by the mae and mae_st repositories. Our models are based on code from Hiera, HF Transformers, OpenCLIP, and Fairseq.
License
This project is primarily under the CC-BY-NC 4.0 license; see LICENSE for details. Portions of the project are available under separate license terms: Transformers is licensed under the Apache-2.0 license and OpenCLIP is licensed under the OpenCLIP license.