Home

Awesome

ZeroNLG

PyTroch implementation of our paper published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2024:

ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Bang Yang, Fenglin Liu, Yuexian Zou, Xian Wu, Yaowei Wang, and David A. Clifton.

[TPAMI], [arXiv]

TOC

Update Notes

[2023-12-01] Release notebooks and upgrade to zeronlg==1.0.1

[2023-04-06] Release the code, data, and pre-trained models

Environment

# clone the repo
git clone https://github.com/yangbang18/ZeroNLG

# enter the repo
cd ZeroNLG

# install a proper version of PyTorch
# see https://pytorch.org/get-started/previous-versions/
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

# optional
pip install transformers==4.12.5

# install this repo with a editable mode
pip install -e .

Note:

Quick Start

Visual Captioning:

from zeronlg import ZeroNLG

# Automatically download models pre-trained for visual captioning from Huggingface Hub
model = ZeroNLG('zeronlg-4langs-vc')

# `images` can be a remote image url, a local image/video file, etc
# `lang` should be one of English ('en'), Chinese ('zh'), German ('de'), and French ('fr')
url='./asserts/dogs.webp'
model.forward(images=url, lang='en', num_beams=3, task='caption')
# ["dogs playing in the snow"]

model.forward(images=url, lang='zh', num_beams=3, task='caption')
# ["狗 在 雪 地 里 玩 耍"]

# Althernatively, you can call the specific forward function
model.forward_caption(images=url, lang='en', num_beams=3)

Machine Translation

from zeronlg import ZeroNLG

# Automatically download models pre-trained for machine translation from Huggingface Hub
model = ZeroNLG('zeronlg-4langs-mt')

# Translating English into Chinese
# Note: the multilingual encoder is langauge-agnostic, so the `lang` below means the langauge to be generated
model.forward_translate(texts='a girl and a boy are playing', lang='zh', num_beams=3)
# ["一 个 女 孩 和 一 个 男 孩 一 起 玩"]

Zero-Shot Performance

Visual captioning

Model: zeronlg-4langs-vc's multilingual decoder + CLIP's ViT-B-32 image encoder.

DatasetLanguageTypeBLEU@1BLEU@2BLEU@3BLEU@4METEORROUGE-LCIDEr-DSPICE
Flickr30KEnglishImage46.427.215.58.913.031.321.07.6
Flickr30KChineseImage45.325.514.68.4-31.818.0-
Flickr30KGermanImage41.921.111.25.7-21.217.1-
Flickr30KFrenchImage19.89.55.02.8-18.624.8-
COCOEnglishImage47.529.016.89.614.434.929.98.7
MSR-VTTEnglishVideo52.231.916.68.715.035.49.9-
VATEXEnglishVideo42.224.612.56.311.729.39.1-
VATEXChineseVideo41.924.313.77.1-29.69.8-

Notes:

Machine translation

Model: zeronlg-4langs-mt only.

ToolkitEn->ZhEn<-ZhEn->DeEn<-DeEn->FrEn<-FrZh->DeZh<-DeZh->FrZh<-FrDe->FrDe<-Fr
SacreBLEU14.78.820.521.122.024.67.311.95.216.216.718.5
NLKT6.09.221.623.227.226.87.84.66.19.720.919.6

Notes:

Reproduction

Data

Please see data/README.md for more details.

Training

The training process does not involve any validation operation, i.e., you should choose the best-performed model on a specific dataset by yourself. In our experiments, we always use ZeroNLG that applys auto-encoding training after 3 epochs for downstream evaluations because it generally performs the best.

Stage 1: Cross-Lingual Alignment

# require 2 hours with a 24-GB GTX4090 
# (3 epochs, batch_size 128)
python train.py \
--use_amp \
--scales 1 0 0 0 \
--target_languages en zh de fr \
--student_model_name distilbert-base-multilingual-cased \
--output_path output/1_mDistilBERT \
--batch_size 128

Here, we showcase how to use pre-trained multilingual DistilBERT as an starting point to train the multilingual encoder. There are all options for student_model_name:

Stage 2: Denoising Language Reconstruction (Visual Captioning)

# require 6.5 hours with a 24-GB GTX4090 
# (3 epochs, batch_size 32)
python train.py \
--use_amp \
--scales 0 0 1 0 \
--target_languages en zh de fr \
--student_model_name output/1_mDistilBERT \
--output_path output/2_ZeroNLG_VC \
--no_tie_all \
--init_word_embeddings \
--noise_std 0.1

Note:

Stage 2: Denoising Language Reconstruction (Machine Translation)

# require 6.5 hours with a 24-GB GTX4090 
# (3 epochs, batch_size 32)
python train.py \
--use_amp \
--scales 0 0 1 0 \
--target_languages en zh de fr \
--student_model_name output/1_mDistilBERT \
--output_path output/2_ZeroNLG_MT \
--student_emb_keyname token_embeddings \
--use_masking \
--mask_prob 0.05 \
--noise_std 0.01

Note:

Testing (Zero-Shot Transfer)

# visual captioning
## evluate the model trained after 3 epochs
## `output/2_ZeroNLG_VC` is equivalent to `output/2_ZeroNLG_VC/2`
export model=output/2_ZeroNLG_VC
bash scripts/caption.sh $model

## evluate the model trained after 1 epoch
export model=output/2_ZeroNLG_VC/0
bash scripts/caption.sh $model

# machine translation
export model=output/2_ZeroNLG_MT
bash scripts/translate.sh $model

# retrieval
export model=output/2_ZeroNLG_VC
bash scripts/retrieval.sh $model

Semi-Supervised Training on Visual Captioning

# training on limited labeled data w/o pre-training
bash scripts/semi.sh coco en
bash scripts/semi.sh msrvtt en
bash scripts/semi.sh flickr30k de
bash scripts/semi.sh flickr30k fr
bash scripts/semi.sh vatex zh

# training on limited labeled data w/ pre-training
export model=output/2_ZeroNLG_VC
bash scripts/semi.sh coco en $model
bash scripts/semi.sh msrvtt en $model
bash scripts/semi.sh flickr30k de $model
bash scripts/semi.sh flickr30k fr $model
bash scripts/semi.sh vatex zh $model

The script will loop over 0.01% (if available), 0.1%, 1%, and 10% labeled data, each for three times (as we generate subsets with 3 different seeds).

Visualization and More

Please refer to notebooks.

Bugs or Questions?

If you encounter any problems when using the code, or want to report a bug, you can open an issue or email yangbang@pku.edu.cn, fenglin.liu@eng.ox.ac.uk. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please consider citing our papers if our code, data and models are useful to your work, thanks sincerely!

@ARTICLE{Yang2024ZeroNLG,
  author={Yang, Bang and Liu, Fenglin and Zou, Yuexian and Wu, Xian and Wang, Yaowei and Clifton, David A.},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation}, 
  year={2024},
  volume={46},
  number={8},
  pages={5712-5724},
  doi={10.1109/TPAMI.2024.3371376}}

Acknowledgements

Our code is built upon sentence-transformers.