Home

Awesome

Our Project

[WACV 2024] Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmntation for Grounding-Based Vision and Language Models. (arXiv)

If you find this repository useful please cite our paper :)

    @article{yi2023augment,
      title={Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models},
      author={Jingru Yi and Burak Uzkent and Oana Ignat and Zili Li and Amanmeet Garg and Xiang Yu and Linda Liu},
      journal={arXiv preprint arXiv:2311.02536},
      year={2023}
    }

Introduction

Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flipping to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flipping. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics.

<p align="center"> <img src="imgs/fig1.png", width="600"> </p>

Our Method

Propose to use text-conditioned and text-unconditioned augmentations in the phrase grounding task.

<p align="center"> <img src="imgs/fig2.png", width="600"> </p> <p align="center"> <img src="imgs/fig3.png", width="800"> </p> <p align="center"> <img src="imgs/table1.png", width="800"> </p>

How to start

The code in this repo only contains the augmentation part. In our paper, we utilize the MDETR as base architecture. You may incorporate the augmentations into different phrase grounding architectures.

To apply hflip augmentation randomly, we insert the augmentation to ConvertCocoPolysToMask function of MDETR:datasets/coco.py, where annotations are prepared when loading the training dataset. Other augmentations can be applied after data preparation. In evaluation/validaton, we disable the augmentation functions.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.