Awesome

Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

This repository contains the code for the TMLR paper Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning, by Maurits Bleeker1, Mariya Hendriksen1, Andrew Yates1, and Maarten de Rijke1.

The implementation builds upon the codebase of Latent Target Decoding.

1University of Amsterdam, The Netherlands

News

Jul 2024: The paper has been accepted by TMLR
Feb 2024: Initial arXiv release

Requirements

To set up the environment, install the requirements using the provided YAML file:

conda env create --file src/environment.yaml

This command will create a conda environment contrastive-shortcuts. Activate the created environment:

source activate contrastive-shortcuts

Training the models

For local development, execute the following command:

python src/trainer.py --yaml_file src/configs/{f30k, coco}/development_local.yaml

To train a model run python src/trainer.py and provide a base config in YAML format using --yaml_file <config path.yaml>.

Hyperparameters can be overridden using command line flags. For example:

python src/trainer.py --yaml_file src/configs/f30k/development_local.yaml --experiment.wandb_project <your project name>

The recommended approach is to have a fixed base config for each experiment and only modify specific hyperparameters for different training/evaluation settings.

All training and evaluation were conducted using a SLURM-based scheduling system.

Data loading and preparation

We implemented a PyTorch Dataloader class that loads the images from the memory of the compute node the training runs on. The captions are loaded from either the Flickr30k or MS-COCO annotation file.

Update the *.yaml config with the right file paths.

img_path:
annotation_file:
annotation_path:

Vocabulary class

To create the vocabulary class, run:

python utils/vocab.py

With the appropriate input flags.

Job files

Job and hyperparameter files to reproduce experiments can be found in src/jobs/{coco, f30k}/.

The shortcut experiments (Section 4) are available in the shortcuts folder, the LTD experiments in the LTD folder, and the IFM experiments in the 'IFM' folder (Section 6).

Evaluation

To reproduce results from Section 3, run the following evaluation script (ensure correct file paths).

sbatch src/jobs/{coco, f30k}/snellius/shortcuts/{clip, vse}/{clip, vse}_{coco, f30k}_shortcut_experiments_eval.job

Next, copy all the RSUM values to notebooks/visualizations/visualization.ipynb to generate the plot.

The results from Section 6 are generated by using notebooks/Evaluation.ipynb.

Citing and Authors

If you find this repository helpful, feel free to cite our paper "Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning":

@article{bleeker-2024-demonstrating,
  title={Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning},
  author={Bleeker, Maurits and Hendriksen, Mariya and Yates, Andrew and de Rijke, Maarten},
  journal={Transactions on Machine Learning Research},
  url={https://openreview.net/forum?id=gfANevPraH},
  year={2024}
}