Awesome

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

[arXiv] [video presentation at ICCV]

Requirements

PyTorch 1.8 or higher
CLIP (install with pip install git+https://github.com/openai/CLIP.git)
transformers (install with pip install transformers)
cococaption

Images Download

COCO
MPI. Rename to mpi
Flickr30K. Rename to flickr30k
VCR
ImageNet (ILSVRC2012). Rename to ImageNet
Visual Genome v1.2. Rename to VG_100K

Data

The trianing and test data (combined for all datasets) can be found here

Annotations

The annotations in the format that cococaption expects can be found here. Please place them inside the cococaption folder.

Code

train_nlx.py: script for training only test_datasets.py: script for validation/testing for all epochs on all 7 NLE tasks clip_model.py: script for vision backbone we use (CLIP visual encoder)

Models

Pretrained Model (w/o finetuning)
Pretrained Model (w/ finetuning)

Results

Results (w/o finetuning)
Results (w/ finetuning)