Home

Awesome

IMAD

This repo contains code and published data for the AINL2023 paper IMAD: IMage Augmented multi-modal Dialogue.

Our dataset serves for task of interpreting image in the context of dialogue. Published code could help with

  1. Classifying if utterance is replaceable with image
  2. Finding the best image for utterance
  3. Generation of utterance that was replaced with an image

Data

IMAD Dataset created from mutlitple dialogue dataset sources and Unsplash images. Every sample from dataset is

  1. Context of dialogue
  2. Image
  3. Replaced utterance

Example

Dataset is availible with images_id at HuggingFace. Due to the images specific license we are unable to publish them, but you can still obtain the full dataset without any difficulties. There are two ways of doing it:

  1. Request full image dataset from Unsplash and match via image_id in dataset
  2. Request full dataset directly via email from Contacts.

Code

Replace Text with Image

This tool performs classification if utterance could be potentially replaced with an image. For this purposes data should contain list of features:

  1. Image Score
  2. Sentence Similarity
  3. BLEU
  4. Maximum Entity Score
  5. Thresholding

Classification is performed with model from models directory. Generation of all the features is performed with models from features scripts directory. Example of usage is shown at Text Replacing Tutorial. Note that scripts are using Paths, which is essential to this script

Find better image with VQA

This tool is capable of finding better image with the use of BLIP VQA. Long story short it finds top-N (N is specified) images that are closest to utterance and then scores them with VQA model. This is performed with models from features scripts directory. Example of usage is shown at Image Replacing Tutorial

Paths

This is a special dataclass, that contains all the paths that would be used in scripts

License

All the code, except fine-tuned BLIP is licensed under Apache 2.0 license. Fine-tuned version of BLIP is licensed under BSD3. Text version of IMAD Dataset is licensed under Creative Commons NC, dataset with images is licensed under Unsplash License.

References

To cite this article please use this BibTex reference

@misc{viktor2023imad,
      title={IMAD: IMage-Augmented multi-modal Dialogue}, 
      author={Moskvoretskii Viktor and Frolov Anton and Kuznetsov Denis},
      year={2023},
      eprint={2305.10512},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Or via MLA Citation

Viktor, Moskvoretskii et al. “IMAD: IMage-Augmented multi-modal Dialogue.” (2023).

Contacts

Feel free to reach out to us at [vvmoskvoretskiy@yandex.ru] for inquiries, collaboration suggestions, or data requests related to our work.