Awesome

IMAD

This repo contains code and published data for the AINL2023 paper IMAD: IMage Augmented multi-modal Dialogue.

Our dataset serves for task of interpreting image in the context of dialogue. Published code could help with

Classifying if utterance is replaceable with image
Finding the best image for utterance
Generation of utterance that was replaced with an image

Data

IMAD Dataset created from mutlitple dialogue dataset sources and Unsplash images. Every sample from dataset is

Context of dialogue
Image
Replaced utterance

Example

Dataset is availible with images_id at HuggingFace. Due to the images specific license we are unable to publish them, but you can still obtain the full dataset without any difficulties. There are two ways of doing it:

Request full image dataset from Unsplash and match via image_id in dataset
Request full dataset directly via email from Contacts.

Code

Replace Text with Image

This tool performs classification if utterance could be potentially replaced with an image. For this purposes data should contain list of features:

Image Score
Sentence Similarity
BLEU
Maximum Entity Score
Thresholding

Classification is performed with model from models directory. Generation of all the features is performed with models from features scripts directory. Example of usage is shown at Text Replacing Tutorial. Note that scripts are using Paths, which is essential to this script

Find better image with VQA

This tool is capable of finding better image with the use of BLIP VQA. Long story short it finds top-N (N is specified) images that are closest to utterance and then scores them with VQA model. This is performed with models from features scripts directory. Example of usage is shown at Image Replacing Tutorial

Paths

This is a special dataclass, that contains all the paths that would be used in scripts

dialog_features_path is the path to the directory where utterances embedding vectors are stored. Initially it could be empty and vectors will be generated during the script run. The example is shown in tutorial and default value is './feature_vectors/test_vectors/'. Make sure you create new directory or clean it before running your examples
image_vectors_path is the path to the .pt file that contains images embedding vectors. Default value is './images/vectors.pt'
output_path is the path to the output .json file. Script will save all the output to that path and also read from it sometimes. Default values is './outputs/test_output.json' .
temporary_path is the path to the temporary .json file. It is used to store some outputs, that are not valuable at the end. Default values is './outputs/temporary_path.json'
entity_vectors_path is the path to the directory where entities embedding vectors are stored. Initially it could be empty and vectors will be generated during the script run. Default value is './feature_vectors/entity_vectors/'.
images_dataset_path : is the path to the dataset containing information about images. It should contain image ids, url, description and ai_desription. You can leave them blank except the id. Default value is './images/dataset.json'.
path2images is the path to the directory, that contains raw images. Images should be named with id, that has been used in images_dataset_path. Default value is './images/full_images/'.
path2images_features is the path to the directory, that contains images embedding vectors, that are named the same as id in images_dataset_path. Default value is './images/vectors'
path2trained_model is the path to the trained model for Text Replacing. You can use the default value './models/random_forest.joblib'

License

All the code, except fine-tuned BLIP is licensed under Apache 2.0 license. Fine-tuned version of BLIP is licensed under BSD3. Text version of IMAD Dataset is licensed under Creative Commons NC, dataset with images is licensed under Unsplash License.

References

To cite this article please use this BibTex reference

@misc{viktor2023imad,
      title={IMAD: IMage-Augmented multi-modal Dialogue}, 
      author={Moskvoretskii Viktor and Frolov Anton and Kuznetsov Denis},
      year={2023},
      eprint={2305.10512},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Or via MLA Citation

Viktor, Moskvoretskii et al. “IMAD: IMage-Augmented multi-modal Dialogue.” (2023).

Contacts

Feel free to reach out to us at [vvmoskvoretskiy@yandex.ru] for inquiries, collaboration suggestions, or data requests related to our work.