Home

Awesome

<p align=center> Multimodal Composite Editing and Retrieval </p>

:fire::fire: This is a collection of awesome articles about multimodal composite editing and retrieval:fire::fire:

[NEWS.20240909] The related survey paper has been released.

If you find this repository is useful for you, please cite our paper:

@misc{li2024survey,
      title={A Survey of Multimodal Composite Editing and Retrieval}, 
      author={Suyan Li, Fuxiang Huang, and Lei Zhang},
      year={2024},
      eprint={2409.05405},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Papers and related codes

Image-text composite editing

2024

[ACM SIGIR, 2024] Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
Haokun Wen, Xuemeng Song, Xiaolin Chen, Yinwei Wei, Liqiang Nie, Tat-Seng Chua
[Paper]

[IEEE TIP, 2024] Multimodal Composition Example Mining for Composed Query Image Retrieval
Gangjian Zhang, Shikun Li, Shikui Wei, Shiming Ge, Na Cai, Yao Zhao
[Paper]

[IEEE TMM, 2024] Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text Feedback
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, Heng Tao Shen
[Paper]

[WACV, 2024] Text-to-Image Editing by Image Information Removal
Zhongping Zhang, Jian Zheng, Zhiyuan Fang, Bryan A. Plummer
[Paper]

[WACV, 2024] Shape-Guided Diffusion with Inside-Outside Attention
Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
[Paper]

2023

[IEEE Access, 2023] Text-Guided Image Manipulation via Generative Adversarial Network With Referring Image Segmentation-Based Guidance
Yuto Watanabe, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
[Paper]

[arXiv, 2023] InstructEdit: Improving Automatic Masks for Diffusion-Based Image Editing with User Instructions
Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
[Paper] [GitHub]

[ICLR, 2023] DiffEdit: Diffusion-Based Semantic Image Editing with Mask Guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, Matthieu Cord
[Paper] [GitHub]

[CVPR, 2023] SINE: Single Image Editing with Text-to-Image Diffusion Models
Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, Jian Ren
[Paper] [GitHub]

[CVPR, 2023] Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel
[Paper] [GitHub]

[arXiv, 2023] PRedItOR: Text Guided Image Editing with Diffusion Prior
Hareesh Ravi, Sachin Kelkar, Midhun Harikumar, Ajinkya Kale
[Paper]

[TOG, 2023] Unitune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image
Dani Valevski, Matan Kalman, Eyal Molad, Eyal Segalis, Yossi Matias, Yaniv Leviathan
[Paper] [GitHub]

[arXiv, 2023] Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
[Paper] [GitHub]

[CVPR, 2023] Imagic: Text-Based Real Image Editing with Diffusion Models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani
[Paper] [GitHub]

[ICLR, 2023] Diffusion-Based Image Translation Using Disentangled Style and Content Representation
Gihyun Kwon, Jong Chul Ye
[Paper] [GitHub]

[arXiv, 2023] MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path
Qian Wang, Biao Zhang, Michael Birsak, Peter Wonka
[Paper] [GitHub]

[CVPR, 2023] InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks, Aleksander Holynski, Alexei A. Efros
[Paper] [GitHub]

[ICCV, 2023] Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models
Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han
[Paper]

[arXiv, 2023] DeltaSpace: A Semantic-Aligned Feature Space for Flexible Text-Guided Image Editing
Yueming Lyu, Kang Zhao, Bo Peng, Yue Jiang, Yingya Zhang, Jing Dong
[Paper]

[AAAI, 2023] DE-Net: Dynamic Text-Guided Image Editing Adversarial Networks
Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, Qi Tian
[Paper] [GitHub]

2022

[ACM MM, 2022] LS-GAN: Iterative Language-Based Image Manipulation via Long and Short Term Consistency Reasoning
Gaoxiang Cong, Liang Li, Zhenhuan Liu, Yunbin Tu, Weijun Qin, Shenyuan Zhang, Chengang Yan, Wenyu Wang, Bin Jiang
[Paper]

[arXiv, 2022] FEAT: Face Editing with Attention
Xianxu Hou, Linlin Shen, Or Patashnik, Daniel Cohen-Or, Hui Huang
[Paper]

[ECCV, 2022] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff
[Paper] [GitHub]

[ICML, 2022] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, Mark Chen
[Paper] [GitHub]

[WACV, 2022] StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation
Umut Kocasari, Alara Dirik, Mert Tiftikci, Pinar Yanardag
[Paper] [GitHub] [website]

[CVPR, 2022] HairCLIP: Design Your Hair by Text and Reference Image
Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, Nenghai Yu
[Paper] [GitHub]

[NeurIPS, 2022] One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations
Yiming Zhu, Hongyu Liu, Yibing Song, Ziyang Yuan, Xintong Han, Chun Yuan, Qifeng Chen, Jue Wang
[Paper] [GitHub]

[CVPR, 2022] Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte, Luc Van Gool, Errui Ding
[Paper] [GitHub]

[CVPR, 2022] Blended Diffusion for Text-Driven Editing of Natural Images
Omri Avrahami, Dani Lischinski, Ohad Fried
[Paper] [GitHub]

[CVPR, 2022] DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
Gwanghyun Kim, Taesung Kwon, Jong Chul Ye
[Paper] [GitHub]

[ICLR, 2022] SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon
[Paper] [GitHub] [Website]

2021

[CVPR, 2021] TediGAN: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
[Paper] [GitHub]

[ICIP, 2021] Segmentation-Aware Text-Guided Image Manipulation
Tomoki Haruyama, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
[Paper] [GitHub]

[IJPR, 2021] FocusGAN: Preserving Background in Text-Guided Image Editing
Liuqing Zhao, Linyan Li, Fuyuan Hu, Zhenping Xia, Rui Yao
[Paper]

[ICCV, 2021] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski
[Paper] [GitHub]

[MM, 2021] Text as Neural Operator: Image Manipulation by Text Instruction
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
[Paper] [GitHub]

[CVPR, 2021] TediGAN: Text-guided diverse face image generation and manipulation
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
[Paper] [GitHub]

[arXiv, 2021] Paint by Word
Alex Andonian, Sabrina Osmany, Audrey Cui, YeonHwan Park, Ali Jahanian, Antonio Torralba, David Bau
[Paper] [GitHub] [Website]

[CVPR, 2021] Learning by Planning: Language-Guided Global Image Editing
Jing Shi, Ning Xu, Yihang Xu, Trung Bui, Franck Dernoncourt, Chenliang Xu
[Paper] [GitHub]

2020

[ACM MM, 2020] IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning
Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui Wang, Qingming Huang
[Paper] [GitHub]

[CVPR, 2020] ManiGAN: Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip HS Torr
[Paper] [GitHub]

[NeurIPS, 2020] Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation
Bowen Li, Xiaojuan Qi, Philip Torr, Thomas Lukasiewicz
[Paper] [GitHub]

[LNCS, 2020] CAFE-GAN: Arbitrary Face Attribute Editing with Complementary Attention Feature
Jeong-gi Kwak, David K. Han, Hanseok Ko
[Paper] [GitHub]

[ECCV, 2020] Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, Hongsheng Li
[Paper] [GitHub]

2019

[ICASSP, 2019] Bilinear Representation for Language-based Image Editing Using Conditional Generative Adversarial Networks
Xiaofeng Mao, Yuefeng Chen, Yuhong Li, Tao Xiong, Yuan He, Hui Xue
[Paper] [GitHub]

2018

[NeurIPS, 2018] Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language
Seonghyeon Nam, Yunji Kim, Seon Joo Kim
[Paper] [GitHub]

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

[arXiv, 2018] Interactive Image Manipulation with Natural Language Instruction Commands
Seitaro Shinagawa, Koichiro Yoshino, Sakriani Sakti, Yu Suzuki, Satoshi Nakamura
[Paper]

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

2017

[ICCV, 2017] Semantic image synthesis via adversarial learning
Hao Dong, Simiao Yu, Chao Wu, Yike Guo
[Paper] [GitHub]

Image-text composite retrieval

2024

[AAAI, 2024] Dynamic weighted combiner for mixed-modal image retrieval
Fuxiang Huang, Lei Zhang, Xiaowei Fu, Suqi Song
[Paper] [GitHub]

[ICMR, 2024] Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, Evangelos Kanoulas
[Paper] [GitHub]

[ACM MM, 2024] Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives
Zhangchi Feng, Richong Zhang, Zhijie Nie
[Paper] [GitHub]

[CVPR, 2024] Language-only Training of Zero-shot Composed Image Retrieval
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun
[Paper] [GitHub]

[AAAI, 2024] Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang, Jing Yu, Keke Gai, Jiamin Zhuang, Gang Xiong, Yue Hu, Qi Wu
[Paper] [GitHub]

[CVPR, 2024] Knowledge-enhanced dual-stream zero-shot composed image retrieval
Yucheng Suo, Fan Ma, Linchao Zhu, Yi Yang
[Paper]

2023

[CVPR, 2023] Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks
Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang
[Paper] [GitHub]

[ICCV, 2023] FashionNTM: Multi-turn fashion image retrieval via cascaded memory
Anwesan Pal, Sahil Wadhwa, Ayush Jaiswal, Xu Zhang, Yue Wu, Rakesh Chada, Pradeep Natarajan, Henrik I Christensen
[Paper]

[CVPR, 2023] Pic2word: Mapping pictures to words for zero-shot composed image retrieval
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister
[Paper] [GitHub]

[arXiv, 2023] Pretrain like you inference: Masked tuning improves zero-shot composed image retrieval
Junyang Chen, Hanjiang Lai
[Paper] [GitHub]

[ICCV, 2023] Zero-shot composed image retrieval with textual inversion
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
[Paper] [GitHub]

[ACM, 2023] Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo
[Paper] [GitHub]

2022

[IEEE, 2022] Adversarial and isotropic gradient augmentation for image retrieval with text feedback
Fuxiang Huang, Lei Zhang, Yuhang Zhou, Xinbo Gao
[Paper]

[TOMM, 2022] Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval
Feifei Zhang, Mingliang Xu, Changsheng Xu
[Paper]

[arXiv, 2022] Image Search with Text Feedback by Additive Attention Compositional Learning
Yuxin Tian, Shawn Newsam, Kofi Boakye
[Paper]

[IEEE, 2022] Heterogeneous feature alignment and fusion in cross-modal augmented space for composed image retrieval
Huaxin Pang, Shikui Wei, Gangjian Zhang, Shiyin Zhang, Shuang Qiu, Yao Zhao
[Paper]

[ICLR, 2022] ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity
Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, Diane Larlus
[Paper] [GitHub]

[WACV, 2022] SAC: Semantic attention composition for text-conditioned image retrieval
Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy
[Paper]

[ACM TOMCCAP, 2022] AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval
Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, Shujuan Huang
[Paper][GitHub]

2021

[SIGIR, 2021] Comprehensive linguistic-visual composition network for image retrieval
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, Liqiang Nie
[Paper]

[AAAI, 2021] Dual compositional learning in interactive image retrieval
Jongseok Kim, Youngjae Yu, Hoeseong Kim, Gunhee Kim
[Paper] [GitHub]

[CVPRW, 2021] Leveraging Style and Content features for Text Conditioned Image Retrieval
Pranit Chawla, Surgan Jandial, Pinkesh Badjatiya, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy
[Paper]

[ICCV, 2021] Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, Stephen Gould
[Paper] [GitHub]

[SIGIR, 2021] Conversational fashion image retrieval via multiturn natural language feedback
Yifei Yuan, Wai Lam
[Paper] [GitHub]

[WACV, 2021] Compositional learning of image-text query for image retrieval
Muhammad Umer Anwaar, Egor Labintcev, Martin Kleinsteuber
[Paper] [GitHub]

2020

[ECCV, 2020] Learning joint visual semantic matching embeddings for language-guided retrieval
Yanbei Chen, Loris Bazzani
[Paper]

[arXiv, 2020] CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, Gunhee Kim
[Paper]

[CVPR, 2020] Image search with text feedback by visiolinguistic attention learning
Yanbei Chen, Shaogang Gong, Loris Bazzani
[Paper] [GitHub]

[arXiv, 2020] Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, Kofi Boakye
[Paper] [GitHub]

2018

[CVPR, 2018] Language-based image editing with recurrent attentive models
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu, Xiaodong Liu
[Paper]

[NeurIPS, 2018] Dialog-based interactive image retrieval
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, Rogerio Feris
[Paper] [GitHub]

2017

[ICCV, 2017] Automatic spatially-aware fashion concept discovery
Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, Larry S Davis
[Paper]

[ICCV, 2017] Be your own prada: Fashion synthesis with structural coherence
Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, Chen Change Loy
[Paper] [GitHub]

Other mutimodal composite retrieval

2024

[CVPR, 2024] Tri-modal motion retrieval by learning a joint embedding space
Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian
[Paper]

[WACV, 2024] Modality-Aware Representation Learning for Zero-shot Sketch-based Image Retrieval
Eunyi Lyou, Doyeon Lee, Jooeun Kim, Joonseok Lee
[Paper] [GitHub]

[CVPR, 2024] Pros: Prompting-to-simulate generalized knowledge for universal cross-domain retrieval
Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, Heng Tao Shen
[Paper] [GitHub]

[CVPR, 2024] You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
[Paper]

[AAAI, 2024] T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan
[Paper] [GitHub]

[IEEE/CVF, 2024] TriCoLo: Trimodal contrastive loss for text to shape retrieval
Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X Chang
[Paper] [GitHub]

2023

[CVPR, 2023] SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Yi-Zhe Song
[Paper]

2022

[ECCV, 2022] A sketch is worth a thousand words: Image retrieval with text and sketch
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays
[Paper]

[ECCV, 2022] Motionclip: Exposing human motion generation to clip space
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, Daniel Cohen-Or
[Paper] [GitHub]

[IEEE, 2022] Multimodal Fusion Remote Sensing Image–Audio Retrieval
Rui Yang, Shuang Wang, Yingzhi Sun, Huan Zhang, Yu Liao, Yu Gu, Biao Hou, Licheng Jiao
[Paper]

2021

[CVPR, 2021] Connecting what to say with where to look by modeling human attention traces
Zihang Meng, Licheng Yu, Ning Zhang, Tamara L Berg, Babak Damavandi, Vikas Singh, Amy Bearman
[Paper] [GitHub]

[ICCV, 2021] Telling the what while pointing to the where: Multimodal queries for image retrieval
Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu Soricut
[Paper]

2020

[arXiv, 2020] A Feature Analysis for Multimodal News Retrieval
Golsa Tahmasebzadeh, Sherzod Hakimov, Eric Müller-Budack, Ralph Ewerth
[Paper] [GitHub]

2019

[Multimedia Tools and Applications, 2019] Efficient and interactive spatial-semantic image retrieval
Ryosuke Furuta, Naoto Inoue, Toshihiko Yamasaki
[Paper]

[arXiv, 2019] Query by Semantic Sketch
Luca Rossetto, Ralph Gasser, Heiko Schuldt
[Paper]

2017

[IJCNLP, 2017] Draw and tell: Multimodal descriptions outperform verbal-or sketch-only descriptions in an image retrieval task
Ting Han, David Schlangen
[Paper]

[CVPR, 2017] Spatial-Semantic Image Search by Visual Feature Synthesis
Long Mai, Hailin Jin, Zhe Lin, Chen Fang, Jonathan Brandt, Feng Liu
[Paper]

[ACM Multimedia, 2017] Region-based image retrieval revisited
Ryota Hinami, Yusuke Matsui, Shin'ichi Satoh
[Paper]

2014

[Cancer Informatics, 2014] Medical image retrieval: a multimodal approach
Yu Cao, Shawn Steffey, Jianbiao He, Degui Xiao, Cui Tao, Ping Chen, Henning Müller
[Paper]

2013

[10th Conference on Open Research Areas in Information Retrieval, 2013] NovaMedSearch: a multimodal search engine for medical case-based retrieval
André Mourão, Flávio Martins
[Paper]

[12th International Conference on Document Analysis and Recognition, 2013] Multi-modal Information Integration for Document Retrieval
Ehtesham Hassan, Santanu Chaudhury, M. Gopal
[Paper]

2003

[EURASIP, 2003] Semantic indexing of multimedia content using visual, audio, and text cues
WH Adams, Giridharan Iyengar, Ching-Yung Lin, Milind Ramesh Naphade, Chalapathy Neti, Harriet J Nock, John R Smith
[Paper]

Datasets

Datasets for image-text composite editing

DatasetModalitiesScaleLink
Caltech-UCSD Birds(CUB)Images, Captions11K images, 11K attributesLink
Oxford-102 flowerImages, Captions8K images, 8K attributesLink
CelebFaces Attributes (CelebA)Images, Captions202K images, 8M attributesLink
DeepFashion (Fashion Synthesis)Images, Captions78K images, -Link
MIT-Adobe 5kImages, Captions5K images, 20K textsLink
MS-COCOImage, Caption164K images, 616K textsLink
ReferItImage, Caption19K images, 130K textLink
CLEVR3D images, Questions100K images, 865K questionsLink
i-CLEVR3D image, Instruction10K sequences, 50K instructionsLink
CSS3D images, 2D images, Instructions34K images, -Link
CoDrawimages, text instructions9K images, -Link
Cityscapesimages, Captions25K images, -Link
Zap-Seqimage sequences, Captions8K images, 18K texts-
DeepFashion-Seqimage sequences, Captions4K images, 12K texts-
FFHQImages70K imagesLink
LSUNImages1M imagesLink
Animal FacesHQ (AFHQ)Images15K imagesLink
CelebA-HQImages30K imagesLink
Animal facesImages16K imagesLink
LandscapesImages4K imagesLink

Datasets for image-text composite retrieval

DatasetModalitiesScaleLink
Fashion200kImage, Captions200K images, 200K textLink
MIT-StatesImage, Captions53K images, 53K textLink
Fashion IQImage, Captions77K images, -Link
CIRRImage, Captions21K images, -Link
CSS3D images, 2D images, Instructions34K images, -Link
ShoesImages14K imagesLink
Birds-to-WordsImages, Captions-Link
SketchyCOCOImages, Sketches14K sketches, 14K photosLink
FSCOCOImages, Sketches10K sketchesLink

Datasets for other mutimodal composite retrieval

DatasetModalitiesScaleLink
HumanML3DMotions, Captions14K motion sequences, 44K textLink
KIT-MLMotions, Captions3K motion sequences, 6K textLink
Text2ShapeShapes, Captions6K chairs, 8K tables, 70K textLink
Flickr30k LocNarImages, Captions31K images, 155K textsLink
Conceptual CaptionsImages, Captions3.3M images, 33M textsLink
Sydney_IVRS Images, Audio Captions613 images, 3K audio descriptionsLink
UCM_IVImages, Audio Captions2K images, 10K audio descriptionsLink
RSICD_IVImage, Audio Captions11K images, 55K audio descriptionsLink

Datasets for other mutimodal composite retrieval

DatasetModalitiesScaleLink
HumanML3DMotions, Captions14K motion sequences, 44K textLink
KIT-MLMotions, Captions3K motion sequences, 6K textLink
Text2ShapeShapes, Captions6K chairs, 8K tables, 70K textLink
Flickr30k LocNarImages, Captions31K images, 155K textsLink
Conceptual CaptionsImages, Captions3.3M images, 33M textsLink
Sydney_IVRS Images, Audio Captions613 images, 3K audio descriptionsLink
UCM_IVImages, Audio Captions2K images, 10K audio descriptionsLink
RSICD_IVImage, Audio Captions11K images, 55K audio descriptionsLink

Experimental Results

Performance comparison on the Fashion-IQ datase((VAL split)

MethodsImage EncoderDress R@10Dress R@50Shirt R@10Shirt R@50Toptee R@10Toptee R@50Average R@10Average R@50Avg.
ARTEMIS+LSTM <!--\cite{delmas2022ARTEMIS}-->ResNet-1825.2348.6420.3543.6723.3646.9722.9846.4334.70
ARTEMIS+BiGRU <!--\cite{delmas2022ARTEMIS}-->ResNet-1824.8449.0020.4043.2223.6347.3922.9546.5434.75
JPM(VAL,MSE) <!--\cite{JPM}-->ResNet-1821.2743.1221.8843.3025.8150.2722.9845.5934.29
JPM(VAL,Tri) <!--\cite{JPM}-->ResNet-1821.3845.1522.8145.1827.7851.7023.9947.3435.67
EER <!--\cite{EER}-->ResNet-5030.0255.4425.3249.8733.2060.3429.5155.2242.36
Ranking-aware <!--\cite{chen2023ranking-aware}-->ResNet-5034.8060.2245.0169.0647.6874.8542.5068.0455.27
CRN <!--\cite{2023-CRN}-->ResNet-5030.2057.1529.1755.0333.7063.9131.0258.7044.86
DWC <!--\cite{huang2023-DWC}-->ResNet-5032.6757.9635.5360.1140.1366.0936.1161.3948.75
DATIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->ResNet-5021.9043.8021.9043.7027.2051.6023.7046.4035.05
CoSMo <!--\cite{AMC}-->ResNet-5025.6450.3024.9049.1829.2157.4626.5852.3139.45
FashionVLP <!--\cite{ComqueryFormer}-->ResNet-5032.4260.2931.8958.4438.5168.7934.2762.5148.39
CLVC-Net <!--\cite{wen2021-CLVC-NET}-->ResNet-5029.8556.4728.7554.7633.5064.0030.7058.4144.56
SAC w/BERT <!--\cite{jandial2021SAC}-->ResNet-5026.5251.0128.0251.8632.7061.2329.0854.7041.89
SAC w/ Random Emb. <!--\cite{jandial2021SAC}-->ResNet-5026.1352.1026.2050.9331.1659.0527.8354.0340.93
DCNet <!--\cite{kim2021-DCNet}-->ResNet-5028.9556.0723.9547.3030.4458.2927.7853.8940.83
AMC <!--\cite{AMC}-->ResNet-5031.7359.2530.6759.0836.2166.6032.8761.6447.25
VAL(Lvv) <!--\cite{Chen2020VAL}-->ResNet-5021.1242.1921.0343.4425.6449.4922.6045.0433.82
ARTEMIS+LSTM <!--\cite{delmas2022ARTEMIS}-->ResNet-5027.3451.7121.0544.1824.9149.8724.4348.5936.51
ARTEMIS+BiGRU <!--\cite{delmas2022ARTEMIS}-->ResNet-5027.1652.4021.7843.6429.2054.8326.0550.2938.17
VAL(Lvv + Lvs) <!--\cite{Chen2020VAL}-->ResNet-5021.4743.8321.0342.7526.7151.8123.0746.1334.60
VAL(GloVe) <!--\cite{Chen2020VAL}-->ResNet-5022.5344.0022.3844.1527.5351.6824.1546.6135.38
AlRet <!--\cite{xu2024-AlRet}-->ResNet-5030.1958.8029.3955.6937.6664.9732.3659.7646.12
RTIC <!--\cite{shin2021RTIC}-->ResNet-5019.4043.5116.9338.3621.5847.8819.3043.2531.28
RTIC-GCN <!--\cite{shin2021RTIC}-->ResNet-5019.7943.5516.9538.6721.9749.1119.5743.7831.68
Uncertainty (CLVC-Net) <!--\cite{chen2024uncertainty}-->ResNet-5030.6057.4631.5458.2937.3768.4133.1761.3947.28
Uncertainty (CLIP4CIR) <!--\cite{chen2024uncertainty}-->ResNet-5032.6161.3433.2362.5541.4072.5135.7565.4750.61
CRR <!--\cite{CRR}-->ResNet-10130.4157.1133.6764.4830.7358.0231.6059.8745.74
CIRPLANT <!--\cite{liu2021CIRPLANT}-->ResNet-15214.3834.6613.6433.5616.4438.3414.8235.5225.17
CIRPLANT w/OSCAR <!--\cite{liu2021CIRPLANT}-->ResNet-15217.4540.4117.5338.8121.6445.3818.8741.5330.20
ComqueryFormer <!--\cite{ComqueryFormer}-->Swin33.8661.0835.5762.1942.0769.3037.1764.1950.68
CRN <!--\cite{2023-CRN}-->Swin30.3457.6129.8355.5433.9164.0431.3659.0645.21
CRN <!--\cite{2023-CRN}-->Swin-L32.6759.3030.2756.9737.7465.9433.5660.7447.15
BLIP4CIR1 <!--\cite{liu2023BLIP4CIR1}-->BLIP-B43.7867.3845.0467.4749.6272.6246.1569.1557.65
CASE <!--\cite{levy2023CASE}-->BLIP47.4469.3648.4870.2350.1872.2448.7970.6859.74
BLIP4CIR2 <!--\cite{liu2024-BLIP4CIR2}-->BLIP40.6566.3440.3864.1346.8669.9142.6366.7954.71
BLIP4CIR2+Bi <!--\cite{liu2024-BLIP4CIR2}-->BLIP42.0967.3341.7664.2846.6170.3243.4967.3155.40
CLIP4CIR3 <!--\cite{CLIP4CIR3}-->CLIP39.4664.5544.4165.2647.4870.9843.7866.9355.36
CLIP4CIR <!--\cite{CLIP4CIR2}-->CLIP33.8159.4039.9960.4541.4165.3738.3261.7450.03
AlRet <!--\cite{xu2024-AlRet}-->CLIP-RN5040.2365.8947.1570.8851.0575.7846.1070.8058.50
Combiner <!--\cite{baldrati2022combiner}-->CLIP-RN5031.6356.6736.3658.0038.1962.4235.3959.0347.21
DQU-CIR <!--\cite{Wen_2024-DQU-CIR}-->CLIP-H57.6378.5662.1480.3866.1585.7361.9781.5671.77
PL4CIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->CLIP-L38.1864.5048.6371.5452.3276.9046.3770.9858.68
TG-CIR <!--\cite{Wen_2023-TG-CIR}-->CLIP-B45.2269.6652.6072.5256.1477.1051.3273.0962.21
PL4CIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->CLIP-B33.2259.9946.1768.7946.4673.8441.9867.5454.76

Performance comparison on the Fashion-IQ dataset(original split)

MethodsImage EncoderDress R@10Dress R@50Shirt R@10Shirt R@50Toptee R@10Toptee R@50Average R@10Average R@50Avg.
ComposeAE <!--\cite{anwaar2021composeAE}-->ResNet-1810.7728.299.9625.1412.7430.79---
TIRG <!--\cite{vo2018TIRG}-->ResNet-1814.8734.6618.2637.8919.0839.6217.4037.3927.40
MAAF <!--\cite{AMC}-->ResNet-5023.8048.6021.3044.2027.9053.6024.3048.8036.60
Leveraging <!--\cite{Leveraging}-->ResNet-5019.3343.5214.4735.4719.7344.5617.8441.1829.51
MCR <!--\cite{pang2022MCR}-->ResNet-5026.2051.2022.4046.0129.7056.4026.1051.2038.65
MCEM ((L_CE)) <!--\cite{MCEM}-->ResNet-5030.0756.1323.9047.6030.9057.5228.2953.7541.02
MCEM ((L_FCE)) <!--\cite{MCEM}-->ResNet-5031.5058.4125.0149.7332.7761.0229.7656.3943.07
MCEM ((L_AFCE)) <!--\cite{MCEM}-->ResNet-5033.2359.1626.1550.8733.8361.4031.0757.1444.11
AlRet <!--\cite{xu2024-AlRet}-->ResNet-5027.3453.4221.3043.0829.0754.2125.8650.1738.02
MCEM ((L_AFCE) w/ BERT) <!--\cite{MCEM}-->ResNet-5032.1159.2127.2852.0133.9662.3031.1257.8444.48
JVSM <!--\cite{chen2020JVSM}-->MobileNet-v110.7025.9012.0027.1013.0026.9011.9026.6319.27
FashionIQ (Dialog Turn 1) <!--\cite{wu2020fashioniq}-->EfficientNet-b12.4535.2111.0528.9911.2430.4511.5831.5521.57
FashionIQ (Dialog Turn 5) <!--\cite{wu2020fashioniq}-->EfficientNet-b41.3573.6333.9163.4233.5263.8536.2666.9751.61
AACL <!--\cite{tian2022AACL}-->Swin29.8955.8524.8248.8530.8856.8528.5353.8541.19
ComqueryFormer <!--\cite{ComqueryFormer}-->Swin28.8555.3825.6450.2233.6160.4829.3755.3642.36
AlRet <!--\cite{xu2024-AlRet}-->CLIP35.7560.5637.0260.5542.2567.5238.3062.8250.56
MCEM ((L_AFCE)) <!--\cite{MCEM}-->CLIP33.9859.9640.1562.7643.7567.7039.2963.4751.38
SPN (TG-CIR) <!--\cite{feng2024data_generation-SPN}-->CLIP36.8460.8341.8563.8945.5968.7941.4364.5052.97
SPN (CLIP4CIR) <!--\cite{feng2024data_generation-SPN}-->CLIP38.8262.9245.8366.4448.8071.2944.4866.8855.68
PL4CIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->CLIP-B29.0053.9435.4358.8839.1664.5634.5359.1346.83
FAME-ViL <!--\cite{han2023FAMEvil}-->CLIP-B42.1967.3847.6468.7950.6973.0746.8469.7558.30
PALAVRA <!--\cite{cohen2022-PALAVRA}-->CLIP-B17.2535.9421.4937.0520.5538.7619.7637.2528.51
MagicLens-B <!--\cite{zhang2024magiclens}-->CLIP-B21.5041.3027.3048.8030.2052.3026.3047.4036.85
SEARLE <!--\cite{Baldrati2023SEARLE}-->CLIP-B18.5439.5124.4441.6125.7046.4622.8942.5332.71
CIReVL <!--\cite{karthik2024-CIReVL}-->CLIP-B25.2946.3628.3647.8431.2153.8528.2949.3538.82
SEARLE-OTI <!--\cite{Baldrati2023SEARLE}-->CLIP-B17.8539.9125.3741.3224.1245.7922.4442.3432.39
PLI <!--\cite{chen2023-PLI}-->CLIP-B25.7147.8133.3653.4734.8758.4431.3153.2442.28
PL4CIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->CLIP-L33.6058.9039.4561.7843.9668.3339.0263.0051.01
SEARLE-XL <!--\cite{Baldrati2023SEARLE}-->CLIP-L20.4843.1326.8945.5829.3249.9725.5646.2335.90
SEARLE-XL-OTI <!--\cite{Baldrati2023SEARLE}-->CLIP-L21.5744.4730.3747.4930.9051.7627.6147.9037.76
Context-I2W <!--\cite{tang2023contexti2w}-->CLIP-L23.1045.3029.7048.6030.6052.9027.8048.9038.35
CompoDiff (with SynthTriplets18M) <!--\cite{gu2024compodiff}-->CLIP-L32.2446.2737.6949.0838.1250.5736.0248.6442.33
CompoDiff (with SynthTriplets18M) <!--\cite{gu2024compodiff}-->CLIP-L37.7849.1041.3155.1744.2656.4139.0251.7146.85
Pic2Word <!--\cite{saito2023pic2word}-->CLIP-L20.0040.2026.2043.6027.9047.4024.7043.7034.20
PLI <!--\cite{chen2023-PLI}-->CLIP-L28.1151.1238.6358.5139.4262.6835.3957.4446.42
KEDs <!--\cite{suo2024KEDs}-->CLIP-L21.7043.8028.9048.0029.9051.9026.8047.9037.35
CIReVL <!--\cite{karthik2024-CIReVL}-->CLIP-L24.7944.7629.4947.4031.3653.6528.5548.5738.56
LinCIR <!--\cite{gu2024LinCIR}-->CLIP-L20.9242.4429.1046.8128.8150.1826.2846.4936.39
MagicLens-L <!--\cite{zhang2024magiclens}-->CLIP-L25.5046.1032.7053.8034.0057.7030.7052.5041.60
LinCIR <!--\cite{gu2024LinCIR}-->CLIP-H29.8052.1136.9057.7542.0762.5236.2657.4646.86
DQU-CIR <!--\cite{Wen_2024-DQU-CIR}-->CLIP-H51.9074.3753.5773.2158.4879.2354.6575.6065.13
LinCIR <!--\cite{gu2024LinCIR}-->CLIP-G38.0860.8846.7665.1150.4871.0945.1165.6955.40
CIReVL <!--\cite{karthik2024-CIReVL}-->CLIP-G27.0749.5333.7151.4235.8056.1432.1952.3642.28
MagicLens-B <!--\cite{zhang2024magiclens}-->CoCa-B29.0048.9036.5055.5040.2061.9035.2055.4045.30
MagicLens-L <!--\cite{zhang2024magiclens}-->CoCa-L32.3052.7040.5059.2041.4063.0038.0058.2048.10
SPN (BLIP4CIR1) <!--\cite{feng2024data_generation-SPN}-->BLIP44.5267.1345.6867.9650.7473.7946.9869.6358.30
PLI <!--\cite{chen2023-PLI}-->BLIP-B28.6250.7838.0957.7940.9262.6835.8857.0846.48
SPN (SPRC) <!--\cite{feng2024data_generation-SPN}-->BLIP-250.5774.1257.7075.2760.8479.9656.3776.4566.41
CurlingNet <!--\cite{yu2020Curlingnet}-->-24.4447.6918.5940.5725.1949.6622.7445.9734.36

Performance comparison on the Fashion200k dataset

MethodsImage EncoderR@1R@10R@50
TIRG <!--\cite{vo2018TIRG}-->ResNet-1814.1042.5063.80
ComposeAE <!--\cite{anwaar2021composeAE}-->ResNet-1822.8055.3073.40
HCL <!--\cite{HCL}-->ResNet-1823.4854.0373.71
CoSMo <!--\cite{Lee2021CoSMo}-->ResNet-1823.3050.4069.30
JPM(TIRG,MSE) <!--\cite{JPM}-->ResNet-1819.8046.5066.60
JPM(TIRG,Tri) <!--\cite{JPM}-->ResNet-1817.7044.7064.50
ARTEMIS <!--\cite{delmas2022ARTEMIS}-->ResNet-1821.5051.1070.50
GA(TIRG-BERT) <!--\cite{huang2022-GA-data-augmentation}-->ResNet-1831.4054.1077.60
LGLI <!--\cite{huang2023-LGLI}-->ResNet-1826.5058.6075.60
AlRet <!--\cite{xu2024-AlRet}-->ResNet-1824.4253.9373.25
FashionVLP <!--\cite{ComqueryFormer}-->ResNet-18-49.9070.50
CLVC-Net <!--\cite{wen2021-CLVC-NET}-->ResNet-5022.6053.0072.20
Uncertainty <!--\cite{chen2024uncertainty}-->ResNet-5021.8052.1070.20
MCR <!--\cite{ComqueryFormer}-->ResNet-5049.4069.4059.40
CRN <!--\cite{2023-CRN}-->ResNet-50-53.1073.00
EER w/ Random Emb. <!--\cite{EER}-->ResNet-50-51.0970.23
EER w/ GloVe <!--\cite{EER}-->ResNet-50-50.8873.40
DWC <!--\cite{huang2023-DWC}-->ResNet-5036.4963.5879.02
JGAN <!--\cite{JGAN}-->ResNet-10117.3445.2865.65
CRR <!--\cite{CRR}-->ResNet-10124.8556.4173.56
GSCMR <!--\cite{2022-GSCMR}-->ResNet-10121.5752.8470.12
VAL(GloVe) <!--\cite{Chen2020VAL}-->MobileNet22.9050.8073.30
VAL(Lvv+Lvs) <!--\cite{Chen2020VAL}-->MobileNet21.5053.8072.70
DATIR <!--\cite{ComqueryFormer}-->MobileNet21.5048.8071.60
VAL(Lvv) <!--\cite{Chen2020VAL}-->MobileNet21.2049.0068.80
JVSM <!--\cite{chen2020JVSM}-->MobileNet-v119.0052.1070.00
TIS <!--\cite{TIS}-->MobileNet-v117.7647.5468.02
DCNet <!--\cite{kim2021-DCNet}-->MobileNet-v1-46.8967.56
TIS <!--\cite{TIS}-->Inception-v316.2544.1465.02
LBF(big) <!--\cite{hosseinzadeh2020-locally-LBF}-->Faster-RCNN17.7848.3568.50
LBF(small) <!--\cite{hosseinzadeh2020-locally-LBF}-->Faster-RCNN16.2646.9071.73
ProVLA <!--\cite{Hu_2023_ICCV-ProVLA}-->Swin21.7053.7074.60
CRN <!--\cite{2023-CRN}-->Swin-53.3073.30
ComqueryFormer <!--\cite{ComqueryFormer}-->Swin-52.2072.20
AACL <!--\cite{tian2022AACL}-->Swin19.6458.8578.86
CRN <!--\cite{2023-CRN}-->Swin-L-53.5074.50
DQU-CIR <!--\cite{Wen_2024-DQU-CIR}-->CLIP-H36.8067.9087.80

Performance comparison on the MIT-States dataset

MethodsImage EncoderR@1R@10R@50Average
TIRG <!--\cite{vo2018TIRG}-->ResNet-1812.2031.9043.1029.10
ComposeAE <!--\cite{anwaar2021composeAE}-->ResNet-1813.9035.3047.9032.37
HCL <!--\cite{HCL}-->ResNet-1815.2235.9546.7132.63
GA(TIRG) <!--\cite{huang2022-GA-data-augmentation}-->ResNet-1813.6032.4043.2029.70
GA(TIRG-BERT) <!--\cite{huang2022-GA-data-augmentation}-->ResNet-1815.4036.3047.7033.20
GA(ComposeAE) <!--\cite{huang2022-GA-data-augmentation}-->ResNet-1814.6037.0047.9033.20
LGLI <!--\cite{huang2023-LGLI}-->ResNet-1814.9036.4047.7033.00
MAAF <!--\cite{dodds2020MAAF}-->ResNet-5012.7032.6044.80-
MCR <!--\cite{CRR}-->ResNet-5014.3035.3647.1232.26
CRR <!--\cite{CRR}-->ResNet-10117.7137.1647.8334.23
JGAN <!--\cite{JGAN}-->ResNet-10114.2733.2145.3429.10
GSCMR <!--\cite{2022-GSCMR}-->ResNet-10117.28-36.45-
TIS <!--\cite{TIS}-->Inception-v313.1331.9443.3229.46
LBF(big) <!--\cite{hosseinzadeh2020-locally-LBF}-->Faster-RCNN14.7235.3046.5696.58
LBF(small) <!--\cite{hosseinzadeh2020-locally-LBF}-->Faster-RCNN14.29-34.6746.06

Performance comparison on the CSS dataset

MethodsImage EncoderR@1(3D-to-3D)R@1(2D-to-3D
TIRG <!--\cite{JGAN}-->ResNet-1873.7046.60
HCL <!--\cite{HCL}-->ResNet-1881.5958.65
GA(TIRG) <!--\cite{huang2022-GA-data-augmentation}-->ResNet-1891.20-
TIRG+JPM(MSE) <!--\cite{JPM}-->ResNet-1883.80-
TIRG+JPM(Tri) <!--\cite{JPM}-->ResNet-1883.20-
LGLI <!--\cite{huang2023-LGLI}-->ResNet-1893.30-
MAAF <!--\cite{dodds2020MAAF}-->ResNet-5087.80-
CRR <!--\cite{CRR}-->ResNet-10185.84-
JGAN <!--\cite{JGAN}-->ResNet-10176.0748.85
GSCMR <!--\cite{2022-GSCMR}-->ResNet-10181.8158.74
TIS <!--\cite{TIS}-->Inception-v376.6448.02
LBF(big) <!--\cite{hosseinzadeh2020-locally-LBF}-->Faster-RCNN79.2055.69
LBF(small) <!--\cite{hosseinzadeh2020-locally-LBF}-->Faster-RCNN67.2650.31

Performance comparison on the Shoes dataset

MethodsImage EncoderR@1R@10R@50Average
ComposeAE <!--\cite{anwaar2021composeAE}-->ResNet-1831.2560.30--
TIRG <!--\cite{vo2018TIRG}-->ResNet-5012.6045.4569.3942.48
VAL(Lvv) <!--\cite{Chen2020VAL}-->ResNet-5016.4949.1273.5346.38
VAL(Lvv + Lvs) <!--\cite{Chen2020VAL}-->ResNet-5016.9849.8373.9146.91
VAL(GloVe) <!--\cite{Chen2020VAL}-->ResNet-5017.1851.5275.8348.18
CoSMo <!--\cite{Lee2021CoSMo}-->ResNet-5016.7248.3675.6446.91
CLVC-Net <!--\cite{wen2021-CLVC-NET}-->ResNet-5017.6454.3979.4750.50
DCNet <!--\cite{kim2021-DCNet}-->ResNet-50-53.8279.33-
SAC w/BERT <!--\cite{jandial2021SAC}-->ResNet-5018.5051.7377.2849.17
SAC w/Random Emb. <!--\cite{jandial2021SAC}-->ResNet-5018.1152.4175.4248.64
ARTEMIS+LSTM <!--\cite{delmas2022ARTEMIS}-->ResNet-5017.6051.0576.8548.50
ARTEMIS+BiGRU <!--\cite{delmas2022ARTEMIS}-->ResNet-5018.7253.1179.3150.38
AMC <!--\cite{AMC}-->ResNet-5019.9956.8979.2752.05
DATIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->ResNet-5017.2051.1075.6047.97
MCR <!--\cite{CRR}-->ResNet-5017.8550.9577.2448.68
EER <!--\cite{EER}-->ResNet-5020.0556.0279.9452.00
CRN <!--\cite{2023-CRN}-->ResNet-5017.1953.8879.1250.06
Uncertainty <!--\cite{chen2024uncertainty}-->ResNet-5018.4153.6379.8450.63
FashionVLP <!--\cite{Goenka_2022_FashionVLP}-->ResNet-50-49.0877.32-
DWC <!--\cite{huang2023-DWC}-->ResNet-5018.9455.5580.1951.56
MCEM((L_CE)) <!--\cite{MCEM}-->ResNet-5015.1749.3373.7846.09
MCEM((L_FCE)) <!--\cite{MCEM}-->ResNet-5018.1354.3178.6550.36
MCEM((L_AFCE)) <!--\cite{MCEM}-->ResNet-5019.1055.3779.5751.35
AlRet <!--\cite{xu2024-AlRet}-->ResNet-5018.1353.9878.8150.31
RTIC <!--\cite{shin2021RTIC}-->ResNet-5043.6672.11--
RTIC-GCN <!--\cite{shin2021RTIC}-->ResNet-5043.3872.09--
CRR <!--\cite{CRR}-->ResNet-10118.4156.3879.9251.57
CRN <!--\cite{2023-CRN}-->Swin17.3254.1579.3450.27
ProVLA <!--\cite{Hu_2023_ICCV-ProVLA}-->Swin19.2056.2073.3049.57
CRN <!--\cite{2023-CRN}-->Swin-L18.9254.5580.0451.17
AlRet <!--\cite{xu2024-AlRet}-->CLIP21.0255.7280.7752.50
PL4CIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->CLIP-L22.8858.8384.1655.29
PL4CIR <!--\cite{zhao2022-PL4CIR_PLHMQ-twostage}-->CLIP-B19.5355.6580.5851.92
TG-CIR <!--\cite{Wen_2023-TG-CIR}-->CLIP-B25.8963.2085.0758.05
DQU-CIR <!--\cite{Wen_2024-DQU-CIR}-->CLIP-H31.4769.1988.5263.06

Performance comparison on the CIRR dataset

MethodsImage EncoderR@1R@5R@10R@50
ComposeAE <!--\cite{shin2021RTIC}-->ResNet-18-29.6059.82-
MCEM((L_CE)) <!--\cite{MCEM}-->ResNet-1814.2640.4655.6185.66
MCEM((L_FCE)) <!--\cite{MCEM}-->ResNet-1816.1243.9258.8786.85
MCEM((L_AFCE)) <!--\cite{MCEM}-->ResNet-1817.4846.1362.1788.91
Ranking-aware <!--\cite{chen2023ranking-aware}-->ResNet-5032.2466.6379.2396.43
SAC w/BERT <!--\cite{jandial2021SAC}-->ResNet-50-19.5645.24-
SAC w/Random Emb. <!--\cite{jandial2021SAC}-->ResNet-50-20.3444.94-
ARTEMIS+BiGRU <!--\cite{delmas2022ARTEMIS}-->ResNet-15216.9646.1061.3187.73
CIRPLANT <!--\cite{liu2021CIRPLANT}-->ResNet-15215.1843.3660.4887.64
CIRPLANT w/ OSCAR <!--\cite{liu2021CIRPLANT}-->ResNet-15219.5552.5568.3992.38
CASE <!--\cite{levy2023CASE}-->ViT48.0079.1187.2597.57
ComqueryFormer <!--\cite{ComqueryFormer}-->Swin25.7661.7675.9095.13
CLIP4CIR <!--\cite{baldrati2022-CLIP4CIR}-->CLIP38.5369.9881.8695.93
CLIP4CIR3 <!--\cite{CLIP4CIR3}-->CLIP44.8277.0486.6597.90
SPN(TG-CIR) <!--\cite{feng2024data_generation-SPN}-->CLIP47.2879.1387.9897.54
SPN(CLIP4CIR) <!--\cite{feng2024data_generation-SPN}-->CLIP45.3378.0787.6198.17
Combiner <!--\cite{baldrati2022combiner}-->CLIP33.5965.3577.3595.21
MCEM((L_AFCE)) <!--\cite{MCEM}-->CLIP39.8074.2485.7197.23
TG-CIR <!--\cite{Wen_2023-TG-CIR}-->CLIP-B45.2578.2987.1697.30
CIReVL <!--\cite{karthik2024-CIReVL}-->CLIP-B23.9452.5166.0086.95
SEARLE-OTI <!--\cite{Baldrati2023SEARLE}-->CLIP-B24.2753.2566.1088.84
SEARLE <!--\cite{Baldrati2023SEARLE}-->CLIP-B24.0053.4266.8289.78
PLI <!--\cite{chen2023-PLI}-->CLIP-B18.8046.0760.7586.41
SEARLE-XL <!--\cite{Baldrati2023SEARLE}-->CLIP-L24.2452.4866.2988.84
SEARLE-XL-OTI <!--\cite{Baldrati2023SEARLE}-->CLIP-L24.8752.3166.2988.58
CIReVL <!--\cite{karthik2024-CIReVL}-->CLIP-L24.5552.3164.9286.34
Context-I2W <!--\cite{tang2023contexti2w}-->CLIP-L25.6055.1068.5089.80
Pic2Word <!--\cite{saito2023pic2word}-->CLIP-L23.9051.7065.3087.80
CompoDiff(with SynthTriplets18M) <!--\cite{gu2024compodiff}-->CLIP-L18.2453.1470.8290.25
LinCIR <!--\cite{gu2024LinCIR}-->CLIP-L25.0453.2566.68-
PLI <!--\cite{chen2023-PLI}-->CLIP-L25.5254.5867.5988.70
KEDs <!--\cite{suo2024KEDs}-->CLIP-L26.4054.8067.2089.20
CIReVL <!--\cite{karthik2024-CIReVL}-->CLIP-G34.6564.2975.0691.66
LinCIR <!--\cite{gu2024LinCIR}-->CLIP-G35.2564.7276.05-
CompoDiff(with SynthTriplets18M) <!--\cite{gu2024compodiff}-->CLIP-G26.7155.1474.5292.01
LinCIR <!--\cite{gu2024LinCIR}-->CLIP-H33.8363.5275.35-
DQU-CIR <!--\cite{Wen_2024-DQU-CIR}-->CLIP-H46.2278.1787.6497.81
PLI <!--\cite{chen2023-PLI}-->BLIP27.2358.8771.4091.25
BLIP4CIR2 <!--\cite{liu2024-BLIP4CIR2}-->BLIP40.1771.8183.1895.69
BLIP4CIR2+Bi <!--\cite{liu2024-BLIP4CIR2}-->BLIP40.1573.0883.8896.27
SPN(BLIP4CIR1) <!--\cite{feng2024data_generation-SPN}-->BLIP46.4377.6487.0197.06
SPN(SPRC) <!--\cite{feng2024data_generation-SPN}-->BLIP-255.0683.8390.8798.29
BLIP4CIR1 <!--\cite{liu2023BLIP4CIR1}-->BLIP-B46.8378.5988.0497.08

[NOTE] If you have any questions, please don't hesitate to contact us.