Awesome
Composed Image Retrieval for Training-FREE DOMain Conversion (WACV 2025) π¨
This repository contains the official PyTorch implementation of our WACV 2025 paper: "Composed Image Retrieval for Training-FREE DOMain Conversion". [arXiv]
Overview
We introduce FREEDOM, a training-free, composed image retrieval (CIR) method for domain conversion based on vision-language models (VLMs). Given an $\textcolor{orange}{image\ query}$ and a $\it{text\ query}$ that names a domain, images are retrieved having the class of the $\textcolor{orange}{image\ query}$ and the domain of the $\it{text\ query}$. A range of applications is targeted, where classes can be defined at category level (a,b) or instance level (c), and domains can be defined as styles (a, c), or context (b). In the above visualization, for each image query, retrieved images are shown for different text queries.
<div align="center"> <img width="80%" alt="Domains" src="images/teaser.png"> </div>Motivation
In this paper, we focus on a specific variant of composed image retrieval, namely domain conversion, where the text query defines the target domain. Unlike conventional cross-domain retrieval, where models are trained to use queries of a source domain and retrieve items from another target domain, we address a more practical, open-domain setting, where the query and database may be from any unseen domain. We target different variants of this task, where the class of the query object is defined at category-level (a, b) or instance-level (c). At the same time, the domain corresponds to descriptions of style (a, c) or context (b). Even though domain conversion is a subset of the tasks handled by existing CIR methods, the variants considered in our work reflect a more comprehensive set of applications than what was encountered in prior art.
Approach
Given a $\textcolor{orange}{query\ image}$ and a $\it{query\ text}$ indicating the target domain, proxy images are first retrieved from the query through an image-to-image search over a visual memory. Then, a set of text labels is associated with each proxy image through an image-to-text search over a textual memory. Each of the most frequent text labels is combined with the $\it{query\ text}$ in the text space, and images are retrieved from the database by text-to-image search. The resulting sets of similarities are linearly combined with the frequencies of occurrence as weights. Below: $k=4$ proxy images, $n=3$ text labels per proxy image, $m=2$ most frequent text labels.
<div align="center"> <img width="80%" alt="Domains" src="images/method.png"> </div>Environment
Our experiments were conducted using python 3.10. To set up a Python environment, run:
python -m venv ~/freedom
source ~/freedom/bin/activate
pip install -r requirements.txt
To set up a Conda environment, run:
conda create --name freedom python=3.10
conda activate freedom
pip install -r requirements.txt
Dataset
Downloading the datasets
- Download the ImageNet-R dataset and the validation set of ILSVRC2012 and place them in the directory
data/imagenet-r
. - Download the LTLL dataset and place it in the directory
data/ltll
. - Download the four domains of mini-DomainNet: Clipart, painting, real, and sketch and place them in the directory
data/minidn
. - Download the NICO++ dataset, specifically the
DG_Benchmark.zip
from the dropbox link and place it in the directorydata/nico
. The starting data directory structure should look like this:
freedom/
βββ data/
β βββ imagenet-r/
β β βββ imagenet-r.tar
β β βββ ILSVRC2012_img_val.tar
β β βββ imgnet_real_query.txt
β β βββ imgnet_targets.txt
β β βββ label_names.csv
β βββ ltll/
| | βββ LTLL.zip
| | βββ full_files.csv
| βββ minidn/
| | βββ clipart.zip
| | βββ painting.zip
| | βββ real.zip
| | βββ sketch.zip
| | βββ database_files.csv
| | βββ query_files.csv
| βββ nico/
| βββ DG_Benchmark.zip
| βββ database_files.csv
| βββ query_files.csv
Setting-up the datasets
- To set-up ImageNet-R run:
mkdir -p ./data/imagenet_r/imagenet_val && tar -xf ./data/imagenet_r/ILSVRC2012_img_val.tar -C ./data/imagenet_r/imagenet_val
tar -xf ./data/imagenet_r/imagenet-r.tar -C ./data/imagenet_r/
python set_dataset.py --dataset imagenet_r
The script set_dataset.py
will create the folder real
, that includes the 200 classes that are useful for ImageNet-R from the validation set of ILSVRC2012, and it will place them in their corresponding class folder. It will also create the file full_files.csv
that is needed for data loading. After that you no longer need the imagenet_val
folder.
- To set-up ltll run:
unzip ./data/ltll/LTLL.zip -d ./data/ltll
python set_dataset.py --dataset ltll
The script set_dataset.py
will handle some space (" ")
characters in directory and file names.
- To set-up Mini-DomainNet run:
unzip ./data/minidn/clipart.zip -d ./data/minidn/
unzip ./data/minidn/painting.zip -d ./data/minidn/
unzip ./data/minidn/real.zip -d ./data/minidn/
unzip ./data/minidn/sketch.zip -d ./data/minidn/
- To set-up NICO++ run:
unzip ./data/nico/DG_Benchmark.zip -d ./data/nico
unzip ./data/nico/NICO_DG_Benchmark.zip -d ./data/nico
mv ./data/nico/NICO_DG/* ./data/nico/
rmdir ./data/nico/NICO_DG
python set_dataset.py --dataset nico
The script set_dataset.py
will handle some space (" ")
characters in directory names.
The necessary files in the final data directory structure are the following:
freedom/
βββ data/
β βββ imagenet-r/
β β βββ imagenet-r/
β β βββ real/
β β βββ full_files.csv
β βββ ltll/
| | βββ New/
| | βββ Old/
| | βββ full_files.csv
| βββ minidn/
| | βββ clipart/
| | βββ painting/
| | βββ real/
| | βββ sketch/
| | βββ database_files.csv
| | βββ query_files.csv
| βββ nico/
| βββ autumn/
| βββ dim/
| βββ grass/
| βββ outdoor/
| βββ rock/
| βββ water/
| βββ database_files.csv
| βββ query_files.csv
Experiments
Extract features
To run any experiment you first need to extract features with the script create_features.py
. Both the features of the corpus and the features of the wanted dataset are needed, but for a given backbone, they need to be extracted only once. You can specify what features to produce with --dataset
that can take the inputs corpus
, imagenet_r
, nico
, minidn
, and ltll
. Also, you will need to specify the --backbone
, which can either be clip
or siglip
. You can specify the GPU ID with --gpu
. For example, to run experiments on ImageNet-R with CLIP on the GPU 0 you should extract the following features:
python create_features.py --dataset corpus --backbone clip --gpu 0
python create_features.py --dataset imagenet_r --backbone clip --gpu 0
Run experiments
The experiments can run through the run_retrieval.py
script. You should specify:
--dataset
fromimagenet_r
,nico
,minidn
, andltll
.--backbone
fromclip
, andsiglip
.--method
fromfreedom
,image
,text
,sum
, andproduct
. For example, you can run:
python run_retrieval.py --dataset imagenet_r --backbone clip --method freedom --gpu 0
Specifically for freedom
experiments, you can change the hyperparameters described in the paper by specifying --kappa
, --miu
, and --ni
.
Expected results
All mAPs in this paper are calculated as in RadenoviΔ et. al.
The experiments conducted with the run_retrieval.py
script produce the following Domain Conversion mAP (%) results for the datasets
and methods
outlined above. By running the experiments as described, you should expect the following numbers for the CLIP backbone:
ImageNet-R
Method | CAR | ORI | PHO | SCU | TOY | AVG |
---|---|---|---|---|---|---|
Text | 0.82 | 0.63 | 0.68 | 0.78 | 0.78 | 0.74 |
Image | 4.27 | 3.12 | 0.84 | 5.86 | 5.08 | 3.84 |
Text + Image | 6.61 | 4.45 | 2.17 | 9.18 | 8.62 | 6.21 |
Text Γ Image | 8.21 | 5.62 | 6.98 | 8.95 | 9.41 | 7.83 |
FreeDom | 35.97 | 11.80 | 27.97 | 36.58 | 37.21 | 29.91 |
MiniDomainNet
Method | CLIP | PAINT | PHO | SKE | AVG |
---|---|---|---|---|---|
Text | 0.63 | 0.52 | 0.63 | 0.51 | 0.57 |
Image | 7.15 | 7.31 | 4.38 | 7.78 | 6.66 |
Text + Image | 9.59 | 9.97 | 9.22 | 8.53 | 9.33 |
Text Γ Image | 9.01 | 8.66 | 15.87 | 5.90 | 9.86 |
FreeDom | 41.96 | 31.65 | 41.12 | 34.36 | 37.27 |
NICO++
Method | AUT | DIM | GRA | OUT | ROC | WAT | AVG |
---|---|---|---|---|---|---|---|
Text | 1.00 | 0.99 | 1.15 | 1.23 | 1.10 | 1.05 | 1.09 |
Image | 6.45 | 4.85 | 5.67 | 7.67 | 7.53 | 8.75 | 6.82 |
Text + Image | 8.46 | 6.58 | 9.22 | 11.91 | 11.20 | 8.41 | 9.30 |
Text Γ Image | 8.24 | 6.36 | 12.11 | 12.71 | 10.46 | 8.84 | 9.79 |
FreeDom | 24.35 | 24.41 | 30.06 | 30.51 | 26.92 | 20.37 | 26.10 |
LTLL
Method | TODAY | ARCHIVE | AVG |
---|---|---|---|
Text | 5.28 | 6.16 | 5.72 |
Image | 8.47 | 24.51 | 16.49 |
Text + Image | 9.60 | 26.13 | 17.86 |
Text Γ Image | 16.42 | 29.90 | 23.16 |
FreeDom | 30.95 | 35.52 | 33.24 |
Acknowledgement
NTUA thanks NVIDIA for the support with the donation of GPU hardware.
License
This repository is released under the MIT license as found in the LICENSE file.
Citation
If you find this repository useful, please consider giving a star π and citation:
@inproceedings{efth2025composed,
title={Composed Image Retrieval for Training-Free Domain Conversion},
author={Efthymiadis, Nikos and Psomas, Bill and Laskar, Zakaria and Karantzalos, Konstantinos and Avrithis, Yannis and Chum, OndΕej and Tolias, Giorgos},
booktitle={IEEE Winter Conference on Applications of Computer Vision},
year={2025},
organization={IEEE}
}