Home

Awesome

Clip Embedding Reordering

Note: This currently relies on using the --output_format="webdataset" option from img2dataset. If your images are not inside .tar files, this will not work correctly. CLIP embeddings generated by clip-retrieval are not ordered the same as the webdataset they are generated from. This tool can reorder large CLIP embedding datasets such that they match the order of the image dataset they were generated from.

Install

git clone https://github.com/Veldrovive/embedding-dataset-reordering

cd embedding-dataset-reordering

pip install -e .

API

This module exposes three functions. Example commands are meant to be evaluated from inside the examples folder.

For example, to download the test dataset with img2dataset, navigate to the root directory and run cd examples && reorder-embeddings download-data.

To generate embeddings with clip-retrieval for this test data, run reorder-embeddings clip-inference from the examples folder.


reorder: Takes as input an unordered embedding dataset along with metadata generated by clip-retrieval and reorders the embeddings to match the order of the image dataset.

Note: Before starting, you need to find the shard string width and index string width of your dataset. This is a manual task, but it is easy to find. Navigate to the metadata directory of your embedding dataset and run reorder-embeddings example_key.

This will print something similar to:

Example Keys:
Shard 3 has keys ['0000309', '0000321']
Shard 2 has keys ['0000209', '0000237']
Shard 0 has keys ['0000022', '0000031']
Shard 1 has keys ['0000114', '0000123']

By inspection, we can see that the first 5 characters represent the index of the shard (i.e. the keys for shard 3 start with 00003) so the final 3 digits reprent the index which means the index width is 3.

Parameters

download-data: Uses img2dataset to download a test dataset. Run this from the examples directory to download the default one.

clip_-nference: Uses clip-retrieval to generate embeddings for the test dataset. Run this from the examples directory after downloading the test dataset.

For development

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code