Awesome

Clip Embedding Reordering

Note: This currently relies on using the --output_format="webdataset" option from img2dataset. If your images are not inside .tar files, this will not work correctly. CLIP embeddings generated by clip-retrieval are not ordered the same as the webdataset they are generated from. This tool can reorder large CLIP embedding datasets such that they match the order of the image dataset they were generated from.

Install

git clone https://github.com/Veldrovive/embedding-dataset-reordering

cd embedding-dataset-reordering

pip install -e .

API

This module exposes three functions. Example commands are meant to be evaluated from inside the examples folder.

For example, to download the test dataset with img2dataset, navigate to the root directory and run cd examples && reorder-embeddings download-data.

To generate embeddings with clip-retrieval for this test data, run reorder-embeddings clip-inference from the examples folder.

reorder: Takes as input an unordered embedding dataset along with metadata generated by clip-retrieval and reorders the embeddings to match the order of the image dataset.

Note: Before starting, you need to find the shard string width and index string width of your dataset. This is a manual task, but it is easy to find. Navigate to the metadata directory of your embedding dataset and run reorder-embeddings example_key.

This will print something similar to:

Example Keys:
Shard 3 has keys ['0000309', '0000321']
Shard 2 has keys ['0000209', '0000237']
Shard 0 has keys ['0000022', '0000031']
Shard 1 has keys ['0000114', '0000123']

By inspection, we can see that the first 5 characters represent the index of the shard (i.e. the keys for shard 3 start with 00003) so the final 3 digits reprent the index which means the index width is 3.

Parameters

embeddings_folder: Path to the folder containing the embedding .npy files.
metadata_folder: Path to the folder containing the .parquet metadata files.
output_folder: Path to the folder where the reordered .npy files will be saved.
index_width: The index width found above.
output_shard_width: The width of the shard string for the output files. Should be the same as the shard with for the webdataset.
limit: The number of shards to reorder.
run-concurrent: The number of workers to use during reordering.
verbose: Whether to print out expanded logging.
tmp-folder: With many workers, the temporary file directories get very large. If this is a problem, reduce the number of workers or set tmp-folder to a location with more space available.

download-data: Uses img2dataset to download a test dataset. Run this from the examples directory to download the default one.

clip_-nference: Uses clip-retrieval to generate embeddings for the test dataset. Run this from the examples directory after downloading the test dataset.

For development

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code