Awesome
snip-dedup
This repo is a WIP
You no longer can filter the LAION dataset to remove duplicates, as LAION disabled the webdataset on huggingface. I'll focus on adding some functionality for deduplication for future webdatasets using clip features.
- Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
- Read our research paper
- Train SNIP on your CLIP features
- Run a de-duplication of your dataset using our de-dup code
SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it here.
We used SNIP together with the faiss library to deduplicate a billions scale dataset, and found a high level of duplication (roughly 700M / 2 billion). This webdataset is no longer being distributed by laion.
Install
pip install --upgrade snip-dedup
Usage
# List available commands
snip --help
snip download --help
# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10
Then, you may download (deduplicated) laion2b images with the awesome img2dataset.
See the colab for a demo on search.
What is a Duplicate?
In our first iteration, we merely marked duplicates pairwise, and remove one sample from a duplicate pair (the above code downloads a binary array, for samples to remove). In our latest run, we recorded the entire adjacency matrix of duplication. For instance, suppose SNIP has labeled feature $k$ as a duplicate with feature $j$. Then $A[k,j] = A[j,k] = 1$ in the adjacency matrix. We're currently having trouble computing the full connected components of this matrix, see this issue.
If you allow connected components with only one node, Then to compute the number of "unique" samples, you simply take one from each duplicate set, say $|\mathcal{C}|$ sets, with $N$ nodes is $D := N - |\mathcal{C}|$ duplicates.
Approximate CCs of Duplicates
Currently, we have an approximation of the CC of the duplicates. During the de-duplication, we label nodes as follows. Suppose we are at node $n$, the pseudo code for one step of labeling is calculated as
labels = np.arange(0,N)
...
d,i = index.search(feats[n,:],k)
dups = get_dups(d,i) #Use adaptive threshhold on ADC (see paper)
label[dups] = resolve_labels_one_step(dups)
Where N
is number of nodes (2B for L2B). Here resolve_labels_one_step
will simply re-write any node that is unlabeled to be the current node $n$. This can be thought of as a tree. We then connect nodes with common ancestors with a fixed point
while True:
label = label[label]
The labels of the above loop can be found on huggingface vitl14_labels.
Other:
cumulative sizes of features (for indexing sharded files)
Finding images overfit by Stable Diffusion
By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:
Note on False positives
We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:
you may check a list of (randomly sampled) detected duplicate pairs here
Semantic Search
You may use the compressed features to do semantic search with faiss (see for instance, the clip-retrieval repository).
Contribute
Contributions are welcome. Usually, the best way is first to open an issue to discuss things.
This python project uses the hatch
project manager.
Dependencies are specified inside the pyproject.toml
file, and build configs inside the hatch.toml
file.
As such you can enter the isolated development environment with hatch shell
from inside the repository.
The code should be documented following the Numpy docstring standard.
To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Remark that these can usually get installed in your editor, such as VS Code, to view the checks directly in the code. Once you have installed them (suggested via pipx), you can check that the code is consistent with:
hatch run check # check for mistakes via static analysis with pyright
black --check snip_dedup/ # check formatting of all python files
ruff check snip_dedup/ # check linting rules
STILL TODO:
- add docs / tutorial
- add tests
- check max file size on CI to prevent pushing data
- auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml
Citation
@misc{webster2023deduplication,
title={On the De-duplication of LAION-2B},
author={Ryan Webster and Julien Rabin and Loic Simon and Frederic Jurie},
year={2023},
eprint={2303.12733},
archivePrefix={arXiv},
primaryClass={cs.CV}
}