Awesome
CLIPS
Official implementation of the paper "CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions".
Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions:
- By observing a strong inverse effect with synthetic captions, we use only partial synthetic captions to feed the text encoder, achieving significantly better performance.
- We incorporate an autoregressive captioner that mimics the recaptioning process, predicting full-length synthetic captions conditioned on the image and original web-crawled captions.
Our method achieves state-of-the-art (SOTA) results in zero-shot image-text retrieval on MSCOCO and Flickr30K, while enhancing the visual capability of LLaVA.
Links
Key Results
Inverse Effect with Synthetic Captions
Visualization of four different token reduction strategies. These strategies can improve the model's learning efficiency on synthetic captions to varying degrees. Among these strategies, the sub-caption and block mask perform best.
Zero-Shot Cross-Modal Retrieval
Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines.
Comparison with State-of-the-Art Methods
With increased computational resources and scaling, our best model further achieves 76.4% and 96.6% R@1 text retrieval performance on MSCOCO and Flickr30K respectively, and 57.2% and 83.9% R@1 image retrieval performance on the same datasets, setting new state-of-the-art (SOTA) results.
CLIPS in LLaVA
Replacing OpenAI-CLIP with CLIPS significantly boosts LLaVA's performance across various benchmarks.
Model Zoo
Model | Link |
---|---|
CLIPS-Large-14-224 | 🤗 HuggingFace Model |
CLIPS-Large-14-336 | 🤗 HuggingFace Model |
CLIPS-Huge-14-224 | 🤗 HuggingFace Model |
CLIPS-Huge-14-336 | Coming Soon... |
Model Usage
Environment
Install dependencies:
pip3 install -r requirements.txt
With OpenCLIP
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer
model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-14-CLIPS-Recap-DataComp-1B')
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) # prints: [[0., 0., 0., 1.0]]
Note: We made modifications to the tokenizer implementation in open_clip/tokenizer.py.
Acknowledgement
This pytorch repo is built on OpenCLIP. Many thanks to the awesome works from the open-source community!
We would like to thank TPU Research Cloud (TRC) program, Google Cloud Research Credits program, and AWS Cloud Credit for Research program for supporting our computing needs.
Citation
If you use our work, please cite it:
@misc{liu2024clipsenhancedclipframework,
title={CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions},
author={Yanqing Liu and Xianhang Li and Zeyu Wang and Bingchen Zhao and Cihang Xie},
year={2024},
eprint={2411.16828},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.16828},
}