Home

Awesome

<br /> <p align="center"> <h1 align="center">Multilingual-CLIP</h1> <h3 align="center">OpenAI CLIP text encoders for any language</h3> <p align="center"> <a href="https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn5.laion.ai&index=laion_400m&useMclip=true">Live Demo</a> · <a href="https://huggingface.co/M-CLIP">Pre-trained Models</a> · <a href="https://github.com/FreddeFrallan/Contrastive-Tension/issues">Report Bug</a> </p> </p>

Open In Colab pypi

<!-- ABOUT THE PROJECT -->

Overview

Alt text

OpenAI recently released the paper Learning Transferable Visual Models From Natural Language Supervision in which they present the CLIP (Contrastive Language–Image Pre-training) model. This model is trained to connect text and images, by matching their corresponding vector representations using a contrastive learning objective. CLIP consists of two separate models, a visual encoder and a text encoder. These were trained on a wooping 400 Million images and corresponding captions. OpenAI has since released a set of their smaller CLIP models, which can be found on the official CLIP Github.

Demo

A live demonstration of multilingual Text-Image retrieval using M-CLIP can be found here! This demo was created by Rom1504, and it allows you to search the LAION-400M dataset in various languages using M-CLIP.

This repository contains

Requirements

While it is possible that other versions works equally fine, we have worked with the following:

Install

pip install multilingual-clip torch

You can also choose to pip install tensorflow instead of torch.

Inference Usage

Inference code for Tensorflow is also available in inference_example.py

from multilingual_clip import pt_multilingual_clip
import transformers

texts = [
    'Three blind horses listening to Mozart.',
    'Älgen är skogens konung!',
    'Wie leben Eisbären in der Antarktis?',
    'Вы знали, что все белые медведи левши?'
]
model_name = 'M-CLIP/XLM-Roberta-Large-Vit-L-14'

# Load Model & Tokenizer
model = pt_multilingual_clip.MultilingualCLIP.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

embeddings = model.forward(texts, tokenizer)
print(embeddings.shape)

Install for development

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

Pre-trained Models

Every text encoder is a Huggingface available transformer, with an additional linear layer on top. For more information of a specific model, click the Model Name to see its model card. <br> <br>

NameModel BaseVision ModelVision DimensionsPre-trained Languages#Parameters
LABSE Vit-L/14LaBSEOpenAI ViT-L/14768109 Languages110 M
XLM-R Large Vit-B/32XLM-Roberta-LargeOpenAI ViT-B/32512100 Languages344 M
XLM-R Large Vit-L/14XLM-Roberta-LargeOpenAI ViT-L/14768100 Languages344 M
XLM-R Large Vit-B/16+XLM-Roberta-LargeOpen CLIP ViT-B-16-plus-240640100 Languages344 M

Validation & Training Curves

Following is a table of the <b>Txt2Img @10-Recal</b> for the humanly tanslated MS-COCO testset.

NameEnDeEsFrZhItPlKoRuTrJp
OpenAI CLIP Vit-B/3290.3----------
OpenAI CLIP Vit-L/1491.8----------
OpenCLIP ViT-B-16+-94.3----------
LABSE Vit-L/1491.689.689.589.988.990.189.880.885.589.873.9
XLM-R Large Vit-B/3291.888.789.189.489.389.891.482.186.188.881.0
XLM-R Vit-L/1492.490.691.090.089.791.191.385.285.890.381.9
XLM-R Large Vit-B/16+<b>95.0</b><b>93.0</b><b>93.6</b><b>93.1</b><b>94.0</b><b>93.1</b><b>94.4</b><b>89.0</b><b>90.0</b><b>93.0</b><b>84.2</b>

The training curves for these models are available at this Weights and Biases Report, the results for other non-succesfull and ongoing experiments can be found in the Weights and Biases Project.

Legacy Usage and Models

Older versions of M-CLIP had the linear weights stored separately from Huggingface. Whilst the new models have them directly incorporated in the Huggingface repository. More information about these older models can be found in this section.

<details> <summary>Click for more information</summary>
Download CLIP Model
$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU. For more information please see the official CLIP repostitory.

Download Linear Weights
# Linear Model Weights
$ bash legacy_get-weights.sh

Inference

from multilingual_clip import multilingual_clip

print(multilingual_clip.AVAILABLE_MODELS.keys())

model = multilingual_clip.load_model('M-BERT-Distil-40')

embeddings = model(['Älgen är skogens konung!', 'Wie leben Eisbären in der Antarktis?', 'Вы знали, что все белые медведи левши?'])
print(embeddings.shape)
# Yields: torch.Size([3, 640])
<!--- For a more elaborative example see this [Google Colab](https://colab.research.google.com/github/FreddeFrallan/Multilingual-CLIP/blob/master/Multilingual_CLIP.ipynb). --->

For a more elaborate example, comparing the textual embeddings to the CLIP image embeddings see this colab notebook.

<!-- GETTING STARTED -->

Legacy Pre-trained Models

Every text encoder is a Huggingface available transformer, with an additional linear layer on top. Neither of the models have been extensively tested, but for more information and qualitative test results for a specific model, click the Model Name to see its model card. <br> <br> <b>*** Make sure to update to the most recent version of the repostitory when downloading a new model, and re-run the shell script to download the Linear Weights. *** </b>

NameModel BaseVision ModelPre-trained LanguagesTarget Languages#Parameters
Multilingual
M-BERT Distil 40M-BERT DistilRN50x4101 Languages40 Languages66 M
M-BERT Base 69M-BERT BaseRN50x4101 Languages68 Languages110 M
M-BERT Base ViT-BM-BERT BaseViT-B/32101 Languages68 Languages110 M
Monolingual
Swe-CLIP 500kKB-BERTRN50x4SwedishSwedish110 M
Swe-CLIP 2MKB-BERTRN50x4SwedishSwedish110 M
</details>

Training a new model

This folder contains the code used for training the above models. If you wsh to train your own model you must do the following things:

Pre-computed CLIP Embeddings & Translaton Data

This Google Drive folder contains both pre-computed CLIP-Text Embeddings for a large porton of the the image captions of GCC + MSCOCO + VizWiz.

The Google Drive folder also contains the translation data used to train the currently available models. Good Luck

Contribution

If you have trained a CLIP Text encoder specific to your language, or another model covering a language not supported here, Please feel free to contact us and we will either upload your model and credit you, or simply link to your already uploaded model.

<!-- CONTACT -->

Contact

If you have questions regarding the code or otherwise related to this Github page, please open an issue.

For other purposes, feel free to contact me directly at: Fredrik.Carlsson@ri.se

<!-- ACKNOWLEDGEMENTS -->

Acknowledgements

<!-- LICENSE -->

License

Distributed under the MIT License. See LICENSE for more information.

<!-- CITATION -->

Citing

If you found this repository useful, please consider citing:

@InProceedings{carlsson-EtAl:2022:LREC,
  author    = {Carlsson, Fredrik  and  Eisen, Philipp  and  Rekathati, Faton  and  Sahlgren, Magnus},
  title     = {Cross-lingual and Multilingual CLIP},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {6848--6854},
  abstract  = {The long-standing endeavor of relating the textual and the visual domain recently underwent a pivotal breakthrough, as OpenAI released CLIP. This model distinguishes how well an English text corresponds with a given image with unprecedented accuracy. Trained via a contrastive learning objective over a huge dataset of 400M of images and captions, it is a work that is not easily replicated, especially for low resource languages. Capitalizing on the modularization of the CLIP architecture, we propose to use cross-lingual teacher learning to re-train the textual encoder for various non-English languages. Our method requires no image data and relies entirely on machine translation which removes the need for data in the target language. We find that our method can efficiently train a new textual encoder with relatively low computational cost, whilst still outperforming previous baselines on multilingual image-text retrieval.},
  url       = {https://aclanthology.org/2022.lrec-1.739}
}
<!-- MARKDOWN LINKS & IMAGES --> <!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->