Home

Awesome

Prototypical Contrastive Language Image Pretraining

Welcome to the official PyTorch implementation of ProtoCLIP in our paper ProtoCLIP: Prototypical Contrastive Language Image Pretraining, in IEEE Transactions on Neural Networks and Learning Systems (TNNLS).

by Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Shaoqiu Zheng, Ying Tan, and Erjing, Zhou

Abstract: Contrastive Language Image Pretraining (CLIP) received widespread attention since its learned representations can be transferred well to various downstream tasks. During CLIP training, the InfoNCE objective aims to align positive image-text pairs and separate negative ones. In this paper, we show a representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors.

We introduce Prototypical Contrastive Language Image Pretraining (ProtoCLIP) to enhance such grouping by boosting its efficiency and increasing its robustness against modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. We further propose Prototypical Back Translation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. PBT also enables us to introduce additional external teachers with richer prior knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.

Combining the above novel designs, we train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement.

Fig1

protoclip_model_structure

πŸ”” Updates

πŸš€ What can you get from this repo

Requirements

1. Install Dependencies

2. Prepare Pretraining Data

This codebase reads a CSV file (separated by \t) with two columns: a path to an image ("filepath" by default), and a text caption ("title" by default).

filepathtitle
path/to/image.jpgA very typical bus station
......

The script src/utils/gather_cc.py will collect the Conceptual Captions (CC3M) dataset. First, download the Conceptual Captions URLs from here, then run the following script:

python3 src/utils/gather_cc.py path/to/Train_GCC-training.tsv

Note: The requirement of CC3M validation data of OpenCLIP is removed in this codebase. The CC3M dataset was made public by Google in 2018. As noted in our paper, the number of accessible images keeps drooping due to expired image links. This issue is raised by several recent works. In this work, since we can only collect 2,643,718 images (concurrent to our work, CyCLIP collected 2,631,703 images), we randomly sample a 2,500,000 subset (75% of full CC3M) from them to train our ProtoCLIP. Considering the dropping accessibility of image links in Conceptual Captions, we call for the use of this dataset size (2.5M) in future benchmarking for better comparability.

Note: webdataset is no longer supported in this codebase.

3. Prepare Downstream Data

Evaluation

Training

Build an External Teacher (optional)

Pretrained RoBERTa language model as the external teacher for ProtoCLIP. We load the pretrained RoBERTa-large weights provided by FAIRSEQ via PyTorch Hub. Run the following script to extract text features from a given .csv file, and reduce the feature dimension from 1024 to 64 by PCA to save memory cost:

python src/utils/RoBERTa.py
>>> Input your csv file: <YOUR-PRETRAING-DATASET-CSV-FILE.csv>
>>> Input your feature file: <FEATURE-CACHE-FILE.npy> (e.g, 'features/RoBERTA_features_CC.npy')

With a single NVIDIA 2080Ti GPU, extracting RoBERTa features from CC2.5M takes about 3 hours and the resulting feature file takes 600+Mb storage.

Note: We used the pooled and normalized feature from RoBERTa:

text_feature = roberta.extract_features(texts)
text_feature = text_feature.mean(dim=1)
text_feature = F.normalize(text_feature, dim=-1)

Sample Single GPU Training

By running commands provided below, following results can be obtained. On a machine with single NVIDIA 2080ti GPU, 16 CPU, and 100GB RAM, CLIP training takes 1.03 days, while ProtoCLIP training takes 1.84 days.

ModelBackboneBatch SizeImageNet Linear ProbImageNet Zero-shot10 Dataset Zero-shot Avg.COCO Mean Recall
CLIPRN506438.9512.2915.3026.20
ProtoCLIPRN506444.5514.5020.4828.26

Multi GPU Training

CLIP and ProtoCLIP achieve the following downstream performance with CC2.5M:

ModelBackboneBatch SizeImageNet Linear ProbImageNet Zero-shot10 Dataset Zero-shot Avg.COCO Mean Recall
CLIPRN5051249.4119.4621.8736.48
ProtoCLIPRN5051255.2221.4722.5235.69

Some Notes on Arguments

Run python src/training/main.py --help to see descriptions of all arguments. Here we provide some explanations of our newly added arguments:

πŸ“ˆMonitoring Downstream Performances During Training

Experiment will be logged to <Your Experiment Log dir> as following:

<Your Experiment Log dir>
    β”œβ”€β”€ cache
    β”œβ”€β”€ checkpoints
    β”‚Β Β  β”œβ”€β”€ epoch_4.pt
    β”‚Β Β  β”œβ”€β”€ epoch_8.pt
    β”‚Β Β  β”œβ”€β”€ epoch_12.pt
    β”‚Β Β  β”œβ”€β”€ epoch_16.pt
    β”‚Β Β  β”œβ”€β”€ epoch_20.pt
    β”‚Β Β  β”œβ”€β”€ epoch_24.pt
    β”‚Β Β  β”œβ”€β”€ epoch_28.pt
    β”‚Β Β  β”œβ”€β”€ epoch_32.pt
    β”‚Β Β  └── epoch_latest.pt
    β”œβ”€β”€ out.log
    β”œβ”€β”€ params.txt
    β”œβ”€β”€ results.jsonl
    β”œβ”€β”€ evaluation_metrics_all.csv
    └── tensorboard
        └── events.out.tfevents

We present an useful tool for monitoring the downstream performance. By running src/utils/evaluate_checkpoints.py and specifying an experiment logging dir, it will read configurations from params.txt and automatically monitor and evaluate checkpoints. The result will be automatically saved as a .csv file (evaluation_metrics_all.csv). You can also specify an individual checkpoint to evaluate.

>>> python src/utils/evaluate_checkpoints.py
Please input your experiment dir: <Your Experiment Log dir>
Specify a checkpoint epoch? (press "enter" to scan and evaluate all checkpoints) 

🎈 Aknowledgements

If you find this project useful for your research, please consider citing our paper:

@article{chen2023prototypical,
    author    = {Delong Chen and
                Zhao Wu and
                Fan Liu and
                Zaiquan Yang and
                Shaoqiu Zheng and
                Ying Tan and
                Erjin Zhou},
    title     = {ProtoCLIP: Prototypical Contrastive Language Image Pretraining},
    journal   = {IEEE Transactions on Neural Networks and Learning Systems (TNNLS)},
    year      = {2023},
}

If you have any problems about ProtoCLIP algorithm or this implementation, create an issue or email chendelong@hhu.edu.cn.