Home

Awesome

Data-Centric Vision-Language Pre-training

Arxiv

This work is still in progress, now the compression rate is around 70%-80%.

However, the data selection strategy is quite simple, we are exploring more sloid methods.

We also focus on refine existing dataset with our toolbox Image2Paragraph.

News

08/17/2023: Code released.

To do

1. Introduction

1. Conventional Vision-language Datasets

IndexOriginal Dataset#Original SamplesReduced-Dataset#Reduced SamplesCompressison Rate
0CC3M2.82MTL;DR CC3M0.67M76.25%
1CC12M10.8MTL;DR CC12M2.4M77.8%
2YFCC14.9MTL;DR YFCC2.5M83.33%
3LAION-Sub40MTL;DR LAION-Sub8.04M79.90%

2. Data-efficient learning methods

"Large-scale" means that the methods are effective when used on datasets that are very large in size. The "task agnostic" means that the methods can be used regardless of the specific downstream task, and without any prior exposure to the associated data.

MethodYearData TypeCompression RatioTask AgnosticLarge-scaleSupervisionGeneration/Selection
Dataset Distillation [1]2017Image99%-99.99%NoNoClass LabelGeneration
Data Pruning [2]2022Image20%-30%NoYesClass LabelSelection
Neural Data Server [3]2020Multi-modality94%-98%NoYesImage-text PairsSelection
TL;DR (ours)2023Multi-modality75%-90%YesYesImage-text PairsGeneration+Selection

[1] Wang T et al. Dataset distillation[J]. arXiv preprint arXiv:1811.10959, 2018.

[2] Sorscher B et al. Beyond neural scaling laws: beating power law scaling via data pruning[J]. NeurIPS, 2022.

[3] Yan X, et all . Neural data server: A large-scale search engine for transfer learning data[C]. CVPR. 2020.

2. Run

Step 1. Pre-train Codebook-based Vision-Language Model

The codebook implementation is from VQ-VQE.

Please follow GETTING_START.md for data preparation and captioner model training.

Step 2. Codebook Extractor

python codebook_extractor.py

Step 3. Codebook Clustering and Selection

python codebook_cluster.py

In comparison, use random selection also

python random_selection.py

Step4. Fine-tuning VLP Model on Human-cleaned Captioning Dataset

python vq_compress_model/train_caption.py

Step5. Generate Training Json

python generate_train_json_w_caption.py

We show the ITM score distribution as below:

The main reason for the following steps is to higher the matching score. This not limited to image captioner, nueral data server and other techniques to improve the alignment between visual and text also works.

Step6. Pre-training and Evaluating on downstream Tasks

Use the generated annotation files to train VLP model in normal way.

3. Some Result

a. CC3M

DatasetSamplePretraining TimeCOCO TR@1COCO IR@1COCO Captioning B@4NLVR2
CC3M2.82M19H70.954.336.876.2
TL;DR CC3M0.67M4.7H72.854.837.678.0

b. CC12M

DatasetSamplePretraining TimeFlickr TR@1Flcikr IR@1COCO Captioning B@4NLVR2
CC12M10.8M65H84.775.337.578.9
TL;DR CC12M2.4M14H85.576.338.178.5

c. YFCC

Compression Rate: 83.33%

d. LAION-Subset

Compression Rate: 80%

Acknowledgement

This work is mainly inspired by Dataset Distillation and Data Pruning. The architecutres ablation are mainly based on blip, and ViLT. Thanks for these good works.

Citation

If you find our work helps, please use the following BibTeX entry for citation.

@article{wang2023tldr,
  title={Too Large; Data Reduction for Vision-Language Pre-Training},
  author={Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei and Mike Zheng Shou },
  journal={ICCV},
  year={2023}
}