Home

Awesome

DreamLIP: Language-Image Pre-training with Long Captions

DreamLIP: Language-Image Pre-training with Long Captions <br> <a href="https://zkcys001.github.io/">Kecheng Zheng</a>,</span> <a href="https://github.com/zyf0619sjtu">Yifei Zhang</a>,</span> <a href="https://github.com/wuw2019">Wei Wu</a>,</span> <a href="https://github.com/LuFan31">Fan Lu</a>,</span> <a href="https://scholar.google.com/citations?user=dNhzCu4AAAAJ&hl=zh-CN">Shuailei Ma</a>,</span> <a href="http://home.ustc.edu.cn/~jinxustc/">Xin Jin</a>,</span> <a href="http://www.cad.zju.edu.cn/home/chenwei/">Wei Chen</a>,</span> <a href="https://shenyujun.github.io/">Yujun Shen</a> <br> Project Page | Paper | Data

📰 News

💡 Highlights

<img src="figures/radar.jpg" style="vertical-align: -10px; display: block; margin-left: auto; margin-right: auto;" height="400px" width="440px">

🎨 In-Progress

🏝️ Overview of supported long captions:

<details open> <summary><b>Long Captions of Supported Datasets (5)</b></summary>
</details> <details open> <summary><b>Long Captions of MLLMs (3)</b></summary>
</details>

Generated Long Captions

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="center">Dataset</th> <th valign="center">Huggingface Dataset</th> <!-- TABLE BODY --> <tr> <td align="center">CC3M</td> <td align="center"><a href="https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions">Raw/Long/Short Caption</a></td> </tr> <tr> <td align="center">CC12M</td> <td align="center"><a href="https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions">Raw/Long/Short Caption</a></td> </tr> <tr> <td align="center">YFCC15M</td> <td align="center"><a href="https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions">Raw/Long/Short Caption</a></td> </tr> <tr> <td align="center">Laion49M</td> <td align="center"><a href="https://huggingface.co/datasets/weiwu-ww/Recap-Long-Laion">Long Caption</a></td> </tr> <tr> <td align="center">COYO24M</td> <td align="center"><a href="https://huggingface.co/datasets/weiwu-ww/Recap-Long-Coyo">Long Caption</a></td> </tr> </tbody></table>

Pretrained checkpoints

<table><tbody> <!-- START TABLE --> <!-- TABLE HEADER --> <th valign="center">Dataset</th> <th valign="center">Model</th> <th valign="center">ShareGPT4V</th> <th valign="center">InstructBLIP + LLAVA1.5 + ShareGPT4V</th> <!-- TABLE BODY --> <tr> <td align="center">CC3M</td> <td align="center">ViT-B/16</td> <td align="center"><a href="https://drive.google.com/file/d/1f8JdXmdGRQtCzXpEGTpE_T7bWXLMnaMj/view?usp=sharing">Link</a></td> <td align="center">TODO</td> </tr> <tr> <td align="center">CC12M</td> <td align="center">ViT-B/16</td> <td align="center"><a href="https://drive.google.com/file/d/12qSRzW8q2Jg2L4y05s-AMXyCvPS7O6BK/view?usp=sharing">Link</a></td> <td align="center">TODO</td> </tr> <tr> <td align="center">YFCC15M</td> <td align="center">ViT-B/16</td> <td align="center"><a href="https://drive.google.com/file/d/1CG1-XRsnff7b26WYdygNOWnhAqI5y_a7/view?usp=sharing">Link</a></td> <td align="center">TODO</td> </tr> <tr> <td align="center">CC30M</td> <td align="center">ViT-B/16</td> <td align="center"><a href="https://drive.google.com/file/d/1pPVVOt_YALq_YX7x2kNEfDWSdHQ5wqew/view?usp=sharing">Link</a></td> <td align="center">TODO</td> </tr> </tbody></table>

📣 Instructions

Environment installation

pip install -r requirments.txt

Evaluate zero shot classification

bash eval_zs.sh

License

The project is under a standard Creative Common CC-BY-4.0 License.

📖 Citation

We open source this library to the community to facilitate the research. If you do like our work and use the codebase for your projects, please cite our work as follows.

@inproceedings{DreamLIP,
  title={DreamLIP: Language-Image Pre-training with Long Captions},
  author={Zheng, Kecheng and Zhang, Yifei and Wu, Wei and Lu, Fan and Ma, Shuailei and Jin, Xin and Chen, Wei and Shen, Yujun},
  booktitle={ECCV},
  year={2024}
}

Acknowledgements

This project is based on open_clip, and thanks for the nice work! We also thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models and codes.