Home

Awesome

:wolf: COYO-700M: Image-Text Pair Dataset

COYO-700M is a large-scale dataset that contains 747M image-text pairs as well as many other meta-attributes to increase the usability to train various models. Our dataset follows a similar strategy to previous vision-and-language datasets, collecting many informative pairs of alt-text and its associated image in HTML documents. We expect COYO to be used to train popular large-scale foundation models complementary to other similar datasets.

More details on the data acquisition process can be found in [our paper] (which will be updated soon).

<p align="center"><img src=./assets/coyo-samples.png width="1020" alt=""></p>

Updates

Data Collection Process

<div align="center"> <figure> <img alt="" src="./assets/alt-text-example.png" width="800"> <figcaption>https://en.wikipedia.org/wiki/Napoleon</figcaption> </figure> </div>

Data Filtering

Image Level

Text Level

Image-Text Level

Dataset Preview

id                                                                             url                                                                             textwidthheightimage_phashtext_lengthword_countnum_tokens_bertnum_tokens_gptnum_facesclip_similarity_vitb32clip_similarity_vitl14nsfw_score_opennsfw2nsfw_score_gantmanwatermark_scoreaesthetic_score_laion_v2
4896263451343<img src="https://cdn.shopify.com/s/files/1/0190/8574/products/Art_Riley-Monterey_Fishing_Fleet_1_grande.jpg?v=1479962684" width="400" />Fishing Fleet (Monterey), California art by Art Riley. HD giclee art prints for sale at CaliforniaWatercolor.com - original California paintings, & premium giclee prints for sale600447bac58374982e0fc717825394000.3193360.2481692.54512e-050.02938610.04060097.04812
1425929344479<img src="https://www.ephotozine.com/resize/2018/07/xlrg/121543_1518912380.jpg?RTUdGk5cXyJFBQgJVANtcQlnYF8JERFaGwJRNQh6SlYUAEw1cmsCdg1hAWoxXFNGLSI=" width="400" />The Gate by Pete24536003478374726575bc0f8a2046600.249390.2037356.97374e-060.008232760.07214156.98521
7456063527931<img src="https://www.boredart.com//wp-content/uploads/2014/06/Beautiful-Pictures-From-the-Shores-of-the-Mythical-Land-421.jpg" width="400" />Beautiful Pictures From the Shores of the Mythical Land (42600320949d1fe559e2cc905910111400.2907710.1793210.01306150.01786280.4896426.94643
3221225511175<img src="https://homesfeed.com/wp-content/uploads/2017/12/contemporary-expensive-lighting-fixtures-with-minimum-lighting.jpg" width="400" />contemporary expensive lighting fixtures with minimum lighting800499e5ea35075ab912c66277800.2639160.2178960.0009908680.01371140.09607484.57594
5626407855002<img src="https://api.time.com/wp-content/uploads/2015/03/168951187.jpg" width="400" />Nintendo Co.'s Super Mario is displayed on coffee mugs for sale at the Nintendo World store in New York, U.S., on Friday, May 17, 2013.200013099311891e9437f4f313527373500.4008780.3166500.003629680.03175190.00226936.324910
1125282207474<img src="https://s.yimg.com/ny/api/res/1.2/mOZe9uKtwugmPrqeXBlxFg--/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzMA--/https://s.yimg.com/uu/api/res/1.2/JuTSVK74cI8II09Q75uzGA--~B/aD01MjU7dz04MDA7YXBwaWQ9eXRhY2h5b24-/https://media.zenfs.com/en/reuters.com/15941d3b47960da80f8033f4ddf9da64" width="400" />FILE PHOTO: A rainbow appears on the Auckland skyline featuring Sky Tower in New Zealand80052585b89c0166ee63be8815161600.44531250.35058592.640485e-050.0120740.02191295.294523
1434519186493<img src="https://static.straitstimes.com.sg/s3fs-public/styles/article_pictrure_780x520_/public/articles/2013/07/24/CHINA24e_2x.jpg?itok=6ppRPBJs&timestamp=1436931188" width="400" />A man covers himself with algae as he poses for photographs on a beach in Qingdao, Shandong province on Tuesday, July 23, 2013. -- FILE PHOTO: REUTERS860573f2c48dabbf93810a15026353670.41650390.34277340.0250090.016080.0727756.833739

Dataset Numbers

countratio
# of image-text pairs746,972,269100.00%
# of unique urls656,114,78387.84%
# of unique image_phash579,679,13777.60%
# of unique text566,253,88875.81%

Meta-Attributes

Attributes

nametypedescription
idlongUnique 64-bit integer ID generated by monotonically_increasing_id()
urlstringThe image URL extracted from the src attribute of the <img> tag
textstringThe text extracted from the alt attribute of the <img> tag
widthintegerThe width of the image
heightintegerThe height of the image
image_phashstringThe perceptual hash(pHash) of the image
text_lengthintegerThe length of the text
word_countintegerThe number of words separated by spaces.
num_tokens_bertintegerThe number of tokens using BertTokenizer
num_tokens_gptintegerThe number of tokens using GPT2TokenizerFast
num_facesintegerThe number of faces in the image detected by SCRFD
clip_similarity_vitb32floatThe cosine similarity between text and image(ViT-B/32) embeddings by OpenAI CLIP
clip_similarity_vitl14floatThe cosine similarity between text and image(ViT-L/14) embeddings by OpenAI CLIP
nsfw_score_opennsfw2floatThe NSFW score of the image by OpenNSFW2
nsfw_score_gantmanfloatThe NSFW score of the image by GantMan/NSFW
watermark_scorefloatThe watermark probability of the image by our internal model
aesthetic_score_laion_v2floatThe aesthetic score of the image by LAION-Aesthetics-Predictor-V2

Statistics

widthheighttext_lengthword_countnum_tokens_bertnum_tokens_gptnum_faces
mean621.78540.9968.5311.1315.7517.240.60
min20020063130
max214492250710003238111523736
watermark_scoreclip_similarity_vitb32clip_similarity_vitl14aesthetic_score_laion_v2nsfw_score_opennsfw2
mean0.1785440.2912660.2546324.7691320.012903
min0.0-0.080871-0.1762691.1717120.0
max1.00.5917960.5815428.0826070.499755

Getting Started

Download

Usage

Experiments

We empirically validated the quality of COYO dataset by re-implementing popular models such as ALIGN, unCLIP, and ViT. We trained these models on COYO-700M or its subsets from scratch, achieving competitive performance to the reported numbers or generated samples in the original papers. Since this observation supports the high quality of our dataset, we hope it to be continuously updated with open collaboration. Our pre-trained models and training codes will be released soon along with the technical report.

ALIGN

ModelDataImageNet KNNCOCO I2TCOCO T2I
EfficientNet-B7 + BERT-baseALIGN-1.8B69.30055.40041.700
EfficientNet-B7 + BERT-baseCOYO-700M68.61859.00042.419

unCLIP (OpenAI DALL·E 2)

<div style="width:150px"><img src=./assets/dalle2_knight.png width=80% height=80%></div><div style="width:150px"><img src=./assets/dalle2_andywarhol.png width=80% height=80%></div>
A high quality picture of a medieval knight with golden armorA person with the head of a cat in the style of Andy Warhol
<div style="width:150px"><img src=./assets/dalle2_astronaut.png width=80% height=80%></div><div style="width:150px"><img src=./assets/dalle2_darthvader.png width=80% height=80%></div>
A pencil drawing of an astronaut riding a horseGoryeo celadon in the shape of darth vader

ViT

ModelDataImageNet <br/> Validation <br/>Top-1 Acc
ViT-L/16JFT-300M87.76%
ViT-L/16COYO-Labeled-300M87.24%

Citation

If you apply this dataset to any project and research, please cite our code:

@misc{kakaobrain2022coyo-700m,
  title         = {COYO-700M: Image-Text Pair Dataset},
  author        = {Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-dataset}},
}

People

Disclaimer & Content Warning

The COYO dataset is recommended to be used for research purposes. Kakao Brain tried to construct a "Safe" dataset when building the COYO dataset. (See Data Filtering Section) Kakao Brain is constantly making efforts to create more "Safe" datasets. However, despite these efforts, this large-scale dataset was not hand-picked by humans to avoid the risk due to its very large size (over 700M). Keep in mind that the unscreened nature of the dataset means that the collected images can lead to strongly discomforting and disturbing content for humans. The COYO dataset may contain some inappropriate data, and any problems resulting from such data are the full responsibility of the user who used it. Therefore, it is strongly recommended that this dataset be used only for research, keeping this in mind when using the dataset, and Kakao Brain does not recommend using this dataset as it is without special processing to clear inappropriate data to create commercial products.

License

The COYO dataset of Kakao Brain is licensed under CC-BY-4.0 License. The full license can be found in the LICENSE.cc-by-4.0 file. The dataset includes “Image URL” and “Text” collected from various sites by analyzing Common Crawl data, an open data web crawling project. The collected data (images and text) is subject to the license to which each content belongs.

Obligation to use

While Open Source may be free to use, that does not mean it is free of obligation. To determine whether your intended use of the COYO dataset is suitable for the CC-BY-4.0 license, please consider the license guide. If you violate the license, you may be subject to legal action such as the prohibition of use or claim for damages depending on the use.

Contact

COYO dataset was released as an open source in the hope that it will be helpful to many research institutes and startups for research purposes. We look forward to contacting us from various places who wish to cooperate with us.

coyo@kakaobrain.com