Home

Awesome

VisionTransformer

This repository attempted to reproduce the ViT from the COYO-Labeled-300M dataset.

The model was pre-trained on the labeled COYO-Labeled-300M dataset, which is the largest number of published classification ViT.

We provide the code for pretraining and finetuning in Tensorflow2.

We will also work with HuggingFace to provide the weights file and make it usable in pytorch and jax through the HuggingFace platform as well.

Training

Results

ModelUpstream DatasetResolutionImageNet (downstream)ImageNet-ReaL (downstream)Public
ViT-L/16JFT-300M51287.7690.54X
ViT-L/16COYO-Labeled-300M51287.24 (-0.52)90.03 (-0.51)O
ViT-L/16JFT-300M38487.1289.99X
ViT-L/16COYO-Labeled-300M38486.72 (-0.40)89.84 (-0.15)O

Checkpoints

ModelUpstream DatasetDownstream DatasetResolutionlink
ViT-L/16COYO-Labeled-300M-224 link
ViT-L/16COYO-Labeled-300MImageNet384 link
ViT-L/16COYO-Labeled-300MImageNet512 link

Requirements

Commands

We have used hydra to manage the configuration. For detailed usage, see here.

Pretraining

python3 -m trainer trainer=vit_l16_coyo300m \
runtime.tpu.name={your_tpu_name} \
runtime.tpu.type={your_tpu_type} \
experiment.debug=false experiment.save_dir={your_save_dir}

Finetuning

python3 -m trainer trainer=vit_l16_i1k_downstream \
  runtime.tpu.name={your_tpu_name} \
  runtime.tpu.type={your_tpu_type} \
  experiment.debug=false \
  experiment.save_dir={your_save_dir} \
  trainer.backbone.pretrained={your_pretrained_weight} 

Also, you can experiment by changing the configuration as follows.

python3 -m trainer trainer=vit_l16_i1k_downstream \
  runtime.tpu.name={your_tpu_name} \
  runtime.tpu.type={your_tpu_type} \
  experiment.debug=false experiment.save_dir={your_save_dir} \
  trainer.backbone.pretrained={your_pretrained_weight} \
  trainer.epochs=16 \
  trainer.learning_rate.base_lr=3e-2

Evaluation

python3 -m trainer trainer=vit_l16_i1k_downstream \
  runtime.tpu.name={your_tpu_name} \
  runtime.tpu.type={your_tpu_type} \
  experiment.debug=false \
  experiment.save_dir={your_weight_path} \
  experiment.mode='eval'

Citation

@misc{kakaobrain2022coyo-vit,
  title         = {COYO-ViT},
  author        = {Lee, Sungjun and Park, Beomhee},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-vit}},
}
@misc{kakaobrain2022coyo-700m,
  title         = {COYO-700M: Image-Text Pair Dataset},
  author        = {Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-dataset}},
}
@misc{dosovitskiy2020image,
    title   = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author  = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
    year    = {2020},
    eprint  = {2010.11929},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

People

Contact

This is released as an open source in the hope that it will be helpful to many research institutes and startups for research purposes.

jun.untitled@kakaobrain.com

License

The source codes are licensed under Apache 2.0 License.