Home

Awesome

Multi-Modality

NaViT

My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

Paper Link

Appreciation

Install

pip install navit-torch

Usage

import torch
from navit.main import NaViT


n = NaViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    heads = 16,
    mlp_dim=2048,
    dropout=0.1,
    emb_dropout=0.1,
    token_dropout_prob=0.1
)

images = [
    [torch.randn(3, 256, 256), torch.randn(3, 128, 128)],
    [torch.randn(3, 256, 256), torch.randn(3, 256, 128)],
    [torch.randn(3, 64, 256)]
]

preds = n(images)

Dataset Strategy

Here is a table of the key datasets and their metadata used for pretraining and evaluating NaViT:

DatasetTypeSizeDetailsSource
JFT-4BImage classification4 billion imagesPrivate dataset from Google[1]
WebLIImage-text73M image-text pairsWeb-crawled dataset[2]
ImageNetImage classification1.3M images, 1000 classesStandard benchmark[3]
ImageNet-AImage classification7,500 imagesOut-of-distribution variant[4]
ObjectNetImage classification50K images, 313 classesOut-of-distribution variant[5]
LVISObject detection120K images, 1000 classesLarge vocabulary instance segmentation[6]
ADE20KSemantic segmentation20K images, 150 classesScene parsing dataset[7]
Kinetics-400Video classification300K videos, 400 classesAction recognition dataset[8]
FairFaceFace attribute classification108K images, 9 attributesBalanced dataset for facial analysis[9]
CelebAFace attribute classification200K images, 40 attributesFace attributes dataset[10]

[1] Zhai et al. "Scaling Vision Transformers". 2022. https://arxiv.org/abs/2106.04560
[2] Chen et al. "PaLI". 2022. https://arxiv.org/abs/2209.06794 [3] Deng et al. "ImageNet". 2009. http://www.image-net.org/ [4] Hendrycks et al. "Natural Adversarial Examples". 2021. https://arxiv.org/abs/1907.07174 [5] Barbu et al. "ObjectNet". 2019. https://arxiv.org/abs/1612.03916 [6] Gupta et al. "LVIS". 2019. https://arxiv.org/abs/1908.03195 [7] Zhou et al. "ADE20K". 2017. https://arxiv.org/abs/1608.05442 [8] Kay et al. "Kinetics". 2017. https://arxiv.org/abs/1705.06950 [9] Kärkkäinen and Joo. "FairFace". 2019. https://arxiv.org/abs/1908.04913 [10] Liu et al. "CelebA". 2015. https://arxiv.org/abs/1410.5408

Todo

License

MIT

Citations

@misc{2307.06304,
Author = {Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey Gritsenko and Mario Lučić and Neil Houlsby},
Title = {Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
Year = {2023},
Eprint = {arXiv:2307.06304},
}