Awesome

<font size=8> :label: Recognize Anything Model </font>

This project aims to develop a series of open-source and strong fundamental image recognition models.

Recognize Anything Plus Model (RAM++) [Paper] <br>

RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.
Recognize Anything Model (RAM) [Paper][Demo] <br>

RAM is an image tagging model, which can recognize any common category with high accuracy.

RAM is accepted at CVPR 2024 Multimodal Foundation Models Workshop.
Tag2Text (ICLR 2024) [Paper] [Demo]<br>

Tag2Text is a vision-language model guided by tagging, which can support tagging and comprehensive captioning simultaneously.

Tag2Text is accepted at ICLR 2024! See you in Vienna!

:bulb: Highlight

Superior Image Recognition Capability

RAM++ outperforms existing SOTA image fundamental recognition models on common tag categories, uncommon tag categories, and human-object interaction phrases.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/ram_plus_compare.jpg" align="center" width="700" ></td> </tr> </table> <p align="center">Comparison of zero-shot image recognition performance.</p> </p>

Strong Visual Semantic Analysis

We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.

:sunrise: Model Zoo

RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.

For Common Predefined Categoies. RAM++ exhibits exceptional image tagging capabilities with powerful zero-shot generalization, which maintains the same capabilities as RAM.
For Diverse Open-set Categoires. RAM++ achieves notably enhancements beyond CLIP and RAM.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/ram_plus_experiment.png" align="center" width="800" ></td> </tr> </table> <p align="center">(Green color means fully supervised learning and others means zero-shot performance.)</p> </p> <p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/ram_plus_visualization.jpg" align="center" width="800" ></td> </tr> </table> <p align="center">RAM++ demonstrate a significant improvement in open-set category recognition.</p> </p> </details> <details> <summary><font size="3" style="font-weight:bold;"> RAM </font></summary>

RAM is a strong image tagging model, which can recognize any common category with high accuracy.

Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/experiment_comparison.png" align="center" width="800" ></td> </tr> </table> <p align="center">(Green color means fully supervised learning and Blue color means zero-shot performance.)</p> </p> <p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/tagging_results.jpg" align="center" width="800" ></td> </tr> </table> </p>

RAM significantly improves the tagging ability based on the Tag2text framework.

Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training

</details> <details> <summary><font size="3" style="font-weight:bold;"> Tag2text </font></summary>

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/tag2text_visualization.png" align="center" width="800" ></td> </tr> </table> <p align="center">Tag2Text generate more comprehensive captions with tagging guidance.</p> </p> <p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/tag2text_retrieval_visualization.png" align="center" width="800" ></td> </tr> </table> <p align="center">Tag2Text provides tags as additional visible alignment indicators.</p> </p> </details>

:open_book: Training Datasets

Image Texts and Tags

These annotation files come from the Tag2Text and RAM. Tag2Text automatically extracts image tags from image-text pairs. RAM further augments both tags and texts via an automatic data engine.

DataSet	Size	Images	Texts	Tags
COCO	168 MB	113K	680K	3.2M
VG	55 MB	100K	923K	2.7M
SBU	234 MB	849K	1.7M	7.6M
CC3M	766 MB	2.8M	5.6M	28.2M
CC3M-val	3.5 MB	12K	26K	132K

CC12M to be released in the next update.

LLM Tag Descriptions

These tag descriptions files come from the RAM++ by calling GPT api. You can also customize any tag categories by generate_tag_des_llm.py.

Tag Descriptions	Tag List
RAM Tag List	4,585
OpenImages Uncommon	200

:toolbox: Checkpoints

Note : you need to create 'pretrained' folder and download these checkpoints into this folder.

<table> <thead> <tr style="text-align: right;"> <th></th> <th>Name</th> <th>Backbone</th> <th>Data</th> <th>Illustration</th> <th>Checkpoint</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>RAM++ (14M)</td> <td>Swin-Large</td> <td>COCO, VG, SBU, CC3M, CC3M-val, CC12M</td> <td>Provide strong image tagging ability for any category.</td> <td><a href="https://huggingface.co/xinyu1205/recognize-anything-plus-model/blob/main/ram_plus_swin_large_14m.pth">Download link</a></td> </tr> <tr> <th>2</th> <td>RAM (14M)</td> <td>Swin-Large</td> <td>COCO, VG, SBU, CC3M, CC3M-val, CC12M</td> <td>Provide strong image tagging ability for common category.</td> <td><a href="https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text/blob/main/ram_swin_large_14m.pth">Download link</a></td> </tr> <tr> <th>3</th> <td>Tag2Text (14M)</td> <td>Swin-Base</td> <td>COCO, VG, SBU, CC3M, CC3M-val, CC12M</td> <td>Support comprehensive captioning and tagging.</td> <td><a href="https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text/blob/main/tag2text_swin_14m.pth">Download link</a></td> </tr> </tbody> </table>

:running: Model Inference

Setting Up

Create and activate a Conda environment:

conda create -n recognize-anything python=3.8 -y
conda activate recognize-anything

Install recognize-anything as a package:

pip install git+https://github.com/xinyu1205/recognize-anything.git

Or, for development, you may build from source:

git clone https://github.com/xinyu1205/recognize-anything.git
cd recognize-anything
pip install -e .

Then the RAM++, RAM, and Tag2Text models can be imported in other projects:

from ram.models import ram_plus, ram, tag2text

RAM++ Inference

Get the English and Chinese outputs of the images:

python inference_ram_plus.py --image images/demo/demo1.jpg --pretrained pretrained/ram_plus_swin_large_14m.pth

The output will look like the following:

Image Tags:  armchair | blanket | lamp | carpet | couch | dog | gray | green | hassock | home | lay | living room | picture frame | pillow | plant | room | wall lamp | sit | wood floor
图像标签:  扶手椅  | 毯子/覆盖层 | 灯  | 地毯  | 沙发 | 狗 | 灰色 | 绿色  | 坐垫/搁脚凳/草丛 | 家/住宅 | 躺  | 客厅  | 相框  | 枕头  | 植物  | 房间  | 壁灯  | 坐/放置/坐落 | 木地板

RAM++ Inference on Unseen Categories (Open-Set)

Get the OpenImages-Uncommon categories of the image:

We have released the LLM tag descriptions of OpenImages-Uncommon categories in openimages_rare_200_llm_tag_descriptions.

<pre/> python inference_ram_plus_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_plus_swin_large_14m.pth \ --llm_tag_des datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json </pre>

The output will look like the following:

Image Tags: Close-up | Compact car | Go-kart | Horse racing | Sport utility vehicle | Touring car

You can also customize any tag categories for recognition through tag descriptions:

Modify categories, and call GPT api to generate corresponding tag descriptions:

<pre/> python generate_tag_des_llm.py \ --openai_api_key 'your openai api key' \ --output_file_path datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json </pre> <details> <summary><font size="4" style="font-weight:bold;"> RAM Inference </font></summary>

Get the English and Chinese outputs of the images:

<pre/> python inference_ram.py --image images/demo/demo1.jpg \ --pretrained pretrained/ram_swin_large_14m.pth </pre>

The output will look like the following:

Image Tags:  armchair | blanket | lamp | carpet | couch | dog | floor | furniture | gray | green | living room | picture frame | pillow | plant | room | sit | stool | wood floor
图像标签:  扶手椅  | 毯子/覆盖层 | 灯  | 地毯  | 沙发 | 狗 | 地板/地面 | 家具  | 灰色 | 绿色  | 客厅  | 相框  | 枕头  | 植物  | 房间  | 坐/放置/坐落 | 凳子  | 木地板

</details> <details> <summary><font size="4" style="font-weight:bold;"> RAM Inference on Unseen Categories (Open-Set) </font></summary>

Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:

<pre/> python inference_ram_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_swin_large_14m.pth </pre>

The output will look like the following:

Image Tags: Black-and-white | Go-kart

</details> <details> <summary><font size="4" style="font-weight:bold;"> Tag2Text Inference </font></summary>

Get the tagging and captioning results: <pre/> python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth </pre> Or get the tagging and sepcifed captioning results (optional): <pre/>python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth
--specified-tags "cloud,sky"</pre>

</details>

Batch Inference and Evaluation

We release two datasets OpenImages-common (214 common tag classes) and OpenImages-rare (200 uncommon tag classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/ and datasets/openimages_rare_200/imgs.

To evaluate RAM++ on OpenImages-common:

python batch_inference.py \
  --model-type ram_plus \
  --checkpoint pretrained/ram_plus_swin_large_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/ram_plus

To evaluate RAM++ open-set capability on OpenImages-rare:

python batch_inference.py \
  --model-type ram_plus \
  -- pretrained/ram_plus_swin_large_14m.pth \
  --open-set \
  --dataset openimages_rare_200 \
  --output-dir outputs/ram_plus_openset

To evaluate RAM on OpenImages-common:

python batch_inference.py \
  --model-type ram \
  -- pretrained/ram_swin_large_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/ram

To evaluate RAM open-set capability on OpenImages-rare:

python batch_inference.py \
  --model-type ram \
  -- pretrained/ram_swin_large_14m.pth \
  --open-set \
  --dataset openimages_rare_200 \
  --output-dir outputs/ram_openset

To evaluate Tag2Text on OpenImages-common:

python batch_inference.py \
  --model-type tag2text \
  -- pretrained/tag2text_swin_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/tag2text

Please refer to batch_inference.py for more options. To get P/R in table 3 of RAM paper, pass --threshold=0.86 for RAM and --threshold=0.68 for Tag2Text.

To batch inference custom images, you can set up you own datasets following the given two datasets.

:golfing: Model Training/Finetuning

RAM++

Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags }.
In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
Download RAM++ frozen tag embedding file "ram_plus_tag_embedding_class_4585_des_51.pth", and set file in "ram/data/frozen_tag_embedding/ram_plus_tag_embedding_class_4585_des_51.pth"
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
  --model-type ram_plus \
  --config ram/configs/pretrain.yaml  \
  --output-dir outputs/ram_plus

Fine-tune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 finetune.py \
  --model-type ram_plus \
  --config ram/configs/finetune.yaml  \
  --checkpoint outputs/ram_plus/checkpoint_04.pth \
  --output-dir outputs/ram_plus_ft

Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with four key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags, 'parse_label_id': image tags parsed from caption }.
In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
Download RAM frozen tag embedding file "ram_tag_embedding_class_4585.pth", and set file in "ram/data/frozen_tag_embedding/ram_tag_embedding_class_4585.pth"
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
  --model-type ram \
  --config ram/configs/pretrain.yaml  \
  --output-dir outputs/ram

Fine-tune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 finetune.py \
  --model-type ram \
  --config ram/configs/finetune.yaml  \
  --checkpoint outputs/ram/checkpoint_04.pth \
  --output-dir outputs/ram_ft

</details> <details> <summary><font size="4" style="font-weight:bold;"> Tag2Text </font></summary>

Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'parse_label_id': image tags parsed from caption }.
In ram/configs/pretrain_tag2text.yaml, set 'train_file' as the paths for the json files.
Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
  --model-type tag2text \
  --config ram/configs/pretrain_tag2text.yaml  \
  --output-dir outputs/tag2text

Fine-tune the pre-trained checkpoint using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 finetune.py \
  --model-type tag2text \
  --config ram/configs/finetune_tag2text.yaml  \
  --checkpoint outputs/tag2text/checkpoint_04.pth \
  --output-dir outputs/tag2text_ft

</details>

:black_nib: Citation

If you find our work to be useful for your research, please consider citing.

@article{huang2023open,
  title={Open-Set Image Tagging with Multi-Grained Text Supervision},
  author={Huang, Xinyu and Huang, Yi-Jie and Zhang, Youcai and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Xie, Yanchun and Li, Yaqian and Zhang, Lei},
  journal={arXiv e-prints},
  pages={arXiv--2310},
  year={2023}
}

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

:hearts: Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.

We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.