Home

Awesome

<font size=8> :label: Recognize Anything Model </font>

This project aims to develop a series of open-source and strong fundamental image recognition models.

Training Dataset Tag List Web Demo Open in Colab Open in Bohrium

:bulb: Highlight

Superior Image Recognition Capability

RAM++ outperforms existing SOTA image fundamental recognition models on common tag categories, uncommon tag categories, and human-object interaction phrases.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/ram_plus_compare.jpg" align="center" width="700" ></td> </tr> </table> <p align="center">Comparison of zero-shot image recognition performance.</p> </p>

Strong Visual Semantic Analysis

We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.

:sunrise: Model Zoo

<details> <summary><font size="3" style="font-weight:bold;"> RAM++ </font></summary>

RAM++ is the next generation of RAM, which can recognize any category with high accuracy, including both predefined common categories and diverse open-set categories.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/ram_plus_experiment.png" align="center" width="800" ></td> </tr> </table> <p align="center">(Green color means fully supervised learning and others means zero-shot performance.)</p> </p> <p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/ram_plus_visualization.jpg" align="center" width="800" ></td> </tr> </table> <p align="center">RAM++ demonstrate a significant improvement in open-set category recognition.</p> </p> </details> <details> <summary><font size="3" style="font-weight:bold;"> RAM </font></summary>

RAM is a strong image tagging model, which can recognize any common category with high accuracy.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/experiment_comparison.png" align="center" width="800" ></td> </tr> </table> <p align="center">(Green color means fully supervised learning and Blue color means zero-shot performance.)</p> </p> <p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/tagging_results.jpg" align="center" width="800" ></td> </tr> </table> </p>

RAM significantly improves the tagging ability based on the Tag2text framework.

</details> <details> <summary><font size="3" style="font-weight:bold;"> Tag2text </font></summary>

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

<p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/tag2text_visualization.png" align="center" width="800" ></td> </tr> </table> <p align="center">Tag2Text generate more comprehensive captions with tagging guidance.</p> </p> <p align="center"> <table class="tg"> <tr> <td class="tg-c3ow"><img src="images/tag2text_retrieval_visualization.png" align="center" width="800" ></td> </tr> </table> <p align="center">Tag2Text provides tags as additional visible alignment indicators.</p> </p> </details> <!-- ## :sparkles: Highlight Projects with other Models - [Tag2Text/RAM with Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) is trong and general pipeline for visual semantic analysis, which can automatically **recognize**, detect, and segment for an image! - [Ask-Anything](https://github.com/OpenGVLab/Ask-Anything) is a multifunctional video question answering tool. Tag2Text provides powerful tagging and captioning capabilities as a fundamental component. - [Prompt-can-anything](https://github.com/positive666/Prompt-Can-Anything) is a gradio web library that integrates SOTA multimodal large models, including Tag2text as the core model for graphic understanding --> <!-- ## :fire: News - **`2023/10/30`**: We release the [Recognize Anything Model Plus Model(RAM++)](), checkpoints and inference code! - **`2023/06/08`**: We release the [Recognize Anything Model (RAM) Tag2Text web demo 🤗](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text), checkpoints and inference code! - **`2023/06/07`**: We release the [Recognize Anything Model (RAM)](https://recognize-anything.github.io/), a strong image tagging model! - **`2023/06/05`**: Tag2Text is combined with [Prompt-can-anything](https://github.com/OpenGVLab/Ask-Anything). - **`2023/05/20`**: Tag2Text is combined with [VideoChat](https://github.com/OpenGVLab/Ask-Anything). - **`2023/04/20`**: We marry Tag2Text with with [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything). - **`2023/04/10`**: Code and checkpoint is available Now! - **`2023/03/14`**: [Tag2Text web demo 🤗](https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text) is available on Hugging Face Space! --> <!-- ## :writing_hand: TODO - [x] Release checkpoints. - [x] Release inference code. - [x] Release demo and checkpoints. - [x] Release training codes. - [x] Release training datasets. - [ ] Release full training codes and scripts. -->

:open_book: Training Datasets

Image Texts and Tags

These annotation files come from the Tag2Text and RAM. Tag2Text automatically extracts image tags from image-text pairs. RAM further augments both tags and texts via an automatic data engine.

DataSetSizeImagesTextsTags
COCO168 MB113K680K3.2M
VG55 MB100K923K2.7M
SBU234 MB849K1.7M7.6M
CC3M766 MB2.8M5.6M28.2M
CC3M-val3.5 MB12K26K132K

CC12M to be released in the next update.

LLM Tag Descriptions

These tag descriptions files come from the RAM++ by calling GPT api. You can also customize any tag categories by generate_tag_des_llm.py.

Tag DescriptionsTag List
RAM Tag List4,585
OpenImages Uncommon200

:toolbox: Checkpoints

Note : you need to create 'pretrained' folder and download these checkpoints into this folder.

<!-- insert a table --> <table> <thead> <tr style="text-align: right;"> <th></th> <th>Name</th> <th>Backbone</th> <th>Data</th> <th>Illustration</th> <th>Checkpoint</th> </tr> </thead> <tbody> <tr> <th>1</th> <td>RAM++ (14M)</td> <td>Swin-Large</td> <td>COCO, VG, SBU, CC3M, CC3M-val, CC12M</td> <td>Provide strong image tagging ability for any category.</td> <td><a href="https://huggingface.co/xinyu1205/recognize-anything-plus-model/blob/main/ram_plus_swin_large_14m.pth">Download link</a></td> </tr> <tr> <th>2</th> <td>RAM (14M)</td> <td>Swin-Large</td> <td>COCO, VG, SBU, CC3M, CC3M-val, CC12M</td> <td>Provide strong image tagging ability for common category.</td> <td><a href="https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text/blob/main/ram_swin_large_14m.pth">Download link</a></td> </tr> <tr> <th>3</th> <td>Tag2Text (14M)</td> <td>Swin-Base</td> <td>COCO, VG, SBU, CC3M, CC3M-val, CC12M</td> <td>Support comprehensive captioning and tagging.</td> <td><a href="https://huggingface.co/spaces/xinyu1205/Recognize_Anything-Tag2Text/blob/main/tag2text_swin_14m.pth">Download link</a></td> </tr> </tbody> </table>

:running: Model Inference

Setting Up

  1. Create and activate a Conda environment:
conda create -n recognize-anything python=3.8 -y
conda activate recognize-anything
  1. Install recognize-anything as a package:
pip install git+https://github.com/xinyu1205/recognize-anything.git
  1. Or, for development, you may build from source:
git clone https://github.com/xinyu1205/recognize-anything.git
cd recognize-anything
pip install -e .

Then the RAM++, RAM, and Tag2Text models can be imported in other projects:

from ram.models import ram_plus, ram, tag2text

RAM++ Inference

Get the English and Chinese outputs of the images:

python inference_ram_plus.py --image images/demo/demo1.jpg --pretrained pretrained/ram_plus_swin_large_14m.pth

The output will look like the following:

Image Tags:  armchair | blanket | lamp | carpet | couch | dog | gray | green | hassock | home | lay | living room | picture frame | pillow | plant | room | wall lamp | sit | wood floor
图像标签:  扶手椅  | 毯子/覆盖层 | 灯  | 地毯  | 沙发 | 狗 | 灰色 | 绿色  | 坐垫/搁脚凳/草丛 | 家/住宅 | 躺  | 客厅  | 相框  | 枕头  | 植物  | 房间  | 壁灯  | 坐/放置/坐落 | 木地板

RAM++ Inference on Unseen Categories (Open-Set)

  1. Get the OpenImages-Uncommon categories of the image:

We have released the LLM tag descriptions of OpenImages-Uncommon categories in openimages_rare_200_llm_tag_descriptions.

<pre/> python inference_ram_plus_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_plus_swin_large_14m.pth \ --llm_tag_des datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json </pre>

The output will look like the following:

Image Tags: Close-up | Compact car | Go-kart | Horse racing | Sport utility vehicle | Touring car
  1. You can also customize any tag categories for recognition through tag descriptions:

Modify categories, and call GPT api to generate corresponding tag descriptions:

<pre/> python generate_tag_des_llm.py \ --openai_api_key 'your openai api key' \ --output_file_path datasets/openimages_rare_200/openimages_rare_200_llm_tag_descriptions.json </pre> <details> <summary><font size="4" style="font-weight:bold;"> RAM Inference </font></summary>

Get the English and Chinese outputs of the images:

<pre/> python inference_ram.py --image images/demo/demo1.jpg \ --pretrained pretrained/ram_swin_large_14m.pth </pre>

The output will look like the following:

Image Tags:  armchair | blanket | lamp | carpet | couch | dog | floor | furniture | gray | green | living room | picture frame | pillow | plant | room | sit | stool | wood floor
图像标签:  扶手椅  | 毯子/覆盖层 | 灯  | 地毯  | 沙发 | 狗 | 地板/地面 | 家具  | 灰色 | 绿色  | 客厅  | 相框  | 枕头  | 植物  | 房间  | 坐/放置/坐落 | 凳子  | 木地板 
</details> <details> <summary><font size="4" style="font-weight:bold;"> RAM Inference on Unseen Categories (Open-Set) </font></summary>

Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:

<pre/> python inference_ram_openset.py --image images/openset_example.jpg \ --pretrained pretrained/ram_swin_large_14m.pth </pre>

The output will look like the following:

Image Tags: Black-and-white | Go-kart
</details> <details> <summary><font size="4" style="font-weight:bold;"> Tag2Text Inference </font></summary>

Get the tagging and captioning results: <pre/> python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth </pre> Or get the tagging and sepcifed captioning results (optional): <pre/>python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth
--specified-tags "cloud,sky"</pre>

</details>

Batch Inference and Evaluation

We release two datasets OpenImages-common (214 common tag classes) and OpenImages-rare (200 uncommon tag classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/ and datasets/openimages_rare_200/imgs.

To evaluate RAM++ on OpenImages-common:

python batch_inference.py \
  --model-type ram_plus \
  --checkpoint pretrained/ram_plus_swin_large_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/ram_plus

To evaluate RAM++ open-set capability on OpenImages-rare:

python batch_inference.py \
  --model-type ram_plus \
  -- pretrained/ram_plus_swin_large_14m.pth \
  --open-set \
  --dataset openimages_rare_200 \
  --output-dir outputs/ram_plus_openset

To evaluate RAM on OpenImages-common:

python batch_inference.py \
  --model-type ram \
  -- pretrained/ram_swin_large_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/ram

To evaluate RAM open-set capability on OpenImages-rare:

python batch_inference.py \
  --model-type ram \
  -- pretrained/ram_swin_large_14m.pth \
  --open-set \
  --dataset openimages_rare_200 \
  --output-dir outputs/ram_openset

To evaluate Tag2Text on OpenImages-common:

python batch_inference.py \
  --model-type tag2text \
  -- pretrained/tag2text_swin_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/tag2text

Please refer to batch_inference.py for more options. To get P/R in table 3 of RAM paper, pass --threshold=0.86 for RAM and --threshold=0.68 for Tag2Text.

To batch inference custom images, you can set up you own datasets following the given two datasets.

:golfing: Model Training/Finetuning

RAM++

  1. Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags }.

  2. In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.

  3. Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.

  4. Download RAM++ frozen tag embedding file "ram_plus_tag_embedding_class_4585_des_51.pth", and set file in "ram/data/frozen_tag_embedding/ram_plus_tag_embedding_class_4585_des_51.pth"

  5. Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
  --model-type ram_plus \
  --config ram/configs/pretrain.yaml  \
  --output-dir outputs/ram_plus
  1. Fine-tune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
  --model-type ram_plus \
  --config ram/configs/finetune.yaml  \
  --checkpoint outputs/ram_plus/checkpoint_04.pth \
  --output-dir outputs/ram_plus_ft
<details> <summary><font size="4" style="font-weight:bold;"> RAM </font></summary>
  1. Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with four key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'union_label_id': image tags for tagging which including parsed tags and pseudo tags, 'parse_label_id': image tags parsed from caption }.

  2. In ram/configs/pretrain.yaml, set 'train_file' as the paths for the json files.

  3. Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.

  4. Download RAM frozen tag embedding file "ram_tag_embedding_class_4585.pth", and set file in "ram/data/frozen_tag_embedding/ram_tag_embedding_class_4585.pth"

  5. Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
  --model-type ram \
  --config ram/configs/pretrain.yaml  \
  --output-dir outputs/ram
  1. Fine-tune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
  --model-type ram \
  --config ram/configs/finetune.yaml  \
  --checkpoint outputs/ram/checkpoint_04.pth \
  --output-dir outputs/ram_ft
</details> <details> <summary><font size="4" style="font-weight:bold;"> Tag2Text </font></summary>
  1. Download RAM training datasets where each json file contains a list. Each item in the list is a dictonary with three key-value pairs: {'image_path': path_of_image, 'caption': text_of_image, 'parse_label_id': image tags parsed from caption }.

  2. In ram/configs/pretrain_tag2text.yaml, set 'train_file' as the paths for the json files.

  3. Prepare pretained Swin-Transformer, and set 'ckpt' in ram/configs/swin.

  4. Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py \
  --model-type tag2text \
  --config ram/configs/pretrain_tag2text.yaml  \
  --output-dir outputs/tag2text
  1. Fine-tune the pre-trained checkpoint using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 finetune.py \
  --model-type tag2text \
  --config ram/configs/finetune_tag2text.yaml  \
  --checkpoint outputs/tag2text/checkpoint_04.pth \
  --output-dir outputs/tag2text_ft
</details>

:black_nib: Citation

If you find our work to be useful for your research, please consider citing.

@article{huang2023open,
  title={Open-Set Image Tagging with Multi-Grained Text Supervision},
  author={Huang, Xinyu and Huang, Yi-Jie and Zhang, Youcai and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Xie, Yanchun and Li, Yaqian and Zhang, Lei},
  journal={arXiv e-prints},
  pages={arXiv--2310},
  year={2023}
}

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

:hearts: Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.

We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.