Home

Awesome

<!--- Copyright 2023 The OFA-Sys Team. All rights reserved. This source code is licensed under the Apache 2.0 license found in the LICENSE file in the root directory. --> <p align="center"> <br> <img src="assets/logo.png" width="350" /> <br> <p> <p align="center"> 📖 <a href="https://arxiv.org/abs/2305.11172">Paper</a>&nbsp&nbsp | &nbsp🤗 <a href="https://huggingface.co/spaces/OFA-Sys/ONE-PEACE_Multimodal_Retrieval">Demo</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/models/damo/ONE-PEACE-4B/summary">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp<a href="checkpoints.md">Checkpoints</a>&nbsp | &nbsp<a href="datasets.md">Datasets</a> </p> <br>

ONE-PEACE is a general representation model across vision, audio, and language modalities, Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, audio, audio-language, and vision-language tasks. Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities that are not paired in the training data.

Below shows the architecture and pretraining tasks of ONE-PEACE. With the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.

<p align="center"> <img src="assets/one_peace.png" width=100%> </p> <br>

Online Demo

We provide the online demo in Huggingface Spaces. In this demo, you can combine multiple modalities to retrieve related images, such as audio-to-image, audio+text-to-image, audio+image-to-image, and even audio+image+text-to-image.

<p align="center"> <img src="assets/demo.png" width=100%> </p> <br>

News

Models and Results

Model Card

We list the parameters and pretrained checkpoints of ONE-PEACE below. Note that ONE-PEACE can be disassembled into different branches to handle different tasks. We also provide the vision-branch of ONE-PEACE, which can be used to perform vision tasks.

<table border="1" width="100%"> <tr align="center"> <th>Model</th><th>Ckpt</th><th>Params</th><th>Hidden size</th><th>Intermediate size</th><th>Attention heads</th><th>Layers</th> </tr> <tr align="center"> <td>ONE-PEACE</td><td><a href="http://one-peace-shanghai.oss-accelerate.aliyuncs.com/one-peace.pt">Download</a></td><td>4B</td><td>1536</td><td>6144</td><td>24</td><td>40</td> </tr> <tr align="center"> <td>ONE-PEACE<br>(Vision Branch)</td><td><a href="https://one-peace-shanghai.oss-accelerate.aliyuncs.com/one_peace_checkpoints/one-peace-vision.pkl">Download</a></td><td>1.5B</td><td>1536</td><td>6144</td><td>24</td><td>40</td> </tr> </table> <br>

Results

Vision Tasks

<table border="1" width="100%"> <tr align="center"> <th>Task</th><th>Image classification</th><th>Semantic Segmentation</th><th>Object Detection (w/o Object365)</th><th>Video Action Recognition</th> </tr> <tr align="center"> <td>Dataset</td><td>Imagenet-1K</td><td>ADE20K</td><td>COCO</td><td>Kinetics 400</td> </tr> <tr align="center"> <td>Split</td><td>val</td><td>val</td><td>val</td><td>val</td> </tr> <tr align="center"> <td>Metric</td><td>Acc.</td><td>mIoU<sup>ss</sup> / mIoU<sup>ms</sup></td><td>AP<sup>box</sup> / AP<sup>mask</sup></td><td>Top-1 Acc. / Top-5 Acc.</td> </tr> <tr align="center"> <td>ONE-PEACE</td><td>89.8</td><td>62.0 / 63.0</td><td>60.4 / 52.9</td><td>88.1 / 97.8</td> </tr> </table>

Audio Tasks

<table border="1" width="100%"> <tr align="center"> <th>Task</th><th colspan="4">Audio-Text Retrieval</th><th colspan="3">Audio Classification</th><th>Audio Question Answering</th> </tr> <tr align="center"> <td>Dataset</td><td colspan="2">AudioCaps</td><td colspan="2">Clotho</td><td>ESC-50</td><td>FSD50K</td><td>VGGSound (Audio-Visual)</td><td>AVQA</td> </tr> <tr align="center"> <td>Split</td><td colspan="2">test</td><td colspan="2">evaluation</td><td>full</td><td>eval</td><td>test</td><td>val</td> </tr> <tr align="center"> <td>Metric</td><td>T2A R@1</td><td>A2T R@1</td><td>T2A R@1</td><td>A2T R@1</td><td>Zero-shot Acc.</td><td>MAP</td><td>Acc.</td><td>Acc.</td> </tr> <tr align="center"> <td>ONE-PEACE</td><td>42.5</td><td>51.0</td><td>22.4</td><td>27.1</td><td>91.8</td><td>69.7</td><td>68.2</td><td>92.2</td> </tr> </table>

Vision-Language Tasks

<table border="1" width="100%"> <tr align="center"> <th>Task</th><th colspan="4">Image-Text Retrieval (w/o ranking)</th><th colspan="3">Visual Grounding</th><th>VQA</th><th>Visual Reasoning</th> </tr> <tr align="center"> <td>Dataset</td><td colspan="2">COCO</td><td colspan="2">Flickr30K</td><td>RefCOCO</td><td>RefCOCO+</td><td>RefCOCOg</td><td>VQAv2</td><td>NLVR2</td> </tr> <tr align="center"> <td>Split</td><td colspan="2">test</td><td colspan="2">test</td><td>val / testA / testB</td><td>val / testA / testB</td><td>val-u / test-u</td><td>test-dev / test-std</td><td>dev / test-P</td> </tr> <tr align="center"> <td>Metric</td><td>I2T R@1</td><td>T2I R@1</td><td>I2T R@1</td><td>T2I R@1</td><td colspan="3">Acc@0.5</td><td>Acc.</td><td>Acc.</td> </tr> <tr align="center"> <td>ONE-PEACE</td><td>84.1</td><td>65.4</td><td>97.6</td><td>89.6</td><td>92.58 / 94.18 / 89.26</td><td>88.77 / 92.21 / 83.23</td><td>89.22 / 89.27</td><td>82.6 / 82.5</td><td>87.8 / 88.3</td> </tr> </table> <br></br>

Requirements and Installation

git clone https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE
pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
conda install xformers -c xformers
git clone --recursive https://github.com/HazyResearch/flash-attention
cd flash-attention && pip install .
cd csrc/layer_norm && pip install .
<br>

Datasets and Checkpoints

See datasets.md and checkpoints.md. <br></br>

Usage

API

We provide a simple code snippet to show how to use the API for ONE-PEACE.

Multi-modal Embedding

We use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:

import torch
from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"
# "ONE-PEACE" can also be replaced with ckpt path
model = from_pretrained("ONE-PEACE", device=device, dtype="float32")

# process raw data
src_tokens = model.process_text(["cow", "dog", "elephant"])
src_images = model.process_image(["assets/dog.JPEG", "assets/elephant.JPEG"])
src_audios, audio_padding_masks = model.process_audio(["assets/cow.flac", "assets/dog.flac"])

with torch.no_grad():
    # extract normalized features
    text_features = model.extract_text_features(src_tokens)
    image_features = model.extract_image_features(src_images)
    audio_features = model.extract_audio_features(src_audios, audio_padding_masks)

    # compute similarity
    i2t_similarity = image_features @ text_features.T
    a2t_similarity = audio_features @ text_features.T

print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)

Visual Grounding

We use ONE-PEACE to perform visual grounding on anime pictures:

import torch
import cv2
from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"
model = from_pretrained(
	"ONE-PEACE_Grounding",
    model_type="one_peace_classify",
    device=device,
    dtype="float32"
)

# process raw data
image_text_list = [
    ("assets/pokemons.jpg", "a blue turtle-like pokemon with round head"),
    ("assets/pokemons.jpg", "Bulbasaur"),
    ("assets/pokemons.jpg", "Charmander"),
    ("assets/pokemons.jpg", "Squirtle"),
    ("assets/one_piece.jpeg", "Brook"),
    ("assets/one_piece.jpeg", "Franky"),
    ("assets/one_piece.jpeg", "Monkey D. Luffy"),
    ("assets/one_piece.jpeg", "Nami"),
    ("assets/one_piece.jpeg", "Nico Robin"),
    ("assets/one_piece.jpeg", "Roronoa Zoro"),
    ("assets/one_piece.jpeg", "Tony Tony Chopper"),
    ("assets/one_piece.jpeg", "Usopp"),
    ("assets/one_piece.jpeg", "Vinsmoke Sanji"),
]
(src_images, image_widths, image_heights), src_tokens  = model.process_image_text_pairs(
    image_text_list, return_image_sizes=True
)

with torch.no_grad():
    # extract features
    vl_features = model.extract_vl_features(src_images, src_tokens).sigmoid()
    # extract coords
    vl_features[:, ::2] *= image_widths.unsqueeze(1)
    vl_features[:, 1::2] *= image_heights.unsqueeze(1)
    coords = vl_features.cpu().tolist()

# display results
for i, image_text_pair in enumerate(image_text_list):
    image, text = image_text_pair
    img = cv2.imread(image)
    cv2.rectangle(
        img,
        (int(coords[i][0]), int(coords[i][1])),
        (int(coords[i][2]), int(coords[i][3])),
        (0, 255, 0),
        3
    )
    cv2.imshow(text, img)
    cv2.waitKey(3500)
    cv2.destroyAllWindows()

Audio Classification

We use ONE-PEACE to perform audio classification:

import torch
import json
from one_peace.models import from_pretrained

id2label = json.load(open("assets/vggsound_id2label.json"))

device = "cuda" if torch.cuda.is_available() else "cpu"
model = from_pretrained(
  "ONE-PEACE_VGGSound",
    model_type="one_peace_classify",
    device=device,
    dtype="float32"
)

# process audio
audio_list = ["assets/cow.flac", "assets/dog.flac"]
src_audios, audio_padding_masks = model.process_audio(audio_list)

with torch.no_grad():
    # extract audio features
    audio_logits = model.extract_audio_features(src_audios, audio_padding_masks)
    print(audio_logits.size())
    predict_label_ids = audio_logits.argmax(1).cpu().tolist()

for audio, predict_label_id in zip(audio_list, predict_label_ids):
    predict_label = id2label[str(predict_label_id)]
    print('audio: {}, predict label: {}'.format(audio, predict_label))

Training & Inference

If you are not satisfied with only using the API, we offer comprehensive training and inference instructions for audio & multimodal and vision tasks.

<br></br>

Gallery

Visual Grounding (unseen domain)

grounding

Emergent Zero-shot Retrieval

a2i

a+t2i

a+i2i <br></br>

Acknowledgement

Getting Involved

Feel free to submit GitHub issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to zheluo.wp@alibaba-inc.com or saimeng.wsj@alibaba-inc.com! <br></br>

Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

@article{wang2023one,
  title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},
  author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
  journal={arXiv preprint arXiv:2305.11172},
  year={2023}
}