Home

Awesome

Multi-Modality

CM3Leon: Autoregressive Multi-Modal Model for Text and Image Generation (wip)

GitHub issues GitHub forks GitHub stars GitHub license Share on Twitter Share on Facebook Share on LinkedIn Discord Share on Reddit Share on Hacker News Share on Pinterest Share on WhatsApp Open In Colab

CM3Leon is a transformer-based autoregressive model designed for multi-modal tasks, specifically text and image generation. The model is trained in two stages, using a large diverse multimodal dataset and augmented retrieval pretraining. It also implements contrastive decoding to enhance the quality of the generated samples.

CM3LEON, PAPER LINK

Install

pip3 install cm3


Usage & Example

To start with CM3Leon in a PyTorch environment:

import torch
from cm3.model import CM3

# usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

model = CM3()

output = model(img, caption)
print(output.shape)  # (1, 1024, 20000)


This repository hosts the open-source implementation of CM3Leon, a state-of-the-art autoregressive multi-modal model for text and image generation. The model is introduced in the paper "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning".


Overview

Key Features of CM3Leon:

CM3Leon sets a new benchmark in text-to-image generation, outperforming comparable models while requiring 5x less computational resources.

Getting Started

The following sections provide a detailed analysis of the model architecture, the necessary resources, and the steps needed to replicate the CM3Leon model.

Requirements

Replicating CM3Leon involves several critical components and requires proficiency in the following areas:

System Architecture

The CM3Leon implementation comprises:

Implementing these components involves challenges such as efficient utilization of large compute clusters, minimizing data loading and preprocessing bottlenecks, optimizing memory usage during training and inference, and ensuring low latency serving.

Model Architecture

The architecture of CM3Leon includes:

The model size ranges from 350M to 7B parameters.

Data

Here is a markdown table with the datasets used in the paper along with additional metadata and source links:

DatasetDomainSizeSource
ShutterstockImages and captions3 billion text tokens, licensed image dataProprietary dataset, described in paper
MS-COCOImage captioning591K image-caption pairsMicrosoft COCO Captions
Flickr30kImage captioning144K image-caption pairsFlickr30k Entities
Image ParagraphDense image captioning14K images with paragraph captionsImage Paragraph dataset
Localized NarrativesImage paragraph captioning164K images with localized narrativesLocalized Narratives
VQA2Visual question answering1.3M images with question-answer pairsVQA2 dataset
VizWizVisual question answering for blind users92K images with question-answer pairsVizWiz dataset
OKVQAKnowledge-based VQA26K images with question-answer pairsOK-VQA dataset
ScienceQAScientific visual QA6K images with multi-choice QA pairsScienceQA

The model was trained and evaluated on several datasets including MS-COCO [...] (Chen et al., 2015), Flickr30k [...] (Young et al., 2014), etc.

For successful implementation, CM3Leon requires:

Training

CM3Leon's training process involves:

Inference

For efficient inference, consider:

HyperParameters

350M 24 1024 4096 8M 6e-04 1500 256 1.4T
760M 24 1536 4096 8M 5e-04 1500 256 1.9T
7B 32 4096 4096 8M 1.2e-04 1500 512 2.4T

SuperVised FineTuning parameters

Model # GPUS Seq Length Batch Size LR Warm-up Steps # Tokens
CM3Leon-760m 64 4096 2M 5e-05 150 30B
CM3Leon-7b 128 4096 2M 5e-05 150 30B

Innovations in the paper:

Contributing

This repository welcomes contributions. Feel free to submit pull requests, create issues, or suggest any enhancements.

Support

If you encounter any issues or need further clarification, please create an issue in the GitHub issue tracker.

License

CM3Leon is open-sourced under the MIT license.

Roadmap

Logits, cond = T(ty | ty), logit.uncond = T(ty | <mask>)
logits.cf = logits.uncond + a.c * (logits.cond - logits.uncond)

T = transformer
ty = output tokens
tx = conditional input text <mask>
<mask> = no input text + replacement with a mask token
a.c = scaling factor
V(t.y < .i) = {t.yi is in V: P.exp(t.yi | t.y<.i) >= a * kmax(p.exp(w|t.y<i))}

Citation