Home

Awesome

TaskMatrix

TaskMatrix connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting.

See our paper: <font size=5>Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models</font>

<a src="https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue" href="https://huggingface.co/spaces/microsoft/visual_chatgpt"> <img src="https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue" alt="Open in Spaces"> </a> <a src="https://colab.research.google.com/assets/colab-badge.svg" href="https://colab.research.google.com/drive/1P3jJqKEWEaeNcZg8fODbbWeQ3gxOHk2-?usp=sharing"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"> </a>

Updates:

Insight & Goal:

On the one hand, ChatGPT (or LLMs) serves as a general interface that provides a broad and diverse understanding of a wide range of topics. On the other hand, Foundation Models serve as domain experts by providing deep knowledge in specific domains. By leveraging both general and deep knowledge, we aim at building an AI that is capable of handling various tasks.

Demo

<img src="./assets/demo_short.gif" width="750">

System Architecture

<p align="center"><img src="./assets/figure.jpg" alt="Logo"></p>

Quick Start

# clone the repo
git clone https://github.com/microsoft/TaskMatrix.git

# Go to directory
cd visual-chatgpt

# create a new environment
conda create -n visgpt python=3.8

# activate the new environment
conda activate visgpt

#  prepare the basic environments
pip install -r requirements.txt
pip install  git+https://github.com/IDEA-Research/GroundingDINO.git
pip install  git+https://github.com/facebookresearch/segment-anything.git

# prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}

# prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}

# Start TaskMatrix !
# You can specify the GPU/CPU assignment by "--load", the parameter indicates which 
# Visual Foundation Model to use and where it will be loaded to
# The model and device are separated by underline '_', the different models are separated by comma ','
# The available Visual Foundation Models can be found in the following table
# For example, if you want to load ImageCaptioning to cpu and Text2Image to cuda:0
# You can use: "ImageCaptioning_cpu,Text2Image_cuda:0"

# Advice for CPU Users
python visual_chatgpt.py --load ImageCaptioning_cpu,Text2Image_cpu

# Advice for 1 Tesla T4 15GB  (Google Colab)                       
python visual_chatgpt.py --load "ImageCaptioning_cuda:0,Text2Image_cuda:0"
                                
# Advice for 4 Tesla V100 32GB                            
python visual_chatgpt.py --load "Text2Box_cuda:0,Segmenting_cuda:0,
    Inpainting_cuda:0,ImageCaptioning_cuda:0,
    Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,
    Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,
    InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,
    SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,
    Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,
    NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3"

GPU memory usage

Here we list the GPU memory usage of each visual foundation model, you can specify which one you like:

Foundation ModelGPU Memory (MB)
ImageEditing3981
InstructPix2Pix2827
Text2Image3385
ImageCaptioning1209
Image2Canny0
CannyText2Image3531
Image2Line0
LineText2Image3529
Image2Hed0
HedText2Image3529
Image2Scribble0
ScribbleText2Image3531
Image2Pose0
PoseText2Image3529
Image2Seg919
SegText2Image3529
Image2Depth0
DepthText2Image3531
Image2Normal0
NormalText2Image3529
VisualQuestionAnswering1495

Acknowledgement

We appreciate the open source of the following projects:

Hugging FaceLangChainStable DiffusionControlNetInstructPix2PixCLIPSegBLIP

Contact Information

For help or issues using the TaskMatrix, please submit a GitHub issue.

For other communications, please contact Chenfei WU (chewu@microsoft.com) or Nan DUAN (nanduan@microsoft.com).

Trademark Notice

Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Disclaimer

The recommended models in this Repo are just examples, used for scientific research exploring the concept of task automation and benchmarking with the paper published at Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. Users can replace the models in this Repo according to their research needs. When using the recommended models in this Repo, you need to comply with the licenses of these models respectively. Microsoft shall not be held liable for any infringement of third-party rights resulting from your usage of this repo. Users agree to defend, indemnify and hold Microsoft harmless from and against all damages, costs, and attorneys' fees in connection with any claims arising from this Repo. If anyone believes that this Repo infringes on your rights, please notify the project owner email.