Home

Awesome

<h2 align="center"> <a href="">MaxFusion: Plug & Play multimodal generation in text to image diffusion models</a></h2> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for latest update. </h2> <h5 align="center">

hf_space project page arXiv <br>

Applications

<img src="./assets/img1.png" width="100%"> <img src="./assets/img2.png" width="100%">

Keywords: Multimodal Generation, Text to image generation, Plug and Play

We propose MaxFusion, a plug and play framework for multimodal generation using text to image diffusion models. (a) Multimodal generation. We address the problem of conflicting spatial conditioning for text to iamge models . (b) Saliency in variance maps. We discover that the variance maps of different feature layers expresses the strength og conditioning.

<br>

Contributions:

<!-- <p align="center"> <img src="./utils/intropng.png" alt="Centered Image" style="width: 50%;"> </p> -->

Environment setup

conda env create -f environment.yml

Code demo:

A notebook for differnt demo conditions is provided in demo.ipynb

Testing On custom datasets

Will be released shortly

Instructions for Interactive Demo

An intractive demo can be run locally using

python gradio_maxfusion.py

This code is reliant on:

https://github.com/google/prompt-to-prompt/

Citation

  1. If you use our work, please use the following citation
@inproceedings{nair2025maxfusion,
  title={Maxfusion: Plug\&play multi-modal generation in text-to-image diffusion models},
  author={Nair, Nithin Gopalakrishnan and Valanarasu, Jeya Maria Jose and Patel, Vishal M},
  booktitle={European Conference on Computer Vision},
  pages={93--110},
  year={2025},
  organization={Springer}
}