Awesome
Consistent Human Image and Video Generation with Spatially Conditioned Diffusion
Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng
<div align="center"> <img src="assets/teaser.png"> </div>In this paper, we explore the generation of consistent human-centric visual content through a spatial conditioning strategy. We frame consistent reference-based controllable human image and video synthesis as a spatial inpainting task, where the desired content is spatially inpainted under the conditioning of a reference human image. Additionally, we propose a causal spatial conditioning strategy that constrains the interaction between reference and target features causally, thereby preserving the appearance information of the reference images for enhanced consistency. By leveraging the inherent capabilities of the denoising network for appearance detail extraction and conditioned generation, our approach is both straightforward and effective in maintaining fine-grained appearance details and the identity of the reference human image.
Main Architecture
Core idea: Utilizing the denoising U-Net for reference feature extraction and target image synthesis to ensure content consistency.
<div align="center"> <img src="assets/main_arch.png"> </div>Results of Human Animation
Trained with the TikTok dataset (350 videos), UBCFashion (500 videos), and a self-gathered dance video dataset (3,500 dance videos featuring about 200 humans).
<div align="center"> <img src="assets/demo.gif"> </div>More Applications
Our method can also be applied to the visual try-on task to generate garment-consistent human images. During training, we only add noise to the garment region in the human image:
<div align="center"> <img width="300" src="assets/train_tryon.png"> </div>Correspondingly, a regional loss is applied to the denoising U-Net's prediction during loss calculation. The results of the model trained on VTON-HD dataset:
<div align="center"> <i>Paired setting</i> <img src="assets/tryon_results_2.png"> </div> <div align="center"> <i>Unpaired setting</i> <img src="assets/tryon_results_1.png"> </div>Results with Diffusion Transformer-based Model
Our proposed method can also be integrated into diffusion Transformer-based models, such as SD3 and FLUX, to enhance synthesis quality. In this approach, the reference image is utilized as additional tokens during training, with the loss computation restricted to the noisy tokens. To demonstrate the effectiveness of our method, we present results of FLUX trained on the VTON-HD dataset using SCD framework:
<div align="center"> <i>Tryon results of different base models</i> <img src="assets/sd_vs_flux.jpg"> </div>Contact
If you have any comments or questions, please feel free to contact Mingdeng Cao.