Home

Awesome

StitchDiffusion (Keep Update)

Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models
Hai Wang, Xiaoyu Xiang, Yuchen Fan, Jing-Hao Xue

Project arXiv

Data | Pretrained Model | 8K Data

Colab was implemented by @lshus.

StitchDiffusion Code

Actually, StitchDiffusion is a tailored generation (denoising) process for synthesizing 360-degree panoramas, we provide its core code here.

## following MultiDiffusion: https://github.com/omerbt/MultiDiffusion/blob/master/panorama.py ##
## the window size is changed for 360-degree panorama generation ##
def get_views(panorama_height, panorama_width, window_size=[64,128], stride=16):
    panorama_height /= 8
    panorama_width /= 8
    num_blocks_height = (panorama_height - window_size[0]) // stride + 1
    num_blocks_width = (panorama_width - window_size[1]) // stride + 1
    total_num_blocks = int(num_blocks_height * num_blocks_width)
    views = []
    for i in range(total_num_blocks):
        h_start = int((i // num_blocks_width) * stride)
        h_end = h_start + window_size[0]
        w_start = int((i % num_blocks_width) * stride)
        w_end = w_start + window_size[1]
        views.append((h_start, h_end, w_start, w_end))
    return views
#####################
## StitchDiffusion ##
#####################

views_t = get_views(height, width) # height = 512; width = 4*height = 2048
count_t = torch.zeros_like(latents)
value_t = torch.zeros_like(latents)
# latents are sampled from standard normal distribution (torch.randn()) with a size of Bx4x64x256,
# where B denotes the batch size.

for i, t in enumerate(tqdm(timesteps)):

    count_t.zero_()
    value_t.zero_()

    # initialize the value of latent_view_t
    latent_view_t = latents[:, :, :, 64:192]

    #### pre-denoising operations twice on the stitch block ####
    for ii_md in range(2):

        latent_view_t[:, :, :, 0:64] = latents[:, :, :, 192:256] #left part of the stitch block
        latent_view_t[:, :, :, 64:128] = latents[:, :, :, 0:64] #right part of the stitch block

        # expand the latents if we are doing classifier free guidance
        latent_model_input = latent_view_t.repeat((2, 1, 1, 1))

        # # predict the noise residual
        noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)['sample']

        # perform guidance
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

        # compute the denoising step with the reference (customized) model
        latent_view_denoised = self.scheduler.step(noise_pred, t, latent_view_t)['prev_sample']

        value_t[:, :, :, 192:256] += latent_view_denoised[:, :, :, 0:64]
        count_t[:, :, :, 192:256] += 1

        value_t[:, :, :, 0:64] += latent_view_denoised[:, :, :, 64:128]
        count_t[:, :, :, 0:64] += 1

    # same denoising operations as what MultiDiffusion does
    for h_start, h_end, w_start, w_end in views_t:

        latent_view_t = latents[:, :, h_start:h_end, w_start:w_end]
    
        # expand the latents if we are doing classifier free guidance
        latent_model_input = latent_view_t.repeat((2, 1, 1, 1))
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        # predict the noise residual
        noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)['sample']

        #perform guidance
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

        # compute the denoising step with the reference (customized) model
        latent_view_denoised = self.scheduler.step(noise_pred, t, latent_view_t)['prev_sample']
        value_t[:, :, h_start:h_end, w_start:w_end] += latent_view_denoised
        count_t[:, :, h_start:h_end, w_start:w_end] += 1

    latents = torch.where(count_t > 0, value_t / count_t, value_t)

latents = 1 / 0.18215 * latents
image = self.vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1)


#### global cropping operation ####
image = image[:, :, :, 512:1536]
image = image.cpu().permute(0, 2, 3, 1).float().numpy()

Useful Tools

360 panoramic images viewer: It could be used to view the synthesized 360-degree panorama.

Seamless Texture Checker: It could be employed to check the continuity between the leftmost and rightmost sides of the generated image.

clip-interrogator: It contains Google Colab of BLIP to generate text prompts.

CLIP: It contains Google Colab to calculate the CLIP-score.

FID: It contains Google Colab to calculate FID.

Statement

This research was done by Hai Wang in University College London. The code and released models are owned by Hai Wang.

Citation

If you find the code helpful in your research or work, please cite our paper:

@inproceedings{wang2024customizing,
  title={Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models},
  author={Wang, Hai and Xiang, Xiaoyu and Fan, Yuchen and Xue, Jing-Hao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
  pages={4933--4943},
  year={2024}
}

Acknowledgments

We thank MultiDiffusion. Our work is based on their excellent codes.