Home

Awesome

Text2Cinemagraph

paper | website

<!-- ### [**website**](https://text2cinemagraph.github.io/website/) -->

This is the official PyTorch implementation of "Text-Guided Synthesis of Eulerian Cinemagraphs".

<br> <div class="gif"> <p align="center"> <img src='assets/demo.gif' align="center"> </p> </div>

Method Details

<br> <div class="gif"> <p align="center"> <img src='assets/method_recording.gif' align="center"> </p> </div>

We introduce a fully automated method, Text2Cinemagraph, for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. In this method, we propose an idea of synthesizing image twins from a single text prompt using Stable Diffusion - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion is then transferred to the artistic image to create the final cinemagraph.

Getting Started

Environment Setup

Download Pretrained Models

Inference (Artistic Domain)

<!-- <br> <div class="gif"> <p align="center"> <img src='assets/teaser.gif' align="center"> </p> </div> -->
<img src="assets/video.gif" width="256" /><img src="assets/video2.gif" width="256" /><img src="assets/video3.gif" width="256" />
<img src="assets/cap1.png" width="256" /><img src="assets/cap2.png" width="256" /><img src="assets/cap3.png" width="256" />
<!-- <br> <div class="gif"> <p align="center" style="margin-bottom: -7px;"> <img style='' src='assets/control1.gif' width="400">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src='assets/control2.gif' width="400"> </p> <p align="center"> <img style='' src='assets/caption1.png' width="400">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src='assets/caption2.png' width="400"> </p> </div> -->
<img src="assets/control1.gif" width="400" /><img src="assets/control2.gif" width="400" />
<img src="assets/caption1.png" width="400" /><img src="assets/caption2.png" width="400" />
Artistic Image (s1)Natural Image (s21)ODISE Mask (s3)
<img src="assets/0.png" width="256" /><img src="assets/sample_0.png" width="256" /><img src="assets/mask_odise.png" width="256" />
Self-Attention Mask (s4)Optical Flow (s5)Cinemagraph (s6)
<img src="assets/mask_self_attn_erosion.png" width="256" /><img src="assets/synthesized_flow.jpg" width="256" /><img src="assets/video.gif" width="256" />

Tip and Tricks for achieving better results (Artistic Domain) <br><br>Change the following parameters in inference.yaml or inference_directional.yaml if you do not achieve desired results,

Data Preparation for Training

Optical Flow and Videos

Masks (ODISE)

Text Guided Direction Control

Artistic Domain Prompts

Training

Optical Flow Prediction

Optical Flow Prediction (for text guidance direction)

Video Generation

Evaluation (Real Domain)

Generate Results

Compute FVD on Real Domain Results

The code for FVD computation has been taken from StyleGAN-V.

Citation

@article{mahapatra2023synthesizing,
    title={Text-Guided Synthesis of Eulerian Cinemagraphs},
    author={Mahapatra, Aniruddha and Siarohin, Aliaksandr and Lee, Hsin-Ying and Tulyakov, Sergey and Zhu, Jun-Yan},
    journal={arXiv preprint arXiv:2307.03190},
    year={2023}
}

Acknowledgments

The code for this project was built using the codebase of pix2pixHD, ODISE, plug-and-play, SLR-SFS. The symmetric-splatting code was built on top of softmax-splatting. The code for evalution metric (FVD) was build on codebase of StyleGAN-V. We are very thankful to the authors of the corresponding works for releasing their code.

We are also grateful to Nupur Kumari, Gaurav Parmar, Or Patashnik, Songwei Ge, Sheng-Yu Wang, Chonghyuk (Andrew) Song, Daohan (Fred) Lu, Richard Zhang, and Phillip Isola for fruitful discussions. This work is partly supported by Snap Inc. and was partly done while Aniruddha was an intern at Snap Inc.