Awesome
WildVidFit Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Model
<img src="figures/p1.png" width="1000" height="600">Method piepeline
<img src="figures/p2.png" width="1000" height="330">Requirements
diffusers==0.17.0
DataSet and pre-process
VITON-HD
- Download the VITON-HD dataset
- Clothes feature extraction, use
extract_dino_fea.py
Once the dataset is downloaded, the folder structure should look like this:
├── VITON-HD
| ├── test_pairs.txt
| ├── train_pairs.txt
│ ├── [train | test]
| | ├── image
│ │ │ ├── [000006_00.jpg | 000008_00.jpg | ...]
│ │ ├── cloth
│ │ │ ├── [000006_00.jpg | 000008_00.jpg | ...]
│ │ ├── cloth-mask
│ │ │ ├── [000006_00.jpg | 000008_00.jpg | ...]
│ │ ├── image-parse-v3
│ │ │ ├── [000006_00.png | 000008_00.png | ...]
│ │ ├── openpose_img
│ │ │ ├── [000006_00_rendered.png | 000008_00_rendered.png | ...]
│ │ ├── openpose_json
│ │ │ ├── [000006_00_keypoints.json | 000008_00_keypoints.json | ...]
│ │ ├── dino_fea
│ │ │ ├── [000006_00.pt | 000008_00.pt | ...]
DressCode
- Download the DressCode dataset
- To enhance the performance of our warping module, we have discovered that using in-shop images with a white background yields better results. To facilitate this process, we now offer pre-extracted masks that can be used to remove the background from the images. You can download the masks from the following link: here. Once downloaded, please extract the mask files and place them in the dataset folder alongside the corresponding images.
- Clothes featue extraction, use
extract_dino_fea.py
Once the dataset is downloaded, the folder structure should look like this:
├── DressCode
| ├── test_pairs_paired.txt
| ├── test_pairs_unpaired.txt
| ├── train_pairs.txt
│ ├── [dresses | lower_body | upper_body]
| | ├── test_pairs_paired.txt
| | ├── test_pairs_unpaired.txt
| | ├── train_pairs.txt
│ │ ├── images
│ │ │ ├── [013563_0.jpg | 013563_1.jpg | 013564_0.jpg | 013564_1.jpg | ...]
│ │ ├── masks
│ │ │ ├── [013563_1.png| 013564_1.png | ...]
│ │ ├── keypoints
│ │ │ ├── [013563_2.json | 013564_2.json | ...]
│ │ ├── label_maps
│ │ │ ├── [013563_4.png | 013564_4.png | ...]
│ │ ├── skeletons
│ │ │ ├── [013563_5.jpg | 013564_5.jpg | ...]
│ │ ├── dense
│ │ │ ├── [013563_5.png | 013563_5_uv.npz | 013564_5.png | 013564_5_uv.npz | ...]
│ │ ├── dino_fea
│ │ │ ├── [013563_1.pt | 013564_1.pt | ...]
VVT Dataset
- Download the VVT Dataset (ask the author)
- Pose estimation
- Feature extraction, use
extract_dino_fea_vtt.py
├── VVT
| ├── test_pairs.txt
| ├── train_pairs.txt
| ├── clothes_person
│ │ ├── dino_fea
│ │ ├── img
| ├── train_frames
│ │ │ ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
| ├── train_frames_parsing
│ │ │ ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
| ├── train_openpose_img
│ │ │ ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
| ├── train_openpose_json
│ │ │ ├── [4be21d0a1-n11 | 4be21d09i-k11 | ...]
| ├── test_frames
│ │ │ ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
| ├── test_frames_parsing
│ │ │ ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
| ├── test_openpose_img
│ │ │ ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
| ├── test_openpose_json
│ │ │ ├── [4he21d00f-g11 | 4he21d00f-k11 | ...]
TikTok Dataset
- Download the TikTok Dataset
- Pre-process
- openpose for pose estimation
- parsing by SCHPA
- feature extraction, use
extract_dino_fea.py
├── TikTok
| ├── test_pairs.txt
| ├── train_pairs.txt
| ├── cloth
│ │ ├── [00008.png | 00009.png | ...]
│ ├── dino_fea
│ │ ├── [00008.pt | 00009.pt | ...]
| ├── openpose_img
│ │ │ ├── [00001 | 00002 | ...]
| ├── openpose_json
│ │ │ ├── [00001 | 00002 | ...]
| ├── parse
│ │ │ ├── [00001 | 00002 | ...]
| ├── parse_lip
│ │ │ ├── [00001 | 00002 | ...]
| ├── TikTok_dataset
│ │ │ ├── [00001 | 00002 | ...]
Pretrain model preparation
- Download Stable Diffusion 1.5
- Download mae pretrain model
Trained model
unet
- model_VTT_192_256_1030_fixbug_long: VTT
- model_TikTok_512_fixbug_1109_lip:: TikTok
- model_VITON_512_DINO_large_large_TikTok2: VITON
vae
- HR_VITON_vae: 通用的用emasc ft后的vae
- model_VTT_vae: VVT数据集的vae
Training
Train the network for VITON, DressCode
accelerate launch --mixed_precision="fp16" anydoor_train.py
Train the network for TikTok (only dataset different)
accelerate launch --mixed_precision="fp16" anydoor_train_TikTok.py
Train vae Train the vae using emasc in LaDI-VTON
train_cloth_vae_agnostic.py
train_cloth_vae_agnostic_TikTok.py
Inference
image-level
revise the dataset, model in config.py
, then
python infer.py
video-level
ps: 注意dataset调整起始位置
python infer_video.py
: base video infer. TikTok, VVT, Wild video can use
ps: 注意VTTDataset的set_group 用于设定源数据是哪个
python infer_video_vtt_list
: for VVT, multiple cases prediction
python infer_video_mae_guided.py
: add mae guiding
python infer_video_guided.py
: add clip guiding based on this clip guiding example
python infer_video_mae_clip_guided.py
: mae and similarity guidace
Evaluate
Fid and Kid for image try-on network. Note that we should resize the GT into the size match with the prediction
fidelity --input1 VITON_test_unpaired --input2 /data1/hzj/zalando-hd-resized/test/image_512/ -g 1 -f -k
Other Note
anydoor_train时 读自己得模型不需要删除conv
wild_config.py WildVideoDataset.py 用wild视频直接换衣
pipeline 应该用pcm而不是paser_upper_mask, pcm归0区域更小 有一问题没解决:agnostic = agnostic * (1-pcm) + pcm * torch.zeros_like(agnostic) 这里好像会造成data leak,具体原因未知(猜测是灰色mask得数值不一定是0),因为可视化都一致
Demo
<img src="figures/case1.gif" width="576" height="304"> <img src="figures/case2.gif" width="576" height="304">
Citation
If you find this code helpful for your research, please cite:
@article{he2024wildvidfit,
title={WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models},
author={He, Zijian and Chen, Peixin and Wang, Guangrun and Li, Guanbin and Torr, Philip HS and Lin, Liang},
journal={arXiv preprint arXiv:2407.10625},
year={2024}
}