Awesome
High-Fidelity Human Avatars from a Single RGB Camera
Project Page | Paper | Supp
News
- There was a problem with pose initialization in the previous version, which causes poor texture quality. Currently, I update the code, and this problem should have been solved.
Installation
conda create -n Avatar python==3.6.8
conda install pytorch==1.7.0 torchvision==0.8.0 cudatoolkit=11.0 -c pytorch
or conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
wget https://github.com/facebookresearch/pytorch3d/archive/refs/tags/v0.4.0.zip
cd pytorch3d
pip install -e .
cd thirdparty/neural_renderer_pytorch
python setup.py install
Please make sure your gcc version > 7.5 !
Download the assets files from here, unzip it, and move them to the assets
folder.
Download the pre-trained model from here, unzip it, and move them to the checkpoints
folder.
Besides, we adopt the pose initialization of octopus. But the deep learning framework of octopus is not the same as our work, Therefore, you need to create a new conda environment for octopus.
conda create -n octopus python==2.7
conda install tensorflow-gpu=1.13.1 keras=2.2.4 cudatoolkit=10.0
The environment of octopus needs dirt. We recommend you install dirt by:
cd dirt
mkdir build ; cd build
cmake ../csrc
make
cd ..
pip install -e .
and you need to adjust the parameter arch according to your graphics card, or the compile may fail.
Data Preparation
The size of frame is set to 1024x1024 uniformly. We fill the input frame by:
def fill_frame(frame):
h, w = frame.shape[0], frame.shape[1]
if h > w:
_pad = np.zeros([h, int((h - w) / 2), 3])
frame = np.concatenate([_pad, frame, _pad], axis=1)
elif h < w:
_pad = np.zeros([int((w - h) / 2), w, 3])
frame = np.concatenate([_pad, frame, _pad], axis=0)
frame = np.transpose(frame, [1, 0, 2])
frame = cv2.resize(frame, (1024, 1024))
return frame
The proportion of people in the image should not be too small. If you cannot guarantee the proportion of people during recording, you had better crop the image before filling the frame. We crop the frame by:
def crop_frame(frame, bounding_box):
# bounding_box['y_min'], bounding_box['y_max'], bounding_box['x_min'], bounding_box['x_max'] means the top, bottom, left, right position of human in the whole video.
m_pixel = 30
bounding_box['y_min'] = max(bounding_box['y_min'] - m_pixel, 0)
bounding_box['y_max'] = min(bounding_box['y_max'] + m_pixel, h)
bounding_box['x_min'] = max(bounding_box['x_min'] - m_pixel, 0)
bounding_box['x_max'] = min(bounding_box['x_max'] + m_pixel, w)
frame = frame[bounding_box['y_min']:bounding_box['y_max'],
bounding_box['x_min']:bounding_box['x_max']]
return frame
Then you need to run Openpose, PifuHD and MODNet to generate 2d joints, normal and mask to train our model. Then the generated data should be organized as follows:
--data_dir
----frames_mat
------subject_name
----2d_joints
------subject_name
--------json
----mask_mat
------subject_name
----normal
------subject_name
We provide the sample data in this link.
Usage
First, the pose initialization by running:
cd thirdparty/octopus
python _infer_single.py --root_dir $data_dir --name $subject_name
Then, to generate initial geometry by running:
python dynamic_offsets_runner.py --root_dir $data_dir --name $subject_name --device_id $device_id
Finally, to generate the texture map by running:
python texture_generation.py --root_dir $data_dir --name $subject_name --device_id $device_id
Citation
If you find our work useful in your research, please consider citing:
@inproceedings{zhao2022avatar,
author = {Hao Zhao and Jinsong Zhang and Yu-Kun Lai and Zerong Zheng and Yingdi Xie and Yebin Liu and Kun Li},
title = {High-Fidelity Human Avatars from a Single RGB Camera},
booktitle = {CVPR},
year={2022},
}
Acknowlegement
We borrow some code from NeuralTexture, LWG. Thanks for their great contribtuions.