Home

Awesome

High-Fidelity Human Avatars from a Single RGB Camera

Project Page | Paper | Supp

News

Installation

conda create -n Avatar python==3.6.8
conda install pytorch==1.7.0 torchvision==0.8.0 cudatoolkit=11.0 -c pytorch
or conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.2 -c pytorch

pip install -r requirements.txt

wget https://github.com/facebookresearch/pytorch3d/archive/refs/tags/v0.4.0.zip
cd pytorch3d
pip install -e .

cd thirdparty/neural_renderer_pytorch
python setup.py install 

Please make sure your gcc version > 7.5 !

Download the assets files from here, unzip it, and move them to the assets folder.

Download the pre-trained model from here, unzip it, and move them to the checkpoints folder.

Besides, we adopt the pose initialization of octopus. But the deep learning framework of octopus is not the same as our work, Therefore, you need to create a new conda environment for octopus.

conda create -n octopus python==2.7
conda install tensorflow-gpu=1.13.1 keras=2.2.4 cudatoolkit=10.0

The environment of octopus needs dirt. We recommend you install dirt by:

cd dirt
mkdir build ; cd build
cmake ../csrc
make
cd ..
pip install -e .

and you need to adjust the parameter arch according to your graphics card, or the compile may fail.

Data Preparation

The size of frame is set to 1024x1024 uniformly. We fill the input frame by:

def fill_frame(frame):
    h, w = frame.shape[0], frame.shape[1]
    if h > w:
	 _pad = np.zeros([h, int((h - w) / 2), 3])
	 frame = np.concatenate([_pad, frame, _pad], axis=1)
    elif h < w:
	 _pad = np.zeros([int((w - h) / 2), w, 3])
	 frame = np.concatenate([_pad, frame, _pad], axis=0)
	 frame = np.transpose(frame, [1, 0, 2])
    frame = cv2.resize(frame, (1024, 1024))
    return frame

The proportion of people in the image should not be too small. If you cannot guarantee the proportion of people during recording, you had better crop the image before filling the frame. We crop the frame by:

def crop_frame(frame, bounding_box):
    # bounding_box['y_min'], bounding_box['y_max'], bounding_box['x_min'], bounding_box['x_max'] means the top, bottom, left, right position of human in the whole video.

    m_pixel = 30
    bounding_box['y_min'] = max(bounding_box['y_min'] - m_pixel, 0)
    bounding_box['y_max'] = min(bounding_box['y_max'] + m_pixel, h)
    bounding_box['x_min'] = max(bounding_box['x_min'] - m_pixel, 0)
    bounding_box['x_max'] = min(bounding_box['x_max'] + m_pixel, w)
    frame = frame[bounding_box['y_min']:bounding_box['y_max'],
		  bounding_box['x_min']:bounding_box['x_max']]

    return frame

Then you need to run Openpose, PifuHD and MODNet to generate 2d joints, normal and mask to train our model. Then the generated data should be organized as follows:

--data_dir
----frames_mat
------subject_name
----2d_joints
------subject_name
--------json
----mask_mat
------subject_name
----normal
------subject_name

We provide the sample data in this link.

Usage

First, the pose initialization by running:

cd thirdparty/octopus 
python _infer_single.py --root_dir $data_dir --name $subject_name

Then, to generate initial geometry by running:

python dynamic_offsets_runner.py --root_dir $data_dir --name $subject_name --device_id $device_id

Finally, to generate the texture map by running:

python texture_generation.py --root_dir $data_dir --name $subject_name --device_id $device_id

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{zhao2022avatar,
  author = {Hao Zhao and Jinsong Zhang and Yu-Kun Lai and Zerong Zheng and Yingdi Xie and Yebin Liu and Kun Li},
  title = {High-Fidelity Human Avatars from a Single RGB Camera},
  booktitle = {CVPR},
  year={2022},
}

Acknowlegement

We borrow some code from NeuralTexture, LWG. Thanks for their great contribtuions.