Home

Awesome

Youtube Gesture Dataset

This repository contains scripts to build Youtube Gesture Dataset. You can download Youtube videos and transcripts, divide the videos into scenes, and extract human poses. Please see the project page and paper for the details.

[Project page] [Paper]

If you have any questions or comments, please feel free to contact me by email (youngwoo@etri.re.kr).

Environment

The scripts are tested on Ubuntu 16.04 LTS and Python 3.5.2.

Dependencies

A step-by-step guide

  1. Set config

    • Update paths and youtube developer key in config.py (the directories will be created if not exist).
    • Update target channel ID. The scripts are tested for TED and LaughFactory channels.
  2. Execute download_video.py

    • Download youtube videos, metadata, and subtitles (./videos/*.mp4, *.json, *.vtt).
  3. Execute run_openpose.py

    • Run OpenPose to extract body, hand, and face skeletons for all vidoes (./skeleton/*.pickle).
  4. Execute run_scenedetect.py

    • Run PySceneDetect to divide videos into scene clips (./clip/*.csv).
  5. Execute run_gentle.py

    • Run Gentle for word-level alignments (./videos/*_align_results.json).
    • You should skip this step if you use auto-generated subtitles. This step is necessary for the TED Talks channel.
  6. Execute run_clip_filtering.py

    • Remove inappropriate clips.
    • Save clips with body skeletons (./clip/*.json).
  7. (optional) Execute review_filtered_clips.py

    • Review filtering results.
  8. Execute make_ted_dataset.py

    • Do some post processing and split into train, validation, and test sets (./script/*.pickle).

Pre-built TED gesture dataset

Running whole data collection pipeline is complex and takes several days, so we provide the pre-built dataset for the videos in the TED channel.

Number of videos1,766
Average length of videos12.7 min
Shots of interest35,685 (20.2 per video on average)
Ratio of shots of interest25% (35,685 / 144,302)
Total length of shots of interest106.1 h

Download videos and transcripts

We do not provide the videos and transcripts of TED talks due to copyright issues. You should download actual videos and transcripts by yourself as follows:

  1. Download and copy [video_ids.txt] file which contains video ids into ./videos directory.
  2. Run download_video.py. It downloads the videos and transcripts in video_ids.txt. Some videos may not match to the extracted poses that we provided if the videos are re-uploaded. Please compare the numbers of frames, just in case.

Citation

If our code or dataset is helpful, please kindly cite the following paper:

@INPROCEEDINGS{
  yoonICRA19,
  title={Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots},
  author={Yoon, Youngwoo and Ko, Woo-Ri and Jang, Minsu and Lee, Jaeyeon and Kim, Jaehong and Lee, Geehyuk},
  booktitle={Proc. of The International Conference in Robotics and Automation (ICRA)},
  year={2019}
}

Related Projects

Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity (SIGGRAPH Asia 2020), https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context

Acknowledgement