Home

Awesome

<p align="center"> <img width="60%" src="https://raw.githubusercontent.com/PantoMatrix/BEAT/master/docs/assets/teaser2.png" /> </p> <h1 style="text-align: center;"> BEAT: Body-Expression-Audio-Text Dataset </h1>

PWC Hugging Face Spaces Open In Colab

News

Features

Benchmark

Gesture Generation on BEAT-16h (speaker 2,4,6,8 in English data v0.2.1)

MethodVenueInput ModalitiesFID**SRGRBeatAlignckpt
Seq2SeqICRA'19text261.30.1730.729-
Speech2GesturesCVPR'19audio256.70.0920.751-
Audio2GesturesICCV'21audio223.80.0970.766-
MultiContextSIGGRAPH ASIA'20audio, text176.2 (177.2*)0.195 (0.227)0.776 (0.751)link
CaMNECCV'22audio, text, facial123.7 (122.8)0.239 (0.240)0.783 (0.782)link

*Checkpoints results trained from this repo. are denoted in parentheses. Results in paper are from codes: seq2seq, s2g, a2g, mutli, camn.

**Pretrained 300D AutoEncoder for FID calculation.

Dataset

Introcution

Train/val/test split

Script is in /dataloaders/preprocessing.ipynb, ratio: 2880:500:500

split_rule_english = {
    # 4h speakers x 10
    "1, 2, 3, 4, 6, 7, 8, 9, 11, 21":{
        # 48+40+100=188mins each
        "train": [
            "0_9_9", "0_10_10", "0_11_11", "0_12_12", "0_13_13", "0_14_14", "0_15_15", "0_16_16", \
            "0_17_17", "0_18_18", "0_19_19", "0_20_20", "0_21_21", "0_22_22", "0_23_23", "0_24_24", \
            "0_25_25", "0_26_26", "0_27_27", "0_28_28", "0_29_29", "0_30_30", "0_31_31", "0_32_32", \
            "0_33_33", "0_34_34", "0_35_35", "0_36_36", "0_37_37", "0_38_38", "0_39_39", "0_40_40", \
            "0_41_41", "0_42_42", "0_43_43", "0_44_44", "0_45_45", "0_46_46", "0_47_47", "0_48_48", \
            "0_49_49", "0_50_50", "0_51_51", "0_52_52", "0_53_53", "0_54_54", "0_55_55", "0_56_56", \
            
            "0_66_66", "0_67_67", "0_68_68", "0_69_69", "0_70_70", "0_71_71",  \
            "0_74_74", "0_75_75", "0_76_76", "0_77_77", "0_78_78", "0_79_79",  \
            "0_82_82", "0_83_83", "0_84_84", "0_85_85",  \
            "0_88_88", "0_89_89", "0_90_90", "0_91_91", "0_92_92", "0_93_93",  \
            "0_96_96", "0_97_97", "0_98_98", "0_99_99", "0_100_100", "0_101_101",  \
            "0_104_104", "0_105_105", "0_106_106", "0_107_107", "0_108_108", "0_109_109",  \
            "0_112_112", "0_113_113", "0_114_114", "0_115_115", "0_116_116", "0_117_117",  \
            
            "1_2_2", "1_3_3", "1_4_4", "1_5_5", "1_6_6", "1_7_7", "1_8_8", "1_9_9", "1_10_10", "1_11_11",
        ],
        # 8+7+10=25mins each
        "val": [
            "0_57_57", "0_58_58", "0_59_59", "0_60_60", "0_61_61", "0_62_62", "0_63_63", "0_64_64", \
            "0_72_72", "0_80_80", "0_86_86", "0_94_94", "0_102_102", "0_110_110", "0_118_118", \
            "1_12_12",
        ],
        # 8+7+10=25mins each
        "test": [
           "0_1_1", "0_2_2", "0_3_3", "0_4_4", "0_5_5", "0_6_6", "0_7_7", "0_8_8", \
           "0_65_65", "0_73_73", "0_81_81", "0_87_87", "0_95_95", "0_103_103", "0_111_111", \
           "1_1_1",
        ],
    },
    
    # 1h speakers x 20
    "5, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30":{
        # 8+7+20=35mins each
        "train": [
            "0_9_9", "0_10_10", "0_11_11", "0_12_12", "0_13_13", "0_14_14", "0_15_15", "0_16_16", \
            "0_66_66", "0_74_74", "0_82_82", "0_88_88", "0_96_96", "0_104_104", "0_112_112", "0_118_118", \
            "1_2_2", "1_3_3", 
            "1_0_0", "1_4_4", # for speaker 29 only
        ],
        # 4+3.5+5 = 12.5mins each
        # 0_65_a and 0_65_b denote the frist and second half of sequence 0_65_65
        "val": [
            "0_5_5", "0_6_6", "0_7_7", "0_8_8",  \
            "0_65_b", "0_73_b", "0_81_b", "0_87_b", "0_95_b", "0_103_b", "0_111_b", \
            "1_1_b",
        ],
        # 4+3.5+5 = 12.5mins each
        "test": [
           "0_1_1", "0_2_2", "0_3_3", "0_4_4", \
           "0_65_a", "0_73_a", "0_81_a", "0_87_a", "0_95_a", "0_103_a", "0_111_a", \
           "1_1_a",
        ],
    },
}

Other scripts and avatars

Reproduction

Train and test CaMN

  1. python == 3.7
  2. build folders like:
    audio2pose
    ├── codes
    │   └── audio2pose
    ├── datasets
    │   ├── beat_raw_data
    │   ├── beat_annotations
    │   └── beat_cache
    └── outputs
        └── audio2pose
            ├── custom
            └── wandb   
    
  3. download the scripts to codes/audio2pose/
  4. run pip install -r requirements.txt in the path ./codes/audio2pose/
  5. download full dataset to datasets/beat
  6. bulid data cache and calculate mean and std by given number of joints, FPS, speakers using /dataloader/preprocessing.ipynb
  7. cd ./dataloaders && python build_vocab.py for language model
  8. run python train.py -c ./configs/ae_4english_15_141.yaml for pretrained_ae for FID calculation, or download pretrained ckpt to /datasets/beat_cache/cache_name/weights/
  9. run python train.py -c ./configs/camn_4english_15_141.yaml for training or or download pretrained ckpt to /datasets/beat_cache/cache_name/weights/.
  10. run python test.py -c ./configs/camn_4english_15_141.yaml for inference.
  11. load ./outputs/audio2pose/custom/exp_name/epoch_number/xxx.bvh into blender to visualize the test results.

From data in other dataset (e.g. Trinity)

Citation

BEAT is established for the following research project. Please consider cite our work if you use BEAT dataset.

@article{liu2022beat,
  title   = {BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis},
  author  = {Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, Bo Zheng},
  journal = {European Conference on Computer Vision},
  year    = {2022}
}