Awesome
S2IGAN
This is the pytorch implement for our paper S2IGAN: Speech-to-Image Generation via Adversarial Learning. More results can be seen in the project page.
Data processing
CUB-200 (Bird) and Oxford-102 (Flower)
step0: You can download the synthesized spoken caption database from the project page, and then go to "step3". Or, you can start from "step 1" and synthesize spoken caption by yourself.
step1: Download CUB and Oxford Image and Text Captions
step2: Using a TTS system to transfer text captions to speech captions. In our work, Tacotron2 pre-trained by NVIDIA was adopted. The original code released by NVIDIA doesn't provide the code for the batch inference. We made slight changes to make it can be used to perform batch inference directly. You can download it here.
step3: To speed up the training process, we transferred the wav audio to filter bank spectrogram in advance.
python data_processing/Audio_to_mel.py
step4: Download the train/test split files for CUB and Oxford
Directory tree
├── birds
│ ├── CUB_200_2011
│ │ ├── auido
│ │ ├── audio_mel
│ │ ├── images
│ ├── train
│ │ ├── filenames.pickle
│ │ ├── class_info.pickle
│ ├── test
│ │ ├── filenames.pickle
│ │ ├── class_info.pickle
├── flowers
│ ├── Oxford102
│ │ ├── auido
│ │ ├── audio_mel
│ │ ├── images
│ ├── train
│ │ ├── filenames.pickle
│ │ ├── class_info.pickle
│ ├── test
│ │ ├── filenames.pickle
│ │ ├── class_info.pickle
Places-subset
Download Places audio data. Images of Places-subset and split files can be downloaded here. Database files are organized as follows
├── places
│ ├── images
│ ├── audio
│ │ ├── mel
│ │ ├── wav
│ ├── train
│ │ ├── filenames.pickle
│ ├── test
│ │ ├── filenames.pickle
Flickr8k
step1: Download Flickr8k Image and Audio Captions
step2: Transfer wav to spectrogram.
step2: Download the train/test split files.
├── Flickr8k
│ ├── images
│ ├── flickr_audio
│ │ ├── mel
│ │ ├── wavs
│ ├── train
│ │ ├── filenames.pickle
│ ├── test
│ │ ├── filenames.pickle
Running Step-by-step
Note: Change the path in .sh files to your data path. If you use the speech embedding provided by us (see Step 2), you can start from step3.
step1: Train SEN
sh run/flickr/01_Pretrain.sh
step2: Extract speech embeddings
sh run/flickr/02_Extracting.sh
You can skip the first two steps by using our provided speech embeddings and pre-trained image encoder for CUB, Oxford, Flickr8k, and Places-subset. Then put these embeddings as follows:
├── outputs
│ ├── pre_train
│ │ ├── birds
│ │ │ ├── speech_embeddings_train.pickle
│ │ │ ├── speech_embeddings_test.pickle
│ │ │ ├── models
│ │ │ │ ├── best_image_model.pth
│ │ ├── flowers
│ │ │ ├── speech_embeddings_train.pickle
│ │ │ ├── speech_embeddings_test.pickle
│ │ │ ├── models
│ │ │ │ ├── best_image_model.pth
│ │ ├── flickr
│ │ │ ├── speech_embeddings_train.pickle
│ │ │ ├── speech_embeddings_test.pickle
│ │ │ ├── models
│ │ │ │ ├── best_image_model.pth
│ │ ├── places
│ │ │ ├── speech_embeddings_train.pickle
│ │ │ ├── speech_embeddings_test.pickle
│ │ │ ├── models
│ │ │ │ ├── best_image_model.pth
step3: Train the generator
sh run/flickr/03_TrainGAN.sh
step4: Generate images
sh run/flickr/04_GenImage.sh
step5: Calculate Insception Score (IS)
For Flickr and Places-subset, you can directly run the .sh files in the corresponding directory, such as
sh run/flickr/05_InsceptionScore_generally.sh
For CUB and Oxford, we use the fine-tuned model
step6: Semantic Consistency Evaluation
For Flickr and Places-subset:
sh run/flickr/06_Recall.sh
For CUB and Oxford:
sh run/birds/06_mAP.sh
step7: FID Download the code to calculate the FID
Cite
@article{wang2020s2igan,
title={S2IGAN: Speech-to-Image Generation via Adversarial Learning},
author={Wang, Xinsheng and Qiao, Tingting and Zhu, Jihua and Hanjalic, Alan and Scharenborg, Odette},
journal={arXiv preprint arXiv:2005.06968},
year={2020}
}