Home

Awesome

S2IGAN

This is the pytorch implement for our paper S2IGAN: Speech-to-Image Generation via Adversarial Learning. More results can be seen in the project page.

Data processing

CUB-200 (Bird) and Oxford-102 (Flower)

step0: You can download the synthesized spoken caption database from the project page, and then go to "step3". Or, you can start from "step 1" and synthesize spoken caption by yourself.

step1: Download CUB and Oxford Image and Text Captions

step2: Using a TTS system to transfer text captions to speech captions. In our work, Tacotron2 pre-trained by NVIDIA was adopted. The original code released by NVIDIA doesn't provide the code for the batch inference. We made slight changes to make it can be used to perform batch inference directly. You can download it here.

step3: To speed up the training process, we transferred the wav audio to filter bank spectrogram in advance.

python data_processing/Audio_to_mel.py

step4: Download the train/test split files for CUB and Oxford

Directory tree

├── birds
│   ├── CUB_200_2011
│   │   ├── auido
│   │   ├── audio_mel
│   │   ├── images
│   ├── train
│   │   ├── filenames.pickle
│   │   ├── class_info.pickle
│   ├── test
│   │   ├── filenames.pickle
│   │   ├── class_info.pickle
├── flowers
│   ├── Oxford102
│   │   ├── auido
│   │   ├── audio_mel
│   │   ├── images
│   ├── train
│   │   ├── filenames.pickle
│   │   ├── class_info.pickle
│   ├── test
│   │   ├── filenames.pickle
│   │   ├── class_info.pickle

Places-subset

Download Places audio data. Images of Places-subset and split files can be downloaded here. Database files are organized as follows

├── places
│   ├── images
│   ├── audio
│   │   ├── mel
│   │   ├── wav
│   ├── train
│   │   ├── filenames.pickle
│   ├── test
│   │   ├── filenames.pickle

Flickr8k

step1: Download Flickr8k Image and Audio Captions

step2: Transfer wav to spectrogram.

step2: Download the train/test split files.

├── Flickr8k
│   ├── images
│   ├── flickr_audio
│   │   ├── mel
│   │   ├── wavs
│   ├── train
│   │   ├── filenames.pickle
│   ├── test
│   │   ├── filenames.pickle

Running Step-by-step

Note: Change the path in .sh files to your data path. If you use the speech embedding provided by us (see Step 2), you can start from step3.

step1: Train SEN

sh run/flickr/01_Pretrain.sh

step2: Extract speech embeddings

sh run/flickr/02_Extracting.sh

You can skip the first two steps by using our provided speech embeddings and pre-trained image encoder for CUB, Oxford, Flickr8k, and Places-subset. Then put these embeddings as follows:

├── outputs
│   ├── pre_train
│   │   ├── birds
│   │   │   ├── speech_embeddings_train.pickle
│   │   │   ├── speech_embeddings_test.pickle
│   │   │   ├── models
│   │   │   │   ├── best_image_model.pth 
│   │   ├── flowers
│   │   │   ├── speech_embeddings_train.pickle
│   │   │   ├── speech_embeddings_test.pickle
│   │   │   ├── models
│   │   │   │   ├── best_image_model.pth
│   │   ├── flickr
│   │   │   ├── speech_embeddings_train.pickle
│   │   │   ├── speech_embeddings_test.pickle
│   │   │   ├── models
│   │   │   │   ├── best_image_model.pth
│   │   ├── places
│   │   │   ├── speech_embeddings_train.pickle
│   │   │   ├── speech_embeddings_test.pickle
│   │   │   ├── models
│   │   │   │   ├── best_image_model.pth

step3: Train the generator

sh run/flickr/03_TrainGAN.sh

step4: Generate images

sh run/flickr/04_GenImage.sh

step5: Calculate Insception Score (IS)

For Flickr and Places-subset, you can directly run the .sh files in the corresponding directory, such as

sh run/flickr/05_InsceptionScore_generally.sh

For CUB and Oxford, we use the fine-tuned model

step6: Semantic Consistency Evaluation

For Flickr and Places-subset:

sh run/flickr/06_Recall.sh

For CUB and Oxford:

sh run/birds/06_mAP.sh

step7: FID Download the code to calculate the FID

Cite

@article{wang2020s2igan,
  title={S2IGAN: Speech-to-Image Generation via Adversarial Learning},
  author={Wang, Xinsheng and Qiao, Tingting and Zhu, Jihua and Hanjalic, Alan and Scharenborg, Odette},
  journal={arXiv preprint arXiv:2005.06968},
  year={2020}
}