Home

Awesome

iLID

Automatic spoken language identification (LID) using deep learning.

Motivation

We wanted to classify the spoken language within audio files, a process that usually serves as the first step for NLP or speech transcription.

We used two deep learning approaches using the Tensorflow and Caffe frameworks for different model configuration.

Repo Structure

Requirements

// Install additional Python requirements
pip install -r requirements.txt
pip install youtube_dl

Datasets

Downloads training data / audio samples from various sources.

Voxforge

/data/voxforge/download-data.sh
/data/voxforge/extract_tgz.sh {path_to_german.tgz} german

Youtube

python /data/youtube/download.py

Models

We trained models for 2/4 languages (English, German, French, Spanish).

Best Performing Models

The top scoring networks were trained with 15.000 images per languages, a batch size of 64, and a learning rate of 0.001 that was decayed to 0.0001 after 7.000 iterations.

Shallow Network EN/DE

Shallow Network EN/DE/FR/ES

Training

// Caffe:
/models/{model_name}/training.sh
// Tensorflow:
python /tensorflow/train.py

Labels

0 English, 
1 German, 
2 French, 
3 Spanish

Training Data

For training we used both the public Voxforge dataset and downloaded news reel videos from Youtube. Check out the /data directory for download scripts.