Awesome
Kaggle Freesound Audio Tagging 2019 2nd place code
Usage
-
Download the datasets and place them in the input folder.
-
Unzip the train_curated.zip and train_noisy.zip, then put all the audio clips into audio_train.
-
sh run.sh
requirements
tensorflow_gpu==1.11.0 numpy==1.14.2 tqdm==4.22.0 librosa==0.6.3 scipy==1.0.0 iterative_stratification==0.1.6 Keras==2.1.5 pandas==0.24.2 scikit_learn==0.21.2
Hardware
- 64GB of RAM
- 1 tesla P100
Solution
single model CV: 0.89763
ensemble CV: 0.9108
feature engineering
- log mel (441,64) (time,mels)
- global feature (128,12) (Split the clip evenly, and create 12 features for each frame. local cv +0.005)
- length
def get_global_feat(x,num_steps):
stride = len(x)/num_steps
ts = []
for s in range(num_steps):
i = s * stride
wl = max(0,int(i - stride/2))
wr = int(i + 1.5*stride)
local_x = x[wl:wr]
percent_feat = np.percentile(local_x, [0, 1, 25, 30, 50, 60, 75, 99, 100]).tolist()
range_feat = local_x.max()-local_x.min()
ts.append([np.mean(local_x),np.std(local_x),range_feat]+percent_feat)
ts = np.array(ts)
assert ts.shape == (128,12),(len(x),ts.shape)
return ts
prepocess
- audio clips are first trimmed of leading and trailing silence
- random select a 5s clip from audio clip
model
For details, please refer to code/models.py
- Melspectrogram Layer(code from kapre,We use it to search the hyperparameter of log mel end2end)
- Our main model is a 9-layer CNN. In this competition, we consider that the two axes of the log mel feature have different physical meanings, so the max pooling and average pooling in the model are replaced by one axis using max pooling and the other axis using average pooling. (Our local cv gain a lot from it, but the exact number is forgotten).
- global pooling: pixelshuffle + max pooling in time axes + ave pooling in mel axes.
- se block(several of our models use se block)
- highway + 1*1 conv(several of our models use se block)
- label smoothing
# log mel layer
x_mel = Melspectrogram(n_dft=1024, n_hop=cfg.stride, input_shape=(1, K.int_shape(x_in)[1]),
# n_hop -> stride n_dft kernel_size
padding='same', sr=44100, n_mels=64,
power_melgram=2, return_decibel_melgram=True,
trainable_fb=False, trainable_kernel=False,
image_data_format='channels_last', trainable=False)(x)
# pooling mode
x = AveragePooling2D(pool_size=(pool_size1,1), padding='same', strides=(stride,1))(x)
x = MaxPool2D(pool_size=(1,pool_size2), padding='same', strides=(1,stride))(x)
# model head
def pixelShuffle(x):
_,h,w,c = K.int_shape(x)
bs = K.shape(x)[0]
assert w%2==0
x = K.reshape(x,(bs,h,w//2,c*2))
# assert h % 2 == 0
# x = K.permute_dimensions(x,(0,2,1,3))
# x = K.reshape(x,(bs,w//2,h//2,c*4))
# x = K.permute_dimensions(x,(0,2,1,3))
return x
x = Lambda(pixelShuffle)(x)
x = Lambda(lambda x: K.max(x, axis=1))(x)
x = Lambda(lambda x: K.mean(x, axis=1))(x)
data augmentation
- mixup (local cv +0.002, lb +0.008)
- random select 5s clip + random padding
- 3TTA
pretrain
- train a model only on train_noisy as pretrained model
ensemble
For details, please refer to code/ensemble.py
- We use nn for stacking, which uses localconnect1D to learn the ensemble weights of each class, then use fully connect to learn about label correlation, using some initialization and weight constraint tricks.
def stacker(cfg,n):
def kinit(shape, name=None):
value = np.zeros(shape)
value[:, -1] = 1
return K.variable(value, name=name)
x_in = Input((80,n))
x = x_in
# x = Lambda(lambda x: 1.5*x)(x)
x = LocallyConnected1D(1,1,kernel_initializer=kinit,kernel_constraint=normNorm(1),use_bias=False)(x)
x = Flatten()(x)
x = Dense(80, use_bias=False, kernel_initializer=Identity(1))(x)
x = Lambda(lambda x: (x - 1.6))(x)
x = Activation('tanh')(x)
x = Lambda(lambda x:(x+1)*0.5)(x)
model = Model(inputs=x_in, outputs=x)
model.compile(
loss='binary_crossentropy',
optimizer=Nadam(lr=cfg.lr),
)
return model