Home

Awesome

Concrete_Shazam

A privacy preserved version of Shazam using Concrete-ML.

Approaches

Fingerprinting Method from the SHAZAM paper

The method described in SHAZAM Paper uses fingerprinting to retrieve songs. The algorithm in a nutshell is as follows:

While this is very efficient(Time complexity with Binary search: O((sample fp_size)*log(Database fp_size))) in a usual setting, this is challenging to implement in the FHE mode.

It is closely related to the PSI problem(Private Set Intersection). As described in the paper Fast Private Set intersection from Homomorphic Encryption it is possible to achieve similar time complexity however this has not been attempted for now as it involves low level cryptographic optimisations.

The file fingerprinting_approach.py uses this fingerprinting approach but instead of matching the fingerprints, each fingerprint is treated as a feature vector and a Decision Tree is trained on them. This Approach is not scalable at all and takes around 7 minutes for just 10 songs in the training set. Hence I use a different approach.

Main Approach: MFCC features based Logistic Regression

workflow The workflow is described in the above chart and has been demonstrated in the Main_analysis.ipynb Notebook. The whole notebook takes around 6 minutes to run(On MacBook Pro M2 with 8GB RAM) and gives the results for 400 songs selected from the fma_small.zip dataset. The model was also evaluated on a 1000 song subset and the results are presented in the Results Section. The process is also described in the notebook in the form of markdown blobs.

Logistic Regression uses One Vs All approach to model multiclass outputs. While this is accurate the size of the model during the model.compile(inputset) stage increases significantly which utilises a lot of RAM. This presents an issue in Scaling.

The good part of this approach is that the predictions are returned in FHE to the client. Hence the server not only does not get the plaintext features but it also does not get the plaintext prediction of the song which gives strong privacy.

MFCC features

MFCC stands for Mel Frequency Cepstral Coefficients. MFCCs are widely used as discriminating features for audio datasets. mfcc The above plot depicts the discriminating feature of MFCC on a subset of the fma dataset where two different genre form different clusters.

Using the Repository

To run the notebook just modify the constants in the constants section.

After making these modifications you also need to provide a truth_label dictionary to the evaluate function for evaluation. Here truth_label(song_name) = song_id.

Alternatively there is also a client.py and a train.py file for production deployment. To use them follow the instructions here Deployment Guide

Challenges in Scaling: Next Steps

When it comes to scaling the model we are presented with two challenges which give a tradeoff:

While each of these challenges can be handled seperately to achieve them together gives us a tradeoff.

The Problem of Time/Memory Constraints can be handled by using a small ML model with 8k class prediction vector.

To address the Accuracy concern regardless of the number of classes we could use a Contrastive Approach. We can train a LogisticRegression Model LR with LR(feature_vector_1, feature_vector_2) = 1 if both feature vectors are from same song else 0. Now we can apply this LR model with each feature vector in the training database and then pick the max bincount to get the song_id. This does not however scale well with time, if it takes around 0.1 second for a single run of the LR model, then for each song this approach would need 8000x60x0.1 seconds = 800 minutes (assuming 8k 30sec samples)

Results

The Results here are reported for a 1000-song subset of the fma_small.zip, The test set consists of the same sample as training set but with noise additions as done in the audio_augmentation.py.

Noise STDTop-1 AccuracyTop-3 AccuracyInference Time(per sample) in FHE
0.0195.397.40.46
0.0294.897.10.45

Results on Drawn-out set Evaluation

In this case the model is trained on the first 15 seconds of all the song samples and then evaluated on the later 15 seconds(fma dataset provides clips of 30sec). The Results are reported for a 1000-song subset.

ModelTop-1 AccuracyTop-3 AccuracyInference Time(per sample) in FHE
Concrete LR74.277.20.41
SkLearn LR75.377.80.000233