Abstract is out; Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

paper | repo


Since last December, I was developing and using Kapre, which is just a set of Keras layers  that do some audio preprocessing, say, stft/melspectrogram and etc. I mean, without a real benchmark.

There are many small reasons for this. Well, having my own python package that can be installed with pip command is cool, right? It’s one of the cool kids’ requirements. I’ve learned a lot by doing it. But, most importantly, it makes the user experience so nice to do research with more than a bunch of audio files.

A scenario

Let’s say you’re doing some classification with not-so-small music dataset. Like FMA-medium which is roughly 30GB after uncompressing.

Ok, decoding mp3 takes long so it should be done before.


But actually to what should I convert them? Spectrogram? Melspectrogram? With what parameter? Alright, I’ll try n_fft=1024, 512 and… hop=512, 256,… hm.. and.. n_mels=128, 96.

Congrats, just do it and store them which would take up like 2 * 2 * 2 * 30 = 240GB. (Yeah it’s very loose estimation.)

A day or two later.

Okay they are ready! Now let’s run the model.

Result: [512, 256, 96] worked the best.

Wut, then is 96 better? or maybe I gotta try with 72? 48?

Ok then I can remove the others… should I? But it took me a day and what if I need it again later..

👏👏👏 YES IT GOES ON AND ON! What if you can just do this?

model = Sequential()
# A mel-spectrogram layer
model.add(Melspectrogram(n_dft=512, n_hop=256,
                 border_mode='same', sr=sr, n_mels=128,
                 fmin=0.0, fmax=sr/2, power=1.0,
                 return_decibel_melgram=False, trainable_fb=False,


Experiment result

Hey hey, but it would add more computation on real time. How much is it?

This much,

Screen Shot 2017-06-17 at 14.31.45

for 16384 audio samples (30 seconds each), and the convnet looks like this. (157k parameters, 5-layer)

Screen Shot 2017-06-20 at 02.14.23

  • The experiment uses a dummy data arrays which has no overhead to feed.
  • We tested it on four gpu models and the charts above is normalised. But what really matters is the absolute time.
    • If your model is very simple, the proportion of time that Kapre uses would increases.
    • Vice versa,
      • which is good because if you got a large dataset → pre-computing and storing spectrograms are annoying → with large dataset you probably would like to use large network to take advantage of the dataset → yeah, so among the overall training time, the overhead that Kapre introduce becomes trivial.


paper | repo |slides

EDIT: Slides



3 thoughts on “Abstract is out; Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

  1. Really nice tool. Though I understand that you still need to keep all the original audio. So, when working with the Million Song Dataset, do you retrieve the 30 sec audio sample using some API ? Or do you reconstruct the audio via some features (chroma + mfcc maybe) ?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s