Abstract is out; Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

paper | repo


Since last December, I was developing and using Kapre, which is just a set of Keras layers  that do some audio preprocessing, say, stft/melspectrogram and etc. I mean, without a real benchmark.

There are many small reasons for this. Well, having my own python package that can be installed with pip command is cool, right? It’s one of the cool kids’ requirements. I’ve learned a lot by doing it. But, most importantly, it makes the user experience so nice to do research with more than a bunch of audio files.

A scenario

Let’s say you’re doing some classification with not-so-small music dataset. Like FMA-medium which is roughly 30GB after uncompressing.

Ok, decoding mp3 takes long so it should be done before.


But actually to what should I convert them? Spectrogram? Melspectrogram? With what parameter? Alright, I’ll try n_fft=1024, 512 and… hop=512, 256,… hm.. and.. n_mels=128, 96.

Congrats, just do it and store them which would take up like 2 * 2 * 2 * 30 = 240GB. (Yeah it’s very loose estimation.)

A day or two later.

Okay they are ready! Now let’s run the model.

Result: [512, 256, 96] worked the best.

Wut, then is 96 better? or maybe I gotta try with 72? 48?

Ok then I can remove the others… should I? But it took me a day and what if I need it again later..

👏👏👏 YES IT GOES ON AND ON! What if you can just do this?

model = Sequential()
# A mel-spectrogram layer
model.add(Melspectrogram(n_dft=512, n_hop=256,
                 border_mode='same', sr=sr, n_mels=128,
                 fmin=0.0, fmax=sr/2, power=1.0,
                 return_decibel_melgram=False, trainable_fb=False,


Experiment result

Hey hey, but it would add more computation on real time. How much is it?

This much,

Screen Shot 2017-06-17 at 14.31.45

for 16384 audio samples (30 seconds each), and the convnet looks like this. (157k parameters, 5-layer)

Screen Shot 2017-06-20 at 02.14.23

  • The experiment uses a dummy data arrays which has no overhead to feed.
  • We tested it on four gpu models and the charts above is normalised. But what really matters is the absolute time.
    • If your model is very simple, the proportion of time that Kapre uses would increases.
    • Vice versa,
      • which is good because if you got a large dataset → pre-computing and storing spectrograms are annoying → with large dataset you probably would like to use large network to take advantage of the dataset → yeah, so among the overall training time, the overhead that Kapre introduce becomes trivial.


paper | repo

Notes on my paper; On the Robustness of Deep Convolutional Neural Networks for Music Classification

In this paper I talk about music tagging quite a lot, and audio preprocessing also quite a lot, and some analysis on trained network which is related to music tagging problem again.

Music tagging dataset groundtruth is so wrong

Yeah, because of weakly-labelling it is so incorrect. But how much?

Screen Shot 2017-06-07 at 21.04.31

..about this much. — I manually annotated 4 labels on 500 songs.

Gosh, 70% error? No worries though, it’s sort of ‘weakly-supervised learning’ situation, in which with enough data it’s fine.


But, how much is it fine? — in evaluation?

Screen Shot 2017-06-07 at 20.55.46

red.dot: evaluation of four instrument tags with my annotation.
blue.dash: eval of four instrument tags with MSD groundtruth
yellow.solid: eval of all tags with MSD groundtruth

Corr(red, blue) = how much it’s fine to use MSD groundtruth for 4 tags.
Corr(blue, yellow) = how much it’s fine to generalise 4-tag result for all-tag result.

And as you see. Well, I’d say this is fine. For sure there’s error (Which can be significant if the difference is subtle).

X vs log(X) if X in [spectrogram, melgram, cqt, …]

tl;dr: use log(x). See this distribution!

Screen Shot 2017-06-07 at 21.17.51

Or, see how much disadvantageous if not log(). Roughly 2x data, which is a lot.

Screen Shot 2017-06-07 at 21.18.42

Spectral whitening (per-frequency stdd)? A-weighting? Should I do some special normalisation for better result?


How similar music tags are according to trained convnet?

Screen Shot 2017-06-07 at 21.21.53

More details on the paper:


Paper is out; Transfer learning for music classification and regression tasks, and behind the scene, negative results, etc.

1. Summary

A catch for MIR:

For ML:

So what does it do?


A Convnet is trained.

Then we extract the knowledge like above, and use it for…

  • genre classifications
  • emotion prediction (on av plane)
  • speech/music classification
  • vocal/non-vocal excerpt classification
  • audio event classification

Is it good?


What’s special in this work?

It uses (up to) ALL layers, not the last N layers. Partly because it’s music, not sure what would happen if we do the same for images though. Tried to know if there’s similar approach, found nothing.



2. Behind the scene

Q. Did you really select the tagging task because,

Music tagging is selected as a source task because i) large training data is accessible for this task and ii) its label set covers various aspects of music, i.e., genre, mood, era, and ?instrumentations

? You’ve been doing tagging for a year+ and you must be just looking for re-using your weights! !

A. NOOOOOOO… about 1.5 years ago, music recommendation was still part of my PhD plan and I thought I’d do content-based music tagging ‘briefly’ and use the result for recommendation.

…Although I was forgetting that part for a long while.


Q. What else did you try?

A. I tried similarity prediction using 5th layer feature and similarity data in MSD. Similarity was estimated using cosine distance of two song vectors. Result is:

Screen Shot 2017-03-28 at 16.33.12

There are 18M pairs of similarities and this plot uses 1K samples of them. Correlation value is something like 0.08, -0.05,… totally uncorrelated.

I doubt that the similarity in groundtruth is not much about audio similarity. Although I couldn’t find out how the similarity is gathered. + I tested K-NN of the convnet features to see if they sound really similar, and (ok, it’s subjective) they certainly are!

Hopefully there’s similarity dataset that focuses on the sound rather than social/cultural aspects of song so that I can test it again.

… ok actually I hope someone else find it and do it =)

Tried something else, too — measure them by precision@k, as below:

First, collect similar pairs of songs from the ground-truth data (you can threshold the similarity.) Let’s call these pairs “similar pairs at similarity P”. Something like P=0.8 should be used.

Then, for each of these similar pairs (A,B), you will do the following:

(1) Find K most similar songs to A according to the cosine similarity
(2) If B is included in this set, it’s correct.
(3) Do the same for B as well.


checking 139684 pairs, threshold=0.999 (out of 18M pairs)

precision: 0.0105667077117 for 139684 pairs at k=10
precision: 0.0416654734973 for 139684 pairs at k=100
precision: 0.115367543885 for 139684 pairs at k=1000

are checking 292255 pairs, threshold=0.8 (out of 18M pairs)

precision: 0.00903662897128 for 292255 pairs at k=10
precision: 0.0364476227952 for 292255 pairs at k=100
precision: 0.106280474243 for 292255 pairs at k=1000

In short, it doesn’t work.


Q. Why is it on ICLR template?

A. Don’t you like it!??


Q. Why didn’t you add the result using ELM?

A. I will do it in another paper, which will include some of this transfer learning results.


That’s it. Again, please also enjoy the paper and the codes.


EDIT: 30 June 2017 – It’s updated to v2, the camera-ready version for ISMIR 2017.

An unexpected encounter to Extreme Learning Machines

I was testing multiple MIR tasks with my pre-trained features from compact_cnn. I thought I was.

It turned out that I didn’t even load the weights file. There is simply no such code. (You’re supposed to laugh at me now. out loud.)

The featured (thumbnail) image is what I ended up doing. Please scroll up and take a look. But I almost never realised it and even wrote quite a lot of ‘discussion’ on the top of the result. (Yeah, laugh at me. LOUDER) Anyway, it means my system was a deep convolutional extreme learning machine.

I couldn’t realised it earlier because the results are quite good. Let’s see how the feature worked with SVM classifier.


Not bad, huh?

Finally, Kyunghyun noted out that it has been known since 1965 by Cover’s theorem.

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.

— Cover, T.M., Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition, 1965

Still quite interesting. Oh, and the non-ELM results, what I thought I was doing, will be posted to arXiv soon. See you there then.

PS. Saxe et al 2011, a NIPS paper about it. Deeplearningbook also mentions in Convnet chapter that “Random filters often work surprisingly well in convolutional networks”.

How do I preprocess audio signals

With Urbansound8k dataset, which is in a reasonable size, I did like this:


The code structure seems not bad. As a result, I have a couple of hdf files. They are convenient to use in Keras.


  • Not multi-processing used here.
  • hdf doesn’t yet support multi-reader. It means you can’t keep them opened in different process. In other words, you can’t use the same file at the same time. In other words, even if you have N>1 GPU, you can’t just run the task at the same time. There are workarounds though.




For beginners; Writing a custom Keras layer

I have written a few simple keras layers. This post will summarise about how to write your own layers. It’s for beginners because I only know simple and easy ones 😉


1. Keras layer

https://keras.io/layers/about-keras-layers/ introduces some common methods. For beginners I don’t think it’s necessary to know these.

2. Keras Lambda layer

Lambda layer is an easy way to customise a layer to do simple arithmetics. As written in the page,

…an arbitrary Theano / TensorFlow expression… 

we can use the operations supported by Keras backend such as dot, transpose, max, pow, sign, etc as well as those are not specified in the backend documents but actually supported by Theano and TensorFlow – e.g.,  **, /, //, % for Theano.

2.1 Lambda layer and output_shape

You might need to specify the output shape of your Lambda layer, especially your Keras is on Theano. Otherwise it just seems to infer it with input_shape.

2.1.1 With function

You can create a function that returns the output shape, probably after taking input_shape as an input. Here, the function returns the shape of the WHOLE BATCH.

2.1.2 With tuple

If you pass tuple, it should be the shape of ONE DATA SAMPLE.

3. A Keras model as a layer

On high-level, you can combine some layers to design your own layer. For example, I made a Melspectrogram layer as below. (Complete codes are on keras_STFT_layer repo.) In this way, I could re-use Convolution2D layer in the way I want.

Downside would be some overhead due to many layers.

4. Customising Layer

When Lego-ing known layers doesn’t get you what you want, write your own!

4.1 Read the document

https://keras.io/layers/writing-your-own-keras-layers/ Read this! Whether you fully understand it or not. I didn’t fully understand but later I got it thanks to @fchollet’s help.

4.2 Four methods

4.2.1 __init()__ :

initiate the layer. Assign attributes to self so that you can use them later.

4.2.2build(self, input_shape) :

  1. initiate the tensor variables (e.g. W, bias, or whatever) using Keras backend functions (e.g., self.W = K.variable(an_init_numpy_array)).
  2. set self.trainable_weights with a list of variables. e.g., self.trainable_weights=[self.W].

Remember : trainable weights should be tensor variables so that machine can auto-differenciate them for you.

Remember (2): Check out the dtype of every variable! If you initiate a tensor variable with float64 a numpy array, the variable might be also float64, which will get you an error. Usually it wouldn’t because by default K.variable()  casts the value into float32. But, check check check! check it by simply printing x.dtype.

4.2.3 call(self, x, mask=None) :

This is where you implement the forward-pass operation. You may want to dot product with one of the trainable weights and input (K.dot(x, self.W)), wanna expand the dimensionality of a tensor variable (K.expand_dims(var1, dim=2)), or whatever.

Again, dtype! For example, I had to use this line, np.sqrt(2. * np.pi).astype('float32'), to make the constant to be float32.

4.2.4 get_output_shape_for(self, input_shape)

As the name says.

4.3 Examples

4.3.1 Example 1 : Cropping2D Layer

It crops 2D input. Simple!

4.3.2 Example 2. ParametricMel

4.4 Tips

Remember: you need to make the operation of layer differentiable w.r.t the input and trainable weights you set. Look up keras backend use them.

tensor.shape.eval() returns an integer tuple. You would need to print them a lot 😉

paper is UPDATED; Convolutional Recurrent Neural Networks for Music Classification (+reviews)


I updated my paper: arXiv link

Compare to the previous one, I

  • added one more architecture,
  • changed their names,
  • removed dropout from all convolution and fully-connected layers,
  • and re-ran all the experiments

Hopefully the figures and table below would be interesting enough to read the paper!

  • Those are layouts.


  • In detail,


  • Results are,


  • The same results in time-AUC plane,


  • Performances per tag, but this figure is better to be seen within the paper



Review 1

  • Importance/Relevance: Of sufficient interest
  • Novelty/Originality: Minor originality
  • Technical Correctness: Probably correct
  • Experimental Validation: Sufficient validation/theoretical paper
    • Comment on Experimental Validation:
      Experiments on a large music corpus were carried out thoroughly, where comparisons among three different models (Conv1D, Conv2D, and CRNN) were done for different combinations of the number of hidden layers and the number of parameters.
  • Clarity of Presentation: Clear enough
  • Reference to Prior Work: References adequate
  • General Comments to Authors:
    • This study considers automatic classification of music, and reports the results of the experiment where three different types of convolutional neural networks (Conv1D, Conv2D, and CRNN) were compared thoroughly.
      Although convolutional recurrent neural network (CRNN) is not a new model, the evaluation done in the study is solid, which provides useful information to the people working in the research area.

Review 2

  • Importance/Relevance: Of limited interest
    • Comment on Importance/Relevance:
      In my opinion the paper might be of interest only for people working specifically on music classification problem with CNNs.
  • Novelty/Originality: Minor originality
    • Comment on Novelty/Originality:
      The use of RNN in CNNs for music tagging seems to be a relatively simple extension of the existing methods. The improvement is also moderate.
  • Technical Correctness: Definitely correct
  • Experimental Validation: Limited but convincing
  • Clarity of Presentation: Clear enough
  • Reference to Prior Work: References adequate

Review 3

  • Importance/Relevance: Of sufficient interest
    • Comment on Importance/Relevance:
      investigating properties of deep learning is important, especially these days, across application domains
  • Novelty/Originality: Moderately original
    • Comment on Novelty/Originality:
      little novel technical contribution, but a solid and much-needed empirical study
  • Technical Correctness: Definitely correct
  • Experimental Validation: Sufficient validation/theoretical paper
  • Clarity of Presentation: Very clear
  • Reference to Prior Work: Excellent references
  • General Comments to Authors:
    • well done experimental study

My comments about reviews

  • TL;DR of the reviews would be “not too original, useful enough, nice experiment, good writing”, which I’m quite glad with. The paper is kinda suggesting CRNN, but also about benchmark/comparison (, which was the original title of the paper).
  • Surprised that the review 1 mentioned  http://www.cs.stanford.edu/people/anusha/static/deepplaylist.pdf. I knew it and wasn’t sure if I have to this for two reasons: it’s a school class project, and it’s not directly related more than ‘music’ x ‘ConvRNN’. I guess the reviewer searched while reviewing, which would make the review good!


paper is out; Convolutional Recurrent Neural Networks for Music Classification




It is highly likely that you don’t need to read the paper after reading this post.


We introduce a convolutional recurrent neural network (CRNN) for music tagging. CRNNs take advantage of convolutional neural networks (CNNs) for local feature extraction and recurrent neural networks for temporal summarisation of the extracted features. We compare CRNN with two CNN structures that have been used for music tagging while controlling the number of parameters with respect to their performance and training time per sample. Overall, we found that CRNNs show strong performance with respect to the number of parameter and training time, indicating the effectiveness of its hybrid structure in music feature extraction and feature summarisation.


1. Introduction

  • CNNs (convolutional neural networks) are good useful strong robust popular etc.
  • CNNs can learn hierarchical features well.
  • Recently, CNNs are sometimes combined with RNNs (recurrent neural networks). We call it CRNN.
    • Convolutional layers are used first, then recurrent layers follow to summarise high-level features.
  • CRNNs fit music tasks well.
    • i) RNNs can be better at aggregation – they are more flexible.
    • ii) If input size varies, RNNs rock!
  • That’s why we write this paper.

2. Models


  • We compare the three models above.
  • We use an identical setting of dropout, batch normalization, activation function (ELU) for a correct comparison.

2.1 Conv1D

2.2 Conv2D

2.3 CRNN

  • 4-layer 2D conv layers + 2-layer GRU
    • conv: extract local (and short-segment) features
    • rnn (gru): summarise those (already quite high-level) features along time
      • It doesn’t need to be time-axis that RNN sweeps over. It’s just the shape of spectrograms is almost always like that when the prediction is on track-level.

2.4 Scaling networks

  • Q. I want to compare them while controlling parameters. How?
  • A.
    • Don’t change the depth.
    • Output node numbers are always fixed.
    • I control width of layers. What is width?
      • number of nodes in fully-connected layers
      • number of (hidden) nodes in gru
      • number of feature maps in conv layers
  • #parameters ∈ {0.1M, 0.25M, 0.5M, 1M, 3M}
    • They may seem too small. So, why?
      • There are only 50 output nodes.
      • Hence no need for too many nodes at last N layers.
    • Is it fair?
      • I feel like 0.1M and 0.25M is bit too harsh for Conv1D. Because it needs a lot of nodes only for fully-connected layers. That’s one of the properties of the network though.
      • RNN layers in CRNN don’t need that many hidden parameters, no matter what the total number of parameters is. You will realise it again in the result section.
      • Conv2D is quite well optimised in this range. In my previous paper (Automatic Tagging using Deep Convolutional Neural Networks), I set the numbers of feature maps too high, resulting in inefficient structure.
  • A better way?
    • Perhaps I will amend this paper for the final version
    • Current setting: the same scale factor s is multiplied to the all layers except output layer. E.g., s can be numbers like 0.3, 0.9, 1.3.
    • Proposed setting: set s, but for layers near output layer, I used sqrt(s).
      • i.e. layer width varies less as it is close to output layer.
      • Because #output layer node is fixed.


3. Experiments

  • MSD is used, Top 50 tag is predicted
  • 29s subsegment, 12kHz down-sampled, 96-bin mel-spectrogram
  • keras, theano, librosa.
  • AUC-ROC for evaluation

3.1 Memory-controlled experiment


  • CRNN > Conv2D > Conv1D except 3.0M parameters
    • CRNN > Conv2D :RNN rocks. Even with narrower conv layers, CRNN shows better performance.
      • Convolution operation and max-pooling is quite simple and static, while recurrent layers are flexile on summarising the features. Actually, we don’t know exactly how they should summarise the information. We only can say RNN seems better at it. Well, considering its strong performance in sentence summarisation, it’s not surprising.
    • Conv2D > Conv1D
      • Fully-connected layers don’t behave better – easily overfit, takes large number of parameters. Looks like the gradual subsampling over time and frequency axis in Conv2D is working well.
      • There can be many variants between them. E.g., why not mixing 1d conv layers + 2D sub-sampling? It’s out of the scope here though.

3.2 Computation-controlled experiment


  • Same result but plotted in time-AUC domain
  • In training time < 150, CRNN is the best.
  • Conv2D (3M-parameter) is better in time > 150
  • Okay, [DISCLAIMER]
    • As described in paper, I wait much longer for 3M-parameter structures because I want them to show their best performance, a reference structure.
    • They were trained MUCH LONGER than #parameters <= 1M.
    • Therefore it’s not fair to compare the others to 3M-params.

3.3 Performance per tag


Fig. 3: AUCs of 1M-parameter structures. i) The average AUCs over all samples are plotted with dashed lines. ii) AUC of each tag is plotted using a bar chart and line. For each tag, red line indicates the score of Conv2D which is used as a baseline of bar charts for Conv1D (blue) and CRNN (green). In other words, blue and green bar heights represent the performance gaps, Conv2D-Conv1D and CRNN-Conv2D, respectively. iii) Tags are grouped by categories (genre/mood/instrument/era) and sorted by the score of Conv2D. iv) The number in parentheses after each tag indicates that tag’s popularity ranking in the dataset.

  • AUCs per tags (1M params structures)
    • CRNN > Conv2D for all
    • Conv2D > Conv1D for 47/50
  • Let’s frame tagging problem as a multi-task problem.
    • A better structure in a task A is a better structure for other tasks as well.
  • Tag popularity (=#occurrence of each tag) is not correlated to the tag performances. Therefore their different performance is mainly not about popularity bias.

4. Conclusion

  • There is a trade-off in speed and memory
  • Either Conv2D and CRNN can be used depending on the circumstance.

5. Github

https://github.com/keunwoochoi/music-auto_tagging-keras provides Conv2D and CRNN structure and pre-trained weights. Their performances are better than the paper because I didn’t use early stopping for this repo and waited like forever.

Oh, and again, here’s a link to the paper.

keras STFT layers

I started implementing new keras layers at keras_STFT_layer repo.

What are these?

With these layers, you wouldn’t need to pre-compute and store STFT/Melgram/CQT in your hard drive. A new pipeline would be…

  • Store audio files as it is,
    • or perhaps decode them into raw wave (PCM) and store them in npy or hdf.
  • Start training!


The code would be

model = keras.Sequential()
specgram = Spectrogram(n_dft=512, n_hop=128, input_shape=(len_src, 1))
model.add(BatchNormalization(axis=time_axis)) # recommended

Would it be faster?

I will find out 🙂

How’s the quality of the conversion?



More info

Stay tuned to the keras_STFT_layer repo, there are code, ipython files, etc.