Paper revision’s out: The effects of noisy labels on deep convolutional neural networks for music classification

It’s a revision of this paper. It’s a major revision, so major changes! I’ll only take notes on the new stuff.

Screen Shot 2017-09-12 at 01.31.31

The groundtruth are very noisy in tagging dataset. The recall and precision is our (estimates of) evaluation on the groundtruth. Yeah it’s pretty low and we call it ‘groundtruth’…

Screen Shot 2017-09-12 at 01.31.42

Which hurts the performance of them.

Good thing is the trend doesn’t change no matter which groundtruth we use — either the provide one or our re-annotation.

Screen Shot 2017-09-12 at 01.31.52

It’s a figure from Convolutional Recurrent Neural Networks for Music Classification, where I couldn’t get why there’s such differences on the performances per tag. Well, I think I know at least one of the reasons. More noise on tag A → more confusing for the network (whatever the exact structure is) → lower performance.


Screen Shot 2017-09-12 at 01.38.06

Why don’t we try to explain other tag category from the same perspective? Yeah, In the dataset, 90s and 00s are majority (84%), but they probably don’t get tagged properly, at least not as good as 60s/70s/80s because come on, you’re in 2010 and listening to 00s music. Why would you tag it? It’s more likely that you would tag 60s/70s/80s music because doing so get you some information. As a result, old tags got less noise, so higher performance. Yes, this is our guess.

Screen Shot 2017-09-12 at 01.42.07

Ok, so with such a corrupted groundtruth, we know what happens when we use it to train. What happens when we use it to evaluate?

(a)(b) : ok it’s fine.

(c) : no it’s not that fine when the differences between the systems are subtle. Which is obvious because at some point, the noise in evaluation > the system-wise differences.


That’s it. Please go read it if it sounds interesting! arXiv link here.


Machine learning for music discovery (workshop) @ICML2017, Sydney

(Erik Schmidt from Pandora, who organised this workshop(s), is giving an opening introduction.)

Another icml, another ml4md workshop! It was 3rd machine learning for music discovery workshop this year and was featured with many awesome talks as expected. I’ll summarise who talked what briefly. Please also check out the workshop website.


Matrix Co-Factorisation and Applications to Music Analysis [abstract]

Slim Essid@Telecom ParisTech

This slideshow requires JavaScript.

Slim introduced how matrix co-factorisation (wrt NMF) can be used for multi-view problems. NMF (Non-negative matrix factorisation) is a technique to decompose a matrix into two while keeping their elements to be non-negative. It is often applied on music spectrograms V, which are several column vectors W (frequency response/harmonic pattern/etc) and row vectors H (to decide the activations of column vectors/time-domain envelopes/..)

Multi-view streams such as audio-visual information from a performance can be used for source separation assuming their activations H1 and H2 should be similar. The original iterative rules are modified with the condition and shown to work well. Fitzgerald et al. (2009); Yoo & Choi (2011); Yokoya et al. (2012); assumed hard constraints while (Seichepine et al., 2014) assumed a soft one.

The slides above shows only few examples. One of the recent works is done with audio signals and the motion captured of string players to separated string instruments. The video was awesome, sad that I don’t have any video.


Learning a Large-Scale Vocal Similarity Embedding [abstract]

By Aparna Kumar@Spotify — as well as Rachel Bittner, Nicola Montecchio, Andreas Jansson, Maria Panteli, Eric Humphrey, Tristan Jehan

So many authors and now you’re thinking they just put all the names of those who drank together? Actually what Aparna presented was a big and awesome system focusing on the (vocal) melody of pop music. Melody, for sure, is such an intriguing part of music, people say melody, rhythm, and chord are the three essential parts of music, which I think is true for most of popular music. I can’t think of any popular music that doesn’t have vocal — so we call it pop song.

What they did with vocal melody? Let’s see.

This slideshow requires JavaScript.

1. Andreas applied U-net, which is a convolutional auto-encoder with some bridges (which I think should be called as BAE;bridged auto-encoder), which are popular for visual image segmentation — which is as below (image from Oxford). An analogy in music would be… source separation! It works pretty well, I had a chance to listen to the demo before, it will be up soon as well as the paper in ismir 2017. Okay, we got vocal track now.


2. With well-separated vocal track, estimating pitch (f0) is trivial!… no it’s not. But definitely it’s easier. Rachel’s this year’s ismir paper is about this afaik. I don’t think source separation was a part of the paper, but it is in this talk. So, yay, we got a pretty accurate melody.

3. What do we do with the melody? Well, whatever machine learning to understand furthermore. It’s not precisely specified in the slides, I might miss from her talk, but Maria Panteli, who’s my friend at c4dm and doing an internship at Spotify@London has been done lots of machine listening works for music style understanding. A melody can be a good cue for this task. 25 Vocal styles were provided by musicologists and… okay I kinda forgot what’s been done exactly on it… um..

4. They also tried a genre classification task, making it clear that it is not an ultimate goal but a proxy task that can prove the idea of exploiting melody for whatever machine-listening task.

5. Another task was artist retrieval which didn’t work very well.

6. An ongoing work was to retrieve music by some semantic query that describes the melody.

Obviously there are lots of things going on there and I was impressed that such a large-scale and long-pipline system is working indeed.


Aligned Hierarchies – A Multi-Scale Structure-Based Representation for Music [abstract]

Katherine Kinnaird at Brown university @kmkinnaird

This slideshow requires JavaScript.

Katherine’s work was done by her and her three students, one of which is a super-talented undergrad and looking for a grad school to further work on MIR and math, which is one of the most important words to spread.

Her talk was about music structure analysis, especially focusing on the hierarchical structures and applications for cover song detection. Hierarchy in structure is such a pain and her approach is based on using repeats in every level. This is an interesting idea! The acknowledgements say it is a portion of her doctoral thesis, but some were to appear in this year’s ismir and there will be more by her and her students.


Mining Creation Methods from Music Data for Automated Content Generation [abstract]

Satoru Fukayama and Masataka Goto
National Institute of Advanced Industrial Science and Technology (AIST)

This slideshow requires JavaScript.

I also enjoyed Satoru’s quite a comprehensive review on the works that have been done on music(-related) creation in AIST. As in the paper, there were four topics:

  • machine dancing
  • automatic chord generation (NMF, N-gram-based music language model)
  • automatic guitar tab generation
  • song2quartet: generate a quartet music from audio content

Check out the four references in his abstract for more details.

NSynth: Unsupervised Understanding of Musical Notes [abstract]

Cinjon Resnick et  al., Google Brain (Magenta team)

This slideshow requires JavaScript.

It is based on their ICMP paper. As many would already know, NSynth is a sample-level synthesiser with WaveNet and conditional auto-encoder, with releasing a large dataset that they used for training.

Cinjon also played many clips that probably are not public yet online (check out his YouTube channel!). The demos are about controlling the volume (which is more like ‘gain’), oscillation, and mean sound, by modifying z, the latent vector of auto-encoder. As summarised in the abstract, it didn’t always work as the analogy. Seems like there are still lots of work to understand and improve this approach, especially because people would expect a synthesiser to be easily and effectively controllable.


Multi-Level and Multi-Scale Feature Aggregation Using Sample-level Deep Convolutional Neural Networks for Music Classification [abstract]

Jongpil Lee and Juhan Nam, KAIST, Korea

This slideshow requires JavaScript.

Jongpil has be working on music tagging (hm, sounds familiar). His approach achieved a really good result on MSD tagging dataset and I think it probably is at least one of the ways to go.

It is actually quite different from the previous approaches as the long title indicates.

  • Multi-level: activations of different levels are aggregated to directly contribute to the final decision (which is similar to my transfer learning paper approach and Jongpil’s paper was out earlier).
  • Multi-scale: Multi-level is already somehow multi-scale thou.
  • Sample-level: I think they should’ve made it clearer than the current title/name. Here, sample-level means something like character-level as an alternative of word-level in NLP. The minimum length of samples are not like 512 or 256, but even 2 or 3 samples! Check out their SPL paper for details.


Music Highlight Extraction via Convolutional Recurrent Attention Networks [abstract]

This slideshow requires JavaScript.

Jung-Woo Ha, Adrian Kim, Dongwon Kim, Chanju Kim, and Jangyeon Park, at NAVER Corp

As a Korean-Google, Naver corp has many services and departments like Naver (a search engine+many more), Line (a messenger in Japan/Kor/many other asian countries and did their IPO in NASDAQ), Clova/CLAIR (AI/robotics research department), etc.

The presentation was about their highlight selection algorithm based on RNN/attention. They trained a genre classification with an LSTM, which is analysed again to find the part that are most attended. It’s an interesting work and it requires a whole track as a data sample, which partly and probably is why there’s no such work yet.

One of my questions is how much it is correlated to the other simple metrics such as energy/zero-crossing/etc. Hopefully there’s a full paper that discuss those aspects. I’d be also curious to know if finding shorter highlight would formulate the problem better. It was 1 minute in this abstract, but isn’t it too long? Although it has to match to the requirement of their service.


Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras [abstract]

Keunwoo Choi, Deokjin Joo, and Juho Kim

It’s my work and already covered in the previous post. It got me a best title award by Katherline.


Ephemeral Context to Support Robust and Diverse Recommendations [abstract]

This slideshow requires JavaScript.

Pavel Kucherbaev, Nava Tintarev, and Carlos Rodriguez
Delft University of Technology and University of New South Wales

The final slot was filled by Australian researchers! (along with Netherland.)

This abstract is mainly about how could a context-based music recommender can aggregate what kind of information, focusing on using multi-sensory information. A demo is online at, it’s not really fully-working recommender but more about the potential API though. The webapp might not be working, then you can check out the repo. It’s kinda a position paper and can be also useful to pick up the literatures in the field so far.


The audio-video failures

Finally, I’d like to report that there were three sound failures which got us ironic and joyful moments. This is known to be a AV-hard problem which is at least as difficult as NP-hard.

PS. Thanks for the organisers, Erik Schmidt, Oriol Nieto, Fabien Gouyon, and Gert Lanckriet!

Abstract is out; Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

paper | repo


Since last December, I was developing and using Kapre, which is just a set of Keras layers  that do some audio preprocessing, say, stft/melspectrogram and etc. I mean, without a real benchmark.

There are many small reasons for this. Well, having my own python package that can be installed with pip command is cool, right? It’s one of the cool kids’ requirements. I’ve learned a lot by doing it. But, most importantly, it makes the user experience so nice to do research with more than a bunch of audio files.

A scenario

Let’s say you’re doing some classification with not-so-small music dataset. Like FMA-medium which is roughly 30GB after uncompressing.

Ok, decoding mp3 takes long so it should be done before.


But actually to what should I convert them? Spectrogram? Melspectrogram? With what parameter? Alright, I’ll try n_fft=1024, 512 and… hop=512, 256,… hm.. and.. n_mels=128, 96.

Congrats, just do it and store them which would take up like 2 * 2 * 2 * 30 = 240GB. (Yeah it’s very loose estimation.)

A day or two later.

Okay they are ready! Now let’s run the model.

Result: [512, 256, 96] worked the best.

Wut, then is 96 better? or maybe I gotta try with 72? 48?

Ok then I can remove the others… should I? But it took me a day and what if I need it again later..

👏👏👏 YES IT GOES ON AND ON! What if you can just do this?

model = Sequential()
# A mel-spectrogram layer
model.add(Melspectrogram(n_dft=512, n_hop=256,
                 border_mode='same', sr=sr, n_mels=128,
                 fmin=0.0, fmax=sr/2, power=1.0,
                 return_decibel_melgram=False, trainable_fb=False,


Experiment result

Hey hey, but it would add more computation on real time. How much is it?

This much,

Screen Shot 2017-06-17 at 14.31.45

for 16384 audio samples (30 seconds each), and the convnet looks like this. (157k parameters, 5-layer)

Screen Shot 2017-06-20 at 02.14.23

  • The experiment uses a dummy data arrays which has no overhead to feed.
  • We tested it on four gpu models and the charts above is normalised. But what really matters is the absolute time.
    • If your model is very simple, the proportion of time that Kapre uses would increases.
    • Vice versa,
      • which is good because if you got a large dataset → pre-computing and storing spectrograms are annoying → with large dataset you probably would like to use large network to take advantage of the dataset → yeah, so among the overall training time, the overhead that Kapre introduce becomes trivial.


paper | repo |slides

EDIT: Slides


Notes on my paper; On the Robustness of Deep Convolutional Neural Networks for Music Classification

In this paper I talk about music tagging quite a lot, and audio preprocessing also quite a lot, and some analysis on trained network which is related to music tagging problem again.

Music tagging dataset groundtruth is so wrong

Yeah, because of weakly-labelling it is so incorrect. But how much?

Screen Shot 2017-06-07 at 21.04.31

..about this much. — I manually annotated 4 labels on 500 songs.

Gosh, 70% error? No worries though, it’s sort of ‘weakly-supervised learning’ situation, in which with enough data it’s fine.


But, how much is it fine? — in evaluation?

Screen Shot 2017-06-07 at 20.55.46 evaluation of four instrument tags with my annotation.
blue.dash: eval of four instrument tags with MSD groundtruth
yellow.solid: eval of all tags with MSD groundtruth

Corr(red, blue) = how much it’s fine to use MSD groundtruth for 4 tags.
Corr(blue, yellow) = how much it’s fine to generalise 4-tag result for all-tag result.

And as you see. Well, I’d say this is fine. For sure there’s error (Which can be significant if the difference is subtle).

X vs log(X) if X in [spectrogram, melgram, cqt, …]

tl;dr: use log(x). See this distribution!

Screen Shot 2017-06-07 at 21.17.51

Or, see how much disadvantageous if not log(). Roughly 2x data, which is a lot.

Screen Shot 2017-06-07 at 21.18.42

Spectral whitening (per-frequency stdd)? A-weighting? Should I do some special normalisation for better result?


How similar music tags are according to trained convnet?

Screen Shot 2017-06-07 at 21.21.53

More details on the paper:

Paper is out; Transfer learning for music classification and regression tasks, and behind the scene, negative results, etc.

1. Summary

A catch for MIR:

For ML:

So what does it do?


A Convnet is trained.

Then we extract the knowledge like above, and use it for…

  • genre classifications
  • emotion prediction (on av plane)
  • speech/music classification
  • vocal/non-vocal excerpt classification
  • audio event classification

Is it good?


What’s special in this work?

It uses (up to) ALL layers, not the last N layers. Partly because it’s music, not sure what would happen if we do the same for images though. Tried to know if there’s similar approach, found nothing.



2. Behind the scene

Q. Did you really select the tagging task because,

Music tagging is selected as a source task because i) large training data is accessible for this task and ii) its label set covers various aspects of music, i.e., genre, mood, era, and ?instrumentations

? You’ve been doing tagging for a year+ and you must be just looking for re-using your weights! !

A. NOOOOOOO… about 1.5 years ago, music recommendation was still part of my PhD plan and I thought I’d do content-based music tagging ‘briefly’ and use the result for recommendation.

…Although I was forgetting that part for a long while.


Q. What else did you try?

A. I tried similarity prediction using 5th layer feature and similarity data in MSD. Similarity was estimated using cosine distance of two song vectors. Result is:

Screen Shot 2017-03-28 at 16.33.12

There are 18M pairs of similarities and this plot uses 1K samples of them. Correlation value is something like 0.08, -0.05,… totally uncorrelated.

I doubt that the similarity in groundtruth is not much about audio similarity. Although I couldn’t find out how the similarity is gathered. + I tested K-NN of the convnet features to see if they sound really similar, and (ok, it’s subjective) they certainly are!

Hopefully there’s similarity dataset that focuses on the sound rather than social/cultural aspects of song so that I can test it again.

… ok actually I hope someone else find it and do it =)

Tried something else, too — measure them by precision@k, as below:

First, collect similar pairs of songs from the ground-truth data (you can threshold the similarity.) Let’s call these pairs “similar pairs at similarity P”. Something like P=0.8 should be used.

Then, for each of these similar pairs (A,B), you will do the following:

(1) Find K most similar songs to A according to the cosine similarity
(2) If B is included in this set, it’s correct.
(3) Do the same for B as well.


checking 139684 pairs, threshold=0.999 (out of 18M pairs)

precision: 0.0105667077117 for 139684 pairs at k=10
precision: 0.0416654734973 for 139684 pairs at k=100
precision: 0.115367543885 for 139684 pairs at k=1000

are checking 292255 pairs, threshold=0.8 (out of 18M pairs)

precision: 0.00903662897128 for 292255 pairs at k=10
precision: 0.0364476227952 for 292255 pairs at k=100
precision: 0.106280474243 for 292255 pairs at k=1000

In short, it doesn’t work.


Q. Why is it on ICLR template?

A. Don’t you like it!??


Q. Why didn’t you add the result using ELM?

A. I will do it in another paper, which will include some of this transfer learning results.


That’s it. Again, please also enjoy the paper and the codes.


EDIT: 30 June 2017 – It’s updated to v2, the camera-ready version for ISMIR 2017.

An unexpected encounter to Extreme Learning Machines

I was testing multiple MIR tasks with my pre-trained features from compact_cnn. I thought I was.

It turned out that I didn’t even load the weights file. There is simply no such code. (You’re supposed to laugh at me now. out loud.)

The featured (thumbnail) image is what I ended up doing. Please scroll up and take a look. But I almost never realised it and even wrote quite a lot of ‘discussion’ on the top of the result. (Yeah, laugh at me. LOUDER) Anyway, it means my system was a deep convolutional extreme learning machine.

I couldn’t realised it earlier because the results are quite good. Let’s see how the feature worked with SVM classifier.


Not bad, huh?

Finally, Kyunghyun noted out that it has been known since 1965 by Cover’s theorem.

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.

— Cover, T.M., Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition, 1965

Still quite interesting. Oh, and the non-ELM results, what I thought I was doing, will be posted to arXiv soon. See you there then.

PS. Saxe et al 2011, a NIPS paper about it. Deeplearningbook also mentions in Convnet chapter that “Random filters often work surprisingly well in convolutional networks”.

How do I preprocess audio signals

With Urbansound8k dataset, which is in a reasonable size, I did like this:

The code structure seems not bad. As a result, I have a couple of hdf files. They are convenient to use in Keras.


  • Not multi-processing used here.
  • hdf doesn’t yet support multi-reader. It means you can’t keep them opened in different process. In other words, you can’t use the same file at the same time. In other words, even if you have N>1 GPU, you can’t just run the task at the same time. There are workarounds though.




For beginners; Writing a custom Keras layer

I have written a few simple keras layers. This post will summarise about how to write your own layers. It’s for beginners because I only know simple and easy ones 😉


1. Keras layer introduces some common methods. For beginners I don’t think it’s necessary to know these.

2. Keras Lambda layer

Lambda layer is an easy way to customise a layer to do simple arithmetics. As written in the page,

…an arbitrary Theano / TensorFlow expression… 

we can use the operations supported by Keras backend such as dot, transpose, max, pow, sign, etc as well as those are not specified in the backend documents but actually supported by Theano and TensorFlow – e.g.,  **, /, //, % for Theano.

2.1 Lambda layer and output_shape

You might need to specify the output shape of your Lambda layer, especially your Keras is on Theano. Otherwise it just seems to infer it with input_shape.

2.1.1 With function

You can create a function that returns the output shape, probably after taking input_shape as an input. Here, the function returns the shape of the WHOLE BATCH.

2.1.2 With tuple

If you pass tuple, it should be the shape of ONE DATA SAMPLE.

3. A Keras model as a layer

On high-level, you can combine some layers to design your own layer. For example, I made a Melspectrogram layer as below. (Complete codes are on keras_STFT_layer repo.) In this way, I could re-use Convolution2D layer in the way I want.

Downside would be some overhead due to many layers.

4. Customising Layer

When Lego-ing known layers doesn’t get you what you want, write your own!

4.1 Read the document Read this! Whether you fully understand it or not. I didn’t fully understand but later I got it thanks to @fchollet’s help.

4.2 Four methods

4.2.1 __init()__ :

initiate the layer. Assign attributes to self so that you can use them later.

4.2.2build(self, input_shape) :

  1. initiate the tensor variables (e.g. W, bias, or whatever) using Keras backend functions (e.g., self.W = K.variable(an_init_numpy_array)).
  2. set self.trainable_weights with a list of variables. e.g., self.trainable_weights=[self.W].

Remember : trainable weights should be tensor variables so that machine can auto-differenciate them for you.

Remember (2): Check out the dtype of every variable! If you initiate a tensor variable with float64 a numpy array, the variable might be also float64, which will get you an error. Usually it wouldn’t because by default K.variable()  casts the value into float32. But, check check check! check it by simply printing x.dtype.

4.2.3 call(self, x, mask=None) :

This is where you implement the forward-pass operation. You may want to dot product with one of the trainable weights and input (, self.W)), wanna expand the dimensionality of a tensor variable (K.expand_dims(var1, dim=2)), or whatever.

Again, dtype! For example, I had to use this line, np.sqrt(2. * np.pi).astype('float32'), to make the constant to be float32.

4.2.4 get_output_shape_for(self, input_shape)

As the name says.

4.3 Examples

4.3.1 Example 1 : Cropping2D Layer

It crops 2D input. Simple!

4.3.2 Example 2. ParametricMel

4.4 Tips

Remember: you need to make the operation of layer differentiable w.r.t the input and trainable weights you set. Look up keras backend use them.

tensor.shape.eval() returns an integer tuple. You would need to print them a lot 😉

paper is UPDATED; Convolutional Recurrent Neural Networks for Music Classification (+reviews)


I updated my paper: arXiv link

Compare to the previous one, I

  • added one more architecture,
  • changed their names,
  • removed dropout from all convolution and fully-connected layers,
  • and re-ran all the experiments

Hopefully the figures and table below would be interesting enough to read the paper!

  • Those are layouts.


  • In detail,


  • Results are,


  • The same results in time-AUC plane,


  • Performances per tag, but this figure is better to be seen within the paper



Review 1

  • Importance/Relevance: Of sufficient interest
  • Novelty/Originality: Minor originality
  • Technical Correctness: Probably correct
  • Experimental Validation: Sufficient validation/theoretical paper
    • Comment on Experimental Validation:
      Experiments on a large music corpus were carried out thoroughly, where comparisons among three different models (Conv1D, Conv2D, and CRNN) were done for different combinations of the number of hidden layers and the number of parameters.
  • Clarity of Presentation: Clear enough
  • Reference to Prior Work: References adequate
  • General Comments to Authors:
    • This study considers automatic classification of music, and reports the results of the experiment where three different types of convolutional neural networks (Conv1D, Conv2D, and CRNN) were compared thoroughly.
      Although convolutional recurrent neural network (CRNN) is not a new model, the evaluation done in the study is solid, which provides useful information to the people working in the research area.

Review 2

  • Importance/Relevance: Of limited interest
    • Comment on Importance/Relevance:
      In my opinion the paper might be of interest only for people working specifically on music classification problem with CNNs.
  • Novelty/Originality: Minor originality
    • Comment on Novelty/Originality:
      The use of RNN in CNNs for music tagging seems to be a relatively simple extension of the existing methods. The improvement is also moderate.
  • Technical Correctness: Definitely correct
  • Experimental Validation: Limited but convincing
  • Clarity of Presentation: Clear enough
  • Reference to Prior Work: References adequate

Review 3

  • Importance/Relevance: Of sufficient interest
    • Comment on Importance/Relevance:
      investigating properties of deep learning is important, especially these days, across application domains
  • Novelty/Originality: Moderately original
    • Comment on Novelty/Originality:
      little novel technical contribution, but a solid and much-needed empirical study
  • Technical Correctness: Definitely correct
  • Experimental Validation: Sufficient validation/theoretical paper
  • Clarity of Presentation: Very clear
  • Reference to Prior Work: Excellent references
  • General Comments to Authors:
    • well done experimental study

My comments about reviews

  • TL;DR of the reviews would be “not too original, useful enough, nice experiment, good writing”, which I’m quite glad with. The paper is kinda suggesting CRNN, but also about benchmark/comparison (, which was the original title of the paper).
  • Surprised that the review 1 mentioned I knew it and wasn’t sure if I have to this for two reasons: it’s a school class project, and it’s not directly related more than ‘music’ x ‘ConvRNN’. I guess the reviewer searched while reviewing, which would make the review good!