paper is out; Convolutional Recurrent Neural Networks for Music Classification




It is highly likely that you don’t need to read the paper after reading this post.


We introduce a convolutional recurrent neural network (CRNN) for music tagging. CRNNs take advantage of convolutional neural networks (CNNs) for local feature extraction and recurrent neural networks for temporal summarisation of the extracted features. We compare CRNN with two CNN structures that have been used for music tagging while controlling the number of parameters with respect to their performance and training time per sample. Overall, we found that CRNNs show strong performance with respect to the number of parameter and training time, indicating the effectiveness of its hybrid structure in music feature extraction and feature summarisation.


1. Introduction

  • CNNs (convolutional neural networks) are good useful strong robust popular etc.
  • CNNs can learn hierarchical features well.
  • Recently, CNNs are sometimes combined with RNNs (recurrent neural networks). We call it CRNN.
    • Convolutional layers are used first, then recurrent layers follow to summarise high-level features.
  • CRNNs fit music tasks well.
    • i) RNNs can be better at aggregation – they are more flexible.
    • ii) If input size varies, RNNs rock!
  • That’s why we write this paper.

2. Models


  • We compare the three models above.
  • We use an identical setting of dropout, batch normalization, activation function (ELU) for a correct comparison.

2.1 Conv1D

2.2 Conv2D

2.3 CRNN

  • 4-layer 2D conv layers + 2-layer GRU
    • conv: extract local (and short-segment) features
    • rnn (gru): summarise those (already quite high-level) features along time
      • It doesn’t need to be time-axis that RNN sweeps over. It’s just the shape of spectrograms is almost always like that when the prediction is on track-level.

2.4 Scaling networks

  • Q. I want to compare them while controlling parameters. How?
  • A.
    • Don’t change the depth.
    • Output node numbers are always fixed.
    • I control width of layers. What is width?
      • number of nodes in fully-connected layers
      • number of (hidden) nodes in gru
      • number of feature maps in conv layers
  • #parameters ∈ {0.1M, 0.25M, 0.5M, 1M, 3M}
    • They may seem too small. So, why?
      • There are only 50 output nodes.
      • Hence no need for too many nodes at last N layers.
    • Is it fair?
      • I feel like 0.1M and 0.25M is bit too harsh for Conv1D. Because it needs a lot of nodes only for fully-connected layers. That’s one of the properties of the network though.
      • RNN layers in CRNN don’t need that many hidden parameters, no matter what the total number of parameters is. You will realise it again in the result section.
      • Conv2D is quite well optimised in this range. In my previous paper (Automatic Tagging using Deep Convolutional Neural Networks), I set the numbers of feature maps too high, resulting in inefficient structure.
  • A better way?
    • Perhaps I will amend this paper for the final version
    • Current setting: the same scale factor s is multiplied to the all layers except output layer. E.g., s can be numbers like 0.3, 0.9, 1.3.
    • Proposed setting: set s, but for layers near output layer, I used sqrt(s).
      • i.e. layer width varies less as it is close to output layer.
      • Because #output layer node is fixed.


3. Experiments

  • MSD is used, Top 50 tag is predicted
  • 29s subsegment, 12kHz down-sampled, 96-bin mel-spectrogram
  • keras, theano, librosa.
  • AUC-ROC for evaluation

3.1 Memory-controlled experiment


  • CRNN > Conv2D > Conv1D except 3.0M parameters
    • CRNN > Conv2D :RNN rocks. Even with narrower conv layers, CRNN shows better performance.
      • Convolution operation and max-pooling is quite simple and static, while recurrent layers are flexile on summarising the features. Actually, we don’t know exactly how they should summarise the information. We only can say RNN seems better at it. Well, considering its strong performance in sentence summarisation, it’s not surprising.
    • Conv2D > Conv1D
      • Fully-connected layers don’t behave better – easily overfit, takes large number of parameters. Looks like the gradual subsampling over time and frequency axis in Conv2D is working well.
      • There can be many variants between them. E.g., why not mixing 1d conv layers + 2D sub-sampling? It’s out of the scope here though.

3.2 Computation-controlled experiment


  • Same result but plotted in time-AUC domain
  • In training time < 150, CRNN is the best.
  • Conv2D (3M-parameter) is better in time > 150
  • Okay, [DISCLAIMER]
    • As described in paper, I wait much longer for 3M-parameter structures because I want them to show their best performance, a reference structure.
    • They were trained MUCH LONGER than #parameters <= 1M.
    • Therefore it’s not fair to compare the others to 3M-params.

3.3 Performance per tag


Fig. 3: AUCs of 1M-parameter structures. i) The average AUCs over all samples are plotted with dashed lines. ii) AUC of each tag is plotted using a bar chart and line. For each tag, red line indicates the score of Conv2D which is used as a baseline of bar charts for Conv1D (blue) and CRNN (green). In other words, blue and green bar heights represent the performance gaps, Conv2D-Conv1D and CRNN-Conv2D, respectively. iii) Tags are grouped by categories (genre/mood/instrument/era) and sorted by the score of Conv2D. iv) The number in parentheses after each tag indicates that tag’s popularity ranking in the dataset.

  • AUCs per tags (1M params structures)
    • CRNN > Conv2D for all
    • Conv2D > Conv1D for 47/50
  • Let’s frame tagging problem as a multi-task problem.
    • A better structure in a task A is a better structure for other tasks as well.
  • Tag popularity (=#occurrence of each tag) is not correlated to the tag performances. Therefore their different performance is mainly not about popularity bias.

4. Conclusion

  • There is a trade-off in speed and memory
  • Either Conv2D and CRNN can be used depending on the circumstance.

5. Github provides Conv2D and CRNN structure and pre-trained weights. Their performances are better than the paper because I didn’t use early stopping for this repo and waited like forever.

Oh, and again, here’s a link to the paper.


19 thoughts on “paper is out; Convolutional Recurrent Neural Networks for Music Classification

  1. can you pls share your evaluation code? I tried to reproduce your model with your weights on magnatagatune dataset in torch. But my AUC scores are lesser ..


    1. Hi, I just used roc_auc_score in scikit-learn. I only shared weights on MSD, not MagnaTagATune and the tag sets are not identical. How did you use it for MTT?


      1. Oh, even I use the metric in scikit-learn. Yea tags are not same.. I just initialized your weights and started finetuning on MTT.


      1. Im doing my paster thesis and I want to test the performance on specific tags.. So I just wanted to replace few tags in your network and fine tune..


  2. Yea technically either weights should work but auc diverges after 2 MTT epoch.. well yes I ll try out the compact_cnn


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s