A self-critic on my ismir paper “Automatic Tagging Using Deep Convolutional Neural Networks”

In my 2016 ISMIR paper(Paper is out: Automatic tagging using deep convolutional neural networks), I applied deep convolutional networks for music tagging. It’s been 10 months since I wrote the paper and I realised many mistakes or not-very-best  design choices, which I just felt like to share. (Yes, I am writing a report and it’s long and boring.)

  • Number of feature maps

I set 128-to-2048 for the first-to-last layers. It is absolutely too many. No doubt. It is redundant when the number of output nodes is only 50. (Even for Imagenet, which has 1000 output nodes, 2048 could be too much).

With only 32 feature maps in all 5 layers, I got a similar performance. To be safe, 64 would be fine. But not 2048

  • Split setting

I released the split setting (release; Million Song Dataset split setting that I used) for reproducing the experiment – which I believe is good. What’s not cool about it is that it wouldn’t be the best split.

The problem is, for multi-label task, stratified split is not too easy. Probably someone can implement this paper, which splits the data iteratively over labels, into scikit-learn?

..until then let’s use the same setting, at least we got reproducability 😉

  • Dropout? Batch normalization? – too much noise will kill you.

Dropout still helps to convnet, but not as critical as it used to be (read these), and I think 0.5 was too large. At the end of training, it only makes it hard to decide when to stop. These days I’m relying on batch normalization + early stop, seems like it’s more stable.

  • Could have used zero-padding

..so that none of the layers doesn’t discard the information. The pooling schemes that I used sometimes discard some edges. It wouldn’t be much critical though.

  • It is not fully convolutional indeed

because, in essence, the output layer is fully-connected to the last Nx1x1 feature map. Using average-pooling with real fully-convolutional setting should work, I haven’t tried with msd/tagging though.

  • A bug

I happend to used log (power-power-melspectrogram) with 80-dB dynamic range limitation. Which made it a 40-dB dynamic range input representation.

  • FCN-6 and FCN-7 are pointless

They are. I just wanted to add more experiments..


That’s it. Please note these problems if you’re more than reading the paper!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s