Paper is out; Transfer learning for music classification and regression tasks, and behind the scene, negative results, etc.

1. Summary

A catch for MIR:

For ML:

So what does it do?


A Convnet is trained.

Then we extract the knowledge like above, and use it for…

  • genre classifications
  • emotion prediction (on av plane)
  • speech/music classification
  • vocal/non-vocal excerpt classification
  • audio event classification

Is it good?


What’s special in this work?

It uses (up to) ALL layers, not the last N layers. Partly because it’s music, not sure what would happen if we do the same for images though. Tried to know if there’s similar approach, found nothing.



2. Behind the scene

Q. Did you really select the tagging task because,

Music tagging is selected as a source task because i) large training data is accessible for this task and ii) its label set covers various aspects of music, i.e., genre, mood, era, and ?instrumentations

? You’ve been doing tagging for a year+ and you must be just looking for re-using your weights! !

A. NOOOOOOO… about 1.5 years ago, music recommendation was still part of my PhD plan and I thought I’d do content-based music tagging ‘briefly’ and use the result for recommendation.

…Although I was forgetting that part for a long while.


Q. What else did you try?

A. I tried similarity prediction using 5th layer feature and similarity data in MSD. Similarity was estimated using cosine distance of two song vectors. Result is:

Screen Shot 2017-03-28 at 16.33.12

There are 18M pairs of similarities and this plot uses 1K samples of them. Correlation value is something like 0.08, -0.05,… totally uncorrelated.

I doubt that the similarity in groundtruth is not much about audio similarity. Although I couldn’t find out how the similarity is gathered. + I tested K-NN of the convnet features to see if they sound really similar, and (ok, it’s subjective) they certainly are!

Hopefully there’s similarity dataset that focuses on the sound rather than social/cultural aspects of song so that I can test it again.

… ok actually I hope someone else find it and do it =)

Tried something else, too — measure them by precision@k, as below:

First, collect similar pairs of songs from the ground-truth data (you can threshold the similarity.) Let’s call these pairs “similar pairs at similarity P”. Something like P=0.8 should be used.

Then, for each of these similar pairs (A,B), you will do the following:

(1) Find K most similar songs to A according to the cosine similarity
(2) If B is included in this set, it’s correct.
(3) Do the same for B as well.


checking 139684 pairs, threshold=0.999 (out of 18M pairs)

precision: 0.0105667077117 for 139684 pairs at k=10
precision: 0.0416654734973 for 139684 pairs at k=100
precision: 0.115367543885 for 139684 pairs at k=1000

are checking 292255 pairs, threshold=0.8 (out of 18M pairs)

precision: 0.00903662897128 for 292255 pairs at k=10
precision: 0.0364476227952 for 292255 pairs at k=100
precision: 0.106280474243 for 292255 pairs at k=1000

In short, it doesn’t work.


Q. Why is it on ICLR template?

A. Don’t you like it!??


Q. Why didn’t you add the result using ELM?

A. I will do it in another paper, which will include some of this transfer learning results.


That’s it. Again, please also enjoy the paper and the codes.


EDIT: 30 June 2017 – It’s updated to v2, the camera-ready version for ISMIR 2017.


7 thoughts on “Paper is out; Transfer learning for music classification and regression tasks, and behind the scene, negative results, etc.

    1. Because SVM have been (I think) the most popular in MIR works and it makes it easier to compare only the feature set, removing classifier dependency. — and it’s good enough, might not be the best though.


      1. Could you give some intuition light on how your model extracts features when the sound files are all of different sizes.For example if i train my model with 1000 files of size 4-5 sec and 1000 files of size 20-30 sec.Does it affect the classification accuracy badly?.I would like to know your experience with such issues.


      2. It is actually in the paper, that I extended the short signals by repeating them to make 30s signal. It’s due to the average pooling at the end of feature extractors.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s