A catch for MIR:
So what does it do?
A Convnet is trained.
Then we extract the knowledge like above, and use it for…
- genre classifications
- emotion prediction (on av plane)
- speech/music classification
- vocal/non-vocal excerpt classification
- audio event classification
Is it good?
What’s special in this work?
It uses (up to) ALL layers, not the last N layers. Partly because it’s music, not sure what would happen if we do the same for images though. Tried to know if there’s similar approach, found nothing.
2. Behind the scene
Q. Did you really select the tagging task because,
Music tagging is selected as a source task because i) large training data is accessible for this task and ii) its label set covers various aspects of music, i.e., genre, mood, era, and ?instrumentations
? You’ve been doing tagging for a year+ and you must be just looking for re-using your weights! !
A. NOOOOOOO… about 1.5 years ago, music recommendation was still part of my PhD plan and I thought I’d do content-based music tagging ‘briefly’ and use the result for recommendation.
…Although I was forgetting that part for a long while.
Q. What else did you try?
A. I tried similarity prediction using 5th layer feature and similarity data in MSD. Similarity was estimated using cosine distance of two song vectors. Result is:
There are 18M pairs of similarities and this plot uses 1K samples of them. Correlation value is something like 0.08, -0.05,… totally uncorrelated.
I doubt that the similarity in groundtruth is not much about audio similarity. Although I couldn’t find out how the similarity is gathered. + I tested K-NN of the convnet features to see if they sound really similar, and (ok, it’s subjective) they certainly are!
Hopefully there’s similarity dataset that focuses on the sound rather than social/cultural aspects of song so that I can test it again.
… ok actually I hope someone else find it and do it =)
Tried something else, too — measure them by precision@k, as below:
First, collect similar pairs of songs from the ground-truth data (you can threshold the similarity.) Let’s call these pairs “similar pairs at similarity P”. Something like P=0.8 should be used.
Then, for each of these similar pairs (A,B), you will do the following:
(1) Find K most similar songs to A according to the cosine similarity
(2) If B is included in this set, it’s correct.
(3) Do the same for B as well.
checking 139684 pairs, threshold=0.999 (out of 18M pairs)
precision: 0.0105667077117 for 139684 pairs at k=10
precision: 0.0416654734973 for 139684 pairs at k=100
precision: 0.115367543885 for 139684 pairs at k=1000
are checking 292255 pairs, threshold=0.8 (out of 18M pairs)
precision: 0.00903662897128 for 292255 pairs at k=10
precision: 0.0364476227952 for 292255 pairs at k=100
precision: 0.106280474243 for 292255 pairs at k=1000
In short, it doesn’t work.
Q. Why is it on ICLR template?
A. Don’t you like it!??
Q. Why didn’t you add the result using ELM?
A. I will do it in another paper, which will include some of this transfer learning results.
EDIT: 30 June 2017 – It’s updated to v2, the camera-ready version for ISMIR 2017.