It’s a revision of this paper. It’s a major revision, so major changes! I’ll only take notes on the new stuff.
The groundtruth are very noisy in tagging dataset. The recall and precision is our (estimates of) evaluation on the groundtruth. Yeah it’s pretty low and we call it ‘groundtruth’…
Which hurts the performance of them.
Good thing is the trend doesn’t change no matter which groundtruth we use — either the provide one or our re-annotation.
It’s a figure from Convolutional Recurrent Neural Networks for Music Classification, where I couldn’t get why there’s such differences on the performances per tag. Well, I think I know at least one of the reasons. More noise on tag A → more confusing for the network (whatever the exact structure is) → lower performance.
Why don’t we try to explain other tag category from the same perspective? Yeah, In the dataset, 90s and 00s are majority (84%), but they probably don’t get tagged properly, at least not as good as 60s/70s/80s because come on, you’re in 2010 and listening to 00s music. Why would you tag it? It’s more likely that you would tag 60s/70s/80s music because doing so get you some information. As a result, old tags got less noise, so higher performance. Yes, this is our guess.
Ok, so with such a corrupted groundtruth, we know what happens when we use it to train. What happens when we use it to evaluate?
(a)(b) : ok it’s fine.
(c) : no it’s not that fine when the differences between the systems are subtle. Which is obvious because at some point, the noise in evaluation > the system-wise differences.
That’s it. Please go read it if it sounds interesting! arXiv link here.