In this paper I talk about music tagging quite a lot, and audio preprocessing also quite a lot, and some analysis on trained network which is related to music tagging problem again.
Music tagging dataset groundtruth is so wrong
Yeah, because of weakly-labelling it is so incorrect. But how much?
..about this much. — I manually annotated 4 labels on 500 songs.
Gosh, 70% error? No worries though, it’s sort of ‘weakly-supervised learning’ situation, in which with enough data it’s fine.
But, how much is it fine? — in evaluation?
red.dot: evaluation of four instrument tags with my annotation.
blue.dash: eval of four instrument tags with MSD groundtruth
yellow.solid: eval of all tags with MSD groundtruth
Corr(red, blue) = how much it’s fine to use MSD groundtruth for 4 tags.
Corr(blue, yellow) = how much it’s fine to generalise 4-tag result for all-tag result.
And as you see. Well, I’d say this is fine. For sure there’s error (Which can be significant if the difference is subtle).
X vs log(X) if X in [spectrogram, melgram, cqt, …]
tl;dr: use log(x). See this distribution!
Or, see how much disadvantageous if not log(). Roughly 2x data, which is a lot.
Spectral whitening (per-frequency stdd)? A-weighting? Should I do some special normalisation for better result?
How similar music tags are according to trained convnet?