Preprint is on arXiv: Explaining deep convolutional neural networks on music classification

arXiv link here. Probably I’m not gonna work on this issue unless I found a computational analysis of learned features. These days the networks are getting bigger and bigger, I might need to fly me to the moon to have enough time to let me play among the features.

Abstract is as below:

Deep convolutional neural networks (CNNs) have been actively adopted in the field of music information retrieval, e.g. genre classification, mood detection, and chord recognition. However, the process of learning and prediction is little understood, particularly when it is applied to spectrograms. We introduce auralisation of a CNN to understand its underlying mechanism, which is based on a deconvolution procedure introduced in [2]. Auralisation of a CNN is converting the learned convolutional features that are obtained from deconvolution into audio signals. In the experiments and discussions, we explain trained features of a 5-layer CNN based on the deconvolved spectrograms and auralised signals. The pairwise correlations per layers with varying different musical attributes are also investigated to understand the evolution of the learnt features. It is shown that in the deep layers, the features are learnt to capture textures, the patterns of continuous distributions, rather than shapes of lines.

  • To know what is deconvolution: check out this paper, which proposed deconvolution for the visualisation of learnt features in CNN for image recognition.
  • To know what is auralisation and how it is done: Section 3 of my paper
  • To listen to the learned features: my previous post

Below are the reviews I had from this year’s MLSP, which my work failed to be accepted.

==============
= Reviewer 1 =
==============
The main flaw of this paper is that there is no comparison to any other technique. Beside this, how do you justify the use of deep NNs in this task? The title would sound better as “Explaining deep convolutional neural networks for music classification”.
Also, in the paper there are some grammar flaws. Please go again through the paper and correct them.

==============
= Reviewer 2 =
==============
This is interesting work, however I don’t see how it is enough material to constitute a paper. This is a straightforward application of visualizing CNN behavior, the only added bit being that this is done in the audio domain. I don’t really see a strong conclusion in the results, nor any noteworthy technical advances.

I think this would make a fantastic blog post, but I don’t see the novelty, analysis, or conclusions that warrant a scientific publication.

==============
= Reviewer 3 =
==============
This paper proposes an way to analyse the CNN developed for music genre classification. The paper tries to analyse the CNN behavior by deconvoluting images and add phases obtained from original signals to resynthesis sounds. The paper call it auralisation. The analysis part of the paper is well written, but the proposed method of auralisation is rather incremental from the original visualization used in the CNN. Also the paper lacks the explanation of the auralisation (only providing the pseud code), and this is not appropriate as a machine learning (for signal processing) paper.

comments:
1) In abstract, you should explain more about “auralisation”. Also “textures” and “shapes” in the music processing context are not trivial and need some more explanations.
2) Section 3 “STFT is therefore recommended)”: No left side parentheses

I like the second and third reviews 😉


The paper is on Motherboard, thanks!

Leave a Comment