[UPDATE] DEMO: What CNNs see when CNNs see spectrograms

This is a demo for my paper, Explaining Deep Convolutional Neural Networks on Music Classification.

In my previous post about auralisation of CNNs, I posted 8 deconvolution (and auralisation) results, which were the demonstration contents I selected in the airplane from Korea to UK. It was a perfect allocation, the best way of using the time above 10,000m, and also best timing for listening to all the tedious samples (5 layers x 64 features x 3 songs).

Now I have to confess; I actually had listened less than half of them. I mean, it was that tedious. Probably I could do a bit more if I was on business-class.

Instead, I spent more time on my bed listening the rest, and found out more interesting patterns. I’m going to submit a paper to MLSP this year and here’s the demo for it. I’ll update more lines from the paper after submission.

The first layer



The second layer



The third layer



The fourth layer



The fifth layer


PS. Things used: Keras on Theano with librosa. Oh, and an active noise canceling headphones.


9 thoughts on “[UPDATE] DEMO: What CNNs see when CNNs see spectrograms

  1. If these are the same headphones you were using at ISMIR 2015 for your demo, it is a great research tool 🙂
    Looking forward to read your paper. The features from higher layers look really interesting. I wonder if those textures sound meaningful and natural for a human.


    1. Exactly! Haha, yes, people kept asking me about that part of research – ‘btw, what is the name of these headphones?’
      Most of them sounds strange but interesting. I also put the networks some simple sounds such as many intervals and chords of different instruments. The textures appear more clearly in that way; the evolution of onset detector, key/chord invariance in general, some of them are invariant to instruments while the others are not, etc.
      The problem on writing it on a paper is the space. There’s much more to illustrate than discuss (which is perfect for blog posting..). So I’m looking for some generous workshops that let me use long pages with so many colourful spectrograms.. Hope I could make it soon 🙂


  2. It is interesting to try to hear what these CNNs have learnt. However, training less deep architectures would possibly allow you to experiment faster. In fact, when reviewing the deep learning literature in MIR it seems that there is a consensus on using no more than two layers – probably due to the current (small) size of the datasets.

    If you are using the same set-up as in your ISMIR’15 paper (spectrogram input with small squared conv filters), your results confirm what the intuition tells: small squared filters can model short time dependencies in sub-bands. Indeed, you find filters capable of detecting an onset, bass notes or kicks. It would be great to experiment with different filter shapes to observe what can different filters model. For example, I would expect the CNN to model cymbals or snare drums if the filters are wider in frequency.

    Congrats for your work!


    1. Hi and thanks! In my opinion, the network should be still able to capture what it needs to, which means there should be redundant number of features to some extent (especially if there are dropouts). Also I had to cross-check the results from several songs, which makes the listening time *N. I think what we need to do is to find a computational way for analysis at the end. Not sure how it could be done though.
      Hope my another submission make accepted so that more than two-layer setup is added to the deep learning literature in MIR and break the consensus 🙂 The kernel shape is tricky but yes, probably some multi-resolution (not only in time but also in freq) would help.


  3. Hi there nice work and its quite good to see what CNN can see in audios but I am sure we can’t differentiate the differences as much as CNN can. I have just started working with sounds and NL. I want to create a model which can detect distortions in given audio sample. Basically I have done some research and i think mel spectrograms can do this. I would like your opinion on the same. I have classified noises as continuous (white noise ) intermittent( static and such ) and impulsive( sudden occurences :- like glass breaking ) . Is it possible to do this with CNNs ? Thanks.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s