ISMIR 2015 LBD – Auralisation of Deep Convolutional Neural Networks: Listening to Learned Features (Auralization)

ISMIR 2015 Late-Breaking/Demo Session on-line materials

Auralisation of Deep Convolutional Neural Networks: Listening to Learned Features (Auralization) 

by Keunwoo Choi, Jeonghee Kim, George Fazekas, and Mark Sandler

0. I’m looking for an internship for 2016 summer! 

1. Please Download materials: paperposterposter(row resolution), code

2. Summary and demonstrations

This is about my auralisation, or auralization, or sonification (I now regret why I didn’t use this word) of learned features in CNNs.

The descriptors below are by-products of my CNNs that were learned for a genre classification task, and the labels are my interpretation of the learned CNNs filters. In other words, the goal is to understand what is learned as results of CNNs training, not to learn/use them as the descriptors.

First, see this figure, which is included in the poster.

As you might already notice, the frequency axis is in logarithmic scale.

I spend fair amount of time to listen to the files, every features in the five layers. (It’s actually quite a good hobby when I’m flying.)

IF YOU’RE BUSY: listen to the piano piece (the lefts below) at the row [1] (original), row [4] (a better onset detector). Then, listen to the piano+vocal+drums (+bass guitar, oh god, I’m a bass player and I omitted it in my poster. Yeah, bass guitar is not the most appreciated instrument in band, but…) at the row [1] (original), row [7] (a kick drum selector). 

[1] Original (Piano, Piano+Vocal, Piano+Vocal+Drums)

I selected this three among 15 files I auralised since they are diverse in the level of complexity. (Those last two are K-pop around 2000, Lena Park and Toy.)

[2] Attack suppressor, at layer 1, 18th feature


This is an attack suppressor, you can check by seeing the image that vertical lines are removed, which is the result of filtering by a horizontal-edge detector in the first layer, [1 1; -1 -1] for example. Take a look and listen!

[3] Onset detector, at layer 1, 19th feature


This is the opposite of [2], an onset detector, which is obtained by filtering with a vertical edge detector, [1 -1; 1 -1] for example. In the first layer, there were couple of features that behave like this. This is well explained if you see this figure(, which is also included in the paper).

First of all, this is not the visualisation of the exact network I used for the auralisation, but of a similar network, only differed by enlarging the filter size at the first layer for better visualisation, as shown. There are many vertical edge detectors! 


[4] (Better) Onset detector, at layer 2, 0th feature

Then here comes the better one, the better onset detector! 

Why does it evolve? Because the features at layer 2 is a (nonlinear mapping of) linear combination of the features at layer 1. This would be the result from mainly combining onset detectors at layer 1.

[5] Bass note selector, at layer 2, 1st feature

Obviously these are not low-pass filtered signals. I named it a bass note selector as it kinda of behaves in that way for excerpts with obvious bass part, but it doesn’t work in the way when there’s no bass guitar or something like that.

[6] Melody (or harmonic part) selector, at layer 2, 10th feature

This sounds like a melodic contour separator with somehow removing its harmonics. 

  [7] Kick drum selector (?), at layer 3, 3rd feature

This is an interesting one, a kick drum separator if exists! It behaves in the same way when I put other files that have kick drum sound. 

[8] unknown, at layer 4, 5th feature

Now it becomes trickier to say what they are, in the layer 4 and 5, the highest two layers.


 [9] unknown, at layer 5, 10th feature

For further understanding there are a lot to do. The weight matrices should be investigated to verify if the network really learned in those ways I presumed, other architectures with different tasks would be compared, comparing them with hand-craft features. Please check Sander’s great post on CNNs, his work provides a big hints and what’s going on at the high-level layers. Of course please check my materials – paperposterposter(row resolution), code. Thanks! 

PS. You can also download these sound files:  click to download wave files


2 thoughts on “ISMIR 2015 LBD – Auralisation of Deep Convolutional Neural Networks: Listening to Learned Features (Auralization)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s