ISMIR 2015 Late-Breaking/Demo Session on-line materials
Auralisation of Deep Convolutional Neural Networks: Listening to Learned Features (Auralization)
by Keunwoo Choi, Jeonghee Kim, George Fazekas, and Mark Sandler
0. I’m looking for an internship for 2016 summer!
1. Please Download materials: paper, poster, poster(row resolution), code
2. Summary and demonstrations
This is about my auralisation, or auralization, or sonification (I now regret why I didn’t use this word) of learned features in CNNs.
The descriptors below are by-products of my CNNs that were learned for a genre classification task, and the labels are my interpretation of the learned CNNs filters. In other words, the goal is to understand what is learned as results of CNNs training, not to learn/use them as the descriptors.
First, see this figure, which is included in the poster.
As you might already notice, the frequency axis is in logarithmic scale.
I spend fair amount of time to listen to the files, every features in the five layers. (It’s actually quite a good hobby when I’m flying.)
IF YOU’RE BUSY: listen to the piano piece (the lefts below) at the row  (original), row  (a better onset detector). Then, listen to the piano+vocal+drums (+bass guitar, oh god, I’m a bass player and I omitted it in my poster. Yeah, bass guitar is not the most appreciated instrument in band, but…) at the row  (original), row  (a kick drum selector).
This is an attack suppressor, you can check by seeing the image that vertical lines are removed, which is the result of filtering by a horizontal-edge detector in the first layer, [1 1; -1 -1] for example. Take a look and listen!
 Onset detector, at layer 1, 19th feature
This is the opposite of , an onset detector, which is obtained by filtering with a vertical edge detector, [1 -1; 1 -1] for example. In the first layer, there were couple of features that behave like this. This is well explained if you see this figure(, which is also included in the paper).
First of all, this is not the visualisation of the exact network I used for the auralisation, but of a similar network, only differed by enlarging the filter size at the first layer for better visualisation, as shown. There are many vertical edge detectors!
 (Better) Onset detector, at layer 2, 0th feature
Then here comes the better one, the better onset detector!
Why does it evolve? Because the features at layer 2 is a (nonlinear mapping of) linear combination of the features at layer 1. This would be the result from mainly combining onset detectors at layer 1.
 Bass note selector, at layer 2, 1st feature
Obviously these are not low-pass filtered signals. I named it a bass note selector as it kinda of behaves in that way for excerpts with obvious bass part, but it doesn’t work in the way when there’s no bass guitar or something like that.
 Melody (or harmonic part) selector, at layer 2, 10th feature
This sounds like a melodic contour separator with somehow removing its harmonics.
 Kick drum selector (?), at layer 3, 3rd feature
This is an interesting one, a kick drum separator if exists! It behaves in the same way when I put other files that have kick drum sound.
Now it becomes trickier to say what they are, in the layer 4 and 5, the highest two layers.
For further understanding there are a lot to do. The weight matrices should be investigated to verify if the network really learned in those ways I presumed, other architectures with different tasks would be compared, comparing them with hand-craft features. Please check Sander’s great post on CNNs, his work provides a big hints and what’s going on at the high-level layers. Of course please check my materials – paper, poster, poster(row resolution), code. Thanks!
PS. You can also download these sound files: click to download wave files