Machine learning for music discovery (workshop) @ICML2017, Sydney

(Erik Schmidt from Pandora, who organised this workshop(s), is giving an opening introduction.)

Another icml, another ml4md workshop! It was 3rd machine learning for music discovery workshop this year and was featured with many awesome talks as expected. I’ll summarise who talked what briefly. Please also check out the workshop website.


Matrix Co-Factorisation and Applications to Music Analysis [abstract]

Slim Essid@Telecom ParisTech

This slideshow requires JavaScript.

Slim introduced how matrix co-factorisation (wrt NMF) can be used for multi-view problems. NMF (Non-negative matrix factorisation) is a technique to decompose a matrix into two while keeping their elements to be non-negative. It is often applied on music spectrograms V, which are several column vectors W (frequency response/harmonic pattern/etc) and row vectors H (to decide the activations of column vectors/time-domain envelopes/..)

Multi-view streams such as audio-visual information from a performance can be used for source separation assuming their activations H1 and H2 should be similar. The original iterative rules are modified with the condition and shown to work well. Fitzgerald et al. (2009); Yoo & Choi (2011); Yokoya et al. (2012); assumed hard constraints while (Seichepine et al., 2014) assumed a soft one.

The slides above shows only few examples. One of the recent works is done with audio signals and the motion captured of string players to separated string instruments. The video was awesome, sad that I don’t have any video.


Learning a Large-Scale Vocal Similarity Embedding [abstract]

By Aparna Kumar@Spotify — as well as Rachel Bittner, Nicola Montecchio, Andreas Jansson, Maria Panteli, Eric Humphrey, Tristan Jehan

So many authors and now you’re thinking they just put all the names of those who drank together? Actually what Aparna presented was a big and awesome system focusing on the (vocal) melody of pop music. Melody, for sure, is such an intriguing part of music, people say melody, rhythm, and chord are the three essential parts of music, which I think is true for most of popular music. I can’t think of any popular music that doesn’t have vocal — so we call it pop song.

What they did with vocal melody? Let’s see.

This slideshow requires JavaScript.

1. Andreas applied U-net, which is a convolutional auto-encoder with some bridges (which I think should be called as BAE;bridged auto-encoder), which are popular for visual image segmentation — which is as below (image from Oxford). An analogy in music would be… source separation! It works pretty well, I had a chance to listen to the demo before, it will be up soon as well as the paper in ismir 2017. Okay, we got vocal track now.


2. With well-separated vocal track, estimating pitch (f0) is trivial!… no it’s not. But definitely it’s easier. Rachel’s this year’s ismir paper is about this afaik. I don’t think source separation was a part of the paper, but it is in this talk. So, yay, we got a pretty accurate melody.

3. What do we do with the melody? Well, whatever machine learning to understand furthermore. It’s not precisely specified in the slides, I might miss from her talk, but Maria Panteli, who’s my friend at c4dm and doing an internship at Spotify@London has been done lots of machine listening works for music style understanding. A melody can be a good cue for this task. 25 Vocal styles were provided by musicologists and… okay I kinda forgot what’s been done exactly on it… um..

4. They also tried a genre classification task, making it clear that it is not an ultimate goal but a proxy task that can prove the idea of exploiting melody for whatever machine-listening task.

5. Another task was artist retrieval which didn’t work very well.

6. An ongoing work was to retrieve music by some semantic query that describes the melody.

Obviously there are lots of things going on there and I was impressed that such a large-scale and long-pipline system is working indeed.


Aligned Hierarchies – A Multi-Scale Structure-Based Representation for Music [abstract]

Katherine Kinnaird at Brown university @kmkinnaird

This slideshow requires JavaScript.

Katherine’s work was done by her and her three students, one of which is a super-talented undergrad and looking for a grad school to further work on MIR and math, which is one of the most important words to spread.

Her talk was about music structure analysis, especially focusing on the hierarchical structures and applications for cover song detection. Hierarchy in structure is such a pain and her approach is based on using repeats in every level. This is an interesting idea! The acknowledgements say it is a portion of her doctoral thesis, but some were to appear in this year’s ismir and there will be more by her and her students.


Mining Creation Methods from Music Data for Automated Content Generation [abstract]

Satoru Fukayama and Masataka Goto
National Institute of Advanced Industrial Science and Technology (AIST)

This slideshow requires JavaScript.

I also enjoyed Satoru’s quite a comprehensive review on the works that have been done on music(-related) creation in AIST. As in the paper, there were four topics:

  • machine dancing
  • automatic chord generation (NMF, N-gram-based music language model)
  • automatic guitar tab generation
  • song2quartet: generate a quartet music from audio content

Check out the four references in his abstract for more details.

NSynth: Unsupervised Understanding of Musical Notes [abstract]

Cinjon Resnick et  al., Google Brain (Magenta team)

This slideshow requires JavaScript.

It is based on their ICMP paper. As many would already know, NSynth is a sample-level synthesiser with WaveNet and conditional auto-encoder, with releasing a large dataset that they used for training.

Cinjon also played many clips that probably are not public yet online (check out his YouTube channel!). The demos are about controlling the volume (which is more like ‘gain’), oscillation, and mean sound, by modifying z, the latent vector of auto-encoder. As summarised in the abstract, it didn’t always work as the analogy. Seems like there are still lots of work to understand and improve this approach, especially because people would expect a synthesiser to be easily and effectively controllable.


Multi-Level and Multi-Scale Feature Aggregation Using Sample-level Deep Convolutional Neural Networks for Music Classification [abstract]

Jongpil Lee and Juhan Nam, KAIST, Korea

This slideshow requires JavaScript.

Jongpil has be working on music tagging (hm, sounds familiar). His approach achieved a really good result on MSD tagging dataset and I think it probably is at least one of the ways to go.

It is actually quite different from the previous approaches as the long title indicates.

  • Multi-level: activations of different levels are aggregated to directly contribute to the final decision (which is similar to my transfer learning paper approach and Jongpil’s paper was out earlier).
  • Multi-scale: Multi-level is already somehow multi-scale thou.
  • Sample-level: I think they should’ve made it clearer than the current title/name. Here, sample-level means something like character-level as an alternative of word-level in NLP. The minimum length of samples are not like 512 or 256, but even 2 or 3 samples! Check out their SPL paper for details.


Music Highlight Extraction via Convolutional Recurrent Attention Networks [abstract]

This slideshow requires JavaScript.

Jung-Woo Ha, Adrian Kim, Dongwon Kim, Chanju Kim, and Jangyeon Park, at NAVER Corp

As a Korean-Google, Naver corp has many services and departments like Naver (a search engine+many more), Line (a messenger in Japan/Kor/many other asian countries and did their IPO in NASDAQ), Clova/CLAIR (AI/robotics research department), etc.

The presentation was about their highlight selection algorithm based on RNN/attention. They trained a genre classification with an LSTM, which is analysed again to find the part that are most attended. It’s an interesting work and it requires a whole track as a data sample, which partly and probably is why there’s no such work yet.

One of my questions is how much it is correlated to the other simple metrics such as energy/zero-crossing/etc. Hopefully there’s a full paper that discuss those aspects. I’d be also curious to know if finding shorter highlight would formulate the problem better. It was 1 minute in this abstract, but isn’t it too long? Although it has to match to the requirement of their service.


Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras [abstract]

Keunwoo Choi, Deokjin Joo, and Juho Kim

It’s my work and already covered in the previous post. It got me a best title award by Katherline.


Ephemeral Context to Support Robust and Diverse Recommendations [abstract]

This slideshow requires JavaScript.

Pavel Kucherbaev, Nava Tintarev, and Carlos Rodriguez
Delft University of Technology and University of New South Wales

The final slot was filled by Australian researchers! (along with Netherland.)

This abstract is mainly about how could a context-based music recommender can aggregate what kind of information, focusing on using multi-sensory information. A demo is online at, it’s not really fully-working recommender but more about the potential API though. The webapp might not be working, then you can check out the repo. It’s kinda a position paper and can be also useful to pick up the literatures in the field so far.


The audio-video failures

Finally, I’d like to report that there were three sound failures which got us ironic and joyful moments. This is known to be a AV-hard problem which is at least as difficult as NP-hard.

PS. Thanks for the organisers, Erik Schmidt, Oriol Nieto, Fabien Gouyon, and Gert Lanckriet!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s