I attended this amazing workshop this year again, Machine learning for music discovery at International conference on machine learning (ICML) 2016. ICML is one of the biggest conferences in machine learning (ICML; THE summer ML conference, NIPS; THE winter ML conference)(Or the opposite if it happens at somewhere in southern hemisphere). The whole conference was massive! The committee expected ~3,000 attendees. ML4MD workshop was also rather packed, though the room was not large like deep learning workshop.
There was one keynote (1hr), 5 invited talks, 8 accepted talks, and happy hours.
Project Magenta: Can Music Generation be Solved with Music Recommendation?
By Douglas Eck, Google Brain
Douglas Eck gave this presentation about rather hot issue – Project Magenta by Google Brain. If you haven’t heard of it — please check out the website. The current example is not that like state-of-the-art-as-Google-does-all-the-time, but it is a project that just started less then a month. Douglas worked on music recommendation in Google Play Music before he leads this project. However he was also one of the very first people (if not perhaps the first person) who applied LSTM to generate music (check out this google search result).. which means we could expect some great insights!
The talk itself was not focusing on technical/computational aspects but on discussions – especially about the evaluation. Which was one of the main issues in CSMC. For some keywords, I suggest “deep dream, artistic style transfer, generative adversarial networks, attention model, RNN“. These keywords also frequently appear in Machine Learning subreddit as in a form of ‘Why don’t we do *** for music?’. Of course things are not that easy though.
It was interesting that he also mentioned potential applications of Q-learning (or reinforcement learning) to music generation. I also always have similar idea that the framework of RL seems much more making sense than without it, e.g. plain LSTM to music (as I did recently).
By Joshua Moore, Cornel University (and now at Facebook)
Joshua introduced his work on embedding of cities and artists into a same (embedding) space. In the work, he used Million Song Tweets Dataset, which include location information + music information together, e.g. “I listened Steely Dan’s music, Aja (at Seoul, Korea, Asia)”. He squeezed the data into tuples of (City, Artist) e.g. (Seoul, Steely Dan). Sounds like the very thing we should do with the dataset — doesn’t it?
Many interesting figures that are not included in the abstract are presented. Many seem the same figures from their 2014 ismir paper – “Taste Space Versus the World: an Embedding Analysis of Listening Habits and Geography“. Seems quite useful for cultural analysis of music – or anything else that you could get a dataset in such forms (tuples, triples,…).
by Dawen Liang, LabRosa, Columbia University
According to Dawen, this is
“A weird trick to boost the performance of your recommender system without using any additional data”
So the author started the presentation by declaring that his work is weird! (Overall, The most fun presentation award should be his.)
The motive is to modify matrix factorisation (MF) so that it also can learn the context – like the embeddings do. (I again mentioned embedding here!) For those who how MF is related to recommender systems, see some tutorials like this. MF is just THE method for every recommendation system,
“For example, Pandora,… and many others!”.
One other approach is Point-wise Mutual Information Matrix (Levy and Goldberg, NIPS 2014), which is based on the word2vec algorithm, which uses the context of each item (word) to find out the meaning of the item (and which is designed for MLP tasks).
In this work Dawen proposes to use embedding-based approach for recommendation problem. An embedding of an item can be computed using its context items (e.g. songs that are played before/after, clicked items, …) as an embedding of an word is computed using its contextual words (other words around it) in word2vec.
The equation (1) clarifies the work with its objective function. As a result, the proposed method tries to find item embeddings that encode users’ preferences and item co-occurrences. β_i is the item embedding and therefore it is shared in both terms (MF and embedding).
“It does not have a proper generative story.”
“regularizing the traditional MF objective with item embeddings learned by factorizing the item co-occurrence matrix.”
journalRecSys paper on the subject is coming soon.
By Eric Humphrey, Spotify
It was literally about the music (and audio) discovery. We know the discovery of audio files sucks — so we try tagging, auto tagging, many visualisation, search methods. His work is about visualisation of audio, or perhaps I could say audio files? Because the point is to make discovery with many audio files easier.
Details are in Chap 2 of his PhD Thesis and that’s actually what I’m summarising here.
The goal the work is to learn a mapping: timbre of audio signals → N-dim embedding. Yes, it’s another work involving embedding, but equivalently we can say (I think) it’s about learning a timbre descriptor.
The whole structure is as below:
The figure is self-explanatory (and interesting!). CQT is used as it is pitch-invariant. 3 convolutional layers and 2 fully-connected layers are used with max-pooling and tanh (!). Let’s say this is a feature extractor. The cost (which should be large if they are not similar in timbre and vice versa) is computed, and it means the whole procedure is pair-wise, while the weights in the feature extractor is shared.
Here, why tanh? According to his dissertation,
Hyperbolic tangents are chosen as the activation function for the hidden layers purely as a function of numerical stability. It was empirically observed that randomly initialized networks designed with rectified linear units instead were near impossible to train; perhaps due to the relative nature of the learning problem, i.e. the network must discover an equilibrium for the training data, the parameters were routinely pulled into a space where all activations would go to zero, collapsing the network. Conversely, hyperbolic tangents, which saturate and are everywhere-differentiable, did not suffer the same fate. It is possible that the use of activation functions that provide an error signal every- where, such as sigmoids or “leaky” rectified linear units, or better parameter initialization might avoid this behavior, but neither are explored here.
The computation of cost and loss takes advantage of DrLIM (dimensionality reduction by learning an invariant mapping), which is similar to but not as popular as t-SNE. Eric answered DrLIM is preferred to t-SNE for it is better at generalisation, which I couldn’t get why – because I don’t deeply understand both DrLIM and t-SNE. Nor seems there a further explanation on this in the thesis.
The results are plotted in the abstract as well as the thesis (for sure!) I picked the first one just for tasting. The figure in the abstract is a result from extended experiment and for me it was easier to understand after seeing the figures in the thesis – as below.
Learning Better Representations for Sequence Retrieval, with Applications to Large-Scale Audio-to-MIDI Matching
By Colin Raffel, LabRosa, Columbia University
This work is a hyper-compressive summary of his PhD. I feel like to spend more time to understand the whole work. It’s two ISMIR paper-s (the second one is of this year and not released yet), two ICASSP paper–s, and one NIPS workshop paper. So the paper in a nutshell,
- Find similar (audio) sequence from a query
- with robustness such as dynamic time warping but
- more quickly by downsampling
- and even more quickly by attention-based neural net
- for small performance degradation.
By Justin Salamon, New York University
Justin summarises his work on pitch analysis for music discovery. He argues that pitch information comprises a significant proportion of commercial music. (My comment 1. As far as I understand he’s talking modern, popular music. Wouldn’t it be more important in classical music?) (Comment 2. It ironically reminds of the upcoming tutorial in this year’s ismir – ismir 2016 tutorial: why hip-hop is interesting. Well, basically we always can find a counterexample against’music is blah blah’ (or X is blah blah)).
He first summarises many ML approaches to pitch analysis. This subject is also quite new to me but what attracted me is that
to date no A[utomatic] M[elody] E[xtraction] algorithm has significantly outperformed the purely-heuristic Melodia algorithm.
He argues that this is due to the lack of data so that ML algorithms can’t learn properly. We definitely are lack of large dataset in the overall MIR tasks. What is tricky is that we actually do not know if it truly is because of the size of dataset – until we have a breakthrough that overcomes the existing problems.
That’s why they released
MelodyDBMedleyDB:A Multitrack dataset for annotation-intensive MIR research in ISMIR 2014. The dataset is expected to grow bigger with NYU recordings. Still, still, still there are problems such as (hand-)annotating continuous f0.
The latest work of the authors is to leverage synthesis. The paper for this work is being submitted. If I may summary, it’s the other way around – synthesise polyphonic signals and used them for training, because then we don’t need to worry about (creating) ground-truth.