DrummerNet – Deep Unsupervised Drum Transcription

Paper | Code

I’m glad to announce that our recent work, DrummerNet, was accepted to ISMIR 2019. My deepest gratitude goes to my great co-author, Kyunghyun Cho. A huge thanks to Chie-Wei Wu (now with Netflix) for discussions, comments, and insights, too.


What is that?

DrummerNet is a deep neural network that performs drum transcription (gets drum track audio, outputs its drum note annotation – when which drum components was played).

Screen Shot 2019-06-10 at 11.55.15 PM

What’s so special?

We trained DrummerNet in an unsupervised learning fashion. In other words, no drum annotation (transcription) is needed.

Like how?

By letting it reconstruct the input drum audio using known drum component waveforms and an estimation. In the block diagram above, the grey boxes are not trained. The block boxes are trained, which comprises the transcriber.

Does it work well?


Screen Shot 2019-06-10 at 11.58.42 PM

You should be careful though to interpret it. This is the case of ‘cross eval’ scenario, where all networks are trained on some other datasets, then tested on SMT (no fine tuning). Hence it’s more realistic and it evaluates the generalizability.

And that’s the reason I was interested in unsupervised learning. When I was deciding the research topic, I was actually not interested in a certain task. I only cared about what can be done and only done with a large un-annotated dataset (=unsupervised learning) because, in MIR, annotated datasets are TINY. Like, tiny as a really tiny thing. Like… 20 songs!

What was the trick?

I didn’t test anything but U-net for the analysis part. For me, its implicit multi-resolution nature fits so well.

I didn’t test 2-dim representation as input. To be honest, I think they might work better because KD/SD/HH are pretty easily distinguished visually.

SparseMax was very important as you can see in Ablation Study section of the paper.

Unsupervised Learning

Hey hey hey, unsupervised is the way to go especially in MIR. The reality has shown us so far that we won’t get enough annotated data anyway. These days, the get a linear improvement of a deep neural network, we need exponentially increased dataset.


DrummerNet paper reviews and my response

Ok, this is the fun part! See the review I had from ISMIR along with my response.

# Meta review

> This is a meta-review.
> This paper proposes an automatic drum transcription (ADT) method called DrummerNet that combines an analysis module with a synthesis module in a unified autoencoder-type network. The proposed method was comprehensively evaluated and the results are promising and convincing.
> All reviewers agree with (strongly) accepting the paper. The proposed method based on the integration of synthesis and generation is technically novel. As some reviewers pointed out, the missing acoustic information problem in generation and the semi-supervised property should be discussed more carefully and clearly in the camera-ready manuscript.

> My comments as a reviewer are as follows.
> I enjoyed reading the paper. It is well written and organized. This is the first study that uses the autoencoder-type network for ADT. The paper should be accepted due to its significant scientific contribution.


> The proposed method is technically sound, but it should be called a semi-supervised method, NOT an unsupervised method, because isolated drum sounds are used as templates for synthesis. The authors should clarify this in abstract and Section 1 and amend the title. There are some semi-supervised ADT methods, so the authors should make a more specific title indicating the paper content.

I wouldn’t agree, for me that sounds like a bit too wide definition of semi-supervised learning. Although its definition may vary, some sources clearly define it as using labelled datasets + unlabelled datasets together. As long as the goal of DrummerNet is NOT to find the right templates, using templates itself shouldn’t make it semi-supervised learning. I know there are some works calling something similar semi-supervised, but also there are some other works calling exactly same thing unsupervised. I’m of an opinion of the latter idea.

> The neural integration of analysis and synthesis has recently been studied in some fields (e.g., computer vision and automatic speech recognition). Please refer to such attempts, e.g.,
> (a) Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders, ICASSP 2019.
> (b) Cycle-consistency Training for End-to-end Speech Recognition, ICASSP 2019.

(Hm, these works were published after my submission, and I already mentioned a paper from computer vision though..)
(a) shares a similar idea, but it uses both {pair, unpaired} datasets for both {ASR, TTS} (and that’s why, again, it’s called SEMI-supervised). This is different from DrummerNet. In DrummerNet, you need more stuff to make sure it outputs transcriptions only after trained with drum stems — no supervision on y, the label (=transcription). Not quite similar problem.

Also, the given situation is different – for both ASR and TTS, there are annotated datasets that are large enough for successful supervised learning that is deployable in a realistic scenario. Meanwhile, for drum transcription, SMT is like 100 drum loops with 3 drum kits. …and we use it for training/validation/test..

(b) is more similar to DrummerNet than (a) is. But it’s still not that similar. So, probably not.

> The authors should discuss a critical limitation of the integrated analysis and synthesis approach in Section 1. In principle, it is impossible to completely reconstruct original signals only from musical scores because musical scores have no acoustic information (e.g., timbres and volumes). Therefore, a core of the proposed method is that the synthesis module is trained to generate onset-enhanced spectrograms obtained by significantly losing acoustic information. This idea is very close to the text-to-encoder (TTE) approach proposed in (b) instead of the text-to-speech (TTS) approach.

Isn’t it the scope of transcription task, not the limitation of DrummerNet? And isn’t removing redundant information == feature extraction? DrummerNet isn’t about making drum tracks sound better so I wouldn’t call it losing information. All classification tasks are ultimately reducing the input data into N one-hot-vector.

Also, DrummerNet actually uses volumes (or velocities). It’s actually even better than supervised learning approach, and potentially more and more. Transcription annotation does NOT include acoustic information. However, we need those information AS WELL to synthesize well. Analysis-sythesis transription is, hence, already estimating volumes, and potentially can extract richer information — while supervised learning of transcription is upper-limited to the task of finding the onset positions.

# Review 1

> The paper provides a method to train a machine learning system, in this case a neural network, in the case where data is scarce in an unsupervised manner. The task addressed is perennial in MIR, drum transcription. The main idea behind the paper is not: synthesis has been used before to address the lack of data annotation.
Paper which use similar ideas:
> > Carabias-Orti, Julio J., et al. “Nonnegative signal factorization with learnt instrument models for sound source separation in close-microphone recordings.” EURASIP Journal on Advances in Signal Processing 2013.1 (2013): 184.
Salamon, Justin, et al. “An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets.” ISMIR. 2017.

Um……… not really. First, Ok. DrummerNet synthesizes drum signals. But not for generating more training data. That’s what Mark Cartwright, or Richard Vogl did (and I mentioned in the paper as another important direction).
Drummer synthesizes drum audio tracks based on resulting drum transcription of its transcriber module.

> Furthermore, it is interesting to mention Yoshii’s drum transcription which uses adaptive templates, which is another case of unsupervised drum transcription. This paper doesn’t appear in the state of the art as well as some other drum transcription papers published in recent years.

Yes, it’s what I mentioned in the introduction with citing it ([44]).

> Integrating the synthesis part into a fully-differentiable network is an excellent idea. However, I have some concerns with some tweaks which were introduced to improve performance, regarding their differentiability. First, is the CQT transform differentiable? This is needed to compute the cost between on x and \hat x, respectively on their onset enhanced CQT spectrum. How is this implemented within pytorch, which does not have CQT?

Pytorch doesn’t have official CQT (or pseudo-CQT as mentioned in the paper), so I implemented. (Also provided in the source code).

> The heuristic peak picking method is introduced only in the evaluation. Is this function differentiable?

It is only in the evaluation (hence it’s fine even though it’s not differentiable).

> The loss is computed on an analogue task to drum transcription: reconstruction. The onset enhancing is meant to improve a bit the similiarity between the tasks, however, there are many factors missing: how the velocity (intensity,amplitude) of the transcribe events influence the transcription, particularly when they are masked by other events?

That’s when DrummerNet makes mistakes. Section 4.4(last paragraoh) and Figure 7 is about that problem.

> Some SoA based on NMF uses tail models to account for the dissimilarity between the decay and the attack. How does that influences this paper?

  • My guess: for kick/snare/hi-hat, it seemed fine (the problem of variations in the envelope curve is already part of the problem DrummerNet has to solve to do the job right).
  • My opinion: That model is a simplfied approach, and I think we should try something more generalizeable than that aspect.

> What if only 3 drum classes are used? How does that impact the f-measure?

I didn’t get it exactly – we’re using 3 classes only already.

> Electronic kits are not used here and the datasets are scarce regarding this type of music which has a diverse and different timbre. How do authors plan to deal with this problem.

Several opinions.
– For the ‘kits’ Synthesizer module, it’s trivial to add electronic kits.
– But I understand the raised concerned is more about how it’d work with electronic drum stems.
– One thing — I would plan to see it as a less difficult problem because by definition, it is way easier to synthesize a realistic, high-quality electronic drum tracks. It’s not the case when it comes to recordings of acoustic drums — the randomless and various modification that are introduced during recording and producing process is the most natural augmentation of data. Probably for the reason, real drum recordings would provide more information than synthesized drum tracks.

> The starting paragraph in 4.3 is not clear and it should be rewritten. It’s not clear what authors want to address with that evaluation? The system seems to be retrained on test/train collections of SMT for a fair comparison with state of the art?

Fair enough, will update. Quick note — we do not re-train DrummerNet on SMT, and that’s what transfer learning scenario or Eval Cross meant in the cited review paper by Wu et al.

The last phrase of 4.3 overhypes the proposed system, which is another ML system after all and can suffer from overfitting and domain-mismatch. There will always be some mismatch between training and testing data and it is very difficult to predict how a system would perform on unseen future data. Strong claims need strong evidence which is not justified here: DrummerNet is easier to train (doesn’t need annotated data) but it is prone to overfitting as any ML system.

Partly because of what I answered right above, I don’t think it is an oversell to say so. Naturally, although an ML module with supervised learning would not overfit to the training set with a proper set split, it only fits to that dataset at its best. When that dataset is really big (E.g., ImageNet), we don’t need to worry if it only had fitted to a certain distribution only (I.e., it’s almost ‘true’ distribution!)

The recent deep supervised learning approaches for drum transcription have done a very good job (2016-now). Their within-split sets performances, which what people have reported, are excellent. But their cross-dataset performances, which I used in Figure 5, are slightly less good. How slightly? Well, large enough so that DrummerNet (2019), and NMFD (2012.. yes. 2 0 1 2 !!!) outperformed. It wouldn’t happen if we have an ImageNet-sized annotated dataset. The reality is, SMT dataset is 130 mins. MDB is 20 mins. ENST is 61 mins. And they are quite different to each other and only representing a small subset of drum stems.

Training data can represent the real world scenario in some cases. When training sets are tiny, however, we can’t expect that would happen. And the annotated sets for supervised transcription are not big.

# Review 2

> The paper proposes a system for (bass, snare drum, and hi-hat) drum transcription that can be trained on (drum-only) audio alone, without any ground truth transcriptions.
> The main idea of constructing a waveform autoencoder whose bottleneck looks like a transcription (by decomposing it into sparse temporal activation sequences for each source, similar to NMF) is a nice one. It can be thought of as a deep-learning version
> of NMF-style unsupervised transcription.

Yup 🙂

> Based on the experiments, the key components which make the system work for transcription appear to be:
> 1. enforcing sparse activtions by using sparsemax nonlinearity to determine
> activations
> 2. making use of features which enhance the onsets of each drum event
> Although the training mechanism is unsupervised in the sense that it does not depend on fully transcribed examples, it does require predefined signal templates for each of the targeted instruments.
> This is in contrast to NMF-style techniques, which can (sometimes) learn instrument templates along with the transcription.

Just to comment; we need better differentiable synthesizer for this.

> On first read I was surprised that it worked to define the templates at the waveform level, essentially turning the synthesis module F_s into a crude drum machine. It seems that the important tricks to achieve generalization are: randomly useing different “drum-kit” templates in each training batch, and computing the loss in terms of the “onset spectrum”, which discards much of the low level signal detail.

I agree.

> The assumption that the onsets of instrument events can uniquely identify the instrument (as shown in the right of Fig 3) seems limited to certain classes of instruments, e.g. those with percussive attacks. As does (to a lesser extent) the preference for sparsity in activations. That said, the paper does not address the more general transcription task, so this is not a problem.

This is true, and this is one of the reasons the directions of future work should be using better loss function than audio similarity.

(edit: I removed comments on typos/etc.)

# Review 3

> This paper presents an unsupervised method for drum transcription. Overall I believe the idea behind this paper and how it has been undertaken is very good. It is definitely appropriate for the ISMIR conference. However, I do have a couple of major issues and minor issues regarding this paper.


> Major issues
> – No mention of system [35] within the background section. The high level usage of a ‘synthesiser’ and a ‘transcriber’ is very similar to the ‘player’ and ‘transcriber’ used in this work. [35] also clearly states that it can be used without existing training data and so is extremely similar to the model proposed in this paper. Why is this paper not mentioned in the background section when it is the paper that shares the most similarity to this work?

For me they are pretty different even on the high level. Having a transcriber module shouldn’t count as something in common in a transcription system. ‘Player’ in [35] generates training data for (supervised) reinforcement learning. I believe I’ve covered enough for the more similar “analysis-synthesis” works in MIR. ..at least as much as I can in within 6-page!

> – The evaluation against other baselines is limited to only the results which make the system look good.


DrummerNet worked well in the experiment that I care.

> Only the results for the eval-cross (trained on DTP) experiment are presented? The evaluation is performed on the MDB and ENST datasets which could be interpreted as a DTP eval-subset scenario.

NO, they’re not. I made it clearer in the camera-ready version, but DrummerNet is not trained on any of SMT/MDB/ENST at all. Hence it’s eval-cross, not eval-subset.

This is from Wu’s review paper.
“Eval Subset: This strategy also evaluates the ADT performance within the ”closed world” of each dataset but using a three-fold subset cross-validation. To this end, each of the three subsets (see Table V) is evenly split into validation and testing sets. The union of all items contained in the remaining two subsets serves as training data. A single subset is used for the validation and testing set in order to maintain sufficient training data.”

> The results for these datasets are presented in Figure 4, but seem to not be discussed further, possibly because the system does not achieve comparable results here. If that is the case it needs to be clearly stated and the possible reasons why discussed so that readers who aren’t familiar with the field get a full understanding of where the contributions are within the larger field.

I’ll probably put more info if there’s space, but (as written in Section 4.3,) we only could do comparison on SMT (Figure 5) because that’s the only option for correct/fair comparison. our experiment is ‘DTD, and eval-cross (trained on DTP)’, and this is only the case of experiment on SMT in the review paper.

I mean..
I cannot compare mine with something that doesn’t exist.

> Minor Points:
> – I’m not 100% convinced that this system falls under the unsupervised bracket as both the Fs and Fa parts require the user to determine the number of ‘instrument classes’ i.e., K. I would explain somewhere what you mean by the word unsupervised in this context, especially regarding my earlier points about [35].

Is specifying the number of classes considered as supervised?

> – I don’t believe ‘y’ can be called a score, the majority of existing state-of-the-art drum transcription systems do not create a score, only output times in seconds. To achieve a score requires more tasks for example beat and downbeat tracking.

It’s true. But I think it’s fair to say so in the very first sentence of a paper where I introduce about what is transcription as a lay person’s explanation of the task.

(edit: remove comments about typos/grammar)

> – I had to read the DrummerNet section multiple times to fully understand what the system was aiming to do. To aid future readers I believe that a high level explanation of the full system would be useful. For example, you never actually explain what the Fa and Fs system aim to do on a high level.


> – The evaluation and system is limited to the DTD and DTP context. Maybe mention in the future work how the model could be extended for use within the DTM context.

👍 (fun fact: DTM was actually my original task.)

> – This years conference asked for submissions to explicitly discuss reusable insights. It would be useful to add more to the future work section regarding how this model could be incorporated in other work possibly other types of instrument transcription?

👍 (I added a bit more)

Links (again)

This is it. Thanks! Please check out the paper and the code, too.

Paper | Code

Machine Learning for Creativity and Design Workshop (NeurIPS2018), and +@

Following their first workshop last year, there was the second ML4 Creativity and Design workshop on 8th Dec 2018 at Neurips2018 (=one of the biggest machine learning conferences), Montreal (=one of the coldest area I’ve ever been). It was great! And even greater for those who are interested in music. I missed the last year’s one but seems like there were more musical stuff this year than before. Here’s my summary for the workshop, a non-exhaustive and mostly musical one, but please treat yourself with other papers too. Ok, here we go.

1. Music-related works

Screen Shot 2018-12-10 at 9.12.35 AM.png

  • “Neural wavetable: a playable wavetable synthesizer using neural networks”
    • By Lamtharn Hantrakul and Li-Chia Yang (Google Brain residency and Georgia Tech)
    • To generate an wavetable, which is a (data)base for a certain type of synthesizer (wavetable synthesizer, obviously), they used WaveNet + AutoEncoder so that by controlling the latent space (hidden representation of AutoEncoder) the waveforms of the table can be manipulated continuously.




  • (continued)
    • Compared to MusicVAE, multitrack VAE is..
      • still with a global z over time, but this time z has multitrack information encoded
      • and with chord conditioned for each bar.

Screen Shot 2018-12-10 at 12.56.36 PM

2018-12-08 13.28.582018-12-08 13.28.422018-12-08 13.28.47


2018-12-08 13.42.30

  • “Transformer-NADE for piano performances”
    • by Curtis Hawthorne et al. (Google Magenta)
    • proposed to use NADE (neural autoregressive distribution estimator) to predict the following note and the dimension is an element of note tuple — the elements are properties of note (onset timing, duration, ..).
    • FYI “Transformer” is a purely attention-based sequence-to-sequence model, originally proposed for language translation, recently used for symbolic music generation

2018-12-08 14.08.462018-12-08 14.11.372018-12-08 14.11.43


  • Piano Ginie


2. Some others

  • “Artistic Influence GAN
    • by Eric Chu, MIT Media Lab
    • “What if Banksy had met Jackson Pollock during his formative years, or if David Hockney had missed out on the Tate Gallery’s famous 1960 Picasso exhibition?”
    • Similar to one thing that I’ve always thought about — to simulate the history of music, maybe with RL though.

2018-12-08 13.34.482018-12-08 13.34.58

3. An awesome talk

  • Michael Levin’s Keynote titled “What bodies think about: Bioelectric computation outside the nervous system, primitive cognition, and synthetic morphology” totally blew many’s minds, I think it could be one that excited the NeurIPS 2018 participants the most — and you don’t need to be a deep learning research to get impressed.

4. Neither musical or talk (i.e. the usual NeurIPS stuff)

  • Best GAN ever

  • VAE + GAN

Ok this is it. Thanks for the great works everyone!

Paper revision’s out: The effects of noisy labels on deep convolutional neural networks for music classification

It’s a revision of this paper. It’s a major revision, so major changes! I’ll only take notes on the new stuff.

Screen Shot 2017-09-12 at 01.31.31

The groundtruth are very noisy in tagging dataset. The recall and precision is our (estimates of) evaluation on the groundtruth. Yeah it’s pretty low and we call it ‘groundtruth’…

Screen Shot 2017-09-12 at 01.31.42

Which hurts the performance of them.

Good thing is the trend doesn’t change no matter which groundtruth we use — either the provide one or our re-annotation.

Screen Shot 2017-09-12 at 01.31.52

It’s a figure from Convolutional Recurrent Neural Networks for Music Classification, where I couldn’t get why there’s such differences on the performances per tag. Well, I think I know at least one of the reasons. More noise on tag A → more confusing for the network (whatever the exact structure is) → lower performance.


Screen Shot 2017-09-12 at 01.38.06

Why don’t we try to explain other tag category from the same perspective? Yeah, In the dataset, 90s and 00s are majority (84%), but they probably don’t get tagged properly, at least not as good as 60s/70s/80s because come on, you’re in 2010 and listening to 00s music. Why would you tag it? It’s more likely that you would tag 60s/70s/80s music because doing so get you some information. As a result, old tags got less noise, so higher performance. Yes, this is our guess.

Screen Shot 2017-09-12 at 01.42.07

Ok, so with such a corrupted groundtruth, we know what happens when we use it to train. What happens when we use it to evaluate?

(a)(b) : ok it’s fine.

(c) : no it’s not that fine when the differences between the systems are subtle. Which is obvious because at some point, the noise in evaluation > the system-wise differences.


That’s it. Please go read it if it sounds interesting! arXiv link here.

Machine learning for music discovery (workshop) @ICML2017, Sydney

(Erik Schmidt from Pandora, who organised this workshop(s), is giving an opening introduction.)

Another icml, another ml4md workshop! It was 3rd machine learning for music discovery workshop this year and was featured with many awesome talks as expected. I’ll summarise who talked what briefly. Please also check out the workshop website.


Matrix Co-Factorisation and Applications to Music Analysis [abstract]

Slim Essid@Telecom ParisTech

This slideshow requires JavaScript.

Slim introduced how matrix co-factorisation (wrt NMF) can be used for multi-view problems. NMF (Non-negative matrix factorisation) is a technique to decompose a matrix into two while keeping their elements to be non-negative. It is often applied on music spectrograms V, which are several column vectors W (frequency response/harmonic pattern/etc) and row vectors H (to decide the activations of column vectors/time-domain envelopes/..)

Multi-view streams such as audio-visual information from a performance can be used for source separation assuming their activations H1 and H2 should be similar. The original iterative rules are modified with the condition and shown to work well. Fitzgerald et al. (2009); Yoo & Choi (2011); Yokoya et al. (2012); assumed hard constraints while (Seichepine et al., 2014) assumed a soft one.

The slides above shows only few examples. One of the recent works is done with audio signals and the motion captured of string players to separated string instruments. The video was awesome, sad that I don’t have any video.


Learning a Large-Scale Vocal Similarity Embedding [abstract]

By Aparna Kumar@Spotify — as well as Rachel Bittner, Nicola Montecchio, Andreas Jansson, Maria Panteli, Eric Humphrey, Tristan Jehan

So many authors and now you’re thinking they just put all the names of those who drank together? Actually what Aparna presented was a big and awesome system focusing on the (vocal) melody of pop music. Melody, for sure, is such an intriguing part of music, people say melody, rhythm, and chord are the three essential parts of music, which I think is true for most of popular music. I can’t think of any popular music that doesn’t have vocal — so we call it pop song.

What they did with vocal melody? Let’s see.

This slideshow requires JavaScript.

1. Andreas applied U-net, which is a convolutional auto-encoder with some bridges (which I think should be called as BAE;bridged auto-encoder), which are popular for visual image segmentation — which is as below (image from Oxford). An analogy in music would be… source separation! It works pretty well, I had a chance to listen to the demo before, it will be up soon as well as the paper in ismir 2017. Okay, we got vocal track now.


2. With well-separated vocal track, estimating pitch (f0) is trivial!… no it’s not. But definitely it’s easier. Rachel’s this year’s ismir paper is about this afaik. I don’t think source separation was a part of the paper, but it is in this talk. So, yay, we got a pretty accurate melody.

3. What do we do with the melody? Well, whatever machine learning to understand furthermore. It’s not precisely specified in the slides, I might miss from her talk, but Maria Panteli, who’s my friend at c4dm and doing an internship at Spotify@London has been done lots of machine listening works for music style understanding. A melody can be a good cue for this task. 25 Vocal styles were provided by musicologists and… okay I kinda forgot what’s been done exactly on it… um..

4. They also tried a genre classification task, making it clear that it is not an ultimate goal but a proxy task that can prove the idea of exploiting melody for whatever machine-listening task.

5. Another task was artist retrieval which didn’t work very well.

6. An ongoing work was to retrieve music by some semantic query that describes the melody.

Obviously there are lots of things going on there and I was impressed that such a large-scale and long-pipline system is working indeed.


Aligned Hierarchies – A Multi-Scale Structure-Based Representation for Music [abstract]

Katherine Kinnaird at Brown university @kmkinnaird

This slideshow requires JavaScript.

Katherine’s work was done by her and her three students, one of which is a super-talented undergrad and looking for a grad school to further work on MIR and math, which is one of the most important words to spread.

Her talk was about music structure analysis, especially focusing on the hierarchical structures and applications for cover song detection. Hierarchy in structure is such a pain and her approach is based on using repeats in every level. This is an interesting idea! The acknowledgements say it is a portion of her doctoral thesis, but some were to appear in this year’s ismir and there will be more by her and her students.


Mining Creation Methods from Music Data for Automated Content Generation [abstract]

Satoru Fukayama and Masataka Goto
National Institute of Advanced Industrial Science and Technology (AIST)

This slideshow requires JavaScript.

I also enjoyed Satoru’s quite a comprehensive review on the works that have been done on music(-related) creation in AIST. As in the paper, there were four topics:

  • machine dancing
  • automatic chord generation (NMF, N-gram-based music language model)
  • automatic guitar tab generation
  • song2quartet: generate a quartet music from audio content

Check out the four references in his abstract for more details.

NSynth: Unsupervised Understanding of Musical Notes [abstract]

Cinjon Resnick et  al., Google Brain (Magenta team)

This slideshow requires JavaScript.

It is based on their ICMP paper. As many would already know, NSynth is a sample-level synthesiser with WaveNet and conditional auto-encoder, with releasing a large dataset that they used for training.

Cinjon also played many clips that probably are not public yet online (check out his YouTube channel!). The demos are about controlling the volume (which is more like ‘gain’), oscillation, and mean sound, by modifying z, the latent vector of auto-encoder. As summarised in the abstract, it didn’t always work as the analogy. Seems like there are still lots of work to understand and improve this approach, especially because people would expect a synthesiser to be easily and effectively controllable.


Multi-Level and Multi-Scale Feature Aggregation Using Sample-level Deep Convolutional Neural Networks for Music Classification [abstract]

Jongpil Lee and Juhan Nam, KAIST, Korea

This slideshow requires JavaScript.

Jongpil has be working on music tagging (hm, sounds familiar). His approach achieved a really good result on MSD tagging dataset and I think it probably is at least one of the ways to go.

It is actually quite different from the previous approaches as the long title indicates.

  • Multi-level: activations of different levels are aggregated to directly contribute to the final decision (which is similar to my transfer learning paper approach and Jongpil’s paper was out earlier).
  • Multi-scale: Multi-level is already somehow multi-scale thou.
  • Sample-level: I think they should’ve made it clearer than the current title/name. Here, sample-level means something like character-level as an alternative of word-level in NLP. The minimum length of samples are not like 512 or 256, but even 2 or 3 samples! Check out their SPL paper for details.


Music Highlight Extraction via Convolutional Recurrent Attention Networks [abstract]

This slideshow requires JavaScript.

Jung-Woo Ha, Adrian Kim, Dongwon Kim, Chanju Kim, and Jangyeon Park, at NAVER Corp

As a Korean-Google, Naver corp has many services and departments like Naver (a search engine+many more), Line (a messenger in Japan/Kor/many other asian countries and did their IPO in NASDAQ), Clova/CLAIR (AI/robotics research department), etc.

The presentation was about their highlight selection algorithm based on RNN/attention. They trained a genre classification with an LSTM, which is analysed again to find the part that are most attended. It’s an interesting work and it requires a whole track as a data sample, which partly and probably is why there’s no such work yet.

One of my questions is how much it is correlated to the other simple metrics such as energy/zero-crossing/etc. Hopefully there’s a full paper that discuss those aspects. I’d be also curious to know if finding shorter highlight would formulate the problem better. It was 1 minute in this abstract, but isn’t it too long? Although it has to match to the requirement of their service.


Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras [abstract]

Keunwoo Choi, Deokjin Joo, and Juho Kim

It’s my work and already covered in the previous post. It got me a best title award by Katherline.


Ephemeral Context to Support Robust and Diverse Recommendations [abstract]

This slideshow requires JavaScript.

Pavel Kucherbaev, Nava Tintarev, and Carlos Rodriguez
Delft University of Technology and University of New South Wales

The final slot was filled by Australian researchers! (along with Netherland.)

This abstract is mainly about how could a context-based music recommender can aggregate what kind of information, focusing on using multi-sensory information. A demo is online at https://icml2017demo.herokuapp.com, it’s not really fully-working recommender but more about the potential API though. The webapp might not be working, then you can check out the repo. It’s kinda a position paper and can be also useful to pick up the literatures in the field so far.


The audio-video failures

Finally, I’d like to report that there were three sound failures which got us ironic and joyful moments. This is known to be a AV-hard problem which is at least as difficult as NP-hard.

PS. Thanks for the organisers, Erik Schmidt, Oriol Nieto, Fabien Gouyon, and Gert Lanckriet!

Abstract is out; Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

paper | repo


Since last December, I was developing and using Kapre, which is just a set of Keras layers  that do some audio preprocessing, say, stft/melspectrogram and etc. I mean, without a real benchmark.

There are many small reasons for this. Well, having my own python package that can be installed with pip command is cool, right? It’s one of the cool kids’ requirements. I’ve learned a lot by doing it. But, most importantly, it makes the user experience so nice to do research with more than a bunch of audio files.

A scenario

Let’s say you’re doing some classification with not-so-small music dataset. Like FMA-medium which is roughly 30GB after uncompressing.

Ok, decoding mp3 takes long so it should be done before.


But actually to what should I convert them? Spectrogram? Melspectrogram? With what parameter? Alright, I’ll try n_fft=1024, 512 and… hop=512, 256,… hm.. and.. n_mels=128, 96.

Congrats, just do it and store them which would take up like 2 * 2 * 2 * 30 = 240GB. (Yeah it’s very loose estimation.)

A day or two later.

Okay they are ready! Now let’s run the model.

Result: [512, 256, 96] worked the best.

Wut, then is 96 better? or maybe I gotta try with 72? 48?

Ok then I can remove the others… should I? But it took me a day and what if I need it again later..

👏👏👏 YES IT GOES ON AND ON! What if you can just do this?

model = Sequential()
# A mel-spectrogram layer
model.add(Melspectrogram(n_dft=512, n_hop=256,
                 border_mode='same', sr=sr, n_mels=128,
                 fmin=0.0, fmax=sr/2, power=1.0,
                 return_decibel_melgram=False, trainable_fb=False,


Experiment result

Hey hey, but it would add more computation on real time. How much is it?

This much,

Screen Shot 2017-06-17 at 14.31.45

for 16384 audio samples (30 seconds each), and the convnet looks like this. (157k parameters, 5-layer)

Screen Shot 2017-06-20 at 02.14.23

  • The experiment uses a dummy data arrays which has no overhead to feed.
  • We tested it on four gpu models and the charts above is normalised. But what really matters is the absolute time.
    • If your model is very simple, the proportion of time that Kapre uses would increases.
    • Vice versa,
      • which is good because if you got a large dataset → pre-computing and storing spectrograms are annoying → with large dataset you probably would like to use large network to take advantage of the dataset → yeah, so among the overall training time, the overhead that Kapre introduce becomes trivial.


paper | repo |slides

EDIT: Slides


Notes on my paper; On the Robustness of Deep Convolutional Neural Networks for Music Classification

In this paper I talk about music tagging quite a lot, and audio preprocessing also quite a lot, and some analysis on trained network which is related to music tagging problem again.

Music tagging dataset groundtruth is so wrong

Yeah, because of weakly-labelling it is so incorrect. But how much?

Screen Shot 2017-06-07 at 21.04.31

..about this much. — I manually annotated 4 labels on 500 songs.

Gosh, 70% error? No worries though, it’s sort of ‘weakly-supervised learning’ situation, in which with enough data it’s fine.


But, how much is it fine? — in evaluation?

Screen Shot 2017-06-07 at 20.55.46

red.dot: evaluation of four instrument tags with my annotation.
blue.dash: eval of four instrument tags with MSD groundtruth
yellow.solid: eval of all tags with MSD groundtruth

Corr(red, blue) = how much it’s fine to use MSD groundtruth for 4 tags.
Corr(blue, yellow) = how much it’s fine to generalise 4-tag result for all-tag result.

And as you see. Well, I’d say this is fine. For sure there’s error (Which can be significant if the difference is subtle).

X vs log(X) if X in [spectrogram, melgram, cqt, …]

tl;dr: use log(x). See this distribution!

Screen Shot 2017-06-07 at 21.17.51

Or, see how much disadvantageous if not log(). Roughly 2x data, which is a lot.

Screen Shot 2017-06-07 at 21.18.42

Spectral whitening (per-frequency stdd)? A-weighting? Should I do some special normalisation for better result?


How similar music tags are according to trained convnet?

Screen Shot 2017-06-07 at 21.21.53

More details on the paper:


Paper is out; Transfer learning for music classification and regression tasks, and behind the scene, negative results, etc.

1. Summary

A catch for MIR:

For ML:

So what does it do?


A Convnet is trained.

Then we extract the knowledge like above, and use it for…

  • genre classifications
  • emotion prediction (on av plane)
  • speech/music classification
  • vocal/non-vocal excerpt classification
  • audio event classification

Is it good?


What’s special in this work?

It uses (up to) ALL layers, not the last N layers. Partly because it’s music, not sure what would happen if we do the same for images though. Tried to know if there’s similar approach, found nothing.



2. Behind the scene

Q. Did you really select the tagging task because,

Music tagging is selected as a source task because i) large training data is accessible for this task and ii) its label set covers various aspects of music, i.e., genre, mood, era, and ?instrumentations

? You’ve been doing tagging for a year+ and you must be just looking for re-using your weights! !

A. NOOOOOOO… about 1.5 years ago, music recommendation was still part of my PhD plan and I thought I’d do content-based music tagging ‘briefly’ and use the result for recommendation.

…Although I was forgetting that part for a long while.


Q. What else did you try?

A. I tried similarity prediction using 5th layer feature and similarity data in MSD. Similarity was estimated using cosine distance of two song vectors. Result is:

Screen Shot 2017-03-28 at 16.33.12

There are 18M pairs of similarities and this plot uses 1K samples of them. Correlation value is something like 0.08, -0.05,… totally uncorrelated.

I doubt that the similarity in groundtruth is not much about audio similarity. Although I couldn’t find out how the similarity is gathered. + I tested K-NN of the convnet features to see if they sound really similar, and (ok, it’s subjective) they certainly are!

Hopefully there’s similarity dataset that focuses on the sound rather than social/cultural aspects of song so that I can test it again.

… ok actually I hope someone else find it and do it =)

Tried something else, too — measure them by precision@k, as below:

First, collect similar pairs of songs from the ground-truth data (you can threshold the similarity.) Let’s call these pairs “similar pairs at similarity P”. Something like P=0.8 should be used.

Then, for each of these similar pairs (A,B), you will do the following:

(1) Find K most similar songs to A according to the cosine similarity
(2) If B is included in this set, it’s correct.
(3) Do the same for B as well.


checking 139684 pairs, threshold=0.999 (out of 18M pairs)

precision: 0.0105667077117 for 139684 pairs at k=10
precision: 0.0416654734973 for 139684 pairs at k=100
precision: 0.115367543885 for 139684 pairs at k=1000

are checking 292255 pairs, threshold=0.8 (out of 18M pairs)

precision: 0.00903662897128 for 292255 pairs at k=10
precision: 0.0364476227952 for 292255 pairs at k=100
precision: 0.106280474243 for 292255 pairs at k=1000

In short, it doesn’t work.


Q. Why is it on ICLR template?

A. Don’t you like it!??


Q. Why didn’t you add the result using ELM?

A. I will do it in another paper, which will include some of this transfer learning results.


That’s it. Again, please also enjoy the paper and the codes.


EDIT: 30 June 2017 – It’s updated to v2, the camera-ready version for ISMIR 2017.

An unexpected encounter to Extreme Learning Machines

I was testing multiple MIR tasks with my pre-trained features from compact_cnn. I thought I was.

It turned out that I didn’t even load the weights file. There is simply no such code. (You’re supposed to laugh at me now. out loud.)

The featured (thumbnail) image is what I ended up doing. Please scroll up and take a look. But I almost never realised it and even wrote quite a lot of ‘discussion’ on the top of the result. (Yeah, laugh at me. LOUDER) Anyway, it means my system was a deep convolutional extreme learning machine.

I couldn’t realised it earlier because the results are quite good. Let’s see how the feature worked with SVM classifier.


Not bad, huh?

Finally, Kyunghyun noted out that it has been known since 1965 by Cover’s theorem.

A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.

— Cover, T.M., Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition, 1965

Still quite interesting. Oh, and the non-ELM results, what I thought I was doing, will be posted to arXiv soon. See you there then.

PS. Saxe et al 2011, a NIPS paper about it. Deeplearningbook also mentions in Convnet chapter that “Random filters often work surprisingly well in convolutional networks”.

How do I preprocess audio signals

With Urbansound8k dataset, which is in a reasonable size, I did like this:


The code structure seems not bad. As a result, I have a couple of hdf files. They are convenient to use in Keras.


  • Not multi-processing used here.
  • hdf doesn’t yet support multi-reader. It means you can’t keep them opened in different process. In other words, you can’t use the same file at the same time. In other words, even if you have N>1 GPU, you can’t just run the task at the same time. There are workarounds though.