LSTMetallica: Generation drum tracks by learning the drum tracks of 60 Metallica songs


Update 20 Apr 2016: Check out the paper on arXiv (PDF)


Word-RNN (LSTM) on Keras with wordified text representations of Metallica’s drumming midi files, which came from

  • Midi files of Metallica track comes from
  • LSTM model comes from Keras.
  • Read Midi files with python-midi.
  • Convert them to a text file (corpus) by my rules, which are
    • (Temporal) Quantisation
    • Simplification/Omitting some notes
    • ‘Word’ with binary numbers
  • Learn an LSTM model with the corpus and generate by prediction of words.
  • Words in a text file → midi according to the rules I used above.
  • Listen!

A Quick look on things (copied-and-pasted from my previous post)


LSTM (Long Short-Term Memory) is a type of RNN. It is known to be able to learn a sequence effectively.


RNN (Recurrent Neural Network) is a type of deep learning neural network. See this post by WildML for further understanding.


Keras is a deep learning framework based on Theano and Tensorflow. I used Theano as backend but this shouldn’t affect the output.

Another quick look on things


Metallica is an American heavy metal band formed in Los Angeles, California. Metallica was formed in 1981 when vocalist/guitarist James Hetfield responded to an advertisement posted by drummer Lars Ulrich in a local newspaper. Wikipedia

Why Metallica?

Because I found quite enough number of midi track of Metallica – more than any other artists so far. Also the drum tracks in Metallica are relatively consistent. The simplification in the preprocess would also make sense for Metallica drum tracks according to…… my brain.

Preprocess – How to Model a rhythm representation into a text?

LSTM models are basically about time-series modelling, i.e. 1-D data. In my previous work, it was straightforward how to model a chord progression as a text (so that I didn’t even mentioned about ‘modelling’). I converted the midi tracks into a symbolic, 1-D data to deal with it as a text.

Drum track is not a 1-D

Screen Shot 2016-02-23 at 09.59.38

No it isn’t as you see above. This is so called a piano-roll view. Y-axis is pitch, x-axis is time, and each note represents different part of drum. Here, blue:kick, green:snare, yellow-or-olive-or-whatever:(opened) hi-hats, and red:crash cymbals. Yes, this is a piano-roll view of the drum track of Master of Puppet – from 0:28 in this live:

It is not 1-D as drummers are using their arms and legs simultaneously.

First, Quantisation and Simplification

Quantisation is to put the notes at certain timings, and only at certain timings. Nice explanation from It’s just a rounding function in time axis. So I quantised the midi files by 16th notes, assuming Lars Ulrich is not playing otherwise – it introduces some errors, especially there are triplets.

Further simplification I did is to limit the types of notes: a kick, a snare, open hi-hats, closed hi-hats, three tom-toms, a crash, and a ride: 9 notes in total. Using the General-MIDI drum map it is expressed as below:

allowed_pitch = [36, 38, 42, 46, 41, 45, 48, 51, 49] # 46: open HH

drum_conversion = {35:36, # acoustic bass drum -> bass drum (36)
37:38, 40:38, # 37:side stick, 38: acou snare, 40: electric snare
43:41, # 41 low floor tom, 43 ghigh floor tom
47:45, # 45 low tom, 47 low-mid tom
50:48, # 50 high tom, 48 hi mid tom
44:42, # 42 closed HH, 44 pedal HH
57:49, # 57 Crash 2, 49 Crash 1
59:51, 53:51, 55:51, # 59 Ride 2, 51 Ride 1, 53 Ride bell, 55 Splash
52:49 # 52: China cymbal

My encoding scheme (midi→text)

The basic idea is to represent the information of all True or False (played or not played) of every notes at a time with a single word. The words are rather self-explanatory:

‘000000000’ : nothing played
‘100000000’ : kick is played
‘1000000001’ : kick and crash played
‘0101000000’ : snare and open-HH played

Obviously 1 is True and 0 is False, at their dedicated places with a rule of:

Screen Shot 2016-02-23 at 10.52.12

with ‘Bar’ added for the segmentation of measures.

Corpus looks like this:

0b010000000 0b010000000 0b000000000 0b010000000 0b010000000 0b000001000 0b000000000 0b000001000 0b010000000 0b010000000 0b000000000 0b010000000 0b010000000 0b000001000 0b000000000 0b000001000 BAR 0b010000000 0b010000000 0b000000000 0b010000000 0b010000000 0b000001000 0b000000000 0b000001000 0b010000000 0b000000000 0b000000000 0b000001000 0b000000000 0b000001000 0b000001000 0b000000000 BAR 0b100000001 0b000000000 0b000000000 0b000000000 0b010000001 0b000000000 0b000000000 0b000000000 0b100000001 0b000000000 0b000000000 0b000000000 0b010000001 0b000000000 0b000000000 0b000000000 BAR …


where a prefix 0b is added to specify it’s kind of a binary number.

LSTM Structure

I applied word-RNN here. Some numbers are..

  • 60 songs for training data
  • Number of words: 2,141,692 (including ‘BAR’ in every 16 words)
  • Total number of words: 119
    • 119 out of 2**9==512 possibilities.

The code is the same as I used for LSTM Realbook:

model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, num_chars)))
model.add(LSTM(512, return_sequences=True))
model.add(LSTM(512, return_sequences=False))

model.compile(loss='categorical_crossentropy', optimizer='adam')

The results

It didn’t end up learning the structure of bars and 16-notes after the first iteration.

0b000000000 0b010100000 0b100000000 0b100010011 0b000000000 0b000000000 0b100001010 0b100000000 0b000000000 0b010000000 0b000000000 0b101000000 0b000011000 0b000000000 0b001001101 BAR
0b000000000 0b000000000 0b001000000 0b100000000 0b100000001 0b010000000 0b000100000 0b000000000 0b000001100 0b000000000 0b000000000 0b101000000 0b000000000 0b011000000 0b001000000 0b000000000 BAR
0b100000000 0b000000000 0b000001000 BAR
0b000000000 0b100100000 0b010100000 0b100100000 0b010100000 0b101000000 BAR
0b000000000 0b000000000 0b000000000 0b100000010 0b000000000 0b010100000 0b000000000 0b101000000 0b000000000 0b101000000 0b000100000 0b011000000 0b100000001 0b000000000 0b000000000 0b100100000 0b011000000 0b000000000 0b001000000 0b101000000 0b000000000 0b010000001 0b010000000 0b101000000 0b100100000 0b100000000 0b100000000 0b000000000 0b000000000 0b101000000 0b010010000 0b000000000 0b101000000 0b101000000 0b000000000 0b000000000 0b000000000 0b101000000 0b000000000 0b000011000 0b000000000 0b000000000 0b100100000 0b000010000 0b100100000 0b000000000 0b100000000 0b001000000 0b000000000 0b101000000 0b000000000 0b000000000 0b000000000 0b000000000 BAR
0b000000000 0b101000000 0b100000001 0b000000000 0b000000000 0b000000000 0b000100000 0b000000000 0b000000000 0b000000000 0b000000000 BAR 0b000000000 0b000000000 0b000000000 0b100000000 BAR


After 45 iterations it looks more structured. (I added \n for better understanding)

Screen Shot 2016-02-23 at 11.23.21

After 60 iterations it became bit more structured.

Screen Shot 2016-02-23 at 11.28.06

Okay, it’s boring, let’s listen to the rhythm.

First, I add a score for this track:

Screen Shot 2016-02-24 at 15.12.36

It looks like a proper drum score. You can listen to it below:


I ran 60 iterations with diversity parameters of [0.5, 0.8, 1.0, 1.25, 1.5]. I’ll present 10 tracks – with all the diversity parameters x [30th, 60th] iteration result.

Screen Shot 2016-02-23 at 12.13.28.png

With the diversity value of 1.50 and 1.25 the track sounds bit too virtuosic. Track 3, 8, 9, 10 are boring but probably more makes sense.

Track 8 and 9 are interesting, to some extent is has regular patterns with kick, snare, hi-hats + crash cymbals.

Do they sounds like Lars Ulrich?… Perhaps not yet.


  • Looks like I have to fine tune the diversity parameter to get more reasonable drum tracks.
  • The proposed encoding scheme – based on nine binary digits – seems making sense.
  • Probably learning with different segment would lead a whole, complete track with a proper beginning and ending.
  • I don’t think it is deeply understanding the structure – the regular patterns of kick, snare, hi-hats, and the meaning (except track 8 and 9). At least it didn’t come to me that easily.
  • It would be also cool if I had more data with other bands to have some pun fun. E.g. Rage Against the long short Term Memory.
  • What I wanted to do is to do it with jazz drum tracks so that I could combine them with LSTM Realbook. Is there any good (and hopefully free) resources for it?

Code & dataset

Dataset is now shared in my repo. Will share the code soon.


15 thoughts on “LSTMetallica: Generation drum tracks by learning the drum tracks of 60 Metallica songs

  1. I stumbled upon to visit your post.
    Well I don’t describe myself as an machine learning professionalist, but I think no one can deny that you’ve made such a great thing, which shows what machine learning can do in our real life with a very fun way.


      1. Thank you.
        I really enjoyed your work and would like to contribute with it and use it as a basis for new research (citation is your job).
        Could you just tell if you created the algorithm or if you used a ready tool?


  2. Oh, excuse me, I got confused with chord progression work. I coded it by myself using . Sadly I can’t find the script I used. I’m afraid I removed it while cleaning up the folder. (I don’t wanna believe it.. I never delete codes, just found out this.)

    In the file above, there are two classes – note and note list. The methods of Note_List is ordered as it’s supposed to be used. add_note, quantise them, simplify drum events, and return as text. I hope it helps.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s