Q&A: How to transcribe rap songs

… I want to understand what they are rapping about … I want to ask if it is indeed possible to transcribe rap songs? I have vocals extracted from the songs and tried to use Google speech2text API for it but the results look very random and bad. I am given the impression that transcribing songs in general to lyrics is far from achieving ok performance atm. Does this sound right to you?

(From an email I received)

From what I’ve heard, I agree – vocal extractor + automatic speech recognition is not a completely working solution. So, what’s the real issue and what we can do? When you’re asking a question like this, you don’t want an answer like “Make a rap song dataset and train an neural net with WaveNet and GAN and CTC loss end-to-end”. Oh wait.. yes that could work, too. But we probably want something simpler.

Obviously, we should google “singing voice transcription”. They must’ve discussed the issues they faced; although you’ll need to take the difference between singing and rapping into account, additionally. However, in this post, I’ll just describe my thought process without reading papers.

Problem definition

People are rapping over some music / beat and we have its recordings. They are not production music we can find on streaming services. We want to transcribe the lyrics. The language is English.

Vocal extractor

Sound source separation is working pretty well these days. Check out recent demos like Open-Unmix. (CAVEAT – training datasets for vocal source separation usually consists of singing voices, not rapping. Even combined with strong drum beats, I wouldn’t worry about the difference between singing and rapping (or more like, the difference in the patterns they have on waveform/spectrogram representations). It’s still worth keeping in mind.) OK, seems like it’s pretty good.

Speech recognition

I believe Google’s speech recognition API should work (nearly) state-of-the-art. That said, it’s not a bad idea to do some survey.. by Googling “google speech2text API performance“. Checked out the first link. Ok they don’t seem to provide any official performance, which can be annoying but also makes sense.

If speech recognition is a problem, why it would be? If it’s not an end-to-end model, ASR (automatic speech recognition) models consists of two stages – acoustic model (audio-to-phoneme) and language models (phoneme-to-word). Even in an end-to-end model, those are what’s happening seamlessly and we could choose to see them separately.

Conclusion… NOT

WELL I DON’T KNOW

🤪

Ok let’s try further.

Break down the problems

We have a bunch of smaller questions to answer. I only have some ideas about how we can answer them.

  • Does vocal separator work well for rapping voices?
    • Rapping voices are mixed loud enough, and the instrumentation is relatively sparse in Hip-hop. So yes, I think it would work well. This can be tested by using real examples to the model you use.
  • Does vocal separator work well for non-production recordings?
    • This is hard to answer because I don’t know the recording quality. But probably yes, as long as the signal-to-noice (or voice-to-others) ratio is at least as large as typical music, and as long as the ‘others’ sounds similar to typical non-vocal component in music.
  • Does vocal separator work well for hip-hop?
    • Besides rapping vs singing, I don’t see any potential problem with Hip-hop as a target of vocal separator.
  • Does speech recognition work well for rapping?
    • 🤔.. next section.

Transcribe rapping voices

  • In one hand, rapping is more similar with typical speaking than singing is. I.e., your use-case should be pretty compatible to the acoustic model in the ASR model.
  • In other hand, there are some particular words and expressions in rapping that won’t show up in typical conversation and script reading (=training dataset). I.e., your use-case might not be compatible to the language model.
    • There might be a way to adapt your model with the vocabulary you’re expecting to appear a lot in your test case.

Conclusion

Let’s say there’s no conclusion here 🙂 chew it, digest it, and find your way!

Some other thought

If it’s a production music, I’d rather run audio fingerprinting and find the corresponding lyrics.

Leave a Comment