YouTube MusicVideo 8M dataset [Beta]

Introduction

Music video is quite a thing. Not only because now 3 music videos have been watched for more than INT_MAX, but because thanks to YouTube’s innovative and provoking strategy which is shown to be working very well (for them), there are loads of music video online!

This means, ironically, unlike in the case you’re looking for a dataset that provide music signal, you can just download music contents. For free. No API blocking. No copyright law bans you to do so (redistribution is restricted though). Not just 30s preview but the damn whole song. Just because it’s not mp3 but mp4.

As a MIR researcher I found it annoying but blessing. {music} is banned but {music video} is not! It’s non-sense and thanks! Well, as a music lover I’m totally agree with the situation. Ok. That’s enough. ANYWAY,

I beta-released YouTube Music Video 5M dataset. The readme has pretty much everything and will be up-to-date. In short, It’s 5M+ youtube music video URLs that are categorised by Spotify artist IDs which are sorted by some sort of artist popularity.

Unfortunately I can’t redistribute anything further that is crawled from either YouTube (e.g., music video titles..) or Spotify (e.g., artist genre labels) but you can get them by yourself. (My gratitude to Lostanen to volunteer as an online non-certificate temporary lawyer.)

But how can we make it complete?

There are lots of potential on a dataset like this. Perhaps we can filter these under some metrics such as #songs/artist and then do some MTurk sort of thing to label this, get some artist-level information using Spotify API (Can I expect some official help? 🙂 ). Let’s brainstorm. How should we filter? What feature should and could we add? Would it be a good idea to do something on the top of youtube id’s? (I think so, but there would be some problems too).

Leave a Comment