1. Is There Really A Problem?

The scale of copyright infringement on otherwise legal sites

We’re told that Article 13 will ‘break the internet’, and one of the ways it would cause that is by forcing YouTube and similar sites to shut down, or block EU users from uploading or accessing the site, or impose some other draconian resolution. This would supposedly happen because there is far too much content being uploaded - e.g. 400 hours of video uploaded to YouTube per minute, according to Google’s own data - and Article 13 makes the content-sharing sites liable for all of it. Opponents claim this is insurmountable because:

  1. It’s too much content for a human moderating team to manually check every upload.
  2. It’s too complex a task for a computer to automatically check without causing many false positives (i.e. videos judged to infringe when a human assessment would rule otherwise).

These two points are true, at least for a massive general purpose site such as YouTube, maybe even smaller ones such as Tumblr. However, what Article 13 opponents are missing is that not only is it not a choice between one or the other, but there are other tools, processes, and policies available to augment content checking by humans and computers. For example, why is Wikipedia (another top-10 website, like YouTube) not rife with copyright infringement, even though you don’t need an account to edit it? Why do we not keep hearing tales of Bandcamp being full of other people's music, even though it’s obviously designed to allow users to distribute music to the public? It turns out that ensuring your platform is mostly free of infringement is not as impossible as some people would have you believe.

The Elephant In The Room

There are many big content sharing platforms around today - Twitter, Instagram, Facebook, YouTube, Tumblr, Pinterest, and so on. Each offer many features to their users, including the ability to freely upload content, and each of these sites has some degree of copyright infringement, some more than others. By far the biggest example is YouTube, as a repository of both video and music content and as the second most popular site in the world, so it’s useful to examine it closely. To what degree is it a problem?

YouTube is a site with a wide range of different offerings to suit all audiences. There are various genres that never really existed before it came along, some of which can be seen on this list of “the 20 types of videos that get the most views on YouTube” - “Couple VLogs”, “Self-Improvement”, “Unboxing”, “Haul Videos”, and so on. Other lists of top video categories exist: here, and here, and here. What do all these lists have in common? They’re all wrong. Each of them omits the category of video on YouTube that is most popular by far, and that is simply Music.

It’s hard to prove this definitively without having access to YouTube’s internal stats, but the evidence we do have makes any alternative relatively implausible. Wikipedia’s page on “Most-Viewed YouTube Videos” shows that, at the time of writing, an astonishing 95 out of the top 100 most popular videos on YouTube are music videos.

Obviously the major pop artists seen in that list are outliers. To be rigorous, we need to ask whether this popularity extends to other artists and music in general, and how does it stack up against other categories?

The company Pex attempts to track audio and video online. Their research in October 2017 suggested that at least 27.7% of YouTube traffic was in music, the most popular category - and that this is an underestimate since not all music videos are correctly categorised.

Obviously I have no insight into the methodology used by Pex, so it's hard to corroborate their figures. So, as a software engineer by trade, I set about doing a similar survey for myself. This involved writing some simple software that requested 10,000 random YouTube videos, looked at the content’s metadata - such as the uploader, view count, title, category, etc - and gathered some statistics that reflect the site as a whole.

My data suggests that although music uploads only make up roughly 15% of uploads, they account for over 47% of all views on the site. Music videos are often shorter than other video forms such as LetsPlays or Vlogs, so you might think that the relative viewing time is low - but again, once you do the mathematics the proportion of watching time for music still makes up about 35% of all time spent on YouTube - more than gaming, more than other entertainment. In other words, at any given time, roughly 1 in 3 of YouTube’s video-watching users is listening to music.

Is this just a side-effect of the recommendation algorithm preferring music, perhaps? No - it seems that the top searches on YouTube are disproportionately weighted towards bands and music. People are going there in large numbers explicitly to find music, and their search is successful.

What about the question of how music compares to other categories, not just for the pop stars but for videos in general? To avoid the data being distorted by millions of views for top pop stars (or, indeed, other YouTube celebrities), we can look at the median view count for videos in each category (music, news & politics, people & blogs, etc). The data suggests that the median view count of all videos on YouTube is roughly 490 views, whereas the median view count of music videos is somewhere around 950. If you take music videos out of the equation the median video view count is closer to 425, which suggests that music videos are generally 2.2x as popular as other videos.

Therefore we can reasonably argue, based on how it is actually used, that YouTube is predominantly a music site first and foremost, with the other categories almost an afterthought. Certainly it is likely that the majority of their advertising revenue is coming from ads shown alongside music videos. So, despite many anti-copyright activists complaining about the music industry as being an isolated group selling “legacy content” - the implication being that recorded music is somehow outdated or obsolete - the actual and inescapable truth is that music is a disproportionately massive part of YouTube’s appeal and business offering in the present day.

But is it infringement?

We can see that YouTube is much more of a music site than some people are willing to accept. But is this a problem, necessarily?

It is certainly the case that many of the top-viewed music videos on YouTube are completely authorised downloads, put there by the artist’s account or their label. But that doesn’t mean the unauthorised ones are just unsigned bands nobody has heard of - this Ed Sheeran video with over 30 million views was uploaded by a random user, as was Lynyrd Skynyrd’s Sweet Home Alabama with over 45 million, and Iron Maiden’s The Trooper has over 70 million.

By looking at the videos in my data set and analysing the text in the descriptions, it appears that less than half of the music videos on YouTube were clearly uploaded by the original artist on their own channel, or by authorised people acting on their behalf, such as a distributor or record label.

Many of the rest appear to have been uploaded by fans, despite this being an obviously illegal practice. In about half the cases, the much-maligned 'upload filters' spotted the content and applied a licence to them - but it was still uploaded without permission, and against YouTube's Terms of Service section 7.4 - but this is rarely enforced. My survey suggests that ContentID - YouTube's system for matching content and blocking or applying advertising revenue to it - is ‘retroactively legalising’ roughly 22% of the videos that users upload without permission. We’ll come back to this figure shortly.

(It's also worth remembering that the typical independent musician does not get to use ContentID themselves. They can sometimes work with a third party to use it - meaning the already meagre share of advertising revenue it doles out is diluted even further.)

If you take the amount of video uploaded daily, work out how much of that was music, then see how much of it was not authorised by the rightsholders, it’s a massive amount. I used my tool to check the metadata and personally examined a representative sample of the remaining videos that the tool couldn't understand, and came to these figures:

If we do the maths, taking the amount of video being uploaded per day and extrapolating that to the number of music videos likely to be among them, we see that there are likely to be over 100,000 infringing music videos uploaded daily. If we include the unauthorised uploads that got matched by ContentID, it goes up to over 215,000.

Is it a big deal?

So we know there’s a lot of music on YouTube that shouldn’t be, and we know that it’s getting a lot of attention from YouTube’s audience. But still, YouTube is a video site, and people are mostly listening to music via music services, such as Spotify, Apple Music, Beats, Deezer… right? Well, not as much. According to The Trichordist’s Streaming Price Bible, in 2018 YouTube’s ContentID accounted for over 48% of the industry’s recorded streams, making it by far the most popular destination for music online.

And there’s one more thing - ContentID is only applying to the minority of music on YouTube, namely the 22% of automatically tagged user uploads plus the 16.5% that were supplied directly by distributors and labels - by view count, these come to about 70% of all the views of the Music category on YouTube. So that 48% market share figure starts to looks a lot more like 57.5% once you factor in the other, untracked content.

To put this in very stark terms - most music being streamed online comes from a site where less than half of it was put there with the owner’s clear consent.

So we clearly do have a widespread problem of unauthorised music uploads, some of which get licensed after the fact, some do not.

Can it be helped?

Many Article 13 opponents don’t really deny the level of infringement we just covered - indeed, the chief opponent of the directive in the European Parliament even admits that “copyright infringement is pretty common”. They just claim that this level of infringement is a necessary evil in order to get the benefits to society that these internet platforms supposedly bring, and that there’s nothing that can be done without wrongly blocking a lot of legitimate content, or denying such content the visibility it deserves. But it turns out that there is an entire toolbox at a platform’s disposal which could do a much better job of finding and removing infringing content.

We have heard so much about “upload filters”, mostly because of the arguments that (a) it’s not possible to implement article 13 without compelling site to use automated filtering on uploads, and (b) because it’s believed that such filters would be imprecise and/or too expensive. Let’s examine these 2 claims in turn, after we consider the steps that led us to where we are today.