4. Algorithmic Measures

(Or, “The Tools Formerly Disparaged And Oversimplified as "Upload Filters"

Imagine a site had employed as many of the manual processes as possible to reduce their moderation burden - but there were still too many infringing works on their services for them to be confident about avoiding a lawsuit. Is it time for the dreaded “upload filters” that will block masses of work unfairly? Not exactly. Content recognition technology can indeed help, but it's not as black-and-white as being an upload filter and there's no reason it needs to indiscriminately block legitimate works.

Matching technology

The essence here is: if necessary, can we match the content in uploads to the relevant rightsholders using a mostly automated process? In many cases the answer is yes, and this means the burden on human moderator teams can be decreased.

Metadata checks

Before we move onto the complex stuff, there are some simple victories to be found here. Most people who upload content want it to be found and viewed. As such, they tend to tag music videos as music, and they tend to include the artist name and the track name, so that people can find it via the search function or in search engines. This metadata can be checked against simple text databases, some of which are already available for free, such as MusicBrainz or Discogs in the case of music.

Obviously no responsible site would block a video just because the name matched something from these lists, but it might be prudent to put that video near the top of the list of videos that need a quick check from a moderator as soon as practical. After all, if I upload a video with “Katy Perry Dark Horse” in the title then you don’t need advanced technology to guess that it’s more likely to be a video of the song 'Dark Horse' by Katy Perry than to be anything else, and if I’m not the official Katy Perry channel, it’s highly likely to be infringement.

But of course, sometimes the names are misleading or missing or ambiguous, so we might want to go a bit further and match against the actual content. (I’ll mostly talk about music below, but the principles are much the same for video, images, and text.)

Content checks

Firstly, let’s imagine that rightsholders have provided the relevant data to the services. This isn’t impractical, given that it would be little more than an upload of each work, as is currently done for Youtube’s ContentID. It’s also fairly easy for the rightsholders to instead provide access for site owners to pull the data they need, whether direct from their own server, or perhaps via a 3rd party. And if Article 13 passes, it is in the interests of rightsholders to do this, given that paragraph 4b says that platform owners only have to prevent access to works that they have been provided information about.

Internet platforms can then use technology to match the works their users want to broadcast with the works on the protected list. Explaining how this operates is quite technical, but the broad strokes are:

How can sites do this?

There is already effective technology to do this, and contrary to most claims, it does not have to be prohibitively expensive, meaning even smaller sites can use it. It's not necessary to licence multi-million dollar technology from Google to comply with Article 13's 'best efforts' requirement, because there are various cheap or free tools available as well as the non-algorithmic measures mentioned previously. And these tools are getting better all the time as technology advances and software improves.

Written works

There exist off-the-shelf tools (e.g. TurnItIn, Plagtracker), or free data analysis software that developers can integrate to check for similarities between works (e.g. 'frequent itemset mining’ to find words that are used together, or ‘term frequency’ models that can see whether 2 documents use terms in a significantly similar way to each other). It's possible to use statistical tools like these to identify likely matches, and then run 'common substring' searches to see if there are clear extracts to confirm things further. This is perhaps the simplest type of content to check.

Photographs, drawings, and other images

Image recognition is also highly advanced, and indeed, Tineye have been doing it for years. Google also offer a ‘reverse image search’ where you provide an image as a search term rather than text, and it finds matching images. But you don’t need to use their services: the technology is freely available to developers in the form of open-source libraries, such as in ‘image-match’, or ‘pHash’. Open source software can also handle finding rotated and scaled versions of the original image, using tools like OpenCV. So any site that wanted to offer a large scale photo or art sharing platform would have several tools available to detect copies.

Audio and music

Audio recognition too is very advanced - technology like that used by Shazam is capable of identifying music even when the input is a small snippet of the song captured through a phone’s microphone. Many people have mentioned that Audible Magic provide a paid service to do this for site owners, but there are a variety of other software libraries for ‘audio fingerprinting’ available, such as AcoustID or soundfingerprinting - in fact, it’s so straightforward that there are blog posts with sample code, such as here, or here. Any site wishing to distribute a lot of audio or video content would be able to use these tools to quickly find most infringing uses.

All these tools enable a site to implement a practical “take down and stay down” system, even without rightsholders providing a massive database upfront - once material has been identified as being infringing and removed from the site, the content can be analysed and stored in the database, so that next time someone uploads it, it’s instantly flagged up for attention.

What about false positives, or ‘overblocking’?

So, we know that matching content can be done quite effectively. Indeed, it is already happening, with over 99% precision being reported. But what about the false positives - aren’t these a massive problem? Isn’t it impossible for a machine to know when it’s made a mistake? How can a computer tell the difference between extracts shown for a review and a copy of the whole song?

The answer requires knowing a bit about the technology. Most matching algorithms use something called a ‘similarity measure’ which does not give a simple “yes or no” answer to whether two works match, but instead gives a score - say, from 0.0 to 1.0 - representing how closely the works match. There is no reason why a site needs to draw a line at a given match score and automatically block everything above that line and automatically allow everything below that line.

A more nuanced approach might be (for example) to only automatically block the highest-scoring matches (assuming there’s no licence already in place), put the other works in the top 10% of match scores into a queue pending manual moderation, put the next top 20% of match scores into a ‘spot check’ queue where moderators will check some random proportion of the work after it’s published, and leave the rest. It would be hard for many false positives to slip through such a system since humans would be checking the things that the computer wasn’t sure about.

What about excerpts for review use and criticism? Obviously there’s a difference between “Video A is 50% similar to Video B” and “Video A contains 50% of Video B’s content”. Thankfully this is also a problem that the algorithms can handle. For example, audio fingerprinting technology is able to listen to a short snippet of audio and detect which song in its database it matches, so if I uploaded a video where I review the latest album by some band, the system can scan my video and instantly spot the snippet I play of that band’s song. The algorithm can report “From 0:45 to 0:55 of Ben’s Video, we detected the fingerprint of ‘Enter Sandman’ by Metallica, 1:46 to 1:56”. From there, if no other Enter Sandman sections were detected, it’s easy to compare against the track length of 5:30, see that only 3% of the work was used, and that therefore it is much more likely to be a review, criticism, or parody, and maybe doesn’t need to be held in a moderation queue. A computer may not be able to recognise parody or a review, but it can certainly tell the difference between a wholesale copy of a work and a small excerpt of one, meaning false matches on these grounds can be minimal.

Even better, we have a lot of information to work with to help make better decisions. If the system identifies many short matches in music tracks, it is probably a sample. If a video uses 100% of a song but it’s spread out over three times its normal duration, interspersed with non-matching content, maybe it’s one of those “Vocal Coach Reacts To X” videos. There are various indicators which any company dealing with masses of content can use to prioritise which pieces it double-checks. It's not hard to imagine a moderation queue that shows the newer upload alongside existing content or an audio track so a human moderator can quickly confirm or refute the match. The technology multiplies the efficiency of the moderators and the moderators multiply the correctness of the technology.

None of these tools are perfect - just like human assessments are not perfect - but they can make quick assessments which are usually correct, and their assessments can be used to vastly speed up the human element.

Conclusion

With all the tools described above, platform holders have a range of effective ways to reduce the amount of infringing content on their platforms to a manageable amount, such that they can not only be satisfied that the vast majority of their content is authorised, but also that courts and rightholders can see that they have taken their responsibilities seriously, without needing to shut down, block entire regions, or impose crude filters. As such the provisions in Article 13 can be reasonably implemented, without significant risk to a user's freedom of expression, and all responsible platforms will be able to comply.