Summary of the project

One of the primary concerns of the news media industry is how to manage the comments that readers post on news articles. Most online news publishers provide content in a form that allows readers not only to access it but to post their own comments: for readers, this is valuable in allowing them to express their opinions and interact with each other; for the publishers, it is valuable in that it provides a way to understand their audience and increase reader engagement. However, the ability to comment is often misused, with comments used to advertise, abuse others, spread misinformation and post illegal content. In many countries, publishers are legally accountable for the content that is posted. Publishers therefore usually employ some form of moderation: human moderators will scan the comments posted, and apply some moderation policy to block those that should not appear, and in severe cases perhaps ban the users from posting again.

This job is not easy: decisions can be subjective and hard to make consistently; it can be easy to miss comments that need blocking, and when high volumes of comments are coming in (peak volumes of many thousands of comments per hour are not unusual during events of note) it can be difficult to keep up. There has therefore been great interest in recent years in AI tools to assist moderators: tools to analyse the content of comments using natural language processing (NLP) methods and help flag those which should or should not be blocked, helping speed up the moderators’ work and produce consistent results. Recent research shows impressive accuracies.

However, transferring these AI methods from research to practical industry use is not straightforward. Tools must usually be trained on large volumes of data labelled with the correct expected output decisions: this data must be in the domain, style and language that will be seen in use, so must generally be produced from scratch for any new publisher, newspaper or topic. This process is expensive and needs expertise in NLP and AI methods.

This project seeks to develop new methods to bypass this problem and make the initial implementation process easy and fast. We will develop methods for semi-automatic annotation of data, including new variants of active learning in which the AI tools can quickly select the data they need to be labelled. We will build on recent progress in topic-dependent comment filtering to build tools that can take the context of the associated news article into account, reducing the new data needed. Finally, we will use recent progress in transfer learning to allow tools to be initialised from existing labelled data in other domains and languages, reducing the amount of data required.

The result will be a suite of tools to enable easy, fast, practical implementation of accurate, robust comment filtering methods for use in the news media industry.

see website