Tech firms threatened with government fines over abuse, how can speech recognition help?

Sadly, it has become the norm to expect and accept offensive behaviour online. Whether we receive this abuse directly or watch others become targets. 47% of internet users have experienced online harassment or abuse. The faceless communities mean that people can hide behind an online profile and therefore not face the repercussions they would in real life. But who’s responsibility is it to combat this complex problem; individuals, police, tech firms, government? And how can tech firms identify this behaviour?

At the end of last year the UK government announced that new rules were to be introduced to make online a safer place to be. These rules are aimed at tech firms that allow users to post their own content or interact. If these firms fail to protect people, they could face fines up to 10% of turnover or the blocking of their sites and the government will reserve the power for senior managers to be held liable.

Online sites such as social media sites, gaming platforms, websites, apps, and other sources where user-generated content or where people can talk to one another online will need to remove or limit the spread of illegal content such as child abuse, terrorist material and suicide content. As well as this tech firms will need to protect children from being exposed to harmful content or activity such as grooming, bullying and pornography.

But how can this abuse be identified and tracked particularly in communities that use mainly voice to communicate, like gaming?

Games culture is well known for breeding toxic language and behaviour. Join any online game and you are likely to find aggression, hostility and trash talk over voice chats. The frequent use of slurs and other demeaning language creating an unfriendly space. This hostility disproportionately affects marginalised groups such as: women, people of colour and LGBTQIA+ communities, as it does everywhere online.

Gaming also creates dangers to children including grooming, the Covid 19 pandemic and closure of schools has meant that children have been spending more time online increasing risk.

Where can speech recognition help?

Technology exists to try to moderate the worst of hate and toxic speech online:  Jigsaw, the Google subsidiary, maintains a number of tools that are designed to flag toxic comments online, although that has come in for recent criticism because of its tendency to flag Black-authored speech as “toxic” because it does not conform to the white-aligned English that the underlying Machine Learning algorithms are trained on.

However, these tools are limited to the “text-only” world.  Why is that?  With huge array of interactions taking place across voice channels, especially with the growth of platforms like Discord, surely we should be tackling this potentially far more wide-ranging threat?

In recent years, with the rise of Machine Learning, Large Vocabulary Continuous Speech Recognition (LVCSR) has become significantly more accurate and robust, and much better able to cope with a wide variety of languages and accents that might be thrown at it.  In somewhat contrived laboratory settings, there have even been claims of “human parity” speech recognition achieved by machines.

But it is by no means perfect, especially when faced with fast, conversational and overlapping speech, which means the outputs from the LVCSR systems are not always close enough to the models that have been trained for text to be useful.

But allowing for this, simple keyword spotting systems can at least identify potentially offensive language and flag it, so surely that is reason enough to have machines police our voice channels, even though we know from our work in compliance monitoring that these techniques can lead to too many false positives?

This is where we come on to the second challenge, which is cost.

For someone trying to develop a system that allows them to protect their users’ online audio experience, they are usually faced with using an API provided by one of the big three cloud-based LVCSR providers Amazon, Google, Microsoft, and at a pinch IBM.  Prices start at the bottom end for untrained speech recognition at $1 for Microsoft, and Google just about undercutting that if you allow them to listen to your audio, and use it to train their models further.

During the various international lockdowns, recorded gaming activity has risen to about 8.5 hours per week for each gamer, which mean on this very conservative estimate, a games developer could be paying $35 a week, just to try to transcribe their users’ speech, even before they put in place systems to analyse it:  And of course if they have been forced to accept Google’s lower pricing for economic reasons, all of that personal voice data from their users is being slurped up and stored.

Yes, speech recognition is a processer intensive activity, but a well-engineered, highly-tuned GPU powered system can slash these costs, as well as offering the benefits of allowing the gaming company to train the system themselves to cope with the types of slang their customers use.

We are also beginning to see the rise of behavioural models that are not trained solely on text-based data, but which have been trained on the real output from LVCSR models, which means the corruptions and mis-transcriptions seen from these systems can be baked into the model building process, making them more robust in domains where there is “real” speech. Our work in behavioural analytics is making great strides in addressing this issue.

There is one thing that a lot of automated systems seem to forget: Fun. If these domains are to be made safer that must not be at the expense of freedom of speech or in the enjoyment of gaming. To be truly effective and minimise collateral intrusion any system must be context aware and must take account of all elements of speech.  The use of voice should not just stop at “compliance”. Our work in AR/VR is focussed on surfacing emotional features mined from online “Zoom-type” conference calls in user configured avatars. Why not use this voice data as a powerful feature of the virtual world?

The technology is there to help protect online gamers and others who use voice and video channels to communicate.  If you would like to know more about how we can help you build a system to address this growing problem of toxicity, just email us at info@intelligent-voice.test