Is Speech Recognition just OCR for Voice? I don’t think so…
In 2015, I put up a slide at a Financial Services technology conference saying “2015 is the Year of Voice”, citing examples like Siri and the massive fines then being handed out for rogue trading as reasons why the technology was being taken seriously. The following year it was “2016 is the Year of Voice”. And then 2017 rolled on, as did my slide.
Finally, I’m sure, 2018 is the Year of Voice: I think.
Alexa more than any other device has shown the potential for voice technology. For the first time, someone has put into production a solution that is actually usable by consumers, that doesn’t require pre-training, and does something useful. Anyone who has tried to use transcription software (which was the last real attempt at using voice technology to do something to make people’s lives easier), knows that even small imprecisions in the output made it quicker to learn to type. Finally, you can walk into your kitchen and turn on the radio just by asking. Progress! Who needs pesky buttons.
But it has highlighted some more interesting issues, like how do we feel about having an always on microphone in the house, and how do we feel about all of our sensitive data passing through Amazon’s hand. I have often cited the “Magic Pipe” as one of the great dangers of current implementations.
When you use a web browser to contact your bank, the whole transaction is secured end to end using SSL. But when you say “Alexa, what is my bank balance?”, those voice commands are sent to Amazon, who send the question to your bank, who send the information back to Amazon who send it to your Alexa device. There is no Magic Pipe of secure data connection between your somewhat underpowered Alexa device and the bank. How comfortable do you feel now?
While Privacy is undoubtedly an issue with home voice devices, as it is with using any public cloud service, what I want to explore are the common mistakes that people make when assessing and implementing speech recognition technology.
Speech recognition is hard, and it is complicated by a lot of factors. Telephony speech in particular, is hard as it is highly compressed and often conversational. When you then add in multiple parties on a conference call, it can be hard even as a human to really grasp what is being said. Have you ever tried to record a meeting using the recording app on your phone? It is very easy to strategically place your phone so that no-one comes out clearly, a combination of the “far-field” effect of distance from the microphone and background noise.
It is one thing to accurately transcribe clearly spoken speech, but quite something else to transcribe things accurately in the wild.
The first question we get asked when talking to customers is “How accurate is your speech recognition engine?”. There are all sorts of answers to that. “As good as Google” is one. “Better than IBM” is another. Although true, none of these replies are that helpful. I could say “95% accurate” or “50% accurate” because both of those would be true as well. It is dependant on the use case. So, our answer to that question is another question or often a series of questions, the primary one being “What do you want to use speech technology for?”, and the answer is usually “We want a transcript”
Let me tell you very clearly. In almost every use case involving speech, the one thing you don’t want is a transcript, at least not in isolation.
This leads on to what our first question should probably be, which is “Why do you think you want a transcript?”
There are a number of answers to this. Some are for search use cases, where people want to retrieve files at a later date using text search. Others are for keyword spotting (compliance, QA and fraud often need this). Others are NLU, where they are interested in the intent. Some will be for sentiment analysis. And the worst one “We want to read and review the transcript”
Have you ever read a transcript, particularly of a telephone call?
Try this little snippet for size:
Believe it or not, this is between two people at the top of the “articulate” scale, one of whom will probably be the next King of England (so I have to be careful what I say, lest my head be severed from my body).
The point is that conversations are not like properly written documents (like emails, for example). They are difficult to “read” even when they are perfectly transcribed. And the reason why there is still a thriving “human in the loop” transcription industry (costing $60+ per audio hour for the transcript) here in the Age of Alexa is that machine transcripts are still not that accurate, especially in the more challenging environments identified above.
So, if speech technology is not 100% perfect “OCR for Voice”, what is it and can it do? And how can it be used for better business benefit?
You wouldn’t set off on a 100k bike ride without doing a little training? Well don’t start your voice project without it either. We used to be fairly soft with clients about this. We would say “We do have a process that quickly allows you to improve the recognition accuracy, but feel free to run it out of the box if you want”.
But we found that in every case the customer didn’t get what they wanted. This is because every domain is different, and unless you adapt to that domain, you get good general results, but you don’t get results that match your expectations. This is often noticeable where people have a search or keyword spotting scenario. If am on the lookout for a particular unusual name or a person or product, it is almost certain that a speech recognition system will not have it in its “lexicon” (a dictionary to you and me) of words.
Want to catch references to “Martin Shkreli” in your voice data? You won’t unless you search for “Martin Squirly” or unless you have pointed your speech recognition system to the Turing Pharmaceuticals Wikipedia page (here, if you are interested), and said “Learn this”.
Ask Alexa to play “The Word Girl” by Scritti Politti. She’ll get it (unfortunately, from a musical standpoint), because “Play” sends her off to her dictionary of musical words and terms, and so she’s pre-sensitised to artist, song and album names.
Not magic, just domain-specific training.
Seeing the Future
If you speak into a “live” speech recogniser like Siri, you will often see phrases appearing before you have said them, and you may well see words that have been transcribed change, often from something wrong to something right. This is because speech recognisers use statistical models to help them work out what humans “mean” when they are speaking to try to get the right words.
Start a sentence “The cat sat on the”, you listener (and your speech recognition engine), is expecting you to say “mat” at the end. Say “rat” and you have engendered a state of confusion.
This power to see the future also helps in certain other scenarios. When we do machine transcription, we only expect to see single lines of text as output, representing what has been said. But in the background, where the system is not necessarily sure what you meant, it will have generated alternatives, often homophones of what it heard (Claus, clause and claws for examples).
If your use case is trying to find things, like for search or keyword spotting, you really want to have access to that data, because it will expand your search pool.
This is called a lattice, a list of alternative words with a “confidence” score attached to it.
Below you see an example of a phone call. Every single word that was said was correctly identified, so if you had used the lattice to search, you would have found what you were looking for. Especially as the name, often a key search term, is only seen with a confidence of .08.
Is Voice Ready for Primetime?
Yes, and no. And whether it is very much depends on you.
Step 1, understand your use case.
Step 2, really understand your use case. Unless you are trying to produce close captions for a podcast, you don’t just want a transcript, “out of the box”.
Here are some questions to ask:
Is my use case really a “search” use case? If so, think about how you use training and lattice output. Microsoft did some research on recall using lattice results and found:
“Experiments on a 170-hour lecture set show an accuracy improvement by 30-60% for phrase searches and by 130% for two-term AND queries, compared to indexing linear text.”
This means that if you search for a phrase, say “Intelligent Voice”, you are up to 60% more likely to pick it up a lattice search than if you search the plain text output you usually see from a straight transcription. If you make that a Boolean search “Intelligent AND Voice”, the chances of picking it up rise to 130%
Am I worried about privacy? Are you happy that your data is commercially or legally suitable to be processed by a public cloud provider? If not, you need to look at on-device, on-prem or private cloud solutions
Am I worried about cost? Google charges $1.44 per audio hour processed. Think about the volume of data you might be processing and multiply it by that. If you are looking at telephony monitoring solutions with agents on the phone 30 hours a week, does that make commercial sense?
The abilities of voice technology do not really live up to the hype. Much commercial speech recognition is focused heavily on English, and “normal” accents. And they usually rely on the audio being processed being pretty clean to start with: Alexa uses a 7 microphone array and clever “beamforming” techniques to overcome the problem of you being distant from the microphone.
At the moment, we are building a model for the Bavarian dialect: not Bavarian accented German, but an actual dialect, but reduced in written form to Standard German. Not an easy task, but another step in trying to break the Anglo hold on speech recognition.
Will it ever be 100% accurate as well as 100% useful? As they say in Munich, “Schau ma moi”.