We all love Alexa!  And Google Home!  Less so Siri, but it can come in handy.  Let’s face it, the world has fallen in love with voice powered everything.  You can Alexa-enable your microwave, kettle and refrigerator.  Who knows what’s next?  An Alexa-powered tank? 

And this is where you should recoil in horror.  I used to think the scariest phrase in the world was “Alexa, what’s my bank balance?” but I’m beginning to wonder if “Alexa, shell that building” is scarier, particularly if your vehicle unexpectedly starts blasting music rather than buildings in a Noriega style. 

Here’s the thing:  It is really, really easy to build a voice “skill”.  And it is really, really easy to build deeper voice integrations using speech to text tools from major cloud providers like IBM, Google and Amazon.  A few days ago (and I shan’t say where) I saw a proposal to build a law-enforcement surveillance system based on Amazon Transcribe and Amazon Lex.  I mean, yes, it’s great to show how you can string a few APIs together to build a proof of concept, but this is highly sensitive, highly valuable data.  You can’t just sling it out to the cloud because there is an API to throw it out to. 

This is what happens when you let yourself be lead by your developers and innovation teams (I have both, so I’m allowed to be slightly rude about them):  Inevitably, to get prototypes out, they choose what is easy, rather than perhaps what is right.  My first question about any system I am building where I have sensitive information passing through is “Is it secure?” and the second is “How can I guarantee privacy?” 

Cave dwellers may not have seen the stories about how Amazon, Google and even the sainted Apple have been listening in to audio files captured by their devices, but the rest of us have.  We have traded privacy for convenience and cheap, even free, services in the consumer domain.  But that should not be the same in the enterprise. 

And here’s the rub.  Deploying a lot of this voice technology in the enterprise, either on-prem or as a piece of kit, is hard.  It is difficult to procure, the costing models don’t always work, and they can be slow and not that accurate. And also, no-one just buys transcription technology.  You need a use case, and you need a solution that matches it. 

The fact is, these easy-to-use cloud tools have made people believe the “magic pipe” fallacy: when you speak to a voice assistant, many people think it is like a secure web browser where it connects direct to the service provider, like a bank, using a magic pipe.  But the opposite is true.  When you say “Alexa, I have a migrane”, it doesn’t go straight to your doctor.  Both your question and the response pass through Amazon’s servers, in clear.   

Enter “Vox in a Box”:  IV’s core technology packed into a single GPU accelerated server, which can be shipped straight to a rack in your data centre.  Or if you are a lover of the private cloud, as a virtual “box” as an AMI for Amazon Web Services, or a VHD for Azure.  A REST-based API wrapped around speech recognition, Natural Language Processing, Speaker Identification and lots of other fun stuff like easy model adaptation from text:  And in over 20 languages and dialects.  This version is optimised for massive batch processing, so tens or hundreds of thousands of call recordings, voicemails etc all being processed in record time. 

Full disclosure:  We have been shipping versions of this for years, but as part of larger solution sets into banks for trader surveillance, prisons for prisoner monitoring and identification, “Blue lights” services to help channel calls appropriately, Government departments for QA and into the legal profession to help speed up litigation involving audio.  These are all use cases, not just technology looking for a solution.  These are all highly privacy sensitive environments, and in every case, we have put in highly accelerated GPU boxes that provide a lot of speech technology in a small hardware footprint. 

And now we’re been working with our friends at NVIDIA to make it even faster:  If you are at GTC DC (Use code: GMXGTC for a 20% discount on any #GTC19 pass), you can see the work we have been doing, and how we make it work in the real world. 

Our mission is to give our users the simplicity of working with a “cloudy” system, with the benefits of having full control over it.  We have packed a lot of expertise in how to make speech recognition work better, into one appliance. 

Don’t worry, if you are looking for live streaming or instant voice response, we’ve got systems for that too.  Just get in touch. 

Remember, next time you are tempted to send your voice data to an unknown data centre in an unknown location looked at by who knows who:  A secret shared is not a secret any more! 

Leave a Reply

Your email address will not be published. Required fields are marked *

÷ 1 = 7