Once more unto the (data) breach

by Dr Cornelius Glackin


1 in 4 companies will experience a data breach in the next 12 months according to the Ponemon[1] Institute’s “2017 Cost of Data Breach Study: Global Overview”. The perception is that the vast majority of data breaches involve on-premise infrastructure. As such, many companies prefer to employ the cloud for storing their data; it makes sense in principle to outsource cyber security to a professional cloud provider. It is also lower in cost.  However, some of the largest and most costly breaches have been for cloud-based systems e.g. Apple iCloud, Dropbox, LinkedIn, Microsoft and Yahoo[2], each resulting in millions – and in some cases billions – of accounts being compromised.

Cloud computing means organizations allowing access to business-critical applications and sensitive data over the Internet. Recent advances in deep learning have revolutionised image and speech processing, making exciting new applications possible. Many of these applications require the support of cloud computing infrastructure to centralise the necessary computing power required to process video and audio data. There are numerous emerging examples of this such as Amazon’s personal assistant Alexa which employs cloud processing to support its voice recognition and dialogue management functionality. Whilst no breaches of this system have been reported, the implication is that unencrypted audio data must reside on the cloud, to enable it to be processed, and hence carries a substantial risk.

Earlier this year, an open database containing links to more than 2 million voice messages recorded on cuddly toys was discovered[3]. Personal pictures of celebrities were breached from Apple’s iCloud offering. In the majority of cases, cloud providers typically urge their customers to use stronger passwords, and add notification systems that look for suspicious activity.

Whilst personal photos of Jennifer Lawrence are seemingly of interest to hackers, the implications for leakage of audio data could be even more serious. Perhaps the largest unknown in this scenario, is what the future capabilities of deep learning will have on analysis of biometric signals like voice.

Dr Rita Singh from Carnegie Mellon University and her colleagues pieced together a profile of a serial US Coastguard prank caller solely from recordings of his voice[4]. This included a prediction of his height and weight, and also the size of room he was calling from, leading to his apprehension by the authorities. Dr Singh’s team are using this research to identify a person’s use of intoxicants or other substances, and also the onset of various medical conditions the speaker may not even be aware they possess. For instance, the biomarker for Parkinson’s Disease can be detected in a person’s voice long before any other symptoms arise. This raises the prospect of using voice recognition in the medical field to diagnose diseases with speech-related biomarkers.

This recognition of the usefulness of voice biometrics is now utilised by some banks to “secure” accounts. Banking has embraced voice authentication in order to make the banking customer’s experience frictionless. However, a recent BBC article detailed a voice biometric breach that occurred when a journalist gained access to his twin brother’s HSBC bank account. Whilst this flaw was attributed to legacy voice biometric solutions, one should be cautious with relying on voice as the principle mode for authentication, for no other reason than it is not difficult to record someone’s voice, and in the near future to use that recording to synthesise that voice to say anything. Start-ups like Lyrebird[5] are working on ways to replicate a voice using just a minute of recorded speech. In the very near future, any sample of your voice could be used to realistically impersonate you.

The implication is that the future will feature a significant arms race between AI-equipped adversaries’ intent on breaching cloud-based systems, and the intelligent algorithms designed to protect such systems. So, what is the answer? Well, first of all, organisations must understand the probability of being attacked, how it affects them, and even more importantly, which factors can reduce or increase the impact and cost of a data breach. One such way to mitigate the effects of a breach of audio or video data in particular is to encrypt it.

For sensitive data, there is the option of using encryption for the secure storage of data in the cloud. However, while we have become increasingly good at encrypting data at rest, in order to process the data on the cloud we first need to decrypt it, which in turn excludes the possibility for using the cloud’s resources to process sensitive data, unless it can be done in a secure way. Cryptography research has made some innovative strides with this issue in recent years.

Searchable Encryption (SE) is a relatively new form of encryption that enables encrypted data to be searched with encrypted keywords. In this way, the idea is that the cloud can be used to store sensitive data that has been encrypted. An authenticated user can then search that data using search terms that are also encrypted, and the Searchable Encryption protocol residing on the cloud is able to compare the encrypted search terms and match it to the relevant encrypted data without ever understanding either what was being searched for, or what data it contains. It is no surprise that the seminal paper[6] from Senny Kamara, the inventor of this revolutionary cryptosystem, is one of the most-cited security papers since 1981.

Searchable Symmetric Encryption (SSE) is also the basis of the Intelligent Voice’s encrypted search product CryptoSearch, with which large volumes of a users’ encrypted speech transcripts and their corresponding encrypted audio can be outsourced to the cloud for storage. For review, the audio database and its associated encrypted transcripts can be searched, and once the pertinent audio file has been found it can be downloaded and decrypted behind the client’s own firewall – without the need to download everything, decrypt it, find what you are looking for, re-encrypt and re-upload. At no point does the cloud server ever see the data or the search terms in the clear. In the event of a breach any data retrieved is encrypted and can only be decrypted with either prohibitively computationally costly brute force decryption, or the user’s private encryption key.

Ultimately it is advances such as Searchable Symmetric Encryption and Fully Homomorphic Encryption that will be the cloud defender’s most valuable asset for safeguarding our data in the cyber security threat climate we can expect in the very near future.



[1] https://www.ibm.com/security/data-breach

[2] https://www.storagecraft.com/blog/7-infamous-cloud-security-breaches/

[3] http://www.bbc.co.uk/news/technology-39115001


[5] https://lyrebird.ai/

[6] https://blog.cs.brown.edu/2017/05/09/kamaras-work-searchable-symmetric-encryption-2-most-cited-2006-security-paper/

Author: admin

Leave a Reply

Your email address will not be published. Required fields are marked *

67 − 63 =