Yup, what you don’t hear can hurt you. This is in regards to automatic speech recognition systems or voice user interfaces to be used interchangeably moving forward as VUI’s or ASR’s. Nowadays more and more VUI’s are becoming widely acceptable as input methods. You don’t have to take my word for it. Look at the sales stats of Amazon echo, Google Home or I-Home devices over the past 4 years. “As of 2017, it’s estimated that almost 50% of adults in the United States are using voice assistants as part of their regular daily routine.”. The website SlashGear reported that “100 million Alexa enabled devices have been sold, The 100 million number includes both Amazon’s own products and third-party devices with Alexa build-in, but the company doesn’t reveal how many of each have been sold. There are now over 150 products shipping with Alexa built-in, like the Sonos One speaker and the Bose QC35 II headphones, but it’s unclear how many they’ve sold versus Amazon’s Echo Dot.”.
That is a truckload of automated eavesdroppers. These invaders posing as guests, helpers and family friends use an artificial intelligence framework known as Natural Language Processing (introduced in an early chapter, of the book Content Weapons) The front ends or VUI’s you know as Cortana, Siri or Alexa all have a key weakness. Unlike your significant other, they love to hear you talk. They are always in the mood to hear what you have to say.! These “talk goblins” continue to showcase major improvements in automatic speech recognition and are extremely impressive with their capabilities which continue to advance with each interaction. This is a relatively new technique made possible by a combination of newish technologies. Leveraging the concept of S.E.N.S: In biology, this is basically the concept of engineering cells to get better with use. This is exactly how these A.I. driven algorithms work, the more data, the more users and viola’ the better the accuracy.
Most of the Psychoacoustic Hiding attacks are focused on the use of gradient descent algorithms (sorry, keep nerding out) But the thing to consider here is that instead of updating the neural network parameters, they modify the audio signal. They are basically harnessing the power of listening thresholds in order to ensure that everything works like it would with a normal file.
You can hide a transcription in any audio file without a problem. These attacks are possible, and it can definitely be advanced by those with malicious intent. Which makes it very tricky and also quite scary, because no one really knows what to expect and whether an audio piece can have certain hidden messages and attacks hidden under a rather simple file type.
Similar to other application vulnerabilities, attackers can use flaws to all kinds of issues. In the case of ASR’s researchers used the Psychoacoustic hearing model in order to add commands to the audio signals. Because the human auditory system is focused on the louder sounds, our minds are distracted and will not be able to perceive any quieter sounds that are in the background. That’s how Psychoacoustic Hiding happens. So, raise your hand if you know the most popular file type for audio, if you guessed .mp3 you are correct, and this is also the format used that is most easily adapted for these types of attacks. Uuuggghh, I Really liked .mp3’s, but now you know what makes them great is also what allows them to pass on exploits.
For example, Researchers in Bochum passed some manipulated audio files to Kaldi in the form of input data. In future studies they actually want to show that Psychoacoustic Hiding can work very well even if you transfer all this data passed along in a manner that is completely inaudible to humans. It all comes down to reaching the voice assistant device be it an Amazon echo, Google home, laptop running Cortana, Apple watch or Samsung phone in one way or the other so the attack can be efficient and eventually we all know it work very well. “Son of a bitch”, first it was cigarettes, then bacon, oh yeah sugar also and now Alexa. Can we have anything good anymore?
Calm down Michael, all is not lost. You need to keep in mind that most providers of ASR’s are based on deep neural networks, are taking active measures to create secure systems. These DNN’s have multiple layers, you have an input, and them it will go through multiple layers, then you have the output which is a recognized sentence. The trick is to ensure that checks are in place to verify user intent upon identifying a malformed or outlier sentence based on user history. Sounds easy but we will see what comes of it. So, don’t throw out your devices just yet.
What can we do in order to prevent such attacks? Unfortunately, we as consumers can’t really do a lot. Developers need to understand the reality that attacks are actually possible and continue to find ways to adapt the training of these models as well as make consistent efforts to customize and secure the ASR’s as well. But getting past and eventually improving on all of this is very time-consuming, also with the increasing adoption of this VUI technology, more infotainment becomes available and expand their capabilities which creates more demand. This is the perfect storm for the Ill-intentioned. Basic commercialization rushes to get out a service and usually security becomes an afterthought. This has to change, and it starts with the developers and platform providers.
As a provider of such or even if your organizations developed and deployed a voice application, this should remain an issue to continuously monitor. The criticality and cost effectiveness regarding your efforts around security will ultimately be based on the nature of that service. I mean come on, if your application is there to provide standard information and offer daily deals then you have a significantly reduced risk exposure than if your application had access to specific user data or allowed your users the capacity to manage their bank or credit card accounts. Either way, keep in mind that all software has vulnerabilities and that is the world we now live in. Be diligent in what you can do to protect your customers as well as your organizations best interest. I mean we have all heard the saying “An ounce of prevention is worth a pound of cure”, Cliché’s are Cliché’s for a reason.
#contentweapons, Content Weapons
J. Michael Stattelman