The Next Frontier in Threats to your Privacy: Voice Recognition

Updated on Jan 27, 2024 by Glyn Moody

Privacy News Online has been tracking the increasing use of facial recognition technologies for some time. Concerns about their wider deployment are growing. But the surveillance world does not stand still. While people focus their attention on facial recognition, a new form of tracking is being rolled out: voice recognition – detecting who is speaking, not what is being spoken (speech recognition).

Interpol, whose motto is “connecting police for a safer world,” is the largest international police organization, with 192 member countries around the globe. It recently announced the completion of a four-year project “to develop new technology to help the law enforcement community identify the voices of unknown individuals.” As well as Interpol, the Speaker Identification Integrated Project (SIIP) involves an international consortium of 17 partners including end-users, industry and academia. The €10.5 million funding (about $12 million) came from the European Commission. SIIP aims to tackle two challenges facing law enforcement agencies (LEAs) in the digital world:

The adoption of multiple fake and arbitrary identities by terrorists and criminals using telecommunication and Internet mediums in aim to avoid their lawful interception and tracking by LEAS, through prepaid cell-phones, Frequent alerting of SIM cards in cell-phones, Random access to other people’s phones and Using different nick names in various Internet VOIP applications such as: Tango, ooVoo, Skype, Viber and G-Talk, etc.
“The second side problem” – The difficulty to identify unknown participants in a lawfully intercepted call of a known suspect.

At the heart of SIIP lies the voiceprint – a unique digital signature representing a single speaker. Drawing on technologies from a number of companies, the SIIP system is designed to handle variations in gender, age, language and accent. SIIP will obtain its voice samples from from multiple communication channels. It aims to:

Run on any speech source and channel (Internet, Social-Media, PSTN [public switched telephone network], Cellular and SATCOM) and provide LEAs with better intelligence and improved judicial admissible evidence of lawful intercepted calls.
Associate each speaker identification with rich-metadata (Identifiers used by the suspect, Personal details, Location-profiles, Social-connections and many more), taken from variety of sources including the WEB and Social-Media.
Enrolment of rich suspect voice and metadata from Telecom and Internet sources, incl. Social-Media (e.g. YouTube).

The important role played by social media is striking. A video produced by SIIP outlines another way the resource is used to establish a speaker’s identity:

SIIP also searches social media to find matches with persons not yet known to police. SIIP compares a specific unknown voice reference, posted in social media, against social media postings. If a match is found in a social media video, where the speaker’s face is not covered, then it is possible to identify the unknown speaker.

The logic seems to be that since these videos are posted on the open Internet, the police may use them to match with voices obtained through “lawful interception”, or from other online material. In effect, people posting voice recordings on social media are handing over their voice prints, and not just to one law enforcement agency, but to all 192 members of Interpol. A central database of voiceprints drawn from many sources around the world is being created, and all Interpol police forces will have access to it. A SIIP user manual explains how the system will work in practice.

Voice recognition systems are already deployed by many police forces around the world. An Interpol survey on the use of speaker identification by law enforcement agencies, published in 2016, revealed that 44 of the 91 respondents from 69 different countries said they had speaker identification capabilities in house or via external laboratories. The number has doubtless increased since then.

The NSA has been using these kind of voice recognition technologies for many years. An Intercept article drawing on documents provided by Edward Snowden reveals that as far back as 2008, voiceprints were an area “where NSA reigns supreme“, according to the agency itself. The Chinese authorities also regard voice recognition as a key surveillance technique. As Privacy News Online wrote last year, the government there is building a nationwide voiceprint database of its citizens. More recently, the privacy organization Big Brother Watch discovered that the UK tax authorities have collected huge numbers of voiceprints without people’s consent:

Millions of callers to HMRC [Her Majesty’s Revenue & Customs, the UK government body responsible for collecting taxes] have been required to repeat the phrase, “My voice is my password” on an automated line before being able to access services. Big Brother Watch said taxpayers are being “railroaded into a mass ID scheme” as they are not given the choice to opt in or out, in a scheme that experts say breaches UK data protection laws.
Big Brother Watch submitted Freedom of Information requests revealing the Government department has amassed a staggering 5.1 million voiceprints.
However, HMRC has refused to disclose which other Government departments the voice IDs have been shared with, how the IDs are stored and used, whether it is possible to delete a voice ID, which legal territory the data is kept in, how much the scheme has cost taxpayers, or the legally-required ‘privacy impact assessment’.

A transcript of a call to the HMRC reveals that UK citizens have to refuse to say “My voice is my password” multiple times in order to proceed without creating a voiceprint to be held on the system, and shared with other government agencies. Another transcript shows how hard it is to have an existing voiceprint removed.

The collection of voiceprints on a massive scale by government departments is troubling. But potentially just as problematic is the rise of smart speakers and other devices that work through spoken commands. By definition, these listen all the time to what people say. In general, they work using cloud computing power. Amazon writes of its Alexa-enabled products: “Alexa lives in the cloud so it’s always getting smarter, and updates are delivered automatically. The more you talk to Alexa, the more it adapts to your speech patterns, vocabulary, and personal preferences.” That means voice data is being sent elsewhere to be processed, and the system is constantly improving its ability to recognize your voice. Clearly, both of those features could be abused in ways that seriously harm privacy.

It might be argued that people who buy voice-enabled, connected products think that the benefits outweigh the risks. But there are now moves to install such devices in public locations, where people have less say over the audio surveillance that is carried out on them. For example, after a gradual spread of its smart speakers in hotels, Amazon has launched its Alexa for Hospitality service:

Alexa for Hospitality integrates seamlessly with your existing amenities and services, to become your guests’ virtual concierge. Alexa simplifies tasks for guests like playing music, ordering towels, controlling in-room temperature or lighting, finding local restaurants and attractions, calling, and even checking out. Alexa makes delivering a great customer experience simple. Just ask.

Privacy is naturally a concern here, which Amazon addresses thus:

Using Alexa is optional. If you do not want to use Alexa, you can push the microphone on/off button built on top of the device. When the microphone on/off button is pressed, the microphones are electrically disconnected, cannot detect the wake word, and cannot stream any audio to the cloud. The light ring will turn red when the microphones are disconnected.

That’s good as far as it goes, but it requires active intervention on the part of the hotel guest. As we know from other contexts, many people either forget or simply can’t be bothered to protect their privacy when using digital devices. It is inevitable that smart speaker systems will be left on in hotel rooms and elsewhere. And as these discreetly-placed units become commonplace, so people will increasingly accept them to the point of not noticing, and therefore not taking available measures to preserve their privacy. It’s also possible that even when devices are supposedly shut down, that they can still eavesdrop, as is the case for smartphones.

The larger issue is that voice commands picked up by constantly-listening devices powered by cloud-based AI farms will probably become one of the main ways of controlling digital products in the future. After all, it’s easy and natural – it mimics what we do in the analog world. But the difference is that every command can be unambiguously linked to the voice that uttered it, which leaves a tell-tale track of actions and their instigator. That’s in contrast to traditional devices. They may log which account is in use, but can rarely identify with certainty the person actually controlling it at any given moment. As voiceprint databases spring up around the world, the potential for tapping into them to abuse people’s privacy grows too.

Featured image from SIIP.