Is there something special about the human voice?

BBC || Shining BD

Published: 11/25/2024 8:15:25 PM

Artificial intelligence-powered speech synthesisers can now hold eerily realistic spoken conversations, putting on accents, whispering and even cloning the voices of others. So how can we tell them apart from the human voice?

These days it's quite easy to strike up a conversation with AI. Ask a question of some chatbots, and they'll even provide an engaging response verbally. You can chat with them across multiple languages and request a reply in a particular dialect or accent.

It is now even possible to use AI-powered speech cloning tools to replicate the voices of real humans. One was recently used to copy the voice of the late British broadcaster Sir Michael Parkinson to produce an eight-part podcast series while natural history broadcaster Sir David Attenborough was "profoundly disturbed" to hear his voice has been cloned by AI and used to say things he never uttered.

In some cases the technology is being used in sophisticated scams to trick people into handing over money to criminals.

Not all AI-generated voice are used for nefarious means. They are also being built into chatbots powered by large language models so they can hold respond and converse in a far more natural and convincing way. ChatGPT's voice function, for example, can now reply using variations of tone and emphasis on certain words in very similar ways that a human would to convey empathy and emotion. It can also pick up non-verbal cues such as sighs and sobs, speak in 50 languages and is able to render accents on the fly. It can even make phone calls on behalf of users to help with tasks. At one demonstration by OpenAI, the system ordered strawberries from a vendor.

These capabilities raise an interesting question: is there anything unique about the human voice to help us distinguish it from robo-speech?

Jonathan Harrington, a professor of phonetics and digital speech processing at the University of Munich, Germany, has spent decades studying the intricacies of how humans talk, produce the sounds of words and accents. Even he is impressed by the capabilities of AI-powered voice synthesisers.

"In the last 50 years, and especially recently, speech generation/synthesis systems have become so good that it is often very difficult to tell an AI-generated and a real voice apart," he says.

But he believes there are still some important cues that can help us to tell if we are talking to a human or an AI.

Before we get into that, however, we decided to set up a little challenge to see just how convincing an AI-generated voice could be compared to a human one. To do this we asked New York University Stern School of Business chief AI architect Conor Grennan to create pairs of audio clips reading out short segments of text.

One was a passage from Lewis Carroll's classic tale, "Alice in Wonderland" read by Grennan and the other was an identical segment generated with an AI speech cloning tool from software company ElevenLabs. You can listen to them both below to see if you can tell the difference.

Surprisingly, around half of the people we played the clips to couldn't tell which was which by ear. It's worth pointing out that our experiment was far from scientific and the clips weren't being listened to over high-end audio equipment – just typical laptop and smart phone speakers.

Steve Grobman, who serves as the chief technology officer of cybersecurity company, McAfee, struggled to discern which voice was human and which was AI merely by listening with his ear.

"There were definitely things beyond speech, like the inhalation which would have me go more towards human, but the cadence, balance, tonality would push me to AI," he says. For the untrained human ear, many of these things can be difficult to pick up.

"Humans are very bad at this," says Grobman, explaining that deepfake detection software is helping catch things the human ear can miss. But it gets especially challenging when bad actors manipulate real audio with bits of fake audio, he says, pointing to a video of Microsoft co-founder Bill Gates hawking a quantum AI stock trading tool. To the human ear, the audio sounded exactly like the tech billionaire, but running it through a scam classifier, it was flagged as a deepfake.

McAfee recently highlighted how a fabricated advert used mixed deepfake and real audio of singer Taylor Swift. Grobman's tip: "Always listen to the context of what is being said, things that sound suspicious likely are."

We live in a post-real society where AI generated voice clones can fool even the voice validation systems of credit card companies – Pete Nicoletti

Another cybersecurity expert we spoke to – Pete Nicoletti, global chief information security officer of Check Point Software, a threat analysis platform – was also stumped by our "Alice in Wonderland" challenge.

He says he usually listens for unnatural speech patterns such as irregular pauses and awkward phrasing when playing audio. Strange artefacts like distortions and mismatched background noise can also be a give-away. He also listens for limited variations in volume, cadence and tone because voices that are cloned from just a few seconds of audio may not have the full range of a human voice.

"We live in a post-real society where AI generated voice clones can fool even the voice validation systems of credit card companies," Nicoletti says. "Turing would be turning over in his grave right now," referring to World War II British code breaker Alan Turing, who designed the "Turing Test" as a way to identify AI by engaging with it in conversation.

Dane Sherrets, innovation architect of emerging technologies at HackerOne, a community of bug bounty hunters that work to expose security vulnerabilities of some of the biggest companies in the world, was among those able to correctly identify the human voice. The natural inflection and breathing in the clips were the give-away, he says.

Listening for the accentuation, or emphasis, words are given in a sentence can be a good trick for spotting computer-generated speech, agrees Harrington. This is because humans use accentuation to give a sentence more meaning within the context of a dialogue.

"For example, a sentence like 'Marianna made the marmalade' typically has most emphasis on the first and last words if read as an individual sentence devoid of context," he says. But if someone asked if Marianna bought the marmalade, the emphasis might instead fall on the word "made" in the answer.

Intonation – the change in pitch of the voice across a sentence – can also change the same words from being a statement ("Marianne made the marmalade"), into a question ("Marianne made the marmalade?").

The ability to clone people's voices creates a security risk by potentially fooling voice recognition systems, friends and family (Credit: Estudio Santa Rita)

Phrasing is also an important factor. The way a sentence is broken up can also alter its meaning. The sentence "when danger threatens, children call the police", has a very different meaning from "when danger threatens children, call the police", Harrington explains.

Together these three elements of speech are known as sentence-level prosody. It is "one of the ways computer-generated speech has been quite poor and not very human like", says Harrington.

But as the technology develops, AI is growing more adept at replicating these aspects of speech too.

"If you think about it, this is the worst the technology is ever going to be," says Sherrets. "Even something that is 60% as good is still pretty powerful. It's only going to get cheaper, faster, better from here."

He and many of the people we spoke to are particularly worried about voice cloning. It is a very real threat for businesses, for example. Assaf Rappaport, chief executive at Wiz, a leading cybersecurity company, told an audience at a technology conference in October that someone had created a voice clone of him from one of his recent talks. They then used it to send a deepfake voice message to dozens of employees in an attempt to steal credentials. The scammers were unsuccessful, but the incident was a wakeup call.

In another example, a school principal received death threats after a fake audio clip appeared to show him making deeply offensive remarks. Other cases have seen family members scammed out of money in phone calls using voice clones of their loved ones.

Sherrets advises developing other ways of authenticating that you really are speaking to the person you think you are.

"At home this means deciding on family passwords," he says. "At work this means not making a wire transfer just because you got a voice message from the chief executive officer of your company."

You can also ask personal questions, such as their favourite song. But perhaps the best thing to do if you suspect an AI is impersonating someone you know is to say you will call them back. Call them on the number you have for them and don't panic.

Many of AI-voice systems struggle to speak outside the normal vocal range

Michael McNerney is senior vice president of security at cyber risk insurance firm, Resilience, which covers attacks like "spear fishing" where employees are duped into wire transferring money with deepfake audio. He too correctly guessed which voice was AI and which was human in our "Alice in Wonderland" challenge.

As he listened to the samples, he found himself asking: Is that real breathing or fake breathing? Were there any mistakes being made? Was it too bright, too perfect? Stumbling over words and taking breaths are very human, so if things are too perfect, it can actually be a sign that AI is faking it.

But McNerney says even here, the technology is sounding more and more human. "These are super hard to tell," he says.

Listening to our two pairs of audio clips, Harrington and his colleagues at the University of Munich's Institute of Phonetics also struggled to tell the AI voices apart when listening by ear. They pointed to a number of features that should have helped them identify the human speech.

Variations in the rate of speech are often an apparent giveaway of a human voice, but in fact the AI voice seemed to produce this more than the human in our examples.

Breath intakes too should also be another tell-tale sign. A few of those we played the clips to identified something off about the breathing in both sets of clips. Harrington and his colleagues also said they found he breath intakes in one of the "Alice in Wonderland" clips almost too regular for their liking. But it turned out to be the human sample.

The fact that many of the experts we spoke to for this article struggled to tell the AI and human voices apart should not be seen as a failure in their abilities. Rather it is a sign of just how good at imitating human voices AI has now become.

It is something that could have some worrying implications, says Harrington.

"I'm amazed at how the AI voices knew where to put false stats and hesitations, assuming they were not typed in by someone at the keyboard," he says. "The ability for AI to communicate, in speech, ideas from an individual that might be completely at odds with the individual's real views is now complete," he says. "That's the bit I find quite scary."

Distinguishing AI-generated speech from real human voices just by using our ears is becoming increasingly difficult (Credit: Estudio Santa Rita)

There could, however, be another way of telling a human from an AI voice, Harrington says. He suggests using something known as prosodic deaccenting. Take the example below:

Question: Has John read "Hard Times" yet?

Answer: John doesn't LIKE Dickens.

The emphasis on the verb in the answer signals that the person replying understands that Dickens is the author of the novel, "Hard Times".

"The synthesis of these types of dialogue with a natural prosody might still be quite hard for many AI systems because it requires a knowledge of the world that goes well beyond the words printed on the page," says Harrington.

But even this sort of test could soon be overcome by large language models drawing on large datasets from the internet as it trains itself to speak more human.

"It would be really interesting to find out at some stage if AI gets that right as well," Harrington adds.

Mainstream services such as ChatGPT's voice function can already laugh, whisper, be interrupted and then continue what it was saying. It can also remember everything you ever told it.

Perhaps in the search to find out if you are speaking to a human, the solution is simple – spend more time meeting face to face

When asked what safeguards were in place to ensure its AI would disclose that it is AI while conversing with humans, OpenAI – the developers of ChatGPT – said there were none. It also said it was not planning to "watermark" AI to identify it because of the potential for bias against its users. This could include groups of impaired speakers using ChatGPT to communicate or it could include students using ChatGPT to help with homework.

However, OpenAI says it is actively trying to block voice cloning as ChatGPT's advanced features roll out.

"We work to prevent our synthetic voices from copying the voices of real people," ChatGPT multimodal product lead Jackie Shannon tells the BBC. "For Advanced Voice, in particular, we only allow the model to use the preset voices." These include two British-sounding and seven American-sounding voices, split between gender.

There are a couple of other tricks you could try if you have any doubts that the voice you are conversing with might not be human. You could, for example, ask it to scream. Many of AI-voice systems struggle to speak outside the normal vocal range, unless they have been specifically trained to, said Nicoletti. I asked ChatGPT to shout and it told me it couldn't.

The flaws in human speech could be another give away, says Grennan. Correcting oneself and doubling back on one's thoughts, is a very human thing to do. It's unlikely you'll ever hear ChatGPT say, "Uh nevermind!" or "You know what!?"

There are also moves to make deepfake detection software more readily available to consumers. McAfee, for example, has partnered with Dell, HP, Lenovo, Samsung, Acer and Asus to pre-install their solution on AI enabled PCs. The company is also expecting to roll out its software to mobile devices in the near future, according to Grobman.

ElevenLabs – which is the maker of the tool that was used to create the AI voice clones in our "Alice in Wonderland" challenge – also offers a free AI detection tool to help people identify if its software has been used to create a piece of audio.

But in the inevitable arms race between AI generation and AI detection, we may find new value in something we have lost in our increasingly virtually connected world – physical interaction. Perhaps in the search to find out if you are speaking to a human, the solution is simple – spend more time meeting face to face.

For those of you still puzzling over which of our audio clips was real, we can reveal that the first clip was AI while the second was human. Were you able to guess correctly?

Shining BD