In 2019, the director of a British company fell for a scam. He received a fake voicemail from his manager asking him to transfer €220,000 ($240,000) to a supplier. A year later, a bank manager in Hong Kong got a call from someone who sounded familiar. Because they had an existing business relationship, the banker transferred $400,000 before realizing something was wrong. Scams like these using artificial intelligence (AI) voice cloning technology are becoming more frequent, and detecting deepfake voices will get harder as AI quickly improves, even by trained people using special tools.
A recent study published in Plos One involving 529 participants revealed that humans struggle to accurately distinguish between real and fake voice messages. The study found that participants failed 25% of the time when attempting to detect voice deepfakes, and even training had minimal impact. Half of the participants received prior training by listening to five synthesized voice examples, but their performance was only 3% better than the untrained group.
The study by researchers from the University College London (UK) also aimed to understand whether the challenge was easier or harder depending on the characteristics of different languages, so they conducted the tests in English and Mandarin. The findings indicate that both groups rated the authenticity of the messages equally. They considered attributes like naturalness and lack of a robotic-sounding voice as important factors. “Both English-speaking and Mandarin-speaking participants frequently cited incorrect pronunciations and atypical intonations in the sound clips as factors influencing their decision-making process,” said Kimberly Mai, lead author of the study.
Audio is more subjective than visual images
Participants mentioned the same characteristics, regardless of the accuracy of the response. This is because audio is subjective. Unlike detecting visual deepfakes, where authenticity can be judged by observing objects and backgrounds, the subjective nature of speech causes perceptions to vary more. “When looking at a potentially fake image of a person, you can count the number of fingers or see if their clothing and accessories match,” said Mai.
To compare human and technological capabilities, the researchers also tested two automated detection systems. The first used software trained on an unrelated database, achieving 75% accuracy, similar to human responses. The second detector, trained on both the original and synthesized voice versions, achieved 100% accuracy in identifying fake and real audio. Mai says that advanced programs outperform humans due to their ability to recognize subtle acoustical nuances, something humans cannot do.
Complex sounds, like human speech, consist of various frequencies. Frequency refers to the number of times a sound wave repeats in one second. “During their training phase, automated detectors analyze thousands of voice samples and learn about peculiarities in specific frequency levels and rhythmic irregularities that humans are unable to discern,” said Mai.
Automated detectors have shown to be more effective than humans in this task, yet they also have limitations. First, they are not available for everyday use. Furthermore, their performance decreases when audio levels fluctuate and in noisy environments. However, the main challenge is keeping up with advances in generative artificial intelligence, which produces increasingly realistic content that is synthesized much more quickly. In the past, training a program to create deepfakes used to require hours of recording, but now it can be accomplished in seconds.
According to Fernando Cucchietti, an expert in the field, the study’s findings have certain limitations. The experiment conditions were tightly controlled and not representative of the real-life challenges posed by this technology. “They’re not really practical in situations where deepfakes can cause issues, like when you personally know the person being imitated,” said Cuchietti, the head of data analysis and visualization at the Barcelona Supercomputing Center to Spain’s Science Media Center. However, Cucchietti points out that these findings align with other studies in controlled environments, and “... the results are less influenced by factors like biases or preconceived notions, as seen in studies on misinformation.”
At an individual level, people struggle to reliably detect voice deepfakes. However, research suggests that aggregating the opinions of multiple individuals and making decisions based on majority voting improves detection capabilities. “If you come across an unusual audio message that causes some doubt, like if it’s asking you to transfer a big sum of money, it’s always a good idea to talk to others and double-check where it’s coming from,” said Mai.
Mai proposes enhancing automated detectors by enhancing their resilience to variations in test audio. Her team is currently adapting successful models from other domains, such as text and images. “Since these models use a great deal of data for training, we can expect them to get better at recognizing variations in sound clips.” Mai also believes that institutions need to take action. “They should make it a priority to implement strategies, like regulations and policies, to mitigate the risks related to voice deepfakes.”
Sign up for our weekly newsletter to get more English-language news coverage from EL PAÍS USA Edition