Will the real Barack Obama please speak up?

In recent years, the development of deepfake technology has created a new frontier in digital manipulation. Deepfakes are generated through artificial intelligence algorithms that can create realistic audio and video clips of people saying or doing things that they never actually did. While this technology has the potential to be used for creative and entertainment purposes, it also poses significant risks and concerns related to disinformation, identity theft, and privacy violations. 

One of the most pressing concerns regarding deepfakes is their potential to deceive people into believing that they are real. For example, deepfake audio clips could be used to spread false information or incite political unrest by creating fake statements from public figures or political leaders. 

However, researchers are developing new techniques and tools to help detect deepfakes and distinguish them from real audio recordings. One promising approach is to use biometric authentication techniques, such as x-vectors and cosine similarity search, to compare the characteristics of real and deepfake voices. 

X-vectors are a type of feature embeddings extracted using deep neural networks, which represent speaker characteristics from audio recordings. These characteristics include features such as the pitch, tone, and rhythm of the voice. By analyzing these features, researchers can create a unique “voiceprint” that can be used to identify a specific individual’s voice. 

Cosine similarity search, on the other hand, involves comparing the similarity between the speakers in different audio clips based on their acoustic features. This technique can be used to determine whether two audio recordings were produced by the same person or by different people. 

To test the effectiveness of these biometric authentication techniques, researchers at Intelligent Voice recently conducted a study that compared real and deepfake audio recordings of former US President Barack Obama, as well as comparing real recordings to those made by voice actors. In the study, the researchers enrolled the “real” voice of Obama into a biometric database, as well as deepfake voices generated using artificial intelligence, as well as the actor voices. They then compared the real and “fake” voices against five real clips and one fake clip of Obama speaking. 

The results of the study were both surprising and reassuring. The system correctly identified the real clips as real above the 80% threshold required for a match, whereas every “fake” audio file was rejected. 

The team started with only one “real” audio files to simulate real-life matching conditions, which they took from https://www.youtube.com/watch?v=iaxqTVNHFB0  (labelled “Obama_Real”). 

Can’t believe your eyes? 

One of the most famous deepfake Barack Obama Videos was put together by actor and director Jordan Peele (it can be found at https://www.youtube.com/watch?v=cQ54GDm1eL0).  

The similarity between Obama_Real and the Jordan Peele “fake” audio was only 67.19%. For a match to be considered as accurate the score must be above 80%. However, even though the video and audio are extremely believable, the voiceover is actually done by Jordan Peele himself.  So already we are seeing the system is able to detect a fake even in a very believable, human generated, voice. 

To test against actual deepfake audio, four clips were obtained from https://this-voice-does-not-exist.com/barack_obama_deepfake which were synthetic voices which sounded like the real Obama to the human ear. Again, only scores above 80% are regarded as a match by the biometric system. 

The match between the enrolled real Obama Voice and the AI generated voice was: 

A biometric profile was created from the above 4 clips and was enrolled as “Obama_Deepfake” and the Jordan Peele impersonated deepfake audio was enrolled as  “Obama_JP_Deepfake” in the biometric database and a series of further tests were performed. 

Can you see the hear me, can ya? 

Not only is it important that the system can identify deepfakes, but it is important as a control that it can also identify actual Obama clips, preferably with different acoustic properties. 

So the team tested 5 real Obama files from various domains such as interviews and speeches and (as a comparison) 1 DC-TTS (Deep Convolutional Text To Speech) generated synthesized audio, to look for the highest scoring match. 

Files: 

1. https://www.youtube.com/watch?v=51uk5SFF5s8 

2. https://www.youtube.com/watch?v=RShdL1jB7-Q  

3. https://www.youtube.com/watch?v=MS5UjNKw_1M 

4. https://www.youtube.com/watch?v=mIA0W69U2_Y 

5. https://www.youtube.com/watch?v=2hOp408Ib5w 

DC TTS synthesized audio clipped 44 seconds to 1 min 16 secs from below: 

6. https://www.youtube.com/watch?v=6bFN2YkN6bo&t=44s 

Obama_Deepfake : 1 Match 

Obama_Real: 5 Matches 

The system correctly identifies the “real” clips as coming from Obama himself.  The synthesised voice matched closest to the “fake” clips we enrolled before. So again, a good result. 

We saw earlier that human actors can create very realistic “deepfakes”:  If that were not so, impressionists would have gone out of business a long time ago!  So the team found more examples of acted Obama audio (all from the “Conan” Show on TBS) , and ran tests against them. 

Files: 

1. https://www.youtube.com/watch?v=ucynU4IDlGg 

2. https://www.youtube.com/watch?v=ldZzRlFmxfg 

3. https://www.youtube.com/watch?v=Fe2456MZB5c 

Again, machine beats human. 

Safe:  For now… 

While these results suggest that deepfake audio technology still has some limitations, it is important to note that this technology is evolving rapidly. As deepfake algorithms become more sophisticated and difficult to detect, it will become increasingly challenging to distinguish between real and generated audio recordings. 

To address this challenge, researchers are exploring new approaches to deepfake detection, such as analyzing the metadata associated with an audio recording or using machine learning algorithms to detect subtle differences in voice patterns.  A recent study showed that is it possible to detect the actual system used to generate the deepfake, which shows that each of them develop a common “voice”.  This is evident when trying to use the systems to generate audio that is outside of the original training data, which is very often based on open-source US English corpora.  

In addition to technical solutions, it is also important to develop a comprehensive policy and legal framework to address the risks and harms associated with deepfakes. This could include regulations that require platforms to label deepfake content or to remove it entirely, as well as laws that criminalize the creation or dissemination of malicious deepfakes. 

Moreover, it is crucial to invest in media literacy and public education programs to help people become more aware of the risks and challenges associated with deepfake technology. By educating the public on how deepfakes are created and how to detect them, we can reduce the potential harms associated with this technology.  Identifying the source of video and audio content is going to become more crucial, and it could not come at a worse time when we find that trust in mainstream media is at a record low, with Gallu reporting only 34% of Americans believe news is reported “fully, accurately and fairly” by mass media. 

In conclusion, deepfake audio technology poses significant risks and challenges, but biometric authentication techniques such as x-vectors and cosine similarity search offer promising approaches to detect and distinguish between real and generated audio recordings. While these techniques are not foolproof, they represent a step forward in the fight against deepfakes. However, addressing the risks and harms associated with deepfakes will require a multifaceted approach that includes technical, legal, and educational solutions. By working together, we can develop effective strategies to mitigate the risks of deepfake technology and safeguard our digital identities and public discourse.