Microsoft's AI Program Can Clone Your Voice From a 3-Second Audio Clip

The technology, while impressive, would make it easy for cybercriminals to clone people's voices for scam and identity fraud purposes.

By Michael Kan

January 10, 2023

A new artificial intelligence advancement from Microsoft can clone your voice after hearing you speak for a mere 3 seconds.

The program, called VALL-E, was designed for text-to-speech synthesis. A team of researchers at Microsoft created it by having the system listen to 60,000 hours of English audiobook narration from over 7,000 different speakers in a bid to get it to reproduce human-sounding speech. This sample is hundreds of times larger than what other text-to-speech programs have been built on.

The Microsoft team published a website that includes several demos of VALL-E in action. As you can hear, the AI program can not only clone someone’s voice using a 3-second audio clip, but also manipulate the clone voice to say whatever is desired. In addition, the program can replicate emotion in a person's voice or be configured into different speaking styles.

Human Speaker:

VALL-E:

Human Speaker:

VALL-E:

Voice cloning is nothing new. But Microsoft’s approach stands out by making it easy to replicate anyone’s voice with only a short snippet of audio data. Hence, it’s not hard to imagine the same technology fueling cybercrime—which the Microsoft team acknowledges as a potential threat.

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” the researchers wrote in their paper. That said, the team notes it might be possible to build programs that can “discriminate whether an audio clip was synthesized by VALL-E.”

VALL-E interprets audio speech as “discrete tokens,” and then reproduces the token to speak with different text. “VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording,” the researchers wrote. “Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.”

However, the technology is far from perfect. In their research paper, Microsoft’s team notes VALL-E can sometimes struggle or fail to pronounce certain words. At other times, the words can sound gargled, artificially synthesized, robotic or just tonally off.

Recommended by Our Editors

Lensa AI Is Carrying Gender Bias Into the Future

New York City Schools Block Access to ChatGPT Over Cheating Concerns

How to Use the Dall-E 3 AI Art Generator to Create Stunning Images From Text

Human Speaker:

VALL-E:

"Even if we use 60K hours of data for training, it still cannot cover everyone’s voice, especially accent speakers,” the team added. “Moreover, the diversity of speaking styles is not enough, as LibriLight (the audio VALL-E was trained on) is an audiobook dataset, in which most utterances are in reading style.”

Nevertheless, the research suggests creating an even more accurate voice cloning program is achievable if it’s trained on even more audio clips. In the meantime, it doesn’t appear Microsoft has released VALL-E to the public, likely to guard against misuse.

Get Our Best Stories!

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.

Thanks for signing up!

Your subscription has been confirmed. Keep an eye on your inbox!

Microsoft's AI Program Can Clone Your Voice From a 3-Second Audio Clip

Human Speaker:

VALL-E:

Human Speaker:

VALL-E:

Recommended by Our Editors

Human Speaker:

VALL-E:

Get Our Best Stories!

Further Reading

TRENDING

About Michael Kan

Senior Reporter

Read the latest from Michael Kan