Skip to Main Content
PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

Microsoft's AI Program Can Clone Your Voice From a 3-Second Audio Clip

The technology, while impressive, would make it easy for cybercriminals to clone people's voices for scam and identity fraud purposes.

By Michael Kan
January 10, 2023
(Credit: Getty Images/photoworldwide)

A new artificial intelligence advancement from Microsoft can clone your voice after hearing you speak for a mere 3 seconds. 

The program, called VALL-E, was designed for text-to-speech synthesis. A team of researchers at Microsoft created it by having the system listen to 60,000 hours of English audiobook narration from over 7,000 different speakers in a bid to get it to reproduce human-sounding speech. This sample is hundreds of times larger than what other text-to-speech programs have been built on.

The Microsoft team published a website that includes several demos of VALL-E in action. As you can hear, the AI program can not only clone someone’s voice using a 3-second audio clip, but also manipulate the clone voice to say whatever is desired. In addition, the program can replicate emotion in a person's voice or be configured into different speaking styles.

Human Speaker:

VALL-E:

Human Speaker:

VALL-E:

Voice cloning is nothing new. But Microsoft’s approach stands out by making it easy to replicate anyone’s voice with only a short snippet of audio data. Hence, it’s not hard to imagine the same technology fueling cybercrime—which the Microsoft team acknowledges as a potential threat.  

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” the researchers wrote in their paper. That said, the team notes it might be possible to build programs that can “discriminate whether an audio clip was synthesized by VALL-E.”

VALL-E interprets audio speech as “discrete tokens,” and then reproduces the token to speak with different text. “VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording,” the researchers wrote. “Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.”

However, the technology is far from perfect. In their research paper, Microsoft’s team notes VALL-E can sometimes struggle or fail to pronounce certain words. At other times, the words can sound gargled, artificially synthesized, robotic or just tonally off.

Human Speaker:

VALL-E:

"Even if we use 60K hours of data for training, it still cannot cover everyone’s voice, especially accent speakers,” the team added. “Moreover, the diversity of speaking styles is not enough, as LibriLight (the audio VALL-E was trained on) is an audiobook dataset, in which most utterances are in reading style.”

Nevertheless, the research suggests creating an even more accurate voice cloning program is achievable if it’s trained on even more audio clips. In the meantime, it doesn’t appear Microsoft has released VALL-E to the public, likely to guard against misuse.

Get Our Best Stories!

Sign up for What's New Now to get our top stories delivered to your inbox every morning.

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.


Thanks for signing up!

Your subscription has been confirmed. Keep an eye on your inbox!

Sign up for other newsletters

TRENDING

About Michael Kan

Senior Reporter

I've been with PCMag since October 2017, covering a wide range of topics, including consumer electronics, cybersecurity, social media, networking, and gaming. Prior to working at PCMag, I was a foreign correspondent in Beijing for over five years, covering the tech scene in Asia.

Read Michael's full bio

Read the latest from Michael Kan