A Change You Can Hear: AI Now Mimics Voices

Andrei Mihai

While the world is currently abuzz with artificial intelligence (AI) models that produce text or images, another revolution is happening more silently: Seemingly unnoticed, sound-generating AIs have also grown tremendously, and are now enabling our favorite actors to continue to be heard – even after losing their voices. However, for the voice-over and music industry, this could mean a significant shift in business practices.

Who Wants to Live Forever?

**Image** **credits: Thomas Le (CC BY 3.0)**

Top Gun: Maverick brought back one of the most iconic characters of the 80s and – less than one year later – has become one of the highest-grossing movies of all time, earning $1.5 billion at the box office. However, the movie almost did not happen.

The star of the show is undoubtedly Tom Cruise, but in the original movie, Val Kilmer also plays a key role. Cruise had said he would not do the follow-up if Val Kilmer did not join, but there was a problem: Kilmer had lost his voice after a battle with throat cancer. Ten years ago, that would have been the end of the story. No Kilmer, no Cruise, no Top Gun. But things are different now.

Instead of simply giving up, filmmakers restored Kilmer’s voice with the aid of AI technology. He provided a company with hours of recordings of his voice, and after clearing the samples of any noise and unwanted signals, the company trained a model to produce new outputs of Kilmer’s voice from a text input, thus enabling his appearance on the big screen.

An AI-generated rendition of actor Val Kilmer’s voice.

Remarkably, this isn’t the only high-profile cinema project to feature AI voices. Actor James Earl Jones uttered possibly the most famous line in the history of cinema: “No, I am your father,” Jones told Luke Skywalker, as he was playing the role of the infamous Darth Vader in “Star Wars”. With Jones now 91 years old, he has announced that he will retire from acting and voice acting. However, he too will live in on in perpetuity via an AI-generated voice.

As is often the case with AI projects, details regarding the model type and training used are scarce – companies are careful to divulge as little as possible, so as not to lose any edge against their competition. But the results are convincing, and the number of instances where AI is used to replace specific human voices is on the rise. For instance, the 2021 documentary “Roadrunner” also features a 45-second clip with the AI-generated voice of celebrity chef Anthony Bourdain.

In fact, all signs point to the impact of AI voice generation going beyond cinematic applications, and having an even more sweeping impact.

Reshaping the industry

If models are good enough to create a replica of someone’s voice, then it could be a game-changer for the voice acting and maybe even the music industry. But can this approach really be scaled?

Microsoft believes so. Just a few days ago, the tech company released its new text-to-speech AI model: VALL-E, the voice equivalent of the DALL-E image generating model. Microsoft claims that it can simulate a person’s voice after only a three-second audio input, subsequently synthesizing audio of that person speaking, while preserving the individual’s unique voice signature and emotional tone.

The 3-second speaker prompt.

The AI generated audio based on the prompt.

It is not hard to envision applications for VALL-E, but Microsoft is particularly eyeing voice-overs for the gaming industry, with their planned acquisition of the gaming company Blizzard for over $68 billion. Blizzard is one of the world’s largest voice actor employers for video games. If they could automate that process and make it virtually free, it would likely mean huge windfalls for the company (and very bad news for voice actors).

VALL-E was trained on Libri-light, an open source dataset for the training of weakly supervised and unsupervised speech models. Unsupervised learning requires the algorithm to learn patterns from untagged data, whereas weakly supervised data contains noisy or imprecise data. Overall, Libri-light contains over 60,000 hours of recordings from over 7,000 audiobook narrators. Since the dataset is open source, this means other models could use the dataset as well – and indeed, Microsoft is not the only one eyeing this type of voice generation scalability.

The Chinese music company Tencent Music Entertainment (TME) revealed that an AI-based voice synthesis system was used to record more than 1000 songs, including a track titled “Today” that surpassed 100 million streams and has generated an estimated revenue of nearly $350,000 so far. In fact, Tencent claims its proprietary model is so advanced it could be used to get any singer (living or not) to ‘perform’ any song, and the company claims it is already incorporating AI-generated voice into its core strategy. Tencent is not the first to think of this, as earlier in 2022, an AI-powered rapper was signed to Capitol Records, but like Microsoft, the company claims to be able to scale the process. If details were scarce in the first few examples, verifying the claims of Tencent is all the more difficult, given how difficult it is to access the company’s music from outside of China.

But regardless of where you are in the world, the prospect of an AI capable of replicating anyone’s voice should probably be of interest.

The future of the music industry may involve fewer instruments played by real human beings. Image credits: Wes Hicks (CC BY 3.0)

Giving voices and taking voices away

While AI has given Kilmer his voice back and enabled some of our favorite actors keep their voices on screen after their retirement, the advent of this technology generally raises a number of ethical concerns.

For starters, there is the problem of copyright. Who owns the rights to people’s voices? James Earl Jones has a very recognizable voice that people may want to use, say, in a commercial. Jones reportedly lives in New York, a state that protects all rights to publicity against commercial use and permits that right to be inherited – so you would not be able to use his voice outright. But what if you use a voice that really sounds like his, but is a bit different? Or what if you want to use the voice of someone who lives in a state or country with laws that are not as clearly defined? Several high-profile lawsuits have already been launched regarding image-generating algorithms, and voice-generating algorithms could trigger a similar uproar. Microsoft itself indirectly admitted to how big a problem copyright is for audio-generating AIs, as it announced its new music-generating algorithm, but did not release it, specifically to avoid copyright issues.

Then, there is the problem of potentially eliminating numerous jobs – ones one would have assumed were safe from automation only a few years ago – virtually overnight. How the world of music and voice acting might react to this is anyone’s guess at the moment.

Part of the ethical challenges that arise with AI also boils down to the datasets they are trained on. AI is supposed to be a simple, neutral box: information goes into the model and then some type of output is returned. But we know by now that algorithms are not neutral and they can be vulnerable to the biases of our society. Since companies are likely to use AI voices to interact with their customers, it is reasonable to assume this practice might be plagued by the same biases found in aI use elsewhere. Voices directed at customers may subtly differ in tone and emotion depending on the target audience and lead towards discrimination based on gender, ethnic an racial factors, or even presumed likes and dislikes of a given customer base. The field of voice AI is a bit less mature than its sister fields, but there is no reason to believe it will be spared from the same biases.

But perhaps the biggest cause for concern is the potential for misuse.

As exciting as it is to have the option to get everyone’s voice to say whatever you want, the problem is just that: you could potentially use anyone‘s voice to say whatever you want. The potential for deepfakes, forgeries, scams, phishing, and other cybercrime looms on the horizon. A politician you do not like? Get them to say something their supporters will hate. The truth will come out, but as we have painfully learned, lies travel faster than the truth in today’s online world.

Judging by what we are already seeing now, it will not be long until anyone’s voice can become fair game. Even as this technology is extremely recent, we are already seeing the potential for abuse. Just a few days ago, an AI startup reported that online trolls used the platform to get the voices of celebrities to read memes, hatespeech, and misinformation. In one example, the voice of Emma Watson was heard reading fragments from Mein Kampf, while other celebrity voices were used to utter homophobic or racist remarks. The company has since announced safeguards to tackle this sort of abuse, but this is just the very early days; what happens a year or two from now, when there will likely be multiple companies with this type of capability?

Companies are usually quick to point out that they are putting safeguards in place, but this will likely come down to an AI arms race. Just days ago, OpenAI, the start-up behind ChatGPT, announced a classifier that assesses the likelihood of a text being written by an AI. Similar approaches can be used for image and voice content, but will they be good enough? Both content-generating algorithms and content-classifier algorithms will become more and more sophisticated, but it may be too early to tell whether the former will be able to produce content indistinguishable from that of a human or the latter will be able to flag out non-human content. If we end up in a middle ground, which seems more likely, there will often be some confusion regarding the authenticity of voice bits – and it is not hard to envision this causing chaos during moments like elections.

No doubt, this technology will have a profound impact on society, but as it seems to usually be the case, technology is moving faster than legislative and social norms. Technology’s approach seems to be “build fast, ask questions later,” which accentuates the benefits of having earlier access to revolutionary algorithms, but also accentuates the potential for misuse before safeguards are in place.

As was also discussed by laureates during the 2022 Hot Topic at the 9th HLF, with great potential brought by AI, there is also a great deal of risk.

“We’re bringing more and more powerful tools into the world, and they can bring a lot of good,” quipped Yoshua Bengio, co-recipient of the Turing Award, at the 9th HLF’s Hot Topic in 2022. “But it can also be used in nefarious ways that we can’t even fathom yet.”

The key is to direct technology in a way that makes a positive impact in our society. Essentially, technology can provide tools like AI, but technology cannot provide the framework in which those tools are best used. That is up to ethicists, policymakers, and experts in various fields outside of computer science. An AI content-generating technique is no longer a future prospect – It is already upon us, and the deluge of non-human visual, audio, and text content is already knocking at our door. Ultimately, as Bengio himself concluded, its imperative that we, as a society, decide how we want to deal with this type of challenge. Technology is a tool, and we should decide how we want to use it.

The post A Change You Can Hear: AI Now Mimics Voices originally appeared on the HLFF SciLogs blog.

Back