UCSD Researchers Make AI-Generated Voices More Expressive with New Technology

Artificial intelligence-generated voice is a tool made of voice commands that AI uses to receive instructions and communicate with the end-user. Up until now, AI-generated voices were devoid of any moods or emotions whatsoever. Just that same dull, military voice that never betrayed any feeling.

Artificial Intelligence

It is past time AI voice had a major upgrade, won’t you agree?

In that regard, computer scientists at the University of San Diego discovered a way to train AI voices to have more expressive meaning.

This new technology

This groundbreaking work and its results were presented at the annual ACML conference in 2021 by the research team of electrical engineers and computer scientists from the University of California, San Diego. Their results revealed that this technology can massively improve the quality of service of smart devices’ personal assistants and other areas where AI voice commands are used. It doesn’t stop there, it can be applied to translation to different languages, in animated movies to make voice-overs with superior quality. Furthermore, this technology will be a major booster to speech-generating devices. The speech-generating device [SGD] is a personalized voice output system that aids persons who have issues with speech, think of it as a speech supplement. The famous physicist Stephen Hawking used this. Imagine what happens when SGD and this new AI tech team up; the computerized voice will be exquisitely expressive.

Shehzeen Hussain, one of the lead authors and a Ph.D. candidate from the UC San Diego Jacobs School of Engineering, described their research as a mission to add “expressive meaning to speech”.

Essentially, this tech, developed by those gurus at UC San Diego, generates expressive speech with little training for persons, not in the group the AI group trained with. Pre-existing AI voices had several limitations that made them relatively below par. Some systems only synthesize expressive speech for a particular subject after long hours of training on data. Some others can only translate and create speech after just a few minutes from a new subject, however, they cannot synthesize expressive speech for that person.

To develop this new tech, the scientists used the rhythm and pitch of normal speech to represent emotion. This maneuver helped them clone expressive speeches of a wide range of voices.

The researchers say, “We demonstrate that our proposed model can make a new voice express, emote, sing or copy the style of a given reference speech.” Could this have implications in the health sector, possibly to create a sense of human connection for those undergoing quarantine?

On the downside, this work could be used to make more accurate deepfake videos and audio. Neekhara Paarth, co-author and Ph.D. candidate, acknowledges this as a threat and concern. As such, the team is going to focus on remedying that next. They plan to work on a watermark code that will help you weed out cloned voices.

Conclusion

The ability of this tech to interpret pitch and rhythm as expression, translate written text to speech, and reconstruct a whole speech from a small template from the speaker means that, bit by bit, we are adding the emotional component to AI.

Though promising, there is a lot of work to be done to improve this technology, especially to help speakers with strong accents.

References

Neekhara, P., Hussain, S., Dubnov, S., Koushanfar, F., & McAuley, J. (2021). Expressive neural voice cloning (arXiv:2102.00151). arXiv. https://doi.org/10.48550/arXiv.2102.00151

This new technology

Conclusion

References

Related Posts