AI system can convert voice track to video of a person speaking using a still image

News8Plus1st March 2024

8,559 2 minutes read

We proposed EMO, an expressive audio-driven portrait-video technology framework. Enter a single reference picture and the vocal audio, e.g. speaking and singing, our technique can generate vocal avatar movies with expressive facial expressions, and varied head poses, in the meantime, we will generate movies with any length relying on the size of enter audio. Credit: arXiv (2024). DOI: 10.48550/arxiv.2402.17485

A small crew of synthetic intelligence researchers on the Institute for Clever Computing, Alibaba Group, demonstrates, by way of movies they created, a brand new AI app that may settle for a single {photograph} of an individual’s face and a soundtrack of somebody talking or singing and use them to create an animated model of the individual talking or singing the voice observe. The group has published a paper describing their work on the arXiv preprint server.

Prior researchers have demonstrated AI purposes that may course of {a photograph} of a face and use it to create a semi-animated model. On this new effort, the crew at Alibaba has taken this a step additional by including sound. And maybe, simply as importantly, they’ve accomplished so with out the usage of 3D fashions and even facial landmarks. As an alternative, the crew has used diffusion modeling primarily based on coaching an AI on giant datasets of audio or video information. On this occasion, the crew used roughly 250 hours of such information to create their app, which they name Emote Portrait Alive (EMO).

By instantly changing the audio waveform into video frames, the researchers created an utility that captures delicate human facial gestures, quirks of speech and different traits that establish an animated picture of a face as human-like. The movies faithfully recreate the possible mouth shapes used to type phrases and sentences, together with expressions sometimes related to them.

Character: Mona Lisa Vocal Supply: Shakespeare’s Monologue II As You Like It: Rosalind “Yes, one; and in this manner.” Credit: https://humanaigc.github.io/emote-portrait-alive/

The crew has posted a number of movies demonstrating the strikingly correct performances they generated, claiming that they outperform different purposes concerning realism and expressiveness. Additionally they notice that the completed video size is decided by the size of the unique audio observe. Within the movies, the unique image is proven alongside that individual talking or singing within the voice of the one who was recorded on the unique audio observe.

Credit: Emote Portrait Alive

The crew concludes by acknowledging that use of such an utility will should be restricted or monitored to forestall unethical use of such expertise.

Extra info:
Linrui Tian et al, EMO: Emote Portrait Alive—Producing Expressive Portrait Movies with Audio2Video Diffusion Mannequin beneath Weak Situations, arXiv (2024). DOI: 10.48550/arxiv.2402.17485

EMO: humanaigc.github.io/emote-portrait-alive/

Journal info:
arXiv

Quotation:
AI system can convert voice observe to video of an individual talking utilizing a nonetheless picture (2024, March 1)
retrieved 1 March 2024
from https://techxplore.com/information/2024-03-ai-voice-track-video-person.html

This doc is topic to copyright. Aside from any honest dealing for the aim of personal research or analysis, no
half could also be reproduced with out the written permission. The content material is offered for info functions solely.

Click Here To Join Our Telegram Channel

Source link

If in case you have any considerations or complaints concerning this text, please tell us and the article shall be eliminated quickly.

Raise A Concern