Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Wang, S; Li, L; Ding, Y; Fan, C; Yu, X

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Wang, S Li, L Ding, Y Fan, C Yu, X

Permalink

Publisher:: IJCAI
Publication Type:: Conference Proceeding
Citation:: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 1098-1105
Issue Date:: 2021-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (2.28 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, S
dc.contributor.author	Li, L
dc.contributor.author	Ding, Y
dc.contributor.author	Fan, C
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.date	2021-08-29
dc.date.accessioned	2022-05-26T00:39:45Z
dc.date.available	2022-05-26T00:39:45Z
dc.date.issued	2021-01-01
dc.identifier.citation	Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021, pp. 1098-1105
dc.identifier.isbn	9780999241196
dc.identifier.issn	1045-0823
dc.identifier.uri	http://hdl.handle.net/10453/157705
dc.description.abstract	We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.
dc.language	en
dc.publisher	IJCAI
dc.relation.ispartof	Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
dc.relation.ispartof	International Joint Conference on Artificial Intelligence
dc.relation.isbasedon	10.24963/ijcai.2021/152
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion
dc.type	Conference Proceeding
utslib.location.activity	Montreal, Canada
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
pubs.consider-herdc	false
dc.date.updated	2022-05-26T00:39:43Z
pubs.finish-date	2021-08-27
pubs.place-of-publication	USA
pubs.publication-status	Published
pubs.start-date	2021-08-29
dc.location	USA

Abstract:

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157705