Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Li, L; Wang, S; Zhang, Z; Ding, Y; Zheng, Y; Yu, X; Fan, C

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Li, L Wang, S Zhang, Z Ding, Y Zheng, Y Yu, X

Fan, C

Permalink

Publisher:: AAAI
Publication Type:: Conference Proceeding
Citation:: 35th AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, 3A, pp. 1911-1920
Issue Date:: 2021-01-01

Closed Access

	Filename	Description	Size
	16286-Article Text-19780-1-2-20210518.pdf	Published version	1.55 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, L
dc.contributor.author	Wang, S
dc.contributor.author	Zhang, Z
dc.contributor.author	Ding, Y
dc.contributor.author	Zheng, Y
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.contributor.author	Fan, C
dc.date	2021-02-02
dc.date.accessioned	2022-06-18T01:14:18Z
dc.date.available	2022-06-18T01:14:18Z
dc.date.issued	2021-01-01
dc.identifier.citation	35th AAAI Conference on Artificial Intelligence, AAAI 2021, 2021, 3A, pp. 1911-1920
dc.identifier.isbn	9781713835974
dc.identifier.uri	http://hdl.handle.net/10453/158236
dc.description.abstract	In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i.e., facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.
dc.language	en
dc.publisher	AAAI
dc.relation.ispartof	35th AAAI Conference on Artificial Intelligence, AAAI 2021
dc.relation.ispartof	Conference on Artificial Intelligence
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation
dc.type	Conference Proceeding
utslib.citation.volume	3A
utslib.location.activity	Virtual
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2022-06-18T01:14:16Z
pubs.finish-date	2021-02-09
pubs.place-of-publication	USA
pubs.publication-status	Published
pubs.start-date	2021-02-02
pubs.volume	3A
dc.location	USA

Abstract:

In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i.e., facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/158236