StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.

Wang, S; Ma, Y; Ding, Y; Hu, Z; Fan, C; Lv, T; Deng, Z; Yu, X

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.

Wang, S Ma, Y Ding, Y Hu, Z Fan, C Lv, T Deng, Z Yu, X

Permalink

Publisher:: IEEE COMPUTER SOC
Publication Type:: Journal Article
Citation:: IEEE Trans Pattern Anal Mach Intell, 2024, 46, (6), pp. 4331-4347
Issue Date:: 2024-06

Closed Access

	Filename	Description	Size
	1721988.pdf	Published version	9.14 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, S
dc.contributor.author	Ma, Y
dc.contributor.author	Ding, Y
dc.contributor.author	Hu, Z
dc.contributor.author	Fan, C
dc.contributor.author	Lv, T
dc.contributor.author	Deng, Z
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.date.accessioned	2024-08-21T05:39:46Z
dc.date.available	2024-08-21T05:39:46Z
dc.date.issued	2024-06
dc.identifier.citation	IEEE Trans Pattern Anal Mach Intell, 2024, 46, (6), pp. 4331-4347
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/180500
dc.description.abstract	Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	IEEE COMPUTER SOC
dc.relation	http://purl.org/au-research/grants/arc/DP220100800
dc.relation	http://purl.org/au-research/grants/arc/DE230100477
dc.relation.ispartof	IEEE Trans Pattern Anal Mach Intell
dc.relation.isbasedon	10.1109/TPAMI.2024.3357808
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4611 Machine learning
dc.subject.mesh	Humans
dc.subject.mesh	Facial Expression
dc.subject.mesh	Speech
dc.subject.mesh	Video Recording
dc.subject.mesh	Head Movements
dc.subject.mesh	Algorithms
dc.subject.mesh	Head
dc.subject.mesh	Imaging, Three-Dimensional
dc.subject.mesh	Face
dc.subject.mesh	Image Processing, Computer-Assisted
dc.subject.mesh	Head
dc.subject.mesh	Face
dc.subject.mesh	Humans
dc.subject.mesh	Imaging, Three-Dimensional
dc.subject.mesh	Facial Expression
dc.subject.mesh	Speech
dc.subject.mesh	Head Movements
dc.subject.mesh	Algorithms
dc.subject.mesh	Image Processing, Computer-Assisted
dc.subject.mesh	Video Recording
dc.subject.mesh	Humans
dc.subject.mesh	Facial Expression
dc.subject.mesh	Speech
dc.subject.mesh	Video Recording
dc.subject.mesh	Head Movements
dc.subject.mesh	Algorithms
dc.subject.mesh	Head
dc.subject.mesh	Imaging, Three-Dimensional
dc.subject.mesh	Face
dc.subject.mesh	Image Processing, Computer-Assisted
dc.title	StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.
dc.type	Journal Article
utslib.citation.volume	46
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2024-08-21T05:39:37Z
pubs.issue	6
pubs.publication-status	Published
pubs.volume	46
utslib.citation.issue	6

Abstract:

Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/180500