Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses

Sheng, Z; Nie, L; Zhang, M; Chang, X; Yan, Y

Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses

Sheng, Z Nie, L Zhang, M Chang, X

Yan, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Transactions on Circuits and Systems for Video Technology, 2023, PP, (99), pp. 1-1
Issue Date:: 2023-01-01

Closed Access

	Filename	Description	Size
	Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses_OPUS.pdf	Accepted version	20.89 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Sheng, Z
dc.contributor.author	Nie, L
dc.contributor.author	Zhang, M
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Yan, Y
dc.date.accessioned	2024-03-27T05:39:13Z
dc.date.available	2024-03-27T05:39:13Z
dc.date.issued	2023-01-01
dc.identifier.citation	IEEE Transactions on Circuits and Systems for Video Technology, 2023, PP, (99), pp. 1-1
dc.identifier.issn	1051-8215
dc.identifier.issn	1558-2205
dc.identifier.uri	http://hdl.handle.net/10453/177275
dc.description.abstract	Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Transactions on Circuits and Systems for Video Technology
dc.relation.isbasedon	10.1109/TCSVT.2023.3311039
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4006 Communications engineering
dc.subject.classification	4009 Electronics, sensors and digital hardware
dc.subject.classification	4603 Computer vision and multimedia computation
dc.title	Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses
dc.type	Journal Article
utslib.citation.volume	PP
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-27T05:38:11Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/177275