Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses
- Publisher:
- Institute of Electrical and Electronics Engineers (IEEE)
- Publication Type:
- Journal Article
- Citation:
- IEEE Transactions on Circuits and Systems for Video Technology, 2023, PP, (99), pp. 1-1
- Issue Date:
- 2023-01-01
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
Stochastic Latent Talking Face Generation Towards Emotional Expressions and Head Poses_OPUS.pdf | Accepted version | 20.89 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Current talking face generation methods have achieved promising lip-synchronization results, while still struggling to generate talking face video that exhibits emotional expressions and head poses. Studies in psychology have demonstrated that people may manifest diverse facial animations that follow a time-varying distribution. This presents two stochastic challenges that make generating appropriate emotional expressions and head poses difficult: (1) modelling the time-varying distribution of facial deformations to synthesize the stochastic dynamics of emotional expressions and head poses, and (2) estimating the complex motion distribution with given audio features to capture ambiguous audio-related expressions and head poses. To address the above issues, we present a Stochastic Latent talkIng face Generation mOdel (SLIGO), which builds a deep state space model (SSM) for talking face generation. The SLIGO model captures diverse and stochastic facial dynamics via the latent motion distribution. Additionally, we devise a dynamic variational autoencoder (DVAE) method to optimize the deep SSM model. This method decomposes the Evidence Lower BOund (ELBO) of SSM into three components: a posterior for latent motion encoding, a prior for audio-driven motion prediction, and a likelihood for talking face decoding. Furthermore, we propose a novel mixer continuous normalizing flow (CNF) module to model the complex facial motion prior distribution. Experimental results demonstrate that SLIGO outperforms existing methods and achieves state-of-the-art performance.
Please use this identifier to cite or link to this item: