StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads.
- Publisher:
- IEEE COMPUTER SOC
- Publication Type:
- Journal Article
- Citation:
- IEEE Trans Pattern Anal Mach Intell, 2024, 46, (6), pp. 4331-4347
- Issue Date:
- 2024-06
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
1721988.pdf | Published version | 9.14 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Full metadata record
Field | Value | Language |
---|---|---|
dc.contributor.author | Wang, S | |
dc.contributor.author | Ma, Y | |
dc.contributor.author | Ding, Y | |
dc.contributor.author | Hu, Z | |
dc.contributor.author | Fan, C | |
dc.contributor.author | Lv, T | |
dc.contributor.author | Deng, Z | |
dc.contributor.author |
Yu, X |
|
dc.date.accessioned | 2024-08-21T05:39:46Z | |
dc.date.available | 2024-08-21T05:39:46Z | |
dc.date.issued | 2024-06 | |
dc.identifier.citation | IEEE Trans Pattern Anal Mach Intell, 2024, 46, (6), pp. 4331-4347 | |
dc.identifier.issn | 0162-8828 | |
dc.identifier.issn | 1939-3539 | |
dc.identifier.uri | http://hdl.handle.net/10453/180500 | |
dc.description.abstract | Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip. | |
dc.format | Print-Electronic | |
dc.language | eng | |
dc.publisher | IEEE COMPUTER SOC | |
dc.relation | http://purl.org/au-research/grants/arc/DP220100800 | |
dc.relation | http://purl.org/au-research/grants/arc/DE230100477 | |
dc.relation.ispartof | IEEE Trans Pattern Anal Mach Intell | |
dc.relation.isbasedon | 10.1109/TPAMI.2024.3357808 | |
dc.rights | info:eu-repo/semantics/closedAccess | |
dc.subject | 0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering | |
dc.subject.classification | Artificial Intelligence & Image Processing | |
dc.subject.classification | 4603 Computer vision and multimedia computation | |
dc.subject.classification | 4611 Machine learning | |
dc.subject.mesh | Humans | |
dc.subject.mesh | Facial Expression | |
dc.subject.mesh | Speech | |
dc.subject.mesh | Video Recording | |
dc.subject.mesh | Head Movements | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Head | |
dc.subject.mesh | Imaging, Three-Dimensional | |
dc.subject.mesh | Face | |
dc.subject.mesh | Image Processing, Computer-Assisted | |
dc.subject.mesh | Head | |
dc.subject.mesh | Face | |
dc.subject.mesh | Humans | |
dc.subject.mesh | Imaging, Three-Dimensional | |
dc.subject.mesh | Facial Expression | |
dc.subject.mesh | Speech | |
dc.subject.mesh | Head Movements | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Image Processing, Computer-Assisted | |
dc.subject.mesh | Video Recording | |
dc.subject.mesh | Humans | |
dc.subject.mesh | Facial Expression | |
dc.subject.mesh | Speech | |
dc.subject.mesh | Video Recording | |
dc.subject.mesh | Head Movements | |
dc.subject.mesh | Algorithms | |
dc.subject.mesh | Head | |
dc.subject.mesh | Imaging, Three-Dimensional | |
dc.subject.mesh | Face | |
dc.subject.mesh | Image Processing, Computer-Assisted | |
dc.title | StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads. | |
dc.type | Journal Article | |
utslib.citation.volume | 46 | |
utslib.location.activity | United States | |
utslib.for | 0801 Artificial Intelligence and Image Processing | |
utslib.for | 0806 Information Systems | |
utslib.for | 0906 Electrical and Electronic Engineering | |
pubs.organisational-group | University of Technology Sydney | |
pubs.organisational-group | University of Technology Sydney/Faculty of Engineering and Information Technology | |
pubs.organisational-group | University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute | |
utslib.copyright.status | closed_access | * |
dc.date.updated | 2024-08-21T05:39:37Z | |
pubs.issue | 6 | |
pubs.publication-status | Published | |
pubs.volume | 46 | |
utslib.citation.issue | 6 |
Abstract:
Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
Please use this identifier to cite or link to this item:
Download statistics for the last 12 months
Not enough data to produce graph