AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion

Huang, S; Yang, Z; Li, L; Yang, Y; Jia, J

AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion

Huang, S Yang, Z Li, L Yang, Y

Jia, J

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5734-5745
Issue Date:: 2023-10-26

Closed Access

	Filename	Description	Size
	3581783.3612022.pdf	Published version	42.53 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huang, S
dc.contributor.author	Yang, Z
dc.contributor.author	Li, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Jia, J
dc.date.accessioned	2024-03-15T04:37:09Z
dc.date.available	2024-03-15T04:37:09Z
dc.date.issued	2023-10-26
dc.identifier.citation	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5734-5745
dc.identifier.isbn	9798400701085
dc.identifier.uri	http://hdl.handle.net/10453/176777
dc.description.abstract	Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code are available on our project page https://hansenhuang0823.github.io/AvatarFusion.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
dc.relation.ispartof	Proceedings of the 31st ACM International Conference on Multimedia
dc.relation.isbasedon	10.1145/3581783.3612022
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion
dc.type	Conference Proceeding
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-15T04:35:59Z
pubs.publication-status	Published

Abstract:

Large-scale pre-trained vision-language models allow for the zero-shot text-based generation of 3D avatars. The previous state-of-the-art method utilized CLIP to supervise neural implicit models that reconstructed a human body mesh. However, this approach has two limitations. Firstly, the lack of avatar-specific models can cause facial distortion and unrealistic clothing in the generated avatars. Secondly, CLIP only provides optimization direction for the overall appearance, resulting in less impressive results. To address these limitations, we propose AvatarFusion, the first framework to use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars while simultaneously segmenting clothing from the avatar's body. AvatarFusion includes the first clothing-decoupled neural implicit avatar model that employs a novel Dual Volume Rendering strategy to render the decoupled skin and clothing sub-models in one space. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes, and generates a variety of clothing styles. Moreover, we establish the first benchmark for zero-shot text-to-avatar generation. Our experimental results demonstrate that our framework outperforms previous approaches, with significant improvements observed in all metrics. Additionally, since our model is clothing-decoupled, we can exchange the clothes of avatars. Code are available on our project page https://hansenhuang0823.github.io/AvatarFusion.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/176777