Uncovering Limitations in Text-to-Image Generation: A Contrastive Approach with Structured Semantic Alignment

Feng, Q; Sui, Y; Zhang, H

Uncovering Limitations in Text-to-Image Generation: A Contrastive Approach with Structured Semantic Alignment

Feng, Q Sui, Y

Zhang, H

Permalink

Publication Type:: Conference Proceeding
Citation:: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 8876-8888
Issue Date:: 2023-01-01

Closed Access

	Filename	Description	Size
	2023.findings-emnlp.595.pdf	Published version	7.36 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Feng, Q
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574
dc.contributor.author	Zhang, H
dc.date.accessioned	2024-06-07T04:57:34Z
dc.date.available	2024-06-07T04:57:34Z
dc.date.issued	2023-01-01
dc.identifier.citation	Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 8876-8888
dc.identifier.isbn	9798891760615
dc.identifier.uri	http://hdl.handle.net/10453/179454
dc.description.abstract	Despite significant advancement, text-to-image generation models still face challenges when producing highly detailed or complex images based on textual descriptions. In this work, we propose a Structured Semantic Alignment (SSA) method for evaluating text-to-image generation models. SSA focuses on learning structured semantic embeddings across different modalities and aligning them in a joint space. The method employs the following steps to achieve its objective: (i) Generating mutated prompts by substituting words with semantically equivalent or nonequivalent alternatives while preserving the original syntax; (ii) Representing the sentence structure through parsing trees obtained via syntax parsing; (iii) Learning fine-grained structured embeddings that project semantic features from different modalities into a shared embedding space; (iv) Evaluating the semantic consistency between the structured text embeddings and the corresponding visual embeddings. Through experiments conducted on various benchmarks, we have demonstrated that SSA offers improved measurement of semantic consistency of text-to-image generation models. Additionally, it unveils a wide range of generation errors including under-generation, incorrect constituency, incorrect dependency, and semantic confusion. By uncovering these biases and limitations embedded within the models, our proposed method provides valuable insights into their shortcomings when applied to real-world scenarios.
dc.language	en
dc.relation.ispartof	Findings of the Association for Computational Linguistics: EMNLP 2023
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Uncovering Limitations in Text-to-Image Generation: A Contrastive Approach with Structured Semantic Alignment
dc.type	Conference Proceeding
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2024-06-07T04:57:30Z
pubs.publication-status	Published

Abstract:

Despite significant advancement, text-to-image generation models still face challenges when producing highly detailed or complex images based on textual descriptions. In this work, we propose a Structured Semantic Alignment (SSA) method for evaluating text-to-image generation models. SSA focuses on learning structured semantic embeddings across different modalities and aligning them in a joint space. The method employs the following steps to achieve its objective: (i) Generating mutated prompts by substituting words with semantically equivalent or nonequivalent alternatives while preserving the original syntax; (ii) Representing the sentence structure through parsing trees obtained via syntax parsing; (iii) Learning fine-grained structured embeddings that project semantic features from different modalities into a shared embedding space; (iv) Evaluating the semantic consistency between the structured text embeddings and the corresponding visual embeddings. Through experiments conducted on various benchmarks, we have demonstrated that SSA offers improved measurement of semantic consistency of text-to-image generation models. Additionally, it unveils a wide range of generation errors including under-generation, incorrect constituency, incorrect dependency, and semantic confusion. By uncovering these biases and limitations embedded within the models, our proposed method provides valuable insights into their shortcomings when applied to real-world scenarios.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179454