EGGen: Image Generation with Multi-entity Prior Learning through Entity Guidance
- Publisher:
- Association for Computing Machinery (ACM)
- Publication Type:
- Conference Proceeding
- Citation:
- MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6637-6645
- Issue Date:
- 2024-10-28
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
Diffusion models have shown remarkable prowess in text-to-image synthesis and editing, yet they often stumble when tasked with interpreting complex prompts that describe multiple entities with specific attributes and interrelations. The generated images often contain inconsistent multi-entity representation (IMR), reflected as inaccurate presentations of the multiple entities and their attributes. Although providing spatial layout guidance improves the multi-entity generation quality in existing works, it is still challenging to handle the leakage attributes and avoid unnatural characteristics. To address the IMR challenge, we first conduct in-depth analyses of the diffusion process and attention operation, revealing that the IMR challenges largely stem from the process of cross-attention mechanisms. According to the analyses, we introduce the entity guidance generation mechanism, which maintains the integrity of the original diffusion model parameters by integrating plug-in networks. Our work advances the stable diffusion model by segmenting comprehensive prompts into distinct entity-specific prompts with bounding boxes, enabling a transition from multi-entity to single-entity generation in cross-attention layers. More importantly, we introduce entity-centric cross-attention layers that focus on individual entities to preserve their uniqueness and accuracy, alongside global entity alignment layers that refine cross-attention maps using multi-entity priors for precise positioning and attribute accuracy. Additionally, a linear attenuation module is integrated to progressively reduce the influence of these layers during inference, preventing oversaturation and preserving generation fidelity. Our comprehensive experiments demonstrate that this entity guidance generation enhances existing text-to-image models in generating detailed, multi-entity images.
Please use this identifier to cite or link to this item: