Context-aware person image generation

Roy, Prasun

Context-aware person image generation

Roy, Prasun

Permalink

Publication Type:: Thesis
Issue Date:: 2025

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (8.81 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Roy, Prasun
dc.date.accessioned	2025-12-19T02:25:34Z
dc.date.available	2025-12-19T02:25:34Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/10453/191015
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	Human image generation is an intriguing yet fundamentally challenging problem in computer vision. The ability to synthesize high-quality and semantically meaningful novel instances of a person has potential use cases across multiple domains, including academic research and enterprise applications. Visual realism and contextual coherence in such generative frameworks can directly benefit synthetic data generation, scene understanding, creative software, digital media, retail advertisements, animation, and augmented / virtual reality (AR/VR) products. Although the premise is intellectually and commercially appealing, generating realistic novel human instances is a significantly challenging problem. Moreover, imposing semantic constraints on the generative process to achieve contextually coherent visual results introduces additional complexities to the problem. In recent years, both unconditional and conditional generative algorithms have achieved a remarkable uplift in photorealism by adopting Generative Adversarial Networks (GAN) and later Diffusion Models (DM). However, most existing approaches focus on synthesizing an instance of a specific object class. In contrast, a real-world scene generally contains multiple object classes with different inter-object contextual relationships. Therefore, conditioning a generation process on the scene context becomes essential for semantically meaningful visual synthesis. This thesis explores two foundational aspects of context-aware person image generation. The first phase investigates visually realistic image generation of an isolated human instance from a local input context, such as geometric structure (pose) or textual descriptions. The second phase introduces global semantic constraints in the generative process and learns to blend a human instance into a complex scene while adapting to a contextually valid scene-human interaction. The high degree of appearance diversity and pose variations in human images contribute to the key challenges in the problem, where the aim is to generate semantically consistent novel views of a highly deformable object (human) from a single observation. The proposed research addresses these challenges by introducing generative strategies that achieve state-of-the-art performance on multiple visual and analytical benchmarks for context-aware person image generation.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/191015/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2025 Prasun Roy
dc.rights	au.edu.uts.lib/cph
dc.title	Context-aware person image generation	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Human image generation is an intriguing yet fundamentally challenging problem in computer vision. The ability to synthesize high-quality and semantically meaningful novel instances of a person has potential use cases across multiple domains, including academic research and enterprise applications. Visual realism and contextual coherence in such generative frameworks can directly benefit synthetic data generation, scene understanding, creative software, digital media, retail advertisements, animation, and augmented / virtual reality (AR/VR) products. Although the premise is intellectually and commercially appealing, generating realistic novel human instances is a significantly challenging problem. Moreover, imposing semantic constraints on the generative process to achieve contextually coherent visual results introduces additional complexities to the problem. In recent years, both unconditional and conditional generative algorithms have achieved a remarkable uplift in photorealism by adopting Generative Adversarial Networks (GAN) and later Diffusion Models (DM). However, most existing approaches focus on synthesizing an instance of a specific object class. In contrast, a real-world scene generally contains multiple object classes with different inter-object contextual relationships. Therefore, conditioning a generation process on the scene context becomes essential for semantically meaningful visual synthesis. This thesis explores two foundational aspects of context-aware person image generation. The first phase investigates visually realistic image generation of an isolated human instance from a local input context, such as geometric structure (pose) or textual descriptions. The second phase introduces global semantic constraints in the generative process and learns to blend a human instance into a complex scene while adapting to a contextually valid scene-human interaction. The high degree of appearance diversity and pose variations in human images contribute to the key challenges in the problem, where the aim is to generate semantically consistent novel views of a highly deformable object (human) from a single observation. The proposed research addresses these challenges by introducing generative strategies that achieve state-of-the-art performance on multiple visual and analytical benchmarks for context-aware person image generation.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/191015