Context-aware person image generation

Publication Type:
Thesis
Issue Date:
2025
Full metadata record
Human image generation is an intriguing yet fundamentally challenging problem in computer vision. The ability to synthesize high-quality and semantically meaningful novel instances of a person has potential use cases across multiple domains, including academic research and enterprise applications. Visual realism and contextual coherence in such generative frameworks can directly benefit synthetic data generation, scene understanding, creative software, digital media, retail advertisements, animation, and augmented / virtual reality (AR/VR) products. Although the premise is intellectually and commercially appealing, generating realistic novel human instances is a significantly challenging problem. Moreover, imposing semantic constraints on the generative process to achieve contextually coherent visual results introduces additional complexities to the problem. In recent years, both unconditional and conditional generative algorithms have achieved a remarkable uplift in photorealism by adopting Generative Adversarial Networks (GAN) and later Diffusion Models (DM). However, most existing approaches focus on synthesizing an instance of a specific object class. In contrast, a real-world scene generally contains multiple object classes with different inter-object contextual relationships. Therefore, conditioning a generation process on the scene context becomes essential for semantically meaningful visual synthesis. This thesis explores two foundational aspects of context-aware person image generation. The first phase investigates visually realistic image generation of an isolated human instance from a local input context, such as geometric structure (pose) or textual descriptions. The second phase introduces global semantic constraints in the generative process and learns to blend a human instance into a complex scene while adapting to a contextually valid scene-human interaction. The high degree of appearance diversity and pose variations in human images contribute to the key challenges in the problem, where the aim is to generate semantically consistent novel views of a highly deformable object (human) from a single observation. The proposed research addresses these challenges by introducing generative strategies that achieve state-of-the-art performance on multiple visual and analytical benchmarks for context-aware person image generation.
Please use this identifier to cite or link to this item: