Advancing Image Inpainting: From GANs to Diffusion Models

Zhang, Yongle

Advancing Image Inpainting: From GANs to Diffusion Models

Zhang, Yongle

Permalink

Publication Type:: Thesis
Issue Date:: 2025

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (54.08 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Yongle
dc.date.accessioned	2026-05-27T00:31:35Z
dc.date.available	2026-05-27T00:31:35Z
dc.date.issued	2025
dc.identifier.uri	http://hdl.handle.net/10453/195151
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	Image inpainting, the task of reconstructing corrupted regions in images to achieve visually coherent results, is pivotal in computer vision, enabling applications such as photo restoration, object removal, and creative design. Despite significant advancements, existing methods struggle with complex real-world corruptions involving diverse semantic content, intricate structural details, and precise spatial control. This thesis addresses three critical challenges in image inpainting, spanning generative adversarial network (GAN)-based and diffusion-based frameworks, to advance the field toward solutions that restore semantic plausibility, structural fidelity, and spatial consistency, thereby meeting the demands of diverse applications. The first challenge in GAN-style inpainting involves restoring multiple corrupted semantic regions, such as facial features with distinct class-level content. This is complicated by the need for semantic coherence across diverse regions, as existing methods that rely on implicit learning of semantics (e.g., GANs trained on large datasets to infer patterns) struggle with effective restorations, while methods that use explicit guidance (e.g., from pre-trained models or predicted semantic maps) often propagate errors from inaccurate initial predictions. To address this, a dual-task generative framework is proposed that jointly refines semantic segmentation predictions and texture restoration, iteratively correcting segmentation errors using restored textures and guiding inpainting with refined semantics. The second challenge, also in GAN-style inpainting, focuses on reconstructing structurally intricate regions, such as repetitive patterns in building facades, where semantic priors alone are insufficient for capturing fine-grained spatial details. Existing methods incorporate structural priors (e.g., sketches) through direct or modulated feature fusion, but they lack dynamic and global adaptation to evolving inpainting features, leading to inconsistencies or artifacts. An adaptive multi-modal framework is introduced, inspired by human drawing processes, which dynamically integrates structural and semantic priors to ensure both structural accuracy and semantic plausibility throughout the progressive inpainting process. The third challenge, specific to Diffusion-style inpainting, concerns recovering partially occluded objects with precise posture control. This task is hindered by the limitations of text-only guidance in conveying pixel-aligned spatial attributes, while existing visual guides ignore residual cues from uncorrupted object regions, leading to pose mismatches. This thesis solves this via a dual-path visual control module, which explicitly models interactions between uncorrupted visual cues and guided sketches, then integrates guided sketches into the diffusion-based inpainting process. This spatial bridge ensures that sketch-controlled generated regions seamlessly connect to existing structures, achieving precise posture control and consistency. Additionally, two novel datasets, CUB-sketch and MSCOCO-sketch, are introduced to benchmark posture-aware inpainting.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/195151/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2025 Yongle Zhang
dc.rights	au.edu.uts.lib/cph
dc.title	Advancing Image Inpainting: From GANs to Diffusion Models	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Image inpainting, the task of reconstructing corrupted regions in images to achieve visually coherent results, is pivotal in computer vision, enabling applications such as photo restoration, object removal, and creative design. Despite significant advancements, existing methods struggle with complex real-world corruptions involving diverse semantic content, intricate structural details, and precise spatial control. This thesis addresses three critical challenges in image inpainting, spanning generative adversarial network (GAN)-based and diffusion-based frameworks, to advance the field toward solutions that restore semantic plausibility, structural fidelity, and spatial consistency, thereby meeting the demands of diverse applications. The first challenge in GAN-style inpainting involves restoring multiple corrupted semantic regions, such as facial features with distinct class-level content. This is complicated by the need for semantic coherence across diverse regions, as existing methods that rely on implicit learning of semantics (e.g., GANs trained on large datasets to infer patterns) struggle with effective restorations, while methods that use explicit guidance (e.g., from pre-trained models or predicted semantic maps) often propagate errors from inaccurate initial predictions. To address this, a dual-task generative framework is proposed that jointly refines semantic segmentation predictions and texture restoration, iteratively correcting segmentation errors using restored textures and guiding inpainting with refined semantics. The second challenge, also in GAN-style inpainting, focuses on reconstructing structurally intricate regions, such as repetitive patterns in building facades, where semantic priors alone are insufficient for capturing fine-grained spatial details. Existing methods incorporate structural priors (e.g., sketches) through direct or modulated feature fusion, but they lack dynamic and global adaptation to evolving inpainting features, leading to inconsistencies or artifacts. An adaptive multi-modal framework is introduced, inspired by human drawing processes, which dynamically integrates structural and semantic priors to ensure both structural accuracy and semantic plausibility throughout the progressive inpainting process. The third challenge, specific to Diffusion-style inpainting, concerns recovering partially occluded objects with precise posture control. This task is hindered by the limitations of text-only guidance in conveying pixel-aligned spatial attributes, while existing visual guides ignore residual cues from uncorrupted object regions, leading to pose mismatches. This thesis solves this via a dual-path visual control module, which explicitly models interactions between uncorrupted visual cues and guided sketches, then integrates guided sketches into the diffusion-based inpainting process. This spatial bridge ensures that sketch-controlled generated regions seamlessly connect to existing structures, achieving precise posture control and consistency. Additionally, two novel datasets, CUB-sketch and MSCOCO-sketch, are introduced to benchmark posture-aware inpainting.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/195151