Advancing Image Inpainting: From GANs to Diffusion Models

Publication Type:
Thesis
Issue Date:
2025
Full metadata record
Image inpainting, the task of reconstructing corrupted regions in images to achieve visually coherent results, is pivotal in computer vision, enabling applications such as photo restoration, object removal, and creative design. Despite significant advancements, existing methods struggle with complex real-world corruptions involving diverse semantic content, intricate structural details, and precise spatial control. This thesis addresses three critical challenges in image inpainting, spanning generative adversarial network (GAN)-based and diffusion-based frameworks, to advance the field toward solutions that restore semantic plausibility, structural fidelity, and spatial consistency, thereby meeting the demands of diverse applications. The first challenge in GAN-style inpainting involves restoring multiple corrupted semantic regions, such as facial features with distinct class-level content. This is complicated by the need for semantic coherence across diverse regions, as existing methods that rely on implicit learning of semantics (e.g., GANs trained on large datasets to infer patterns) struggle with effective restorations, while methods that use explicit guidance (e.g., from pre-trained models or predicted semantic maps) often propagate errors from inaccurate initial predictions. To address this, a dual-task generative framework is proposed that jointly refines semantic segmentation predictions and texture restoration, iteratively correcting segmentation errors using restored textures and guiding inpainting with refined semantics. The second challenge, also in GAN-style inpainting, focuses on reconstructing structurally intricate regions, such as repetitive patterns in building facades, where semantic priors alone are insufficient for capturing fine-grained spatial details. Existing methods incorporate structural priors (e.g., sketches) through direct or modulated feature fusion, but they lack dynamic and global adaptation to evolving inpainting features, leading to inconsistencies or artifacts. An adaptive multi-modal framework is introduced, inspired by human drawing processes, which dynamically integrates structural and semantic priors to ensure both structural accuracy and semantic plausibility throughout the progressive inpainting process. The third challenge, specific to Diffusion-style inpainting, concerns recovering partially occluded objects with precise posture control. This task is hindered by the limitations of text-only guidance in conveying pixel-aligned spatial attributes, while existing visual guides ignore residual cues from uncorrupted object regions, leading to pose mismatches. This thesis solves this via a dual-path visual control module, which explicitly models interactions between uncorrupted visual cues and guided sketches, then integrates guided sketches into the diffusion-based inpainting process. This spatial bridge ensures that sketch-controlled generated regions seamlessly connect to existing structures, achieving precise posture control and consistency. Additionally, two novel datasets, CUB-sketch and MSCOCO-sketch, are introduced to benchmark posture-aware inpainting.
Please use this identifier to cite or link to this item: