A Study on the Detection of Generative Visual Data
- Publication Type:
- Thesis
- Issue Date:
- 2024
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
The development of deep learning enables generative models such as Generative Adversarial Networks (GANs) and Diffusion Models (DMs) to produce realistic visual data such as images and videos. Generators that produce visual data with human faces are commonly known as Deepfake techniques. Deepfake human faces are more deceiving than general fake scenes because they contain more sensitive biological information and can easily lead to misinformation to the public. Therefore, detecting Deepfake human faces has also attracted much research attention recently. While existing detectors leverage artifacts such as biometrics, texture distortion and blending edges and have achieved promising performance, detecting more challenging Deepfake, such as multi-media and high-realism Deepfake, still needs to be well explored. Because they usually require more sophisticated feature extraction and feature modeling. In this thesis, we first comprehensively surveyed the generation and detection of GAN-based and DM-generated Deepfake. Then, we propose four detectors focusing on detecting multi-media and high-realism Deepfake. Firstly, we adopt the temporal identity inconsistency to tell Deepfake videos from real videos. Based on the observation that the identity similarities among fake video frames are significantly lower than those in real videos, we extract their identity vectors and calculate their differences as temporal inconsistency features. These features are then processed by the Gated Recurrent Unit (GRU) to model temporal dependencies for classification. Secondly, we propose a novel framework adopting motion information to detect Deepfake videos. By disentangling and detaching the id-related components, we feed the more generalizable id-unrelated embeddings to the Swin Transformer and achieve state-of-the-art generalization ability. Thirdly, we propose to identify the authenticity of multimodal inputs. For audio-video inputs, we analyze the intra-modal inconsistencies of both modals and the cross-modal inconsistencies. The intra-modal and inter-modal inconsistencies are fused to predict the binary label of the overall inputs. Besides, we achieve multi-grained forgery localization with the additional unimodal classification and spatiotemporal attention weights. Lastly, we propose a detector for DM-generated images. Based on the comprehensive analysis of the DM generation pipeline and the generated images, we observed that DM-generated images are more robust against minor additive Gaussian noise. Therefore, we collected the high-frequency responses of DM-generated images to Gaussian noise and fed them to the classifier to recognize Deepfake. Overall, this thesis provides insight into the detection of Deepfake and some valuable solutions for Deepfake detection.
Please use this identifier to cite or link to this item:
