Multimodal and Generative Representation Learning

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Representation learning is fundamental for most vision and language tasks. The quality of the designed or learned representation of the input text or visual signals determines the success or failure of relevant tasks. In this thesis, I present novel representation learning methods based on deep neural networks for two different tasks: Natural Language Queries (NLQ) and Face Swapping. 1) Natural Language Queries is a multimodal information retrieval task between video and text. Given an egocentric video clip and a text query, the goal of NLQ is to locate a temporal moment of the video clip where the answer to the query can be obtained. How to learn a combined representation by two different modality features: video and text, is the major problem of this task. To address this challenge, we propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between video and text. 2) Face Swapping is a generative task. Given a target image and a source image, Face Swapping aims to swap the identity of the target image to the identity of the source image, while the other attributes of the target image (background, expression, et al.) should be preserved. The primary issue of this task is the modulation of identity representation which is solved by our ID modulation block. Furthermore, we utilize the rich and diverse representation priors learned in a pre-trained face GAN to obtain high-fidelity and high-quality face-swapped images. Experiments have shown that our designed representation learning methods for these two tasks bring significant improvement.
Please use this identifier to cite or link to this item: