Learning for Visual Synthesis and Transformation

Chen, Xinyuan

Learning for Visual Synthesis and Transformation

Chen, Xinyuan

Permalink

Publication Type:: Thesis
Issue Date:: 2020

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (186.76 kB)

Adobe PDF

Download thesisAdobe PDF (46.67 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Chen, Xinyuan
dc.date.accessioned	2020-11-18T02:52:51Z
dc.date.available	2020-11-18T02:52:51Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/10453/144113
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	Visual synthesis is one of the most fundamental problems in computer vision and artificial intelligence. Visual synthesis aims to create pixel-level data (e.g., images and videos) based on descriptions such as texts, noise, semantic annotations and images. Recently, deep generative learning has greatly promoted the development of visual synthesis. However, the existing generative methods still suffer from several issues, including model interpretation, controllability, stability, efficiency and performance. In this thesis, several generative models are proposed to address these challenges. This thesis makes the following contributions: First, this thesis introduces an attention generative model for local image synthesis, so as to improve the controllability and interpretation of the generative model. How to precisely locate the foreground region in the image and generate the target object to the specified region is the key problem in the local image synthesis task. The object transfiguration task is an application of the local image synthesis, which aims to transform the object of images to another object. Existing generative methods often fail to decompose the foreground and background. In this thesis, the attention mechanism is incorporated into generative models, so as to transform the object of our interests without altering the background. The model is built by decomposing the generative network into two separate networks, each of which is dedicated to one sub-task: to detect the region of interests and to generate the object from one object to another. The attention network predicts spatial attention maps of images, and the transformation network focuses on translating objects. The attention network produces attention maps which are encouraged to be sparse so that the model can only pay attention to the objects of interest. Also, a novel perceptual loss is introduced to improve the quality of transformed images in the high-level feature space. Experimental results demonstrate the necessity of investigating attention in image-to-image transformation, and that the improvement of the quality of generated images. Second, this thesis proposes a multi-domain generative model that multiple styles of images can be generated in a single network. The major challenge is how to efficiently generate multiple styles in a single network. Our model is capable to extract the content and style feature of images, and apply multiple style features to the content image. This thesis proposes a gated generative model that consists of three modules: an encoder, a gated transformer, and a decoder. Different styles can be achieved through different branches of gated transformers while the encoder and decoder are used for capturing structure information sharing weights for all styles. A discriminative network is used to distinguish whether the input image is a stylized or genuine image. An auxiliary classifier is used to recognize the style categories of transferred images, thereby helping to generate images in multiple styles. In addition, to stabilize the adversarial training process, an auto-encoder reconstruction loss is introduced by combining the encoder and decoder module. Extensive experiments demonstrate the stability and effectiveness of the proposed model for multi-domain image synthesis. Third, this thesis investigates the video synthesis problem on the long-term horizon. A temporal generative model is proposed for long-term video frame prediction. The existing generative model for video prediction usually cannot output high-quality predictions for a long-time horizon. The reason is that those methods recursively output subsequent frames by taking the newly generated frames as observations, consequently the prediction error accumulates dramatically. The introduced retrospection process is designed to look back on what has been learned from the past and rectify the prediction deficiencies. To this end, a retrospection network is built to reconstruct the past frames given the currently predicted frames. On the other hand, an auxiliary route is built by reversing the flow of time and executing a similar retrospection. These two routes interact with each other to boost the performance of retrospection network and enhance the understanding of dynamics across frames, especially for the long-term horizon. Overall, this thesis investigates the deep generative model and solves several practical issues for visual synthesis and transformation. For local image synthesis, we propose an attention generative model. We also propose a gated generative model for generating multi-domain of images in a single generative network. For video synthesis, a temporal generative model is proposed to output long-term video frames by incorporating the prediction and retrospection process in the model. Extensive experimental results on large-scale benchmark datasets demonstrate that the proposed methods in this thesis perform favorably against previous visual synthesis algorithms in terms of efficiency, controllability, and robustness.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/144113/2/02whole.pdf
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2020 Xinyuan Chen
dc.title	Learning for Visual Synthesis and Transformation	en_AU
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Visual synthesis is one of the most fundamental problems in computer vision and artificial intelligence. Visual synthesis aims to create pixel-level data (e.g., images and videos) based on descriptions such as texts, noise, semantic annotations and images. Recently, deep generative learning has greatly promoted the development of visual synthesis. However, the existing generative methods still suffer from several issues, including model interpretation, controllability, stability, efficiency and performance. In this thesis, several generative models are proposed to address these challenges. This thesis makes the following contributions: First, this thesis introduces an attention generative model for local image synthesis, so as to improve the controllability and interpretation of the generative model. How to precisely locate the foreground region in the image and generate the target object to the specified region is the key problem in the local image synthesis task. The object transfiguration task is an application of the local image synthesis, which aims to transform the object of images to another object. Existing generative methods often fail to decompose the foreground and background. In this thesis, the attention mechanism is incorporated into generative models, so as to transform the object of our interests without altering the background. The model is built by decomposing the generative network into two separate networks, each of which is dedicated to one sub-task: to detect the region of interests and to generate the object from one object to another. The attention network predicts spatial attention maps of images, and the transformation network focuses on translating objects. The attention network produces attention maps which are encouraged to be sparse so that the model can only pay attention to the objects of interest. Also, a novel perceptual loss is introduced to improve the quality of transformed images in the high-level feature space. Experimental results demonstrate the necessity of investigating attention in image-to-image transformation, and that the improvement of the quality of generated images. Second, this thesis proposes a multi-domain generative model that multiple styles of images can be generated in a single network. The major challenge is how to efficiently generate multiple styles in a single network. Our model is capable to extract the content and style feature of images, and apply multiple style features to the content image. This thesis proposes a gated generative model that consists of three modules: an encoder, a gated transformer, and a decoder. Different styles can be achieved through different branches of gated transformers while the encoder and decoder are used for capturing structure information sharing weights for all styles. A discriminative network is used to distinguish whether the input image is a stylized or genuine image. An auxiliary classifier is used to recognize the style categories of transferred images, thereby helping to generate images in multiple styles. In addition, to stabilize the adversarial training process, an auto-encoder reconstruction loss is introduced by combining the encoder and decoder module. Extensive experiments demonstrate the stability and effectiveness of the proposed model for multi-domain image synthesis. Third, this thesis investigates the video synthesis problem on the long-term horizon. A temporal generative model is proposed for long-term video frame prediction. The existing generative model for video prediction usually cannot output high-quality predictions for a long-time horizon. The reason is that those methods recursively output subsequent frames by taking the newly generated frames as observations, consequently the prediction error accumulates dramatically. The introduced retrospection process is designed to look back on what has been learned from the past and rectify the prediction deficiencies. To this end, a retrospection network is built to reconstruct the past frames given the currently predicted frames. On the other hand, an auxiliary route is built by reversing the flow of time and executing a similar retrospection. These two routes interact with each other to boost the performance of retrospection network and enhance the understanding of dynamics across frames, especially for the long-term horizon. Overall, this thesis investigates the deep generative model and solves several practical issues for visual synthesis and transformation. For local image synthesis, we propose an attention generative model. We also propose a gated generative model for generating multi-domain of images in a single generative network. For video synthesis, a temporal generative model is proposed to output long-term video frames by incorporating the prediction and retrospection process in the model. Extensive experimental results on large-scale benchmark datasets demonstrate that the proposed methods in this thesis perform favorably against previous visual synthesis algorithms in terms of efficiency, controllability, and robustness.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/144113