Connecting Perception with Cognition for Deep Representations Learning

Publication Type:
Issue Date:
Full metadata record
A long-held purpose in the field of computer vision is to enable agents to perceive and understand the visual world. Significant progresses have been made over the last decades due to underlying advances in data collecting and computing infrastructure. Perception is envisaged as an interface that allows systems to interact and learn from the surrounding environment. However, most tasks are still dependant on the hotbed of assigning intense labels (e.g., category, location and description) to visual data. Development has been limited for tasks where supervision is difficult to obtain or inaccessible. Naturally, human tends to process information with ``thinking'' via a cognitive process. If we can build generalized models via connecting perception with cognitive inference, it would benefit tasks such as domain adaptation and cross-modal by learning rich and good representations. The aim of this thesis is to explore the importance of cognition in representation learning for different visual tasks. During which, we target on modeling vision-based systems with the following objectives: i) It is critical to learn consistent and robust embeddings that are resistant to confounding factors for better understanding the visual world; ii) Second, imagination is essential for models to build connections between visual inputs and other modalities, e.g., 3D models for reconstruction, lingual descriptions for visual grounding and reasoning; iii) Beyond that, it is important to leverage prior knowledge in learning and exploring out-of-domain data in the wild. It is an indispensable ability for models to increase their universality without forgetting the pre-acquired knowledge. This thesis elaborates on alternative methods for learning deep representations for visual tasks. The key idea is to build the fundamental correspondences as objectives for supervising the learning procedure. Accordingly, models and techniques are developed to allow connecting visual inputs and latent semantics. In particular, we validate the proposed methodology under different scenarios with multiple modalities. Especially, we investigate into connecting perception with cognition to make the most of in-domain knowledge and generalize in out-of-domain data, e.g., domain adaptation, open-set recognition. We implemented these ideas in the related benchmarks and the competitive performances demonstrate their effectiveness and universality.
Please use this identifier to cite or link to this item: