Generalizable Visual Understanding with Deep Neural Networks

Li, Guangrui

Generalizable Visual Understanding with Deep Neural Networks

Li, Guangrui

Permalink

Publication Type:: Thesis
Issue Date:: 2023

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download thesisAdobe PDF (15.77 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Guangrui
dc.date.accessioned	2023-11-28T02:56:47Z
dc.date.available	2023-11-28T02:56:47Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/10453/173601
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	Deep neural networks (DNNs) have transformed computer vision, advancing object recognition, scene understanding, and image synthesis. However, a critical challenge remains in their ability to generalize to unseen distributions and novel categories, limiting their full potential in real-world applications. This thesis endeavors to address this limitation and develops methodologies to bestow vision models with strong generalizability in diverse and changing environments. It delves into two crucial perspectives of generalizability in computer vision, i.e., generalizing to novel structures and novel categories. In addressing the challenge of generalizing to novel structures, the research endeavors to extract generalizable structural representations from diverse visual scenarios. These encompass 2D rigid scenes, 3D rigid scenes, and non-rigid structures. The study identifies obstacles to generalization, including discrepancies in layout distribution for 2D scenes, dropout noises disrupting 3D scene geometry, and variations in inter-joint relationships within non-rigid structures. To overcome these challenges, innovative methodologies are developed. These methodologies include layout-matching techniques to bridge layout distribution gaps, adversarial masking paradigms to enhance robustness against disruptive geometry noises, and a "decompose to generalize" paradigm that reinforces commonalities in inter-joint relationships among different species, thereby promoting generalization. Regarding the generalizability with novel categories, this thesis is structured around two fundamental questions: (1) the ability to discern novel categories from known ones, and (2) the aptitude to effectively classify each newly encountered category. The former challenge is addressed as the "category shift" problem, wherein only partial categories are shared between two correlated domains / datasets. To tackle this issue, a clustering algorithm is proposed to delineate the known from the unknown through cross-domain consensus knowledge. For the latter challenge, a solution is devised by leveraging cross-modality knowledge from Vision-Language Models (VLMs), wherein distinctions between known and novel categories are discerned through discriminative mappings in the latent text space. In pursuit of this objective, the thesis introduces a "decouple to contrast" methodology to alleviate ambiguities between visual and text latent spaces in a decoupled manner. In conclusion, this thesis contributes to the advancement of generalizable visual understanding by proposing novel approaches and methodologies tailored for deep neural networks. The developed techniques enhance the network's ability to learn robust and transferable representations, enabling better generalization across diverse visual domains. These findings have implications for various real-world applications, including autonomous systems, robotics, and computer vision-based technologies.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/173601/1/thesis.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	© 2023 Guangrui Li
dc.rights	au.edu.uts.lib/cph
dc.title	Generalizable Visual Understanding with Deep Neural Networks	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Deep neural networks (DNNs) have transformed computer vision, advancing object recognition, scene understanding, and image synthesis. However, a critical challenge remains in their ability to generalize to unseen distributions and novel categories, limiting their full potential in real-world applications. This thesis endeavors to address this limitation and develops methodologies to bestow vision models with strong generalizability in diverse and changing environments. It delves into two crucial perspectives of generalizability in computer vision, i.e., generalizing to novel structures and novel categories. In addressing the challenge of generalizing to novel structures, the research endeavors to extract generalizable structural representations from diverse visual scenarios. These encompass 2D rigid scenes, 3D rigid scenes, and non-rigid structures. The study identifies obstacles to generalization, including discrepancies in layout distribution for 2D scenes, dropout noises disrupting 3D scene geometry, and variations in inter-joint relationships within non-rigid structures. To overcome these challenges, innovative methodologies are developed. These methodologies include layout-matching techniques to bridge layout distribution gaps, adversarial masking paradigms to enhance robustness against disruptive geometry noises, and a "decompose to generalize" paradigm that reinforces commonalities in inter-joint relationships among different species, thereby promoting generalization. Regarding the generalizability with novel categories, this thesis is structured around two fundamental questions: (1) the ability to discern novel categories from known ones, and (2) the aptitude to effectively classify each newly encountered category. The former challenge is addressed as the "category shift" problem, wherein only partial categories are shared between two correlated domains / datasets. To tackle this issue, a clustering algorithm is proposed to delineate the known from the unknown through cross-domain consensus knowledge. For the latter challenge, a solution is devised by leveraging cross-modality knowledge from Vision-Language Models (VLMs), wherein distinctions between known and novel categories are discerned through discriminative mappings in the latent text space. In pursuit of this objective, the thesis introduces a "decouple to contrast" methodology to alleviate ambiguities between visual and text latent spaces in a decoupled manner. In conclusion, this thesis contributes to the advancement of generalizable visual understanding by proposing novel approaches and methodologies tailored for deep neural networks. The developed techniques enhance the network's ability to learn robust and transferable representations, enabling better generalization across diverse visual domains. These findings have implications for various real-world applications, including autonomous systems, robotics, and computer vision-based technologies.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/173601