Generalizable Visual Understanding with Deep Neural Networks

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Deep neural networks (DNNs) have transformed computer vision, advancing object recognition, scene understanding, and image synthesis. However, a critical challenge remains in their ability to generalize to unseen distributions and novel categories, limiting their full potential in real-world applications. This thesis endeavors to address this limitation and develops methodologies to bestow vision models with strong generalizability in diverse and changing environments. It delves into two crucial perspectives of generalizability in computer vision, i.e., generalizing to novel structures and novel categories. In addressing the challenge of generalizing to novel structures, the research endeavors to extract generalizable structural representations from diverse visual scenarios. These encompass 2D rigid scenes, 3D rigid scenes, and non-rigid structures. The study identifies obstacles to generalization, including discrepancies in layout distribution for 2D scenes, dropout noises disrupting 3D scene geometry, and variations in inter-joint relationships within non-rigid structures. To overcome these challenges, innovative methodologies are developed. These methodologies include layout-matching techniques to bridge layout distribution gaps, adversarial masking paradigms to enhance robustness against disruptive geometry noises, and a "decompose to generalize" paradigm that reinforces commonalities in inter-joint relationships among different species, thereby promoting generalization. Regarding the generalizability with novel categories, this thesis is structured around two fundamental questions: (1) the ability to discern novel categories from known ones, and (2) the aptitude to effectively classify each newly encountered category. The former challenge is addressed as the "category shift" problem, wherein only partial categories are shared between two correlated domains / datasets. To tackle this issue, a clustering algorithm is proposed to delineate the known from the unknown through cross-domain consensus knowledge. For the latter challenge, a solution is devised by leveraging cross-modality knowledge from Vision-Language Models (VLMs), wherein distinctions between known and novel categories are discerned through discriminative mappings in the latent text space. In pursuit of this objective, the thesis introduces a "decouple to contrast" methodology to alleviate ambiguities between visual and text latent spaces in a decoupled manner. In conclusion, this thesis contributes to the advancement of generalizable visual understanding by proposing novel approaches and methodologies tailored for deep neural networks. The developed techniques enhance the network's ability to learn robust and transferable representations, enabling better generalization across diverse visual domains. These findings have implications for various real-world applications, including autonomous systems, robotics, and computer vision-based technologies.
Please use this identifier to cite or link to this item: