Towards structured visual percpetion

Publication Type:
Thesis
Issue Date:
2025
Full metadata record
Human visual perception, the foundation for our understanding of the world, is characterized by its ability to interpret scenes as structured, coherent wholes rather than mere collections of isolated objects. Despite deep learning has driven significant progress in computer vision, current visual perception models still fall short in achieving this holistic comprehension. This thesis argues that attaining human-like visual intelligence requires a fundamental shift towards structured visual perception, and presents a body of research effort to develop computational methods that can explicitly model, learn, and reason with visual structures. This dissertation advances this vision through three interconnected and progressively deepening research thrusts. First, I model dynamic visual structures by leveraging temporal correspondences to capture the evolution of scenes and objects over time. Then, the focus is extended to spatial relational structures, developing approaches to uncover the rich connections between objects and their components to build structured representations of scenes. Finally, I investigate general principles for structured perception through the integration of symbolic knowledge, using commonsense or domain-specific constraints to guide both the learning and inference processes of deep models. Collectively, this thesis outlines a comprehensive roadmap towards equipping machines with visual intelligence that more closely emulates the structured, holistic nature of human visual perception.
Please use this identifier to cite or link to this item: