Multimodal Learning and Video Analysis with Deep Neural Networks

Publication Type:
Thesis
Issue Date:
2021
Full metadata record
Multi-modal perception is essential when we human explore, capture and perceive the real world. As a multi-modal media, video captures informative content in our daily life. Although deep-learning-based networks have proven to be successful in understanding visual images, an intelligent system is expected to perceive the world from the overall understanding of multiple modalities (e.g., vision and audio), and communicate humans with natural language. This thesis introduces several works on multi-modal perception and video analysis, including audio-visual video understanding, anticipating future actions, and describing unseen visual content using natural languages. For detailed analyzing audio-visual events in videos, I present a double attention corresponding network for synchronized audio-visual events and exploring heterogeneous clues for asynchronous audio-visual video parsing. For anticipating future actions, I propose to generate intermediate future features and optimize the generation via contrastive learning for multiple modality sources. For visual captioning, I design a decoupled novel object captioner to generate generalized captioning sentences for unseen objects. Multi-modal perception is essential when we human explore, capture and perceive the real world. As a multi-modal media, video captures informative content in our daily life. Although deep-learning-based networks have proven to be successful in understanding visual images, an intelligent system is expected to perceive the world from the overall understanding of multiple modalities (e.g., vision and audio), and communicate humans with natural language. This thesis introduces several works on multi-modal perception and video analysis, including audio-visual video understanding, anticipating future actions, and describing unseen visual content using natural languages. For detailed analyzing audio-visual events in videos, I present a double attention corresponding network for synchronized audio-visual events and exploring heterogeneous clues for asynchronous audio-visual video parsing. For anticipating future actions, I propose to generate intermediate future features and optimize the generation via contrastive learning for multiple modality sources. For visual captioning, I design a decoupled novel object captioner to generate generalized captioning sentences for unseen objects.
Please use this identifier to cite or link to this item: