Collaborative Dual-Stream Modeling for Video Understanding

Publication Type:
Issue Date:
Full metadata record
Most existing video recognition systems classify the input video to coarse-grained labels with single-stream architectures or combine multi-modal predictions by simple late fusion. However, real-world video applications usually require understanding complex human-object interactions and fine-grained content. It expects a video analysis system to be able to conduct meticulous reasoning. Besides, the urgent need for multi-modal alignment and communication among different models requires multi-stream video modeling, which is beyond single-stream architectures' capacity. In this thesis, I argue that we should tackle video understanding with collaborative dual-stream modeling in several challenging scenarios. The interaction between the different information in videos can encourage the video understanding system to exploit the spatio-temporal relation. The idea has been applied to three tasks. First, for egocentric action recognition, symbiotic attention mechanism and interactive prototype learning scheme are developed to explore the relationship between the motion stream and appearance stream. Second, we design a T2VLAD framework for text-video retrieval to align the text stream and video stream. Third, for efficient video recognition, the communication between the lightweight model and heavyweight model is enabled by a parallel sampling network to sample more salient frames. Extensive experiments on popular video datasets demonstrate the effectiveness of the proposed approaches.
Please use this identifier to cite or link to this item: