Large-scale video analysis and understanding

Publication Type:
Issue Date:
Full metadata record
Video understanding is a complex task in computer vision, which requires not only recognizing objects, persons, and scenes, but also capturing and remembering the changes of visual content along time. Rapid development in building blocks like image classification task in recent years provides great opportunities for accurate and efficient video understanding. Based on deep convolutional neural networks and recurrent neural networks, various kinds of deep learning applications on video understanding have been studied. In this thesis, I present my research on large-scale video analysis and understanding in three major aspects: video representation learning, recognition with limited examples, and vision & language. Representation and features are the most important part for vision tasks, since it is very general and can be used for classification task, detection task and also tasks for structural prediction like vision and language. We begin with video classification from multimodal features, which are hand-crafted features from different streams, i.e. vision and audio. For representation learning, we investigate aggregation methods to generate video representation from frame features. Significant improvements over classical pooling methods have been demonstrated. In addition, we propose a hierarchical recurrent neural network to learn the hierarchical structure for video. Going beyond supervised learning, we develop a sequence model to learn from reconstruction of future and past features based on the current sequences, showing that unlabeled videos can help learning good and generalizable video representation. We explore the problem of recognition with limited examples, which tries to tackle the situation that we cannot obtain enough data to train the model. The encouraging results show that it is feasible to obtain good performance with only a few examples for the target class. Except for the video classification task which only outputs labels for the video, we also seek for richer interaction between machine and human on vision content via natural language. We consider two major forms of vision and language tasks, the first is video captioning, i.e., to automatically generate caption to describe the given video sequence, and video question answering, i.e., to answer questions related to the presented video sequence. Finally, I conclude the thesis with some future directions on video understanding.
Please use this identifier to cite or link to this item: