Video representation learning with deep neural networks

Publication Type:
Issue Date:
Full metadata record
Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. One method to learn a video representation from untrimmed videos is to perform unsupervised temporal modeling. Given a clip sampled from a video, its past and future neighboring clips are used as temporal context, and reconstruct the two temporal transitions, i.e., present→past transition and present→future transition, which reflect the temporal information in different views. In this thesis, the two transitions are exploited simultaneously by incorporating a bi-direction reconstruction which consists of a backward reconstruction and a forward reconstruction. To adapt an existing model to recognize a new category which was unseen during training, it may be necessary to manually collect hundreds of new training samples. Such a procedure is rather tedious and labor intensive, especially when there are many new categories. In this thesis, a classification model is proposed to learn from a few examples in a life-long manner. To evaluate the effectiveness of the learned representation, extensive experiments are conducted on multimedia event detection, image classification, video captioning, and video question answering.
Please use this identifier to cite or link to this item: