Large-scale video analysis and understanding

Xu, Zhongwen

Large-scale video analysis and understanding

Xu, Zhongwen

Permalink

Publication Type:: Thesis
Issue Date:: 2017

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (152.51 kB)

Adobe PDF

Download thesisAdobe PDF (2.32 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, Zhongwen
dc.date.accessioned	2019-01-09T00:21:07Z
dc.date.available	2019-01-09T00:21:07Z
dc.date.issued	2017
dc.identifier.uri	http://hdl.handle.net/10453/129351
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	Video understanding is a complex task in computer vision, which requires not only recognizing objects, persons, and scenes, but also capturing and remembering the changes of visual content along time. Rapid development in building blocks like image classification task in recent years provides great opportunities for accurate and efficient video understanding. Based on deep convolutional neural networks and recurrent neural networks, various kinds of deep learning applications on video understanding have been studied. In this thesis, I present my research on large-scale video analysis and understanding in three major aspects: video representation learning, recognition with limited examples, and vision & language. Representation and features are the most important part for vision tasks, since it is very general and can be used for classification task, detection task and also tasks for structural prediction like vision and language. We begin with video classification from multimodal features, which are hand-crafted features from different streams, i.e. vision and audio. For representation learning, we investigate aggregation methods to generate video representation from frame features. Significant improvements over classical pooling methods have been demonstrated. In addition, we propose a hierarchical recurrent neural network to learn the hierarchical structure for video. Going beyond supervised learning, we develop a sequence model to learn from reconstruction of future and past features based on the current sequences, showing that unlabeled videos can help learning good and generalizable video representation. We explore the problem of recognition with limited examples, which tries to tackle the situation that we cannot obtain enough data to train the model. The encouraging results show that it is feasible to obtain good performance with only a few examples for the target class. Except for the video classification task which only outputs labels for the video, we also seek for richer interaction between machine and human on vision content via natural language. We consider two major forms of vision and language tasks, the first is video captioning, i.e., to automatically generate caption to describe the given video sequence, and video question answering, i.e., to answer questions related to the presented video sequence. Finally, I conclude the thesis with some future directions on video understanding.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/129351/7/02whole.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	au.edu.uts.lib/ppc
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.subject	Image classification	en_AU
dc.subject	Video understanding	en_AU
dc.subject	Video representation learning	en_AU
dc.subject	Neural network	en_AU
dc.title	Large-scale video analysis and understanding	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

Video understanding is a complex task in computer vision, which requires not only recognizing objects, persons, and scenes, but also capturing and remembering the changes of visual content along time. Rapid development in building blocks like image classification task in recent years provides great opportunities for accurate and efficient video understanding. Based on deep convolutional neural networks and recurrent neural networks, various kinds of deep learning applications on video understanding have been studied. In this thesis, I present my research on large-scale video analysis and understanding in three major aspects: video representation learning, recognition with limited examples, and vision & language. Representation and features are the most important part for vision tasks, since it is very general and can be used for classification task, detection task and also tasks for structural prediction like vision and language. We begin with video classification from multimodal features, which are hand-crafted features from different streams, i.e. vision and audio. For representation learning, we investigate aggregation methods to generate video representation from frame features. Significant improvements over classical pooling methods have been demonstrated. In addition, we propose a hierarchical recurrent neural network to learn the hierarchical structure for video. Going beyond supervised learning, we develop a sequence model to learn from reconstruction of future and past features based on the current sequences, showing that unlabeled videos can help learning good and generalizable video representation. We explore the problem of recognition with limited examples, which tries to tackle the situation that we cannot obtain enough data to train the model. The encouraging results show that it is feasible to obtain good performance with only a few examples for the target class. Except for the video classification task which only outputs labels for the video, we also seek for richer interaction between machine and human on vision content via natural language. We consider two major forms of vision and language tasks, the first is video captioning, i.e., to automatically generate caption to describe the given video sequence, and video question answering, i.e., to answer questions related to the presented video sequence. Finally, I conclude the thesis with some future directions on video understanding.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/129351