Video representation learning with deep neural networks

Zhu, Linchao

Video representation learning with deep neural networks

Zhu, Linchao

Permalink

Publication Type:: Thesis
Issue Date:: 2019

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (194.63 kB)

Adobe PDF

Download thesisAdobe PDF (2.4 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, Linchao
dc.date.accessioned	2019-05-13T23:00:32Z
dc.date.available	2019-05-13T23:00:32Z
dc.date.issued	2019
dc.identifier.uri	http://hdl.handle.net/10453/133362
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. One method to learn a video representation from untrimmed videos is to perform unsupervised temporal modeling. Given a clip sampled from a video, its past and future neighboring clips are used as temporal context, and reconstruct the two temporal transitions, i.e., present→past transition and present→future transition, which reflect the temporal information in different views. In this thesis, the two transitions are exploited simultaneously by incorporating a bi-direction reconstruction which consists of a backward reconstruction and a forward reconstruction. To adapt an existing model to recognize a new category which was unseen during training, it may be necessary to manually collect hundreds of new training samples. Such a procedure is rather tedious and labor intensive, especially when there are many new categories. In this thesis, a classification model is proposed to learn from a few examples in a life-long manner. To evaluate the effectiveness of the learned representation, extensive experiments are conducted on multimedia event detection, image classification, video captioning, and video question answering.	en_AU
dc.format	Thesis (PhD)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/133362/2/02whole.pdf
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Video representation learning with deep neural networks	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. One method to learn a video representation from untrimmed videos is to perform unsupervised temporal modeling. Given a clip sampled from a video, its past and future neighboring clips are used as temporal context, and reconstruct the two temporal transitions, i.e., present→past transition and present→future transition, which reflect the temporal information in different views. In this thesis, the two transitions are exploited simultaneously by incorporating a bi-direction reconstruction which consists of a backward reconstruction and a forward reconstruction. To adapt an existing model to recognize a new category which was unseen during training, it may be necessary to manually collect hundreds of new training samples. Such a procedure is rather tedious and labor intensive, especially when there are many new categories. In this thesis, a classification model is proposed to learn from a few examples in a life-long manner. To evaluate the effectiveness of the learned representation, extensive experiments are conducted on multimedia event detection, image classification, video captioning, and video question answering.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/133362