Hierarchically Supervised Deconvolutional Network for Semantic Video Segmentation

Wang, Y; Liu, J; Li, Y; Fu, J; Xu, M; Lu, H

Hierarchically Supervised Deconvolutional Network for Semantic Video Segmentation

Wang, Y Liu, J Li, Y Fu, J Xu, M

Lu, H

Permalink

Publication Type:: Journal Article
Citation:: Pattern Recognition, 2017, 64 pp. 437 - 445
Issue Date:: 2017-04-01

Closed Access

	Filename	Description	Size
	1-s2.0-S0031320316303077-main.pdf	Published Version	1.47 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, Y	en_US
dc.contributor.author	Liu, J	en_US
dc.contributor.author	Li, Y	en_US
dc.contributor.author	Fu, J	en_US
dc.contributor.author	Xu, M https://orcid.org/0000-0001-9581-8849	en_US
dc.contributor.author	Lu, H	en_US
dc.date.issued	2017-04-01	en_US
dc.identifier.citation	Pattern Recognition, 2017, 64 pp. 437 - 445	en_US
dc.identifier.issn	0031-3203	en_US
dc.identifier.uri	http://hdl.handle.net/10453/124963
dc.description.abstract	© 2016 Elsevier Ltd Semantic video segmentation is a challenging task of fine-grained semantic understanding of video data. In this paper, we present a jointly trained deep learning framework to make the best use of spatial and temporal information for semantic video segmentation. Along the spatial dimension, a hierarchically supervised deconvolutional neural network (HDCNN) is proposed to conduct pixel-wise semantic interpretation for single video frames. HDCNN is constructed with convolutional layers in VGG-net and their mirrored deconvolutional structure, where all fully connected layers are removed. And hierarchical classification layers are added to multi-scale deconvolutional features to introduce more contextual information for pixel-wise semantic interpretation. Besides, a coarse-to-fine training strategy is adopted to enhance the performance of foreground object segmentation in videos. Along the temporal dimension, we introduce Transition Layers upon the structure of HDCNN to make the pixel-wise label prediction consist with adjacent pixels across space and time domains. The learning process of the Transition Layers can be implemented as a set of extra convolutional calculations connected with HDCNN. These two parts are jointly trained as a unified deep network in our approach. Thorough evaluations are performed on two challenging video datasets, i.e., CamVid and GATECH. Our approach achieves state-of-the-art performance on both of the two datasets.	en_US
dc.relation.ispartof	Pattern Recognition	en_US
dc.relation.isbasedon	10.1016/j.patcog.2016.09.046	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Hierarchically Supervised Deconvolutional Network for Semantic Video Segmentation	en_US
dc.type	Journal Article
utslib.citation.volume	64	en_US
utslib.for	0899 Other Information and Computing Sciences	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
utslib.copyright.status	closed_access	*
pubs.publication-status	Published	en_US
pubs.volume	64	en_US

Abstract:

© 2016 Elsevier Ltd Semantic video segmentation is a challenging task of fine-grained semantic understanding of video data. In this paper, we present a jointly trained deep learning framework to make the best use of spatial and temporal information for semantic video segmentation. Along the spatial dimension, a hierarchically supervised deconvolutional neural network (HDCNN) is proposed to conduct pixel-wise semantic interpretation for single video frames. HDCNN is constructed with convolutional layers in VGG-net and their mirrored deconvolutional structure, where all fully connected layers are removed. And hierarchical classification layers are added to multi-scale deconvolutional features to introduce more contextual information for pixel-wise semantic interpretation. Besides, a coarse-to-fine training strategy is adopted to enhance the performance of foreground object segmentation in videos. Along the temporal dimension, we introduce Transition Layers upon the structure of HDCNN to make the pixel-wise label prediction consist with adjacent pixels across space and time domains. The learning process of the Transition Layers can be implemented as a set of extra convolutional calculations connected with HDCNN. These two parts are jointly trained as a unified deep network in our approach. Thorough evaluations are performed on two challenging video datasets, i.e., CamVid and GATECH. Our approach achieves state-of-the-art performance on both of the two datasets.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/124963