Single Image Depth Estimation with Normal Guided Scale Invariant Deep Convolutional Fields

Yan, H; Yu, X; Zhang, Y; Zhang, S; Zhao, X; Zhang, L

Single Image Depth Estimation with Normal Guided Scale Invariant Deep Convolutional Fields

Yan, H Yu, X

Zhang, Y Zhang, S Zhao, X Zhang, L

Permalink

Publisher:: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:: Journal Article
Citation:: IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29, (1), pp. 80-92
Issue Date:: 2019-01-01

Closed Access

	Filename	Description	Size
	Single_Image_Depth_Estimation_With_Normal_Guided_Scale_Invariant_Deep_Convolutional_Fields.pdf	Published version	3.67 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Yan, H
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.contributor.author	Zhang, Y
dc.contributor.author	Zhang, S
dc.contributor.author	Zhao, X
dc.contributor.author	Zhang, L
dc.date.accessioned	2022-08-15T05:00:21Z
dc.date.available	2022-08-15T05:00:21Z
dc.date.issued	2019-01-01
dc.identifier.citation	IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29, (1), pp. 80-92
dc.identifier.issn	1051-8215
dc.identifier.issn	1558-2205
dc.identifier.uri	http://hdl.handle.net/10453/160205
dc.description.abstract	Estimating scene depth from a single image can be widely applied to understand 3D environments due to the easy access of the images captured by consumer-level cameras. Previous works exploit conditional random fields (CRFs) to estimate image depth, where neighboring pixels (superpixels) with similar appearances are constrained to share the same depth. However, the depth may vary significantly in the slanted surface, thus leading to severe estimation errors. In order to eliminate those errors, we propose a superpixel-based normal guided scale invariant deep convolutional field by encouraging the neighboring superpixels with similar appearance to lie on the same 3D plane of the scene. In doing so, a depth-normal multitask CNN is introduced to produce the superpixel-wise depth and surface normal predictions simultaneously. To correct the errors of the roughly estimated superpiexl-wise depth, we develop a normal guided scale invariant CRF (NGSI-CRF). NGSI-CRF consists of a scale invariant unary potential that is able to measure the relative depth between superpixels as well as the absolute depth of superpixels, and a normal guided pairwise potential that constrains spatial relationships between superpixels in accordance with the 3D layout of the scene. In other words, the normal guided pairwise potential is designed to smooth the depth prediction without deteriorating the 3D structure of the depth prediction. The superpixel-wise depth maps estimated by NGSI-CRF will be fed into a pixel-wise refinement module to produce a smooth fine-grained depth prediction. Furthermore, we derive a closed-form solution for the maximum a posteriori (MAP) inference of NGSI-CRF. Thus, our proposed network can be efficiently trained in an end-to-end manner. We conduct our experiments on various datasets, such as NYU-D2, KITTI, and Make 3D. As demonstrated in the experimental results, our method achieves superior performance in both indoor and outdoor scenes.
dc.language	English
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.relation.ispartof	IEEE Transactions on Circuits and Systems for Video Technology
dc.relation.isbasedon	10.1109/TCSVT.2017.2772892
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Single Image Depth Estimation with Normal Guided Scale Invariant Deep Convolutional Fields
dc.type	Journal Article
utslib.citation.volume	29
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2022-08-15T04:59:48Z
pubs.issue	1
pubs.publication-status	Published
pubs.volume	29
utslib.citation.issue	1

Abstract:

Estimating scene depth from a single image can be widely applied to understand 3D environments due to the easy access of the images captured by consumer-level cameras. Previous works exploit conditional random fields (CRFs) to estimate image depth, where neighboring pixels (superpixels) with similar appearances are constrained to share the same depth. However, the depth may vary significantly in the slanted surface, thus leading to severe estimation errors. In order to eliminate those errors, we propose a superpixel-based normal guided scale invariant deep convolutional field by encouraging the neighboring superpixels with similar appearance to lie on the same 3D plane of the scene. In doing so, a depth-normal multitask CNN is introduced to produce the superpixel-wise depth and surface normal predictions simultaneously. To correct the errors of the roughly estimated superpiexl-wise depth, we develop a normal guided scale invariant CRF (NGSI-CRF). NGSI-CRF consists of a scale invariant unary potential that is able to measure the relative depth between superpixels as well as the absolute depth of superpixels, and a normal guided pairwise potential that constrains spatial relationships between superpixels in accordance with the 3D layout of the scene. In other words, the normal guided pairwise potential is designed to smooth the depth prediction without deteriorating the 3D structure of the depth prediction. The superpixel-wise depth maps estimated by NGSI-CRF will be fed into a pixel-wise refinement module to produce a smooth fine-grained depth prediction. Furthermore, we derive a closed-form solution for the maximum a posteriori (MAP) inference of NGSI-CRF. Thus, our proposed network can be efficiently trained in an end-to-end manner. We conduct our experiments on various datasets, such as NYU-D2, KITTI, and Make 3D. As demonstrated in the experimental results, our method achieves superior performance in both indoor and outdoor scenes.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/160205