Exploring temporal consistency for human pose estimation in videos

Li, Y; Li, K; Wang, X; Xu, RYD

Exploring temporal consistency for human pose estimation in videos

Li, Y Li, K Wang, X Xu, RYD

Permalink

Publisher:: ELSEVIER SCI LTD
Publication Type:: Journal Article
Citation:: Pattern Recognition, 2020, 103
Issue Date:: 2020-07-01

Closed Access

	Filename	Description	Size
	1-s2.0-S0031320320300637-main.pdf	Published version	3.62 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Y
dc.contributor.author	Li, K
dc.contributor.author	Wang, X
dc.contributor.author	Xu, RYD
dc.date.accessioned	2021-01-03T21:40:44Z
dc.date.available	2021-01-03T21:40:44Z
dc.date.issued	2020-07-01
dc.identifier.citation	Pattern Recognition, 2020, 103
dc.identifier.issn	0031-3203
dc.identifier.issn	1873-5142
dc.identifier.uri	http://hdl.handle.net/10453/145043
dc.description.abstract	© 2020 In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.
dc.language	English
dc.publisher	ELSEVIER SCI LTD
dc.relation.ispartof	Pattern Recognition
dc.relation.isbasedon	10.1016/j.patcog.2020.107258
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Exploring temporal consistency for human pose estimation in videos
dc.type	Journal Article
utslib.citation.volume	103
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney
utslib.copyright.status	closed_access	*
dc.date.updated	2021-01-03T21:40:39Z
pubs.publication-status	Published
pubs.volume	103

Abstract:

© 2020 In this paper, we introduce a method of exploring temporal information for estimating human poses in videos. The current state-of-the-art methods utilizing temporal information can be categorized into two major branches. The first category is a model-based method that captures the temporal information entirely by using a learnable function such as RNN or 3D convolution. However, these methods are limited in exploring temporal consistency, which is essential for estimating human joint positions in videos. The second category is the posterior enhancement method, where an independent post-processing step (e.g., using optical flow) is applied to enhance the prediction. However, operations such as optical flow estimation can be susceptible to the occlusion and motion blur problems, which will adversely affect the final performance. We propose a novel Temporal Consistency Exploration (TCE) module to address both shortcomings. Compared to previous approaches, the TCE module is more efficient as it captures the temporal consistency at the feature level without having to post-process and calculate extra optical flow. Further, to capture the rich spatial context in video data, we design a multi-scale TCE to explore the time consistency information at multi-scale spatial levels. Finally, a video-based pose estimation network is designed, which is based on the encoder-decoder architecture and extended with the powerful multi-scale TCE module. We comprehensively evaluate the proposed model on two video datasets, Sub-JHMDB and Penn, and our model achieves state-of-the-art performance on both datasets.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/145043