Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

Zhou, T; Wang, W; Liu, S; Yang, Y; Van Gool, L

Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing

Zhou, T Wang, W Liu, S Yang, Y

Van Gool, L

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 1622-1631
Issue Date:: 2021-11-13

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (4.01 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhou, T
dc.contributor.author	Wang, W
dc.contributor.author	Liu, S
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Van Gool, L
dc.date	2021-06-20
dc.date.accessioned	2022-06-04T06:28:16Z
dc.date.available	2022-06-04T06:28:16Z
dc.date.issued	2021-11-13
dc.identifier.citation	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 1622-1631
dc.identifier.isbn	9781665445092
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/157922
dc.description.abstract	To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multi-person joint assembling task. By formulating joint association as maximum-weight bipartite matching, a differentiable solution is developed to exploit projected gradient descent and Dykstra’s cyclic projection algorithm. This makes our method end-to-end trainable and allows back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Experiments on three instance-aware human parsing datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
dc.language	en
dc.publisher	IEEE
dc.relation	http://purl.org/au-research/grants/arc/DP200100938
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/cvpr46437.2021.00167
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Nashville, TN, USA
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2022-06-04T06:28:13Z
pubs.finish-date	2021-06-25
pubs.publication-status	Published
pubs.start-date	2021-06-20
pubs.volume	00

Abstract:

To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multi-person joint assembling task. By formulating joint association as maximum-weight bipartite matching, a differentiable solution is developed to exploit projected gradient descent and Dykstra’s cyclic projection algorithm. This makes our method end-to-end trainable and allows back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Experiments on three instance-aware human parsing datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157922