Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning

Li, L; Zhou, T; Wang, W; Yang, L; Li, J; Yang, Y

Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning

Li, L Zhou, T Wang, W Yang, L Li, J Yang, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, 2022-June, pp. 8709-8720
Issue Date:: 2022-01-01

In Progress

	Filename	Description	Size
	Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning.pdf	Accepted version	7.7 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is being processed and is not currently available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, L
dc.contributor.author	Zhou, T
dc.contributor.author	Wang, W
dc.contributor.author	Yang, L
dc.contributor.author	Li, J
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2022-06-18
dc.date.accessioned	2023-03-08T05:35:51Z
dc.date.available	2023-03-08T05:35:51Z
dc.date.issued	2022-01-01
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, 2022-June, pp. 8709-8720
dc.identifier.isbn	9781665469463
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/166708
dc.description.abstract	Our target is to learn visual correspondence from unlabeled videos. We develop Liir, a locality-aware inter-and intra-video reconstruction method that fills in three missing pieces, i.e., instance discrimination, location awareness, and spatial compactness, of self-supervised correspondence learning puzzle. First, instead of most existing efforts focusing on intra-video self-supervision only, we exploit cross-video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme. This enables instance discriminative representation learning by contrasting desired intra-video pixel association against negative inter-video correspondence. Second, we merge position information into correspondence matching, and design a position shifting strategy to remove the side-effect of position encoding during inter-video affinity computation, making our Liir location-sensitive. Third, to make full use of the spatial continuity nature of video data, we impose a compactness-based constraint on correspondence matching, yielding more sparse and reliable solutions. The learned representation surpasses self-supervised state-of-the-arts on label propagation tasks including objects, semantic parts, and keypoints.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation	http://purl.org/au-research/grants/arc/DE220101390
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
dc.relation.ispartof	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.isbasedon	10.1109/CVPR52688.2022.00852
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.title	Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning
dc.type	Conference Proceeding
utslib.citation.volume	2022-June
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	in_progress	*
dc.date.updated	2023-03-08T05:35:45Z
pubs.finish-date	2022-06-24
pubs.publication-status	Published
pubs.start-date	2022-06-18
pubs.volume	2022-June

Abstract:

Our target is to learn visual correspondence from unlabeled videos. We develop Liir, a locality-aware inter-and intra-video reconstruction method that fills in three missing pieces, i.e., instance discrimination, location awareness, and spatial compactness, of self-supervised correspondence learning puzzle. First, instead of most existing efforts focusing on intra-video self-supervision only, we exploit cross-video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme. This enables instance discriminative representation learning by contrasting desired intra-video pixel association against negative inter-video correspondence. Second, we merge position information into correspondence matching, and design a position shifting strategy to remove the side-effect of position encoding during inter-video affinity computation, making our Liir location-sensitive. Third, to make full use of the spatial continuity nature of video data, we impose a compactness-based constraint on correspondence matching, yielding more sparse and reliable solutions. The learned representation surpasses self-supervised state-of-the-arts on label propagation tasks including objects, semantic parts, and keypoints.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/166708