Mask Propagation for Efficient Video Semantic Segmentation

Weng, Y; Han, M; He, H; Li, M; Yao, L; Chang, X; Zhuang, B

Mask Propagation for Efficient Video Semantic Segmentation

Weng, Y Han, M He, H Li, M Yao, L Chang, X

Zhuang, B

Permalink

Publication Type:: Conference Proceeding
Citation:: Advances in Neural Information Processing Systems, 2023, 36
Issue Date:: 2023-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (2.59 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Weng, Y
dc.contributor.author	Han, M
dc.contributor.author	He, H
dc.contributor.author	Li, M
dc.contributor.author	Yao, L
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Zhuang, B
dc.date.accessioned	2024-05-22T07:29:08Z
dc.date.available	2024-05-22T07:29:08Z
dc.date.issued	2023-01-01
dc.identifier.citation	Advances in Neural Information Processing Systems, 2023, 36
dc.identifier.issn	1049-5258
dc.identifier.uri	http://hdl.handle.net/10453/179146
dc.description.abstract	Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4× FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS.
dc.language	en
dc.relation.ispartof	Advances in Neural Information Processing Systems
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	1701 Psychology, 1702 Cognitive Sciences
dc.subject.classification	4611 Machine learning
dc.title	Mask Propagation for Efficient Video Semantic Segmentation
dc.type	Conference Proceeding
utslib.citation.volume	36
utslib.for	1701 Psychology
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2024-05-22T07:29:06Z
pubs.publication-status	Published
pubs.volume	36

Abstract:

Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4× FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179146