VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Miao, J; Wei, Y; Wu, Y; Liang, C; Li, G; Yang, Y

VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild

Miao, J Wei, Y Wu, Y

Liang, C Li, G Yang, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 4131-4141
Issue Date:: 2021-11-13

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted versionAdobe PDF (2.77 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Miao, J
dc.contributor.author	Wei, Y
dc.contributor.author	Wu, Y https://orcid.org/0000-0002-1680-8253
dc.contributor.author	Liang, C
dc.contributor.author	Li, G
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2021-06-20
dc.date.accessioned	2022-06-05T01:40:13Z
dc.date.available	2022-06-05T01:40:13Z
dc.date.issued	2021-11-13
dc.identifier.citation	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 4131-4141
dc.identifier.isbn	9781665445092
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/157934
dc.description.abstract	In this paper, we present a new dataset with the target of advancing the scene parsing task from images to videos. Our dataset aims to perform Video Scene Parsing in the Wild (VSPW), which covers a wide range of real-world scenarios and categories. To be specific, our VSPW is featured from the following aspects: 1) Well-trimmed long-temporal clips. Each video contains a complete shot, lasting around 5 seconds on average. 2) Dense annotation. The pixel-level annotations are provided at a high frame rate of 15 f/s. 3) High resolution. Over 96% of the captured videos are with high spatial resolutions from 720P to 4K. We totally annotate 3,536 videos, including 251,633 frames from 124 categories. To the best of our knowledge, our VSPW is the first attempt to tackle the challenging video scene parsing task in the wild by considering diverse scenarios. Based on VSPW, we design a generic Temporal Context Blending (TCB) network, which can effectively harness long-range contextual information from the past frames to help segment the current one. Extensive experiments show that our TCB network improves both the segmentation performance and temporal stability comparing with image-/video-based state-of-the-art methods. We hope that the scale, diversity, long-temporal, and high frame rate of our VSPW can significantly advance the research of video scene parsing and beyond. The dataset is available at https://www.vspwdataset.com/.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/cvpr46437.2021.00412
dc.rights	info:eu-repo/semantics/openAccess
dc.title	VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Nashville, TN, USA
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
dc.date.updated	2022-06-05T01:40:11Z
pubs.finish-date	2021-06-25
pubs.publication-status	Published
pubs.start-date	2021-06-20
pubs.volume	00

Abstract:

In this paper, we present a new dataset with the target of advancing the scene parsing task from images to videos. Our dataset aims to perform Video Scene Parsing in the Wild (VSPW), which covers a wide range of real-world scenarios and categories. To be specific, our VSPW is featured from the following aspects: 1) Well-trimmed long-temporal clips. Each video contains a complete shot, lasting around 5 seconds on average. 2) Dense annotation. The pixel-level annotations are provided at a high frame rate of 15 f/s. 3) High resolution. Over 96% of the captured videos are with high spatial resolutions from 720P to 4K. We totally annotate 3,536 videos, including 251,633 frames from 124 categories. To the best of our knowledge, our VSPW is the first attempt to tackle the challenging video scene parsing task in the wild by considering diverse scenarios. Based on VSPW, we design a generic Temporal Context Blending (TCB) network, which can effectively harness long-range contextual information from the past frames to help segment the current one. Extensive experiments show that our TCB network improves both the segmentation performance and temporal stability comparing with image-/video-based state-of-the-art methods. We hope that the scale, diversity, long-temporal, and high frame rate of our VSPW can significantly advance the research of video scene parsing and beyond. The dataset is available at https://www.vspwdataset.com/.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157934