Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation

Xu, Y; Yang, Z; Yang, Y

Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation

Xu, Y Yang, Z Yang, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2024, 00, pp. 9704-9717
Issue Date:: 2024-01-15

Closed Access

	Filename	Description	Size
	1704444.pdf	Published version	3.43 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, Y
dc.contributor.author	Yang, Z
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2023-10-01
dc.date.accessioned	2024-05-13T04:25:33Z
dc.date.available	2024-05-13T04:25:33Z
dc.date.issued	2024-01-15
dc.identifier.citation	2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2024, 00, pp. 9704-9717
dc.identifier.isbn	979-8-3503-0719-1
dc.identifier.issn	1550-5499
dc.identifier.uri	http://hdl.handle.net/10453/178908
dc.description.abstract	Tracking any given object s spatially and temporally is a common purpose in Visual Object Tracking VOT and Video Object Segmentation VOS Joint tracking and segmentation have been attempted in some studies but they often lack full compatibility of both box and mask in initialization and prediction and mainly focus on single object scenarios To address these limitations this paper proposes a Multi object Mask box Integrated framework for unified Tracking and Segmentation dubbed MITS Firstly the unified identification module is proposed to support both box and mask reference for initialization where detailed object information is inferred from boxes or directly retained from masks Additionally a novel pinpoint box predictor is proposed for accurate multi object box prediction facilitating target oriented representation learning All target objects are processed simultaneously from encoding to propagation and decoding as a unified pipeline for VOT and VOS Experimental results show MITS achieves state of the art performance on both VOT and VOS benchmarks Notably MITS surpasses the best prior VOT competitor by around 6 on the GOT 10k test set and significantly improves the performance of box initialization on VOS benchmarks The code is available at https github com yoxu515 MITS
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2023 IEEE/CVF International Conference on Computer Vision (ICCV)
dc.relation.ispartof	2023 IEEE/CVF International Conference on Computer Vision
dc.relation.isbasedon	10.1109/iccv51070.2023.00893
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Paris, France
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-05-13T04:25:29Z
pubs.finish-date	2023-10-06
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-10-01
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

Tracking any given object s spatially and temporally is a common purpose in Visual Object Tracking VOT and Video Object Segmentation VOS Joint tracking and segmentation have been attempted in some studies but they often lack full compatibility of both box and mask in initialization and prediction and mainly focus on single object scenarios To address these limitations this paper proposes a Multi object Mask box Integrated framework for unified Tracking and Segmentation dubbed MITS Firstly the unified identification module is proposed to support both box and mask reference for initialization where detailed object information is inferred from boxes or directly retained from masks Additionally a novel pinpoint box predictor is proposed for accurate multi object box prediction facilitating target oriented representation learning All target objects are processed simultaneously from encoding to propagation and decoding as a unified pipeline for VOT and VOS Experimental results show MITS achieves state of the art performance on both VOT and VOS benchmarks Notably MITS surpasses the best prior VOT competitor by around 6 on the GOT 10k test set and significantly improves the performance of box initialization on VOS benchmarks The code is available at https github com yoxu515 MITS

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178908