MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Cai, K; Ren, P; Zhu, Y; Xu, H; Liu, J; Li, C; Wang, G; Liang, X

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Cai, K Ren, P Zhu, Y Xu, H Liu, J Li, C Wang, G Liang, X

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2024, 00, pp. 1196-1205
Issue Date:: 2024-01-15

Closed Access

	Filename	Description	Size
	1721237.pdf	Published version	3.3 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Cai, K
dc.contributor.author	Ren, P
dc.contributor.author	Zhu, Y
dc.contributor.author	Xu, H
dc.contributor.author	Liu, J
dc.contributor.author	Li, C
dc.contributor.author	Wang, G
dc.contributor.author	Liang, X
dc.date	2023-10-01
dc.date.accessioned	2024-05-13T04:34:57Z
dc.date.available	2024-05-13T04:34:57Z
dc.date.issued	2024-01-15
dc.identifier.citation	2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2024, 00, pp. 1196-1205
dc.identifier.isbn	979-8-3503-0719-1
dc.identifier.issn	1550-5499
dc.identifier.uri	http://hdl.handle.net/10453/178928
dc.description.abstract	Recently semantic segmentation models trained with image level text supervision have shown promising results in challenging open world scenarios However these models still face difficulties in learning fine grained semantic alignment at the pixel level and predicting accurate object masks To address this issue we propose MixReorg a novel and straightforward pre training paradigm for semantic segmentation that enhances a model s ability to reorganize patches mixed across images exploring both local visual relevance and global semantic coherence Our approach involves generating fine grained patch text pairs data by mixing image patches while preserving the correspondence between patches and text The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features With MixReorg as a mask learner conventional text supervised semantic segmentation models can achieve highly generalizable pixel semantic alignment ability which is crucial for open world segmentation After training with large scale image text data MixReorg models can be applied directly to segment visual objects of arbitrary categories without the need for further fine tuning Our proposed framework demonstrates strong performance on popular zero shot semantic segmentation benchmarks outperforming GroupViT by significant margins of 5 0 6 2 2 5 and 3 4 mIoU on PASCAL VOC2012 PASCAL Context MS COCO and ADE20K respectively
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2023 IEEE/CVF International Conference on Computer Vision (ICCV)
dc.relation.ispartof	2023 IEEE/CVF International Conference on Computer Vision
dc.relation.isbasedon	10.1109/iccv51070.2023.00116
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Paris, France
utslib.copyright.status	closed_access	*
dc.date.updated	2024-05-13T04:34:53Z
pubs.finish-date	2023-10-06
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-10-01
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

Recently semantic segmentation models trained with image level text supervision have shown promising results in challenging open world scenarios However these models still face difficulties in learning fine grained semantic alignment at the pixel level and predicting accurate object masks To address this issue we propose MixReorg a novel and straightforward pre training paradigm for semantic segmentation that enhances a model s ability to reorganize patches mixed across images exploring both local visual relevance and global semantic coherence Our approach involves generating fine grained patch text pairs data by mixing image patches while preserving the correspondence between patches and text The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features With MixReorg as a mask learner conventional text supervised semantic segmentation models can achieve highly generalizable pixel semantic alignment ability which is crucial for open world segmentation After training with large scale image text data MixReorg models can be applied directly to segment visual objects of arbitrary categories without the need for further fine tuning Our proposed framework demonstrates strong performance on popular zero shot semantic segmentation benchmarks outperforming GroupViT by significant margins of 5 0 6 2 2 5 and 3 4 mIoU on PASCAL VOC2012 PASCAL Context MS COCO and ADE20K respectively

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178928