CCNet: Criss-Cross Attention for Semantic Segmentation.

Huang, Z; Wang, X; Wei, Y; Huang, L; Shi, H; Liu, W; Huang, TS

CCNet: Criss-Cross Attention for Semantic Segmentation.

Huang, Z Wang, X Wei, Y Huang, L Shi, H Liu, W Huang, TS

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Trans Pattern Anal Mach Intell, 2023, 45, (6), pp. 6896-6908
Issue Date:: 2023-06

Closed Access

	Filename	Description	Size
	1811.11721.pdf	Published version	1.84 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huang, Z
dc.contributor.author	Wang, X
dc.contributor.author	Wei, Y
dc.contributor.author	Huang, L
dc.contributor.author	Shi, H
dc.contributor.author	Liu, W
dc.contributor.author	Huang, TS
dc.date.accessioned	2024-04-19T01:18:29Z
dc.date.available	2024-04-19T01:18:29Z
dc.date.issued	2023-06
dc.identifier.citation	IEEE Trans Pattern Anal Mach Intell, 2023, 45, (6), pp. 6896-6908
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/178058
dc.description.abstract	Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a criss-cross network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85 percent of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9, 45.76 and 55.47 percent on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https://github.com/speedinghzl/CCNethttps://github.com/speedinghzl/CCNet.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation	http://purl.org/au-research/grants/arc/DE190101315
dc.relation	http://purl.org/au-research/grants/arc/DP200100938
dc.relation.ispartof	IEEE Trans Pattern Anal Mach Intell
dc.relation.isbasedon	10.1109/TPAMI.2020.3007032
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4611 Machine learning
dc.title	CCNet: Criss-Cross Attention for Semantic Segmentation.
dc.type	Journal Article
utslib.citation.volume	45
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2024-04-19T01:18:27Z
pubs.issue	6
pubs.publication-status	Published
pubs.volume	45
utslib.citation.issue	6

Abstract:

Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a criss-cross network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85 percent of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9, 45.76 and 55.47 percent on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https://github.com/speedinghzl/CCNethttps://github.com/speedinghzl/CCNet.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178058