Vision transformers for dense prediction: A survey

Zuo, S; Xiao, Y; Chang, X; Wang, X

Vision transformers for dense prediction: A survey

Zuo, S Xiao, Y Chang, X

Wang, X

Permalink

Publisher:: ELSEVIER
Publication Type:: Journal Article
Citation:: Knowledge-Based Systems, 2022, 253
Issue Date:: 2022-10-11

Closed Access

	Filename	Description	Size
	1-s2.0-S0950705122007821-main.pdf	Published version	3.52 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zuo, S
dc.contributor.author	Xiao, Y
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Wang, X
dc.date.accessioned	2023-01-24T00:28:53Z
dc.date.available	2023-01-24T00:28:53Z
dc.date.issued	2022-10-11
dc.identifier.citation	Knowledge-Based Systems, 2022, 253
dc.identifier.issn	0950-7051
dc.identifier.issn	1872-7409
dc.identifier.uri	http://hdl.handle.net/10453/165385
dc.description.abstract	Transformers have demonstrated impressive expressiveness and transfer capability in computer vision fields. Dense prediction is a fundamental problem in computer vision that is more challenging to solve than general image-level prediction tasks. The inherent properties of transformers enable them to process feature representations with stable and relatively high resolution, which precisely satisfies the demands of dense prediction tasks for finer-grained and more globally coherent predictions. Furthermore, compared to convolutional networks, transformer methods require minimal inductive bias and permit long-range information interaction. These strengths have contributed to exciting advancements in dense prediction tasks that apply transformer networks. This survey aims to provide a comprehensive overview of transformer models with a specific focus on dense prediction. In this survey, we provide a well-rounded view of state-of-the-art transformer-based approaches, explicitly emphasizing pixel-level prediction tasks. We generally consider transformer variants from the network architecture perspective. We further propose a novel taxonomy to organize these models according to their constructions. Subsequently, we examine various specific optimization strategies to tackle certain bottleneck problems in dense prediction tasks. We explore the commonalities and differences among these works and provide multiple horizontal comparisons from the experimental point of view. Finally, we summarize several stubborn problems that continue to impact visual transformers and outline some possible development directions.
dc.language	English
dc.publisher	ELSEVIER
dc.relation.ispartof	Knowledge-Based Systems
dc.relation.isbasedon	10.1016/j.knosys.2022.109552
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 15 Commerce, Management, Tourism and Services, 17 Psychology and Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Vision transformers for dense prediction: A survey
dc.type	Journal Article
utslib.citation.volume	253
utslib.for	08 Information and Computing Sciences
utslib.for	15 Commerce, Management, Tourism and Services
utslib.for	17 Psychology and Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-01-24T00:28:50Z
pubs.publication-status	Published
pubs.volume	253

Abstract:

Transformers have demonstrated impressive expressiveness and transfer capability in computer vision fields. Dense prediction is a fundamental problem in computer vision that is more challenging to solve than general image-level prediction tasks. The inherent properties of transformers enable them to process feature representations with stable and relatively high resolution, which precisely satisfies the demands of dense prediction tasks for finer-grained and more globally coherent predictions. Furthermore, compared to convolutional networks, transformer methods require minimal inductive bias and permit long-range information interaction. These strengths have contributed to exciting advancements in dense prediction tasks that apply transformer networks. This survey aims to provide a comprehensive overview of transformer models with a specific focus on dense prediction. In this survey, we provide a well-rounded view of state-of-the-art transformer-based approaches, explicitly emphasizing pixel-level prediction tasks. We generally consider transformer variants from the network architecture perspective. We further propose a novel taxonomy to organize these models according to their constructions. Subsequently, we examine various specific optimization strategies to tackle certain bottleneck problems in dense prediction tasks. We explore the commonalities and differences among these works and provide multiple horizontal comparisons from the experimental point of view. Finally, we summarize several stubborn problems that continue to impact visual transformers and outline some possible development directions.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/165385