Parameter-Efficient Vision Transformer with Linear Attention

Zhao, Y; Tang, H; Jiang, Y; A, Y; Wu, Q; Wang, J

Parameter-Efficient Vision Transformer with Linear Attention

Zhao, Y Tang, H Jiang, Y A, Y Wu, Q

Wang, J

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2023 IEEE International Conference on Image Processing (ICIP), 2023, 00, pp. 1275-1279
Issue Date:: 2023-09-11

Closed Access

	Filename	Description	Size
	Parameter-Efficient_Vision_Transformer_with_Linear_Attention.pdf	Published version	962.42 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhao, Y
dc.contributor.author	Tang, H
dc.contributor.author	Jiang, Y
dc.contributor.author	A, Y
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483
dc.contributor.author	Wang, J
dc.date	2023-10-08
dc.date.accessioned	2024-03-13T10:17:06Z
dc.date.available	2024-03-13T10:17:06Z
dc.date.issued	2023-09-11
dc.identifier.citation	2023 IEEE International Conference on Image Processing (ICIP), 2023, 00, pp. 1275-1279
dc.identifier.isbn	978-1-7281-9836-1
dc.identifier.issn	1522-4880
dc.identifier.uri	http://hdl.handle.net/10453/176639
dc.description.abstract	Recent advances in vision transformers ViTs have achieved outstanding performance in visual recognition tasks including image classification and detection ViTs can learn global representations with their self attention mechanism but they are usually heavy weight and unsuitable for resource constrained devices In this paper we propose a novel linear feature attention LFA module to reduce computation costs for vision transformers and combine efficient mobile CNN modules to form a parameter efficient and high performance CNN ViT hybrid model called LightFormer which can serve as a general purpose backbone to learn both global and local representation Comprehensive experiments demonstrate that LightFormer achieves competitive performance across different visual recognition tasks On the ImageNet 1K dataset LightFormer achieves top 1 accuracy of 78 5 with 5 5 million parameters Our model also performs well when transferred to object detection and semantic segmentation tasks On the MS COCO dataset LightFormer attains mAP of 33 2 within the YOLOv3 framework and on the Cityscapes dataset with only a simple all MLP decoder LightFormer achieves mIoU of 78 5 and FPS of 15 3 surpassing state of the art lightweight segmentation networks
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2023 IEEE International Conference on Image Processing (ICIP)
dc.relation.ispartof	2023 IEEE International Conference on Image Processing
dc.relation.isbasedon	10.1109/icip49359.2023.10222812
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Parameter-Efficient Vision Transformer with Linear Attention
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Kuala Lumpur, Malaysia
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
pubs.organisational-group	University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-13T10:17:05Z
pubs.finish-date	2023-10-11
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-10-08
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

Recent advances in vision transformers ViTs have achieved outstanding performance in visual recognition tasks including image classification and detection ViTs can learn global representations with their self attention mechanism but they are usually heavy weight and unsuitable for resource constrained devices In this paper we propose a novel linear feature attention LFA module to reduce computation costs for vision transformers and combine efficient mobile CNN modules to form a parameter efficient and high performance CNN ViT hybrid model called LightFormer which can serve as a general purpose backbone to learn both global and local representation Comprehensive experiments demonstrate that LightFormer achieves competitive performance across different visual recognition tasks On the ImageNet 1K dataset LightFormer achieves top 1 accuracy of 78 5 with 5 5 million parameters Our model also performs well when transferred to object detection and semantic segmentation tasks On the MS COCO dataset LightFormer attains mAP of 33 2 within the YOLOv3 framework and on the Cityscapes dataset with only a simple all MLP decoder LightFormer achieves mIoU of 78 5 and FPS of 15 3 surpassing state of the art lightweight segmentation networks

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/176639