Beyond Fixation: Dynamic Window Visual Transformer

Ren, P; Li, C; Wang, G; Xiao, Y; Du, Q; Liang, X; Chang, X

Beyond Fixation: Dynamic Window Visual Transformer

Ren, P Li, C Wang, G Xiao, Y Du, Q Liang, X Chang, X

Permalink

Publisher:: IEEE COMPUTER SOC
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, 2022-June, pp. 11977-11987
Issue Date:: 2022-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 27 Sep 2024

Download Accepted versionAdobe PDF (1.68 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Ren, P
dc.contributor.author	Li, C
dc.contributor.author	Wang, G
dc.contributor.author	Xiao, Y
dc.contributor.author	Du, Q
dc.contributor.author	Liang, X
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.date	2022-06-18
dc.date.accessioned	2023-03-30T05:49:08Z
dc.date.available	2023-03-30T05:49:08Z
dc.date.issued	2022-01-01
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022, 2022-June, pp. 11977-11987
dc.identifier.isbn	9781665469463
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/168852
dc.description.abstract	Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. How-ever, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW- ViT goes beyond the model that employs a fixed single window setting. To the best of our knowl-edge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW- ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We con-ducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with re-lated state-of-the-art (SoTA) methods, DW- ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers [31], DW-ViT has achieved con-sistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.11Code release: https://github.com/pzhren/DW-ViT. This work was done when the first author interned at Dark Matter AI.
dc.language	en
dc.publisher	IEEE COMPUTER SOC
dc.relation	http://purl.org/au-research/grants/arc/DE190100626
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
dc.relation.ispartof	IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/CVPR52688.2022.01168
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	Beyond Fixation: Dynamic Window Visual Transformer
dc.type	Conference Proceeding
utslib.citation.volume	2022-June
utslib.location.activity	New Orleans, LA
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2024-09-27T00:00:00+1000Z
dc.date.updated	2023-03-30T05:49:05Z
pubs.finish-date	2022-06-24
pubs.publication-status	Published
pubs.start-date	2022-06-18
pubs.volume	2022-June

Abstract:

Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. How-ever, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW- ViT goes beyond the model that employs a fixed single window setting. To the best of our knowl-edge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW- ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We con-ducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with re-lated state-of-the-art (SoTA) methods, DW- ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers [31], DW-ViT has achieved con-sistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.11Code release: https://github.com/pzhren/DW-ViT. This work was done when the first author interned at Dark Matter AI.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168852