Local self-attention in transformer for visual question answering

Shen, X; Han, D; Guo, Z; Chen, C; Hua, J; Luo, G

Local self-attention in transformer for visual question answering

Shen, X Han, D Guo, Z Chen, C Hua, J

Luo, G

Permalink

Publisher:: Springer
Publication Type:: Journal Article
Citation:: Applied Intelligence, 2023, 53, (13), pp. 16706-16723
Issue Date:: 2023-01-01

Closed Access

	Filename	Description	Size
	s10489-022-04355-w.pdf	Published version	2.75 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Shen, X
dc.contributor.author	Han, D
dc.contributor.author	Guo, Z
dc.contributor.author	Chen, C
dc.contributor.author	Hua, J https://orcid.org/0000-0002-3409-2076
dc.contributor.author	Luo, G
dc.date.accessioned	2024-05-13T06:52:20Z
dc.date.available	2024-05-13T06:52:20Z
dc.date.issued	2023-01-01
dc.identifier.citation	Applied Intelligence, 2023, 53, (13), pp. 16706-16723
dc.identifier.issn	0924-669X
dc.identifier.issn	1573-7497
dc.identifier.uri	http://hdl.handle.net/10453/178939
dc.description.abstract	Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.
dc.language	English
dc.publisher	Springer
dc.relation.ispartof	Applied Intelligence
dc.relation.isbasedon	10.1007/s10489-022-04355-w
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	46 Information and computing sciences
dc.title	Local self-attention in transformer for visual question answering
dc.type	Journal Article
utslib.citation.volume	53
utslib.for	0801 Artificial Intelligence and Image Processing
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Provost
pubs.organisational-group	University of Technology Sydney/Provost/TD School
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2024-05-13T06:52:18Z
pubs.issue	13
pubs.publication-status	Published
pubs.volume	53
utslib.citation.issue	13

Abstract:

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178939