Visual commonsense reasoning with directional visual connections

Han, Y; Wu, A; Zhu, L; Yang, Y

Visual commonsense reasoning with directional visual connections

Han, Y Wu, A Zhu, L

Yang, Y

Permalink

Publisher:: ZHEJIANG UNIV PRESS
Publication Type:: Journal Article
Citation:: Frontiers of Information Technology and Electronic Engineering, 2021, 22, (5), pp. 625-637
Issue Date:: 2021-05-01

Closed Access

	Filename	Description	Size
	download-2.pdf	Published version	1.32 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Han, Y
dc.contributor.author	Wu, A
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2022-02-05T10:48:32Z
dc.date.available	2022-02-05T10:48:32Z
dc.date.issued	2021-05-01
dc.identifier.citation	Frontiers of Information Technology and Electronic Engineering, 2021, 22, (5), pp. 625-637
dc.identifier.issn	2095-9184
dc.identifier.issn	2095-9230
dc.identifier.uri	http://hdl.handle.net/10453/154198
dc.description.abstract	To boost research into cognition-level visual understanding, i.e., making an accurate inference based on a thorough understanding of visual details, visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers, VCR requires models to select not only the correct answers, but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity, which is helpful in solving specific cognition tasks. Inspired by this idea, we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically, we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then, a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers and rationales, which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.
dc.language	English
dc.publisher	ZHEJIANG UNIV PRESS
dc.relation.ispartof	Frontiers of Information Technology and Electronic Engineering
dc.relation.isbasedon	10.1631/FITEE.2000722
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0906 Electrical and Electronic Engineering
dc.title	Visual commonsense reasoning with directional visual connections
dc.type	Journal Article
utslib.citation.volume	22
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2022-02-05T10:48:30Z
pubs.issue	5
pubs.publication-status	Published
pubs.volume	22
utslib.citation.issue	5

Abstract:

To boost research into cognition-level visual understanding, i.e., making an accurate inference based on a thorough understanding of visual details, visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers, VCR requires models to select not only the correct answers, but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity, which is helpful in solving specific cognition tasks. Inspired by this idea, we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically, we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then, a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers and rationales, which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/154198