Recurrent Attention Network with Reinforced Generator for Visual Dialog

Fan, H; Zhu, L; Yang, Y; Wu, F

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Fan, H Zhu, L

Yang, Y

Wu, F

Permalink

Publisher:: ASSOC COMPUTING MACHINERY
Publication Type:: Journal Article
Citation:: ACM Transactions on Multimedia Computing, Communications and Applications, 2020, 16, (3)
Issue Date:: 2020-09-01

Closed Access

	Filename	Description	Size
	3390891.pdf		3.9 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Fan, H
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546
dc.contributor.author	Wu, F
dc.date.accessioned	2021-03-19T03:17:50Z
dc.date.available	2021-03-19T03:17:50Z
dc.date.issued	2020-09-01
dc.identifier.citation	ACM Transactions on Multimedia Computing, Communications and Applications, 2020, 16, (3)
dc.identifier.issn	1551-6857
dc.identifier.issn	1551-6865
dc.identifier.uri	http://hdl.handle.net/10453/147380
dc.description.abstract	© 2020 ACM. In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer "what is the man on her left wearing?"the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as "her,"(2) parse the image to attend "her,"and (3) uncover the spatial context to shift the attention to "her left"and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.
dc.language	English
dc.publisher	ASSOC COMPUTING MACHINERY
dc.relation.ispartof	ACM Transactions on Multimedia Computing, Communications and Applications
dc.relation.isbasedon	10.1145/3390891
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0803 Computer Software, 0805 Distributed Computing, 0806 Information Systems
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Recurrent Attention Network with Reinforced Generator for Visual Dialog
dc.type	Journal Article
utslib.citation.volume	16
utslib.for	0803 Computer Software
utslib.for	0805 Distributed Computing
utslib.for	0806 Information Systems
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2021-03-19T03:17:46Z
pubs.issue	3
pubs.publication-status	Published
pubs.volume	16
utslib.citation.issue	3

Abstract:

© 2020 ACM. In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer "what is the man on her left wearing?"the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as "her,"(2) parse the image to attend "her,"and (3) uncover the spatial context to shift the attention to "her left"and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/147380