Recurrent Attention Network with Reinforced Generator for Visual Dialog

Publication Type:
Journal Article
ACM Transactions on Multimedia Computing, Communications and Applications, 2020, 16, (3)
Issue Date:
Filename Description Size
3390891.pdf3.9 MB
Adobe PDF
Full metadata record
© 2020 ACM. In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer "what is the man on her left wearing?"the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as "her,"(2) parse the image to attend "her,"and (3) uncover the spatial context to shift the attention to "her left"and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.
Please use this identifier to cite or link to this item: