Saying the Unseen: Video Descriptions via Dialog Agents

Publisher:
Institute of Electrical and Electronics Engineers
Publication Type:
Journal Article
Citation:
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, PP, (99), pp. 1-14
Issue Date:
2021-06-29
Filename Description Size
Saying_the_Unseen_Video_Descriptions_via_Dialog_Agents.pdfAccepted version955.58 kB
Adobe PDF
Full metadata record
Current vision and language tasks usually take complete visual data as input, however, practical scenarios often consist the situations where part of the visual information becomes inaccessible due to various reasons. We introduce a novel task that aims to describe a video using the natural language dialog between two agents as supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI has full access to visual data, which may reveal sensitive information such as recognizable human faces, we intentionally limit the visual input for AI and seek a more secure and transparent information medium, i.e., the dialog, to supplement the missing visual information. Specifically, one intelligent agent - Q-BOT - is given two segmented frames, as well as a finite number of opportunities to ask natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT by answering the questions. We introduce two experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog process. With the proposed QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of our method.
Please use this identifier to cite or link to this item: