Saying the Unseen: Video Descriptions via Dialog Agents
- Publisher:
- Institute of Electrical and Electronics Engineers
- Publication Type:
- Journal Article
- Citation:
- IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, PP, (99), pp. 1-14
- Issue Date:
- 2021-06-29
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
Saying_the_Unseen_Video_Descriptions_via_Dialog_Agents.pdf | Accepted version | 955.58 kB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Current vision and language tasks usually take complete visual data as input, however, practical scenarios often consist the situations where part of the visual information becomes inaccessible due to various reasons. We introduce a novel task that aims to describe a video using the natural language dialog between two agents as supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI has full access to visual data, which may reveal sensitive information such as recognizable human faces, we intentionally limit the visual input for AI and seek a more secure and transparent information medium, i.e., the dialog, to supplement the missing visual information. Specifically, one intelligent agent - Q-BOT - is given two segmented frames, as well as a finite number of opportunities to ask natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT by answering the questions. We introduce two experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog process. With the proposed QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of our method.
Please use this identifier to cite or link to this item: