Saying the Unseen: Video Descriptions via Dialog Agents

Zhu, Y; Wu, Y; Yang, Y; Yan, Y

Saying the Unseen: Video Descriptions via Dialog Agents

Zhu, Y Wu, Y

Yang, Y

Yan, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers
Publication Type:: Journal Article
Citation:: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, PP, (99), pp. 1-14
Issue Date:: 2021-06-29

Closed Access

	Filename	Description	Size
	Saying_the_Unseen_Video_Descriptions_via_Dialog_Agents.pdf	Accepted version	955.58 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, Y
dc.contributor.author	Wu, Y https://orcid.org/0000-0002-1680-8253
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Yan, Y
dc.date.accessioned	2022-05-25T12:09:11Z
dc.date.available	2022-05-25T12:09:11Z
dc.date.issued	2021-06-29
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, PP, (99), pp. 1-14
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/157695
dc.description.abstract	Current vision and language tasks usually take complete visual data as input, however, practical scenarios often consist the situations where part of the visual information becomes inaccessible due to various reasons. We introduce a novel task that aims to describe a video using the natural language dialog between two agents as supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI has full access to visual data, which may reveal sensitive information such as recognizable human faces, we intentionally limit the visual input for AI and seek a more secure and transparent information medium, i.e., the dialog, to supplement the missing visual information. Specifically, one intelligent agent - Q-BOT - is given two segmented frames, as well as a finite number of opportunities to ask natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT by answering the questions. We introduce two experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog process. With the proposed QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of our method.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Institute of Electrical and Electronics Engineers
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence
dc.relation.isbasedon	10.1109/tpami.2021.3093360
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Saying the Unseen: Video Descriptions via Dialog Agents
dc.type	Journal Article
utslib.citation.volume	PP
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2022-05-25T12:09:02Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

Current vision and language tasks usually take complete visual data as input, however, practical scenarios often consist the situations where part of the visual information becomes inaccessible due to various reasons. We introduce a novel task that aims to describe a video using the natural language dialog between two agents as supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI has full access to visual data, which may reveal sensitive information such as recognizable human faces, we intentionally limit the visual input for AI and seek a more secure and transparent information medium, i.e., the dialog, to supplement the missing visual information. Specifically, one intelligent agent - Q-BOT - is given two segmented frames, as well as a finite number of opportunities to ask natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT by answering the questions. We introduce two experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog process. With the proposed QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of our method.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157695