Describing Unseen Videos via Multi-modal Cooperative Dialog Agents

Zhu, Y; Wu, Y; Yang, Y; Yan, Y

Describing Unseen Videos via Multi-modal Cooperative Dialog Agents

Zhu, Y Wu, Y Yang, Y

Yan, Y

Permalink

Publisher:: Springer International Publishing
Publication Type:: Conference Proceeding
Citation:: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2020, 12368 LNCS, pp. 153-169
Issue Date:: 2020-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted versionAdobe PDF (649.11 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, Y
dc.contributor.author	Wu, Y
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Yan, Y
dc.date.accessioned	2021-05-10T01:43:46Z
dc.date.available	2021-05-10T01:43:46Z
dc.date.issued	2020-01-01
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2020, 12368 LNCS, pp. 153-169
dc.identifier.isbn	9783030585914
dc.identifier.issn	0302-9743
dc.identifier.issn	1611-3349
dc.identifier.uri	http://hdl.handle.net/10453/148796
dc.description.abstract	With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog. Codes and models are available at https://github.com/L-YeZhu/Video-Description-via-Dialog-Agents-ECCV2020.
dc.language	en
dc.publisher	Springer International Publishing
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
dc.relation.isbasedon	10.1007/978-3-030-58592-1_10
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Describing Unseen Videos via Multi-modal Cooperative Dialog Agents
dc.type	Conference Proceeding
utslib.citation.volume	12368 LNCS
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2021-05-10T01:43:44Z
pubs.publication-status	Published
pubs.volume	12368 LNCS

Abstract:

With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog. Codes and models are available at https://github.com/L-YeZhu/Video-Description-via-Dialog-Agents-ECCV2020.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/148796