Video interactive captioning with human prompts

Wu, A; Han, Y; Yang, Y

Video interactive captioning with human prompts

Wu, A Han, Y Yang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: IJCAI International Joint Conference on Artificial Intelligence, 2019, 2019-August pp. 961 - 967
Issue Date:: 2019-01-01

Closed Access

	Filename	Description	Size
	final version of the manuscript.pdf	Accepted Manuscript version	1.42 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, A	en_US
dc.contributor.author	Han, Y	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.date.accessioned	2020-04-06T05:00:37Z
dc.date.available	2020-04-06T05:00:37Z
dc.date.issued	2019-01-01	en_US
dc.identifier.citation	IJCAI International Joint Conference on Artificial Intelligence, 2019, 2019-August pp. 961 - 967	en_US
dc.identifier.isbn	9780999241141	en_US
dc.identifier.issn	1045-0823	en_US
dc.identifier.uri	http://hdl.handle.net/10453/139836
dc.description.abstract	© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. Video captioning aims at generating a proper sentence to describe the video content. As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refined caption generator. We show that the ViCap can be trained via a full supervision (with ground-truth) way or a weak supervision (with only prompts) way. For the evaluation of ViCap, we first extend the MSRVTT with interaction ground-truth. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.	en_US
dc.relation.ispartof	IJCAI International Joint Conference on Artificial Intelligence	en_US
dc.title	Video interactive captioning with human prompts	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2019-August	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US
pubs.volume	2019-August	en_US

Abstract:

© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. Video captioning aims at generating a proper sentence to describe the video content. As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refined caption generator. We show that the ViCap can be trained via a full supervision (with ground-truth) way or a weak supervision (with only prompts) way. For the evaluation of ViCap, we first extend the MSRVTT with interaction ground-truth. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/139836