Video interactive captioning with human prompts

Publication Type:
Conference Proceeding
IJCAI International Joint Conference on Artificial Intelligence, 2019, 2019-August pp. 961 - 967
Issue Date:
Filename Description Size
final version of the manuscript.pdfAccepted Manuscript version1.42 MB
Adobe PDF
Full metadata record
© 2019 International Joint Conferences on Artificial Intelligence. All rights reserved. Video captioning aims at generating a proper sentence to describe the video content. As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refined caption generator. We show that the ViCap can be trained via a full supervision (with ground-truth) way or a weak supervision (with only prompts) way. For the evaluation of ViCap, we first extend the MSRVTT with interaction ground-truth. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method.
Please use this identifier to cite or link to this item: