Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Publisher:
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:
Journal Article
Citation:
IEEE Transactions on Multimedia, 2024, 26, pp. 2456-2466
Issue Date:
2024-01-01
Filename Description Size
1669296.pdfPublished version3.44 MB
Adobe PDF
Full metadata record
Illustrating a multi-sentence story with visual content is a significant challenge in multimedia research. While previous works have focused on sequential story-to-visual representations at the image level or representing a single sentence with a video clip, illustrating a long multi-sentence story with coherent videos remains an under-explored area. In this paper, we propose the task of video-based story illustration that focuses on the goal of visually illustrating a story with retrieved video clips. To support this task, we first create a large-scale dataset of coherent video stories in each sample, consisting of 85 K narrative stories with 60 pairs of consistent clips and texts. We then propose the Story Context-Enhanced Model, which leverages local and global contextual information within the story, inspired by sequence modeling in language understanding. Through comprehensive quantitative experiments, we demonstrate the effectiveness of our baseline model. In addition, qualitative results and detailed user studies reveal that our method can retrieve coherent video sequences from stories.
Please use this identifier to cite or link to this item: