Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Lu, Y; Ni, F; Wang, H; Guo, X; Zhu, L; Yang, Z; Song, R; Cheng, L; Yang, Y

Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration

Lu, Y Ni, F Wang, H Guo, X Zhu, L Yang, Z Song, R Cheng, L Yang, Y

Permalink

Publisher:: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication Type:: Journal Article
Citation:: IEEE Transactions on Multimedia, 2024, 26, pp. 2456-2466
Issue Date:: 2024-01-01

Closed Access

	Filename	Description	Size
	1669296.pdf	Published version	3.44 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Lu, Y
dc.contributor.author	Ni, F
dc.contributor.author	Wang, H
dc.contributor.author	Guo, X
dc.contributor.author	Zhu, L
dc.contributor.author	Yang, Z
dc.contributor.author	Song, R
dc.contributor.author	Cheng, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2024-08-21T05:26:56Z
dc.date.available	2024-08-21T05:26:56Z
dc.date.issued	2024-01-01
dc.identifier.citation	IEEE Transactions on Multimedia, 2024, 26, pp. 2456-2466
dc.identifier.issn	1520-9210
dc.identifier.issn	1941-0077
dc.identifier.uri	http://hdl.handle.net/10453/180474
dc.description.abstract	Illustrating a multi-sentence story with visual content is a significant challenge in multimedia research. While previous works have focused on sequential story-to-visual representations at the image level or representing a single sentence with a video clip, illustrating a long multi-sentence story with coherent videos remains an under-explored area. In this paper, we propose the task of video-based story illustration that focuses on the goal of visually illustrating a story with retrieved video clips. To support this task, we first create a large-scale dataset of coherent video stories in each sample, consisting of 85 K narrative stories with 60 pairs of consistent clips and texts. We then propose the Story Context-Enhanced Model, which leverages local and global contextual information within the story, inspired by sequence modeling in language understanding. Through comprehensive quantitative experiments, we demonstrate the effectiveness of our baseline model. In addition, qualitative results and detailed user studies reveal that our method can retrieve coherent video sequences from stories.
dc.language	English
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.relation.ispartof	IEEE Transactions on Multimedia
dc.relation.isbasedon	10.1109/TMM.2023.3296944
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	40 Engineering
dc.subject.classification	46 Information and computing sciences
dc.title	Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story Illustration
dc.type	Journal Article
utslib.citation.volume	26
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/All Manual Groups
pubs.organisational-group	University of Technology Sydney/All Manual Groups/Australian Artificial Intelligence Institute (AAII)
pubs.organisational-group	University of Technology Sydney/All Manual Groups/Australian Artificial Intelligence Institute (AAII)/Associate Member
utslib.copyright.status	closed_access	*
dc.date.updated	2024-08-21T05:26:53Z
pubs.publication-status	Published
pubs.volume	26

Abstract:

Illustrating a multi-sentence story with visual content is a significant challenge in multimedia research. While previous works have focused on sequential story-to-visual representations at the image level or representing a single sentence with a video clip, illustrating a long multi-sentence story with coherent videos remains an under-explored area. In this paper, we propose the task of video-based story illustration that focuses on the goal of visually illustrating a story with retrieved video clips. To support this task, we first create a large-scale dataset of coherent video stories in each sample, consisting of 85 K narrative stories with 60 pairs of consistent clips and texts. We then propose the Story Context-Enhanced Model, which leverages local and global contextual information within the story, inspired by sequence modeling in language understanding. Through comprehensive quantitative experiments, we demonstrate the effectiveness of our baseline model. In addition, qualitative results and detailed user studies reveal that our method can retrieve coherent video sequences from stories.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/180474