Uncovering the Temporal Context for Video Question Answering

Zhu, L; Xu, Z; Yang, Y; Hauptmann, AG

Uncovering the Temporal Context for Video Question Answering

Zhu, L

Xu, Z Yang, Y

Hauptmann, AG

Permalink

Publication Type:: Journal Article
Citation:: International Journal of Computer Vision, 2017, 124 (3), pp. 409 - 421
Issue Date:: 2017-09-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (2.98 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557	en_US
dc.contributor.author	Xu, Z	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.contributor.author	Hauptmann, AG	en_US
dc.date.issued	2017-09-01	en_US
dc.identifier.citation	International Journal of Computer Vision, 2017, 124 (3), pp. 409 - 421	en_US
dc.identifier.issn	0920-5691	en_US
dc.identifier.uri	http://hdl.handle.net/10453/126982
dc.description.abstract	© 2017, Springer Science+Business Media, LLC. In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of “fill-in-the-blank”, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.	en_US
dc.relation	http://purl.org/au-research/grants/arc/LP160100630
dc.relation.ispartof	International Journal of Computer Vision	en_US
dc.relation.isbasedon	10.1007/s11263-017-1033-7	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Uncovering the Temporal Context for Video Question Answering	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	124	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	124	en_US

Abstract:

© 2017, Springer Science+Business Media, LLC. In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of “fill-in-the-blank”, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/126982