MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Gao, D; Zhou, L; Ji, L; Zhu, L; Yang, Y; Shou, MZ

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Gao, D Zhou, L Ji, L Zhu, L Yang, Y

Shou, MZ

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023, 2023-June, pp. 14773-14783
Issue Date:: 2023-01-01

In Progress

	Filename	Description	Size
	2212.09522v1.pdf	Published version	1.88 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is being processed and is not currently available.

Full metadata record

Field	Value	Language
dc.contributor.author	Gao, D
dc.contributor.author	Zhou, L
dc.contributor.author	Ji, L
dc.contributor.author	Zhu, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Shou, MZ
dc.date	2023-06-17
dc.date.accessioned	2024-06-13T00:26:36Z
dc.date.available	2024-06-13T00:26:36Z
dc.date.issued	2023-01-01
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023, 2023-June, pp. 14773-14783
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/179502
dc.description.abstract	To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi- · modal Iterative S.patial-temporal Transformer (MIST)) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
dc.relation.ispartof	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.isbasedon	10.1109/CVPR52729.2023.01419
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.title	MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
dc.type	Conference Proceeding
utslib.citation.volume	2023-June
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	in_progress	*
dc.date.updated	2024-06-13T00:26:34Z
pubs.finish-date	2023-06-24
pubs.publication-status	Published
pubs.start-date	2023-06-17
pubs.volume	2023-June

Abstract:

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi- · modal Iterative S.patial-temporal Transformer (MIST)) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/179502