Local-Global Context Aware Transformer for Language-Guided Video Segmentation.

Liang, C; Wang, W; Zhou, T; Miao, J; Luo, Y; Yang, Y

Local-Global Context Aware Transformer for Language-Guided Video Segmentation.

Liang, C Wang, W Zhou, T Miao, J Luo, Y Yang, Y

Permalink

Publisher:: IEEE COMPUTER SOC
Publication Type:: Journal Article
Citation:: IEEE Trans Pattern Anal Mach Intell, 2023, 45, (8), pp. 10055-10069
Issue Date:: 2023-08

Closed Access

	Filename	Description	Size
	Linear_Complexity_Gibbs_Sampling_for_Generalized_Labeled_Multi-Bernoulli_Filtering.pdf	Published version	2.35 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liang, C
dc.contributor.author	Wang, W
dc.contributor.author	Zhou, T
dc.contributor.author	Miao, J
dc.contributor.author	Luo, Y
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2024-05-05T20:30:54Z
dc.date.available	2024-05-05T20:30:54Z
dc.date.issued	2023-08
dc.identifier.citation	IEEE Trans Pattern Anal Mach Intell, 2023, 45, (8), pp. 10055-10069
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/178661
dc.description.abstract	We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S +, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S + show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.
dc.format	Print
dc.language	eng
dc.publisher	IEEE COMPUTER SOC
dc.relation.ispartof	IEEE Trans Pattern Anal Mach Intell
dc.relation.isbasedon	10.1109/TPAMI.2023.3262578
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4611 Machine learning
dc.title	Local-Global Context Aware Transformer for Language-Guided Video Segmentation.
dc.type	Journal Article
utslib.citation.volume	45
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2024-05-05T20:30:52Z
pubs.issue	8
pubs.publication-status	Published
pubs.volume	45
utslib.citation.issue	8

Abstract:

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S +, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S + show that Locater outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where Locater served as the foundation for the winning solution.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178661