Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Wang, X; Zhu, L; Zheng, Z; Xu, M; Yang, Y

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Wang, X Zhu, L

Zheng, Z Xu, M Yang, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Transactions on Multimedia, 2022, PP, (99), pp. 1-11
Issue Date:: 2022-01-01

Closed Access

	Filename	Description	Size
	Align and Tell Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision.pdf	Published version	4.17 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Zheng, Z
dc.contributor.author	Xu, M
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2023-03-20T00:47:47Z
dc.date.available	2023-03-20T00:47:47Z
dc.date.issued	2022-01-01
dc.identifier.citation	IEEE Transactions on Multimedia, 2022, PP, (99), pp. 1-11
dc.identifier.issn	1520-9210
dc.identifier.issn	1941-0077
dc.identifier.uri	http://hdl.handle.net/10453/167673
dc.description.abstract	Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare the global representation between videos and text descriptions and utilize the global contrastive loss to train the model. These designs overlook the local alignment and the word-level supervision signal. In this paper, we propose a new framework, called Align and Tell, for text-video retrieval. Compared to the previous work, our framework contains additional modules, i.e., two transformer decoders for local alignment and one captioning head to enhance the representation learning. First, we introduce a set of learnable queries to interact with both textual representations and video representations and project them to a fixed number of local features. After that, local contrastive learning is performed to complement the global comparison. Moreover, we design a video captioning head to provide additional supervision signals during training. This word-level supervision can enhance the visual presentation and alleviate the cross-modal gap. The captioning head can be removed during inference and does not introduce extra computational costs. Extensive empirical results demonstrate that our Align and Tell model can achieve state-of-the-art performance on four text-video retrieval datasets, including MSR-VTT, MSVD, LSMDC, and ActivityNet-Captions.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Transactions on Multimedia
dc.relation.isbasedon	10.1109/TMM.2022.3204444
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision
dc.type	Journal Article
utslib.citation.volume	PP
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-20T00:47:46Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare the global representation between videos and text descriptions and utilize the global contrastive loss to train the model. These designs overlook the local alignment and the word-level supervision signal. In this paper, we propose a new framework, called Align and Tell, for text-video retrieval. Compared to the previous work, our framework contains additional modules, i.e., two transformer decoders for local alignment and one captioning head to enhance the representation learning. First, we introduce a set of learnable queries to interact with both textual representations and video representations and project them to a fixed number of local features. After that, local contrastive learning is performed to complement the global comparison. Moreover, we design a video captioning head to provide additional supervision signals during training. This word-level supervision can enhance the visual presentation and alleviate the cross-modal gap. The captioning head can be removed during inference and does not introduce extra computational costs. Extensive empirical results demonstrate that our Align and Tell model can achieve state-of-the-art performance on four text-video retrieval datasets, including MSR-VTT, MSVD, LSMDC, and ActivityNet-Captions.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/167673