Cross-domain multi-style merge for image captioning

Duan, Y; Wang, Z; Li, Y; Wang, J

Cross-domain multi-style merge for image captioning

Duan, Y

Wang, Z Li, Y Wang, J

Permalink

Publisher:: ACADEMIC PRESS INC ELSEVIER SCIENCE
Publication Type:: Journal Article
Citation:: Computer Vision and Image Understanding, 2023, 228
Issue Date:: 2023-02-01

Closed Access

	Filename	Description	Size
	Cross domaing multi style merge for image captioning.pdf	Accepted version	894.24 kB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Duan, Y https://orcid.org/0000-0003-1517-994X
dc.contributor.author	Wang, Z
dc.contributor.author	Li, Y
dc.contributor.author	Wang, J
dc.date.accessioned	2024-03-26T04:42:57Z
dc.date.available	2024-03-26T04:42:57Z
dc.date.issued	2023-02-01
dc.identifier.citation	Computer Vision and Image Understanding, 2023, 228
dc.identifier.issn	1077-3142
dc.identifier.issn	1090-235X
dc.identifier.uri	http://hdl.handle.net/10453/177183
dc.description.abstract	Multi-style image captioning has attracted wide attention recently. Existing approaches mainly rely on style synthetics within a single domain. They cannot deal with multiple styles combination since various styles naturally cannot be included in a uniform dataset. This paper is the first one to investigate the cross-domain multi-style merge for image captioning. Specifically, we propose a novel image caption model with a multi-style gated transformer block to fit the cross-domain caption generation task. Conventional generative adversarial learning for language methods may suffer from the distribution distortion problem, since real datasets do not contain captions with style combinations. Therefore, we devise a multi-stage self-learning framework for the proposed image caption model to exploit real corpus with pseudo styles gradually. Comprehensive experiments and ablation studies demonstrate the effectiveness of our proposed method on the multi-style merge for image captioning.
dc.language	English
dc.publisher	ACADEMIC PRESS INC ELSEVIER SCIENCE
dc.relation.ispartof	Computer Vision and Image Understanding
dc.relation.isbasedon	10.1016/j.cviu.2022.103617
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 1702 Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4602 Artificial intelligence
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4607 Graphics, augmented reality and games
dc.title	Cross-domain multi-style merge for image captioning
dc.type	Journal Article
utslib.citation.volume	228
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2024-03-26T04:42:56Z
pubs.publication-status	Published
pubs.volume	228

Abstract:

Multi-style image captioning has attracted wide attention recently. Existing approaches mainly rely on style synthetics within a single domain. They cannot deal with multiple styles combination since various styles naturally cannot be included in a uniform dataset. This paper is the first one to investigate the cross-domain multi-style merge for image captioning. Specifically, we propose a novel image caption model with a multi-style gated transformer block to fit the cross-domain caption generation task. Conventional generative adversarial learning for language methods may suffer from the distribution distortion problem, since real datasets do not contain captions with style combinations. Therefore, we devise a multi-stage self-learning framework for the proposed image caption model to exploit real corpus with pseudo styles gradually. Comprehensive experiments and ablation studies demonstrate the effectiveness of our proposed method on the multi-style merge for image captioning.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/177183