Decoupled novel object captioner

Wu, Y; Zhu, L; Jiang, L; Yang, Y

Decoupled novel object captioner

Wu, Y

Zhu, L

Jiang, L Yang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, 2018, pp. 1029 - 1037
Issue Date:: 2018-10-15

Closed Access

	Filename	Description	Size
	p1029-wu.pdf	Published version	2.63 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, Y https://orcid.org/0000-0002-1680-8253	en_US
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557	en_US
dc.contributor.author	Jiang, L	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.date.issued	2018-10-15	en_US
dc.identifier.citation	MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, 2018, pp. 1029 - 1037	en_US
dc.identifier.isbn	9781450356657	en_US
dc.identifier.uri	http://hdl.handle.net/10453/131493
dc.description.abstract	© 2018 Association for Computing Machinery. Image captioning is a challenging task where the machine automatically describes an image by sentences or phrases. It often requires a large number of paired image-sentence annotations for training. However, a pre-trained captioning model can hardly be applied to a new domain in which some novel object categories exist, i.e., the objects and their description words are unseen during model training. To correctly caption the novel object, it requires professional human workers to annotate the images by sentences with the novel words. It is labor expensive and thus limits its usage in real-world applications. In this paper, we introduce the zero-shot novel object captioning task where the machine generates descriptions without extra training sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple the language sequence model from the object descriptions. DNOC has two components. 1) A Sequence Model with the Placeholder (SM-P) generates a sentence containing placeholders. The placeholder represents an unseen novel object. Thus, the sequence model can be decoupled from the novel object descriptions. 2) A key-value object memory built upon the freely available detection model, contains the visual information and the corresponding word for each object. A query generated from the SM-P is used to retrieve the words from the object memory. The placeholder will further be filled with the correct word, resulting in a caption with novel object descriptions. The experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts.	en_US
dc.relation.ispartof	MM 2018 - Proceedings of the 2018 ACM Multimedia Conference	en_US
dc.relation.isbasedon	10.1145/3240508.3240640	en_US
dc.title	Decoupled novel object captioner	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

© 2018 Association for Computing Machinery. Image captioning is a challenging task where the machine automatically describes an image by sentences or phrases. It often requires a large number of paired image-sentence annotations for training. However, a pre-trained captioning model can hardly be applied to a new domain in which some novel object categories exist, i.e., the objects and their description words are unseen during model training. To correctly caption the novel object, it requires professional human workers to annotate the images by sentences with the novel words. It is labor expensive and thus limits its usage in real-world applications. In this paper, we introduce the zero-shot novel object captioning task where the machine generates descriptions without extra training sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple the language sequence model from the object descriptions. DNOC has two components. 1) A Sequence Model with the Placeholder (SM-P) generates a sentence containing placeholders. The placeholder represents an unseen novel object. Thus, the sequence model can be decoupled from the novel object descriptions. 2) A key-value object memory built upon the freely available detection model, contains the visual information and the corresponding word for each object. A query generated from the SM-P is used to retrieve the words from the object memory. The placeholder will further be filled with the correct word, resulting in a caption with novel object descriptions. The experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/131493