Switchable Novel Object Captioner.

Wu, Y; Jiang, L; Yang, Y

Switchable Novel Object Captioner.

Wu, Y Jiang, L Yang, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Trans Pattern Anal Mach Intell, 2022, PP, (1), pp. 1162-1173
Issue Date:: 2022-01-25

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (1.48 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, Y
dc.contributor.author	Jiang, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2023-04-06T03:51:52Z
dc.date.available	2023-04-06T03:51:52Z
dc.date.issued	2022-01-25
dc.identifier.citation	IEEE Trans Pattern Anal Mach Intell, 2022, PP, (1), pp. 1162-1173
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/169313
dc.description.abstract	Image captioning aims at automatically describing images by sentences. It often requires lots of paired image-sentence data for training. However, trained captioning models can hardly be applied to new domains in which some novel words exist. In this paper, we introduce the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences. To tackle the challenging task, we mimic the way that babies talk about something unknown, using the word of a similar known object. Following this motivation, we build a key-value object memory by detection models, containing visual information and corresponding words for objects in the image. For those novel objects, we use words of most similar seen objects as proxy visual words to solve the out-of-vocabulary issue. We then propose a Switchable LSTM that incorporates knowledge from the object memory into sentence generation. The model has two switchable working modes, generating the sentences like standard LSTMs and retrieving proper nouns from the key-value memory. Thus our model fully disentangle language generation from training objects, and requires zero training sentence in describing novel objects. Experiments on three large-scale datasets demonstrate the ability of our method to describe novel concepts.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Trans Pattern Anal Mach Intell
dc.relation.isbasedon	10.1109/TPAMI.2022.3144984
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Switchable Novel Object Captioner.
dc.type	Journal Article
utslib.citation.volume	PP
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	open_access	*
dc.date.updated	2023-04-06T03:51:51Z
pubs.issue	1
pubs.publication-status	Published online
pubs.volume	PP
utslib.citation.issue	1

Abstract:

Image captioning aims at automatically describing images by sentences. It often requires lots of paired image-sentence data for training. However, trained captioning models can hardly be applied to new domains in which some novel words exist. In this paper, we introduce the zero-shot novel object captioning task, where the machine generates descriptions about novel objects without extra training sentences. To tackle the challenging task, we mimic the way that babies talk about something unknown, using the word of a similar known object. Following this motivation, we build a key-value object memory by detection models, containing visual information and corresponding words for objects in the image. For those novel objects, we use words of most similar seen objects as proxy visual words to solve the out-of-vocabulary issue. We then propose a Switchable LSTM that incorporates knowledge from the object memory into sentence generation. The model has two switchable working modes, generating the sentences like standard LSTMs and retrieving proper nouns from the key-value memory. Thus our model fully disentangle language generation from training objects, and requires zero training sentence in describing novel objects. Experiments on three large-scale datasets demonstrate the ability of our method to describe novel concepts.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/169313