Fast parameter adaptation for few-shot image captioning and visual question answering

Dong, X; Zhu, L; Zhang, D; Yang, Y; Wu, F

Fast parameter adaptation for few-shot image captioning and visual question answering

Dong, X

Zhu, L

Zhang, D Yang, Y

Wu, F

Permalink

Publication Type:: Conference Proceeding
Citation:: MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, 2018, pp. 54 - 62
Issue Date:: 2018-10-15

Closed Access

	Filename	Description	Size
	FPAIT-MM-18.pdf	Published version	3.25 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Dong, X https://orcid.org/0000-0001-9272-1590	en_US
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557	en_US
dc.contributor.author	Zhang, D	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.contributor.author	Wu, F	en_US
dc.date.issued	2018-10-15	en_US
dc.identifier.citation	MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, 2018, pp. 54 - 62	en_US
dc.identifier.isbn	9781450356657	en_US
dc.identifier.uri	http://hdl.handle.net/10453/131494
dc.description.abstract	© 2018 Association for Computing Machinery. Given only a few image-text pairs, humans can learn to detect semantic concepts and describe the content. For machine learning algorithms, they usually require a lot of data to train a deep neural network to solve the problem. However, it is challenging for the existing systems to generalize well to the few-shot multi-modal scenario, because the learner should understand not only images and texts but also their relationships from only a few examples. In this paper, we tackle two multi-modal problems, i.e., image captioning and visual question answering (VQA), in the few-shot setting. We propose Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that learns to learn jointly understanding image and text data by a few examples. In practice, FPAIT has two benefits. (1) Fast learning ability. FPAIT learns proper initial parameters for the joint image-text learner from a large number of different tasks. When a new task comes, FPAIT can use a small number of gradient steps to achieve a good performance. (2) Robust to few examples. In few-shot tasks, the small training data will introduce large biases in Convolutional Neural Networks (CNN) and damage the learner's performance. FPAIT leverages dynamic linear transformations to alleviate the side effects of the small training set. In this way, FPAIT flexibly normalizes the features and thus reduces the biases during training. Quantitatively, FPAIT achieves superior performance on both few-shot image captioning and VQA benchmarks.	en_US
dc.relation.ispartof	MM 2018 - Proceedings of the 2018 ACM Multimedia Conference	en_US
dc.relation.isbasedon	10.1145/3240508.3240527	en_US
dc.title	Fast parameter adaptation for few-shot image captioning and visual question answering	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

© 2018 Association for Computing Machinery. Given only a few image-text pairs, humans can learn to detect semantic concepts and describe the content. For machine learning algorithms, they usually require a lot of data to train a deep neural network to solve the problem. However, it is challenging for the existing systems to generalize well to the few-shot multi-modal scenario, because the learner should understand not only images and texts but also their relationships from only a few examples. In this paper, we tackle two multi-modal problems, i.e., image captioning and visual question answering (VQA), in the few-shot setting. We propose Fast Parameter Adaptation for Image-Text Modeling (FPAIT) that learns to learn jointly understanding image and text data by a few examples. In practice, FPAIT has two benefits. (1) Fast learning ability. FPAIT learns proper initial parameters for the joint image-text learner from a large number of different tasks. When a new task comes, FPAIT can use a small number of gradient steps to achieve a good performance. (2) Robust to few examples. In few-shot tasks, the small training data will introduce large biases in Convolutional Neural Networks (CNN) and damage the learner's performance. FPAIT leverages dynamic linear transformations to alleviate the side effects of the small training set. In this way, FPAIT flexibly normalizes the features and thus reduces the biases during training. Quantitatively, FPAIT achieves superior performance on both few-shot image captioning and VQA benchmarks.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/131494