Recall What You See Continually Using GridLSTM in Image Captioning

Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:
Journal Article
Citation:
IEEE Transactions on Multimedia, 2019, 22 (3), pp. 808 - 818
Issue Date:
2019-07-29
Filename Description Size
early_access_version.pdfAccepted Manuscript Version11.57 MB
Adobe PDF
Full metadata record
The goal of image captioning is to automatically describe an image with a sentence, and the task has attracted research attention from both the computer vision and natural-language processing research communities. The existing encoder–decoder model and its variants, which are the most popular models for image captioning, use the image features in three ways: first, they inject the encoded image features into the decoder only once at the initial step, which does not enable the rich image content to be explored sufficiently while gradually generating a text caption; second, they concatenate the encoded image features with text as extra inputs at every step, which introduces unnecessary noise; and, third, they using an attention mechanism, which increases the computational complexity due to the introduction of extra neural nets to identify the attention regions. Different from the existing methods, in this paper, we propose a novel network, Recall Network, for generating captions that are consistent with the images. The recall network selectively involves the visual features by using a GridLSTM and, thus, is able to recall image contents while generating each word. By importing the visual information as the latent memory along the depth dimension LSTM, the decoder is able to admit the visual features dynamically through the inherent LSTM structure without adding any extra neural nets or parameters. The Recall Network efficiently prevents the decoder from deviating from the original image content. To verify the efficiency of our model, we conducted exhaustive experiments on full and dense image captioning. The experimental results clearly demonstrate that our recall network outperforms the conventional encoder–decoder model by a large margin and that it performs comparably to the state-of-the-art methods.
Please use this identifier to cite or link to this item: