CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing

Huynh, TM; Nguyen, DL; Nguyen, TT; Vu, TDT; Dang-Ngoc, H; Dang, DNM

CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing

Huynh, TM Nguyen, DL Nguyen, TT Vu, TDT Dang-Ngoc, H Dang, DNM

Permalink

Publisher:: Springer Nature
Publication Type:: Chapter
Citation:: Industrial Networks and Intelligent Systems, 2024, 595 LNICST, pp. 189-203
Issue Date:: 2024-01-01

Closed Access

	Filename	Description	Size
	978-3-031-67357-3_14.pdf	Published version	1.49 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huynh, TM
dc.contributor.author	Nguyen, DL
dc.contributor.author	Nguyen, TT
dc.contributor.author	Vu, TDT
dc.contributor.author	Dang-Ngoc, H
dc.contributor.author	Dang, DNM
dc.date.accessioned	2024-09-16T03:03:32Z
dc.date.available	2024-09-16T03:03:32Z
dc.date.issued	2024-01-01
dc.identifier.citation	Industrial Networks and Intelligent Systems, 2024, 595 LNICST, pp. 189-203
dc.identifier.isbn	9783031673566
dc.identifier.uri	http://hdl.handle.net/10453/180829
dc.description.abstract	Image caption generation resides at the intersection of computer vision and natural language processing, with its primary goal being the creation of descriptive and coherent textual narratives that faithfully depict the content of an image. This paper presents two models that leverage CLIP as the image encoder and fine-tune GPT-2 for caption generation on the Flickr30k and Flickr8k datasets. The first model utilizes a straightforward mapping network and outperforms the original architecture with a BLEU-1 score of 0.700, BLEU-4 score of 0.257, and ROUGE score of 0.569 on the Flickr8k dataset. The second model constitutes a new architecture exploring the boundaries of minimal visual information required for captioning. It incorporates CLIP’s text encoder to produce input for the generator, while the image embedding serves solely as a validation mechanism. Despite its relatively lower performance, with a BLEU-1 score of 0.546, BLEU-4 score of 0.108, and ROUGE score of 0.444 on the Flickr8k dataset, this model demonstrates the decoder’s ability to create captions based on keyword descriptions alone, without direct access to the context vector.
dc.language	en
dc.publisher	Springer Nature
dc.relation.ispartof	Industrial Networks and Intelligent Systems
dc.relation.ispartofseries	Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
dc.relation.isbasedon	10.1007/978-3-031-67357-3_14
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	CLIP-Prefix for Image Captioning and an Experiment on Blind Image Guessing
dc.type	Chapter
utslib.citation.volume	595 LNICST
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	closed_access	*
dc.date.updated	2024-09-16T03:03:30Z
pubs.publication-status	Published
pubs.volume	595 LNICST

Abstract:

Image caption generation resides at the intersection of computer vision and natural language processing, with its primary goal being the creation of descriptive and coherent textual narratives that faithfully depict the content of an image. This paper presents two models that leverage CLIP as the image encoder and fine-tune GPT-2 for caption generation on the Flickr30k and Flickr8k datasets. The first model utilizes a straightforward mapping network and outperforms the original architecture with a BLEU-1 score of 0.700, BLEU-4 score of 0.257, and ROUGE score of 0.569 on the Flickr8k dataset. The second model constitutes a new architecture exploring the boundaries of minimal visual information required for captioning. It incorporates CLIP’s text encoder to produce input for the generator, while the image embedding serves solely as a validation mechanism. Despite its relatively lower performance, with a BLEU-1 score of 0.546, BLEU-4 score of 0.108, and ROUGE score of 0.444 on the Flickr8k dataset, this model demonstrates the decoder’s ability to create captions based on keyword descriptions alone, without direct access to the context vector.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/180829