A New Hybrid Method for Caption and Scene Text Classification in Action Video Images

Nandanwar, L; Shivakumara, P; Pal, U; Lu, T; Blumenstein, M

A New Hybrid Method for Caption and Scene Text Classification in Action Video Images

Nandanwar, L Shivakumara, P Pal, U Lu, T Blumenstein, M

Permalink

Publisher:: World Scientific Pub Co Pte Ltd
Publication Type:: Journal Article
Citation:: International Journal of Pattern Recognition and Artificial Intelligence, 2021, 35, (12), pp. 2160009
Issue Date:: 2021-09-30

Closed Access

	Filename	Description	Size
	Hybrid.pdf	Published version	3.92 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Nandanwar, L
dc.contributor.author	Shivakumara, P
dc.contributor.author	Pal, U
dc.contributor.author	Lu, T
dc.contributor.author	Blumenstein, M https://orcid.org/0000-0002-9908-3744
dc.date.accessioned	2022-06-10T03:20:16Z
dc.date.available	2022-06-10T03:20:16Z
dc.date.issued	2021-09-30
dc.identifier.citation	International Journal of Pattern Recognition and Artificial Intelligence, 2021, 35, (12), pp. 2160009
dc.identifier.issn	0218-0014
dc.identifier.issn	1793-6381
dc.identifier.uri	http://hdl.handle.net/10453/158052
dc.description.abstract	Achieving a better recognition rate for text in action video images is challenging due to multiple types of text with unpredictable actions in the background. In this paper, we propose a new method for the classification of caption (which is edited text) and scene text (text that is a part of the video) in video images. This work considers five action classes, namely, Yoga, Concert, Teleshopping, Craft, and Recipes, where it is expected that both types of text play a vital role in understanding the video content. The proposed method introduces a new fusion criterion based on Discrete Cosine Transform (DCT) and Fourier coefficients to obtain the reconstructed images for caption and scene text. The fusion criterion involves computing the variances for coefficients of corresponding pixels of DCT and Fourier images, and the same variances are considered as the respective weights. This step results in Reconstructed image-1. Inspired by the special property of Chebyshev-Harmonic-Fourier-Moments (CHFM) that has the ability to reconstruct a redundancy-free image, we explore CHFM for obtaining the Reconstructed image-2. The reconstructed images along with the input image are passed to a Deep Convolutional Neural Network (DCNN) for classification of caption/scene text. Experimental results on five action classes and a comparative study with the existing methods demonstrate that the proposed method is effective. In addition, the recognition results of the before and after the classification obtained from different methods show that the recognition performance improves significantly after classification, compared to before classification.
dc.language	en
dc.publisher	World Scientific Pub Co Pte Ltd
dc.relation.ispartof	International Journal of Pattern Recognition and Artificial Intelligence
dc.relation.isbasedon	10.1142/S0218001421600090
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 1702 Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	A New Hybrid Method for Caption and Scene Text Classification in Action Video Images
dc.type	Journal Article
utslib.citation.volume	35
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Strength - QSI - Centre for Quantum Software and Information
utslib.copyright.status	closed_access	*
dc.date.updated	2022-06-10T03:19:39Z
pubs.issue	12
pubs.publication-status	Published
pubs.volume	35
utslib.citation.issue	12

Abstract:

Achieving a better recognition rate for text in action video images is challenging due to multiple types of text with unpredictable actions in the background. In this paper, we propose a new method for the classification of caption (which is edited text) and scene text (text that is a part of the video) in video images. This work considers five action classes, namely, Yoga, Concert, Teleshopping, Craft, and Recipes, where it is expected that both types of text play a vital role in understanding the video content. The proposed method introduces a new fusion criterion based on Discrete Cosine Transform (DCT) and Fourier coefficients to obtain the reconstructed images for caption and scene text. The fusion criterion involves computing the variances for coefficients of corresponding pixels of DCT and Fourier images, and the same variances are considered as the respective weights. This step results in Reconstructed image-1. Inspired by the special property of Chebyshev-Harmonic-Fourier-Moments (CHFM) that has the ability to reconstruct a redundancy-free image, we explore CHFM for obtaining the Reconstructed image-2. The reconstructed images along with the input image are passed to a Deep Convolutional Neural Network (DCNN) for classification of caption/scene text. Experimental results on five action classes and a comparative study with the existing methods demonstrate that the proposed method is effective. In addition, the recognition results of the before and after the classification obtained from different methods show that the recognition performance improves significantly after classification, compared to before classification.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/158052