A New Transformer-Based Approach for Text Detection in Shaky and Non-shaky Day-Night Video

Halder, A; Shivakumara, P; Pal, U; Lu, T; Blumenstein, M

A New Transformer-Based Approach for Text Detection in Shaky and Non-shaky Day-Night Video

Halder, A

Shivakumara, P Pal, U Lu, T Blumenstein, M

Permalink

Publisher:: Springer
Publication Type:: Conference Proceeding
Citation:: Pattern Recognition, 2023, 14407 LNCS, pp. 30-44
Issue Date:: 2023-01-01

Closed Access

	Filename	Description	Size
	978-3-031-47637-2_3.pdf	Published version	4.89 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Halder, A https://orcid.org/0000-0002-8834-8022
dc.contributor.author	Shivakumara, P
dc.contributor.author	Pal, U
dc.contributor.author	Lu, T
dc.contributor.author	Blumenstein, M https://orcid.org/0000-0002-9908-3744
dc.date	2023-11-05
dc.date.accessioned	2023-11-29T21:30:31Z
dc.date.available	2023-11-29T21:30:31Z
dc.date.issued	2023-01-01
dc.identifier.citation	Pattern Recognition, 2023, 14407 LNCS, pp. 30-44
dc.identifier.isbn	9783031476365
dc.identifier.issn	0302-9743
dc.identifier.issn	1611-3349
dc.identifier.uri	http://hdl.handle.net/10453/173619
dc.description.abstract	Text detection in shaky and non-shaky videos is challenging because of variations caused by day and night videos. In addition, moving objects, vehicles, and humans in the video make the text detection problems more challenging in contrast to text detection in normal natural scene images. Motivated by the capacity of the transformer, we propose a new transformer-based approach for detecting text in both shaky and non-shaky day-night videos. To reduce the effect of object movement, poor quality, and other challenges mentioned above, the proposed work explores temporal frames for obtaining activation frames based on similarity and dissimilarity measures. For estimating similarity and dissimilarity, our method extracts luminance, contrast, and structural features. The activation frames are fed to the transformer which comprises an encoder, decoder, and feed-forward network for text detection in shaky and non-shaky day-night video. Since it is the first work, we create our own dataset for experimentation. To show the effectiveness of the proposed method, experiments are conducted on a standard dataset called the ICDAR-2015 video dataset. The results on our dataset and standard dataset show that the proposed model is superior to state-of-the-art methods in terms of recall, precision, and F-measure.
dc.language	en
dc.publisher	Springer
dc.relation.ispartof	Pattern Recognition
dc.relation.ispartof	Asian Conference on Pattern Recognition
dc.relation.ispartofseries	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
dc.relation.isbasedon	10.1007/978-3-031-47637-2_3
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	46 Information and computing sciences
dc.title	A New Transformer-Based Approach for Text Detection in Shaky and Non-shaky Day-Night Video
dc.type	Conference Proceeding
utslib.citation.volume	14407 LNCS
utslib.location.activity	Kitakyushu, Japan
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Strength - QSI - Centre for Quantum Software and Information
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2023-11-29T21:30:29Z
pubs.finish-date	2023-11-08
pubs.place-of-publication	Switzerland
pubs.publication-status	Published
pubs.start-date	2023-11-05
pubs.volume	14407 LNCS
dc.location	Switzerland

Abstract:

Text detection in shaky and non-shaky videos is challenging because of variations caused by day and night videos. In addition, moving objects, vehicles, and humans in the video make the text detection problems more challenging in contrast to text detection in normal natural scene images. Motivated by the capacity of the transformer, we propose a new transformer-based approach for detecting text in both shaky and non-shaky day-night videos. To reduce the effect of object movement, poor quality, and other challenges mentioned above, the proposed work explores temporal frames for obtaining activation frames based on similarity and dissimilarity measures. For estimating similarity and dissimilarity, our method extracts luminance, contrast, and structural features. The activation frames are fed to the transformer which comprises an encoder, decoder, and feed-forward network for text detection in shaky and non-shaky day-night video. Since it is the first work, we create our own dataset for experimentation. To show the effectiveness of the proposed method, experiments are conducted on a standard dataset called the ICDAR-2015 video dataset. The results on our dataset and standard dataset show that the proposed model is superior to state-of-the-art methods in terms of recall, precision, and F-measure.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/173619