An approach for detecting and cleaning of struck-out handwritten text

Chaudhuri, BB; Adak, C

An approach for detecting and cleaning of struck-out handwritten text

Chaudhuri, BB Adak, C

Permalink

Publication Type:: Journal Article
Citation:: Pattern Recognition, 2017, 61 pp. 282 - 294
Issue Date:: 2017-01-01

Closed Access

	Filename	Description	Size
	1-s2.0-S003132031630190X-main.pdf	Published Version	2.04 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Chaudhuri, BB	en_US
dc.contributor.author	Adak, C https://orcid.org/0000-0002-9085-2770	en_US
dc.date.issued	2017-01-01	en_US
dc.identifier.citation	Pattern Recognition, 2017, 61 pp. 282 - 294	en_US
dc.identifier.issn	0031-3203	en_US
dc.identifier.uri	http://hdl.handle.net/10453/125478
dc.description.abstract	© 2016 Elsevier Ltd This paper deals with the identification and processing of struck-out texts in unconstrained offline handwritten document images. If run on the OCR engine, such texts will produce nonsense character-string outputs. Here we present a combined (a) pattern classification and (b) graph-based method for identifying such texts. In case of (a), a feature-based two-class (normal vs. struck-out text) SVM classifier is used to detect moderate-sized struck-out components. In case of (b), skeleton of the text component is considered as a graph and the strike-out stroke is identified using a constrained shortest path algorithm. To identify zigzag or wavy struck-outs, all paths are found and some properties of zigzag and wavy line are utilized. Some other types of strike-out stroke are also detected by modifying the above method. The large sized multi-word and multi-line struck-outs are segmented into smaller components and treated as above. The detected struck-out texts can then be blocked from entering the OCR engine. In another kind of application involving historical documents, page images along with their annotated ground-truth are to be generated. In this case the strike-out strokes can be deleted from the words and then fed to the OCR engine. For this purpose an inpainting-based cleaning approach is employed. We worked on 500 pages of documents and obtained an overall F-Measure of 91.56% (91.06%) in English (Bengali) script for struck-out text detection. Also, for strike-out stroke identification and deletion, the F-Measures obtained were 89.65% (89.31%) and 91.16% (89.29%), respectively.	en_US
dc.relation.ispartof	Pattern Recognition	en_US
dc.relation.isbasedon	10.1016/j.patcog.2016.07.032	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	An approach for detecting and cleaning of struck-out handwritten text	en_US
dc.type	Journal Article
utslib.citation.volume	61	en_US
utslib.for	0899 Other Information and Computing Sciences	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US
pubs.volume	61	en_US

Abstract:

© 2016 Elsevier Ltd This paper deals with the identification and processing of struck-out texts in unconstrained offline handwritten document images. If run on the OCR engine, such texts will produce nonsense character-string outputs. Here we present a combined (a) pattern classification and (b) graph-based method for identifying such texts. In case of (a), a feature-based two-class (normal vs. struck-out text) SVM classifier is used to detect moderate-sized struck-out components. In case of (b), skeleton of the text component is considered as a graph and the strike-out stroke is identified using a constrained shortest path algorithm. To identify zigzag or wavy struck-outs, all paths are found and some properties of zigzag and wavy line are utilized. Some other types of strike-out stroke are also detected by modifying the above method. The large sized multi-word and multi-line struck-outs are segmented into smaller components and treated as above. The detected struck-out texts can then be blocked from entering the OCR engine. In another kind of application involving historical documents, page images along with their annotated ground-truth are to be generated. In this case the strike-out strokes can be deleted from the words and then fed to the OCR engine. For this purpose an inpainting-based cleaning approach is employed. We worked on 500 pages of documents and obtained an overall F-Measure of 91.56% (91.06%) in English (Bengali) script for struck-out text detection. Also, for strike-out stroke identification and deletion, the F-Measures obtained were 89.65% (89.31%) and 91.16% (89.29%), respectively.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/125478