Arbitrary-Shape Scene Text Detection and Its Application in Educational Resource Navigation

Publication Type:
Thesis
Issue Date:
2022
Full metadata record
Text instances exist widely as an information carrier in natural scenes, videos and document photos. However, localizing text instances with arbitrary shapes is a challenging task since their style, colour, size, aspect ratio and shape vary greatly depending on the using scenarios. The abovementioned issues hinder the retrieval of information and the digitization of raw photos and videos. The situation worsens when the raw photos and videos are for educational purposes. In this thesis, we address the challenging problem of arbitrary-shape scene text detection by proposing two deep learning-based bottom-up approaches. Then, we create a navigation system for slide-based educational resources using the semantic information of the detected texts as the primary cue. In the first approach, we revitalize the GCN-based bottom-up text detection frameworks by aggregating the visual-relational features of text with two effective false positive/negative suppression mechanisms. First, dense overlapping text segments depicting the “characterness” and “streamline” of text are generated for further relational reasoning and weakly supervised segment classification. Then, a Location-Aware Transfer (LAT) module is designed to transfer text's relational features into visual compatible features with a Fuse Decoding (FD) module to enhance the representation of text regions for the second step suppression. Finally, a novel multiple-text-map-aware contour-approximation strategy is developed, instead of the route-finding process. In the second approach, targeting building reliable connections between text segments and alleviating error accumulation in bottom-up modelling, we propose a novel approach to capture the regularity of texts by embedding deep morphology for arbitrary-shape text detection so as to regularize false text segment detection and link missing connections. Towards this end, two deep morphological modules are designed to regularize text segments and determine the linkage between them. First, a Deep Morphological Opening (DMOP) module is constructed to remove false text segment detection accumulated in the feature extraction process. Then, a Deep Morphological Closing (DMCL) module is proposed to allow text instances of various shapes to stretch their morphology in all directions while deriving their connections. Using the detected arbitrary-shape text information in educational resources as a primary cue, we propose a slide-based video navigation tool that can extract the hierarchical structure and semantic relationship of visual entries in videos by integrating multi-channel information. A clustering approach is proposed for restoring the hierarchical relationship between visual entities. The restored visual entities are then associated with their corresponding audio speech text by evaluating their semantic relationship.
Please use this identifier to cite or link to this item: