Deep Learning-Based Text Detection and Recognition

Publication Type:
Thesis
Issue Date:
2020
Full metadata record
Texts play a critical role in our daily life. They are everywhere such as slogans on posters, licence plates on cars, etc., to transmit information and knowledge. With the popularity of mobile devices with cameras, more and more texts are collected, transmitted and stored as text images. Automatically reading texts from images is of high application potentials. Therefore, related researches have been attracting considerable attentions from the computer vision community. Scene texts and handwritten texts are the two most difficult texts to be automatically read because of the challenges posed by the complexity of backgrounds, the uncertainty of capturing conditions, the diversity of text appearances, touching characters and the variety of handwriting styles. Text detection, 𝒾.𝑒., localizing text areas from images, and text recognition, 𝒾.𝑒., transcribing located text areas into character sequences, are two key steps of robust text reading. In recent years, they have entered a deep learning era, where Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) play important roles. Here, we conduct researches on text detection and recognition based on CNN and LSTM, as presented below. 1. To improve the recall rate of small text areas in oriented text detection, we propose an Xception-based multi-ASPP-assembled scene text detector named DeepText. DeepText inserts multiple Atrous Spatial Pyramid Pooling (ASPP) modules into Xception after feature maps with different resolutions to retain richer information for small text areas, and introduces auxiliary connections and auxiliary losses to speed up convergence and boost the discrimination ability of lower encoder layers. 2. To address the issue that Mask R-CNN cannot fully leverage global information when performing predictions, we propose a scene text detector named GMask R-CNN, where a global mask module is designed to perform semantic segmentation by considering global information. 3. To tackle the problem that LSTM neglects the valuable spatial and structural information of 2-D text images, we propose two scene text recognisers named FACLSTM, which exploits convolution LSTM to directly perform sequential transcription in 2-D space, and ReELFA, which utilizes one-hot encoded locations to enhance features with pixels' spatial information. 4. To solve the problem that CNNs with fully connected layers are not suitable for sequential prediction tasks due to their requirements of fixed-size inputs/outputs, we propose a CNN-based handwritten text recogniser CFSPP. CFSPP embeds a Spatial Pyramid Pooling-based intermediate layer between convolutional layers and fully connected layers to convert arbitrary-size feature maps into feature vectors with specific lengths.
Please use this identifier to cite or link to this item: