Translating Arabic as low resource language using distribution representation and neural machine translation models
- Publication Type:
- Issue Date:
Rapid growth in social media platforms makes the communication between users easier. According to that, the communication increased the importance of translating human languages. Machine translation technology has been widely used for translating several languages using different approaches such as rule based, statistical machine translation and more recently neural machine translation. The quality of machine translation depends on the availability of parallel datasets. Languages that lack sufficient datasets have posed many challenges related to their processing and analysis. These languages are referred to as low resource languages. In this research, we mainly focused on low resource languages, particularly Arabic and its dialects. Dialectal Arabic can be treated as non-standard text that is used in Arab social media and need to be translated to their standard forms. In this context, the importance and the focus of machine translation have been increased recently. Unlike English and other languages, translation of Arabic and its dialects have not been thoroughly investigated, where existing attempts were mostly developed based on statistic and rule-based approaches, while neural network approaches have hardly been considered. Therefore, a distribution representation model (embedding model) has been proposed to translate dialectal Arabic to Modern Standard Arabic. As Arabic is a rich morphology language that has different forms of the same words the proposed model can help to capture more linguistic features such as semantic and syntax features without any rules. Another benefit of the proposed model is that it has the capability to be trained on monolingual datasets instead of parallel datasets. This model was used to translate Egyptian dialect text to Modern Standard Arabic. We also, built a monolingual datasets from available resources and a small parallel dictionary. Different datasets were used to evaluate the performance of the proposed method. This research provides new insight into dialectal Arabic translation. Recently, there has been increased interest in Neural Machine Translation (NMT). NMT is a deep learning based model that is trained using large parallel datasets with the aim of mapping text from the source language to the target language. While it shows a promising result for high resource translation languages, such as English, low resource languages face challenges using NMT. Therefore, a number of NMT based models have been developed to translate low resource languages, for instance pre-trained models that utilize monolingual datasets. While these models were used on word level and using recurrent neural networks, which have some limitations, we proposed a hybrid model that combines recurrent and convolutional neural networks on character level to translate low resource languages.
Please use this identifier to cite or link to this item: