Translating dialectal Arabic as low resource language using word embedding

Publication Type:
Conference Proceeding
International Conference Recent Advances in Natural Language Processing, RANLP, 2017, 2017-September pp. 52 - 57
Issue Date:
Full metadata record
© 2018 Association for Computational Linguistics (ACL). All rights reserved. A number of machine translation methods have been proposed in recent years to deal with the increasingly important problem of automatic translation between texts of different languages or languages and their dialects. These methods have produced promising results when applied to some of the widely studied languages. Existing translation methods are mainly implemented using rule-based and static machine translation approaches. Rule based approaches utilize language translation rules that can either be constructed by an expert, which is quite difficult when dealing with dialects, or rely on rule construction algorithms, which require very large parallel datasets. Statistical approaches also require large parallel datasets to build the translation models. However, large parallel datasets do not exist for languages with low resources, such as the Arabic language and its dialects. In this paper we propose an algorithm that attempts to overcome this limitation, and apply it to translate the Egyptian dialect (EGY) to Modern Standard Arabic (MSA). Monolingual corpus was collected for both MSA and EGY and a relatively small parallel language pair set was built to train the models. The proposed method utilizes Word embeddings as it requires monolingual data rather than parallel corpus. Both Continuous Bag of Words and Skip-gram were used to build word vectors. The proposed method was validated on four different datasets using a four-fold cross validation approach.
Please use this identifier to cite or link to this item: