Translating Arabic as low resource language using distribution representation and neural machine translation models

Almansor, Ebtesam Hussain

Translating Arabic as low resource language using distribution representation and neural machine translation models

Almansor, Ebtesam Hussain

Permalink

Publication Type:: Thesis
Issue Date:: 2018

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (309.82 kB)

Adobe PDF

Download thesisAdobe PDF (1.28 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Almansor, Ebtesam Hussain
dc.date.accessioned	2019-01-09T01:39:09Z
dc.date.available	2019-01-09T01:39:09Z
dc.date.issued	2018
dc.identifier.uri	http://hdl.handle.net/10453/129360
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_AU
dc.description.abstract	Rapid growth in social media platforms makes the communication between users easier. According to that, the communication increased the importance of translating human languages. Machine translation technology has been widely used for translating several languages using different approaches such as rule based, statistical machine translation and more recently neural machine translation. The quality of machine translation depends on the availability of parallel datasets. Languages that lack sufficient datasets have posed many challenges related to their processing and analysis. These languages are referred to as low resource languages. In this research, we mainly focused on low resource languages, particularly Arabic and its dialects. Dialectal Arabic can be treated as non-standard text that is used in Arab social media and need to be translated to their standard forms. In this context, the importance and the focus of machine translation have been increased recently. Unlike English and other languages, translation of Arabic and its dialects have not been thoroughly investigated, where existing attempts were mostly developed based on statistic and rule-based approaches, while neural network approaches have hardly been considered. Therefore, a distribution representation model (embedding model) has been proposed to translate dialectal Arabic to Modern Standard Arabic. As Arabic is a rich morphology language that has different forms of the same words the proposed model can help to capture more linguistic features such as semantic and syntax features without any rules. Another benefit of the proposed model is that it has the capability to be trained on monolingual datasets instead of parallel datasets. This model was used to translate Egyptian dialect text to Modern Standard Arabic. We also, built a monolingual datasets from available resources and a small parallel dictionary. Different datasets were used to evaluate the performance of the proposed method. This research provides new insight into dialectal Arabic translation. Recently, there has been increased interest in Neural Machine Translation (NMT). NMT is a deep learning based model that is trained using large parallel datasets with the aim of mapping text from the source language to the target language. While it shows a promising result for high resource translation languages, such as English, low resource languages face challenges using NMT. Therefore, a number of NMT based models have been developed to translate low resource languages, for instance pre-trained models that utilize monolingual datasets. While these models were used on word level and using recurrent neural networks, which have some limitations, we proposed a hybrid model that combines recurrent and convolutional neural networks on character level to translate low resource languages.	en_AU
dc.format	Thesis (MSc)
dc.language.iso	en_AU	en_AU
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/129360/2/02whole.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	au.edu.uts.lib/ppc
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.subject	Translating Arabic
dc.subject	Neural machine translation models
dc.subject	Low resource language
dc.title	Translating Arabic as low resource language using distribution representation and neural machine translation models	en_AU
dc.type	Thesis	en_AU
utslib.copyright.status	open_access

Abstract:

Rapid growth in social media platforms makes the communication between users easier. According to that, the communication increased the importance of translating human languages. Machine translation technology has been widely used for translating several languages using different approaches such as rule based, statistical machine translation and more recently neural machine translation. The quality of machine translation depends on the availability of parallel datasets. Languages that lack sufficient datasets have posed many challenges related to their processing and analysis. These languages are referred to as low resource languages. In this research, we mainly focused on low resource languages, particularly Arabic and its dialects. Dialectal Arabic can be treated as non-standard text that is used in Arab social media and need to be translated to their standard forms. In this context, the importance and the focus of machine translation have been increased recently. Unlike English and other languages, translation of Arabic and its dialects have not been thoroughly investigated, where existing attempts were mostly developed based on statistic and rule-based approaches, while neural network approaches have hardly been considered. Therefore, a distribution representation model (embedding model) has been proposed to translate dialectal Arabic to Modern Standard Arabic. As Arabic is a rich morphology language that has different forms of the same words the proposed model can help to capture more linguistic features such as semantic and syntax features without any rules. Another benefit of the proposed model is that it has the capability to be trained on monolingual datasets instead of parallel datasets. This model was used to translate Egyptian dialect text to Modern Standard Arabic. We also, built a monolingual datasets from available resources and a small parallel dictionary. Different datasets were used to evaluate the performance of the proposed method. This research provides new insight into dialectal Arabic translation. Recently, there has been increased interest in Neural Machine Translation (NMT). NMT is a deep learning based model that is trained using large parallel datasets with the aim of mapping text from the source language to the target language. While it shows a promising result for high resource translation languages, such as English, low resource languages face challenges using NMT. Therefore, a number of NMT based models have been developed to translate low resource languages, for instance pre-trained models that utilize monolingual datasets. While these models were used on word level and using recurrent neural networks, which have some limitations, we proposed a hybrid model that combines recurrent and convolutional neural networks on character level to translate low resource languages.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/129360