Translating dialectal Arabic as low resource language using word embedding

Almansor, EH; Al-Ani, A

Translating dialectal Arabic as low resource language using word embedding

Almansor, EH Al-Ani, A

Permalink

Publication Type:: Conference Proceeding
Citation:: International Conference Recent Advances in Natural Language Processing, RANLP, 2017, 2017-September pp. 52 - 57
Issue Date:: 2017-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript versionAdobe PDF (224.21 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Almansor, EH	en_US
dc.contributor.author	Al-Ani, A https://orcid.org/0000-0002-8092-8954	en_US
dc.date.issued	2017-01-01	en_US
dc.identifier.citation	International Conference Recent Advances in Natural Language Processing, RANLP, 2017, 2017-September pp. 52 - 57	en_US
dc.identifier.isbn	9789544520489	en_US
dc.identifier.issn	1313-8502	en_US
dc.identifier.uri	http://hdl.handle.net/10453/125972
dc.description.abstract	© 2018 Association for Computational Linguistics (ACL). All rights reserved. A number of machine translation methods have been proposed in recent years to deal with the increasingly important problem of automatic translation between texts of different languages or languages and their dialects. These methods have produced promising results when applied to some of the widely studied languages. Existing translation methods are mainly implemented using rule-based and static machine translation approaches. Rule based approaches utilize language translation rules that can either be constructed by an expert, which is quite difficult when dealing with dialects, or rely on rule construction algorithms, which require very large parallel datasets. Statistical approaches also require large parallel datasets to build the translation models. However, large parallel datasets do not exist for languages with low resources, such as the Arabic language and its dialects. In this paper we propose an algorithm that attempts to overcome this limitation, and apply it to translate the Egyptian dialect (EGY) to Modern Standard Arabic (MSA). Monolingual corpus was collected for both MSA and EGY and a relatively small parallel language pair set was built to train the models. The proposed method utilizes Word embeddings as it requires monolingual data rather than parallel corpus. Both Continuous Bag of Words and Skip-gram were used to build word vectors. The proposed method was validated on four different datasets using a four-fold cross validation approach.	en_US
dc.relation.ispartof	International Conference Recent Advances in Natural Language Processing, RANLP	en_US
dc.relation.isbasedon	10.26615/978-954-452-049-6-008	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Translating dialectal Arabic as low resource language using word embedding	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2017-September	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Biomedical Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access	*
pubs.publication-status	Published	en_US
pubs.volume	2017-September	en_US

Abstract:

© 2018 Association for Computational Linguistics (ACL). All rights reserved. A number of machine translation methods have been proposed in recent years to deal with the increasingly important problem of automatic translation between texts of different languages or languages and their dialects. These methods have produced promising results when applied to some of the widely studied languages. Existing translation methods are mainly implemented using rule-based and static machine translation approaches. Rule based approaches utilize language translation rules that can either be constructed by an expert, which is quite difficult when dealing with dialects, or rely on rule construction algorithms, which require very large parallel datasets. Statistical approaches also require large parallel datasets to build the translation models. However, large parallel datasets do not exist for languages with low resources, such as the Arabic language and its dialects. In this paper we propose an algorithm that attempts to overcome this limitation, and apply it to translate the Egyptian dialect (EGY) to Modern Standard Arabic (MSA). Monolingual corpus was collected for both MSA and EGY and a relatively small parallel language pair set was built to train the models. The proposed method utilizes Word embeddings as it requires monolingual data rather than parallel corpus. Both Continuous Bag of Words and Skip-gram were used to build word vectors. The proposed method was validated on four different datasets using a four-fold cross validation approach.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/125972