Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation

Unanue, IJ; Borzeshi, EZ; Piccardi, M

Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation

Unanue, IJ Borzeshi, EZ Piccardi, M

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Transactions on Artificial Intelligence, 2022, PP, (99), pp. 1-15
Issue Date:: 2022-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 1 Jul 2024

Adobe PDF

Download Accepted versionAdobe PDF (1.89 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Unanue, IJ
dc.contributor.author	Borzeshi, EZ
dc.contributor.author	Piccardi, M https://orcid.org/0000-0001-9250-6604
dc.date.accessioned	2023-04-11T03:13:51Z
dc.date.available	2023-04-11T03:13:51Z
dc.date.issued	2022-01-01
dc.identifier.citation	IEEE Transactions on Artificial Intelligence, 2022, PP, (99), pp. 1-15
dc.identifier.issn	2691-4581
dc.identifier.issn	2691-4581
dc.identifier.uri	http://hdl.handle.net/10453/169528
dc.description.abstract	In recent years, neural machine translation (NMT) has achieved unprecedented performance in automated translation of resource-rich languages. However, it has not yet managed to achieve a comparable performance over the many low-resource languages and specialized translation domains, mainly due its tendency to overfit small training sets and consequently strive on new data. For this reason, in this paper we propose a novel approach to regularize the training of NMT models to improve their performance over low-resource language pairs. In the proposed approach, the model is trained to co-predict the target training sentences both as the usual categorical outputs (i.e., sequences of words) and as word and sentence embeddings. The fact that word and sentence embeddings are pre-trained over large corpora of monolingual data helps the model generalize beyond the available translation training set. Extensive experiments over three low-resource language pairs have shown that the proposed approach has been able to outperform strong state-of-the-art baseline models, with more marked improvements over the smaller training sets (e.g., up to <inline-formula><tex-math notation="LaTeX">$+6.57$</tex-math></inline-formula> BLEU points in Basque-English translation). A further experiment on unsupervised NMT has also shown that the proposed approach has been able to improve the quality of machine translation even with no parallel data at all.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Transactions on Artificial Intelligence
dc.relation.isbasedon	10.1109/TAI.2022.3187680
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation
dc.type	Journal Article
utslib.citation.volume	PP
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2024-07-01T00:00:00+1000Z
dc.date.updated	2023-04-11T03:13:49Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

In recent years, neural machine translation (NMT) has achieved unprecedented performance in automated translation of resource-rich languages. However, it has not yet managed to achieve a comparable performance over the many low-resource languages and specialized translation domains, mainly due its tendency to overfit small training sets and consequently strive on new data. For this reason, in this paper we propose a novel approach to regularize the training of NMT models to improve their performance over low-resource language pairs. In the proposed approach, the model is trained to co-predict the target training sentences both as the usual categorical outputs (i.e., sequences of words) and as word and sentence embeddings. The fact that word and sentence embeddings are pre-trained over large corpora of monolingual data helps the model generalize beyond the available translation training set. Extensive experiments over three low-resource language pairs have shown that the proposed approach has been able to outperform strong state-of-the-art baseline models, with more marked improvements over the smaller training sets (e.g., up to

$+6.57$

BLEU points in Basque-English translation). A further experiment on unsupervised NMT has also shown that the proposed approach has been able to improve the quality of machine translation even with no parallel data at all.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/169528