Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation

Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:
Journal Article
Citation:
IEEE Transactions on Artificial Intelligence, 2022, PP, (99), pp. 1-15
Issue Date:
2022-01-01
Full metadata record
In recent years, neural machine translation (NMT) has achieved unprecedented performance in automated translation of resource-rich languages. However, it has not yet managed to achieve a comparable performance over the many low-resource languages and specialized translation domains, mainly due its tendency to overfit small training sets and consequently strive on new data. For this reason, in this paper we propose a novel approach to regularize the training of NMT models to improve their performance over low-resource language pairs. In the proposed approach, the model is trained to co-predict the target training sentences both as the usual categorical outputs (i.e., sequences of words) and as word and sentence embeddings. The fact that word and sentence embeddings are pre-trained over large corpora of monolingual data helps the model generalize beyond the available translation training set. Extensive experiments over three low-resource language pairs have shown that the proposed approach has been able to outperform strong state-of-the-art baseline models, with more marked improvements over the smaller training sets (e.g., up to $+6.57$ BLEU points in Basque-English translation). A further experiment on unsupervised NMT has also shown that the proposed approach has been able to improve the quality of machine translation even with no parallel data at all.
Please use this identifier to cite or link to this item: