Abstract:
Machine Translation (MT) has come a long way in recent years, but it still suffers from data scarcity issue due to lack of parallel corpora for low (or sometimes zero) resource languages. However, Transfer Learning (TL) is one of the directions widely used for low-resource machine translation systems to overcome this issue. Creating parallel corpus for such languages is another way of dealing with data scarcity, yet costly, time-consuming and laborious task. In order to avoid the above listed limitations of parallel corpus formation, we present a TL-based Semi-supervised Pseudo-corpus Generation (TLSPG) approach for zero-shot MT systems. It generates the pseudo corpus by exploiting the relatedness between low resource language pairs and zero-resource language pairs via TL approach. It is further empirically ascertained in our experiments that such relatedness helps improve the performance of zero-shot MT systems. Experiments on zero-resource language pairs show that our approach effectively outperforms the existing state-of-the-art models, yielding improvement of +15.56,+8.13,+3.98 and +2 BLEU points for Bhojpuri→Hindi, Magahi→Hindi, Hindi→Bhojpuri and Hindi→Magahi, respectively.