Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Kumar, Amit; Parida, Shantipriya; Pratap, Ajay; Singh, Anil Kumar

IDR Home
→
Article
→
Miscellaneous (Department Name Not Found)
→
View Item

dc.contributor.author	Kumar, Amit
dc.contributor.author	Parida, Shantipriya
dc.contributor.author	Pratap, Ajay
dc.contributor.author	Singh, Anil Kumar
dc.date.accessioned	2024-04-10T07:05:48Z
dc.date.available	2024-04-10T07:05:48Z
dc.date.issued	2023-11-04
dc.identifier.issn	02562499
dc.identifier.uri	http://localhost:8080/xmlui/handle/123456789/3129
dc.description	This paper published with affiliation IIT (BHU), Varanasi in open access mode.	en_US
dc.description.abstract	The use of subword embedding has proved to be a major innovation in Neural machine translation (NMT). It helps NMT to learn better context vectors for Low resource languages (LRLs) so as to predict the target words by better modelling the morphologies of the two languages and also the morphosyntax transfer. Some of the NMT models that achieve state-of-the-art improvement on LRLs are Transformer, BERT, BART, and mBART, which can all use sub-word embeddings. Even so, their performance for translation in Indian language to Indian language scenario is still not as good as for resource-rich languages. One reason for this is the relative morphological richness of Indian languages, while another is that most of them fall into the extremely low resource or zero-shot categories. Since most major Indian languages use Indic or Brahmi origin scripts, the text written in them is highly phonetic in nature and phonetically similar in terms of abstract letters and their arrangements. We use these characteristics of Indian languages and their scripts to propose an approach based on common multilingual Latin-based encoding (WX notation) that takes advantage of language similarity while addressing the morphological complexity issue in NMT. Such multilingual Latin-based encodings in NMT, together with Byte Pair Embedding allow us to better exploit their phonetic and orthographic as well as lexical similarities to improve the translation quality by projecting different but similar languages on the same orthographic-phonetic character space. We verify the proposed approach by demonstrating experiments on similar language pairs (Gujarati ↔ Hindi, Marathi ↔ Hindi, Nepali ↔ Hindi, Maithili ↔ Hindi, Punjabi ↔ Hindi, and Urdu ↔ Hindi) under low resource conditions. The proposed approach shows an improvement in a majority of cases, in one case as much as ∼ 10 BLEU points compared to baseline techniques for similar language pairs. We also get up to ∼ 1 BLEU points improvement on distant and zero-shot language pairs.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.relation.ispartofseries	Sadhana - Academy Proceedings in Engineering Sciences;48
dc.subject	byte pair encoding;	en_US
dc.subject	common phonetic-orthographic space;	en_US
dc.subject	Neural machine translation;	en_US
dc.subject	similar languages;	en_US
dc.subject	transformer model	en_US
dc.subject	Abstracting;	en_US
dc.subject	Computational linguistics;	en_US
dc.subject	Computer aided language translation;	en_US
dc.subject	Encoding (symbols);	en_US
dc.subject	Modeling languages;	en_US
dc.subject	Natural language processing systems;	en_US
dc.subject	Neural machine translation;	en_US
dc.subject	Signal encoding	en_US
dc.title	Machine translation by projecting text into the same phonetic-orthographic space using a common encoding	en_US
dc.type	Article	en_US