145x Filetype PDF File size 0.31 MB Source: sciresol.s3.us-east-2.amazonaws.com
ISSN (Print) : 0974-6846 Indian Journal of Science and Technology, Vol 10(16), DOI: 10.17485/ijst/2017/v10i16/111895, April 2017 ISSN (Online) : 0974-5645 Approaches for Improving Hindi to English Machine Translation System 1 2 Rajesh Kumar Chakrawarti and Pratosh Bansal 1 Faculty of Computer Engineering, Institute of Engineering and Technology, Devi Ahilya Vishwavidyalaya, Indore – 452017, Madhya Pradesh, India; rajesh_kr_chakra@yahoo.com 2 Department of Information Technology, Institute of Engineering and Technology, Devi Ahilya Vishwavidyalaya, Indore – 452017, Madhya Pradesh, India; pratosh@hotmail.com Abstract Objectives: To provide approaches for effective Hindi-to-English Machine Translation (MT) that can be helpful in inexpensive and ease implementation of and MT systems. Methods/Statistical Analysis: Structure of the Hindi and English languages have been studied thoroughly. The possible steps towards the Natural languages have also been studied. The methods, rules, approaches, tools, resources etc. related to MT have been discussed in detail. Findings: MT is an idea for automatic translation of a language. India is the country with full of diversity in culture and languages. More than 20 regional languages are spoken along with several dialects. Hindi is a widely spoken language in all the states of country. A lot of literature, poetries and valuable texts are available in Hindi which gives opportunities to retranslate into English. However, new generation is learning English rapidly and also showing keenness to learn it in simplified lucid manner. Several efforts have been made in this direction. A large number of approaches and solutions exist for MT still there is a huge scope. The paper addresses the challenges of MT and solution efforts made in this direction. This motivates researchers to implement new Hindi-to-English Machine translation systems. Application/Improvements: Efficient, inexpensive and ease translation for available Hindi literature, poetries and other valuable texts into English. Children can easily learn the culture through the poetries and literatures hence the Machine Translation of these will bring wonderful impact. Keywords: English Language, Hindi Language, Machine Translation, Translation-Rules and Translation Approaches 1. Introduction work. Most of the newspapers are also published in vari- ous regional languages. There are 22 regional languages India is one of the finest examples for multi-lingual and named “Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi multi-social country. People from different regions speak (it is official also), Kannada, Kashmiri, Konkani, Maithili, different languages. After the analysis, it is found that the Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, spoken languages may change after in every few kilo- Sanskrit, Santali, Sindhi, Tamil, Telugu and Urdu” speak meters (in digits of 10s). In India, Hindi is the national in various regions. Hence there is dire and great demand language which is spoken by most of the people. English for better Machine Translation systems to establish a bet- is internationally accepted language which is used for ter communication and exchange of information with communication throughout the world. The constitu- 1,2 other countries, states and central governments . tion of India accepts only these two languages Hindi Machine Translation is the key research area in the and English as official languages. The official commu- field of Natural Language Processing (NLP). It is a com- nication between central and state governments is also puterized and automated idea, responsible for translating done in these two languages. The states government the text/documents from one language (called source may have their own regional languages to carry out their language) to another language (called target language). *Author for correspondence Approaches for Improving Hindi to English Machine Translation System The work in machine translation area has been going on sents a block diagram for a Hindi-to-English Machine for several decades but efficient machine translation is a Translation system. still challenging task. In India, the market is largest for 3 Machine Translation . Figure 1 represents a block dia- gram for a simple Machine Translation system. Figure 2. Hindi ð English Machine Translation. Figure 1. A simple Machine Translation (MT) System. 1.2 English-to-Hindi Translation English is a major internationally accepted language Machine Translation produces various challenges for which is spoken and used in all kinds of communications all levels called “Phonetics and Phonology, Morphology, among almost all countries throughout the world. We can Syntax, Semantics, Pragmatics and Discourse” of Natural also say that almost English is the only language which is Language Processing. In which, ambiguity (Semantics) is popular among people from all over the world. the biggest one. Other than this, the different language The default structure of the English sentence is might also have language diversity (called translation Subject-Verb-Object (SVO), e.g. divergence) problem. Machine Translation systems deal “Prithvi wants gold” where S = Prithvi, V = want and with ambiguity and the linguistic diversity problems O = gold. 4 English is having following main characteristics: under the umbrella of Natural Language Processing . In India, we feel that the important and fore- • Highly positional language most Machine Translations are HindiðEnglish and • Rudimentary (poor) morphology. HindiðRegional Language. 1.1 Hindi-to-English Translation English-to-Hindi Machine Translation results a verb movements of large distance. Hindi satisfies the gen- Hindi is our national language. People speak different der agreement also, which is not possible in English. By regional language but Hindi is the main official language enriching the source side English resources with linguis- for standard communication. Other than us, Hindi is 5,6. tic factors, the morphological issues can be resolved known in other countries like Pakistan, Bangladesh and Figure 3 shows a block diagram for an English-to-Hindi Nepal etc. Machine Translation system. The default structure of Hindi sentence is Subject- Object-Verb (SOV), e.g. “पृथ्वी सोना चाहता है |” where S = पृथ्वी, O = सोना and V = चाहना Indian languages (primarily Hindi) have the following characteristics: Figure 3. English ð Hindi Machine Translation. • Highly inflectional language, • Rich morphology, and The HindióEnglish Machine translation can be • Relatively free word order. improved by incorporating technique called Word Sense Disambiguation. Word Sense Disambiguation (WSD) is The Hindi-to-English Machine Translation is more defined as the task of identifying the correct sense of a complex due to its characteristics. Anything written word depending upon the context. Word sense disambig- in Hindi may show different senses depending upon uation algorithms can be broadly classified as knowledge/ the context. The spoken sequence of any statement in dictionary-based, supervised, semi-supervised, unsuper- 5,6. Figure 2 repre- Indian language may differ by people vised approaches. However, there is no boundary in using 2 Vol 10 (16) | April 2017 | www.indjst.org Indian Journal of Science and Technology Rajesh Kumar Chakrawarti and Pratosh Bansal 19 either single or combinations. Earlier, the combinations Indian languages machine aided translation system . It 7,8. have also produced good results is using rule-based (pseudo-interlingua based) method. Since last 03 decades, In India a lot of research The system produces good results. However, sometimes and research projects are done in the area of Machine produces more than one target sentences for a given Translation. Although they have produced some good source English sentence. Computer Assisted Translation Machine Translation systems, they all have their own System Mantra, translates the texts from English to Hindi advantages, disadvantages and limitations and “It is not in the domain of Personnel Administration, is developed possible to have fully automatic, qualitative, and general- 20. Research using rule-based (transfer-based) method 5 purpose Machine Translation ”. Hence, still there is scope through this system produces new areas to contribute for researchers to do more research in this area. A lot of other facilities. The Anusaaraka system, makes docu- researches and research projects are also on going to over- ments accessible in one Indian language to another Indian come these disadvantages and limitations. These scopes language, is developed using direct (word-to-word) 21 are motivating the Teaching of Machine Translation in method . This system also produces good results but 9 Indian perspective to the students and researchers . if it enters into common use, it has major implications. In the field of Machine Translation, a lot of surveys Universal Networking Language (UNL) {Interlingua}- are done in the Indian perspective. First, Survey relates based machine Translation system is used translation to resources, services and tools for Machine Translations for English to Indian languages although is a good sys- system throughout India. This survey is the rigorous tem but language divergence issues between source and 10. Second, Survey 22. AnglaHindi is collection for the Indian perspective target to the UNL results implications includes Word-sense Disambiguationapproach which can a participant project of the Anglabharti translation and 11 23 be used for improving the Machine Translation system . responsible for English to Hindi translation . It is devel- This contains the type of approach (like knowledge-based, oped using rule and example-based hybrid method. supervised, minimally-supervised, unsupervised, hybrid MaTra is a fully automatic system for English-Hindi 24 etc.), corpus or WordNet details, features, advantages, Machine Translation (MT) of general-purpose texts . It disadvantages and limitations of the approach, new tech- is developed using rule-based (transfer-based) method. niques under these approaches etc. Third, Survey includes Statistical-based Machine Translations by Google, different types of Machine Translation approaches Microsoft, Worldlingo and IBM are Google Translate, 12-15. Surveys related to Bing Translator, Worldlingo and IBM Server respectively. used for developing the systems approaches include the name of approach (like direct, Machine Translation approaches are classified as direct rule-based, corpus-based, hybrid etc.) for developing the translation, rule-based (transfer and Interlingua-based) Machine Translation system, features, advantages, disad- translation, corpus-based (statistical and example-based) vantages and limitations of the approach, new techniques translation and hybrid (combination of one or more) under these approaches etc. Fourth, Survey includes dif- translations25. These systems and approaches have their ferent type of Machine Translation systems developed own features, advantages, disadvantages and limitations. 3,14 and in India. Surveys related to these systems contain name, The Statistical Machine Translation (SMT) Model year of development, people and/or organization, fund- its types Word, Phrase and Hierarchical Phrase Based ing agency, place of development, domains/applications Models and others provides the basis to improve the of the system, approaches/techniques and tools/resources Machine Translation systems. These are helpful in devel- used, features etc14-17. The all types of surveys also display oping new systems also. the web-links to use these kinds of Machine Translation A number of online applications are available and systems. The literature available in this paragraph is based accessible for Hindi-to-English Machine Translation. on survey papers only but the next paragraph is based on Table 1 gives the detail analysis of providing the effective- actual research, research projects and resources. ness of those applications. For example, a Hindi language Machine Translation system faces ambiguity and diver- statement “पृथ्वी सोना चाहता है |” has been converted into gence issues at all levels of Natural Language Processing4,18. English language by using online applications mentioned It is observed that the multilingual system is bounded in table. By analyzing the output it can be easily observed to resource constraint like WordNet which is costly and that most of the applications failed to produce desired takes more time in processing. Anglabharti is English to output. Only “Google Translate” is producing good result Vol 10 (16) | April 2017 | www.indjst.org Indian Journal of Science and Technology 3 Approaches for Improving Hindi to English Machine Translation System “Earth wants to sleep”. However, it cannot identify the A lot of ancient literatures exist in Hindi. They are Noun “पृथ्वी” that’s why it is producing “Earth” whether written on “Devanagari lipi (script)” which had been th it should write “Prithvi”. The remaining applications are developed during 15 Century. Mostly books, novels, vol- producing improper results. Hence, it can easily analyze umes etc. are in Hindi script. In modern era, there is a that there is a need of an enhanced and appropriate ver- huge demand for English translation. Since last decades, 35 sion of Hindi-to-English Machine Translator which can the research has been increased . provide better and appropriate result. One of the hardest kinds of machine translation is WordNet is an online lexical database designed poetry translation. A lot of poetries are available in Hindi. for English language includes four main Parts-of- A lot of work has been done in this move. Available sys- Speech (PoS) (i) Noun, (ii) Verb, (iii) Adjective and (iv) tem requires better mechanism for poetry translation into 26 36 Adverb which are organized into sets of synonyms . English . HindiWordNet is an online lexical database designed for Many researchers, institutions and research orga- Hindi language on the basis of English WordNet. Similar nizations have started working on Machine Translation to English WordNet, It also includes the four main parts- systems for Hindi to English translation, English to Hindi, of-speech of Hindi (i) Noun, (ii) Verb, (iii) Adjective and Hindi to regional language translation and vice-versa and (iv) Adverb, which are organized into sets of synonyms. have succeeded in obtaining very satisfactory results. The IndoWordNet is a linked structure of wordnets of major prominent institutions and research organizations which 27. Indian languages have worked in area of Machine Translation and still 2,5,17 Word-sense Disambiguation algorithms and appli- working are as follows : cations are categorized as knowledge/dictionary-based, supervised, semi-supervised, unsupervised and hybrid • Technology Development for Indian Languages 7 (TDIL) project by Department of Electronics and approaches . They have their own features, advantages, disadvantages and limitations. The critical analysis Information Technology (DeitY), Ministry of provides the knowledge to choose the appropriate Word- Communications and Information Technology, sense Disambiguation approach for improving the Government of India. Machine Translation Systems28. Unsupervised Word • Department of Computer Science and Sense Disambiguation based an experimental study of Engineering, Indian Institute of Technology Graph Connectivity helps in improving the Machine (IIT), Kanpur, Bombay and Delhi. 29 • Department of Computer and Information Translation . Concept map construction might help in improving Sciences, University of Hyderabad (UoH), the Machine Translation because with the help of this, the Hyderabad. ideas and knowledge can be combined which are related • Language Technologies Research Center to each other in some respect. This creates a semantic (LTRC), International Institute of Information binding between two ideas or knowledge. With concept Technology (IIIT), Hyderabad. map, we can interlink the concepts which belong to the • Centre for Development of Advanced Techniques same domain30,31. (CDAC), Pune, Noida and Banglore. Chinese-Japanese Sign Language Translation pro- • National Center for Software Technology posed system provides research directions for other kind (NCST) (Now CDAC), Bombay. of similar translations like HindiðEnglish Sign Language • Department of Computer Science and 32 Engineering, Jadhavpur University, Kolkata. Translation System . Bi-lingual Hindi-English (Hinglish) Machine Translation plays important research direction • Machine Learning Lab, CSA, Indian Institute of for separate the pure component languages from a mixed Science (IISc), Banglore. 33 • AU-KBC Research Centre, Chennai. set language . BLEU (Bilingual Evaluation Understudy) is the major • Department of Computer Science and and some other metrics are helpful in the automatic eval- Application, Utkal University, Utkal. uation of Machine Translation system. There are different • Advanced Center for Technical Development techniques under BLEU which play important role in of Punjabi Language, Literature and Culture, evaluation the Machine Translation system6,34. Punjabi University, Patiyala. 4 Vol 10 (16) | April 2017 | www.indjst.org Indian Journal of Science and Technology
no reviews yet
Please Login to review.