305x Filetype PDF File size 0.14 MB Source: aclanthology.org
TheIITBombayHindi⇔EnglishTranslationSystematWMT2014
Piyush Dungarwal, Rajen Chatterjee, Abhijit Mishra, Anoop Kunchukuttan,
Ritesh Shah, Pushpak Bhattacharyya
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
{piyushdd,rajen,abhijitmishra,anoopk,ritesh,pb}@cse.iitb.ac.in
Abstract WMT2014shared task has provided a standard-
In this paper, we describe our English- ized test set to evaluate multiple approaches and
Hindi and Hindi-English statistical sys- avails the largest publicly downloadable English-
temssubmittedtotheWMT14sharedtask. Hindi parallel corpus. Using these resources,
The core components of our translation we have developed a phrase-based and a factored
systems are phrase based (Hindi-English) basedsystemforHindi-EnglishandEnglish-Hindi
and factored (English-Hindi) SMT sys- translation respectively, with pre-processing and
tems. We show that the use of num- post-processing components to handle structural
ber, case and Tree Adjoining Grammar divergence and morphlogical richness of Hindi.
information as factors helps to improve Section 2 describes the issues in Hindi↔English
English-Hindi translation, primarily by translation.
generating morphological inflections cor- The rest of the paper is organized as follows.
rectly. We show improvements to the Section 3 describes corpus preparation and exper-
translation systems using pre-procesing imental setup. Section 4 and Section 5 describe
andpost-processing components. To over- our English-Hindi and Hindi-English translation
come the structural divergence between systemsrespectively. Section 6 describes the post-
English and Hindi, we preorder the source processing operations on the output from the core
side sentence to conform to the target lan- translation system for handling OOV and named
guage word order. Since parallel cor- entities, and for reranking outputs from multiple
pus is limited, many words are not trans- systems. Section 7 mentions the details regarding
lated. We translate out-of-vocabulary our systems submitted to WMT shared task. Sec-
words and transliterate named entities in tion 8 concludes the paper.
a post-processing stage. We also investi- 2 ProblemsinHindi⇔English
gate ranking of translations from multiple Translation
systems to select the best translation.
1 Introduction Languages can be differentiated in terms of
structural divergences and morphological mani-
India is a multilingual country with Hindi be- festations. English is structurally classified as
ing the most widely spoken language. Hindi and a Subject-Verb-Object (SVO) language with a
English act as link languages across the coun- poor morphology whereas Hindi is a morpho-
try and languages of official communication for logically rich, Subject-Object-Verb (SOV) lan-
the Union Government. Thus, the importance of guage. Largely, these divergences are responsi-
English⇔Hindi translation is obvious. Over the ble for the difficulties in translation using a phrase
last decade, several rule based (Sinha, 1995) , in- based/factoredmodel,whichwesummarizeinthis
terlingua based (Dave et. al., 2001) and statistical section.
methods(Ramanathanet. al., 2008) have been ex- 2.1 English-to-Hindi
plored for English-Hindi translation.
In the WMT 2014 shared task, we undertake The fundamental structural differences described
thechallengeofimprovingtranslationbetweenthe earlier result in large distance verb and modi-
English and Hindi language pair using Statisti- fier movements across English-Hindi. Local re-
cal Machine Translation (SMT) techniques. The ordering models prove to be inadequate to over-
90
Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 90–96,
c
Baltimore, Maryland USA, June 26–27, 2014.
2014 Association for Computational Linguistics
come the problem; hence, we transformed the English Hindi
source side sentence using pre-ordering rules to Token 2,898,810 3,092,555
conform to the target word order. Availability of Types 95,551 118,285
robust parsers for English makes this approach for Total Characters 18,513,761 17,961,357
English-Hindi translation effective. Total sentences 289,832 289,832
As far as morphology is concerned, Hindi is Sentences (word 188,993 182,777
more richer in terms of case-markers, inflection- count ≤ 10)
rich surface forms including verb forms etc. Hindi Sentences (word 100,839 107,055
exhibits gender agreement and syncretism in in- count > 10)
flections, which are not observed in English. We
attempt to enrich the source side English corpus Table 1: en-hi corpora statistics, post normalisa-
with linguistic factors in order to overcome the tion.
morphological disparity.
2.2 Hindi-to-English 3.3 DataSplit
Before splitting the data, we first randomize the
Thelackofaccuratelinguisticparsersmakesitdif- parallel corpus. We filter out English sentences
ficult to overcome the structural divergence using longer than 50 words along with their parallel
preordering rules. In order to preorder Hindi sen- Hindi translations. After filtering, we select 5000
tences, we build rules using shallow parsing infor- sentenceswhichare10to20wordslongasthetest
mation. Thesourcesidereorderinghelpstoreduce data, while remaining 284,832 sentences are used
the decoder’s search complexity and learn better for training.
phrasetables. Someoftheotherchallengesingen- 4 English-to-Hindi (en-hi) translation
eration of English output are: (1) generation of ar-
ticles, which Hindi lacks, (2) heavy overloading of WeusetheMOSEStoolkit(Koehnet. al., 2007a)
English prepositions, making it difficult to predict forcarryingoutvariousexperiments. Startingwith
them. PhraseBasedStatisticalMachineTranslation(PB-
SMT)(Koehn et. al., 2003) as baseline system we
3 Experimental Setup goaheadwithpre-orderPBSMTdescribedinSec-
Weprocess the corpus through appropriate filters tion 4.1. After pre-ordering, we train a Factor
for normalization and then create a train-test split. Based SMT(Koehn, 2007b) model, where we add
factors on the pre-ordered source corpus. In Fac-
3.1 English Corpus Normalization tor Based SMT we have two variations- (a) using
SupertagasfactordescribedinSection4.2and(b)
To begin with, the English data was tokenized us- using number, case as factors described in Section
ing the Stanford tokenizer (Klein and Manning, 4.3.
2003)andthentrue-casedusingtruecase.perlpro- 4.1 Pre-ordering source corpus
vided in MOSES toolkit. Research has shown that pre-ordering source lan-
3.2 Hindi Corpus Normalization guage to conform to target language word order
significantly improves translation quality (Collins
For Hindi data, we first normalize the corpus us- et. al, 2005). There are many variations of pre-
ing NLP Indic Library (Kunchukuttan et. al., ordering systems primarily emerging from either
2014)1. Normalization is followed by tokeniza- rule based or statistical methods. We use rule
tion, wherein we make use of the trivtokenizer.pl2 based pre-ordering approach developed by (Pa-
providedwithWMT14sharedtask. InTable1,we tel et. al., 2013), which uses the Stanford parser
highlight some of the post normalization statistics (Klein and Manning, 2003) for parsing English
for en-hi parallel corpora. sentences. This approach is an extension to an ear-
lier approach developed by (Ramanathan et. al.,
1https://bitbucket.org/anoopk/indic_ 2008). The existing source reordering system re-
nlp_library
2http://ufallab.ms.mff.cuni.cz/~bojar/ quires the input text to contain only surface form,
hindencorp/ however, we extended it to support surface form
91
along with its factors like POS, lemma etc.. An Weuse MICA Parser (Bangalore et. al., 2009)
example of improvement in translation after pre- forobtainingsupertags. Aftersupertaggingwerun
ordering is shown below: pre-ordering system preserving the supertags in it.
Example: trying to replace bad ideas with good For translation, we create mapping from source-
ideas . word|supertag to target-word. An example of im-
Phr: replace br EvcAro\ кo aQC EvcAro\ к provement in translation by using supertag as fac-
sAT tor is shown below:
(replace bure vichaaron ko acche vichaaron ke Example: trying to understand what your child is
saath) saying to you
Gloss: replace bad ideas good ideas with Phr: aApкA bÎA aAps ÈA кh rhA {h yh
Pre-orderPBSMT:aQC EvcAro\ s br EvcAro\ (aapkaa bacchaa aapse kya kaha rahaa hai yaha)
кo bdln кF кoEшш кr rh {h\ Gloss: your child you what saying is this
(acche vichaaron se bure vichaaron ko badalane SupertagFact: aApкA bÎA aAps ÈA кh rhA
ki koshish kara rahe hain) {h , us smJn кF кoEшш кrnA
Gloss: good ideas with bad ideas to replace trying (aapkaa bacchaa aapse kya kaha rahaa hai, use
4.2 Supertag as Factor samajhane kii koshish karnaa)
Gloss: your child to you what saying is , that un-
The notion of Supertag was first proposed by derstand try
Joshi and Srinivas (1994). Supertags are elemen-
tary trees of Lexicalized Tree Adjoining Grammar 4.3 Number,CaseasFactor
(LTAG) (Joshi and Schabes, 1991). They provide In this section, we discuss how to generate correct
syntactic as well as dependency information at the noun inflections while translating from English to
word level by imposing complex constraints in a Hindi. Therehasbeenpreviousworkdoneinorder
local context. These elementary trees are com- to solve the problem of data sparsity due to com-
bined in some manner to form a parse tree, due plex verb morphology for English to Hindi trans-
to which, supertagging is also known as “An ap- lation (Gandhe, 2011). Noun inflections in Hindi
proach to almost parsing”(Bangalore and Joshi, are affected by the number and case of the noun
1999). A supertag can also be viewed as frag- only. Number can be singular or plural, whereas,
ments of parse trees associated with each lexi- case can be direct or oblique. We use the factored
cal item. Figure 1 shows an example of su- SMTmodeltoincorporatethislinguistic informa-
pertagged sentence “The purchase price includes tion during training of the translation models. We
taxes”describedin(Hassanet. al.,2007). Itclearly attach root-word, number and case as factors to
shows the sub-categorization information avail- English nouns. On the other hand, to Hindi nouns
able in the verb include, which takes subject NP we attach root-word and suffix as factors. We de-
to its left and an object NP to its right. finethetranslation and generation step as follows:
• Translation step (T0): Translates English
root|number|case to Hindi root|suffix
• Generation step (G0): Generates Hindi sur-
face word from Hindi root|suffix
Figure 1: LTAG supertag sequence obtained using An example of improvement in translation by
MICAParser. using number and case as factors is shown below:
Use of supertags as factors has already been Example: Twosets of statistics
studied by Hassan (2007) in context of Arabic- Phr: do к aA кw
English SMT. They use supertag language model (do ke aankade)
along with supertagged English corpus. Ours Gloss: two of statistics
is the first study in using supertag as factor Num-CaseFact: aA кwo\ к do sV
for English-to-Hindi translation on a pre-ordered (aankadon ke do set)
source corpus. Gloss: statistics of two sets
92
4.3.1 Generating number and case factors Development WMT14
With the help of syntactic and morphological Model BLEU TER BLEU TER
tools, we extract the number and case of the En- Phr 27.62 0.63 8.0 0.84
glish nouns as follows: PhrReord 28.64 0.62 8.6 0.86
• Number factor: We use Stanford POS tag- PhrReord+STag 27.05 0.64 9.8 0.83
ger3 to identify the English noun entities PhrReord+NC 27.50 0.64 10.1 0.83
(Toutanova, 2003). ThePOStaggeritselfdif- Table 2: English-to-Hindi automatic evaluation on
ferentiates between singular and plural nouns development set and on WMT14 test set.
by using different tags.
• Case factor: It is difficult to find the that pre-orders the source sentence to conform to
direct/oblique case of the nouns as En- target word order.
glish nouns do not contain this information. A substantial volume of work has been done
Hence, to get the case information, we need in the field of source-side reordering for machine
to find out features of an English sentence translation. Most of the experiments are based on
that correspond to direct/oblique case of the applying reordering rules at the nodes of the parse
parallel nouns in Hindi sentence. We use tree of the source sentence. These reordering rules
object of preposition, subject, direct object, can be automatically learnt (Genzel, 2010). But,
tense as our features. These features are manysource languages do not have a good robust
extracted using semantic relations provided parser. Hence, instead we can use shallow pars-
by Stanford’s typed dependencies (Marneffe, ing techniques to get chunks of words and then
2008). reorder them. Reordering rules can be learned au-
4.4 Results tomatically from chunked data (Zhang, 2007).
Hindi does not have a functional constituency
Listed below are different statistical systems or dependency parser available, as of now. But,
trained using Moses: a shallow parser4 is available for Hindi. Hence,
• Phrase Based model (Phr) we follow a chunk-based pre-ordering approach,
wherein, we develop a set of rules to reorder
• Phrase Based model with pre-ordered source the chunks in a source sentence. The follow-
corpus (PhrReord) ing are the chunks tags generated by this shallow
parser: Noun chunks (NP), Verb chunks (VGF,
• Factor Based Model with factors on pre- VGNF, VGNN), Adjectival chunks (JJP), Ad-
ordered source corpus verbchunks(RBP),Negatives(NEGP),Conjuncts
– Supertag as factor (PhrReord+STag) (CCP), Chunk fragments (FRAGP), and miscella-
– Number,Caseasfactor(PhrReord+NC) neous entities (BLK) (Bharati, 2006).
Weevaluated translation systems with BLEU and 5.1 Developmentofrules
TERasshowninTable2. Evaluationonthedevel- After chunking an input sentence, we apply hand-
opmentsetshowsthatfactorbasedmodelsachieve crafted reordering rules on these chunks. Follow-
competitive scores as compared to the baseline ing sections describe these rules. Note that we ap-
system, whereas, evaluation on the WMT14 test ply rules in the same order they are listed below.
set shows significant improvement in the perfor- 5.1.1 Mergingofchunks
manceoffactor based models.
After chunking, we merge the adjacent chunks, if
5 Hindi-to-English (hi-en) translation they follow same order in target language.
AsEnglishfollowsSVOwordorderandHindifol- 1. Merge {JJP VGF} chunks (Consider this
lowsSOVwordorder,simpledistortionpenaltyin chunk as a single VGF chunk)
phrase-basedmodelscannothandlethereordering e.g., vEZ
t {h (varnit hai), E-Tt {h (sthit hai)
well. For the shared task, we follow the approach
4http://ltrc.iiit.ac.in/showfile.php?
3http://nlp.stanford.edu/software/tagger.shtml filename=downloads/shallow_parser.php
93
no reviews yet
Please Login to review.