329x Filetype PDF File size 0.27 MB Source: www.statmt.org
YandexSchoolofDataAnalysisapproachtoEnglish-Turkishtranslation
at WMT16NewsTranslationTask
1,2 2 2
AntonDvorkovich ,SergeyGubanov ,andIrinaGalinskaya
{dvorkanton,esgv,galinskaya}@yandex-team.ru
1 Yandex School of Data Analysis, 11/2 Timura Frunze St., Moscow 119021, Russia
2 Yandex, 16 Leo Tolstoy St., Moscow 119021, Russia
Abstract For morphological segmentation and
English-to-Turkish reordering we tried both
We describe the English-Turkish and rule-based/supervised and fully unsupervised
Turkish-English translation systems sub- approaches.
mitted by Yandex School of Data Analy-
sis team to WMT16 news translation task. 2 Data&commonsystemcomponents
Wesuccessfullyappliedhand-craftedmor- In our two systems (Turkish-English and English-
phological (de-)segmentation of Turkish, Turkish)weusedseveralcommoncomponentsde-
syntax-based pre-ordering of English in scribed below.
English-Turkish and post-ordering of En- Thespecificapplication of these tools varies for
glish in Turkish-English. We perform de- Turkish-English and English-Turkish systems, so
segmentation using SMT and propose a wediscuss it separately in Sections 4 and 3.
simple yet efficient modification of post-
ordering. We also show that Turkish mor- 2.1 Phrase-based translator
phology and word order can be handled We used an in-house implementation of phrase-
in a fully-automatic manner with only a based MT (Koehn et al., 2003) with Berkeley
small loss of BLEU. Aligner (Liang et al., 2006) and MERT tuning
1 Introduction (Och, 2003).
Yandex School of Data Analysis participated in 2.2 English syntactic parser
WMT16 shared task ”Machine Translation of Weused an in-house transition-based English de-
News”inTurkish-English language pair. pendency parser similar to (Zhang and Nivre,
Machinetranslation between English and Turk- 2011).
ish is a challenging task, due to the strong differ-
ences between languages. In particular, Turkish 2.3 English-to-Turkish reorderers
has rich agglutinative morphology, and the word Weused two different reorderers that put English
order differs between languages (SOV in Turkish, words in Turkish order. Both reorderers need an
SVOinEnglish). English dependency parse tree as input.
To deal with these dissimilarities, we prepro- Rule-based reorderer modifies parse trees using
cess both source and target parts of the parallel rules similar to Tregex (Levy and Andrew, 2006),
corpus before training: we perform morphologi- adapted to dependency trees1. We used a set of
cal segmentation of Turkish and reordering of En- about 70 hand-crafted rules, an example of a rule
glish into Turkish word order, aiming to achieve a is given in Figure 1.
monotonous one-to-one correspondence between
tokens to aid SMT. w1 role ’PMOD’
Since we changed the target side of the parallel and .--> (w2 not role ’CONJ’)
corpus, at runtime we had to do post-processing: ::
move group w1 before node w2;
desegmentation of Turkish for EN-TR and post-
ordering of English words for TR-EN. We em- Figure 1: Sample dependency tree reordering rule
ploy additional SMT decoders to solve both tasks, 1Our dependency tree reordering tool is available here:
which results in two-stage translation. https://github.com/yandex/dep_tregex
281
Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 281–288,
c
Berlin, Germany, August 11-12, 2016.
2016 Association for Computational Linguistics
Automatic reorderer uses word alignments on a The ”aggressive rule-based” strategy, in addi-
parallel corpus to construct reference reorderings, tion, forcefully splits all features attached to the
andthentrainsafeedforwardneural-networkclas- lemmaintoaseparate group.
sifier which makes node-swapping decisions (de arkadaşlarına arkadaş +a3pl +p3sg +dat
Gispert et al., 2015). to his friends to his friends
2.4 Turkish morphological analyzers 2.6 NMTreranker
Weusedanin-housefinitestatetransducersimilar Finally, we used a sequence-to-sequence neural
to (Oflazer, 1994) for Turkish morphological tag- network with attention (Bahdanau et al., 2014) as
ging, and structured perceptron similar to (Sak et a feature for 100-best reranking. We used hidden
al., 2007) for morphological disambiguation. layer and embedding sizes of 100, and vocabulary
As an alternative, we trained our implementa- sizes of 40000 (the Turkish side was morphologi-
tion of unsupervised morphology model, follow- cally segmented).
ing (Soricut and Och, 2015), with a single dis-
tinctive feature: in each connected component C 2.7 Data
of the morphological graph, we select the lemma For training translation model, language models,
as argmax (logf(w)−α·l(w)), where l(w) is
C 2 and NMT reranker, we used only the provided
wordlength and f(w) is word frequency . This is constrained data (SETIMES 2 parallel Turkish-
a heuristic, justified by the facts, that (1) lemma English corpus, and monolingual Turkish and En-
tends to be shorter than other surface forms of glish Common Crawl corpora).
a word, and (2) logf(w) is proportional to l(w) Throughout our experiments, we used the
(Strauss et al., 2007). We also make use of mor- BLEU (Papineni et al., 2002) on provided de-
phology induction for unseen words, as described vset (news-dev2016) to estimate the performance
in the original paper. The automatic method re- of our systems, tuning MERT on a random sam-
quires no disambiguation and yields no part-of- ple of 1000 sentences from the SETIMES corpus
speech tags or morphological features. (these sentences, to which we refer as ”the SE-
2.5 Turkish morphological segmenter TIMESsubsample”, were excluded from training
data). For the final submissions, we tuned MERT
We used three strategies for segmenting Turkish directly on news-dev2016.
words into less-sparse units. The ”simple” strat- Due to our setup, we provide BLEU scores on
egy splits a word into lemma and chain of affixes. news-dev2016 for our intermediate experiments
The latter is chosen as suffix of the surface form, and on news-test2016 for our final systems.
starting from (l + 1)-th letter, where l is lemma’s 3 Turkish-English system
length.
arkadaşlarına arkadaş $larına 3.1 Baseline
to his friends to his friends For a baseline, we trained a standard phrase-based
system: Berkeley Aligner (IBM Model 1 and
The ”rule-based” strategy uses hand-crafted HMM, both for 5 iterations); phrase table with
rules similar to (Oflazer and El-Kahlout, 2007), up to 5 tokens per phrase, 40-best translation op-
(Yeniterzi and Oflazer, 2010) or (Bisazza and Fed- tions per source phrase, and Good-Turing smooth-
erico, 2009) to split word into lemma and groups ing; 5-gram lowercased LM with stupid backoff
of morphological features, some of which might and pruning of singleton n-grams due to memory
be attached to lemma. Rules are designed to constraints; MERT on the SETIMES subsample;
achieve a better correspondence between Turkish simplereorderingmodel,penalizedonlybymove-
and English words. This strategy requires mor- ment distance, with distortion limit set to 16.
phological analyzer to output features as well as We lowercased both the training and devel-
lemma. opment corpora, taking into account Turkish
˙
arkadaşlarına arkadaş+a3pl +p3sg +dat specifics: I → ı, I → i.
to his friends to his friends Baseline system achieves 10.84 uncased BLEU
on news-dev2016 (here and on, we ignore case in
2Weusedα=0.6throughoutourexperiments. BLEUcomputation).
282
3 3
# Systemdescription BLEU(uncased),dev BLEU(uncased),test
1 Baseline, phrase-based 11.68 11.50
2 (1) + automatic morph., simple seg. 12.16 -
3 (1) + FST/perceptron morph., simple seg. 11.75 -
4 (1) + FST/perceptron morph., rule-based 12.93 -
seg.
5 (1) + FST/perceptron morph., aggressive 14.06 -
rule-based seg.
6 (5) + ”reordered” post-ordering, rule- 14.24 -
based reorderer
7 (5) + ”translated” post-ordering, rule- 15.13 -
based reorderer
8 (2) + ”translated” post-ordering, auto- 13.43 13.39
matic reorderer
9 (7) + NMT reranking in first stage 15.49 15.12
Table 1: Our TR-EN setups on news-dev2016 and news-test2016 (submitted system in bold)
3.2 Morphological segmentation mightstilltranslate the unseen wordformcor-
In Turkish-to-English translator we directly ap- rectly.
plied Turkish morphological segmenters (see Sec- • An excessive segmentation does not really
tion 2.5) as an initial step in the pipeline (Oflazer hurt a phrase-based system, as shown by
and El-Kahlout, 2007; Bisazza and Federico, (Chang et al., 2008).
2009).
The effect of different morphological tagging 3.3 Post-ordering
and segmentation methods is shown in Table 1. It is not possible to directly apply English-to-
FST/perceptron analyzer with aggressive rule- Turkish reorderer as a preprocessing step in this
based segmentation (run #5) turned out to be the translation direction, and we also counld not con-
most successful method, bringing +2.60 BLEU. struct a Turkish-to-English reorderer (due to the
Our segmenters split Turkish words into lem- absence of Turkish parser).
mas and auxiliary tokens like $ini or +a3sg. Instead, we reordered the target side of the par-
To account for the increased number of tokens on allel corpus on the training phase using the rule-
Turkish side, we increased the length of a target based reorderer described in Section 2.3, and em-
phrase from 5 to 10 (but still allowing only up to ployed a second-stage translator to restore English
5 non-auxiliary tokens in a phrase). In order to word order at runtime, following (Sudoh et al.,
further decrease sparsity we also removed all di- 2011).
acritics from the intermediate segmented Turkish. As shown in Figure 2, the first, ”monotonous
Possible ambiguity in translations, caused by this, translation” stage is trained to translate from Turk-
is handled by English LM. ish to English that was reordered to the Turkish or-
For a rule-based segmentation we note that it 4
is beneficial to aggressively separate away lemma der , and the second, ”reordering” stage is trained
and morphological features that would normally to translate from reordered English to normal En-
be attached to it (that is, if we acted according to glish, relying on the LM and baseline reordering
the rules). We think the reason for this is the pres- inside the phrase-based decoder.
ence of errors and non-optimal decisions in our 3Wetune on the SETIMES subsample for ”dev” column,
segmentation rules, but we still consider the extra and on news-dev2016 for ”test” column. So the same line
lists the results for two sets of MERT coefficients.
split helpful: 4This does not mean we completely disable the base-
line reordering mechanism in the decoder on this stage; that
• If we do the extra split, a wordform is seg- wouldhavemadesenseonlyif(a)ourEnglish-to-Turkish re-
mentedintoalemmaandseveralauxiliaryto- orderer was perfect and (b) if the two languages could be per-
fectly aligned using just word reordering. Obviously, neither
kens, so if we have seen just the lemma, we of those is the case.
283
Turkish English English As shown in Table 1, the best results are
(in Turkish order) achieved using ”translated Turkish” for training
the second-stage translator, yielding an additional
+1.60 BLEU.
MT stage 1 MT stage 2 3.4 NMTreranking
Finally, we enhanced the first-stage translator with
a 100-best reranking which uses decoder features
and a neural sequence-to-sequence network de-
Figure 2: Two-stage post-ordering scribed in Section 2.6. To train the network, we
used the same corpus used to train the first-stage
Figures 3 and 4 illustrate the training of two- PBMTtranslator (incorporating Turkish segmen-
stage postordering systems. We explore two op- tation and English reordering).
tions for the training of the second, ”reordering” NMT reranking yields an additional +0.47
stage: as the source-side, we can either use (a) the BLEUscore.
reordered English sentences, or (b) Turkish sen- 3.5 Final system
tences translated to reordered English with first-
stage translator. The complete pipeline of our submitted system is
English showninFigure 5.
Turkish (in Turkish order) English Weselected the setup that performed best dur-
ing experiments (#9 in Table 1), and re-tuned it on
Reorder the development set; for contrastive runs we also
re-tuned baseline and ”fully automatic” systems
MT stage 1 (#1 and #8 respectively). See Table 1 for results.
Ourbest setup reaches 15.17 BLEU, which is a
+3.17 BLEUimprovementoverthebaseline.
The system without the hand-crafted rules
achieves a lower improvement of +1.89 BLEU,
Figure 3: Training the ”monotonous translation” whichisanicegainnevertheless. Comparing runs
stage of post-ordering system #2and#3,weseethatthedecreaseinBLEUisnot
duetothequalityofmorphologicalanalysis; com-
Turkish English English paringruns#3and#5,weseethatthedifferencein
(in Turkish order) quality is purely due to the segmentation scheme.
(b) Translate using
stage 1 4 English-Turkish system
(a) Reorder
MT stage 2 4.1 Baseline
As a baseline, we trained the same phrase-based
system as in Section 3.1 (except we did not prune
singleton n-grams in the Turkish language model).
Figure 4: Two options for training the ”reorder- Baseline system achieves 8.51 uncased BLEU
ing” stage of post-ordering system on news-dev2016.
The two decoders have two sets of MERT co- 4.2 Pre-ordering
efficients. We tune them jointly and iteratively: We directly apply English-to-Turkish reorderers
first, we tune the first-stage decoder (with second- described in Section 2.3 as a pre-processing step
stage coefficients fixed), optimizing BLEU of the in the phrase-based MT pipeline, like e.g. (Xia
whole-system output, then we tune the second- and McCord, 2004; Collins et al., 2005). Results
stage decoder (with first-stage coefficients fixed), are shown in Table 2
again optimizing the whole-system BLEU, and so The rule-based reorderer earns +1.65 BLEU
on. against the baseline (run #2), so we selected it as a
284
no reviews yet
Please Login to review.