235x Filetype PDF File size 0.21 MB Source: www.statmt.org
JU-Saarland Submission in the WMT2019 English–Gujarati Translation
SharedTask
1,* 1,* 1,∗
Riktim Mondal ,ShankhaRajNayek ,AdityaChowdhury ,
2 1 2
SantanuPal , Sudip Kumar Naskar , Josef van Genabith
1Jadavpur University, Kolkata, India
2Saarland University, Germany
{riktimrules,shankharaj29,adityachowdhury21}@gmail.com
{santanu.pal,josef.vangenabith}@uni-saarland.de
sudip.naskar@cse.jdvu.ac.in
Abstract to increase the size of the parallel training dataset.
In the WMT 2019 news translation shared task,
In this paper we describe our joint submission onesuchresourcescarcelanguagepairisEnglish-
(JU-Saarland) from Jadavpur University and Gujarati. Due to insufficient volume of parallel
Saarland University in the WMT 2019 news corporaavailabletotrainanNMTsystemforthese
translation shared task for English–Gujarati language pairs, creation of more actual/synthetic
language pair within the translation task sub- parallel data for low resources languages such as
track. Our baseline and primary submis-
sions are built using a Recurrent neural net- Gujarati, is an important issue.
work (RNN) based neural machine translation In this paper, we described our joint partici-
(NMT)systemwhichfollowsattentionmecha- pation of Jadavpur University and Saarland Uni-
nism followed by fine-tuning using in-domain versity in the WMT 2019 news translation task
data. Given the fact that the two languages be- for English–Gujarati and Gujarati–English. The
long to different language families and there is released training data set is completely differ-
not enough parallel data for this language pair,
building a high quality NMT system for this ent in-domain compared to the development set
language pair is a difficult task. We produced and the size is not anywhere close to the siz-
synthetic data through back-translation from able amount of training data which is typically re-
available monolingual data. We report the quired for the success of NMT systems. We use
automatic evaluation scores of our English– additional synthetic data produced through back-
Gujarati and Gujarati–English NMT systems translation from the monolingual corpus. This
trained at word, byte-pair and character encod- provides significant improvements in translation
ing levels where RNN at word level is consid- performance for both our English–Gujarati and
ered as the baseline and used for comparison
purpose. Our English–Gujarati system ranked Gujarati–English NMT systems. Our English–
in the second position in the shared task. Gujarati system was ranked second in terms of
BLEU (Papineni et al., 2002) and TER (Snover
1 Introduction et al., 2006) in the shared task.
Neural Machine translation (NMT) is an ap- 2 Related Works
proach to machine translation (MT) that uses
artificial neural network to directly model the Dungarwal et al. (Dungarwal et al., 2014) devel-
conditional probability p(y|x) of translating a oped a statistical method for machine translation,
source sentence (x ,x ,...,x ) into a target sen-
1 2 n wherephrasebasedmethodforHindi-Englishand
tence (y ,y ,...,y ). NMT has consistently per-
1 2 m factored based method for English-Hindi SMT
formedbetter than the phrase-based statistical MT system was used. They had shown improvements
(PB-SMT) approaches and has provided state-of- to the existing SMT systems using pre-procesing
the-art results in the last few years. However, and post-processing components that generated
one of the major constraints of using supervised morphological inflections correctly. Imankulova
NMTisthatitisnotsuitable for low resource lan- et al. (Imankulova et al., 2017) showed how back-
guage pairs. Thus, to use supervised NMT, low translation and filtering from monolingual data
resource pairs need to resort to other techniques canbeusedtobuildaneffectivetranslationsystem
∗
These three authors have contributed equally. for a low-resourse language pair like Japanese-
308
Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (Day 1) pages 308–313
c
Florence, Italy, August 1-2, 2019.
2019 Association for Computational Linguistics
Dataset Pairs is important in the splitting part too as it is impor-
Parallel Corpora 192,367 tant to choose the test and validation set from the
Cleaned Parallel Corpora 64,346 same distribution and must be chosen randomly
Back-translated Data 219,654 from the available data. Here, test set was also
Development Data 1,998 shuffled as this dataset was used for our internal
Gujarati Test Data 1,016 assessment. After cleaning, we randomly selected
English Test Data 998 64,346 sentence pairs for training, 1,500 sentence
pairs for validation and 1,500 sentences as test
Table 1: Data Statistics of WMT 2019 English– data. It is to be noted that our validation and test
Gujarati translation shared task. corpus is taken from the released parallel data to
setup a baseline model. Later when WMT19 Or-
Russian. Sennrich et al. (Sennrich et al., 2016a) ganizers released the development set, we contin-
shown how back-translation of monolingual data ued training our models by considering WMT19
can improve the NMT system. Ramesh et development set as our test set and the new devel-
al. (RameshandSankaranarayanan,2018)demon- opment set consisting of 3,000 sentences which
strated how an existing model like bidirectional were obtained after combining 1,500 sentences
recurrent neural network can be used to gener- from the validation and the testing set (both were
ate parallel sentences for non-English languages from the parallel corpus as stated above). While
like English-Tamil and English-Hindi, which be- training our final model, the released development
long to low-resource language pair, to improve set was used. After cleaning it was obvious that
the SMT and the NMT systems. Choudhary the amount of training data is not enough to train
et al. (Choudhary et al., 2018) has shown how a neural system for such a low resource language
to build NMT system for low resource paral- pair. Therefore, preparation for large volume of
lel corpus language pair like English-Tamil using parallel corpus is required which can be produced
techniques like word embeddings and Byte-Pair- either by manual translation by professional trans-
Encoding (Sennrich et al., 2016b) to handle Out- lators or scraping parallel data from the internet.
Of-Vocabulary Words. However, these processes are costly, tedious and
sometimesinefficient (in case of scraping from in-
3 DataPreparation ternet).
As the released data was insufficient, to gener-
For our experiments we used both parallel and ate more training data, we use back-translation.
monolingual corpus released by the WMT 2019 For back-translation we applied two methods,
Organizers. We back-translate the monolingual first, using unsupervised statistical machine trans-
corpus and use it as additional synthetic parallel lation as described in (Artetxe et al., 2018) and
corpus to train our NMT system. The detailed second, using Doc translation API1 (The API uses
statistics of the corpus is given in Table 1. Google translator as of April 2019). We have ex-
Weperformedourexperimentsontwodatasets, plained the extraction of sentences and the corre-
one using the parallel corpus provided by WMT sponding results using the above methods in sec-
2019 for the Gujarati–English news translation tion 4.2. The synthetic dataset which we have gen-
shared task, and the other using the parallel cor- 2
erated can be found here.
puscombinedwithbacktranslatedsentencesfrom
provided monolingual corpus (only News crawl 3.1 DataPreprocessing
corpus was used for back translation) for the same To train an efficient machine translation system,
language pair. it is required to clean the available raw parallel
Since the released parallel corpus was very corpus for the system to produce consistent and
noisy, containing redundant sentences, we cleaned reliable translations. The released version of the
the parallel corpus, the procedure of which is de- raw parallel corpus consisted of redundant pairs
scribed in section 3.1. whichneedstoberemovedtoobtainbetterresults
In the next step we shuffle the whole corpus as
it reduces variance and makes sure that our model 1https://www.onlinedoctranslator.com/
en/
overfits less. We then split the dataset into three 2https://github.com/riktimmondal/
parts: training, validation and test set. Shuffling Synthetic-Data-WMT19-for-En-Gu-Language-pair
309
asdemonstratedinpreviousworks (Johnsonetal., 4.1 PrimarySystemdescription
2017) which are of types as given below: OurprimaryNMTsystemsarebasedonattention-
• Thesource is same for different targets. based uni-directional RNN (Cho et al., 2014) for
Gujarati–English and bi-directional RNN (Cheng
• Thesource is different for the same target. et al., 2016) for English–Gujarati.
• Repeated identical sentence pair hyper-parameter Value
Model-type text
The redundancy in the translation pairs makes Model-dtype fp32
the model prone to overfitting and hence prevents Attention-layer 2
it from recognizing new features. Thus, one of Attention-Head/layer 8
Hidden-layers 500
the sentence pair is kept while the other redun- Batch-Size 256
dant pairs are removed. Some sentence pairs had Training-steps 160,000
combinations of both language pairs which were Source vocab-size 50,000
Target vocab-size 50,000
also identified as redundant. These pairs strictly learning-rate warm-up+decay*
need elimination as the vocabularies of the in- global-attention function softmax
dividual languages consist of alphanumeric char- tokenization-strategy wordpiece
RNN-type LSTM
acters of the other language which results in in-
consistent encoding and decoding during encoder- Table 2: Hyper-parameter configurations for Gujarati–
decoder application steps on the considered lan- English translation using unidirectional RNN (Cho
guage pair. We tokenize the English side using et al., 2014)), *learning-rate was initially set to 1.0.
Moses (Koehn et al., 2007) tokenizer and for Gu-
jarati, we use the Indic NLP library tokenization Table 2 shows the hyper-parameter configura-
3 tions for our Gujarati–English translation system.
tool . Punctuation normalization was also done.
We initially trained our model with the cleaned
3.2 DataPostprocessing parallel corpus provided by WMT 2019 up to
Postprocessing, such as detokenization (Klein 100K training steps. Thereafter, we fine-tune our
4 generic model on domain specific corpus (con-
et al., 2017), punctuation normalization (Koehn
et al., 2007), was performed on our translated data taining 219K sentences back-translated using Doc
(onthetestset)toproducethefinaltranslateddata. Translator API) changing the learning rate to 0.5
and decay started from 130K training steps with a
4 ExperimentSetup decay factor of 0.5 and keeping the other hyper-
We have explained our experimental setups in parameters same as mentioned in Table 2.
the next two sections. The first section con- hyper-parameter Value
tains the setup used for our final submission and Model-type text
the next section describes all the other support- Model-dtype fp32
ing experimental setups. We use the OpenNMT Encoder-type BRNN
toolkit (Klein et al., 2017) for our experiments. Attention-layer 2
Weperformed several experiments where the par- Attention-Head/layer 8
Hidden-layers 512
allel corpus is sent to the model as space separated Batch-Size 256
character format, space separated word format, Training-steps 135,000
and space separated Byte Pair Encoding (BPE) Source vocab-size 26,859
Target vocab-size 50,000
format (Sennrich et al., 2016b). For our final learning-rate warm-up+decay
(i.e., primary) submissionfortheEnglish–Gujarati global-attention function softmax
task, the source input words were converted to tokenization-strategy Byte-pair Encoding
RNN-type LSTM
BPE whereas the Gujarati words were kept as it
is. For our Gujarati–English submission, both the Table 3: Hyper-parameter configurations for English–
source and the target were in simple word level Gujarati translation using bi-directional RNN (Cheng
format. et al., 2016).
3http://anoopkunchukuttan.github.io/
indic_nlp_library/ To build our English–Gujarati translation sys-
4punctuation normalization.perl tem, we initially trained a generic model like our
310
Gujarati–English translation system. However, in Gujarati. The transformer model was trained until
this case we use different hyper-parameter con- 100Ktraining steps, with 64 batch size in a single
figurations as mentioned in Table 3. Addition- GPU and positional encoding layers size was set
ally, here, we use byte-pair encoding on the En- to 2.
glish side with 32K merge operations. We do Since the the training data size was not enough,
not perform BPE operation on the Gujarati cor- we used backtranslation to generate additional
pus; we keep the original word format for Gu- syntheticsentencepairsfromthemonolingualcor-
jrati. Our generic model was trained with up to pus released in WMT 2019. We initially used
100Ktrainingsteps and then fine-tuned our model monoses (Artetxe et al., 2018), which is based
on domain specific parallel corpus having English on unsupervised statistical phrase based machine
side as BPE and Gujarati side as word level for- translation, to translate the monolingual sentences
mat. During fine-tuning, we reduce the learning from English to Gujarati. We used 2M English
rate from 1.0 to 0.25 and started decaying from sentences to train the monoses system. The train-
120K training steps with a decay factor of 0.5. ing process took around 6 days in our modest
The other hyper-parameter configurations remain 64 GB server. However, the results were ex-
unchanged. The respective hyperparameters used tremely poor with a BLEU score of 0.24 for
for the English–Gujarati task in our primary sys- English–Gujarati and 0.01 for the opposite di-
temsubmissionwerealsotestedforthereversedi- rection, without using preprocessed parallel cor-
rection; however, it did not perform as good as the pus. Moreover, after adding preprocessed paral-
primarysystemandhencethefinalsystemismod- lel corpus, the BLEU score dropped significantly.
ified accordingly. This motivated us to use online document transla-
tor, in our case Google translation API, for back-
4.2 OtherSupportingExperiments translating sentence pairs from the released mono-
In this section we describe all the supporting ex- lingual dataset. The back-translated data was later
periments that we performed for this shared task combined with our preprocessed parallel corpus
starting from Statistical MT to NMT with both su- for our final model.
pervised and unsupervised settings. Additionally, we also tried a simple unidirec-
All the results and experiments discussed below tional RNN model on character level, however,
are tested on the released development set (consid- this also fails to contribute in terms of improving
ering this as the test set). These models were not performance. We have compiled all the results in
tested with the released test set as they provided table 4.
poor BLEUscores on the development set. 5 PrimarySystemResults
We used uni-directional RNN having LSTM Our primary submission for English–Gujarati us-
units trained on 64,346 pre-processed sentences ing bidirectional RNN model with BPE at English
(cf. Section 3) with 120K training steps and learn- side (see Section 4.1) and word format at Gu-
ing rate of 1.0. For English–Gujarati where in- jarati side gave the best result. On the other hand,
put was space separated words for both sides, the Gujarati-English primary submission, based
we achieved highest BLEU score of 4.15 after on an uni-directional RNN model with both En-
fine-tuning with 10K sentences selected from the glish and Gujarati in word format, gave the best
cleaned parallel corpus whose total number of to- result. Before submission, we performed punc-
kens(words) was exceeding 8.The BLEU score tuation normalization, unicode normalization, and
dropped to 3.56 while applying BPE on the both detokenization for each runs. Table 5 shows the
sides. For the other direction (Gujarati–English) published results of our primary submissions on
of the language pair, we got highest BLEU scores WMT2019Testset. Table 6 shows our hands on
of 5.13 and 5.09 at word level and BPE level re- experimental results on the development set.
spectively.
We also tried transformer-based NMT 6 Conclusion and Future Work
model (Vaswani et al., 2017) which however
gave extremely poor results on similar experimen- In this paper, we applied NMT to one of the most
tal settings. The highest BLEU we achieved was challenging language pair, English–Gujarati, as
0.74 for Gujarati–English and 0.96 for English– the availability of parallel corpus is really scarce
311
no reviews yet
Please Login to review.