253x Filetype PDF File size 0.13 MB Source: aclanthology.org
NITK-UoH:Tamil-TeluguMachineTranslationSystemsfortheWMT21
Similar Language Translation Task
Richard Saldanha, ParameswariKrishnamurthy
AnanthanarayanaV.S and AnandKumarM CentreforAppliedLinguistics
Department of Information Technology, and Translation Studies,
National Institute of Technology Karnataka University of Hyderabad
NH66,Srinivasnagar, Surathkal, Mangalore Prof. CR Rao Road
Karnataka 575025, India Gachibowli, Hyderabad
richardsaldanha.207it005@nitk.edu.in Telangana500046,India
anvs@nitk.edu.in pksh@uohyd.ac.in
m_anandkumar@nitk.edu.in
Abstract USA,CanadaandtheUK.Bothlanguagesbelong
In this work, two Neural Machine Transla- to the Dravidian family of languages which com-
tion (NMT) systems have been developed and prise of Tamil, Telugu, Kannada and Malayalam as
evaluated as part of the bidirectional Tamil- the major languages spokeninSouthIndia. Despite
Telugusimilarlanguagestranslationsubtaskin belonging to the same family of languages, there
WMT21. The OpenNMT-py toolkit has been are many differences between Tamil and Telugu,
used to create quick prototypes of the systems, suchasthescriptusedforwritingandlinguisticdif-
following which models have been trained on ferences in terms of phonology, morphology, syn-
the training datasets containing the parallel tax among others. Tamil belongs to the Southern
corpus and finally the models have been evalu- branch of Dravidian languages, which has a rich
ated on the dev datasets provided as part of the literary tradition spanning more than 2000 years.
task. Both the systems have been trained on a Telugu, on the other hand, belongs to the South
DGXstationwith4-V100GPUs.
The first NMT system in this work is a Trans- Central branch of Dravidian languages and has a
former based 6 layer encoder-decoder model, considerable amount of different linguistic charac-
trained for 100000 training steps, whose con- teristics when compared to Tamil as described by
figuration is similar to the one provided by Krishnamurthy (2019).
OpenNMT-py and this is used to create a Aspart of the similar language translation’s sub-
model for bidirectional translation. The sec- task for Dravidian Languages, namely Tamil (TA)
ond NMT system contains two unidirectional and Telugu (TE), we have attempted to build Neu-
translation modelswiththesameconfiguration ral Machine Translation (NMT) models using the
as the first system, with the addition of utiliz- OpenNMT-py toolkit 1, which helps to generate
ingBytePairEncoding(BPE)forsubwordtok- quick prototypes for the NMT models with the
enizationthroughthepre-trainedMultiBPEmb
model. Based on the dev dataset evaluation desired configurations. The first NMT system (sub-
metrics for both the systems, the first system mitted as the primary system) in this work is a
i.e. the vanilla Transformer model has been Transformer based 6 layer encoder-decoder model
submitted as the Primary system. Since there which provides a single model for bidirectional
were no improvements in the metrics during translation between Tamil and Telugu using the
training of the second system with BPE, it has datasets provided for this shared task. The sec-
been submitted as a contrastive system. ond NMT system (submitted as the contrastive
1 Introduction system) consists of two unidirectional translation
models with the same configuration as the first sys-
Tamilisalanguage,predominantlyspokeninTamil tem, but with the addition of utilizing Byte Pair
Nadu, a state in Southern India, along with coun- Encoding (BPE) for subword tokenization using
tries with a large Tamil speaking diaspora such as the pre-trained MultiBPEmb model (Heinzerling
Sri Lanka, Malaysia and Singapore, to name a few. and Strube, 2018).
Telugu on the other hand is the official language Therest of the work is described in sections that
of two Southern states in India, namely Andhra pertain to the related work, data, system descrip-
Pradesh and Telangana. It is also spoken among 1https://opennmt.net/OpenNMT-py/main.
the Telugu speaking immigrant population in the html
299
Proceedings of the Sixth Conference on Machine Translation (WMT), pages 299–303
November10–11,2021. ©2021Association for Computational Linguistics
Dataset Type Dataset Name Numberofsamples
Parallel Aligned TA-TE pairs (Training) PMIndia 26009
Parallel Aligned TA-TE pairs (Training) News 11038
Parallel Aligned TA-TE pairs (Training) MKB 3100
Parallel Aligned TA-TE pairs (Dev) Dev 1261
NonAlignedTA-TEsets(Test) Test 1735 (per language set)
Table 1: Dataset statistics for parallel aligned Tamil-Telugu pairs used as train and dev (validation) datasets along
with non aligned samples used as the test set.
Dataset Type Dataset Name Language Longest Line Length
Training PMIndia TA 659
Training News TA 1524
Training MKB TA 412
Dev Dev TA 923
Test Test TA 1544
Training PMIndia TE 718
Training News TE 1356
Training MKB TE 376
Dev Dev TE 1004
Test Test TE 757
Table 2: Dataset statistics for Longest Line.
tion, results and conclusion. and Strube, 2018).
2 Rationale for Selecting the Models and Other methods to improve translation quality,
Related Work that have not been explored as part of this work are
the use of back translation using monolingual cor-
There has been a significant amount of work done pus or corpora, on the lines of the one described by
ondeveloping machine translation systems for In- Sennrich et al. (2016). Factored NMT (which uses
dian languages, with some notable examples for data tagged on the basis of morphology and Parts
Dravidian languages such as Tamil and Malayalam of Speech (POS)) such as the one described by
described in Kumar et al. (2019). This shared García-Martínez et al. (2016) is another possible
task provides a unique challenge in terms of the candidate suitable for the kind of challenge pro-
constraint on the parallel aligned language pair vided by the similar language translation task, as
data made available for training. The other chal- the use of POS and morphological information can
lenges include the linguistically rich and domain reduce the number of tokens and make the models
specific content present in the Prime Minister of moregeneralizable in terms of predictions.
India (PMI) and the Mann ki baat (MKB) datasets, 3 Data
wheretopicsrelatedtoIndia’sdomesticandforeign
policy issues can be found. ThedatasetsusedintheNMTsystemsforthiswork
In order to address the challenge of lengthy input are the parallel aligned Tamil and Telugu (TA-TE)
(samples containing more than 300 space delim- language pairs provided as part of the Dravidian
ited tokens), the Transformer model described by LanguagesubtaskoftheSimilarLanguageTransla-
Vaswanietal. (2017) was adopted. This model pro- 2
vides the multi head attention mechanism which tion shared task . Some statistics about the dataset
helps retain context for longer length sentence sam- are outlined in Table 1.
ples. To reduce the vocabulary, reduce the training 3.1 Dataset preprocessing
time and possibly improve the translation quality Due to the moderate size of the training dataset,
(through sub word tokenization), a MultiBPEmb whichcontains40147samples,alongwiththetopic
modeltrained with a vocabulary of 100000 tokens
from 275 languages has been utilised (Heinzerling 2https://wmt21similar.cs.upc.edu/
300
ModelConfigurationName ModelConfigurationValue
Corpus Weights for PMI dataset 23
Corpus Weights for News dataset 19
Corpus Weights for MKB dataset 3
Source and Target Sequence Length 1600
Save checkpoint after steps 500
Numberoftraining steps 100000
Numberofvalidation steps 5000
Training batch size 4096
Dev(validation) batch size 16
Optimizer Adam
NumberofEncoderDecoderLayers 6 (each)
NumberofAttention heads 8
Table 3: Training Configuration for Transformer based Encoder-Decoder Model (Primary System).
overlap of sentence samples between the training The configuration for this model is the same as
and dev datasets as well as test set (to a certain ex- that provided by OpenNMT-py. In order to save
tent) on topics such as the Indian Prime Minister’s time, a single bidirectional translation model for
statements on domestic issues and foreign policies TA-TElanguagepairhasbeencreated, which can
in the PM India dataset, the entire training dataset translate from Tamil to Telugu and vice versa. The
has been utilized in its original form. datasets used in this system were doubled in terms
Thelengthwisestatistics of the dataset (in terms of the number of samples when compared to the
of space delimited tokens) is given in Table 2, this second NMTsystem(constrastive submission), by
wastaken as the deciding factor in fixing the max- reversing the position of the TA-TE language pair
imuminput length as 1600 for the NMT systems and appending them to the original datasets. No
developed. Thetokenizationfortheprimarysystem special tagging identifiers were used as the Tamil
wasdoneasspacedelimited tokens which yielded and Telugu scripts are distinct.
a shared Tamil-Telugu vocabulary of 194860 to- Basic space delimited tokenization was applied
kens. On the other hand on using the MultiBPEmb onthe datasets, which resulted in a combined TA-
model for subword tokenization gave a vocabulary TEvocabulary of 194860 tokens being generated,
of 14056 tokens for Tamil (TA) and 13170 tokens the relevant key configuration for this model are
for Telugu (TE), which included some words in listed in Table 3.
English as well. The corpus weights help assign varied impor-
4 SystemDescription tance to the particular datasets used in this task,
the values for these weights were determined after
As mentioned in section 1, the PyTorch based visual analysis of the dev(validation) dataset which
toolkit OpenNMT-py has been used to create rapid indicated the dev dataset’s contents had a greater
prototypes for NMT models (the motivations for overlapwithPMI,Newsand(MannkiBaat-which
the same can be seen in section 2), which have then roughlytranslatesto"Fromtheheart")MKBinthat
been trained on the datasets provided, validated particular order. The training time for the entire
against the provided dev sets and finally transla- modelwas18hours.
tions for the test sets described in section 3 have The second NMT system consists of two uni-
been obtained and submitted to the committee for directional translation models with the same con-
evaluating the Similar Language Translation task. figuration as the first system, with the addition of
ADGXstationwith4-V100GPUshavebeen utilizing Byte Pair Encoding (BPE) for subwords
used to train the models utilized in this task. A using the pretrained MultiBPEmb model (Heinzer-
Transformer based 6 layer encoder-decoder model ling and Strube, 2018). The intuition behind using
on the lines of the NMT system described by BPEwastoreducethevocabularysizeusingsub-
Vaswani et al. (2017), was trained for 100000 train- word tokenization. The choice of the pre trained
ing steps as the first NMT system to be evaluated. BPEmodelwasbasedontherelevanceofcontent
301
SystemName Source Target BLEU RIBES TER
Lan- Lan-
guage guage
Primary System (Transformer Based) TA TE 4.321 7.4 99.1
Contrastive System (Transformer Based + BPE subword) TA TE 0.003 0.0 130.6
Primary System (Transformer Based) TE TA 3.908 9.0 98.7
Contrastive System (Transformer Based + BPE subword) TE TA 0.029 3.0 105.0
Table 4: Dev dataset BLEU, RIBES and TER Corpus level scores using the VizSeq library.
SystemName Source Target BLEU RIBES TER System
Lan- Lan- Rank
guage guage
Primary System TA TE 6.09 17.03 - 1
Contrastive System TA TE 0.00 0.03 - 9
Primary System TE TA 6.55 19.61 98.356 4
Contrastive System TE TA 0.04 1.00 - 9
Table 5: Test dataset BLEU, RIBES, TER scores and BLEU based System Rank in the Shared Task
used for BPE model training, languages supported Corpus level metrics for the dev dataset were
and size of the vocabulary. Heinzerling and Strube computedusingtheVizSeqpythonlibrarywhichis
(2018) describes a MultiBPE model with a 100000 an implementation of several metrics described by
vocabulary which was deemed suitable for this task Wangetal.(2019).The metrics for the dev dataset
as it supported Tamil and Telugu, was trained on are listed in Table 4.
WikiNewsandcoulduseasinglevocabularylike Based on the evaluation metrics of the Dev (val-
the first NMT system used in this work. During idation) dataset translations for both the systems
training it was found that the translations for the evaluated in this work, the first system i.e. the
Dev set couldn’t distinguish between Tamil and vanilla Transformer model has been submitted as
Telugu subwords correctly, due to the failure in the Primary system. Since there were no improve-
vocabulary matching for the candidates used in ments in the metrics (the reason for it can be seen
the evaluation and possibly due to the vocabulary in section 6), during training of the second system
shared between the languages. Hence, this system which consists of the Transformer model along
was trained twice generating two unidirectional with the use of MultiBPEmb model for sub word
models for TA-TE and TE-TA translations. The tokenization, hence the second system has been
training time for each model was 5 hours, which is submitted as a contrastive system.
less when compared to the primary system due to 3
Table 5 lists the evaluation metrics applied on
the number of samples used (the primary system the test dataset and the BLEU based system rank
uses double the number of samples) and the vocab- in the shared task provided by the evaluation com-
ulary size (the contrastive system has a smaller and mittee 4,5.
fixed vocabulary as a pre trained BPE model has
been used). 6 Conclusion and Future Work
5 Results The analysis of the evaluation metrics, from sec-
tion 5, on the dev dataset indicates that the primary
The evaluation metrics used to evaluate the sys- system, which is a Transformer based Encoder-
tems in this task are BiLingual Evaluation Under- 3TheresultsoftheTERmetricsforthetestsettranslations
study (BLEU) score as described by Papineni et al. have been marked as - (refer Table 5), when the values exceed
(2002), Rank-based Intuitive Bilingual Evaluation 100.0
(RIBES)score as described by Isozaki et al. (2010) 4https://mzampieri.com/workshops/wmt/
2021/TA_TE.pdf
and Translation Error Rate (TER) as described by 5https://mzampieri.com/workshops/wmt/
Snover et al. (2006). 2021/TE_TA.pdf
302
no reviews yet
Please Login to review.