228x Filetype PDF File size 0.43 MB Source: www.lrec-conf.org
Challenges and Solutions for Consistent Annotation of Vietnamese Treebank
1&2 1&2 3 4
QuyT.Nguyen , Yusuke Miyao , Ha T.T. Le , Ngan L.T. Nguyen
1TheGraduate University for Advanced Studies (SOKENDAI), Japan
2National Institute of Informatics, Japan
3University of Social Sciences and Humanities, Vietnam
4 University of Information Technology, Vietnam
quynt@nii.ac.jp, yusuke@nii.ac.jp, trucha.ussh@gmail.com, ngannlt@uit.edu.vn
Abstract
Treebanks are important resources for research in natural language processing, speech recognition, theoretical linguistics, etc. To
strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this
treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building
a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging,
and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation
guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and
consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory.
Keywords:Vietnamese Treebank, Consistent Annotation, Challenges and Solutions
Treeing a Vietnamese sentence
1. Introduction Original sentence:
Treebanks–corpora annotated with syntactic structures, are Nam kểvềtai nạn hôm qua.
{Nam tells about the yesterday's accident.}
importantresourcesforresearchersinnaturallanguagepro-
cessing (NLP). Treebanks provide important syntactic in- 1. Word segmentation:
formation in order to improve the quality of NLP tools. To Nam kể về tai_nạn hôm_qua .
to tell about accident yesterday
strengthen the automatic processing of the Vietnamese lan- 2. POS tagging:
guage, Nguyen et al. (2009) have built a Vietnamese tree- Nam/Nrkể/Vvvề/Cs tai_nạn/Nn hôm_qua/Nt ./PU
bank, named VLSP treebank, containing 10,000 sentences.
However, the quality of the VLSP treebank, including the 3. Bracketing:
quality of the annotation scheme, the annotation guidelines, (S
andtheannotationprocess,isnotsatisfactoryandisapossi- (NP-SBJ (Nr-H Nam))
ble source for the low performance of Vietnamese language (VP (Vv-H kể)
processing (Nguyen et al., 2012; Nguyen et al., 2013). (PP-DOB (Cs-H về)
We have been building a new Vietnamese treebank with (NP (Nn-H tai_nạn)
(NP-TMP (Nt-H hôm_qua)))))
3,000 texts (about 40,000 sentences) covering 14 topics (PU .))
collected from a Vietnamese online newspaper, Thanhnien
1
news . Our treebank is annotated with three layers: word
segmentation (WS), part-of-speech (POS) tagging, and Figure 1: An example to illustrate process of treeing a Viet-
2 namesesentence.
bracketing as showed in Figure 1 . We have found that en-
suringtheannotationconsistencyandaccuracyisoneofthe
most important considerations in the annotation of a tree- with other languages (e.g., English and Chinese) to indi-
bank. This requires clear and complete annotation guide- cate that building a high-quality Vietnamese treebank is a
lines. The guidelines contain the annotation scheme, con- challenging problem. We also present our methodology to
sistent principles to annotate linguistic phenomena,andsuf- tackle the challenges in this section. We then discuss dif-
ficient examples. These documents are not only used to ficulties in WS, POS tagging, and bracketing, and how we
train annotators but also valuable sources serving the uses solve them in developing the annotation guideline in Sec-
of the treebank. tion 3, 4, and 5 respectively. Finally, in Section 6, we de-
WepreparedthreesetofguidelinesfortheVietnamesetree- scribe our annotation process, how we revise the guidelines
bank: WSguidelines, POS tagging guidelines, and bracket- during the annotation process, and methods to ensure the
ingguidelines.Inthispaper,Section2describesthegeneral annotation consistency and accuracy.
characteristics of the Vietnamese language in comparison This study is not only beneficial for the development of
1http://thanhnien.vn computational processing technologies for Vietnamese, a
2Underscore "_" is used to link syllables of Vietnamese multi- language spoken by over 90 million people, but also for
syllable words. Translation for the Vietnamese word is given as similar languages such as Thai, Laos, and so on. This study
a subscript. If the Vietnamese word does not have a translatable also promotes the computational linguistic studies on how
meaning,thesubscript is blank. Translation for a Vietnamese sen- to transfer methods developed for a popular language, like
tence is given in curly brackets below the original text. English, to a language that has not yet intensively studied.
1532
Meaning: The construction unit is too slow.
a) S b) S c) S
NP-SBJ Cp ADJP-PRD PU SPL Cp SPL PU
NP-SBJ ADJP-PRD PU Nn-H Vv thì R Aa-H . NP thì ADJP .
R Aa-H . {to be} Nn-H Vv R Aa-H
Nn-H Vv Đơn_vị thi_công quá chậm_chạp
Đơn_vị thi_công quá chậm_chạp Đơn_vị thi_công quá chậm_chạp
{unit} {to construct} {too} {slow}
Figure 2: Examples showing ambiguity of annotating a sentence in Vietnamese.
2. Characteristics of Vietnamese language (Xia, 2000b; Xia, 2000a; Xue et al., 2000), English
andmethodologyforguideline PennTreebank(Santorini,1990;Biesetal.,1995),and
preparation VLSPtreebank (Nguyen et al., 2010b; Nguyen et al.,
Unlike Western languages, in which blank spaces denote 2010a; Nguyen et al., 2010c) and adapt them to our
worddelimiters, in Vietnamese, blank spaces play the roles guidelines if possible.
of not only word delimiters but also syllable delimiters 3
(Diep, 2005; SCSSV, 1983) that cause difficulties in defin- • During the annotation process, annotators are re-
quested to discuss with us about the constructions that
ing words. In addition, unlike English and Japanese, Viet- they cannot annotate or feel ambiguous. These con-
namese is not an inflectional language for which morpho- structions are important clues to revise the guidelines.
logical forms can provide useful clues for word segmen- • We conduct nine rounds of measurement of inter-
tation and POS tagging. While similar problems also oc- annotator agreement and accuracy, for which two an-
cur with Chinese (Xia et al., 2000), annotating Vietnamese notators annotate the same data. The inconsistencies
words may be more difficult, because the modern Viet- and annotation errors found in each round are impor-
namese writing system is based on Latin characters, which tant clues to improve annotation guidelines and to train
represent the pronunciation but not the meaning of words, annotators again.
resulting in many homonyms.
Difficulties in Vietnamese occur in not only determining Details of applying these approaches during the process of
wordsasmentionedabovebutalsobracketingphrases.One building the Vietnamese treebank are explained in the fol-
of the reasons is that there are many expressions having lowing sections.
the same POS sequence but different phrase types in Viet-
namese. Other difficulties are caused by the fact that word 3. Wordsegmentationguidelines
order in Vietnamese is very flexible. 3.1. Challenges of word segmentation
Moreover, there is little consensus in community about
how to define words, phrases and grammatical structures. Words are the most basic units of a treebank (Sciullo and
Though people agree that Vietnamese is the subject-verb- Williams, 1987), and defining words is the first step in
object (SVO) language, Figure 2a shows a sentence in Viet- the annotation process. (Xia, 2000b; Xia, 2000a; Sornlert-
namese that the head word of the predicate is not a verb. lamvanich et al., 1999). For languages like English, defin-
For sentences that do not have the main verb, we can use ing words is almost trivial, because the blank spaces de-
the conjunction thì to link the subject and the predicate as note word delimiters. However, it is a difficult problem in
shown in Figure 2b. However, when the conjunction thì is Vietnamese even for a native speaker. Although most lin-
used, linguists disagree about how to bracket this sentence. guists agree that the Vietnamese language has two types
Diep (2005) considered this sentence as a single sentence of words, single-syllable words (single words) and multi-
(Figure 2b), where the conjunction thì is used to link the syllable words (compound words), distinguishing between
subject and the predicate. SCSSV (1983), in contrast, con- single and multi-syllable words involves much ambiguity.
sidered this sentence as a subordinate compound sentence Theambiguities of Vietnamese WS occur for the following
(Figure2c)becausetheysaidthattheconjunctionthìisused reasons. First, in Vietnamese, blank spaces play the roles
to link two clauses of a subordinate compound sentence. of not only word delimiters but also syllable delimiters.
WepreparedtheguidelinesfortheVietnamesetreebankin- Second, there are no morphological marks to act as impor-
cluding three sets: word segmentation guidelines, POS tag- tant clues to identify words. Third, the Vietnamese writ-
ging guidelines, and bracketing guidelines. The problems ing system is based on Latin characters, which represent
were tackled on the basis of the following approaches: the pronunciation but not the meaning of words. Expres-
• We refer to Vietnamese grammar books (SCSSV, sions that have the same surface form but different word
1983; Diep, 2005) and discuss with our collaborators, segmentation appear frequently in Vietnamese. Rows 1 and
who are Vietnamese linguistics experts, to solve the 2 in Table 1, for instance, show two different segmentation
ambiguities and difficulties. 3Ourtreebankisannotatedbytwoannotatorswhoaregraduate
• We study the guidelines of Chinese Penn Treebank linguistics students.
1533
No. Expression (A B) Meaning WS fromwhattheexpressionindicates,A_Bisconsidered
1 quần áo clothes a word
trousers shirt as a compound word. In contrast, if B has a similar
2 quần áo trousers 2words meaningtoAB,AandBareconsideredastwowords
trousers shirt and shirt
3 ăn nói to speak a word (examples 8 and 9 in Table 1).
eat speak
4 tìm kiếm to find a word
find find
5 nồi đồng copper pot 2words
pot copper • An expression of one or more Sino-Vietnamese sylla-
6 nồi bằng đồng copper pot 3words
pot by copper bles and an original Vietnamese word, in which the
7 đen đúa black a word
black Sino-Vietnamese syllables are the elements used to
8 cá heopig dolphin a word
fish create the new words, is not considered as a word (ex-
9 cá lia_thia betta fish 2words
fish bettafish ample 10 in Table 1).
10 nghiên_cứu viên−er researcher 2words
research
11 nhà nghiên_cứu researcher 2words
−er research • Specialclassifier nounsareconsideredassinglewords
Table 1: Examples to illustrate the principles of word seg- (example 11 in Table 1).
mentation.
It should be noted that these rules do not necessarily con-
types of the expression quần áo. Fourth, there is little con- form to the rules used by linguists. For example, Diep
sistency in segmenting the expressions. For example, some (2005) considers the Sino-Vietnamese syllable viên in
linguists consider the expression cá rô {anabas} −er
fish anabas example 10 in Table 1 as a component of the compound
as a compound word but bệnh sởi {measles} word and considers the special classifier noun nhà as a
illness measles −er
as two words (Hoang, 1998; Diep, 2005). However, these single word. We, on the other hand, consider both viên
expressions have a similar construction: the combination of −er
and nhà−er as single words because we found that they
a categorization noun4 and a specific noun. both have the same grammatical function that is forming
3.2. Policy for annotation of word segmentation new words. However, in our guidelines, the word types for
which there is little consensus between linguists for seg-
As mentioned above, our purpose for word segmentation menting them are annotated with additional information so
is to build a treebank for Vietnamese. Therefore, we con- that such words can be automatically converted according
sider a word as the smallest syntactic unit having a com- to the need.
plete meaning and preventing syntactic rules from analyz-
ing wordstructure (Sciullo and Williams, 1987). On the ba- 4. Part-of-speech tagging guidelines
sis of this word definition, we propose the following rules
to solve the difficulties in Vietnamese word segmentation: 4.1. Challenges of POS tagging
• If A and B5 have different meanings and the meaning Tagging POSforVietnamesewordsisnotatrivialproblem
of the combination form (A_B) is different from the because they are not marked with morphological features,
split form (A B), we select the form that has a mean- such as tense, number, gender, etc. While the same prob-
ing more appropriate for the context. Examples 1 and lem also appears with Chinese, Vietnamese may be more
2 in Table 1 show an expression having two different difficult, because the Vietnamese writing system is based
meanings because of different word segmentation. on Latin characters, which represent the pronunciation, but
• If A and B have different meanings and A_B has the not the meaning of words.
same meaning as A or B, the combination form is se- Words that have the same surface form and pronunciation
lected. The example is given in row 3 of Table 1. but different meanings and grammar functions occur fre-
quently in the text. For example, we can understand the
• If A and B have the same meaning, the combination word mới in accordance with two meanings shown in rows
form is selected (example 4 in Table 1). 1and2ofTable2.Ifweconsidermớiasanadjectivemod-
ifying the preceding word, the noun nghiên_cứuresearch,
• If another syllable can be inserted between A and B, it means new; The word mới means recently or just if we
weselect the split form (examples 5 and 6 in Table 1). consider it as an adjunct modifying the following word, the
• IfAisawordandBisnot(orviceversa),weselectthe verb thực_hiệnto conduct.
combination form. Example 7 in Table 1 shows that if Determining POS of the words having the same surface
đúa is considered as a single word, its meaning is un- form may be more ambiguous because a verb or an adjec-
defined. Therefore, it is considered as part of a multi- tive can appear in the position of a noun as in the case of
syllable word. báo cáo in rows 3 and 4 of Table 2. Solely referring to the
sentence, we do not have any clue to determine if báo cáo
• For the expression of a categorization noun (A) and belongs to the verb class or noun class. Báo cáo means de-
a specific noun (B), if B indicates something different fend if it is considered as a verb (row 3) and thesis if it is
considered as a noun (row 4).
4Categorization nouns indicate general entities, such as cá Ambiguity of the POS tagging is also caused by the omis-
fish sion of words which happens frequently in Vietnamese. For
and cây .
tree
5Without loss of generalization, we assume the expression we example, if a verb or an adjective plays the same roles as
wanttosegmentisAB,whereAandBcanbesyllablesorwords. a noun, it is actually preceded by a special classifier noun
1534
No. Wordincontext Word POS No. POS Meaningoftag No. POS Meaningoftag
1 MộtnghiêncứumớithựchiệntạiNhật. mới Adjective tag tag
{AnewreseachconductedinJapan.} new 1 SV Sino-Vietnamese 17 NA Noun-adjective
2 MộtnghiêncứumớithựchiệntạiNhật. mới Adjunct syllable 18 Vcp Comparative verb
{Aresearch has just conducted in Japan.} just 2 Nc Classifier noun 19 Vv Other verb
3 Báocáotốtnghiệpcủacôấyrấttốt. báo cáo Verb 3 Ncs Special classifier noun 20 An Ordinal number
{Her final defense is very good.} {defense} 4 Nu Unit noun 21 Aa Other adjective
4 Báocáotốtnghiệpcủacôấyrấttốt. báo cáo Noun 5 Nun Administrative unit noun 22 Pd Demonstrative pronoun
{Her thesis is very good.} {thesis} 6 Nw Quantifier indicating 23 Pp Other pronoun
Việc báo cáo tốt nghiệp của cô ấy rất tốt. việc báo cáo the whole 24 R Adjunct
5 {Her final defense is very good.} {defense} Verb 7 Num Number 25 Cs Preposition or conjunction
Cuốnbáocáotốtnghiệpcủacôấyrấttốt. cuốn báo cáo 8 Nq Other quantifier introducing a clause
6 {Her thesis is very good.} {thesis} Noun 9 Nr Proper noun 26 Cp Other conjunction
Bạnsẽđẹpnhấtđêmnay. 10 Nt Nounoftime 27 ON Onomatopoeia
7 sẽ Adjunct
{You will be the most beautiful girl tonight.} will 11 Nn Other noun 28 ID Idioms
Tôi sẽ đi Nhật vào tối nay. 12 Ve Exitting verb 29 E Exclamation word
8 sẽ Adjunct 13 Vc Copula "là" verb 30 M Modifier word
{I will go to Japan tonight.} will
14 D Directional verb 31 FW Foreign word
Table 2: Examples illustrating the challenges of POS tag- 15 VA Verb-adjective 32 X Unidentified word
16 VN Verb-noun 33 PU Punctuation
ging. Table 3: POS tag set designed for our treebank.
6
(as the case of báo cáo in rows 5 of Table 2). Otherwise,
a noun is preceded by a classifier noun7 (the noun báo cáo tag P to annotate all pronouns. However, the pronouns used
in row 6 of Table 2 follows the classifier noun cuốn). How- to express space or time (demonstrative pronouns) such as
ever, such useful nouns are usually omitted in Vietnamese này and đó can be modifiers of the head nouns in
sentences which causes ambiguity of tagging words. this that
noun phrases. Personal pronouns, in contrast, always play
Some linguists (SCSSV, 1983; Diep, 2005) have claimed the roles of the head words of noun phrases.
that POS can be recognized by referring to the adjuncts Therefore, in this work, we created a new POS tag set
modifying the words. For example, adjuncts indicating de- for Vietnamese. Our criteria to classify the words are also
gree and tenses modify adjectives and verbs, respectively. based on the combination abilities and the syntactic func-
However, this method does not necessarily work suffi- tions of the words, like those of the VLSP treebank. How-
ciently with real texts. In practice, many verbs and adjec- ever, we referred to the linguistics literature, carefully ana-
tives in Vietnamese can be modified by the same adjunct. lyzed the roles of words and discussed with our linguistics
For example, the adjunct indicating tense, sẽwill shown in colleagues to create a new POS tag set for Vietnamese with
Table 2 can modify both the adjective đẹpbeautiful (row 7) 33 tags which are shown in Table 3. Using our POS tags,
and the verb đi (row 8).
to go wecanrecognizetheroleofawordinaphraseorsentence.
Because of the above characteristics of Vietnamese, it is For example, the demonstrative pronouns modifying head
difficult not only to define the POS tag set but also to tag words of noun phases are annotated with the Pd label, and
each word in context. In addition, there is still little con- personal pronouns that are head words of noun phrases are
sensus between linguists as to methodology for classifying annotated with the Pp label.
words in Vietnamese. For instance, both Diep (2005) and
SCSSV (1983) classified the words based on their mean- 4.3. Policy for annotation of part-of-speech
ings, their combination ability, and their syntactic func- In our POS tagging guidelines, the words are tagged on the
tions. However, Diep (2005) considered the words express- basis of the following criteria:
ing the whole, such as cả , tất_cả , toàn_bộ , etc.
all all all
as pronouns, while SCSSV (1983), in contrast, considered • Combination ability of the word. For example,
them as nouns, and Hoang (1998) considered cả as a pro- khó_khăn can be understood as difficulty or difficult.
nounandtất_cả as a noun in all contexts. However, if it is a noun, it cannot combine with the
adjunct rất . If it is an adjective, it cannot combine
4.2. Building part-of-speech tag set very
In previous work, Nguyen et al. (2009) classified the words with the quantifier những−s/−es.
onthebasisoftheircombinationabilityandsyntacticfunc- • Syntactic function of the word. For example, if the
tion. They created a POS tag set for Vietnamese includ- quantifier indicating the whole modifies a noun, it will
ing a total of 17 tags (except the tags for unknown words beannotatedwithanNwtag.Thequantifierindicating
and the punctuation). However, this tag set cannot cover the whole will be annotated with a Pp tag if it is head
all the combination abilities as well as the syntactic func- wordofanounphrase.
tions of the Vietnamese words. For example, they used the
6Việc is a special classifier noun that is understood as -ion, • Meaningofthewordinthesentence.Forexample,the
combination ability of the verb đi and the adjec-
-ment, -ing, -ity, -ness, or so on when it comes before verbs or to go
adjectives. An expression of the special classifier noun việc and a tive đẹpbeautiful mentionedaboveisthesame,theyare
verb or adjective is understood as a noun in English. For example, modified by the adjunct sẽ. They also have the same
học_tập means to learn, so to express learning, we can say việc syntactic function which is head word of predicates.
học_tập. However, their meanings are different: the adjective
7Classifier nouns indicate two types of things, animate things expresses the quality, and the verb expresses the ac-
and inanimate things. tion.
1535
no reviews yet
Please Login to review.