228x Filetype PDF File size 0.18 MB Source: aclanthology.org
Building an HPSG-based Indonesian Resource Grammar (INDRA)
DavidMoeljadi Francis Bond SanghounSong
Division of Linguistics and Multilingual Studies
NanyangTechnological University
Singapore
{D001,fcbond,sanghoun}@ntu.edu.sg
Abstract tem, including a variety of prefixes, suffixes, cir-
cumfixes, and reduplication. Most of the affixes
This paper presents the creation and the arederivational. Twoimportantinflectionalaffixes
initial stage development of a broad- are the prefix meN- which marks active voice and
coverage Indonesian Resource Grammar di- which denotes passive voice (Sneddon et al.,
(INDRA) within the framework of Head 2010, pp. 29, 72).
DrivenPhraseStructureGrammar(HPSG) Indonesian has a strong tendency to be head-
(Pollard and Sag, 1994) and Minimal Re- initial (Sneddon et al., 2010, pp. 26-28). In a noun
cursion Semantics (MRS) (Copestake et phrase with an adjective, a demonstrative or a rel-
al., 2005). At the present stage, INDRA ative clause, the head noun precedes the adjective,
focuses on verbal constructions and sub- the demonstrative or the relative clause. There is
categorization since they are fundamental noagreementinIndonesian. In general, grammat-
for argument and event structure. Verbs ical relations are only distinguished in terms of
in INDRA were semi-automatically ac- wordorder. AsisoftenthecasewithAustronesian
quired from the English Resource Gram- languages of Indonesia, Indonesian has a basic
mar(ERG)(Flickinger,2000)viaWordnet word order of SVO with a nominative-accusative
Bahasa (Nurril Hirfana Mohamed Noor et alignment pattern. Argument alternations are trig-
al., 2011; Bond et al., 2014). In the future, gered by passive and applicative constructions.
INDRA will be used in the development
process of machine translation. A prelim- 2 Background
inary evaluation of INDRA on the MRS
test-suite shows promising coverage. This section introduces the background theory, as
1 Introduction to Indonesian well as an overview of the Deep Linguistic Pro-
cessingwithHPSGInitiative(DELPH-IN)andthe
Indonesian (ISO 639-3: ind) is a Western Malayo- tools to build and develop INDRA.
PolynesianlanguageoftheAustronesianlanguage 2.1 Frameworks
family. Within this subgroup, it belongs to the
Malayic branch with Standard Malay in Malaysia INDRA uses the theoretical framework of HPSG
and other Malay varieties (Lewis, 2009). It is spo- (Pollard and Sag, 1994). HPSG is mono-
ken mainly in the Republic of Indonesia as the stratal, handling orthography, syntax, semantics
sole official and national language and as the com- and pragmatics in a single structure (sign), mod-
monlanguageforhundredsofethnicgroupsliving eled through typed feature structures. HPSG is
there (Alwi et al., 2014, pp. 1-2). In Indonesia it unification- and constraint-based. The words and
is spoken by around 22.8 million people as their phrases are combined according to constraints of
first language and by more than 140 million peo- the lexical entries based on the type hierarchy.
pleastheirsecondlanguage. Thelexicalsimilarity INDRA uses MRS (Copestake et al., 2005) as
is over 80% with Standard Malay (Lewis, 2009). its semantic framework because it is adaptable
Morphologically, Indonesian is a mildly agglu- for HPSG typed-feature structure and suitable for
tinative language, compared to Finnish or Turk- parsing and generation. The semantic structures in
ish where the morpheme-per-word ratio is higher MRS are underspecified for scope and thus suit-
(Larasati et al., 2011). It has a rich affixation sys- able for representing ambiguous scoping.
9
Proceedings of the Grammar Engineering Across Frameworks (GEAF) Workshop, 53rd Annual Meeting of the ACL and 7th IJCNLP, pages 9–16,
c
Beijing, China, July 26-31, 2015.
2015 Association for Computational Linguistics
There is no previous work done on Indone- and LOGON (Oepen et al., 2007), a collection of
sian HPSG but much has been done using Lexi- software, grammars, andotherlinguistic resources
cal Functional Grammar (LFG) (Kaplan and Bres- for transfer-based machine translation.
nan, 1982), e.g. Arka and Manning (2008) on ac-
tive and passive voice and Arka (2000) on con- 3 INDRA
trol constructions. In addition, Arka (2012) and This section describes some preliminary work as
Mistica (2013) have worked on the computational well as the methodology.
grammar ”IndoGram” which is a part of the Par-
Gram (Sulger et al., 2013).1 However, it is not 3.1 Methodology
open-source or very broad in its coverage. Fur- Themethodology used in INDRA follows Bender
ther, it does not produce MRS, so cannot be easily et al. (2008). We model our analysis in HPSG and
incorporated into our machine translation system. implement it by editing some TDL files after an-
Thus, there is a need to build and develop a broad- alyzing a phenomenon based on reference gram-
coverage open-source HPSG of Indonesian. mars and other linguistic literatures. Afterwards,
2.2 DELPH-IN we compile the grammar and test it by parsing
The DELPH-IN consortium (Deep Linguistic sample sentences or test-suites. The grammar is
Processing with HPSG Initiative, http://www. debugged and developed further if some gaps or
delph-in.net) is a research collaboration be- problems are found according to the parse results.
tween linguists and computer scientists which Afterwards, the sample sentences in test-suites
builds and develops open source grammar, tools will be parsed again and treebanked. This pro-
for grammar development and applications using cess goes repetitively. If problems are not found
HPSGandMRS.Morethanfifteengrammarshave or the debugging process has finished with a good
been created and developed within DELPH-IN, result, the grammar will be updated in GitHub
e.g. English Resource Grammar (ERG) (Copes- (https://github.com/davidmoeljadi/INDRA).
take and Flickinger, 2000) and Japanese grammar 3.2 GrammarDevelopment
Jacy(SiegelandBender,2002). DELPH-INgram- INDRA was created firstly by filling in
mars define typed feature structures using Type the required sections of the online page
Description Language (TDL) (Copestake, 2002). of LinGO Grammar Matrix questionnaire
We make extensive use of several open-source which covers basic grammar phenomena
tools for grammar development provided by such as word order, tense-aspect-mode, co-
DELPH-IN:LinguisticKnowledgeBuilder(LKB) ordination, morphology, subcategorization
(Copestake, 2002), a grammar and lexicon de- of nouns and verbs (http://www.delph-
velopment environment for typed feature struc- in.net/matrix/customize/matrix.cgi). IN-
ture grammars; The LinGO Grammar Matrix DRA subcategorizes nouns into three groups:
(Bender et al., 2010), a web-based question- common noun, pronoun and proper name. Com-
naire for writing new DELPH-IN grammars, pro- mon nouns are subcategorized into inanimate,
viding a wide range of phenomena and ba- non-human and human based on three main
sic files to make the grammars compatible with classifiers in Indonesian: the classifier buah (lit.
DELPH-IN parsers and generators; Answer Con- fruit) for inanimate nouns, ekor (lit. tail) for
straint Engine (ACE) (http://sweaglesw.org/ non-human animate nouns and orang (lit. person)
linguistics/ace/), an efficient processor for for human nouns (Sneddon et al., 2010, p. 139;
DELPH-IN grammars; ITSDB or [incr tsdb()] Alwi et al., 2014, p. 288).
(Oepen and Flickinger, 1998), a tool for testing, Verbs are subcategorized into three groups:
profilingtheperformanceofthegrammarandtree- intransitive which has one argument, transitive
banking; Full Forest Treebanker (FFTB) (http: which has two arguments and optional transitive
//moin.delph-in.net/FftbTop), a treebanking which has one obligatory subject argument and
tool for DELPH-IN grammars, allowing the selec- one optional object argument as in Adi makan
tion of an arbitrary tree from the “full forest” with- (nasi) “Adi eats (rice)”. The verb subcategoriza-
out enumerating all analyses in the parsing stage; tion here follows Alwi et al. (2014, pp. 95-98).
1http://iness.uib.no/iness/xle-web Besides the number of arguments, the possibil-
10
ity of passivization with morphological inflection al., 2014) and group them based on syntactic types
plays an important role in distinguishing intran- in the ERG, such as intransitive, transitive, and di-
sitives from transitives in Indonesian. Examples transitive, using Python 3.4 and Natural Language
[1] and [2a] show intransitive and transitive Toolkit (NLTK) (Bird et al., 2009). The group-
sentences respectively. ing of verbs (verb frames) in Wordnet (Fellbaum,
1998) is employed to be the bridge between the
(1) Adi tidur. English and Indonesian grammar.
Adisleep Eachverbsynset in Wordnet (also Wordnet Ba-
“Adi sleeps.” hasa) contains a list of sentence frames specified
(2) a. Adi mengejar Budi. by the lexicographer illustrating the types of sim-
Adi ACT-chase Budi ple sentences in which the verbs in the synset can
be used (Fellbaum, 1998). There are 35 verbal
“Adi chases Budi.” sentence frames in Wordnet, some of them are
b. Budi dikejar Adi. shownasfollows with their frame numbers:
Budi PASS-chase Adi (3) 1 Something----s
“Budi is chased by Adi.” 8 Somebody----ssomething
21 Somebody----ssomethingPP
c. Budi saya kejar. Frame 1 is a typical intransitive verbal sentence
Budi 1SG chase frame, as in the book fell; frame 8 is a typical
“Budi is chased by me.” (mono)transitive verbal sentence frame, as in he
chases his friend; and frame 21 is a typical di-
In Example (2a), the verb mengejar is formed transitive verbal sentence frame, as in she put a
fromanactive prefix meN- and the base kejar (the book on a table. A verb may have more than one
initial sound k undergoes nasalization; see Section synset and each synset may have more than one
4.2). The active prefix meN- is changed to a pas- verb frame, e.g. the verb eat has six synsets with
sive prefix di- in passive type one (Sneddon et al., eachsynsethavingdifferent verb frames. Three of
2010, pp. 256-257) in Example (2b) and without the six synsets, together with their definition and
affixinpassivetypetwo(Sneddonetal.,2010,pp. verb frames, are presented in Table 1. These verb
257-258) in Example (2c). Sneddon et al. (2010, frames can be employed as a bridge between the
pp.256-257)statesthatinpassivetypeone,theac- verb types (also verb lexical items) in ERG and
tor is third person or a noun, while in passive two, those in INDRA.
the agent is a pronoun or pronoun substitute and it Synset Definition Verb frame
comesbefore the unprefixed verb. 01168468-v Take in solid food 8 Somebody ----s
The more detailed verb subcategorization into something
othergroupssuchasditransitivewillbementioned 01166351-v Eat a meal, take a 2 Somebody----s
meal
in the next subsection. The lexical items for each 01157517-v Use up (resources 11Something----s
nounandverbsubcategorywereaddedandtheaf- or materials) something
fixes to support the active-passive voice were in- 8 Somebody ----s
something
cluded. However, the Matrix does not handle mor-
phologyasinthenasalizationprocessofmeN-and Table 1: Three of six synsets of the verb “eat” and
thus has to be manually added (see Section 4.2). their verb frames in Wordnet
3.3 Lexical Acquisition Out of 354 verb types in ERG, the top eleven
The lexicon is important in the robustness of the mostfrequentlyusedtypesinthecorpuswerecho-
grammar. Since inputting words or lexical entries sen, excluding the specific English verb types such
manually into the grammar is labor intensive and as be-type verbs (e.g. is, be and was), have-type
time consuming, doing lexical acquisition semi- verbs, verbs with prepositions (e.g. depend on, re-
automatically is vital. In order to do this, we fer to and look after) and modals (e.g. would, may
need good lexical resources. We attempted to ex- andneed). Thechosenelevenverbtypesaregiven
tract Indonesian verbs from WordnetBahasa(Nur- in Table 2. The third, fifth and eighth type (v -
ril Hirfana Mohamed Noor et al., 2011; Bond et unacc le, v - le and v pp unacc le all written in
11
bold in Table 2) are regarded as the same type, i.e. Verb type Verb frame
intransitive verb type, in INDRA. v pp* dir le 2 Sb ----s &
22Sb----sPP
v vp seq le 28Sb----stoINFINITIVE
Verb type Freq Examples of verb v - unacc le 1 Sth ----s ||
Corp Lex v - le 2 Sb ----s
v pp* dir le 7079 204 go, come, hike v pp unacc le
v vp seq le 3921 105 want, like, try v np noarg3 le 8 Sb ----s sth ||
- unacc le 3144 334 close, start, end 11Sth----ssth
v np noarg3 le 2723 5 make, take, give 15Sb----ssthtosb||
v - le 2666 486 arrive, occur, stand 17Sb----ssbwithsth||
v np-pp e le 2439 334 compare, know, relate v np-pp e le 20Sb----ssbPP||
v pp*-cp le 2360 154 think, add, note 21Sb----ssthPP||
v pp unacc le 2307 44 rise, fall, grow 31Sb----ssthwithsth
v np-pp prop le 1861 135 base, put, locate v pp*-cp le 26Sb----sthatCLAUSE
v cp prop le 1600 80 believe, know, find 20Sb----ssbPP||
v np ntr le 1558 10 get, want, total v np-pp prop le 21Sb----ssthPP
Table 2: The ten most frequently used ERG verb v cp prop le 26Sb----sthatCLAUSE
v np ntr le 8 Sb ----s sth ||
types in the corpus 11Sth----ssth
The first type contains verbs expressing move- Table 3: The eleven most frequently used ERG
ment or direction with optional PP complements, verb types in the corpus and their corresponding
asinBcreptintotheroom. Theverbsinthesecond Wordnetverbframes(sb=somebody,sth=some-
type are subject control verbs, as in B intended to thing, & = AND, || = OR
win. The third type consists of unaccusative verbs
without complements as in The plate gleamed. type in Table 2 whether it is in Wordnet or not.
The fourth type contains verbs having two argu- If it could be found in Wordnet, the next step was
ments (monotransitive) although they have a po- to checkwhethertheverbincludestheverbframes
tential to be ditransitive as in B took the book. The mentioned in Table 3 or not. This step had to be
fifth type contains intransitive (unergative) verbs done in order to find out the right synset since
as in B arose. The verbs in the sixth type have a verb can have many synsets but different verb
obligatory NP and PP complements as in B com- frames as shown in Table 1. After the right synset
pared C with D. The verbs in the seventh type are was found, the corresponding Indonesian lemmas
verbs with optional PP complements and obliga- or translations were checked. One synset may
tory subordinate clauses as in B said to C that D have more than one Indonesian lemma or may not
won. Unaccusative verbs with optional PP com- have Indonesian lemmas at all.
plements as in The seed grew into a tree belong The next important step is to check one by
to the eighth type. Ditransitive verbs with oblig- one the Indonesian lemmas belonging to the same
atory NPs and PPs with state result as in B put synset and verb frames whether each can be
C on D belong to the ninth type. The tenth type grouped in the same verb type or not. This man-
consists of verbs with optional complementizers ual step has to be done because grouping verbs
as in B hoped (that) C won and the eleventh type in a particular language into types is a language-
consists of verbs with obligatory NP complements specific work. Arka (2000) states that languages
which cannot be passivized as in B remains C. vary with respect to their lexical stock of “syn-
Basedonthesyntacticinformation of each verb onymous”verbs that may have different argument
type mentioned above, the corresponding verb structures, e.g. the verb know can be both intransi-
frames in Wordnet were manually chosen. For tive and transitive in Indonesian tahu and ketahui
example, the first type contains intransitive verbs respectively, transitive only with an obligatory NP
with optional PP; thus, the verb frames should 2
be Sb ----s and Sb ----s PP. The intransitive in Balinese tawang, and transitive with optional
verbs without complements should correspond to NPin English know. Lastly, after the Indonesian
the verb frames Sth ----s or Sb ----s, regard- verbs were extracted and grouped into their cor-
less of whether the subject is a thing or a person. 2Balinese (ISO 639-3: ban) is a Western Malayo-
Table 3 shows the eleven verb types in ERG and Polynesian language of the Austronesian language family. It
their corresponding Wordnet verb frames. belongs to the Malayo-Sumbawan branch. It is mainly spo-
ken in the island of Bali in the Republic of Indonesia as a
First, we checked for each verb in each verb regional language (Lewis, 2009).
12
no reviews yet
Please Login to review.