232x Filetype PDF File size 0.22 MB Source: aclanthology.org
ASub-CharacterArchitectureforKoreanLanguageProcessing
KarlStratos
Toyota Technological Institute at Chicago
stratos@ttic.edu
Abstract íßᆫ`¦ yᆻ다
We introduce a novel sub-character ar- íßᆫ`¦ yᆻ다
chitecture that exploits a unique com-
positional structure of the Korean lan- íßᆫ `¦ yᆻ 다
guage. Our method decomposes each
character into a small set of primitive ㅅ ㅏ ㄴ ㅇ ㅡ ㄹ ㄱ ㅏ ㅆ ㄷ ㅏ ∅
phonetic units called jamo letters from
which character- and word-level represen- Figure 1: Korean sentence “íßᆫ`¦ yᆻ다” (I went to
tations are induced. The jamo letters di- the mountain) decomposed to words, characters,
vulge syntactic and semantic information and jamos.
that is difficult to access with conventional
character-level units. They greatly alle- Figure 1 for an illustration of the decomposi-
viate the data sparsity problem, reducing tion. The decomposition is deterministic; this is
the observation space to 1.6% of the orig- a crucial departure from previous work that uses
inal while increasing accuracy in our ex- language-specific sub-character information such
periments. We apply our architecture to as radical (a graphical component of a Chinese
dependency parsing and achieve dramatic character). The radical structure of a Chinese
improvementoverstronglexicalbaselines. character does not follow any systematic process,
1 Introduction requiring an incomplete dictionary mapping be-
tween characters and radicals to take advantage of
Korean is generally recognized as a language iso- this information(Sunetal.,2014;Yinetal.,2016).
late: that is, it has no apparent genealogical rela- In contrast, our Unicode decomposition does not
tionship with other languages (Song, 2006; Camp- need any supervision and can extract correct jamo
bell and Mixco, 2007). A unique feature of the letters for all possible Korean characters.
language is that each character is composed of a Our jamo architecture is fully general and can
small, fixedsetofbasicphoneticunitscalledjamo bepluggedinanyKoreanprocessingnetwork. For
letters. Despite the important role jamo plays in a concrete demonstration of its utility, in this work
encoding syntactic and semantic information of wefocusondependencyparsing. McDonaldetal.
words, it has been neglected in existing modern (2013) note that “Korean emerges as a very clear
Korean processing algorithms. In this paper, we outlier” in their cross-lingual parsing experiments
bridge this gap by introducing a novel composi- on the universal treebank, implying a need to tai-
tional neural architecture that explicitly leverages lor a model for this language isolate. Because of
the sub-character information. the compositional morphology, Korean suffers ex-
Specifically, we perform Unicode decomposi- treme data sparsity at the word level: 2,703 out of
tion on each Korean character to recover its un- 4,698 word types (> 57%) in the held-out portion
derlying jamo letters and construct character- and of our treebank are OOV. This makes the language
word-level representations from these letters. See challenging for simple lexical parsers even when
721
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 721–726
c
Copenhagen, Denmark, September 7–11, 2017.
2017 Association for Computational Linguistics
augmented with a large set of pre-trained word Gillicketal.(2016)whoprocesstextasasequence
representations. of bytes. We believe that such byte-level models
While such data sparsity can also be alleviated are too general and that there are opportunities to
by incorporating more conventional character- exploit natural sub-character structure for certain
level information, we show that incorporating languages such as Korean and Chinese.
jamoisaneffective and economical new approach There exists a line of work on exploiting graph-
to combating the sparsity problem for Korean. In ical components of Chinese characters called rad-
experiments, we decisively improve the LAS of icals (Sun et al., 2014; Yin et al., 2016). For in-
the lexical BiLSTM parser of Kiperwasser and stance, 足 (foot) is the radical of 跑 (run). While
Goldberg(2016)from82.77to91.46whilereduc- related, our work on Korean is distinguished in
ing the size of input space by 98.4% when we re- critical ways and should not be thought of as
place words with jamos. As a point of reference, just an extension to another language. First, as
a strong feature-rich parser using gold POS tags mentioned earlier, the compositional structure is
obtains 88.61. fundamentally different between Chinese and Ko-
Tosummarize,wemakethefollowingcontribu- rean. The mapping between radicals and charac-
tions. ters in Chinese is nondeterministic and can only be
• To our knowledge, this is the first work looselyapproximatedbyanincompletedictionary.
that leverages jamo in end-to-end neural Ko- In contrast, the mapping between jamos and Ko-
rean processing. To this end, we develop a rean characters is deterministic (Section 3.1), al-
novel sub-character architecture based on de- lowing for systematic decomposition of all possi-
terministic Unicode decomposition. ble Korean characters. Second, the previous work
on Chinese radicals was concerned with learn-
• Weperformextensiveexperimentsondepen- ing word embeddings. We develop an end-to-end
dency parsing to verify the utility of the ap- compositional model for a downstream task: pars-
proach. We show clear performance boost ing.
with a drastically smaller set of parameters. 3 Method
Ourfinalmodeloutperformsstrongbaselines
by a large margin. 3.1 JamoStructureoftheKoreanLanguage
• Wereleaseanimplementationofourjamoar- Let W denote the set of word types and C the set
chitecture which can be plugged in any Ko- of character types. In many languages, c ∈ C is
rean processing network.1 the most basic unit that is meaningful. In Korean,
2 Related Work eachcharacterisfurthercomposedofasmallfixed
set of phonetic units called jamo letters J where
We make a few additional remarks on related |J| = 51. Thejamolettersarecategorizedashead
work to better situate our work. Our work fol- consonants Jh, vowels Jv, or tail consonants Jt.
lows the successful line of work on incorporating The composition is completely systematic. Given
any character c ∈ C, there exist c ∈ J , cv ∈ Jv,
sub-lexical information to neural models. Vari- h h
and c ∈ J such that their composition yields c.
ous character-based architectures have been pro- t t
Conversely, any c ∈ J , c ∈ J , and c ∈ J
posed. Forinstance,MaandHovy(2016)andKim h h v v t t
et al. (2016) use CNNs over characters whereas can be composed to yield a valid character c ∈ C.
Lample et al. (2016) and Ballesteros et al. (2015) Asanexample,considerthewordyᆻ다(went).
use bidirectional LSTMs (BiLSTMs). Both ap- It is composed of two characters, yᆻ,다 ∈ C. Each
proacheshavebeenshowntobeprofitable;weem- character is furthermore composed of three jamo
ploy a BiLSTM-based approach. letters as follows:
Many previous works have also considered • yᆻ ∈ C is composed of ㄱ ∈ Jh, ㅏ ∈ Jv,
morphemes to augment lexical models (Luong and ㅆ ∈ J .
t
et al., 2013; Botha and Blunsom, 2014; Cotterell • 다 ∈ C is composed of ㄷ ∈ J , ㅏ ∈ J ,
et al., 2016). Sub-character models are substan- h v
and an empty letter ∅ ∈ J .
tially rarer; an extreme case is considered by t
1https://github.com/karlstratos/ The tail consonant can be empty; we assume a
koreannet special symbol ∅ ∈ Jt to denote an empty letter.
722
Figure 1 illustrates the decomposition of a Korean and induce a representation of w as
sentence down to jamo letters.
Note that the number of possible characters fc
w C m C
is combinatorial in the number of jamo letters, h =tanh U bc +b
1
loosely upper bounded by 513 = 132,651. This
upper bound is loose because certain combina- Lastly, this representation is concatenated with a
tions are invalid. For instance, ㅁ ∈ J ∩ Jt but word-level lookup embedding (which can be ini-
h tialized with pre-trained word embeddings), and
ㅁ6∈J whereasㅏ∈J butㅏ6∈J ∪J.
v v h t the result is fed into a BiLSTM network. The pa-
The combinatorial nature of Korean characters
motivates the compositional architecture below. rameters associated with this layer are
For completeness, we describe the entire forward
w dW
pass of the transition-based BiLSTM parser of • Embeddinge ∈R for each w ∈ W
Kiperwasser and Goldberg (2016) that we use in • Two-layer BiLSTM Φ that maps h ...h ∈
our experiments. 1 n
d+d d∗
R Wtoz1...zn ∈R
3.2 JamoArchitecture
Theparameters associated with the jamo layer are • Feedforward for predicting transitions
l d Given a sentence w1...wn ∈ W, the final d∗-
• Embeddinge ∈ R foreachletter l ∈ J dimensional word representations are given by
• UJ,VJ,WJ ∈Rd×dandbJ ∈Rd w w
(z ...z ) = Φ h 1 ... h n
1 n w wn
Given a Korean character c ∈ C, we perform Uni- e 1 e
code decomposition (Section 3.3) to recover the
underlying jamo letters c ,c ,c ∈ J. We com- The parser then uses the feedforward network to
h v t greedilypredicttransitionsbasedonwordsthatare
pose the letters to induce a representation of c as
active in the system. The model is trained end-to-
c J ch J cv J ct J end by optimizing a max-margin objective. Since
h =tanh U e +V e +W e +b
this part is not a contribution of this paper, we refer
This representation is then concatenated with a to Kiperwasser and Goldberg (2016) for details.
character-level lookup embedding, and the result By setting the embedding dimension of jamos
is fed into an LSTM to produce a word representa- d, characters d′, or words dW to zero, we can con-
tion. WeuseanLSTM(HochreiterandSchmidhu- figurethenetworktouseanycombinationofthese
d d
ber, 1997) simply as a mapping φ : R 1 × R 2 → units. We report these experiments in Section 4.
Rd2 that takes an input vector x and a state vector
′ 3.3 Unicode Decomposition
h to output a new state vector h = φ(x,h). The
parameters associated with this layer are Our architecture requires dynamically extracting
c d′ jamo letters given any Korean character. This is
• Embeddinge ∈ R foreachc ∈ C achieved by simple Unicode manipulation. For
′
• Forward LSTMφf : Rd+d ×Rd → Rd any Korean character c ∈ C with Unicode value
′ U(c), let U(c) = U(c) − 44032 and T(c) =
• BackwardLSTMφb : Rd+d ×Rd → Rd U(c) mod 28. Then the Unicode values U(c ),
h
U(cv), and U(ct) corresponding to the head con-
• UC ∈ Rd×2d and bC ∈ Rd sonant, vowel, and tail consonant are obtained by
Given a word w ∈ W and its character sequence U(c ) = 1+U(c)+0x10ff
c1 . . . cm ∈ C, we compute h 588
c U(c ) = 1+(U(c)−T(c)) mod 588+0x1160
fc = φf h i , fc ∀i = 1...m v 28
i eci i−1 U(c ) = 1+T(c)+0x11a7
t
ci
c b h c
b =φ , b ∀i = m...1
i ci i+1
e where c is set to ∅ if T(c ) = 0.
t t
723
Training Development Test ㄱㄳㄲㄵㄴㄷㄶㄹㄸㄻㄺㄼㅁ
# projective trees 5,425 603 299 ㅀㅃㅂㅅㅄㅇㅆㅉㅈㅋㅊㅍㅌ
# non-projective trees 12 0 0 ㅏㅎㅑㅐㅓㅒㅕㅔㅗㅖㅙㅘㅛㅚ
# # Ko Examples ㅝㅜㅟㅞㅡㅠㅣㅢ
word 31,060 – áÔÐÕᅳÏþÐ다 °úᆯqᅵ booz
char 1,772 1,315 þjÏäJ <ÉÌ zªᆼ $ H Aᤠ@ 正 a none of which is OOV in the dev set.
jamo 500 48 ㄱㄳㄼㅏㅠㅢ@正a
Table 1: Treebank statistics. Upper: Number of Implementation and baselines We implement
trees in the split. Lower: Number of unit types our jamo architecture using the DyNet library
in the training portion. For simplicity, we include (Neubig et al., 2017) and plug it into the BiLSTM
3
non-Korean symbols (e.g., @, 正, a) as charac- parser of Kiperwasser and Goldberg (2016). For
ters/jamos. Korean syllable manipulation, we use the freely
4
available toolkit by Joshua Dong. We train the
3.4 WhyUseJamoLetters? parser for 30 epochs and use the dev portion for
model selection. We compare our approach to the
The most obvious benefit of using jamo letters is following baselines:
alleviating data sparsity by flattening the combi- • McDonald13: A cross-lingual parser origi-
natorial space of Korean characters. We discuss nally reported in McDonald et al. (2013).
some additional explicit benefits. First, jamo let-
ters often indicate syntactic properties of words. • Yara: A beam-search transition-based parser
For example, a tail consonant ㅆ strongly implies of Rasooli and Tetreault (2015) based on the
that the word is a past tense verb as in yᆻ다 rich non-local features in Zhang and Nivre
(went), M®o다 (came), and Ùþ¡다 (did). Thus a (2011). We use beam width 64. We use
jamo-level model can identify unseen verbs more 5-fold jackknifing on the training portion to
effectively than word- or character-level models. provide POS tag features. We also report on
Second, jamo letters dictate the sound of a char- using gold POS tags.
acter. For example, yᆻ is pronounced as got be-
cause the head consonant ㄱ is associated with the • K&G16: ThebasicBiLSTMparserofKiper-
sound g, the vowel ㅏ with o, and the tail conso- wasser and Goldberg (2016) without the sub-
nant ㅆ with t. This is clearly critical for speech lexical architecture introduced in this work.
recognition/synthesis and indeed has been investi- • Stack LSTM: A greedy transition-based
gated in the speech community (Lee et al., 1994; parser based on stack LSTM representa-
Sakti et al., 2010). While speech processing is not tions. Dyer15 denotes the word-level vari-
our focus, the phonetic signals can capture useful ant (Dyer et al., 2015). Ballesteros15 denotes
lexical correlation (e.g., for onomatopoeic words). the character-level variant (Ballesteros et al.,
4 Experiments 2015).
Data Weusethepubliclyavailable Korean tree- For pre-trained word embeddings, we apply the
bank in the universal treebank version 2.0 (Mc- spectral algorithm of Stratos et al. (2015) on a
Donald et al., 2013).2 The dataset comes with 2015 Korean Wikipedia dump to induce 285,933
a train/development/test split; data statistics are embeddings of dimension 100.
shown in Table 1. Since the test portion is sig- Parsing accuracy Table 2 shows the main re-
nificantly smaller than the dev portion, we report sult. The baseline test LAS of the original cross-
performance on both. lingual parser of McDonald13 is 55.85. Yara
As expected, we observe severe data sparsity achieves 85.17 with predicted POS tags and 88.61
with words: 24,814 out of 31,060 elements in the with gold POS tags. The basic BiLSTM model
vocabulary appear only once in the training data. of K&G16 obtains 82.77 with pre-trained word
On the dev set, about 57% word types and 3% embeddings (78.95 without). The stack LSTM
character types are OOV. Upon Unicode decom- parser is comparable to K&G16 at the word level
position, we obtain the following 48 jamo types: 3https://github.com/elikip/bist-parser
4https://github.com/JDongian/
2https://github.com/ryanmcd/uni-dep-tb python-jamo
724
no reviews yet
Please Login to review.