jagomart
digital resources
picture1_Chinese Alphabet Pdf 101694 | Tseng06 Interspeech


 122x       Filetype PDF       File size 0.24 MB       Source: www.isca-speech.org


File: Chinese Alphabet Pdf 101694 | Tseng06 Interspeech
interspeech 2006 icslp chinese input method based on reduced mandarin phonetic alphabet chun hantseng chia ping chen department of computer science and engineering national sun yat sen university kaohsiung taiwan ...

icon picture PDF Filetype PDF | Posted on 22 Sep 2022 | 3 years ago
Partial capture of text on file.
             INTERSPEECH 2006 - ICSLP
                              Chinese Input Method Based On Reduced Mandarin Phonetic Alphabet
                                                                            Chun-HanTseng, Chia-Ping Chen
                                                                Department of Computer Science and Engineering
                                                                               National Sun Yat-Sen University
                                                                                       Kaohsiung, Taiwan 800
                                                      M943040041@student.nsysu.edu.tw, cpchen@cse.nsysu.edu.tw
                                                      Abstract                                                a low expected code length. Here, in addition, we require
                          In this paper we study the problem of simplifying Chi-                              that the number of code symbols (the size of the alphabet
                     nese input method and making it suitable for use with mo-                                set) should be as low as possible.
                     bile devices. To see the feasibility of aggressively reduc-                                   The scenario of our input scheme is as follows. When
                     ing the number of keystrokes per Chinese character, we                                   a user wants to input a sentence, he inputs the sequence of
                                                                                                              “rst Mandarin phonetic symbols1 of the characters in the
                     comparethreeinputmodes: character-based,syllable-based                                   sentence. Given the input sequence, the system outputs the
                     and“rst-symbol-based. Speci“cally, we use these linguistic                               most-likely candidate sentences for the user to choose from.
                     units as token types and compare the perplexities. With the                              Whether this is a feasible approach or not depends on the
                     language model trained by data based on the ASBC corpus,                                 entropy of the text (source) and the entropy of symbol se-
         252         the perplexity of the data set we collect from on-line chat                              quence. It is certainly feasible if these entropies are simi-
                     and instant messages is 102.6 for character-based model,                                 lar in magnitude. Otherwise, there will be many sentences
                     67.7 for syllable-based model and 16.3 for “rst-symbol-                                  (exponential in the input size) for given input symbol se-
         h.2006-     based model. Arguing from the relation between the per-                                  quence. If this is the case, the system must be able to search
         eec         plexity and the number of “typical” sentences of a language                              ef“ciently for potential sentences and list the top candidates
                     model, our conclusion is that on average there are 6 to 7                                in the order of probability for the user to choose.
         tersp       characters per “rst-symbol in natural Chinese language.                                       This paper is organized as follows. In Section 2, we
                     Index Terms: speech synthesis, unit selection, join costs.                               review common Chinese input methods and researches on
                                               1. Introduction                                                those methods related to Mandarin phonetic alphabet. We
                                                                                                              describe the principle and practice of our system in Sec-
                     With more powerful handsets and faster data communica-                                   tion 3 and 4. We present our experiments and discuss the
         10.21437/In tion speeds, mobile electronic devices appear to be the con-                             results in Section 5. In Section 6, we summarize our work.
                     verging points for new information technologies, looming                                                                 2. Review
                     to replace the immobile counter-parts. However, for that to
                     happen, the user interfaces on these devices do need signif-                             There are several common Chinese input methods: Pinyin
                     icant overhauls.                                                                         (                    ),  Pinzi (          ), Complex (               ),  Hand-
                          Take the instant message (IM) service for example. Be-                              written (          ), and Number (              ). The Pinyin is based on
                     ing used to run on desktops and laptops, it is now running                               using the Mandarin phonetic symbols to represent a charac-
                     on the mobile phones since the advent of 3G wireless net-                                ter, such as the Syllable (              ), the Microsoft NewSyllable,
                     work. However, in order to input a text message, the users                               and the Natural (             ) input methods. The Pinzi method is
                     can only use the key pads limited in size and the number                                 based on using parts of a character for representation, such
                     of distinct keys. Since the set of potential text is large, this                         as the Chang-Jie (               ) and Da-Yi (             ) input methods.
                     constraint in size posts a severe challenge for a convenient                             The Complex is based on using the form, phoneme and
                     and healthy interface.                                                                   morpheme of a character, such as the Liu (                              ) input
                          Fromtheperspective of source coding, we can view the                                method. The Hand-written is based on character recogni-
                     Chinese input problem as representing each Chinese sen-                                  tion. In the Number input method, the basic strokes (                            )
                     tence (source) by a codeword of input symbols. Ideally, a                                is coded by numbers and the user inputs the sequence of
                     source code has a high probability of being decodable and
                                                                                                                  1The “rst symbol in a Mandarin syllable is loosely known as the head,
                         This work was supported by the National Science Council of Taiwan                    but they are not quite the same – sometimes the tail is the “rst-symbol if
                     ROC,grantnumberNSC94-2213-E-110-061.                                                     the syllable contains only one symbol.
                                                                                                       733                                 September 17-21, Pittsburgh, Pennsylvania
          INTERSPEECH 2006 - ICSLP
                strokes as numbers for a character.
                    OnthePinyinmethods,thereareseveralresearchworks
                to improve the accuracy and ef“ciency. In [1], a statistical
                approach combining a trigram language model and a seg-
                mentation model is proposed to improve the conversion ac-
                curacy. In [2], an approach based on compression by partial
                match is implemented in the language model which outper-
                forms modi“ed Kneser-Ney smoothing methods. In [3], a
                scalar-quantized compact bigram is used on mobile phones
                to reduce computational resource.
                                  3. System Overview
                TheblockdiagramofoursystemisshowninFigure1. First,                                Figure 1: The system block diagram.
                a user inputs a symbol sequence into the system. With the
                input as the constraint condition, the system searches and
                generates a list of candidate sentences with signi“cant prob-
                abilities. The list is redirected to the screen for the user to
                select.
                    Figure 2 illustrates the three modes of user input for
                ”                  ” (National Sun Yat-Sen University). In
                the character-based mode, a user has to type all characters
                correctly.  This is virtually error-free as long as the user
                knows the correct characters. However, this is very time-
                consuming, and can be tedious on a small device such as a
                mobile phone. In the syllable-based mode, a user inputs the
                correct syllable sequence in the symbols of Mandarin pho-
                netic alphabet. The system outputs the most likely character          Figure 2: The input sequences of three modes for ”
                sequences as the input goes along. The user makes a selec-                       ”, the Chinese of ”National Sun Yat-Sen Univer-
                tion whentheinputofasentence,wordorphraseis“nished.                   sity”.
                This mode is currently the most commonly used mode for
                Chinese input with PCs or notebooks. In the “rst-symbol-                                4. LanguageModels
                based mode, a user inputs just the “rst Mandarin phonetic
                symbols of the intended characters. This is essentially the           Weusethebigramlanguagemodel. Inthismodel,theprob-
                sameideaofthesyllable-based mode, but with a smaller al-              ability of a sentence s is
                phabet and a smaller number of keystrokes per character. It
                relies on the “intelligence” of the system to do the rest of              Pr(s)
                the job of outputting the intended text.                                                       l
                    Since different characters can have the same syllable                  =p(w |)p(w |w                  ) p(|w )         (3)
                                                                                                  1                   j   jŠ1                l
                and different syllables can have the same “rst-symbol, it is                                 j=2
                expected that compared to character sequence, the ambigu-             where and are the symbols for start-of-
                ity is higher with syllable sequence and even higher with             sentence and end-of-sentence tokens. They are added arti“-
                “rst-symbol sequence. A higher ambiguity is re”ected by               cially to each sentence in the corpus. With these tokens, the
                a lower entropy. Let X be character sequence and Y be                 word unigram at the start of sentence can be replaced by a
                syllable sequence. The joint entropy is                               bigramandtheprobabilitiesofallsentences,notconditional
                  H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y). (1)                                 on the sentence length, sum to 1.
                                                                                          Given the test set T and the language model P trained
                                                                                      by the training set, we compute the perplexity
                Since H(Y|X)=0andH(X|Y)≥0,wehave                                                                    Š1 logP(T)
                                                                                                           PPL=2 n               ,                 (4)
                                       H(X)≥H(Y).                            (2)      where P(T) is the probability of the test set T using the
                                                                                      model of P, and n is the number of word tokens in the test
                                                                                 734
          INTERSPEECH 2006 - ICSLP
                 set. From (4), the number of typical sentences is approxi-                 For the character-based mode, a character in the text is
                 mately [5]                                                             labelled by itself. We use all characters in the xcin dictio-
                                         1     ∼ (PPL)n.                       (5)      nary, for a total number of 13065 characters. The vocabu-
                                       P(T)                                             lary (of a task) is a subset of the dictionary, containing those
                     Using the bigram model, we have                                    characters appearing in the train set.
                                                                                            For the syllable-based mode, a character in the text is
                    logP(T)                                                             labelled by the “rst syllable of the character’s entry in the
                        N ⎡ li                                              ⎤           xcin dictionary. For the label set, we use all syllables that
                                         i   i                         i              appear in the xcin dictionary as the “rst (or the sole) sylla-
                            ⎣                                               ⎦
                     =            logp(w |w       )+logp(|w ) ,
                                           j  jŠ1                        li             ble for some characters, resulting in a total number of 1256
                        i=1   j=1                                                       syllables. Note that toned syllables are used.
                                                                               (6)
                 where N is the total number of sentences, li is the number                 For the “rst-symbol-based mode, a character is labelled
                 of words in sentence i, and wi is the jth word in sentence i.          by the “rst phonetic symbol of the “rst syllable in the xcin
                                                 j                                      dictionary.    It is straightforward to use the set of Man-
                     To estimate the parameters in the bigram language
                 model,weuseamaximum-likelihood-basedestimatormod-                      darin phonetic alphabet, which contains a total number of
                 i“ed by smoothing and backing-off.             The maximum-            37 (“rst-) symbols.
                 likelihood estimate (MLE) is simply the relative frequency
                                                                                        5.1.2. Text Sets
                                       p(u|v)=n(u,v),                          (7)
                                                    n(v)                                Two text sets are used in this study. The “rst, called the
                                                                                        ASBCset,isextracted from the Academia Sinica Balanced
                 wheren(u,v)isthecountthatthebigram(wj = u,wjŠ1 =                       Corpus [4]. The number of characters in this set is approx-
                 v) appears in the train set. To cope with bigrams unseen in            imately 7.7 million. After adding the start and end tokens,
                 the train set, we use the add-one smoothing scheme, adjust-            the number of tokens in ASBC is approximately 8.3 mil-
                 ing the counts to be                                                   lion. The content in ASBC is of seven different subjects:
                                                                                        literature, life, society, science, philosophy, art, and none.
                              n˜(u,v)=(n(u,v)+1) n(v)               ,          (8)          Wecollect the second, called the “CHAT” set, from on-
                                                         n(v)+V                         line chat messages. As the name indicates, the content of
                 where V is the size of vocabulary, and use the MLE for the             this set is essentially “chats” between friends or classmates.
                 adjusted counts                                                        The number of characters collected in CHAT is approxi-
                                                                                        mately 130 thousands. Examples of sentences in CHAT are
                                          n˜(u,v)       n(u,v)+1                        “               ” (Let me ask you something) or “
                             p˜(u|v)=n˜(u,v) = n(v)+V .                       (9)               ” (No problems), the kind of utterances commonly
                                                                                        used in on-line conversations or instant messages to com-
                 On top of smoothing, we also incorporate backoff scheme                municate with other people.
                 into our bigram language model,                                            Thesetwosetsareofquitedifferentnatures. The ASBC
                                       	                                                set is of various genres and is quite formal (well-written).
                            ∗(u|v)= p˜(u|v),           if n(u,v) > 0,        (10)       The CHAT set is more informal and interactive, imitating
                           p˜            α(v)˜p(u),    if n(u,v)=0,                     the spoken language to a large extent.
                 where α(v) is chosen so that the total probability is 1.               5.2. Results
                                      5. Experiments                                    For each mode (character-, syllable- and “rst-symbol-
                                                                                        based), we compute the perplexities of test set using lan-
                 5.1. Data                                                              guage model trained by train set, of the 4 cases listed in
                 5.1.1. Dictionary and Vocabulary                                       Table 1. Since there are 3 modes, a total number of 12 runs
                                                                        2               of experiments are conducted in this evaluation, as shown
                 Weextract a dictionary from the open-source xcin and re-               in Table 2. The results on perplexities are summarized in
                 lated library source. An entry in the dictionary is a Chinese          Table 3.
                 character (similar to orthography in English) followed by                  The cross entropy (CE) is an upper bound for the en-
                 all its pronunciation variations (similar to homographs). We           tropy rate of the stochastic process of natural languages.
                 call this dictionary the xcin dictionary.                              In other words, it is an approximation to the entropy. PPL
                    2xcin is a server for Chinese input under X Window system. See      and entropy are thus related via CE. Compare the perplex-
                 http://xcin.linux.org.tw/                                              ities using ASBC as the train set and CHAT as the test set
                                                                                   735
          INTERSPEECH 2006 - ICSLP
                        Table 1: Usage of data sets for evaluation.                    Table 2: The list of task IDs for our experiments.
                                        train set  test set                                      character    syllable   “rst-symbol
                                  A1     ASBC      CHAT                                    A1       X1          Y1           Z1
                                  A2     CHAT      ASBC                                    A2       X2          Y2           Z2
                                  A3     CHAT      CHAT                                    A3       X3          Y3           Z3
                                  A4     ASBC      ASBC                                    A4       X4          Y4           Z4
               (X1, Y1 and Z1). The perplexities are 102.6,67.7 and 16.3         Table 3: Experimental results. OOV = out-of-vocabulary,
               respectively for character-based, syllable-based and “rst-        rate = OOV rate, CE = cross entropy.
               symbol based modes. On average, the ambiguity of input                          ID   OOV        rate   CE     PPL
               mode is 1.5 characters per syllable and 6.5 characters per                     X1       15     0.01    6.7   102.6
               “rst symbol.                                                                   X2     297k      3.6    9.6   782.2
                   For all three modes, using CHAT as the train set and                       X3         0       0    6.6    98.8
               ASBCasthe test set (X2, Y2 and Z2) has the highest per-                        X4         0       0    6.7   103.1
               plexity. This is due to the fact that CHAT is a small set                      Y1         006.167.7
               with a small vocabulary, resulting in many OOV (out-of-                        Y2      45k      0.5    8.3   315.6
               vocabulary) tokens in the test set.                                            Y3         0       0    5.9    57.5
                   The fact that using CHAT outperforms using ASBC as                         Y4         0       0    6.2    73.8
               the train set on CHAT as test set is not too surprising since                  Z1         004.016.3
               most probability is distributed to the patterns that appear in                 Z2      362    0.004    4.6    23.4
               the train set.                                                                 Z3         0       0    4.1    17.1
               5.3. Discussion                                                                Z4         0       0    4.0    16.1
               Theresult on the syllable-based mode actually supports the        lect on-line chat messages. We compute perplexities using
               fact that syllable-based approach is highly feasible. The         bigram language models with smoothing and backoff. We
               search space of character sequences for a given syllable se-      base our evaluation on the ambiguity of the input symbol
               quence is manageable and fast search can be implemented           sequence in specifying the output character sequence. The
               without signi“cant computational resource.                        experimental results suggest that side information may be
                   Forthefeasibilityof“rst-symbol-basedinputmode,fur-            needed to reduce the ambiguity for the “rst-symbol-based
               ther research work is required as the search space is enor-       mode,andjustifythefeasibility of the syllable-based mode.
               mous. It is necessary to structure the search space so that
               good candidates can be approached ef“ciently.                                           7. References
                   The current framework does not consider adapting the
               system to speci“c users: if a user frequently inputs certain        [1] Zheng Chen and Kai-Fu Lee, “A New Statistical Ap-
               patterns, the model parameters can be adjusted accordingly              proach to Chinese Pinyin Input”, ACL-2000. The 38th
               to re”ect such idiosyncrasy for better performance.                     AnnualMeetingoftheAssociationforComputational
                   The language model used here is a bigram model with                 Linguistics, Hong Kong, 3-6 October 2000.
               smoothing and back-off. Although good for fast evaluation,          [2] Jin  Hu Huang and David Powers, “Adaptive
               there is a risk that this model is over simpli“ed and unable to         Compression-based Approach for Chinese Pinyin In-
               capture important dependencies between linguistic patterns.             put”, ACL SIGHANWorkshop,pp.24-27.
                   TheCHATsetisquitelimitedin size. The collection of
               such data is a dif“cult issue because text in on-line chat or       [3] Feng Zhang, Zheng Chen, Mingjing Li, Guozhong
               instant message is quite personal. Instead of switching to              Dai, “Chinese Pinyin Input Method for Mobile
               other sets, we will continue to work on this domain, since              Phone”, ISCSLP2000.
               the application in mind is IM with mobile devices.                  [4]                                                       ,
                                    6. Conclusion                                      http://www.sinica.edu.tw/SinicaCorpus/98-04.pdf.
               In this paper, we evaluate the feasibility of a Chinese input       [5] T. Cover and J. Thomas, “Elements of Information
               methodbasedonthe“rstMandarinphoneticsymbolsofthe                        Theory”, John Wiley and Sons, Inc., 1991, USA,
               syllables of characters. We use the ASBC corpus and col-                ISBN:0-471-06259-6.
                                                                             736
The words contained in this file might help you see if this file matches what you are looking for:

...Interspeech icslp chinese input method based on reduced mandarin phonetic alphabet chun hantseng chia ping chen department of computer science and engineering national sun yat sen university kaohsiung taiwan m student nsysu edu tw cpchen cse abstract a low expected code length here in addition we require this paper study the problem simplifying chi that number symbols size nese making it suitable for use with mo set should be as possible bile devices to see feasibility aggressively reduc scenario our scheme is follows when ing keystrokes per character user wants sentence he inputs sequence rst characters comparethreeinputmodes syllable given system outputs andrst symbol specically these linguistic most likely candidate sentences choose from units token types compare perplexities whether feasible approach or not depends language model trained by data asbc corpus entropy text source se perplexity collect line chat quence certainly if entropies are simi instant messages lar magnitude othe...

no reviews yet
Please Login to review.