268x Filetype PDF File size 0.14 MB Source: aclanthology.org
[Mechanical Translation, vol.4, no.3, December 1957; pp. 76-78]
A Refinement in Coding the Russian Cyrillic Alphabet
B. Zacharov, London University, London, England
By reducing the number of characters to be coded the problem of devising a
numerical code for the Cyrillic alphabet can be simplified. This reduction can be
achieved by providing code-words for only the lower-case forms of characters that
do not occur initially; by disregarding the diacritic of the character ё, and by
disregarding the character ё entirely. Ambiguities that arise in the latter cases
can be resolved by an examination of the context.
THE PROBLEM of coding the Russian Cyrillic last approach has been considered in a recent
alphabet in numerical form has been considered paper on mechanical translation3 where all the
previously in several papers 1 and it is clear lower-case characters, except ё, и, ъ and ь
that it would be desirable if each character of are represented by a five binary-digit code,
the Russian alphabet (together with any re- while all the capitals and decimal numbers use
quired numbers, punctuation marks and capitals) a ten bit code; in the code proposed in that
could be coded in such a way that a separate paper simplification is obtained on the basis of
unique numerical code-word existed for each the statement that "... five of the 33 Russian
lower-case character, capital, etc. Unfortu- letters never start a word and will not need to
nately, the speed of modern digital computers be capitalized ... ". The five Russian letters
and the size of their memories are such that a referred to are ё, и, ъ, ь, ы.
code of this form would result in considerable All the other Russian characters occur fre-
time being spent in the memory search for the quently in both upper and lower case and re-
appropriate target language equivalent. quire to be coded separately in both these
It is clear, then, that ways must be found, forms or by the same numerical code, except
apart from engineering advances, to speed up that the upper case is always preceded by some
the memory search time. One way of doing number which denotes an 'upper-case shift'.
this would be to decrease the amount of lin- Inspection of the statement quoted above re-
guistic data stored in the memory, and this has veals that it is formally incorrect with respect
2 to ё although it is quite correct to state that
been considered. Another method would be to
decrease the amount of numerical data (i.e., none of the four characters й, ъ, ь, and ы
the number of bits) in the memory for a given ever begin a word in the Russian language so
number of source language characters. This that clearly, it will never be necessary for
them to be coded in upper-case form. (A rig-
orously phonetic transliteration of some other
alphabet into Russian may create a trivial ex-
1. Harper, K.E., "The Mechanical Transla- ception in the cases of й and ы This will not
tion of Russian: Preliminary Report", Modern be considered here.)
Language Forum, vol.38, no. 3-4, pp. 12-29,
Sept. - Dec. 1953.
2. Oettinger, A. G., "The Design of an Auto-
matic Russian-English Dictionary", Machine 3. Wall, R. E., "Some of the Engineering As-
Translation of Languages, John Wiley and pects of the Machine Translation of Languages",
Sons, New York (1955), pp. 47-65. AIEE Transactions, I, vol.75, 580 (1956).
Refinement in Coding 77
The Problem of ё as the corresponding letters of the (x)-word
4 except that ё in (x) is replaced by e in (y).
Reference to a Russian-English dictionary Examination of a Russian-English dictionary
shows us that many words of the Russian lan- reveals that this does not occur often in the
guage begin with ё Notable examples are stem of a word. Similarly, experience tells us
ёлка 'fir tree' and ёмкость 'capacity'; the that ambiguity seldom arises as a result of
latter is of especial importance in scientific word endings together with stem.
texts. Examples of words where ambiguity may oc-
Superficially, therefore, it would appear that cur are:
ё should be treated in the same way as the все all (plural)
other word-initial characters and that it should всё all (singular, neuter)
be coded in upper and lower case. However, of the village (genitive, singular)
the following points must be considered, села she sat
i) In practice, ё is never written in script сёла villages (nominative/accusative, pl.)
form with the diacritic, either in lower or
upper case — e and E are used. Whereas discrepancy need not necessarily
ii) A modern standard Russian typewriter key- occur in the first example, considerable ambi-
board does not contain Ё or ё — the up- guity can arise in the second case since the
per and lower case forms of e are used, words are different grammatical forms of
as in (i). widely different words ( сёла is a plural noun
iii) Both ё and Ё frequently appear in print, while села may be a verb form or a singular
especially in the texts of scientific peri- noun).
odicals . However, we note that if the contexts of these
Thus, from (i), (ii) and (iii) above, it can be words are examined, most cases of ambiguity
seen that the problem of encoding ё and Ё disappear (this is especially true for Russian
is complicated by the source of the Russian where strict grammatical rules concerning
language text. If e and ё are coded separately, case endings and conjugation must be observed).
it would appear that words containing ё would Indeed, such an examination is essential for
have to be stored in the memory in two separate certain words in Russian and, more especially,
locations, with both e and ё in the corre- 5
sponding positions of each word. in English.
Certain Russian words are such that their
a) ё at the beginning of a word spelling is associated with multiple meaning
For words with ё at the beginning, any cod- and, here, it is often the case that an examina-
ing difficulty can be overcome if it is noted that, tion of the context will not reveal which alter-
if the diacritic is ignored, no ambiguity can native is meant. In this event it becomes nec-
arise. This is because no two words in the essary to print out all the alternatives stored
Russian language exist with different meaning in the computer memory which correspond to
such that corresponding letters of both words the source word. At this stage a simplification
are the same except that ё at the beginning of may be effected if the computer dictionary is
the first word is replaced by e in the second concerned only with a certain field (e.g., nu-
word. As a result of this consideration it will clear physics), in which case only those terms
clearly never be necessary to encode ё in which may reasonably be expected to relate to
capitalized form — the upper-case form of e that field will be printed out.
will be sufficient. Examples of Russian words in such a cate-
gory are:
b) ё in any letter position замок castle
If ё occurs in some letter position other than lock
at the beginning of some word (x), ambiguity
can arise only if another word (y) exists such twist
that all the letters of the (y)-word are the same замотать shake
5. Yngve, V.H., "Syntax and the Problem of
4. Smirinskii, A.I., Russian-English Dic- Multiple Meaning", Machine Translation of
tionary, State Publishing House for Foreign Languages, John Wiley and Sons, New York
and National Dictionaries, Moscow, (1952). (1955), pp.208-226.
78 В. Zacharov
In the two examples above, ambiguity will the Russian language. This may be of some
disappear if the words are used in idiomatic importance since the character can be repres-
context (e.g. padlock = висячий замок). ented in several different ways, namely:
In the case of words containing e or ё, how- i) as ъ.
ever, difficulties of multiple meaning that can- ii) as '
not be resolved by simple context (i. e., syntax)
examination are very rare. In fact, in the iii) as a gap in a word
author's experience, no example can readily iv) it is ignored completely.
be quoted.
Suggested Encoding Rules As in the above encoding rules, if ambiguity
From the above considerations, a set of occurs because ъ is ignored, the context of the
rules can be formulated to include words con- word must be examined. An example of words
taining ё and Ё. They are: where this kind of difficulty can arise is
i) Source language words containing ё or Ё сесть = sit down
are stored in the dictionary in numerical съесть = eat
form as if they contained e or E in the In these cases, if a unique meaning cannot be
corresponding letter positions, found simply from the program, all the target-
ii) Incoming source language words are coded language equivalents will have to be printed out
with a unique number code for every lower- and the required meaning determined by post-
case character except ё which is treated editing.
as if it were e. All upper-case characters
will have unique number codes correspond- From an examination of the occurrence of e
ing to them (or they will be preceded by a in the Russian language it seems that, if the
coded upper-case symbol), except Ё, diacritic is ignored the chances of ambiguity
where the diacritic is ignored and the char- occurring in MT, with the rules formulated
acter is treated as if it were E; й, ъ, ь , above, are very slight. Indeed, for a specific
and ы will have no upper-case code, subject, where all the source language words
iii) If more than one target language alterna- in the dictionary are known, most cases of am-
tive is found, the context of the Russian lan- biguity and difficulties of multiple meaning
guage word must be examined; this will also could be overcome by sufficiently sophisticated
be required for any other word (not contain- programming techniques (i.e., syntactical and
ing e or ё) where ambiguity may exist — idiomatic context examination for all the cases
as in the examples above. of expected ambiguity).
The Problem of ъ As to ъ, it may be ignored in the encoding.
It may be noted that ъ could also be ignored The few cases of ambiguity will be resolved
completely since it occurs so very rarely in from a study of context.
no reviews yet
Please Login to review.