251x Filetype PDF File size 0.38 MB Source: www.diva-portal.org
A Segmentation-free Approach to Recognise Printed Sinhala Script
H. L. Premaratne
University of Colombo School of Computing, Sri Lanka
Lalith.Premaratne@ide.hh.se hlp@mail.cmb.ac.lk
J.Bigun
School of Information Science, Computer and Electrical Engineering
Halmstad University, S-301 18 Halmstad, Sweden
Josef.Bigun@ide.hh.se
Abstract symbols to produce the required vocal sound.
Majority of character recognition algorithms The total number of different modifications from
such as the use of ANNs needs segmentation of the entire alphabet including the basic characters
the script prior to recognition. Contrast to is nearly 400. Although each character possesses
Western scripts, Brahmi descended South Asian a distinct characteristic shape to distinguish from
scripts such as Sinhala consist of modifier the others, some characters resemble with one or
symbols, which make the segmentation a difficult more of the other characters by their appearance.
task that needs to be addressed as a separate Some examples are given in Figure 1.
issue. Further, the change of shape of the basic
character (by violating modification rules) in the Modification of a character is carried out by
modification process makes some modified simply adding one or more modifier symbols
Sinhala characters impossible to segment. The before/after/above/below the character without
proposed method, which uses Linear Symmetry affecting its general shape.. However this rule is
to examine a co-relation between characters in violated for a specific subset of the alphabet
the script with the testing alphabet, recognises numbering to 10 characters, in most of the
characters directly within the image of the script. printed scripts, to give a better appearance
A similar method is used to resolve confusing (Figure 2). Also, in some modifications, the joint
characters. Experiments show highly favourable between the character and the modifier symbol is
results not only for the basic characters of the smoothed to make the modified character appear
alphabet but also for the modifier symbols. A as a single unit of symbol.
novel but simple method using Linear Symmetry
for skew correction has also been proposed. 1.2 Characteristics of the Script
Key Words: Linear Symmetry, Recognition, A single line of script is organised in three
Segmentation, Skew Correction horizontal layers. The middle layer contributing
to approximately 50% of the total line height,
1. INTRODUCTION mainly include fifteen (15) basic characters and
1.1 Alphabet and the Modification Nine (9) modifier symbols. Twenty two (22)
Process other basic characters occupy the middle layer
The Sinhala script used by over 80% of the and the upper layer, with approximately 75% and
18.4 million population in Sri Lanka has been 25% of the total height of each character in each
descended from the ancient Brahmi script and layer respectively. The middle and the lower
evolved independently over many centuries. The layers include the remaining eight (8) characters,
Sinhala language is unique to Sri Lanka and the with approximately 75% and 25% of the total
Sinhala characters that are generally round in height of each character in each layer
shape differ from all the other Brahmi descended respectively. Four (4) modifiers occupy the upper
scripts in South Asia. The Sinhala alphabet layer while the remaining five (5) modifiers are
consists of 18 vowels, 41 consonants and 17 assigned to the lower layer. The upper and the
modifier symbols. A vowel may appear only as lower layers are of equal height each having 25%
the first character of a word and a consonant is of the total line height. (Figure 4).
modified using one or more of the modifier
1.3 The OCR Technology and Recent 2. RECOGNITION PROCESS
Developments 2.1 Theory
Optical Character Recognition (OCR) is the The theory used in the recognition process is the
process of converting typed or printed documents orientation field tensor which has been used
into machine-readable code. The original typed effectively in many applications over the past
or printed documents scanned to form an image few years. A local neighbourhood with ideal
file would be the input to the OCR software local orientation is characterised by the fact that
system. The result is a picture represented as the gray value only changes in one direction. In
light intensities on a rectangular grid of points, all other directions it is constant. Since the gray
which do not yet identify individual characters. values are constant along lines, local orientation
The OCR will in turn, recognise each character is also denoted as linear symmetry [1]. The linear
or symbol in the image file and make them symmetry is also represented in the form a
available in a suitable text editor, which could vector. Since the direction of a simple
either be edited or modified. neighbourhood is different from the direction of
a gradient, which is strictly cyclic, representation
Most of the OCR systems use Artificial Neural of the linear symmetry needs the doubling of the
Networks (ANN's) as the major tool. In addition angle of orientation. The vector that represents
to the features identified in a rectangular grid of a the linear symmetry is composed of two
matrix that encloses a single character, other quantities. One is the orientation angle and the
features of the character such as the curvature other is the certainty measure.
features and transition counts are also used. In
the case of handwriting recognition, some 2.1.1 Mathematical representation
common approaches are the ANN's, The local orientation is determined using the
mathematical morphology, shape analysis and following three steps [1].
hidden Markov model (HMM). Each of the i. Select a local neighbourhood from the image
above approaches has its own strengths and using a window function
weaknesses. Researchers have achieved a ii. Fourier transform the windowed image
significant improvement in performance by iii. Determine the local orientation by fitting a
combining two or more of the above methods. straight line to the spectral density
Majority of alphabets consists of confusing distribution.
characters that resemble to each other to a greater
extent. Resolving this problem especially in the When fitting a straight line, the sum of the
case of handwriting recognition is a critical issue. squares of the distances of the data points are
minimised.
The research on the south and the South-East Since the minimisation of di is same as the
Asian scripts lag behind that on European scripts maximisation of SI, the equation (2) is obtained.
due to various reasons. The main reason is the
complexity of a script. In Asian alphabets, the
number of characters in the alphabet is high and The orientation is obtained as the eigen vector of
the generation of a vocal sound by modifying a the largest eigen value of J. J can be rotated so
character using modifier symbols is complex. that it is diagonalised. The rotation matrix is in
Extensive research has been done on a few fact the eigen vector matrix given in equation (1).
scripts used by a very large population of the
community. Some of such research has been
initiated in developed countries due to the high
exposure to such research. Comparison of the diagonal elements on both sides of
the equation (1) gives
λ +λ = J + J ;
At present, the OCR software for the 1 2 xx yy
languages such as Sindhi, Bengali and Thai are
λ - λ = (J - J )Cos2φ + 2 J Sin2φ
available as commercial products. The research 1 2 xx yy xy
on Devanagari and Tamil languages has achieved
a tremendous progress. To the best of our Cos2φ
= (J - J , +2J )
knowledge, there have been no or a very little xx yy xy
research done on the recognition of printed Sin2φ
Sinhala script.
2.2 . Determination of Skew Angle
Almost all the recognition algorithms need the
= I20 , Cos2φ I20 , I20 / I20 I20 text lines in the input image to be horizontal.
Sin2φ Therefore, any skew associated with the input
∴∴ λ - λ = I image needs corrections prior to recognition.
∴∴ 1 2 20 Experiments show that the recognition algorithm
proposed in this thesis tolerates a skew of +10 to -
Define ∇f = ∂f/∂x + i (∂f/∂y) 10. The accuracy of recognition deviates
considerably with the increasing skew. Therefore
then I20 a robust method for skew correction needs to be
=(∇f)2=((∂f/∂x)2-(∂f / ∂y)2 +2I(∂f/ ∂x). (∂f / ∂y))
2 0 2 incorporated.
= [ (ω +iω ) (ω - iω ) |F| ] = (λ -λ )exp(2iφ)
x y x y 1 2
1 1 2 2 2 2
I =[ (ω +iω ) (ω - iω ) |F| ] = (ω + ω )|F|
11 x y x y x y Careful observation of a line of Sinhala script
= ((∂f/ ∂x)2 + (∂f / ∂y)2 = λ + λ
1 2 shows that the boundary between the upper and
the middle layers and the boundary between the
Angle of I20 represents the (2 x angle) where the middle and the lower layers (fig. 8) possess the
angle is the inclination angle of the fitting highest amount of energy in the horizontal
orientation if the linear symmetry exists, and I11 direction. The horizontal projection of a sample
represents the sum of the best and the worst total script clearly agrees with this concept. This is
errors. due to the fact that any character in the alphabet
should touch either at least one or both of these
The Linear Symmetry algorithm that extracts the boundaries. Therefore, tracing the appearance of
tensor is characterised by the fact that it delivers one of these boundaries in a skewed script could
a dense orientation field along with certainties. In be used to determine the skew angle. Although
case of high confidence on the existence of any straightforward method to detect a boundary
orientation, the linear orientation represents the line could have been used, a more appropriate
least change of gray values in one direction and method using the Linear Symmetry (LS) tensor
maximal change in the orthogonal direction. has been proposed.
Hence a Linear Symmetry Tensor for an image
is constructed by averaging the orientation of the
local neighbourhood, for each pixel of the image. The Linear Symmetry tensor [1] which gives
information for each pixel of the image, on how
it is organised with respect to the orientation
2.1.2 Implementation within a local neighbourhood, could effectively
The LS Tensor for an image is built as explained be used to determine the orientation of the script.
in the following steps. In general, the orientation angle of the resultant
vector of all the vectors representing the LS for
Four 1-D derivative filters dx (Gaussian kernal), each pixel of the image would provide a near
dy (= - dx’) and gx (Gaussian kernal), approximation to the skew angle. In order to
gy (= gx’) are generated. improve the accuracy, the interference to the
final result from the following components
The two derivative convolutions dxf (= should be elimination.
convolution(gy, convolution(dx, Image)) and
dyf (= convolution(gx, convolution(dy, Image)) i. Edges of the image
of the original image with respect to x and y are ii. Background of the image, which consists of
constructed using the above pair of filters. pixels having random orientations of low
confidence.
The LS Tensor (complex) is then given by iii. Other pixels (within the text area) having
LS = (dxf +j∗dxy)^2 where j = √ (-1) orientations of low confidence.
The correlation between the character being The results obtained for the LS tensor derived in
section 3.3.2 yield the skew angle within +10 to –
tested with the image is calculated using the 10 accuracy, which is well within the required
formula accuracy for the recognition algorithm.
absolute(convolution(conjugate(LS Tensor of 2.3 Recognition Procedure
Character), LS Tensor of Image )).
2.3.1 Testing Database. of filtering is carried out in order to determine
The recognition process is based on the the acceptance or rejection of the identified
examination of the correlation of characters in character. A tertiary level of filtering is carried
the script with each character of the alphabet out similarly.
through a filtering operation. The testing It has been observed that, in addition to the
alphabet which consists of all the characters highest value of correlation produced usually at
(including the modifier symbols), is built by the centre of the character, a few more relatively
extracting characters from an LS tensor. Each high values are also produced around the
character in the testing alphabet is filtered (one at neighbouring pixels. This is due to the fact that
a time) through the LS tensor of the script in the template of the testing character nearly
order to identify its occurrences in the entire coincides with the neighbouring pixels around its
script. The plot of correlation at each pixel (Fig. centre. This will result in recognising the same
10) shows that, each occurrence of the character character in the image more than once.
being tested gives a strong correlation. A suitable Therefore, once the filtering has been performed,
threshold that separates the required character non-maximums in a small neighbourhood (e.g.
from the rest of the characters in the script, is 3x3) are suppressed in order to eliminate the
then determined. This procedure is conducted for multiple acceptance of the same character.
each and every character of the alphabet. During The recognition algorithm is as follows:
this process, it has been observed that a total Input image
number of 35 characters amounting to 60% of the Input database-of-characters
alphabet separates from all the other characters */Alphabet/*
with a clear threshold (Fig. 10(a)) while the Pre-process image
balance 40% confuse with one or more Perform Horizontal-projection
characters with similar shapes (Fig. 10(b)). Eight Extract Line-data
(8) such confusing groups have been identified. ConstructLS-tensor
Once all the different confusing groups are Read character
identified, another level of filtering is carried out While not-end-of-alphabet do
to separate each character within the confusing Filter characte with the LS Tensor
group. The secondary level of filtering is
performed to examine the correlation of a distinct */ Primary Filtering /*
segment from one character with all the members Supress non-maximums
in the group (Fig. 11). A suitable (secondary) While not-end-of-image do
threshold that separates each character from the Segment occurrences above threshold
rest is then determined. A further level of If confusing-charcater
filtering is carried out if the confusion still Determine relative rhreshold
occurs. Perform secondary-filtering
The structure of the testing database is as /* and tertiary-filtering if necessary*/
follows. End-If
Character Identifier Store image-coordinates of -each
LS Tensor of character occurrence
Primary Threshold End-While *** not-end-of-image ***
Flag to indicate confusing status Update output array
Secondary Threshold (for confusing characters) /* with ASCII Value, row, column no, .*/
Tertiary Threshold (for confusing characters) Read character
End-While *** not-end-of-alphabet***
2.3.2 Recognition. Sort output on Column No. within the Row No.
The image is initially pre-processed to remove
the background noise. The image is then scaled Since a character is identified directly within the
(if necessary) to match the average height of a image of the script, the need to segment
character to that of the testing alphabet. individual characters does not arise. Symbols
Recognition of a script is performed by such as comma, full stop, question mark are also
filtering the LS tensor of each character of the recognised with the same accuracy.
testing alphabet with the LS tensor of the script.
In each filtering cycle, all the occurrences of the
character being tested are identified. If the testing
character is a confusing one, the secondary level
no reviews yet
Please Login to review.