OCR for historical Kannada documents using clustering methods

Motivation: In India, the Language Kannada is an ancient and oﬃcial language in Karnataka State. The study of ancient Kannada scripts from stone carvings, leaf, metal, cloth, paper and other sources enhances our knowledge on the traditions and culture practiced in Karnataka. Due to Poor Quality, variability and the contrast, the Kannada ancient scripts become very challenging to extract the information or to recognize the characters. Objectives: To design a suitable Optical Character Recognition (OCR) technique to read ancient Kannada scripts. Method : Clustering by fast search and ﬁnd of density peaks is a state-of-the-art density-based clustering algorithm that can eﬀectively ﬁnd clusters with arbitrary shapes. However, it requires to calculate the distances between all the points in a data set to determine the density and separation of each point. Consequently, its computational cost is extremely high in the case of large-scale data sets. In this work the given document is preprocessed. The features alike SIFT and SURF are extracted and clustered using K-Means clustering. The similarity is computed using diﬀerent measures. Findings : The classiﬁcation accuracy was studied under diﬀerent clustering methods like Kmeans, Agglomerative, Density based clustering with distance based measures like Euclidean and Manhattan. To evaluate the performance of the proposed method, we created our own database of Ashok, Kadamba, Hoysala and Mysuru scripts and experiment was conducted in a database of 4 classes under 70, 50 and 30 diﬀerent training models from each class. Novelty: We propose a K-means clustering using SIFT and SURF for Kannada ancient manuscript. Experiment was conducted in our own database to validate the performance of the presented system


Introduction
Kannada is an olden language of the India and official language in Karnataka, originated in epigraphs or inscriptions before 230BC. Epigraphs or inscriptions are ancient scripts, written on stone carvings, palm leaf, metal, cloth and paper. The inscriptions used to enhance our knowledge of the astrological, cultural, philosophy and spiritual of the ancient people. Recognizing the inscription is not straightforward process because of its style of scripts, but we can recognize ancient scripts with the help of epigraphers, those who can easily identify ancient scripts. Identify them manually is a tiresome and timeconsuming process. It could be good to develop the OCR to recognize Kannada ancient scripts automatically to overcome the disadvantage of the manual process system. In Kannada language, recognition of historical or ancient manuscripts is very tough task because of low quality, variableness in writing style, large character set and certain resemblances among the characters is shown in Figure 1. Features extraction and Clustering or classification is very significant stage to increase the performance of OCR system. The features are trained to proposed system by SIFT and SURF methods. When Kannada ancient manuscripts database is created, Identification becomes even more challenging as large variability is seen in the collected data samples. Hence, there is a need to organize and search for data inside the collections; so we used different clustering methods like Kmeans, Agglomerative, Density based clustering with distance based measures like Euclidean and Manhattan to inflation of the performance of OCR system. In the proposed method, we have done wide trials on many character dataset of Ashok Scripts, Kadamba, Scripts, Hoysala Scripts and Mysuru Scripts. An efficient system proposed by authors for character recognition by using LDA, PCA and k-means clustering, decreases the needless information in the training data and increases the performance of the system (1) . In (2) pre sented k-means clustering method for recognizing printed Kannada document, which presents a natural grade of font individuality and used to decrease the training dataset's size and got good accuracy. The Clustering technique with is used to extract features from handwritten signature images, here authors considered height-width, occupancy and distance ratio validate the signature with higher https://www.indjst.org/ accuracy (3) .In (4) proposed the recognition system for Hindi handwritten character by K-means clustering and SVM and result of proposed was better than Euclidean distance. The authors (5) presented clustering method for recognizing of handwritten character of Nandinagari is derived from Brahmi-based script and existing in India from 8th to 19th centuries. The SIFT and SURF approaches identify the interest points and derives feature descriptors and succeeding decent recognition accuracy. The clustering algorithm to recognizing Thai handwritten script. This clustered method is used as a rough classify method or global feature that reflects the structure significance of the characters (6) . (7) Proposed recognition system for Yi character using CNN with density-based clustering method and compared experiment results of different parameters, achieved good accuracy. The authors proposed (8) a system by using SIFT and HOG as feature extractor for recognizing handwritten character of Thai, Bengali and Latin and have succeeded good results. Finding text areas and decorative features in olden scripts and robust manner is encouraged by objects recognition system. SIFT descriptors are chosen to find interest regions, which is used for localization (9) . The authors (10) presented 2 feature extraction methods together with diagonal and transition features and conducted experiment on the Gurmakhi database and achieved good accuracy with different parameters. In OCR system feature extraction from overlapping character blocks is major problem and reduces the performance of the recognition phase. The authors (11) address this problem and proposed a system to escalation the performance of the recognition phase by the help of clustering system and Hamming Distance method and experimental show improvement in performance. In (12,13) proposed precise recognition system for recognition of Historical printed records by the combination of LSTM neural network and clustering and improve the recognition rate by combine the clustering and classification. (14) presented feature extraction method for printed Odia characters set by k-means and spectral clustering algorithm and comparison result has done between k-means and spectral clustering finally authors concluded K-means better than spectral clustering. The authors (20) presented OCR system with the help of connected component technique for recognized handwritten Kannada script with good accuracy. The complete process of OCR , Syntactical analysis and Ternary search tree of Kannada script is discussed (21) . Kannada OCR on the Android OS system for Kannada sign boards by the help of kohonen's procedure and finding the meaning of the word on the Internet (22) . A Complete Survey on Optical Character Recognition system to printed and handwritten Kannada language script is presented in (23) . The authors (24) designed a multipurpose OCR system for documents in any language printed in Kannada Script. The authors (25) proposed an accelerated algorithm of density clustering by k-means for several synthetic and real data sets. This algorithm involves additional computational costs. Although these algorithms improve the performance of the algorithms, they do not fully consider how to reduce redundant computations and accelerate the algorithms. Therefore, in this work, we use the different clustering methods like K-means, Agglomerative, Density based clustering with distance based measures like Euclidean and Manhattan. The algorithm determines the membership of a point to a cluster by considering not only the connectivity but also the separation of points. Thus, its performance is robust with respect to the radius of a neighborhood, compared to other density-based algorithms. However, as with other density-based algorithms, it requires to calculate the distances between all the points in a data set to determine the densities of the points and separations between the points. Consequently, its computational cost is extremely high in the case of large-scale data sets.
Here Section 2 explains the proposed system with feature extraction and Clustering methods. Further, section 3 discussed about experimentation and results. Final end with conclusion.

Proposed System
In this proposed work the given document is preprocessed. The features alike SIFT and SURF are extracted and clustered using K-Means clustering, which described following subsection.

SIFT (Scale Invariant Feature Transform)
The SIFT was introduced by David Lowe (2004) for finding distinct invariant features from images, which can be used to do reliable matching. This algorithm is used in recognizing handwritten character of Nandinagari and Thai (5,8) and Identifying text areas and decorative features in Ancient Scripts (9) . The extraction of SIFT features involves the following steps explained below.

Keypoint Detection
This stage involves finding points of interest known as keypoints, which are invariant to scale and orientation. The keypoints were identify through cascade filtering approach and for each of these keypoints the scale and location are determined and then gradient operators are used for orientation assignment. The identification of keypoints is summarized as follows.
Detection of scale space extrema The 1 st stage of keypoint detection is to detect the locations and scales, this will be repeatedly allocated with dissimilar views of the similar object. The positions that remain consistent to variations in scale are found by finding the steady features at altered https://www.indjst.org/ scales by continuous function of scale, it is specified by Here σ is the width of Gaussian filter, I is an input image and * is the convolution operator . The Difference of Gaussian (DOG) image are calculated from 2 nearby scales that differ using constant multiplicative factor k.

Keypoint Localization
The DOG images obtained in the above step are used to find the key points by the help of local minimum or maximum through dissimilar scales. Every pixel in the DOG image is matched with its 8 and 9 neighbors of the scale above and below correspondingly. Then pixel is selected to be the candidate key point if it is either a local minimum or maximum in 3 × 3 × 3 regions at current and adjacent scales. The next step is to reject the key points that are associated with an edge or which has a low contrast since they are certainly corrupted by noise.
Orientation Assignment Herein a consistent orientation was allocated to the keypoint, which makes the feature invariant to rotation when the descriptor for the keypoint was expressed in reference to orientation The scale of keypoint was used to select Gaussian smoothed image L. For each Gaussian smoothed image L(x, y), magnitude m(x, y) and orientation θ (x, y) are specified by

Keypoint Descriptor
After orientation selection, the feature description is calculated. This is done by calculating a set of orientation histograms in 4 × 4 pixel neighborhoods. The orientation histograms correspond to the keypoint orientation as shown in the ( Figure 2).

Speeded Up Robust Features (SURF)
It is a scale invariant and rotation invariant interest point detector and descriptor. This algorithm has been used in recognizing handwritten character of Nandinagari (5) , Human recognition method by profile face and ear (17) This algorithm uses a keypoint detector and descriptor method which is explained as below.
Detecting Keypoints with Fast-Hessian https://www.indjst.org/ SURF makes use of Hessian matrix to detect the keypoints in the image. It is defined as

Extracting SURF Descriptor
The extraction of the SURF descriptor first involves orientation assignment around the keypoint and then the extraction of the descriptor with reference to the orientation.
Orientation Assignment First a circular region is concede around the selected keypoints to calculate the dominant orientation based on the data in the circular region. The Haar wavelet reply in both vertical and horizontal directions is calculated and the main orientation is obtained by summing the wavelet responses and the maximum response yields the dominant orientation of the keypoint. The feature vector is then computed relative to the dominant orientation thus making it invariant to rotation.

Descriptor Components
A square section allied along the main orientation is measured around the nominated key point. This section is then allocated into 4×4 sub-sections and for each of these sub-sections the Haar wavelet reply is calculated. The sum of the wavelet replies d x and d y in the horizontal and vertical orders for each sub-section denotes the feature vector. Next the sum of the absolute value of the responses |d x | and |d y | are calculated which provides the data about the polarity of the changes in image intensity. Therefore, the feature vector V j for the j th sub-section is given as Concatenating the entire feature vector for the sixteen sub-regions surrounding the keypoint provides the descriptor vector of length 64(16×4).

K-Means clustering method
This method is the furthermost popular and easy clustering method for execution. It is a partition based clustering method that is used in different applications. Partition clustering attempts to split a group of N objects into K clusters, which means that the partitions optimize a certain standard function. Every cluster is denoted by the centroid of the cluster, e.g. k-means. Normally, K seeds are arbitrarily selected and then the relocation structure replicates the points between the clusters to optimize the clustering criteria (18) . This algorithm is executed in four steps: 3. Notion of a point y density-nearby from a basic object x 4. Definition of density-connectivity among two points x, y

Distance m easures
The similarity or dissimilarity is measures between two objects using Euclidean, Manhattan distance.

Euclidean distance
The distance is computed among 2 points by below equations.

Manhattan distance
The distance is computed among two points by below equations.

Experimentation and Results
In order for experimentation, the dataset of Kannada's historical letters are shown in ( Figure 1). So as to substantiate the proficiency of the proposed methodology, we completed broad trials on various Character dataset viz. Ashok, Kadamba, Hoysala and Mysuru Scripts. Each character dataset contains 25 pictures shown in ( Figure 1). In this section, we aimed to study the performance of the proposed system under different clustering methods like K-means, Agglomerative, Density based clustering with distance based measures like Euclidean and Manhattan. We picked images randomly from the dataset and experiment is conducted in a database of 4 classes under 70, 50 and 30 different training models from each class. The accuracy of SIFT features with different clustering methods with Euclidean distance is shown in the ( Tables 1, 2 Figure 4 shows the comparison of clustering method with distance measures for both SIFT and SURF. By tables, we analyze that the Density based clustering with SIFT features achieves maximum accuracy in all cases when compares to K-means and Agglomerative.