Tri-level handwritten text segmentation techniques for Gujarati language

Objectives : To improve the eﬃciency of tri-level segmentation tasks for handwritten Gujarati text. Methods : Using hybrid methods for tri-level segmentation, we have used line, word and character segmentation from the image. This study presents a segmentation paradigm that works with touching characters, slop of the line written on the page, character overlapping, etc. It evaluated on the dataset of 500+ images created by us on diﬀerent writing sentences by diﬀerent people. We have used the Horizontal projection technique for line segmentation, Scale-space technique for word segmentation and the Vertical projection technique for character segmentation . Findings : The experimental results show that the proposed method is more eﬃcient for handwritten Gujarati text with diacritics. We have obtained the accuracy for character level segmentation is 82%, word-level is 90% and for the line-level segmentation is 87% . Novelty : We have designed a methodology to segment Gujarati handwritten text with diacritics at all three levels including characters, words and lines . Applications : We have proposed tri-level segmentation which is pre-processing task that can be used in any character recognition systems i.e. OCR.


Introduction
In the recent era of computer digital evaluation, Natural Language Processing (NLP) is getting more obligatory in our day-to-day life. To educate and enhance the scope of technology we have to reach a root level of the population. It requires thoughtful hard work for NLP. Character recognition of printed documents has achieved a great accomplishment in this field. Gujarati is the 7th most spoken language in India. Gujarat government and local persons are also used Gujarati as their communication medium either verbal or written. From the literature review, many studies that focused on online and word segmentation have not deeply focus on word segmentation. Many authors have a focus on the different segmentation methods but due to the difficulty of the writing style, they are not enough able to get 100% line, word, and character segmentation accuracy.
https://www.indjst.org/ Segmentation can be described as a method of separating or isolating a document into smaller sectors or small useful region. Segment partitioned the whole document into standardized units like line, word, and character. The segmentation approaches phases can be divided into line segmentation, word segmentation, and character segmentation (tri-level segmentation). The important entity as a text line segmentation is the toughest task in Gujarati handwritten script. In Guajarati language, character recognition is challenging due to : (1) it has more curves, holes, and strokes, different writing style for individual persons (2)  In (1) author has improved the methods for lie segmentation and develop a novel method for word segmentation. They have used Cartesian space calculation with the Hough transformation method for line segmentation and achieved 98.9% accuracy. For the word segmentation, they have used two different stages 1. Distance computation for calculating the distance of neighbor character and stage 2. Gap classification is used to identify the word interclass gap. In this stage, the author has used the Gaussian mixture method for universal clustering. They have achieved 96.8% accuracy for word segmentation.
In (2) author has noted that every natural language processing system has different requirements of segmentation with unique writing styles. They also suggest making hybrid segmentation methods for the segmentation of line, word, and characters from a scanned image. In their comparative study of different segmentation techniques for a variety of languages, they suggest neural network, HMM, or SVM for the segmentation phase. In (3) authors discussed and concluded that the use of SVM and CNN for the deep learning approach and concluded that SVM provides more accurate results for segmentation and character recognition. In (4) authors have used SVM and BLSTM decision tree and dynamic programming for character segmentation. With the help of the mention methods, they achieved 98.81% accuracy.
In (5) authors have presented the segmentation using horizontal and vertical projection for the line and character segmentation respectively. They also present a novel concept for overlapping characters like 'matras' in the Gujarati language and apply the split of each overlapping character into multiple points using projection and then re-merge all characters. With help of projection, they execute segmentation for 112 documents having 7724 characters and find the accuracy for 96.72% correctness. In (6) authors have also noted that histogram projection techniques for the character segmentation have use ON pixels in each row to identify the line from the images same they also use the projection methods for word and character segmentation. They have achieved significant results and suggested improving the method available for segmentation. In (7) authors have used modified horizontal and vertical projection methods for line and character segmentation. They have used orthogonal projection towards the x-axis. They have experimented on over 550 images and get a segmentation accuracy of 96%.
In (8) authors have implemented methods like zone determination for line segmentation also used zone boundary detection, segmentation line generation, segmentation line confirmation, and lower zone component separation. They also present the methods for overlapping character segmentation. With the help of this implementation, they achieved almost 85% accuracy. Another author in (9) has used the layout projection method also divides their script into multiple zones and creates N*N blocks. They have used this technique for the Tibetan language. With help of this method, they have achieved 76% accuracy with 5844 images of the database.
In (10) have used a modified header and based line method for line segmentation. This method completely depends on the pixel value of the image. They can get accurate results of 98.1% for the handwritten line segmentation. In (5) authors have particularly focused on an overlapping character with vowels of Lanna language uses in Thailand. With this method they have used the histogram method for splitting, rotating, and margining the processed characters.
In (11) authors have used Bangla OCR with Hough transform for the line segmentation along with this they have also used color filling for the non-text area in the line. After segment a line they work on each word and used the connected component (CC) analysis and with the help of this they can easily segment a word from the line and they have used zone segmentation for the character segmentation. Authors achieved accuracy like 90.46% in line segmentation, 90.06% in word segmentation, and 75.97% in character segmentation. In (12) authors also, perform Hough-based projection for line segmentation and they have segment lines from the scanned image. They have used the vertical histogram method for character segmentation. They achieved almost 90.866% accuracy in printed Guajarati characters. In (13) authors have implemented a Hough transformation technique for text line segmentation used in Arabic language script. For line segmentation, they have achieved 98.9% accuracy.
In (14) the author has used projection and adaptive thresholding algorithms for line segmentation. They tested with their dataset with almost 2500 lines and IAM public handwriting dataset. They have achieved 97.70% accuracy for their dataset.
In (15) the author has used a deep neural network window approach with the right and left context of the target syllable for the Dzongkha language. They have also experimented with using pre-trained syllable ending and others have not used pretrained syllable ending. The author has achieved 94.40% accuracy for line and word segmentation. In (16) authors in this methodology https://www.indjst.org/ use another neural network model. Authors have used deep fully convolutional networks (FCNs). They have used this network to identify x-height as a line representation, and they can segment lines from the image. With the help of this method, the author has achieved 91.3% segmentation accuracy.  In this study, we are trying to summarize the latest techniques available for line, word, and character segmentation and also discuss the proposed method that finds use within our research. For the implementing different segmentation methods author have used own dataset consist of 500+ images of different handwritten style with 1000+ lines and approximate 5000+ different characters.

Methodology
We have used the hybrid algorithm to achieve maximum accuracy and uniformity in the segmentation technique. Due to diacritics, writing pattern, slop of characters, character overlapping, etc. segmentation is more difficult in the Gujarati language. In this methodology we have used the horizontal projection method for line segmentation, we have used the scale-space method for word segmentation, and the vertical projection for character segmentation. In this paper, we have focused on the preprocessing and segmentation of Gujarati handwritten scanned images. A complete segmentation process we have divided into different levels as below.
https://www.indjst.org/ For segmentation we have followed the below mentions algorithms.

Horizontal projection technique for line segmentation
• Step 1: Identify and set an appropriate threshold value for converting an image into a binary value. It also converts the image into a grayscale format. We have used 120 pixels and THRESH_BINARY constant value.
cv.adaptive threshold(src, maxValue, adaptive method, threshold type, blockSize, C ) • Step 2: Remove all image borders. So that we can easily identify the inner region of the image.

Scale-space technique for word segmentation
• Step 1: Create kernel with the kernel size, sigma that is the standard deviation of Gaussian function used for filter kernel, theta used for approximated width/height ratio of words, the filter function is distorted by this factor and minArea: ignore word candidates smaller than specified area.
• -Place the kernel anchor on top of a determined pixel, with the rest of the kernel overlaying the corresponding local pixels in the image. Append bounding box and image of the word to result from the list. • Step 6: List of words, sorted by x-coordinate is ready to use.

Vertical projection technique for character segmentation
• Step 1: Identify and set an appropriate threshold value for converting an image into a binary value. It also converts the image into the grayscale format. We are using 20 pixels and THRESH_BINARY constant value. Step 5: Identify the appropriate starting point from where we can split the word. At this step, we are also storing the line spilt position using that we can segment a word into character. Here we will omit the pixels that have a width value that is less than the thresh value. Here we are using cValue as the thresh value. https://www.indjst.org/ Step 6: Draw and cut the word with the desire position. And create images for each character.

Data Collection
The database is of utmost significance for any research or experimental task. In the Gujarati language there are data available for individual handwritten characters but with the multiple lines with paragraph is not available online. We have downloaded the character dataset from the Indian government portal. This data may use to compare individual segmented characters as a trained dataset.

Dataset Generation
We used the dataset that has been created on our own with the help of people with different handwriting styles, different age groups, and a different gender. Our dataset includes printed as well as handwritten Gujarat images. We have also included the dataset with the isolated and allied characters for experimental purposes. This dataset includes 1000+ handwritten documents with 10 lines, 4-5 words, and almost 12-15 characters each.

Model Architecture
For the enhancement of the line, word, and character segmentation and also for the improvement of character recognition we have used projection and scale methods for segmentation. This model architecture represents the proposed workflow of the complete paradigm it includes image scanning, image prepossessing, binarization, segmentation. With this study, we have focused on tri-level segmentation. We have received fruitful results.

Experimental results and discussion
The tri-level algorithm was performed on the own created dataset with almost 1000+ images. From this dataset, we had select 600+ images for the testing purpose. In the Gujarati language, there are 11 vowels and 36 consonants. Each character in Gujarati has a special appearance and Gujarati handwritten script is irregular in style due to many connected characters, overlapping words, slop of the line, etc. with the above methods we used projection methods for line and character segmentation while we have used scale space method for word segmentation.
From the research review (2) , we have derived that a combined method for all 3 level segmentation is required. In (5) also used the projection method for the segmentation. It is understood that an individual method is not sufficient for complete segmentation. Hence we have combined the methods and we can have better results.
After implementing the above mention tri-level segmentation we have achieved the following results; https://www.indjst.org/ With the above mention algorithm, we have executed the horizontal projection method for line segmentation we have a successfully segmented line from a scanned image. With the help of different images with different writing patterns, we can deeply test our line segmentation algorithm. We have passed a total of 642 images and each image carries an average of 10 lines. We have a deficiency for~13% of the accuracy due to the writing patterns and slop of the lines.
In the case of word segmentation, we have achieved more accurate results with our implementation. We have used 550 lines produced by our previous algorithm having 6 words for each line. We have achieved almost~90% accuracy for the word segmentation from the lines. Some lacking is here due to very diminutive space between two words. We can generate almost all the words from the inputted line to our algorithm.
Character segmentation is the most difficult task in the segmentation process due to connected characters in the writing style and some characters have an irregular shape. With the help of 2700 different words with an average of 4 characters in each word image our algorithm successfully segmented the character images. We achieved almost~82% success for the same.
In this study, we have present different methods for segmentation. Tri-level segmentation is also implemented with hybrid methods like projection and scale-space technique. In comparison to the existing work specifically for the Gujarati language, we have achieved major success results in the handwritten scripts. With this, we also attempt to get the accurate segmentation from the script with the vast variant in the writing pattern, paper quality, ink color, etc.
A Combined accuracy for our implementation is as under.

Conclusion and future work
This study introduces the segmentation methods that can be applied with the handwritten image segmentation for line, word, and characters. This study also presents a novel database that is not available publicly for the Gujarati language. It presents the multiple combinations of segmentation methods for a common handwritten image. We have used the horizontal and vertical projection method for line and character segmentation respectively. Also used the scale-space method for word segmentation. Hence the authors can achieve tri-level segmentation with hybrid segmentation methods.
With the novelty of a hybrid algorithm, we may achieve good results but still, the improvement of the tri-level segmentation is required in concern with words and characters. Also, with these unique practices, a prototype must be implemented to combine all the tri-level segmentation for further enhancement.
With the help of our segmentation methods, we will implement a combined model for image segmentation. These segmented images will be passed as an input of the next phase and that will be our learning model. Further improvements can be made after comparing our image with the existing dataset and identifying a complete handwritten character in Guajarati language with diacritics. https://www.indjst.org/