Multilingual OCR systems for the regional languages in Balochistan

Background: There are various languages for which an optical character recognition technology has been developed but most of these address a particular language and thereby multilingual OCR remains a challenge.Methods: Development of multilingual OCR is one of a highly debated issue. Researcher are studying the feasibility and operational feasibility of multilingual OCR from technical as well as from viable aspects. Multilingual OCR includes printed or handwritten characters' form. In this paper, we study the significance, challenges and issues of developing multilingual OCR system for regional language based on Persio-Arabic script by conducting a comprehensive survery about the operational viability of mmultilingual OCR. Findings: A feedback of 339 participants is collected through an online surgery to find the scope and applicability of multilingual OCR. The respondents were from different linguistic background. The study identified that a large majority of participants are willing to use their native language for the accomplishment of their computational task and deemed that the support of multiple languages in a software would increase their productivity. Novelty: In current form, the study addresses the viability of multilingual OCR of regional language based on Persio-Arabic script. To the best of our knowledge, such kind of study has not been conducted for the domain of Pakistan.


Introduction
The OCR technology is the emerging field and one of the most important computer software technique in the field of computer science. The aim of OCR is to process human readable image or text to translate into machine readable format for editing, and searching text. The commercial OCR packages are easily available for translation various natural languages such Arabic, Persian, and Chinese into machine readable codes efficiently and effectively. The most common available and well-known OCR engines are ABBY and Tesseract (Marcin Heliński). For example, ABBY (1) FineReader is an all-in-one https://www.indjst.org/ PDF and OCR systems overall supports more than 200 OCR languages such as the most common languages are Arabic Persian/Farsi, Chinese, Japanese, Korean, and Thai etc. and Tesseract (2) OCR engine is open source technology which is used to support wide variety of languages up to 250 languages to extract features typed text, printed text, and handwritten text. The Tesseract OCR engine open source system works on the variety of operating systems such as Windows, Linux, and Mac. Initially Tesseract was developed by HP and UNLV in 1990s and later Tesseract sponsored and developed as an application by Google in 2006 (3) . The applications of OCR include in banks for processing checks, post offices for processing handwritten addresses, business cards recognitions and office automations for processing all kinds of data entry forms and many more applications. The function of OCR (4) engines is to read handwritten or printed text image and convert it into machine readable and editable form. The images of multiple languages include letters, symbols, numerals, punctuations marks, diacritical marks, broken characters, and text line are easily and efficiently recognized through reading. The most common, powerful and commercial OCR engines are available in Desktop PC and mobile or handheld devices. The OCR engines are more useful for different languages and documents having multi directional text, and coloured text documents etc. The capability of OCR engine to ensure high speed image processing, automatic recognition, and searching, editing, and copying words in multiple documents which saves time.
The significance of regional languages in Balochistan ( ) has wide influence in culture, art, music, and literature etc. Balochi is the one of major language of the Balochistan province in Pakistan. Pashto is considered secondary language in the province of Balochistan. Brahvi and Sindhi languages are spoken widely districts of Balochistan. Balochi (5) is North Western Iranian Language. Balochi ( ) is the main language of the Balochi people. It was not used for writing as i.e. unwritten language before 19 th century. The writing system of Balochi is from right to left direction. The official language was Persian. Balochi has 30 alphabets/letters (6) for writing Balochi scripts shown in Figure 1. Taimur Mengal introduced Balochi letters in his pamphlet "Balochi Nama Qasim" published in 1987. He also published same alphabets in his article"Balochi Mund Likh". The Balochi scholars follows Urdu Arabic script. Mir Gul Khan Nasir published his first Balochi poetry collection in Urdu Arabic script in 1951. The "Father of Balochi" Sayyad Zahurshah Hashemi wrote the comprehensive guidance in Urdu Arabic script and standardized Balochi in Pakistan and Iran. According to Sayyad Zahurshah Hashemi, there are 26 characters (7) in Balochi alphabet as shown in Figure 2. Brahvi is Dravidian language which is spoken by Baloch people. Brahvi people Pakistan ethnic group of about 2.2 million people (8) found in the Balochistan, Paksitan. The mainly areas of Brahvi people in Balochistan are Bolan Pass and Ras Muari. The mostly Brahvi people areas in Balochistan region are Kalat, Mastung, Khuzdar, Bolan Pass, and some parts District Quetta speaking Brahvi language predominantly. The Brahvi mostly found in Balochistan region and some areas of Sindh in Paksitan. The Brahvi (6) speakers are also found in Afghanistan, Iran, Iraq, Qatar, and UAE. Brahvi has 39 alphabetic characters/letters. The writing script is Arabic and Latin. It has no official status and neither used in education nor Government.
In Persian literature Pashto is known as Afghani and in Urdu and Hindi is known as Pathani language. Pashtoon speakers are also known as Pashtuns and Puktuns. Pashtoon is known sometime Afghans or Pathans. Pashto belongs to Indo-European family and it is Eastern Iranian language. Pashto is one of the official language among two largest languages in Afghanistan. Pashto is the second regional largest language among the regional languages of Pakistan. The total number of population of https://www.indjst.org/ Pashto speaker is about 45-60 million people in all around the world. The Pashto is the national language of Afghanistan. Pashto has 44 letters in its alphabetic haracterset. It has 4 diacritical marks. The writing (9) system of Pasto is right to left. The characteristic property of Pashto language is bidirectional as Urdu and Sindhi Languages in Pakistan. There are various writing styles of cursive script language such as Nashk, Kofi, Naataliq Thuluth, Diwani, and Riqa (10) . The primarily two writing styles for the cursive scripts such as Nash script and Nasataliq script are most commonly used. Urdu, Brahvi and Balochi are written in Nastaleeq script. These both writing styles are commonly used for the cursive languages such as Arabic, Persian, Urdu, Sindhi, Balochi, Brahvi and Pashto.
The word Sindhi ( ) is an adjective which means "belonging to Sindh". The word Sindhi has been derived from the word Sindhu ( ) which is name of Indus River in Pakistan Sindh ( ). The meaning of word Sindhu is an ocean or river. The fundamental ancient languages of Sindhi derived from Sanskrit, Prakrit, Arabic, Farsi, Dravidian (such Brahvi in Balochistan). The Sindhi is the one of the famous language in Sindh province of Pakistan. It is the official and state language of the Sindh province in Pakistan. The history of Sindhi language is as old as the civilization of Moen-jo-Daro. The Sindhi is used as a medium of instruction in Schools, offices, and in colleges. It has gained popularity in print media such as Sindhi news Kawish ( ), and electronic media such as KTN news etc. There are 52 characters in Sindhi language. The Sindhi language is cursive language and writing of direction is from right to left order. The Sindhi language has also bidirectional characteristic such as Arabic, Urdu, and Farsi etc. In (11) proposed Sindhi OCR for the recognition of Sindhi handwritten numerals and arithmetic string numerals without using input devices and memory. The regional languages for OCR are Sindhi and Pashto. But Balochi and Brahvi languages have not gained attention in the field of OCR. Urdu is the national language of Pakistan. Both Balochi and Brahvi languages use common font such as Noori Nastaleeq.

Background
The regional languages such as Sindhi, Pashto, Brahvi, and Balochi are the cursive script languages. There is no database for the regional languages such as Brahvi, and Balochi and no commercial OCR based engine in the field of computer vision and image processing. But the little attention has gained Pashto and Sindhi and no commercially software is available to best of our knowledge. The major languages spoken in Pakistan are Sindhi, Punjabi, Balochi, Pashto, and Urdu. The minor languages spoken in Paksitan are Brahvi, Kashmiri, Hindco, Siraiki, Gujrati, and Farsi etc. The mostly population of Pakistan is Muslims. Arabic is spoken to some extent. Because our Holy Quran is in Arabic language. Our regional languages based on Persio-Arabic script such Naskh and Noori Nastaleeq. Urdu, Brahvi, and Balochi use the Noori Nastaleeq Arabic font. Sindhi, Pashto, and Arabic use Naskh. The character set of regional languages such as Balochi, Brahvi, Pashto, and Sindhi as shown in above figures. Balochi consists of 26 alphabet characters, Brahvi has 39 characters, Pashto has 43 characters, and Sindhi has 52 characters set. These languages are written from right to left. Sindhi is superset of regional languages such as Balochi, Brahvi, and Pashto. The class family is the similar of the most of the characters of these cursive regional languages. The cursiveness refers to the joining and connection of characters of a ligature. The ligature or sub-word is connected part of a word without spacing in between characters. The word consists of one or more ligatures.
Ligature is connected and is a joining part of the word. There is no space between characters of ligature. The word is a set of ligatures. Ligature is the subset of word. Ligature has two components such as primary ligature, and secondary ligature. The primary ligature is set of connected characters in a word. The secondary ligature consists of dots or diacritical marks in a word. The word Balochistan and Pakistan is shown in Figure 3. Balochistan word has three ligatures such as "Balo" ( ), "chista", ( ), and " n" ( ) is shown in Figure 3. Pakistan word has also three ligatures is shown in Figure 2(b). The ligature consists of primary and secondary parts. The primary ligature refers to the continuous subpart of the word without spacing between characters. The secondary ligature may have dots or diacritical marks. The position and placement of dots or diacritical marks is above and below the ligature. Some characters are distinguished with "Tauy", "Hamza", and "Mada" for the regional languages. The complex ligature has two or more connected and joined sub words in the word. Context sensitivity refers to the multiple glyph of the characters, ligatures and words. The different glyphs of a character are formed while joining the standalone or final character, or initial character, and isolated character. The different glyphs/shapes of character are shown in Table 1. There are four different basic shapes of a character "Alif " ( ) "Bay", ( ), and "Jeem" ( ) with initial, medial, final, and isolated character and a few more characters etc. The shape and position changes when ligature is formed by joining the characters.
https://www.indjst.org/ The regional languages such as Sindhi, Pashto, Brahvi, and Balochi are the cursive in nature. Cursiveness refers to the joining characters in the writing ligatures and words. These regional languages are written from write to left. The behavioral characteristic writing system of these regional languages based on Urdu, Arabic, and Farsi script languages. The ligatures or sub-words, words, and sentences of these regional languages are shown in Figure 4. The property of the cursive regional languages is bi-directionality. It is inherited from the Persio-Arabic script languages. These cursive regional languages mainly read and written from right to left. The numerals read and written from left to right direction. Figure 5 illustrates bi-directionality example of cursive regional languages. The numerals in box read and written from left to right and the rest of text is read and written from right to left direction. Diagonality is the attributed characteristic of the Noori Nastaleeq font which is written from top right base line to the bottom left at certain variable tilted angle with well-defined rules. Figure 6 shows the demonstration of diagonality characters and diagonality ligatures Stretching refer to elongating of the characters. There are two types of stretching: horizontal, and vertical stretching. The length and size of a character is increasing across the line horizontally is known as horizontal stretching. The length of a character is extending vertically is known as vertical stretching.
The multifont, multisize, and multiscript problems have not been addressed for the regional languages in Balochistan. The https://www.indjst.org/ mulifont, omnifont, multisize character, and compound character with varying shape and style is still a challenge for the both handwritten and printed characters (1) . These are the open challenges for the natural languages OCR systems for Urdu, Balochi, Pashto, and Sindhi in Pakistan. Omnifont refers to the recognition (12) of any font size that includes writing style, shape, size, weight, width, cursive glyph or character, and word for handwritten and printed scripts. The single font and a sing size characters have achieved a promising accuracy in the field of character recognition. The chaining code and zoning features for 30 Tamil characters using Support Vector Machine (SVM) (13) classification learning algorithm have implemented with reported accuracy of 88% by (2) . Natural language processing, image processing, computer vision, and pattern recognition have been successfully applied in the field of character recognition aka is optical character recognition (14,15) . The OCR research is still active (16) and challenging field in the natural language processing.

Multilingual Regional Languages Characters Family
The character family of regional languages in Baluchistan, are shown in Table 2. Sindhi and Pashto uses the Naskh font family, and Brahvi and Balochi uses the Nastaleeq font family. The writing system of the cursive regional languages is from right to left. The number of dots with characters are zero characters, one dot characters, two dot characters, three dot characters, and four dot characters are shown as in Table 3. The dot or period (.) sign is known as "nukta" in regional languages.

Structural Framework
Formal description of multilingual OCR is important for developing its application environment. The generic framework is shown as in Figure 7 and each step of OCR design is explained briefly. OCR is a great successful application osf computer vision and pattern recognition.
The process of acquiring input images from various sources such as offline and online through some electronic devices is known as image acquisition. The most famous electronic devices digital camera and scanners are commonly used in obtaining the images in electronic format. There are two methods of image acquisition in digitizing the input image in stored template format.

Online image acquisition 2. Offline image acquisition
The online image can be obtained by connecting digital camera to the computer system and then transform image into electronic format for further processing. The offline image can be scanned to transform input image into digital format using scanner or digital camera. These images are offline because images could not be obtained directly some specialized electronic device. The digital image is obtained and stored into the computer databases. The database of input images is prepared for preprocessing operation. The input training image is supplied to the system for the recognition and classification. The input images match with stored template database to determine the class of that input image.
Preprocessing is the important operation after the image acquisition process. The process of removing noises, imperfections, and improving properties from the input image is achieved by the preprocessing technique. There are many preprocessing algohttps://www.indjst.org/  Fig 7. General OCR framework for the regional languages rithms such as image binarization, thresholding (17) and image normalization etc. The process of transformation input image into gray levels and binary image is the first preprocessing operation. There are two gray levels values (0,255) for monochrome images and then two binary levels (0,1) for black and white image. The cut off value can be obtained using thresholding preprocessing technique for the binary image. In simplified terms, the thresholding (17) is cut off value of the binary image. Preprocessing operations involve noise removal, smoothing, edge detection (18) (9) , slant correction and skew detection (19) , image normalization (20) (5) , thinning (21) (22) or keletonization, and baseline detection (23) (24) . The process of separating (25)(26)(27) and partitioning the images is known as segmentation (28) (29) . Segmentation is the still challenging and complex problem in the field of image processing. The segmentation algorithms divide the whole image into sub images with help of different segmentation algorithm into lines, words, and then characters. These segmentation algorithms provide the regions of interest (8) (ROI) for the recognition of isolated images.
The properties and characteristics of the segmented objects are the important features and information is in the form numerical values is the feature extraction technique. The shape, pattern, intensity distributions, and texture are features of an identified and isolated image after segmentation of an image. The features refer to the numerical values in the form of vectors. The extracted features of an isolated image are given to the classifies for the recognition and classifications. The classifier such as Neural Networks (30) , and Convolution Neural Networks (31) (CNN) will identify and recognize the class of objects and image. The machine-readable image will be classified and recognized using state of art classifier such as Support Vector Machine (32) (SVM) into human readable format. The classification is sometimes being referred to as recognition. The classification and recognition terms can be used as interchangeably. After classification, the image is recognized from machine readable format to human readable format. The process of identifying object using classification technique is known as recognition. Use of deep neural network is more preferable to use in multilingual OCR as it is already very successful in different areas including healthcare (33) , face verification and person identification (34) .

Design Materials and Methods
Balochistan is multilingual province in the region of Pakistan. There are numerous languages spoken in Pakistan but few regional languages such as Balochi, Brahvi, Pashto, and Sindhi has not gained a due attention in the field of character recognition. There is a need to survey and identify problem related to the cursive OCR languages. Balochi, Brahvi, Pashto and Sindhi languages are https://www.indjst.org/ cursive languages. These cursive languages have not gained popularity in the field of computer vision and pattern recognition. Our focus is to determine the problems and issues in these regional languages in the context of OCR. The goal of OCR is to develop a system that reads these cursive regional languages and transform into machine readable format. The most commonly used commercial languages such as Chinese, Japanese, and English has gained popularity in computing field. In these regional languages, The Pashto and Sindhi has gained little attention but Balochi and Brahvi still has not gained attention in computer vision and image processing. In order to identify the viability of multilingual OCR system, a tiny study is conducted. The study population comprised of 339 participants with different linguistic background. The number of participants and their mother tongues is shown in Figure 8. During survey most of the natural language speakers participated in the study and this indicates the wideness of study. During study, the participants were asked that which natural language they want to use in the accomplishment of their computational task. The feedback received from the participants is shown in Figure 9. It can be seen that a large of participants are keenly willing to use their natural language in computational domain which https://www.indjst.org/ natural indicates the scope and viability of multilingual OCR system. During survey the participants were asked whether the use of natural language in computer would simplify their work. The feedback received from the participants is shown in Figure 10. During study the participants are also asked whether the use of software that support multiple natural languages would simplify the work and increase their performance/productivity. The feedback received from the participants is shown in Figure 11. The overall feedback received from the participants indicates that the use of multilingual application is need of current era and thereby the multilingual OCR is extremely important and could improve the performance of users.

Conclusion and Future Work
In this study of survey paper, we studied regional languages such Balochi, Brahvi, Pashto, and Sindhi. Sindhi and Pashto have been considered for OCR but still there is no commercial system for recognizing these languages. However, there is no research in Balochi and Brahvi in the context of optical character recognition field of computing. Future work is continued in three ways: i) development of unified computable OCR framework of all the regional languages of Pakistan ii) Use of deep learning for the https://www.indjst.org/ recognition of multi-lingual OCR system iii) Bachmann-Landau analysis of multi-lingual OCR systems. In this study, we also combine OCR technology and Text to Speech Synthesizer (TSS) for multilingual characters and regional languages for visually impaired people using computer through voice interaction in the future work.