Prosody Generation Using Back Propagation Neural Networks for Sindhi Speech Processing Applications

Analysis and synthesis of speech to be automated still require more research efforts in general and for the development of speech processing applications based on Arabic Script like Sindh Textto-Speech in particular. To achieve the required results from the speech processing applications prosodic features must be exercised extremely as the prosody is highly linked with the information of sounds having different characteristics like linguistic rules, complications and variations of expressions. Objectives: This study aims to generate and analyze the prosodic information specifically pitch and duration from the recorded Sindhi sounds using the back propagation neural network. Methods: Two methods are used to obtain the prosodic information of Sindhi sounds, PRAAT speech analyser is used to obtain the results and for the validation a back propagation neural network model is implemented. From the four districts of Sindh 228 speakers were chosen and the sound of different descriptive sentences was recorded for the experiments. Finding: After the experiments with a neural network model with multiple layers on the collected sound, 98.8% a highly acceptable level of accuracy achieved at the 18th epoch among the 100 epochs. Application and improvements: The generated Sindhi prosodic information and adopted research methodology will be supportive to the scholars of Sindhi speech processing applications. This research work can be considered as the first step as no work for generating Sindhi prosody is found yet.


Introduction
Sindhi, being categorized into six dialects, is frequently spoken across the world with assorted accents [1]. This language has a huge sound inventory and is linguistically as well as phonologically rich as compared to other languages spoken in the subcontinent [2]. For last two decades, large research contributions regarding Sindhi language processing have been published but no worthwhile work is found, generally for speech processing and particularly prosody generation due to complex variations in Sindhi accents [3].
The objective of this research is to generate and validate the prosodic information from the units of the recorded Sindhi sounds. The fundamental and mandatory prosodic features are pitch and duration which are always considered by researchers as prerequisite parameters for the development of software applications pertaining to speech like speech recognition and text to speech [4]. It is observed that the generation and analysis process of the prosody is complicated and difficult because the prosody is connected with different levels of information having different natures. But, prosodic information is important to correct the sentence accent in the process of automatic language understanding, communications and speech synthesis [5].
Various modelling approaches like statistical data-driven [6], rule-based [7] and hybrid [8] have been proposed and implemented through which prosodic information is obtained using the recorded sounds of the languages [9]. But comprehensive research work on the automatic generation and analysis of prosodic information is not performed yet for the Sindhi language; hence, authentic Sindhi speech processing applications are not developed for common use in routine life activities. It is because of the deficiency of the prosody generation modelling and investigation which is essential for speech synthesis.
In [10] this study, various undergraduate students of our university are selected as speakers for the recording of sounds. The speakers are basically inhabitants of four districts: Sukkur, Ghotki, Shikarpur and Khairpur. The fundamental prosodic information i.e. pitches, and durations are measured from the recorded sounds using the PRAAT speech analyzer. The calculated prosodic information is mandatory for the development of speech processing software applications [11]. The back propagation Neural Network (NN) approach implemented by [12] is also used for the validation of the prosodic information. This network required three parameters such as input values, output value and the targeted values; such information is taken from the computed prosodic information of the recorded Sindhi sounds.

Literature Survey
Sindhi phonemes have received a great deal of research interests and undoubtedly, they have received the attention of a large population of researchers. Problems of Sindhi phonology are well observed and discussed by [13]. Their observation claims the strain and vocal inflection of Sindhi language and its 6 dialects having variation in several aspects of language. The comparison is brought about between the accents of people following different dialects. The comparison was carried out on the waveform picturing of the image to resolve the discovered problems. In [2,14] has also worked on the same subject with the addition of a letter to sound conversion. In this research, the concentration is given more the demonstration of f0 peak of variant syllables where long and short vowels are used at different places within words.
In [15][16] produced a piece of research on the analysis of the fundamental frequency of Sindhi language. The investigation is centered on the pitch working in between the intonation and stress. The accent of the pitch is examined while observing the recorded sounds of 69 words. These 69 words having different syllables were processed through digital experiments. The final outcomes of the research witnessed that the stress is directly orthogonal to f0 contours.
In [17] investigated the consonant sounds through acoustic analysis of Sindhi language. Mostly VCV formats were collected in the sound forms for the implementation of experiments. The researchers have focused mostly on the liquid consonants and the emphasis is put onto the difference between a trill and lateral consonants. In addition to this research, in [18] has examined the variation of vocalic features in vowels. Specific consideration is given to the differences among the languages spoken in Pakistan. They also include Sindhi phonology in their research. The experiments found the variation in vowels particularly comparing Sindhi ones with those of other spoken languages in Pakistan. The experiments were performed using PRAAT speech analyser.

Research Methodology
An accomplishment of the project of prosody generation of Sindhi language requires various steps, calculations and analysis. The core milestones of the research methodology are described in Figure 1. The first phase and foremost phase is preferences the speakers and their recorded sounds in order to get the required results. For this, we have preferred 228 graduate students of our University who belong to four different districts of Sindh province and having distinct pronunciation of Sindhi words.
Moreover, some descriptive sentences are compulsory for recordings and vice versa. Hence 81 sentences were composed, and 10 randomly selected sentences given to the selected speakers for the recording of their sounds. Furthermore, speech corpus was prepared and various segmented forms of sounds such as phoneme, syllable, words and sentences in binary formats were stored into the separate files using SQL. After that, the pitch and duration of each sound are obtained using the PRAAT speech analyzer. It was investigated digitally and measured for further use. The data sets of pitches and durations are also prepared for getting the input, output and target values for NN.
A back propagation NN model is proposed as the next step of the research methodology in which the number of inputs and the hidden layers were fixed according to the requirements of the prosody generation. The proposed network is simulated using the Matlab language. The next step is to train and test the proposed NN with different inputs of the Sindhi prosodic information as well as boundaries. The last phase of this project is to evaluate the performance of the developed NN model.

Speech Collection and Storing Procedure
Duration and Pitch of the recorded voices are needed to be computed. Hence, appropriate speakers required to accomplish this task [19]. The graduate students of our University Aijaz Ali Shaikh having ages of 20, 21 and 22 years are preferred for recording their voices. This research is specifically highlighted the prosodic analysis of the voices of the undergraduates studying in this university but actually belonging to four Districts of province Sindh. The number of preferred speakers of each District is given in Table 1. From District Sukkur, 51 speakers are selected. Whereas 54 and 58 speakers are chosen from the Districts Shikarpur and Ghotki respectively, 65 speakers are preferred from Khairpur. The Radio Station Khairpur is the place where the sounds of the selected speakers have been recorded. The entire recording and storing process is identical as described by [10]. The prosodic information exists in recorded sounds intensely considered while the development of the speech corpus with the 16-bit encoding and the 16-kHz sampling rate. Among the 81 communicative composed sentences, 10 sentences were erratically given to the selected speakers along with the Sindhi prosodic restrictions. There are 228 speakers and 10 descriptive sentences so that the total obtained sounds of sentences are 228 × 10 = 2280 and the segmented words are 11348. The overall syllables are 21673, the syllables in sentences vary from 3 to 12. The number of phonemes is 53491 varies from 12 to 28. The sounds of these all segmented sounds are used for experimentation with the back propagation NN in order to generate the prosodic information.
The letter-based length of the words plays a momentous role in the computation of duration and pitch of the recorded sounds. Sindhi words which are based on two to five characters are selected for the investigation. The recorded sounds are segmented and characterized by distinct values using the PRAAT speech analyzer tool commonly used by researchers like in [20] due to some specific speech analysis characteristics. In the speech database, letters based words taxonomy is essentially considered and segmented sounds are individually stored according to the mentioned taxonomy. The pitch and duration calculation process of the recorded sounds is described in [10]. One or more prosodic information is also calculated by various other researchers like in [21] for different applications. The duration and pitch of the recorded word are shown in seconds and Hz respectively.

Investigated Pitches and Durations
The

Performance Evaluation Using Neural Network
When the recorded sentences segmented into words, syllables and phonemes, the process of getting prosodic information and classification began using NN. Basically, two approaches were decided to implement for extracting maximum prosodic information through which comparison and statistical analysis done. One such approach used for this purpose is NN. Following the feed forward back propagation method, a neural network is developed in Matlab by setting 4 inputs, 2 output targets as depicted in Figure 2.
Order to train as well test the network, a training dataset is required which is made in Excel sheet with predefined targeted parameters based on received results of recorded speech individuals i.e. Sentence, Word, Syllable and Phoneme as shown in Figure 3  training, all of the training samples pass through the learning algorithm simultaneously in one epoch before weights are updated. While Figure 5 shows the gradient ratio at the 18th epoch, Mμ defines the control parameter for the algorithm which is 0.1 used to train the neural network and total validation checks up to the 18 th epoch. Admirable results received during the training process of a developed network. The testing phase performed after the training process on the selected dataset and predefined parameters. It is observed that the network generated the best validation performance result of 198.8672 at the 12 th epoch against the Mean Square Error during the testing phase as depicted in Figure 6. The performance and results throughout the process using an above-defined dataset for input and desired targeted output are depicted in Figure 7 as it shows that the developed system produces 98.9% accuracy in the training process and 99% generated during the system validation process. However, the result generated during the testing phase is 98.8% which is quite similar to the training results and the overall accuracy achieved by the developed system is also 98.8%. The obtained results of pitch and duration  have been evaluated with the error rate provided by the developed Neural Network and with the standard deviation technique to find the variation of pitch between the male and female genders having different age groups.

Conclusion
Automatic speech analysis and synthesis still required some research efforts particularly for the development of Arabic Script-Based speech processing applications. The prosody is correlated with the information having dissimilar temperaments such as linguistic rules, complications and variations in the manifestation of the sounds hence, prosodic features should be exercised at the maximum level to achieve the required results from the speech processing applications. The Sindhi prosodic information is generated and analyzed specifically pitch and duration from the recorded Sindhi sounds using the back propagation  NN. The prosodic information of Sindhi sounds is obtained with two methods. At first, the PRAAT speech analyzer is used to obtain the results. Secondly, a back propagation NN model is proposed and implemented on the 2280 sounds of the sentences collected from 228 speakers living in the four districts of province Sindh to validate the generated Sindhi prosodic information. The experiments performed on the pitch and duration data sets collected from the recorded sounds. For getting the acceptable level of accuracy, multiple layers and 100 epochs were used which gives 98.8% accuracy at the 18 th epoch. The generated Sindhi prosodic information and adopted research methodology will be supportive to the scholars of Sindhi speech processing applications.