Investigations of the Approximate Percentage of Noise Required to Perceive Hindi Phonemes using HNM

Objectives: Harmonic plus Noise Model (HNM) analysis model has been found to be one of the best methods of speech production in terms of important characteristics like naturalness, intelligibility, and pleasantness which are of pre-requisite in any speech synthesiser. Present study explores the approximate percentage of noise required to perceive some phonemes of Hindi language. Method / Analysis: HNM assumes speech as a combination of both periodic and aperiodic signals, so the effect of each part may be individually measured on the quality and intelligibility of different phonemes using HNM. HNM has been employed as the analysis-synthesis platform and the quality of the synthesized speech is tested with the ITU-T standard PESQ measure (perceptual evaluation of speech quality and MOS (mean opinion score). Findings: Objective results suggest that the percentage of the noise serves as a significant constituent in the quality of synthesized speech. Novelty: Investigations suggest that each individual phoneme requires different noise and voice percentage for clear perception. Further, the optimum percentage of the noise part for good speech quality has been found speaker and phoneme dependent. *Author for correspondence


Introduction
Verbal communication is the ability to convey one's thought by means of a set of signs, whether graphical, acoustic, gestural, or even musical. Among these, speech is an incomparable feature of human beings which states intent, ideas, and desires 1 . Human speech production is a complicated sensory organization which requires the assimilation of diverse information sources and intricate patterns of muscle activations 2,3 . Thought process in the brain initiates the excitation of the vocal tract required for the production of an utterance from the oral tract. The source of vibration called vocal tract, vibrates with the fre-of culture and begin to learn quickly, transitioning from babbling at 6 months of age to full sentences by the age of three.
Knowledge of generation of various speech sounds helps in understanding its spectral and temporal attributes, which in turn helps in classifying it on a broader scale. Speech can be classified into voiced sounds, unvoiced sounds, vowels, consonants, nasal sounds, continuants, stops fricatives, syllables, diphthongs, and monothongs. Slight amendment of the shape of the vocal tract, different types of sounds ranging from vowels to consonants may be produced 4 . The unvoiced sounds, normally represented as a random white noise source don't show any periodicity as a result they do not have a straight relationship with the pitch 8 . The information that is communicated through the speech is intrinsically of a distinct nature as the constituents of the speech are discrete phonemes 5 . Phonemes: also referred to as speech units, show critical importance not because they are indestructible, but because they comprise the smallest distinction among minimal pairs such as let and lit, pat or bat. Wide range of phonemes occupies a region in the articulator space 9 . To utter vowels, the tip of the tongue is positioned in the middle region of the oral cavity (vocal area). The lowering of the velum with the simultaneous constriction of the oral cavity so as to let the air flow through the nasal area generates nasal sounds which show quite broader spectral response similar to that of vowels 8,10 . Stops define that category of sounds which require complete closure of the vocal tract 8 . Noise is produced as a result of partial restriction in the vocal tract, while fricatives are the result of this kind of constriction. Another class of speech called syllable is a vowel surrounded by consonants 11 . Vowels are quasi-periodic in nature 12 . All Indian languages use different scripts consisting of dissimilar graphemes, but there is a wide-ranging linguistic uniformity at the micro-level 13 . India has 22 official languages and about 1652 dialects/ native tongues consisting of 10-12 major scripts. Indian languages have a refined notation of a character unit (or akshara).
An akshara is an essential linguistic entity which includes 0, 1, 2, or 3 consonants and a vowel. A word com- prises of one or more aksharas and since the languages are exclusively phonetic, each akshara can be expressed independently 14 . Aksharas with more than one consonant are termed as samyuktaksharas or combo-characters. The consonant at the end is the foremost in a samyuktakshara. Unique characters in Indian languages scripts are near to syllable and can be, in general, structured as: C, V, CV, VCV, CVC, and CCV, where C signifies a consonant and V stands for a vowel 15 . Hindi is an official language of India spoken by 33% of the total inhabitants. I.P.A symbols for different classes of Hindi phonemes are shown in Table 1, which are further categorized into short vowels, long vowels, semivowels, nasals, fricatives, and stop sounds.

Stops
Restoration and aspiration are two phonemic features which acquire a vital place in Hindi. There are 8 aspirated plosives and 2 aspirated fricatives in addition to their unaspiratedcounterparts 16 . It has been stated 16 that when the duration of the short vowel pair is essentially high (i.e., in case of vowels ओ and इ), native speakers appear to stress on the duration of vowels for preserving the phonemic distribution between members of a vowel pair more than non-native Hindi speakers. In case of stop sounds, release durations are essentially more significant than closure durations 17 . Stop sounds illustrate that release durations have statistical significance in case of unvoiced unaspirated, unvoiced aspirated and voiced aspirated phonemes. In Hindi, the durations of unvoiced aspirated are twice the durations of unvoiced unaspirated.
In Tamil language (spoken in Tamil Nadu, India), durations of unvoiced aspirated are equal to durations of unvoiced unaspirated, this may be because of having single alphabet for the stop sound in the place of articulation 17 . In Telugu (spoken mainly in Andhra Pradesh, India), durations of voiced unaspirated are equal to that of voiced aspiration and durations of unvoiced aspiration are twice the durations of unvoiced unaspirated 17 . The vowels show major dissimilarity between Telugu and Hindi in short vowels and between Tamil and Telugu in long vowels. Vowels help in distinguishing Telugu lan-guage, since Hindi and Tamil speakers speak both short and long vowels approximately at equivalent period rates.
Hindi has more duration compared to Tamil, while Tamil shows a lesser duration than Telugu in short and long vowels correspondingly. For nasals, Hindi language shows important durational attributes 17 and features craft practical significance in differentiating Hindi language, consequently, Telugu can be classified using vowels and Hindi can be classified using nasals. Also singleton stop phoneme durations are imperative feature for the classification of these three languages 17 . It has also been investigated that most of Dogri (spoken in Jammu, northern India) vowels have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same vowel when spoken in Hindi is found to be of longer duration compared to when it is spoken by a person with Dogri as the mother tongue 18 . Thus phoneme durations play a considerable role in distinguishing Hindi, Dogri, Telugu and Tamil languages pertaining to stop sounds, vowels and nasals [16][17][18] and the awareness of the durational characteristics of a language plays an essential role for building highly intelligible Text to Speech Systems (TTS). Research has shown that the people are sensitive not only to the words they hear but also to the manner they are spoken.
Speech synthesis finds wide application in text-tospeech (TTS) systems 19 in speaker text-independent 20,21 and de-dependent identification 22 multimedia entertainment, speech recognition 23 and speaker transformation 24 . Two ways of speech generation are model based and waveform based. Generally, source filter model, direction into velocity of articulator's model-DIVA model, and Level's models are used in model based synthesis. Unit selection, HNM 25 concatenative, Linear Predictive Coding (LPC), and formant synthesis techniques are employed in waveform based techniques 26,27 . Hidden Markov Model (HMM) based speech synthesis system (HTS) version-2 has also been employed for speech synthesis 28 . Taking consideration of the attestations based on the last ten-year of research HNM model seems to be more promising and robust speech synthesis technique 27,[29][30][31] . The objective of this paper is to determine the minimum percentage of noise and harmonic parts required to perceive the sound of the phonemes. Different models for speech analysis and synthesis are described in the following section. The experimentation and estimation methods employed for the evaluation of the quality of synthesized speech are discussed in Section 3. Section 4 presents the results and discussions part and the conclusion is summarized in Section 5.

Analysis-synthesis Models
One of the most attractive areas of speech communication is the area of man-machine communication. Researchers, including linguists, psychologists, and neurologists, have made an attempt to throw some light on the development of speech production and perception, initially focusing the monolingual speaker, and then moving ahead to more complex situations, which takes account of bilingual and multilingual speakers 32 . Although for many years manmade speech has been completely lucid from a segmental point of view, but there are certain areas which still look forward to acceptable realization. During recent years much effort has been directed on increasing the intelligibility as the speech that is synthesized from arbitrary text still sounded unnatural 33 . The mechanical sounding voices may be satisfying but only to a limited extent 34 . The need for natural sounding voices led to the requirement of more innate and intelligible speech synthesizers. In 1791 Von Kempelen projected that the human speech production system can be represented using mechanical models. He demonstrated his idea by building a machine that could produce human voice 35,36 . Imitating Von Kempelens' work another well-known scientist Wheatstone made a speaking machine. Much later, Riesz, Homer Dudley, Haskins Play Back, and many other contributed to this field of speech 37 .
Commercial formant synthesizer DECTalk was the first emotionally expressive speech synthesis system 38,39 . The use of talking machines provides flexibility for comprehensive vocabularies, which is essential for appli-cations such as unlimited translation from written text to speech 40 . Only few Indian languages like Hindi, Tamil, Kannad, Marathi, and Bangla have been employed for developing TTS systems. Synthetic speech can be generated by two approaches: waveform and model based 41 . Model based techniques make use of the natural model of speech generation in human beings for developing artificial models 42 . Phonemes, morphemes, diphones, triphones, or syllables as basic acoustic units required to build a speech synthesizer. Combination of these units is also used in some synthesizers 20 .
DIVA model of speech production provides computationally and neuroanatomically comprehensive account of the network of brain sections required in speech acquisition and production 19,20 . Level developed a model of human communication in steps, visualizing speech production as a sequence of diverse phases, in three foremost mechanisms, specifically, the conceptualizer, the formulator, and the articulator 32 . Articulation synthesis employs the use of simulations and models that initiate from the articulatory mechanism of human speech production system for generating more natural sounding voices 35,43 . The automatic systems built by Von Kempelen and Wheatstone fit in to this class 40,44 . Source-Filter synthesis also called formant synthesis makes use of spectral shaping of driving excitation that employs formants to characterize the spectral shape 35 . Formants have a direct acoustic phonetic interpretation, and are computationally simple when compared to full articulatory models. The formant synthesis method makes use of an acoustic model for speech generation instead of a real recorded human speech 45 .
LPC is amongst the most powerful and latest synthesis techniques used in signal processing for the demonstration of the spectral envelope of speech in compact form taking into consideration the information exploited in linear productive model 46 . It is an important technique for precise, inexpensive 47 measurements of speech parameters like pitch, formal spectra, vocal tract area functions, and for the representation of speech for low rate transmission and storage. Such approaches are extensively used by many systems, in which speech unit waveforms are stored and then later concatenated during synthesis 46 .
The acoustic composition of the speech can be described as a combination of units called phonemes, put another way, this theory brings into mind that in a similar way the distinct phonemes can be blended together to generate a speech waveform and this is the basic principle of concatenative speech synthesizers 30 . Most of the cases involve a mechanism in which voice sounds are compressed by manipulating a pitch period waveform to reduce the number of signal samples that are required to have a power spectrum that should be sufficiently close to the original. Concatenative speech synthesis may be categorized into unit selection and diphone synthesis. Unit selection synthesis uses richer variety of speech and simply cuts out speech and rearranges it. The main aim is to find more naturalness in the generated speech 35 . It uses a large database of pre-recorded speech. Research has shown that the sound obtained as result of unit selection is often difficult to distinguish from the real one, however for maximum naturalness we need large database, it may be a recorded speech of hours 48,49 . Diphone synthesis makes use of the notion that a little portion of the acoustic signal varies to minor amount, and is also less subjective by the phonetic context than others. The quality of sound obtained using diphone synthesis lies somewhat in between concatenative and formant synthesis. However, the resulting speech suffers from glitches and is even little mechanical to hear, similar to the quality of sound as obtained from formant synthesis 50 . Sine wave syntheses operate by replacing the formants with the pure form whistles. Domain specific synthesis is a relatively uncomplicated technique which uses the principle of concatenation of the pre-recorded speech to generate a complete statement. For the reason of its trouble-free implementation, this method is extensively 29 used for commercial purposes since long. This technique finds application in the areas where the output text is limited to a particular domain like weather reports, transit schedule announcements.
Speech analysis-synthesis techniques taking consideration of the characteristic and modification of diverse models for the speech quality enrichment can provide more natural sounding and intelligible systems. It has been seen that the quality of the speech originated from the speech synthesizer depends upon the model being used by the synthesizer for this process. Harmonic models based concatenative techniques are widely used in TTS systems 22 . The fast generation of a harmonic signal is an important issue in reducing the complexity of TTS systems based on these models 22 . A good model generally requires virtues like intelligible synthesized speech, no difficulty of parameter extraction, ease of amendment of parameters, lesser number of parameters required, and lesser computation load.
A very versatile speech synthesizer called Festival is a concatenative TTS synthesis system developed by at the Centre for Speech Technology Research, University of Edinburgh with components supporting front-end processing of the input text 47,51 . HNM model performs analysis and synthesis of speech signal and is basically a modification of sinusoidal models 26 . HNM decomposes speech signal into quasi periodic: a lower harmonic" part, and non-periodic part represented by an upper "noise" part 20,21 . This breakdown method 52 which represents the upper harmonic part as the voice part and lower stochastic as the noise part of speech signal, employed in HNM permits more naturalness in synthesized speech.
HNM has reduced database, and provides a direct technique for smoothing discontinuities of acoustic units around concatenation points 53,54 produced as a result of different distributions of the system phase around the points of concatenation. This incoherence generates noise in between the harmonic peaks thus destroying the harmonic structure of a periodic sound henceforth degrading the voice quality. Analysis shows that all vowels and syllables can be produced with a better quality syllables by the implementation of HNM 20 also HNM is a pitch-synchronous system 55 and unlike TD-PSOLA and other concatenative approaches hence it eliminates the problem of synchronization of speech frames and shows the capabilities of providing high-quality prosodic modifications without business when compared to other methods 56 .
HNM framework is also used in a low bit rate speech coder to increase naturalness as well as intelligibility 57 . Harmonics plus noise model has also been used for the development of a high-quality vocoder applicable in statistical frameworks, particularly in modern speech synthesizers 58,59 . Speaker transformation and voice conversion method techniques can also be implemented using HNM system 20,46,47 and since HNM appears to be more promising than all existing models, thus in the present research this technique has been employed to determine the effect of noise and voice part separately on the speech signal.

Recording and Segmentation
For data collection, six speakers (3 males and 3 females) in the age group of 18-25 years were selected for recording in Hindi language. It is desirable that the speakers belong to same group in terms of language and education. The speakers participating in our experiments were university students and they had Hindi as their first language.
The script for recording as shown in Table 2 consists 35 phonemes (consonants) with first and the last phoneme (v) being a: for all the vcvs combinations. Speech was recorded in an acoustically treated room using Sony ICD-PX820 audio recorder with 16 kHz sampling and 16-bit quantization. The recorded speech was manually segmented and labeled into vcvs which were considered to be correctly articulated by all the speakers were selected for use as the speech utterances for the experiment. The duration of each utterance was 5−9 s. Figure 1 shows the basic scheme used for HNM-based modification of speech. The parameters of the speech signal are obtained from HNM analysis and modified for the spectral modification. The synthesis axis is prepared according to the given pitch and time scaling factors. HNM parameters are estimated at the instants on the synthesis axis using interpolation. The modified parameters on the synthesis axis are used for synthesizing the speech output. In the present experiment, in order to inspect the effect of noise percentage on speech quality, the analysissynthesis of the speech using HNM has been performed. Keeping voice part fixed, the speech was synthesized by

Evaluation techniques
The methods used for the evaluation of can be broadly classified as subjective and objective. Subjective methods demand listening tests by means of human subjects. The results of the subjective evaluation may get affected by the test conditions, and hence these have to be standardized and consistently followed. The subjects should be adequately familiarized with the reference quality before the test. The subjective tests may be grouped in three categories: intelligibility, quality, and identity. Quality of the phrases is generally evaluated by-mean opinion score (MOS), degradation category rating (DCR), and preference tests.
In MOS test or absolute category test, the subject rates the quality of the speech stimuli on 1-5 scale (1: bad, 2: poor, 3: fair, 4: good, 5: excellent). The stimuli are presented in a randomized order, with three to five presentations of each stimulus. The average score calculated across stimuli and subjects is known as the mean opinion score (MOS) 63 . The test gives an assessment based on all the parameters affecting the quality. It is easy to conduct and does not need trained listeners, but its sensitivity for high quality speech is low. The objective methods ensure consistency in the evaluation and can be performed by means of computations 60 . PESQ, one of the methods for objective evaluation evaluates one-way speech quality. The signal to be evaluated is introduced into the system under test and the synthesized output signal is matched with the input (reference) signal 61 . PESQ has been incorporated as the ITU-T P.862 recommendation 62 . PESQ is a narrow-band (3.2 kHz) speech quality assessment and has the capability of providing an admirable quality in a range of conditions together with background noise, analogue filtering, and variable delay.
Level alignment is required for signals like the reference speech signal, however the degraded signal is made parallel to the continuous power intensity. The signals are time aligned assuming the impediment initiated by the transmission system as piecewise constant. This transform eliminates those parts of the signal that are impossible to hear to the listener. Frame-by-frame interruptions are anticipated by means of envelope and fine correla-  tion histogram-based delay detection. PESQ assessment finds applications in wide range of applications. Because of its high-speed and repeatability PESQ make it feasible to execute extensive testing over undersized period and also facilitates the quality of time-varying conditions to be observed 61 . PESQ offers specific and repeatable estimation of speech quality. Figure 2 shows different plots for different proportions of voice parts (v10-v100, ex. v10 indicates 10% voice part), of all the six speakers. In each histogram the horizontal axis shows the percentage of noise, and the vertical axis shows its corresponding PESQ score. With voice part of only 10%, in case of single female speaker it has been analyzed that, minimum quality (PESQ score of 2 approximately) is obtained with zero noise part, whereas adequate PESQ score (3 approximately) at which the speech quality is quite acceptable is obtained when only 10% of noise is added. As the proportion of voice part is gradually increased, the point at which the maxima (highest quality) occurs requires a greater percentage (40%) of noise for optimum speech quality, also the extent to which the speech quality showed steep degradation (in case of v10 and v20) after the maxima was obtained is considerably reduced.

Results and Discussion
Similar are the results obtained in case of males, with only one significant change. The value of minima (minimum quality) is obtained at a PESQ score of 2.3 approximately, little higher than that found in case of females. Pictorial representation of the experimental results is represented in histograms shown in Figure 2, which depicts the consequence of the change in voice and noise percentage on the quality of the synthesized speech of all the six speakers (3 males and 3 females) taken all together. It may be analyzed that for a constant voice part, as the percentage of noise is increased, PESQ score shows a substantial increase till it attains a peak value, after which, there occurs a gradual decrease in the quality of the synthesized speech. However, as percentage of voice part is increased, although the value of PESQ score for which minima is obtained remains approximately the same, but the PESQ score at which the quality of speech was found to be maximum, requires a greater percentage of noise part. However as soon as this maxima is attained any further increase in noise proportion doesn't appear to affect the speech quality i.e. the quality becomes quite sta-   ble even if the noise percentage is increased. Thus it may be concluded that noise part serves as an important part in the quality of synthesized speech. With no noise part added the speech quality is quite poor. Also, the percentage of noise part to be added for optimum voice quality depends strictly on the voice part. Figure 3 and Figure 4 show the results obtained after quality assessment of HNM synthesized speech using MOS. The purpose of this experiment is to evaluate the quality and identity of the HNM synthesized speech using subjective listening tests and to verify the results of the objective tests (as obtained by PESQ evaluation) for male and female speakers. The set of 35 VCV utterances as listed in Table 2 were used as test material for one male and one female speaker for this experiment.
Subjective evaluation of the quality and intelligibility of the synthesized speech was carried out using MOS test. For each presentation subject could listen to the sounds one after the other in a sequence more than once before finalizing the response and proceeding to the next presentation. The average MOS scores for 35 VCV utterances at different voice and noise levels in case of male and female speaker are listed in Table 3 and Table 4 respectively. From objective evaluation it was found that a voice level of 50% and a noise level of 40% assure good quality for HNM synthesized speech. Above a voice level of 50% the quality doesn't degrade even if the noise percentage is further added. Mean score for male speaker at voice level 50% and noise level of 40% is 4.7. Mean scores for male speaker utterances at increasing voice level clearly  show that the quality shows a gradual increase with the percentage of voice part and noise part and at a particular level of voice (50%) and noise (40%) the quality is maximum. The plot for quality as obtained by MOS score of six listeners has been presented in Figure 3 and Figure 4 for male and female speaker respectively at different voice and noise percentages (e.g. v35n100 denotes voice part 35% and noise part 100%). In case of males it can be analyzed that at v50n40 the quality is maximum. Similarly in case of females the MOS score at v50n40 is 4.6 and hence again appreciable quality has been obtained at this percentage of voice and noise part. Table 5 shows the approximate percentage of noise required to perceive different Hindi phonemes for six speakers (sp1-sp6). It has also been analyzed that at 10%, quality dependence of phonemes pə, ɾə, pʰə, bə, jə, lə, t̪ ə, ʋə, mə is found to show least speaker dependency, while the phonemes d̪ ə, tʃʰə, dʒʰə, ʃə, sə, ɖə, tra, dʒə, nə, ɳə, d̪ ʰə, ʈə, sə, ɳə, ʃə, gʰə, bʰə, ɖʰə, d̪ ʰə, ʈə, show speaker dependency. At 20% the phonemes ŋə, dʒə, ʃə, ʃə, ksʰ, kʰə, t̪ ʰə, d̪ ə are speaker independent, but the phonemes gʰə, dʒʰə, d̪ ʰə, jna, ɳə, ɾə, jə, ʈʰə, tʃʰə, t̪ ə, pʰə, lə, bə, ʋə, ɦə, ɖʰə, nə, ɦə ,ʈə, sə, ɖə, jna, sʰrə, gə, ŋə, ʈə show speaker dependency.

Conclusion
Investigations were carried out to find the minimum percentage of periodic and aperiodic portion of speech for clear perception of different Hindi phonemes. HNM has been used for analysis and synthesis of speech HNM has been employed while PESQ method is used for objective evaluation and MOS for subjective evaluation of the speech quality. Investigations carried out by varying voice and noise parts of speech signal show that the quality and intelligibility is related to the relative percentage of noise and voice parts. The quality of the synthesized speech without adding any noise part gives almost same PESQ score independent of the percentage of the voice part with respect to the original amount of the voice part. With 50% voice part, the required noise percentage for acceptable speech quality is found to be around 40%. As the voice percentage is increased beyond 50%, the speech quality shows no degradation even if the noise percentage is further increased. Results obtained from MOS also verify the same. These values of noise proportions are approximately similar for male and female speakers.