Weighted Mel frequency cepstral coefficient based feature extraction for automatic assessment of stuttered speech using Bi-directional LSTM

Objective: To propose a system for automatic assessment of stuttered speech to help the Speech Language Pathologists during their treatment of a person who stutters. Methods: A novel technique is proposed for automatic assessment of stuttered speech, composed of feature extraction based on Weighted Mel Frequency Cepstral Coefficient and classification using Bi-directional Long-Short Term Memory neural network. It mainly focuses on detecting prolongation and syllable, word, and phrase repetition in stuttered events. Findings: This study has discussed and performed a comparative analysis of WMFCC feature extraction method with different extensions of widely used MFCC, namely, Delta, and Delta-Delta cepstrum. The comparison of speech parameterization techniques is carried out based on the effect of different frame lengths, percentage of window overlapping, and preemphasis filter alpha value. The experimental investigation elucidated that WMFCC outperforms the other feature extraction methods and provides an average recognition accuracy of 96.67%. 14-dimensional WMFCC achieves a low computational overhead compared to conventional 42-dimensional MFCC, including Delta and Delta-delta cepstrum. Application: The integration of Weighted MFCC based speech feature extraction and deep learning Bi-LSTM based classification techniques proposed in this study are more efficient for introducing an optimal model to automatically classify the stuttered events such as prolongation and repetition.


Introduction
For communication between human beings, speech is the most habitually and widely used verbal means to precise feelings, ideas, and thoughts. Not all human beings https://www.indjst.org/ are blessed with normal means of speech. The power of speech in the sharing of information during interaction depends on fluency (1) . If continuity between semantic units, rhythm, speed, and energy contributed to flow is natural, speech is fluent. Dysfluency is characterized as any form of fluency disruption. The complex form of dysfluency is stuttering. In stuttering, due to pauses and blocks, there is a disruption in continuity and rhythm, the rate is much slower, and efforts are greater than normal.
There may be three kinds of disorders in people who stutter (PWS): repetition of syllable, word or phrase, prolongation, and silent blocks at starting a vocalization or expression or within the middle of a word. Stuttering influences individuals of all ages, cultures, and races, irrespective of their intelligence and financial status (2) . Many research pieces have stated that stuttering affects approximately 1% of the world population and is more common in males than females (3) . Therefore, this area is mainly a knowledge base field of analysis for distinctive domains like speech pathology, physiology, psychology, acoustics, and signal analysis.
Speech-Language Pathologists (SLP) diagnose the person who stutters and assess the fluency to determine the stutterer's response during the treatment phase. SLPs were previously used to determine the severity of stuttering manually through their experience. They counted and divided the frequency of stuttered events with total spoken words. Such sorts of stuttering assessments are arbitrary, incoherent, lengthy, and error-prone. Therefore, SLPs have paid considerable attention to objective assessment methods to identify stuttered events over the past few decades (4) .

Literature Survey
The survery shows a detailed comparative analysis of various feature extraction and classification techniques based on the dataset used, type of disfluency, and accuracy (5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22) . The previous work published illustrates the significance of feature extraction and classification techniques in identifying stuttered events. This paper focuses on the implementation and performance analysis of the feature extraction technique used in the proposed methodology. A wide variety of speech parameterization techniques are available for the recognition process, such as Perceptual Linear Prediction (PLP), Linear Prediction Coding (LPC), Linear Predictive Cepstral Coefficient (LPCC), and Mel Frequency Cepstral Coefficient (MFCC). In (12) and (19) , the authors extracted PLP features to analyze stuttered speech samples. The PLP feature vectors show the dependency while maintaining overall spectral balance on formant amplitudes and sensitive to noise and communication channel (23) . In (24) , the writers introduced a stuttered speech recognition system based on LPC features. LPC works on assuming the static nature of speech, therefore, ineffective in representing and analyzing speech accurately (25) . In (8) and (10) , the authors analyzed LPCC features' performance to assess stuttered speech. LPCC delivers poor performance in high quantization noise and uses a linear scale that is not adequate for speech processing (23) .
From Table 1, it can be observed that MFCC is a highly employed feature extraction technique. However, these features involve only static information of speech signals. Based on the above considerations, this paper introduces a more efficient extension of MFCC, known as Weighted MFCC (WMFCC) for feature extraction of stuttered speech samples. WMFCC includes the speech samples' dynamic information, which increases the detection accuracy of stuttered events; and reduces the computational overhead to the classification stage.
The proposed work has introduced a low dimensional and dynamic feature extraction method WMFCC, and deep-learning classification technique Bi-directional Long-Short Term Memory (Bi-LSTM) for the automatic evaluation and diagnosis of four forms of disfluency prolongation and syllable, word, and phrase repetition. The efficiency of WMFCC is determined by comparing performance of four feature extraction methods, MFCC, Delta, and Delta-Delta cepstrum, and WMFCC based on the accuracy of stuttered events classification. In (26) , the authors have discussed the implementation and analysis of the classification technique employed in this study.
The paper is structured according to the following. Section 2 elaborates on the framework for the system proposed. Experimental results and a comparative analysis of the feature extraction techniques are performed in Section 3. Section 4 provides a conclusion.

Methodology
The proposed method for disfluency detection ( Figure 1) is split into five phases: speech signal pre-processing, segmentation, and labeling of the disfluent speech signal, splitting the labeled samples into training, validation, and test sets, feature extraction, and classification. The study has conducted a comparative analysis of extensions of MFCC feature extraction techniques, namely Delta MFCC, Delta-delta MFCC, and Weighted MFCC. The University College London Archive of Stuttered Speech (UCLASS) https://www.indjst.org/ database is utilized for analysis (27) . The Bi-LSTM classifier evaluates the efficacy of the feature extraction techniques in the classification of prolongation and repetition dysfluencies.

Speech Signal Pre-processing
A signal is pre-processed by removing the silence regions (28) . There is no excitation in the vocal tract during the silence region, hence no speech production. Thus, pre-processing reduces the amount of processing and enhances the system's overall efficiency and accuracy. In this study, the integration of two widely known techniques, Short Time Energy (STE) and Zeros Crossing Rate (ZCR) (Figure 2), are applied (29) . It is a quick and straightforward approach and provides a better outcome of voiced/unvoiced/silent speech classification. (Figure 3).

Disfluent speech sample segmentation and labeling
The disfluent speech signals are obtained from the University College London Archive of Stuttered Speech (UCLASS) (27) . The dataset used in this study refers to 20 samples of stuttered speech of UCLASS Version 1 for experimentation. It comprises two female speakers and 18 male speakers aged 7years 8 months to 17 years 9 months. The purpose of the selection of speech signals is to cover a broad range of stuttering rates and ages. The samples available with transcriptions are only included in the dataset.
This paper investigates only four forms of disfluencies, prolongation, syllable, word, and phrase repetition. They are easily detectable in monosyllabic words. After pre-processing the selected speech samples, disfluent speech samples were identified and segmented manually by listening to the pre-processed signals. The segmented samples were labeled as five classes: Fluent, Prolongation, Syllable Repetition, Word Repetition, and Phrase Repetition ( Figure 4).

Labeled samples splitting
The segmented and labelled disfluent speech samples were divided into three sets for training, validation, and testing. The training set is a subset of annotated stuttered speech samples for training the classification model. The validation set is used to optimize the performance of the model by reconfiguring the different hyperparameter values. It is smaller than the training set. The test set determines the absolute accuracy of the model and helps in analyzing the performance of proposed model. In this study, the datastore of disfluent speech samples is split into training, validation, and test set in the ratio of 60%, 20%, and 20%, respectively.

Speech feature extraction
The extraction of speech features is a sort of dimension reduction technique applied to minimize the enormous data to be processed by an algorithm. The critical objective of feature extraction is to upbraid the speech signal into the various acoustically recognizable elements and get the feature vectors with a minor amendment to keep the processing efficient. The proposed work has applied frequency-domain based MFCC and its type for assessing speech disfluencies ( Figure 11).

Mel Frequency Cepstrum Coefficients (MFCC)
MFCC (30) is among the most prominent techniques for extracting features for speech recognition. It is based on the frequency domain using the Mel scale, evolved from the human ear scale. These coefficients are stable and accurate to speaker-dependent variations and recording conditions. MFCCs are commonly derived using the following steps described below ( Figure 5) (30) .

(i) Pre-emphasis
The first stage pre-emphasizes the signal spectrums by raising the high frequencies ( Figure 6). A low order digital system is employed to flatten the signal spectrally, making it less sensitive to find accurate results later in signal processing (28) . Generally, a first-order FIR filter is represented as Eq. (1). https://www.indjst.org/ The standard value of α is between the range 0.91-0.99.  Figure 8 (30) .
Framing: The speech signal is split into small duration blocks, called frames, to perform their spectral analysis. The frame length is defined as the number of milliseconds in each frame, while frame overlapping is the number of overlapping milliseconds between two successive frames (28) (Figure 7).
Spectral Estimation: Discrete Fourier Transform (DFT) extracts spectral coefficients for discrete frequency bands for a discretetime signal. DFT is computed by an algorithm known as the Fast Fourier Transform (FFT). It only provides the magnitude of the spectral coefficients. DFT can be defined as Eq. (3): where X[k] is the spectral coefficients, x[n] is the framed signal, and 0<=n, k>=N-1.

(iii) Mel Frequency Filter Bank
The frequencies output by the DFT is wrapped onto the Mel scale. It constructs a bank of 20 triangular Mel frequency filters that captures energy from each frequency band. The bank of filters ( Figure 9) consists of ten filters linearly spaced below 1000 Hz, and the remaining filters spaced logarithmically above 1000Hz. Eq. (4) shows the conversion of linear scale frequency to Mel scale frequency.
Each triangular filter in the filter bank satisfies Eq. (6).

v) Discrete Cosine Transform (DCT)
The filter banks computed above are all overlapping; thus, the filter bank energies are strongly correlated. Hence, the DCT of the log filter bank energies is computed. However, only 14 coefficients are kept for each frame called Mel Frequency Cepstral Coefficient. DCT can be defined as Eq. (8).
where K is chosen as 14. This stage outputs a matrix with rows equal to the number of frames and columns equal to 14. https://www.indjst.org/

Delta and Delta-Delta cepstrum coefficients
The features provided by MFCC are static. The dynamic coefficients delta and delta-delta are appended with MFCC to gather dynamic information about speech signals. These features improve the recognition accuracy as they hold account of temporal variability in feature vectors. The first-order derivative of MFCC is delta coefficients, and the second-order derivative is deltadelta coefficients (30) . The delta coefficients are given as Eq. (9).
Where c and △c represent static and dynamin coefficients, respectively. M corresponds to the number of surrounding frames and c t represents the MFCC feature vector. Delta-delta coefficients are computed similarly as delta coefficients. These obtained features are appended to the original features vectors, resulting in a 28-dimensional Delta MFCC and 42-dimensional Deltadelta MFCC feature vector for each frame.

Weighted MFCC
The overall disfluency recognition rate gets improved by employing delta and delta-delta features. However, it leads to higher computational complexity overhead due to an increase in the feature vector dimension. WMFCC utilizes the benefits of dynamic features with the reduced feature vector dimensions (31) . WMFCC is described as Eq. (10): where p and q are weights of Delta and delta-delta, respectively, and wc (n) is a 14-dimensional WMFCC feature vector. The resultant vector is a fusion of MFCC and its derivatives, thus containing both static and dynamic information of the signal. Moreover, the feature vector is of size 14; thus, incur less computational overhead ( Figure 10).

Stuttered speech samples classification
This study applies a deep-learning technique for the classification of stuttered speech samples known as Bi-directional LSTM. The set of features vectors extracted in the above phase are inputted to the classifier. The classifier is trained and validated with 60% and 20% of the segmented stuttered speech samples, respectively. The rest was used for testing the model. The proposed classification model has a better classification accuracy of 96.67% and performs better than other models (26) .

Experiments and Results
This section discusses the comparative analysis of proposed WMFCC feature extraction with feature extraction techniques such as, MFCC, Delta MFCC and Delta-Delta MFCC, based on the Bi-LSTM classification results and with some existing works, and also determines the optimal values of parameters required for efficient feature extraction process. The performance of feature extraction methods depends on various parameters such as frame length, frame overlapping ( Figure 7) and preemphasis factor ( Figure 6). Therefore, the classification results were discussed under situations such as different frame sizes, different pre-emphasis filter alpha values, and different percentages of frame overlapping. The experiments were performed based on the parameter's configuration tabled in Table 2. The first observational study determines the best frame length value by setting the alpha at 0.97 and the percentage of overlapping at 50%. The frame length was varied from 10ms to 50ms for analysis, and the result is presented in Figure 12. It can be seen that 30ms frame length https://www.indjst.org/ generated better classification accuracy of 94.33% for WMFCC for available stuttered data. The observation states that MFCC, Delta-MFCC, and WMFCC provide the highest average accuracy of 81.67%, 91.67%, and 94.33% respectively for 30ms frame length while Delta-MFCC of 86.67% for 40ms frame length. The above experiment concludes two things, WMFCC outperforms the other three feature extraction techniques with the classification accuracy of 94.33%, and frame length of 30ms gives the best recognition accuracy. The second observational study analyses the effect of alpha for values between 0.91 and 0.99. The frame length was set as the best value determined in the first experiment and the percentage of frame overlapping as 50%. The average classification accuracy versus alpha is represented in Figure 13. The experiment showed that MFCC, Delta MFCC, Delta-delta MFCC, and WMFCC produced the highest classification accuracy of 81.67%, 86.67%, 93.3%, and 95.67% respectively for the value of alpha as 0.98. Thus, it implies that the optimal value for alpha for controlling the pre-emphasis degree is 0.98, with the WMFCC as a feature extraction technique.
The effect of the percentage of overlapping was analyzed in the third experiment by fixing frame length and alpha values to the best value found in the previous experiments. This study discussed the effect of no overlap, 33.33%, 50%, and 75%, and the results are presented in Figure 14. It can be figured out that 75% frame overlapping outputs best recognition accuracy of speech disfluencies for WMFCC with a value of 96.67% as compared to other techniques. The highest average accuracy given by MFCC and Delta-delta MFCC is 81.67% and 93.33%, respectively, for 50% frame overlapping while 88.33% and 96.67% by Delta-MFCC and WMFCC respectively for 75% frame overlapping. The results elucidated that WMFCC performed consistently better than other features, and features extracted from a higher percentage of overlapping provide optimal classification accuracy.
https://www.indjst.org/  The observational studies above strongly recommend that the WMFCC feature extraction method is superior to widely used MFCC feature extraction technique for automatic assessment of the stuttered speech. WMFCC combines both delta and deltadelta cepstrum with MFCC vectors according to the weights p and q as in the Eq. (10). The experiments were performed for various combinations of p and q to obtain an optimal pair. The computed results are presented in Table 3. It can be determined from the results at 1/3 and 1/6 as p and q values provide the highest recognition accuracy, respectively. From the Figures 12, 13 and 14, it can be deduced that WMFCC performs marginally better than Delta-delta MFCC and significantly outperforms Delta MFCC and MFCC for three speech parameterization parameters like frame length, preemphasis filter alpha, and frame overlapping for Bi-LSTM. Moreover, Delta and Delta-delta MFCC gave better accuracy than MFCC because they both are dynamic coefficients and keep account of temporal variability. By taking both Delta and acceleration coefficients and with fewer cepstral coefficients, WMFCC maximizes the classification accuracy compared to MFCC and reduces the computational overhead to the classification stage. The optimal values determined for parameters are 30ms frame with a frame overlap of 75% and the alpha as 0.98. The summarized analytical result is presented in Table 4.  Table 5 presents a comparison of classification accuracy of proposed method with the other existing feature extraction techniques, employed in papers like (21), (17), (12) and (8). The proposed WMFCC feature extraction method with the deep learning technique Bi-Directional LSTM shows an average classification accuracy of 96.67% while (21) applied Gated Recurrent CNN for classification and MFCC for feature extraction and achieved an average accuracy of 94%. (17) employed MFCC, formant, and the shimmer employed for speech parameterization and Dynamic Time Warping (DTW) for classification purposes and yielded an accuracy of 94%. (12) carried out the comparative analysis of classifiers such as k-NN, LDA, and SVM for classifying repetition and prolongation dysfluencies. The feature extraction techniques used are MFCC, PLP, and LPC. SVM achieved the highest rate of accuracy of 95%. (8) performed speech parametrization using LPCC technique and classification by two classifiers, LDA and k-NN and the average accuracy rates of recognition achieved were 88.05%. Thus, it can be concluded that proposed work provides an efficient feature extraction technique with high success rate, and is dynamic in nature, incurs less computational overhead and integrates well with the deep learning technique Bi-Directional LSTM, for the classification of stuttered events. However, a direct comparison cannot be made due to different languages, different classifiers, and different types, size, and categorical distribution of stuttered speech database, as well as ways of segmentation of database for gathering, stuttered speech samples. https://www.indjst.org/

Conclusion
In this study, speech parameter WMFCC, were extracted, and the Bi-directional LSTM classifier was used for automated assessment of the stuttered speech. The speech parameterization technique was compared with namely, MFCC, Delta MFCC and Delta-delta MFCC, based on the recognition accuracy of four forms of disfluencies, prolongation, syllable, word, and phrase repetition and with other existing models. Experimental results of this study display that WMFCC slightly outperforms Deltadelta MFCC and significantly outperforms Delta MFCC and MFCC in all situations of frame length, alpha values, and frame overlap percentage. The optimally configured 14-dimensional WMFCC features have the highest accuracy of 96.67%, while 14-dimensional MFCC features have 81.67% accuracy. WMFCC fusions MFCC features with dynamic coefficients, Delta and Delta-delta MFCC. Thus, WMFCC significantly increases the detection accuracy of stuttered events as compared to existing methods and reduces the computational overhead to the classification stage. The optimal values of frame length, alpha, and percentage of frame overlapping observed in the performance analysis are 30ms, 0.98, and 75%, respectively. The current study also proved that Bi-directional LSTM could be employed for the disfluency classification. In the future study, other feature extraction and classification techniques may be applied to improve speech disfluencies' recognition accuracy.