System for Fusion of Face and Speech Modalities Using DTCWT+QFT and MFCC+RASTA Techniques

Objectives : The main objective is to propose a multimodal biometric system by forming a fusion of Face and Speech modalities using DTCWT+QFT techniques for face and MFCC+RASTA Techniques for Speech recognitions. The experimental results are compared with existing works and analysed the performance with counterparts. Methods : The proposed model, make use of DTCWT and QFT techniques to extract the features of face images and perform fusion of both. The MFCC and RASTA techniques are implemented to extract features of speech data and then fusion is applied. Various databases discussed and utilized for both face and speech recognition system proposed. Findings : The results of experimentation are compared with existing systems and analysis proved than the proposed system is placed in better position. The fusion of DTCWT and QFT techniques for face recognition system is implemented and the results using performance parameters such as False Acceptation Ratio (FAR), False Rejection Ratio (FRR), Total Success Rate (TSR), Partial Error Rate (PER), Equal Error Rate (EER) are tabulated for six diﬀerent types of face data sets. The average performance of the results is compared with four existing fusion techniques and showed that the proposed system performs better. The fusion of MFCC and RASTA techniques for speech recognition system is implemented and the performance is measured by calculating accuracy, precision, recall and F1-score. These results are compared with ﬁve diﬀerent schemes and proved that proposed system of fusion of face and speech traits works better for human recognitions. Novelty : Fusion of two algorithms for face recognition is implemented and the results analysed. Then the fusion of two algorithms for speech recognition is implemented and the results are analysed. The novel approach is presented to combine both face and speech recognition system in to single system to improve the security using multimodal biometrics. depict an execution of speech identification to pick and place an item utilizing Robot Arm. To get the component extraction of speech signal utilized Mel-Frequency Cepstrum Coefficients (MFCC) strategy and to gain proficiency with the data set for recognition of speech made use of Support Vector Machine (SVM) technique, the algorithm dependent on Python. The authors in article (10) presented methodology for concluding the quality of descriptions of grammar of Tatar language and lexicon’s degree of coverage. . The results demonstrated that the proposed fusion techniques for both speech and face recognitions placed better in the performance. The proposed model work differently for the data sets. The human face data available inside database and outside database outputs the variations in recognition rates. In the future work, the model will be enhanced to ensure the higher recognition rates for all the possible data sets. Also, the increased number of modalities will be tried for fusion.


Introduction
The advancements in biometric systems using various modalities recognize a particular human on behavioural and physiological traits in faster and efficient manner. The quantity of studies in regards to recognition systems for face and speech modalities have been expanding every year. People can comprehend and imagine different emotions consistently. This should be possible by seeing different elements like movements of facial muscles, voice, hand signals, and so on (1) . The recognition framework for speech has been utilized in numerous areas. The speech or voice recognition system is a step-tostep process of distinguishing words or sentences by means of a machine. This process is in need of exact algorithms, such as classification and feature extraction algorithms. The effective strategies for feature extraction these days included are RASTA (Relative Spectral Filtering) and MFCC (Mel frequency cepstral coefficients). The MFCC is one of well-known techniques for feature extraction because of the better accuracy it has. Few of the studies concerning voice recognition framework making use of MFCC as feature extraction technique are conducted by many of the researchers (2,3) . The speech enhancement technique RASTA was basically emerged with objective of subsiding addictive and unwanted disturbances in Automatic voice recognition systems. The technique RASTA not just eases the effect of noise in voice signal however it likewise upgrades the quality of voice with background disturbances. Consequently, RASTA is technique of modulation frequency band sifting which could be utilized either in log spectral domain or cepstral domain, where RASTA filter band goes through every coefficient of features. The procedure of RASTA speech processing technique, is illustrated in the block diagram given in Figure 1. The more standardized mechanism MFCC is dependent on human ear's known variation of frequency with critical bandwidth. MFCC is advantageous in decreasing the recurrence data of the input voice signal into coefficients, it is a quick, dependable and simple computation method. The primary goal of Mel-Frequency cepstral coefficient is to impersonate human hear-able framework and thus is utilized for processing of speech. The procedure of MFCC technique for feature processing, is illustrated in the block diagram given in Figure 2. The technique DTCWT could be carried out making parallel use of 2D divisible two real wavelets transform. The principal real wavelet transform can be implemented making use of high pass and low pass filter coefficients H 0 (k) and H 1 (k) applied along the row and column dimensions of 2D information that makes structure of DTCWT with Upper Filter bank.
The second Real wavelet transform that describes Lower Filter bank of DTCWT could be developed by making use of high pass and low pass filter coefficients G 0 (k) and G 1 (k) that are around logical to coefficients of upper filter bank bringing the results in ideal reproduction of incoming information of images. The QFT (Quick Fourier Transform) is a faster calculation algorithm for Discrete Fourier Transform utilized in applications of signal processing such as Correlation analysis, Linear Filtering and spectrum analysis which includes higher time for computation, that results in moderate efficient algorithms. In QFT, the sequence of data is divided into smaller sequences until we are able to get sequences of single-point. Considering N = 2s, such decompositions can be computed s = log 2 N intervals. Hence, the total count of complicated multiplications decreased to (N/2) log 2 N versus N2 complicated multiplication of straight forward calculation of DFT. In same way, the count of complicated additions is decreased to N log 2 N considered to N2 -N complicated additions of directly DFT calculation.
There is good amount of work carried out in the field of speech and face recognition systems by various researchers. For the detection of active speaker, the works done in the paper (4) presents a procedure for effective fusion of correlated auditory and visual information for active speaker detection. The redundant data, noise information produced during the time spent singlemodular component extraction, and conventional learning algorithms are hard to acquire ideal performance of recognitions. The authors in (5) propose a deep learning based multimodal fusion of emotion-based recognition strategy for voice expressions. During the process of person's social and day to day activities, voice, text and expressions of face are considered as primary channels to pass on human feelings. In this work (6) , based on voice, motion ant text, a fusion strategy for multi-modal emotion recognition is proposed. The investigation of a robust strategy for multimodal emotion detection when a conversation is happening, is presented in the works (7) . Three distinct models for text, video and audio modalities are fine-tuned and organized on MELD.
The work in article (8) gives a careful assessment of the various studies that have been directed since 2006, when techniques such as deep learning initially emerged as another space of Machine learning, for speech applications. The work in article (9) https://www.indjst.org/ depict an execution of speech identification to pick and place an item utilizing Robot Arm. To get the component extraction of speech signal utilized Mel-Frequency Cepstrum Coefficients (MFCC) strategy and to gain proficiency with the data set for recognition of speech made use of Support Vector Machine (SVM) technique, the algorithm dependent on Python. The authors in article (10) presented methodology for concluding the quality of descriptions of grammar of Tatar language and lexicon's degree of coverage.
The historical activities of technologies for recognition of face modality, the present status of-the-art techniques, and directions for the future are discussed (11) . This explicitly focus on the latest data sets, 2D and 3D face identification techniques. In addition, this gives specific consideration to deep learning method as it describes the fact in those fields. The face ID making use of DTCWT has been utilized adequately for database L-Spacek (12) . The pre-handling is refined on face picture for acquiring uniform size for every one of the pictures and Dual-Tree Complex Wavelet Transform is utilized in the resized picture of appearances for getting features of DTCWT and these attributes are considered as the last ones. The Euclidean Distance is adjusted for coordinating. The authors (11) , propose an ASR made with CNN where the exhibition of two element extraction techniques, to be specific Mel Frequency Cepstral Coefficients (MFCC) and Relative Spectral Transform -Perceptual Linear Prediction (RASTA-PLP) are looked at on Bangla's disengaged words comprising digits and voice commands. This work also adds to the writing of Bangla ASR using few unique approaches. First and foremost, Effects of commotion is probed Bangla voice commands just as secluded words in CNN based ASR. Furthermore, the exhibition of MFCC and RASTA-PLP are analysed in disturbing environment utilizing CNN based classifier.
The acoustic extraction approach improvement dependent on a hybrid procedure comprising of Perceptual Wavelet Packet (PWP) and Mel Frequency Cepstral Coefficients (MFCC) is presented (13) . The exhibition of wavelet change can be utilized as a denoising strategy in voice identification framework utilizing MFCC technique for feature extraction is done (14) . The good variables to be utilized in the framework are Minimax soft thresholding determination rule on level 10 for the decomposition. The authors in (15) , worked on a structure modification for the power normalized cepstral coefficient (PNCC) framework for separating more relevant features of speech, explicitly at low sign to noise proportions (SNRs), without influencing the framework execution for undistorted speech. The article (16) significantly zeroing in on the improvement of speech acknowledgment by using the hybrid features such as MFCC and LPC, individually. By utilizing the spectral deduction strategy in pre-preparing stage, they were successfully taken out the noise from the voice with the equipped for viable extraction of source voice from noisy surroundings. The extraction of features has performed by LPC and MFCC strategy precisely with each feature types involving echo-based varieties of phases. The automatic implementation of the multimodalities could be achieved using parallel algorithms (17) in multicore systems. In the work proposed by authors (18) , recommended a visually impaired multimodal watermarking strategy for biometric confirmation frameworks. The introduced approach depends on fusing face and finger impression modalities by means of the mix of DTCWT and DCT recurrence space methods to get and improve the presentation of the biometric verification frameworks. The directed tests showed the capacity to validate images with an impressively high precision level. The authors in (19) (20) presented an approach which used the face detection system as a biometric technology to mark students and employee's attendance in the organizations. The researchers in (21) and (22) presented an overview about multimodal biometrics by making use of face and ear. The work in paper (23) proposed a hybrid methodology by joining cascading and fusion of multimodal biometrics system making use of face and fingerprint traits.

Face and Speech Datasets
The various available and created data sets used for face and speech recognitions are discussed in this section.
Spacek Face database: This database created by Libor Spacek (24)     Data sets for speech recognition system, the LibriSpeech corpus is used (28) , that is taken from audiosets and consists of English speech of 1000 hours sampled at 16 kHz. As per the texts in the audiobook, the speakers recorded their sentences. Because of the low error rate, the database is best suitable for training and evaluation of speech recognition systems. The LibriSpeech database made use in our research contains the voice belong to 16 speakers, among those, 8 are men and 8 are women, with each speaker speaking 11 distinct sentences. https://www.indjst.org/

Methodology
In this section, proposed methodology for face and speech recognition system is discussed. The extracted features of QFT and DTCWT are fused to generate final face feature set. The extracted feature of MFCC and RASTA are fused to get final results as final speech feature set. The model proposed concentrate on enhancing the recognition rates for both face and speech modalities. The diagram illustrating the proposed model is given in Figure 7. The face and speech databases mentioned in the previous section are used in the proposed model the extract required images and voice inputs to perform the pre-processing and for the analysis of performance.
The various face databases contain numerous dimensions in the face; therefore, the images may be processed into uniform sized images. Every image is processed to 2p × 2q where p and q are integer variables. The images of face are processed to size of 128 × 512. The algorithms DTCWT and QFT are applied to those images which are resized. The Feature Extraction of Two Dimensional DTCWT follows the steps given below: Firstly, an image given as input is made to decompose using two-dimensional DWT. Our model proposed applies five stage DTCWT on images of face, that gives sixteen sub levels at every stage, four sub levels having lower frequencies and twelve sub levels having higher frequencies. In each stage, size of the image is decreased to 50% of original size of the image, that is, in to 4 x 16 image size.
Secondly, each two respective sub levels that contain the similar pass levels are linearly joined by differencing and averaging. Resultantly, the sub levels of two-dimensional CWT in every stage are computed as (SPx+SPy)/ √ 2, (SPx − SPy)/ √ 2, (PSx +PSy)/ √ 2, (PPx − PPy)/ √ 2, (PPx + PPy)/ √ 2. Thirdly, For the recognitions of the face features, magnitudes of real levels and imaginary levels are considered. Twodimensional DWT on every decomposition obtains three higher frequency levels such as PS, SP and PP that provides a directional data. The DTCWT developed making use of two-dimensional real wavelet transform generates six complicated wavelets generating directional data on various directions by considering real part and imaginary part of every complicated wavelet. The magnitude of a set of six complicated wavelets are computed by the equations given below, Equations 1 and 2 and the finalized magnitude coefficients are produces by applying the concatenation operation as given in the Equation 3.
In the above equation, F(p,q) represent QFT Coefficient and f(k,i) represent input face image.
T he resultant f eature = ∑ 384 n=0 (FFT n + DTCW T n ) The FFT n and DTCWT n provides coefficients of Quick Fourier Transform and DTC wavelets.
In Eqn.6, the x k represents feature value for images from database, and y k represents feature value for test images. The voice data from the speech database are pre-processed, before we apply the MFCC and RASTA techniques for the feature extractions.
The scale of Mel-Frequency is a lower frequency which is linear within 1000 Hz and logarithmic higher frequency more than 1000 Hz. Equation 7 provides the relationship of Mel scales to frequency in Hz. .
The computation at Equation 8, provide X i as the coefficient of MFCC, Z i is Mel frequency power spectrum, i = 1, 2, 3, …, N, where N is the count of coefficients desired and M is presented as count of filters. The specialized level-pass filter joined to each frequency sub-levels in MFCC algorithm to make it smooth out shorter noise variations and to reduce any constant disturbances in the speech channel. As shown in the Figure 7 the voice data is preprocessed and given to RASTA mechanism for extraction of relevant features in separate channel. The fusion of MFCC-RASTA will make a good matching result and in turn to increase in the recognition rates.

Results and Discussion
The experimentation of face recognition system in proposed model is done using MATLAB. The various set of face images are used from the databases discussed in section-III, such as Spacek Face database, Extended Yale Face Database B+, Near Infrared Face Database and ORL database along with the Indian male and Indian female databases.
https://www.indjst.org/ The parameters for the performance analysis, experimental results by making use of DTCWT, QFT, and fusion of these techniques are analysed. The numerous combinations of Human inside database (HID) and Human outside database (HOD) of each database is used to know the variations in the performance parameters.
The performance parameters such as False Acceptation Ratio (FAR), False Rejection Ratio (FRR), Total Success Rate (TSR), Partial Error Rate (PER), Equal Error Rate (EER) are used for evaluations are defined below: Let A be the Count of human faces accepted in the outside database and B be the total count of humans available outside database. Then, Let C be the count of genuine humans rejected in the inside database and D be the total count of humans in database. Then Let X be the count of matched humans and Y be the total count of humans available inside database. Then, The EER describes rate error, where FRR and FAR both are equal. These parameter values of DTCWT, QFT and fusion of these mechanisms are recorded in Table 1 , with various combinations of HID and HOD values. The Table 2 gives the results recorded towards performance comparison of proposed fusion techniques with the existing mechanisms presented by researchers in (29) , (30) , (31) and (32) Figures 8 and 9. The rates of recognition and Half error of proposed technique is compared with the TLPP and BSIF+TLPP(7x7), FFT+DWT techniques presented in the work (30) and (29) . The graphical analysis given in Figures 10 and 11 shows that the proposed technique's performance is better placed. The speech recognition system making fusion of the algorithms MFCC and RASTA are evaluated for the performance against the data sets LibriSpeech corpus, which consists of English speech of 1000 hours, with 16 speakers containing both men and women, speaking distinct words in their sentences. The results are compared with existing speech recognition methods. While doing the experimentation the recognition rate is computed by taking in to account the total count of speakers and count of right matches. The performance metrics including accuracy, F1-score, recall and precision are utilized for the evaluation of performance. Where, Accuracy (ACC) is count of data matched rightly out of total data sets, Recall (RC) gives the speech proportion checked positive and recognized, Precision (PR) gives the speech proportion more precisely recognised and F1-Score is computed from recall and precision values. The parameters such as True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) are made use for computing these metrics are defined below.
https://www.indjst.org/ PR = T P T P + FP (15) F1 − Score = 2 RC X PR RC + PR (16) (31) LPCC+MFCC (32) SVM (33) HMM (34) RNN ( The Table 3 records the values of performance parameters calculated for proposed MFCC+RASTA techniques and comparative values of existing techniques. The graphical illustration of comparing proposed technique with existing techniques demonstrated in Figure 12, gives the better place for proposed techniques in terms of accuracy and precision values. Hence, the results demonstrates that the fusion of MFCC and RASTA produces good recognitions in speech detection.

Discussion:
The results of the experimentation are tabulated and compared with other existing research works as given in the tables, Tables 1, 2 and 3 . The results are analysed using graphical representations presented in figures, Figure 8 through Figure 12. As mentioned in the explanation of results above, the obtained results are compared with various related researches on speech recognition strategies such as RASTA+LPC+DWT (34) , LPCC+MFCC (35) , SVM (36) , HMM (37) and RNN (38) . The performance evaluation of proposed fusion technique for face recognition is done with various existing techniques proposed by different researchers such as FFT+DWT (27) , LBP +DWT +SOM (28) , DWT +SVM (29) and RLM for Canny Edge (30) . The results demonstrated that the proposed fusion techniques for both speech and face recognitions placed better in the performance. The proposed model work differently for the data sets. The human face data available inside database and outside database outputs the variations in recognition rates. In the future work, the model will be enhanced to ensure the higher recognition rates for all the possible data sets. Also, the increased number of modalities will be tried for fusion. https://www.indjst.org/

Conclusion
In this work, both face and voice modalities are discussed along with data sets required for testing face and speech recognition systems. The model is proposed in the paper comprising the face and speech recognition systems, where in face recognition system is implemented by extracting the features of face images using DTCWT and QFT techniques then the fusion of both the techniques is applied. The speech recognition system is implemented by extracting the features of voice data using MFCC and RASTA techniques, and then fusion of both the techniques done to get effective speech recognition system results. The unimodal speech recognition accuracy is 98.67 % and the overall recognition rate of 98.8% is achieved by the proposed model. The results of both the modalities checked and compared with various techniques and demonstrated that the proposed model works better, using performance metrics. In the future work, different biometric traits will be considered, in order to develop system of fusion of more than two biometric modalities, so as to have most advanced and secured human recognition systems.