Real-time video based emotion recognition using convolutional neural network and transfer learning

Background/Objectives: The deep learning approaches have paved their way to construct various artificial intelligence products and the proposed system uses a convolutional neural network for detecting real-time emotions of mankind. The objective of the study is to develop a real-time application for emotion recognition using convolutional neural network and transfer learning methods. Methods/Statistical analysis: The proposed system considers happy, normal and surprised categories of emotions. The system consists of four major steps: dataset collection, training, validation, and real-time testing. The dataset is comprised of face images containing emotions such as happy, normal and surprised in the form of video frames. The face and mouth regions are detected using the Haar-Based Cascade classifier at 20 frames per second. Findings: The convolutional neural network (CNN) is trained using mouth images and the pre-trained models VGG16 and VGG19 are trained with face images. The trained model is used to detect the emotions in the live webcam video. The experimental results show that the CNN model trained using mouth images gives an accuracy of 85.71% and the pre-trained models trained with face images using transfer learning method achieves an accuracy of 77.78%. The proposed system using CNN outperforms the pre-trained models for recognizing the emotions in real-time video.Novelty/Applications: The proposed system is entirely based on the mouth region video frames and the real-time emotion recognition system is developed. This work can detect the three emotions in an unconstrained laboratory environment.


Introduction
The emotion recognition places a very indispensable role in inspecting human feelings and internal thoughts more precisely. The emotions and mood lead in identifying the human mind quickly. According to the psychologist, emotions are mainly for short time and mood is milder than emotion and it is a long-lasting one. Humans may interact with society through emotions in case of the absence of verbal communication (1) . Among the other non-verbal communications, emotions play a very effective way of exchanging internal thoughts with society. Emotions of humans can be detected through distinct ways such as their verbal responses or voice tone, physical responses or through the body-languages, autonomic responses, etc. (1) . The basic types of emotions in a person are happy, normal, surprised, fear, anger, disgust or dislike and sadness. Emotions like happy, normal, surprised, disgust and fear are easy to find out whereas other expressions like disgust, amusement, pride, contempt, and shame are very hard to find in human through facial expressions. There are varieties of applications for facial emotions recognition like student classroom behavior monitoring system, airport/railway suspicious person detection system, autism children expression detection, facial expression-based emotion chat applications, real-time person pain monitoring systems, etc. (2,3) .
In this work, the basic expressions like happy, normal and surprised are considered in the mouth and face regions of persons to detect the emotions efficiently. The main issues for constructing a facial emotion recognition system using deep learning needs a larger amount of dataset, emotion recognition in varying lightening conditions and identifying the human expression in real-time under different scenarios. There are numerous architecture models found in the ImageNet competition in the past years which are useful in redefining new models for every problem when their layers are properly fine-tuned and frozen, gives better accuracy and it minimizes the workload to develop a completely new architecture for each problem, thereby reducing the complex task with reusing those pre-trained models instead of training from the scratch.
The  (4)(5)(6)(7) . In facial image-based emotion recognition techniques, a hybrid method was proposed combining CNN and Recurrent Neural Network (RNN) classifiers using Rectified Linear (ReLU) activation units and it gives better accuracy of 94.46% (6) . In a combined approach for both facial expression recognition and gender classification (8) , the feature extraction is done using the viola-jones algorithm. The eye and nose region are detected using Haar-cascade classifiers and lip corners are detected using the Sobel edge detection method where 19 patches are extracted from the segments with the landmark. These landmarks are trained with Quadratic Discriminant Analyzer (QDA) and Support Vector Machine (SVM) classifier gives better accuracy than the state-of-art methods. In a video frame-based emotion recognition system (7) , a method based on LBP along with the Adaboost algorithm to read the Linear Binary Pattern (LBP) features and then fed to Gaussian Mixture Models (GMM) for emotion classification and this model gives maximum accuracy and minimum time consumption. The Electroencephalography (EEG) signals are significant features for classifying six emotional states as they are better suitable for clinical diagnosis. In an unsupervised learning method called hypergraphbased emotion recognition method (9) , acoustic features like MFCC (Mel Frequency Cepstral Coefficients) combined with epochs (glottal closure) features from speech signals are used and it gives good accuracy than their individual accuracies. The collective feature group comprising of MFCC, spectral centroids and MFCC derivatives along with the bagged ensemble methods consisting of 20 support vectors gives an accuracy of 75.69% (10) . The feature fusion vector is formed with DBN (Deep Belief Network) features and statistical features like Electro-Dermal Activity (EDA), Photoplethysmogram (PPG) and Zygomaticus Electromyography (zMEG) and is used for classifying the emotions using Fine Gaussian Support Vector Machine (FGSVM) gives 89.53% overall accuracy (11) . A method was proposed for combining MFCC with Residual features to extract useful information from each emotion and models are created to detect the music emotion recognition system using Auto Associative Neural Network (AANN), SVM and Radial Basis Functional Neural Network (RBFNN) where SVM shows highest accuracy of 99.0% for combined features than the other classifiers (12) . In the transfer learning method, pre-trained models like (AlexNet (13) , VGG-S (14) , VGG-M (14) , and VGG-VD16 (15) ) are used to extract the low-level features using "MatConvNet" toolkit to predict the human activity in the surveillance video camera. The author, Mehmet Akif OZDEMIR (2) identified seven emotions by training the LeNet architecture and obtained a validation accuracy of 91.81%. The CNN model is adopted by the authors by Denis Sokolov and Mikhail Patkin (3) for detecting emotions in real-time using iPhone SE or higher versions smartphones where the CNN model is trained for 1 hour using WeSee dataset and obtained an accuracy of 63.01% on test data. The eye-region is taken for emotion recognition instead of considering the entire face images to detect emotions in real-time (16) . The authors generated RGB images from the Spectrogram arrays of the speech signals with 8 to 4 kHz frequency. The AlexNet is used as the pre-trained CNN model to detect the emotions with 79.7% average weighted accuracy (17) . https://www.indjst.org/

Proposed Methodology
The objective of this work is to detect the emotion of the person in real-time using CNN architecture and transfer learning of pre-trained models. The system consists of these main steps: mouth and face detection, training, validation, and real-time testing. The mouth emotion images are trained using CNN architecture and the face emotion images are Keras library (18) with a backend TensorFlow library (19) surprised as these emotions are the most prevailing emotions expressed through mouth region.

Mouth and Face Detection
The "OPENCV-4.1.0" is a Python open-source library for image and video processing, packaged with pre-trained Haar-feature based cascade classifiers which consist of XML files for detecting face, eye and mouth region in the image. This library is specially developed to potentially teach a machine about the objects existing in real life. This cascade classifier is an algorithm created by Paul Viola and Michael Jones (20) using machine learning techniques. These cascade classifiers are trained with samples containing face images and non-face images. In this work, the Haar-based cascade classifier is used for detecting the mouth and face in every video frame.

CNN
The CNN is a deep neural network (DNN) performing numerous operations in various layers like convolution, sub-sampling, flattening and action similar to backpropagation type of learning in the dense layers. In the convolution layer, the convolutional filter will be applied over the input matrix or another dot product operation may be performed. In the sub-sampling layer, a max-pooling operation is performed which involves reducing the size of the matrix by preserving important information in the feature maps. Then those values are flattened, followed by fully connected layers and the last layer is the output layer with Softmax activation to classify the outputs to their respective classes thereby following the strategy of supervised learning technique. The CNN has several nodes/neurons which has the capacity to learn the weights and biases by itself through continuous training as shown in Figure 1. The images first undergo a simple pre-processing step where the RGB images are converted to grayscale images. We have used the Scikit-library (21) for pre-processing the input frames. The frames are converted into a form of NumPy array to feed to the CNN. The training samples 6720 are converted to the form, (6720, 43, 72, 1) and validation 1680 are converted to the form (1680, 43, 72, 1). In the CNN model, the dataset is passed as input to the CNN models to classify and detect mouth emotion expressions, which includes four convolution layers and four max-pooling layers, and then they are flattened. The flattened output is given as input to the varying number of fully connected dense layers. The initial layer consists of the raw pixels of the mouth image of the size 43x72. The first convolution layer with 8 filters performs a dot product of their weights and input image pixel values and reduces the dimension to 41x70 (8 images). The first max-pooling layer is applied along the spatial dimension (height x width) to perform a downsampling operation and this layer reduces the dimension to 20x35. In the second layer, the CNN filter of size (3x3), 8 filters are applied and it gives output dimension as 18x33 (8 images) followed by the second maxpooling layer of size (2x2) is applied and it gives output 9x16. In the third layer, the CNN filter of size (3x3), 16 filters, gives the https://www.indjst.org/ output as 8x15 (16 images). The third max-pooling layer of size (2x2) gives output 4x7. In layer 4, CNN filter of size (3x3), 16 filters, gives output 3x6. The max-pooling layer (2x2) for the fourth time is applied and it gives output 1x3 (16 images). Then the images are flattened in layer 5 to form 48 features. These 48 features are passed to the dense layers which are comprised of a varying number of hidden neurons. Different models have been developed with varying numbers of dense layers and their hidden units. The last layer will be the output layer with Softmax activation function comprising three output neurons which will classify the output into normal, happy and surprised categories for the proposed mouth-based emotion recognition system.

Transfer learning
The pre-trained CNN architectures are the models that are already trained with a subset of ImageNet dataset during ImageNet Competitions and have learned the weights on that larger dataset. So, the weights of these models can be used to learn the proposed emotion recognition problem. The knowledge obtained from one domain can be applied to another domain is said to be 'transfer learning' . Transfer learning can be done using two methods: fine-tuning and freezing. In the fine-tuning method, the pre-trained model layers are tuned with varying filters, layers and hidden units to optimize their learning in current problems thereby increasing the accuracy in learning the newly defined problem. In the case of freezing, the pre-trained model layer weights are frozen (locked), thereby not allowing those weights from being changed during the current training. The pre-trained models reduce the burden of training the CNN architecture models from scratch. The pre-trained convolution base consists of various layers like convolution block, max-pooling layer, rectified linear activation unit (ReLU), batch normalization layer, separable convolution layer, inception layers, etc. These blocks that are already trained for the ImageNet dataset are frozen to preserve/lock their weights from being trained for the proposed emotion recognition task. Figure 2 shows the illustration of transfer learning used in the proposed facial emotion recognition. The pre-trained models like Visual Geometry Group (VGG) are made the publicly available ConvNet models for all computer vision problems used in image and video recognition tasks, where VGG16 and VGG19 are used in this work to find out the emotions. VGG was developed by K. Simonyan and A. Zisserman (15) , the runner-up model in the ImageNet competitions conducted in the year 2014 called ImageNet Large-Scale Visual Recognition Challenge(ILSVRC) (22) for classifying the subset of ImageNet database objects into 1000 classes and trained on NVIDIA Titan Black GPU's. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images. The architecture of VGG16 and VGG19 are described (13) .

Data sets
The dataset is collected using a web camera with a resolution of 1280 x 720 in the laboratory environment. A total of 8400 mouth images and 5400 face images are used from 20 subjects (10 Male, 10 Female). The dataset is divided into training and https://www.indjst.org/ validation sets. For training the CNN architecture, 6720 mouth images are used for training and 1680 mouth images are used for validation. For testing, 2100 mouth images are used for each expression in real-time. For training using transfer learning methods, 4320 face images are used for training and 1080 face images are used for validation and 2100 face images are used during real-time testing.

Training and Validation
The four-layer CNN models are trained with varying numbers of dense layers such as two, three, four, five and six. The models are trained with various numbers of parameters and their weights are adjusted for 50 epochs with patience=5. The model with two dense layers is trained with a total of 4,731 training parameters gives good classification results. The training time of this model is 8 min 7 sec on a system with 2.29 GHz CPU. In the simple four-layer CNN with two dense layer model, the training loss and accuracy saturated at 19 epochs and yields an accuracy of 99.05% for the training data and 96.96% for the validation data and it is shown in Table 1. For the pre-trained models, varying number of layers are frozen and trained on free cloud-based service, Google Colab network GPU's (23) . All the models are trained for 25 epochs and the model is saved with patience=5. The loss/error and accuracy for the training and validation data for different models are shown in Table 2, where model 6 and model 7 are obtained from VGG16, and model 8 and model 9 are obtained from VGG19 using transfer learning. In model 6, 15 layers are frozen and trained with the output layer 3 neurons and it has 12,291 trainable parameters. In model 7, 14 layers are frozen and dense layer hidden neurons are changed from 4096 to 512 neurons in the first dense layer and 128 neurons in the second dense layer and given to 3 output classes with a dropout of 50% and it has 66,051 trainable parameters. In model 8, 18 layers are frozen contains 12,291 trainable parameters. In model 9, 17 layers are frozen and dense layer hidden neurons are changed from 4096 to 512 neurons in the first dense layer and 128 neurons in the second dense layer and given to 3 output classes with a dropout of 50% and it has 66,051 trainable parameters.

Real-time testing
The CNN models are tested with real-time video for varying numbers of dense layers. The CNN model that gets trained with 2 dense layers performs well compared to 3, 4, 5 and 6 dense layer models. The two-layer dense model, when tested for every frame gives 71.88% accuracy. For each emotion, 700 mouth images are extracted from the live webcam and it is used to evaluate the performance of the system for consecutive 'n' images (n= 1, 3, 7 and 10).
Similarly, when the model is tested for every 3 frames and 7 frames gives 75.71% and 80% respectively. The proposed CNN model with three dense layers achieves an accuracy of 85.71% for every 10 frames. Among the four models obtained using transfer learning methods, model 7 and model 9 gives the highest validation accuracies and these models are tested in the realtime video to evaluate the performance of the system. For each emotion, 700 face images are extracted from the live webcam https://www.indjst.org/ and it is used to evaluate the performance of the system for consecutive n images (n=1, 3, 7 and 10). The snapshot of real-time emotion recognition.
The performance of the classifier can be evaluated using measures like precision, recall, F1-score, and accuracy. The precision, recall, F1-score, and accuracy for real-time testing images are shown in Table 3 for every 10 consecutive testing images. The performance of emotion recognition for real-time video is carried out where the Model 7 gives maximum accuracy of 77.78% using the transfer learning method.

Performance analysis
The performance of the classifier can be evaluated using measures like precision, recall, F1-score, and accuracy. The precision, recall, F1-score, and accuracy for real-time testing images are shown in Table 4 for every 10 consecutive testing images. The confusion matrix is one of the most intuitive and easiest metrics used for finding the correctness and accuracy of the model. It is used for classification problems where the output can be of two or more types of classes. It is used to assess the performance of the classifier. In confusion matrix (24) , true positive is the total number of emotion frames which are correctly identified as the respective classes, true negatives are the number of emotion frames which are correctly identified as other classes. The confusion matrix (in %) for real-time recognition of emotion using 10 consecutive face images is shown in Figure 3 . In model 1, 95.20% testing is correctly classified as happy classes, 85.80% testing as a normal class and 77.10% testing as a surprised class respectively. In total, 85.71% testing is correctly classified and 14.29% testing is misclassified as shown in Figure 3(a). Similarly, for model 7, 77.78% testing is correctly classified and 22.22% testing is misclassified to other classes is shown in Figure 3(b) and for model 9, 53.49% testing is correctly classified and 46.51% testing is misclassified to other classes is shown in Figure 3(c).

Conclusion
This work has proposed a real-time emotion recognition system for recognizing the emotions normal, happy and surprised using mouth images with CNN architecture and face images with pre-trained models. The face images are extracted in an unconstrained laboratory environment using a web camera. The experimental results show that the proposed system with CNN recognizes the three emotions using mouth images with an accuracy of 85.71% and the transfer learning of pre-trained models using face images gives an accuracy of 77.78%. The proposed system using CNN gives better performance than the pre-trained modes for recognizing the emotions in real-time video.