A Hybrid of Proposed Filtration and Feature Selections to Enhance the Model Performance

Objectives: Toextract and identify the subjective information of social media user from the unstructured data. To overcome the high dimensionality and sparsity those are the two major challenges in sentiment analysis of text datasets. To increase the model performance by using possibly minimum feature sets in a text classiﬁcation problem. Methods: We proposed a new ﬁltration method which is applied for the removal of correlated features and zero importance features in addition to the various feature selection methods. The various feature selections such as Mutual Info, Lasso, Recursive Feature Elimination and dimensionality reduction, Principal Component Analysis (PCA) have been used along with the proposed ﬁltration to ﬁnd the compelling features. This approach was evaluated using three Indian Government Schemes and these tweets were classiﬁed using Random Forest classiﬁer. The performance was evaluated using various metrics such as accuracy, precision, recall, f1_score, log loss and roc-auc. Findings: In this research, we proposed a model for selecting relevant and non-correlated feature subsets from the unstructured dataset. From this model, accuracy of 92% with the minimum log loss 0.22 was achieved through the minimum number of feature set. Improvements: This study proves that the performance of the model will be improved by overcoming those two problems (dimensionality and sparsity). Here various feature selection methods have been applied with the proposed ﬁltration in order to minimize the number of features. The computing time and the model performance will be improved as a result of decreasing the features. And this will be more eﬀective in case of large datasets. Even though Random Forest performs well in high dimensional datasets we need some more optimization. ensemble classifiers outperforms neural networks. The best of 87.25% accuracy and 87.46% f1-score were obtained from the combinations of Count Difference feature selection with an ensemble of LR + Bagging and Count Difference feature selection with an ensemble of LR + Random Space (11) . Online reviews were analysed by using four classifiers NB, KNN, ME, and SVM with the combination of three ensemble methods such as Boosting, Bagging, and Random subspace. It was found that 88.95% accuracy achieved by using stand alone ME. The highest accuracy of 88.68% was obtained from an ensemble method, SVM+RS (12) . TF-IDF 3-4% Bi-grams. best 57% 50% recall f-score by using LR (13) . The tweets of Indian Railways were analysed to find the sentiments such as positive, negative and neutral. The model was evaluated in terms of accuracy, precision, recall and f1-score by using four classifiers C4.5, Naïve Bayes, SVM and Random Forest. highest of 91.5% accuracy, 88.5% precision, 87.5% f1-score and 83% recall by SVM. It found SVM performs better than Random Forest, Naïve Bayes and C4.5 (14) The tweets from the three digital payment service providers of Indonesia were gathered to perform the sentiment analysis. KNN and NB classifiers were to classify the tweets. KNN performs better


Introduction
ensemble classifiers outperforms neural networks. The best of 87.25% accuracy and 87.46% f1-score were obtained from the combinations of Count Difference feature selection with an ensemble of LR + Bagging and Count Difference feature selection with an ensemble of LR + Random Space (11) . Online reviews were analysed by using four classifiers NB, KNN, ME, and SVM with the combination of three ensemble methods such as Boosting, Bagging, and Random subspace. It was found that 88.95% accuracy achieved by using stand alone ME. The highest accuracy of 88.68% was obtained from an ensemble method, SVM+RS (12) .
The impact of two feature extractors TF-IDF and Bi-grams were analysed on SS-Tweet dataset. Six different algorithms Decision Tree, SVM, KNN, Naïve Bayes, Random Forest and Logistic Regression were used to classify the sentiments as positive, negative and neutral. The result shows that TF-IDF performs 3-4% better than Bi-grams. The best of 57% accuracy, 57% precision, 50% recall and 50% f-score were obtained by using LR (13) . The tweets of Indian Railways were analysed to find the sentiments such as positive, negative and neutral. The model was evaluated in terms of accuracy, precision, recall and f1-score by using four classifiers C4.5, Naïve Bayes, SVM and Random Forest. The highest of 91.5% accuracy, 88.5% precision, 87.5% f1-score and 83% of recall were obtained by using SVM. It was found that SVM performs better than Random Forest, Naïve Bayes and C4.5 (14) . The tweets from the three digital payment service providers of Indonesia were gathered to perform the sentiment analysis. KNN and NB classifiers were used to classify the tweets. From the result, it was found that KNN performs better than NB by getting the accuracy of 91% (15) .
The authors developed a model to determine the proteins that belongs to which molecular function of electron transport proteins. The performances from PSSM with AA Index feature set were successful in identifying electron transport proteins in transport proteins with achieved sensitivity of 73.2%, specificity of 94.1%, and accuracy of 91.3%, with MCC of 0.64 for independent data set (16) .
The authors developed a model to observe the impact of using feature selection on tweet sentiment classification. They experimented by using the combinations of four classifiers, ten feature rankers and ten feature subset sizes: 5, 10, 15, 20, 25, 50, 75, 100, 150, and 200 out of 2388 features available from the dataset of 3000 tweets. Generally using 200 features works best than 100 and 150 features. They proved that using feature selection to select 50 or fewer features generally results in poor performance. By performing ANOVA analysis they tested the statistical significance of their findings. It was found the performance improvement achieved by selecting 75 or more features was statistically significant (17) . A proposed model was developed to improve the performance of the classification of twitter data by using three different classifiers with three categories of feature selections. The Kern lab support vector machine with the third category of feature selection gives the best accuracy of 86.22% (18) .
With the implementation of feature selection and feature weighting the accuracy has been increased by using SVM classifier. The combination of Chi2 and TFIDF has been used to improve the accuracy and system performance was evaluated by using 10 fold cross validation. As a result the accuracy improved from 68.7% to 80.2% (19) . The authors proposed a model to classify the sentiments of E-Commerce based tweets which are downloaded from the Twitter Cloud Repository. The Naïve Bayes classifier was used along with the feature selection method Information Gain to improve the model performance in terms of accuracy. The 88.80% of accuracy was obtained by using this proposed system (20) . A novel hybrid framework was developed to classify the twitter data. The model was proposed based on feature selection method local linear embedding (LLE) and three machine learning classifiers such as Random Forest, K-Nearest Neighbors and Logistic Regression. Random Forest has performed the best out of all by achieving the higher accuracy of 80% (21) .
The authors proposed a model to classify the sentiments of twitter data regarding Citizenship Amendment Act 2020. The proposed system presents a faster approach of sentiment analysis using various classifiers to classify the tweets as positive, negative or neutral. For faster and accurate POS tagging VADER was used. Among the various classifiers SVM performs better which obtains the accuracy of 77.32% (22) . Table 1 given below compares the previous methods with the proposed method. Figure 1 shows the architecture diagram.

Proposed Model
The proposed model has been described in two phases. https://www.indjst.org/

Phase 1
The sentiment analysis has been processed in five steps. They are: 1. Tweets extraction: By using Twitter API, the tweets were collected. The analysis has been done by using three schemes of Indian Government such as Make-in India, Digital India and Swachh Bharat. About 7500 tweets were extracted regarding the schemes. 2. Pre-processing: It includes to kenization, removing links, stop words, unwanted symbols and characters. These have been done by using Pyspark package. 3. Tokens evaluation: The tokens have been assigned the corresponding ratings by using AFINN Dictionary. This dictionary contains English words associated with polarity score between -5 and 5. So, label each tweets by using these scores. 4. Feature Extraction: By using Hashing TF-IDF, the feature vectors were generated. In this model, the features were extracted based on the three feature sets such as Unigram, Bigram and Unigram + Bigram. When using text as the features, the Hashing TF-IDF improves the performance. 5. Classification: After pre-processing 7336 tweets were obtained this includes 7519 features. Before classification split the samples as Train set and Test set in the ratio 7: 3. Random Forest classifier was used to analyse the sentiments as positive, negative and neutral. The comparative analysis has been done between Unigram feature vectors, Bigram feature vectors and Unigram + Bigram feature vectors. The below algorithm, Algorithm1 has been used for extracting and classifying the tweets.
Output: Classify the tweets as negative, positive and neutral.
1. The tweets from JSON files should be converted to Pyspark data frame.
2. The duplicate records in the data frame have been removed by using User_id. 3. By using Pyspark package, the tweets were extracted from the data frame and pre-processed. 4. The AFINN dictionary was used to label the tweets by its corresponding polarity scores. 5. Each token were organised as the combination of unigram and bigrams. 6. Using Hashing TF and IDF method the feature vectors were extracted based on three feature sets. 7. Taking the 70% of data as train set and 30% as test set. 8. The Random Forest classifier was used to predict the sentiments.
In general, when the number of trees increases in Random Forest the better results will be obtained. However, it will decrease the improvements for more number of trees. (23) It has been proved that between 64 and 128 numbers of trees in a forest will obtain a good balance between AUC, processing time, and memory usage. So, this model has been experimented for 10 trees and 100 trees. By comparing the tables Tables 2, 3 and 4 given below, the combination of Unigram and Bigram for 100 trees with 160 features performs better than as they perform individually for 80 features. This achieves the highest of 92% accuracy, 92% precision, 92% recall, 92% f1-score and 99% of ROC-AUC with the minimum log loss 0.22.
As the number of features increases, the performance of the model was decreased by 1% of accuracy, precision, recall and f1-score. This result has been shown in Table 5 for the 400 features. So, the minimum number of 160 features with combination of unigram and Bigram has been taken for the next filtration method.

Phase 2
In this phase, the three different feature selection methods have been applied for 160 features to select the relevant features. Before applying the feature selection methods, a filtration has been done. This filtration consists of two steps: first removes the features whose mean value is less than a threshold and in the second step, the correlated features have been filtered. The Algorithm2 given below describes these two filtration steps. Algorithm2: Input: Dataframe df, that has each features as columns. Output: Returns a data frame in which zero importance features were removed. ii. Remove the features whose correlated value is greater than 60%.
After step3, totally 23 columns have been removed. So (160 -23 = 137) we are finding the correlation coefficient between each pair of 137 columns.
After step6, it was found that the again 23 columns are highly correlated for more than 60%. So the remaining 114 columns (137 -23 = 114) has been undertaken for further feature selection process by using various feature selection methods.
Using feature selection methods such as Mutual Information (MI), Lasso (L1) and Recursive Feature Elimination, RFE (RF) with this proposed filtration method yields good result. Table 6 depicts the model performance by comparing the results with before and after filtration method. There will be only a slight difference occurs when we train the classifier for 160 features without filtration and for 114 features with filtration. Only 0.003% of f1-score, 0.002% of accuracy and 0.002% of recall have been decreased in 114 feature set. But there is an improvement in the log loss where 0.22 has been decreased to 0.21. So, the eliminated features may not be more important for the further classification process.  Table 7 given shows the performance of the model by applying the feature selections without filtration for 160 feature set. Here the model has been trained for least number of features to obtain the better results. This number of least feature varies between the feature selections. Selecting the 112 features in MI, 93 feature in L1 and 95 features in RFE (RF) produces the maximum results. When we further decreases these numbers in each method will affect the performance of model.  Table 8 shows the results of three feature selection methods that are applied after the proposed filtration method. By comparing Tables 7 and 8 , the model performance has been increased with the decreased number of features in each selection method. The better results have been produced by selecting 80 features instead of 112 in MI, 86 features instead of 93 in L1 and 51 features instead of 95 in RFE (RF). 1% of accuracy, precision, recall and f1-score have been increased in L1 method after filtration. MI and RFE (RF) produce the same result with the decreased number of features after filtration. As the number of features decreases the time taken for classification process also decreases. RFE (RF) takes the minimum features of 51 and thus the time taken will be reduced to 811 milliseconds.
From theTable 9, we understand how the filtration helps in achieving better results with minimum number of features. The numbers of features selected by the each selection method after filtration have been applied for 160 features. The performance decreases significantly by further reducing the features. By comparing the Tables 8 and 9 , the numbers of features are same for the each selection method but the result varies because of filtration process done in Table8 before applying selection methods.
Thus, the proposed method is effective in reducing the number of features to obtain better performance.
https://www.indjst.org/  Table 10 shows the result that the model performance increases when filtration is applied before the dimensionality reduction. There is no difference between 160 feature set and 114 feature set for zero components in PCA except the time taken. But with filtration, 1% of accuracy, precision, recall and f1_score have been increased for ten components of PCA.

Performance Measures
Evaluating a model is a core part of building an effective machine learning model. To demonstrate the performance of the proposed method some metrics have been evaluated which are as follows, • Accuracy: It represents the number of correctly predicted data points out of all the data points. • F1-Score: It is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. The F-score has been widely used in the natural language processing literature.
• Roc-Auc: The ROC curve is the plot between sensitivity and (1-specificity). (1-specificity) is also known as false positive rate and sensitivity is also known as True Positive rate. AUC itself is the ratio under the curve and the total area. It considers the predicted probabilities for determining our model's performance. It is widely used when the dataset is imbalanced since accuracy is not a reliable performance metric for imbalanced data. • Log Loss: It is indicative of how close the prediction probability is to the corresponding actual/true value. It is useful to find out the performances of the model since from accuracy we cannot measure how good the predictions of the model are.
Where y i j , indicates whether sample i belongs to class j or not p i j , indicates the probability of sample i belonging to class j.

Results And Analysis
In this section the performance of the proposed model using real-time datasets has been discussed. The performance of the proposed model was compared with the other existing works has been represented in Table 1. The results as shown in this table prove that the proposed model yields the better results than the other classification systems. In addition to the various metrics as evaluated in present works, log loss was also analysed in the proposed work. Moreover, the various combinations of feature size and feature sets such as unigram, bi-grams have been experimented in phase1 to find out the optimal solution. The improvement of proposed model was also analysed in terms of computational time for every feature selection. The time taken to classify the overall 160 features was 1.02 seconds. The hybrid of proposed filtration and feature selection RFE (RF) takes the minimum of 51 features with the reduced time taken of 811 milliseconds. We obtain 20% by calculating ((1020-811) * 100/1020 = 20.49)), the percentage of differences in time taken. The 20% difference will be a significant one in case of big data.
The various metrics such as accuracy, precision, recall, f1_score, log loss and roc-auc have been analysed for the proposed model. It is important to have the completeness and exactness than higher accuracy. So in addition to accuracy, the three parameters such as precision, recall and f1-score have been evaluated. To measure the model performance, auc does not bias on the test data size whereas accuracy always biased on the test data size. In this model, only 30% data has been used as test data. So, it is better to evaluate roc-auc for the proposed model. ROC-AUC also ranges between 0 and 1. A good model will have a roc-auc near to 1. The Figure 2 represents that the proposed model performs better which has the roc-auc 0.99 for each class. From Figure 3 , it understands that the proposed method has reduced the features significantly for MI and RFE (RF) methods. Even though it shows only slight difference for Lasso method; it increases accuracy from 91% to 92%.
https://www.indjst.org/  The loss function acts as a good proxy for accuracy. The accuracy of a classifier is evaluated by log loss through penalising the false classification. Basically, minimising the log loss is equivalent to maximising the accuracy of the classifier. Log loss is a probability value between 0 and 1. A good model will have a log loss of 0. Consider the proposed model as a best model since it has the log loss 0.22 with the minimum number of features. The Figure 4 represents the differences in log loss obtained before and after filtration method with feature selections. The log loss reduced significantly for each feature selection with filtration. Finally, the sentiment classification has been performed on these tweets by using Random Forest classifier and found out the results which described in (24) . https://www.indjst.org/

Discussions
Before applying the feature selection methods or dimensionality reduction method, a proposed filtration method has been used to filter the insignificant features and 60% of correlated features in order to improve the efficiency of model. Because of irrelevant and correlated features the performance of the model will be affected. Also it is proved that the proposed filtration method has reduced the features significantly and achieves the better results. After filtering the 60% of correlated features and applying tree based feature selection will produce efficient results that were proved in the proposed model. This model can also be applied for the large datasets as it is implemented in Apache PySpark framework. This model will also be more effective when size of the data/features increases. The performance of the model is evaluated based on various metrics such as accuracy, precision, recall, f1-score, log loss and roc-auc. The metrics used in the proposed system are the best suitable metrics for any text classifications. Since accuracy is not a reliable performance metric for imbalanced data roc-auc is evaluated and additionally log loss represents how close the predictions are towards the actual values. For any NLP approach f1-score is widely used.
The proposed model is not domain specific that can be implemented in various domains to classify the sentiments. Since this model is proved to be good in both time and performance with minimum feature set, it can be certainly obtain its application in the field of big data. When it comes to social data, its application can be surely extended.

Conclusion
In this study, a hybrid of filtration and feature selection method has been implemented in order to reduce the number of features. By using AFINN dictionary and Random Forest algorithm the multi class classification (positive, negative and neutral) has been performed. Along with, the three feature selections such as MI, L1 and RFE (RF) are used. The filtration has also been tested with dimensionality reduction method, PCA. Finally, the performance of the proposed model has been evaluated using various metrics such as accuracy, precision, recall, f1_score, log loss and roc-auc. This model achieves 92% of accuracy, precision, recall, f1_score and 99% of roc-auc with minimum log loss 0.22. From the results as shown in Table 1 , it proves that our proposed model performs better than the previous models.
An increase of 1% of accuracy is not the only advantage of proposed model, but also, we are gaining more or less same accuracy by using the minimum set of features that reduces the complexity of the model. So in case of large datasets the proposed model will be more useful by improving the model performances with the minimum set of features. With this dataset, we are getting 92% of accuracy with the minimum of 51 features instead of using 160 features i.e. we are using only one third of features to get the same or better results. As we refer (17) , the authors proved that using feature selection to select 50 or fewer features generally results in poor performance. They also found that the performance improvement by selecting 75 or more features was statistically significant. From the results, we proved that 20% of computational time has been reduced which will be more effective in case of large datasets.

Future Scope
As the future work, the analysis should be improved to increase the percentages of evaluation metrics further. Another datasets that might be a large datasets have been implemented in the proposed model in order to find its performance. Need to analyse the sentiments of Tamil tweets.