A search space enhanced modified whale optimization algorithm for feature selection in large-scale microarray datasets

Objectives: To enhance the microarray data classification accuracy, to accelerate the convergence speed of classifier, and Modified Whale Optimization Algorithm (MWOA), refine the best balance among local exploitation and global exploration, a Search space enhanced Modified Whale Optimization Algorithm (SMWOA) is the proposed task.Methods: The SMWOA selects the optimal features stands on the Levy flight method and quadratic interpolation method. Levy flight which employs for acceleration convergence speed of SMWOA and also holds the result from local optima builds up by the population assortment. A quadratic interpolation takes up the exploitation stage for deeper searching within the search area. Finding: In addition to this, a self-adaptive control parameter is introduced to make a clear variation to the solution quality. It refines the best equity among the local exploitation method by global exploration method. After selection of features, those are processed in Naïve Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Artificial Neural Network (ANN) classifiers for cancer detection. Novelty: The classification accuracy is improved by processing the most discriminative features in the classifiers. The overall accuracy, specificity, sensitivity, F1-score and average error of SMWOA-ANN are 6.7%, 5.6%, 7.3% and 5.6% greater than MWOA-ANN respectively for cancer detection.


Introduction
A principle of microarray technology is established and verified for antibody microarrays in a registered series of patents (1) . In a single experiment, Microarray technology allows biologists to determine expression rates of thousands of genes. The microarray data consists of a small sample and huge dimensional data. A disadvantage of microarray data on gene expression is that the number of genes greatly exceeds the https://www.indjst.org/ sample size generally referred to as the curse of dimensionality. As a result of efforts to improve the drug discovery process, microarrays have been developed. In Order to avoid the complication of cursing dimensionality, dimension reduction shows a pivotal role in DNA microarray investigation.
Microarray studies supply the scientist with tremendous data, but important information and knowledge contained in this database cannot be found without proper instruments and methodologies. Analytical and computational problems emerge due to a large amount of raw gene expression data. The challenging work of the analyst is the texture or nature of the microarray data. The best analytical models greatly depend on the sum of all possible gene combo. Hence, the hit of microarray technologies depends biologically on massive data mining and analytical methods. The field of data mining performs a chief role which plans to solve the problem to bane of dimensionality (2)(3)(4) .
The selection of feature (5)(6)(7)(8) is a data mining technique that is used to figure out the curse of dimensionality. The selection of feature is a method used to find the difference between relevant and irrelevant features and eliminates irrelevant features. A feature selection strategy using Particle Swarm Optimization (PSO) (9) has been introduced to cut the dimensions of the microarray data. It reduces the features in microarray data which are processed in Naïve Bayes classifier and Support Vector Machine classifier (SVM) for data classification. Sometimes, PSO has a slow convergence problem. The advantages of WOA (10)(11)(12)(13)(14)(15) such as the low number of parameters and low possibility of getting struck into the local optimal problem are intended to use the feature selection in cancer detection. In order to effectively select feature subsets, a Modified Whale Optimization Algorithmic method (MWOA) (16) acts as a planned method for selecting the most relevant features in the microarray cancer dataset. With the simplicity and less dependency on the parameters of WOA, the local minima can be extended to fit the best solution at random through which the exploration can be tuned to fit the best position of the agents. The MWOA focus to improve the exploration of the Whale Optimization Algorithm which employs on fitness function and aims to find the location of the agents at minimum length. However, sometimes the MWOA will get struck into local optima problem (i.e., the solution is optimal within a neighboring set of candidate solution) which degrades cancer detection accuracy.
In this paper, Search space enhanced MWOA (SMWOA) is proposed to enhance the cancer detection accuracy and to improve the exploration and exploitation strategy of MWOA. A non-linear dynamic strategy is introduced in SMWOA to update the control parameters of MWOA and it also balances the exploration and exploitation abilities of MWOA. In order to escape from local optima problem for cancer detection, a Levy-flight strategy is introduced in SMWOA. Moreover, the leader of the population is done by a quadratic interpolation method which boosts the local exploitation ability which enhances cancer detection accuracy.

Literature Survey
A global feature selection method (17) was done by semi-definite programming models using Lagrange multipliers. However, a threshold value used in Lagrange multipliers greatly influenced the classification accuracy. A framework (18) was proposed to choose top-ranked features for microarray data. However, if the datasets were imbalanced, efforts can be given to resolve the imbalance issue in the dataset. A bi-objective genetic algorithm (19) was proposed for ensemble-based feature selection. The classification accuracy will be improved by using multiple objectives in genetic algorithm.
A two-step attribute selection method (20) was introduced for cancer diagnosis using kernel-based learning. Other objective functions such as the correlation between genes or class separability distances will be considered for cancer diagnosis. A recursive Particle Swarm Optimization (PSO) (21) was proposed for gene selection in the microarray dataset. One of the key future direction of recursive PSO based gene selection has included exploring of other soft computing approaches and extending it for further minimizing the number of genes. Improved-binary particle swarm optimization method results in correlation based feature selection (22) was proposed for gene selection and cancer classification However, this method further improved in terms of accuracy.
A multi-objective simplified swarm optimization (23) method was proposed for picks up gene in microarray data. However, this method was not more suitable for the complex datasets. A Partial Maximum Correlation Information (PMCI) method was proposed (24) for microarray data classification. However, it has a poor F-score. A Binary Coral Reef Optimization (BCRO) algorithm (25) was proposed to select the most significant features from the microarray datasets. The exploration and exploitation of BCRO would be improved by combining with other local search strategies or swarm intelligence algorithms. A multiobjective feature selection model (26) was constructed for microarray data through a distributed parallel algorithm. However, there may be chances for rosining conflicts between the multiple objectives.
A hybrid metaheuristic using binary black hole algorithmic method and Particle Swarm Optimization method(PSO) (27) was proposed for gene selection. However, it was not suitable for complex datasets. A feature selection method (28) was proposed for microarray data classification according to the Hidden Markov Model (HMM). This method will be extended by using more https://www.indjst.org/ feature selection methods with HMM-based method and KNN, ANNs (29) gave better results and classification rate was higher. A metaheuristic method (30) was proposed for gene selection based on binary shuffled for the leap algorithm. This method will be extended for the high dimensional multi-class classification problem.
A feature selection strategy (31) was presented to improve the classification performance over high dimensional datasets. However, analyzing each cluster with Multi-Layer Perceptron (MLP) in sequential order was a highly time-consuming process which was the major drawback of this strategy. A method of filter wrapper hybrid feature selection approaches on the Genetic Algorithm (GA) using the penalty scheme and weighted occurrence frequency (32) was proposed for dimensionality reduction in biomedical datasets. However, this approach required a large number of records for reliable sampling. A wrapper-based feature selection (33) method was proposed to increase the ability of the intrusion detection system. This method will be extended by considering more groups so that the algorithm can categorize diverse types of intrusions within the datasets of the intrusion detection rule. An enhanced Artificial Bee Colony algorithm according to the Whale Optimization Algorithm (ABCWOA) (34) was proposed for data clustering. The performance of ABCWOA will be enhanced by using new operators concerning the optimization problem.

Proposed Methodology
The paper focuses to bring down the dimensionality of microarray data and raise convergence speed of classifiers, improving the detection accuracy. There are three modifications such as tracking the large scale global optimization problem using Quadratic Interpolation (QI), securing the solution from local optima using Levy Flight (LF), and maintaining perfect harmony between exploration and exploitation by using non-linear control parameter strategy is carried out in SMWOA of selectively picks up with the most discriminative features in the microarray dataset.
• QI is carried out in the exploitation time of SMWOA to keep and maintain the population diversity.
• LF is introduced in SMWOA to get away local optima by boosting up the diversity of population.
• The non-linear control parameter strategy has introduced a parameter that makes perfect harmony between exploration and exploitation.
By making these modifications, the SMWOA selects the most discriminative features from the dataset and the selected features are used in NB, SVM, KNN, and ANN for data classification.

Tracking the large scale global optimization problem using quadratic interpolation
Even though the MWOA with the skirting mechanism and spiral way is exploring well in the search space, it still requires enhancement to track the large scale global optimization problem. The Quadratic Interpolation (QI) method inhibits in SMWOA to enchase with exploitation capability and also enhances the data classification accuracy. Mathematically, the minimum point of quadratic curve QI is driven bypassing three selected solutions in n-dimensional space. A crossover operator, Quadratic Interpolation (QI) chooses the excellent search agents X * = (x * 1 , x * 2 , . . . x * n ) and other two parents A = (a 1 , a 2 , . . . a n ), B = (b 1 , b 2 , . . . b n ) as three parents next it creates a newest solution X = (x 1 , x 2 , . . . x n ) to select the optimal features from microarray dataset. The newest solution is expressed as follow In the quadratic crossover, the present good search agent C * takes up the leading role, which allows search agents to better discover the global optimum solution. The population diversity is preserved and needed to enhance the ability of exploitation using SMWOA, Exploitation phase uses Quadratic Interpolation method. In exploitation phase of SMWOA thus in two https://www.indjst.org/ elements such as the spiral-shaped path, and quadratic crossover. The parameter is evenly shared to control the two elements. The spiral-shaped path method is done if the probability is less than 0.6, is calculated by In Equation (3.3), The Equation (3.4), D ′ denotes space between the ith whale (actor) has a good solution (optimal features) got so far, all iteration will restore if there is the best result, t -> present iteration, |.| denotes absolute value operation, × produces element-by-element multiplication, b gives the constant for describing the structure of a logarithmic spiral along with l denotes the random number which ranges from -1 to 1. If the probability is greater than 0.6, to execute the quadratic crossover which updates the position of whales.

Securing the solution from local optima using Levy Flight (LF)
Levy Flight (LF) process is used to secure the solution of local optima which has driven the meeting velocity of powerful global search. Hence, LF works in SMWOA to get away from local optima by upgrading the diversity of population. LF results in non-Gaussian random process in addition to step length following a Levy distribution. The Levy distribution of simple power-law vision is given as: In Equation (3.5) , β denotes an index, s denotes the step length of the LF. The step length s is calculated by using Mantegna's algorithm In Equation (3.6), µ and v obey normal distribution which are mathematically expressed as, The LF jumping out is of the design domain is accepted to avoid step size. The calculation is given below In Equation (3.11), size(D) -> range of the problem considered, ⊕ shows entry-wise multiplications and X i gives the i th solution vector. LF also carries out long-distance movement to facilitate the opportunity to explore the limitless variety of the Lévy distribution, and short-distance movement, is used to maximize exploration potential. This advantage can ensure that the metaheuristic algorithmic method jumps out of local optima. Exploring the search space more effectively, need a shrinking encircling mechanism to be replaced by LF. Based on the following rule, the new position is updated as X (t + 1) = X (t) + 1/sqrt(t) × sign (random − 0.5) ⊕ Levy (3.12) https://www.indjst.org/ In Equation (3.12), 1/sqrt(t)gives the framework related to the present iteration number t also sqrt(t) represents the operation of square root. To the point, search movements larger range that can be implemented in the early stage during little ones are involved for future period. sign (random − 0.5) Sign function has 3 values -1, 0 and 1, which results a random search. During phase of exploration SMWOA is as follows,

Maintaining perfect harmony between exploration and exploitation by using non-linear control parameter strategy
In order to achieve a good performance, a perfect harmony has to be maintained between exploration and exploitation. In MWOA, A coefficient is an important factor in balancing exploration and exploitation. Already mentioned above, whales make charge towards the prey (exploitation) by value |A| < 1 and whales, to examine a search area by value|A| > 1. A is the coefficient, directly affected by the linearly decreasing parameter ′ a ′ . However, the linearly decreasing parameter a cannot represent or respond correctly to the difficult and also to the non-linear search process. Bringing this into view, SMWOA uses a control parameter nonlinear to specifically influence the consistency of the solution. SMWOA has employed a function of cosine to update ′ a ′ each iterations which is given as follows: (3.14) In Equation (3.14) , max itr denotes the maximum iteration.

Search space enhanced modified whale optimization algorithmic method for feature selection (SMWOA)
27. t + + 28. end while 29. return X * SMWOA is used to selectively give the most compelling features in microarray dataset. The selected features are processed in NB, SVM, KNN and ANN for cancer detection. The overall flow of SMWOA is shown in [ Figure 1 ].

Results and Discussion
The work of this part shows the performance of MWOA and SMWOA with different classifiers for cancer detection which is analyzed in terms of accuracy, specificity, sensitivity, F1-score and average error. For the experimental purpose, three microarray https://www.indjst.org/ datasets such as Leukemia, Lymphoma and prostate microarray datasets are used. These datasets are publicly available on the internet. Dataset of Leukemia has 72 instances, features count as 3572 and classes count as 2, lymphoma dataset consists of 77 instances, 2647 features, and 2 classes. Prostate dataset has 102 cases, 2135 features and 2 classes. The execution of this pattern is evaluated on the testing and training datasets. The datasets are divided into testing and training set in the ratio of 60:40. Table 1 shows the number of features selected by MWOA and SMWOA methods.

Accuracy
Accuracy gives the various records in the microarray data which are correctly classified among all number of records in the dataset. It is calculated as  Figure 2 ] gives the accuracy between MWOA and SMWOA with different classifiers for cancer detection. X-axis stands for the classifiers and Y-axis stands for the accuracy based on feature selection methods. For the leukemia dataset, the accuracy of SMWOA-ANN is 51.30% greater than MWOA-SVM, 5.25% greater than MWOA-KNN, 10.03% greater than MWOA-NB, 4.18% greater than MWOA-ANN, 43.32% greater than SMWOA-SVM, 1.41% greater than SMWOA-KNN and 6.13% greater than SMWOA-NB. Similarly, the accuracy of SMWOA-ANN is 4.29% greater than MWOA-SVM, 16.21% greater than MWOA-KNN, 21.66% greater than MWOA-NB, 1.69% greater than MWOA-ANN, 2.02% greater than SMWOA-SVM, 10.69% greater than SMWOA-KNN and 17.33% greater than SMWOA-NB for prostate dataset. For the lymphoma dataset, the accuracy of SMWOA-ANN is 5.86% greater than MWOA-SVM, 23.52% greater than MWOA-KNN, 18.58% greater than MWOA-NB, 14.34% greater than MWOA-ANN, 3.31% greater than SMWOA-SVM, 16.85% greater than SMWOA-KNN and 14.61% greater than SMWOA-NB. From this analysis it is concluded that the planned SMWOA-ANN method produces high accuracy than the methods for cancer detection.

Specificity
The specificity in the clinical test can identify correct people without illness, within all people free from illness. The following formula is used to calculate, Speci f icity = T N FP + T N https://www.indjst.org/  Figure 3 results the specificity inbetween MWOA and SMWOA with different classifiers for cancer detection. X-axis expresses the classifiers and Y-axis expresses the specificity about feature selection methods. For the leukemia dataset, the specificity of SMWOA-ANN is 20.86% greater than MWOA-SVM, 7.49% greater than MWOA-KNN, 6.30% greater than MWOA-NB, 3.58% greater than MWOA-ANN, 14.59% greater than SMWOA-SVM, 3.27% greater than SMWOA-KNN and 1.50% greater than SMWOA-NB. Similarly, the specificity of SMWOA-ANN is 6.37% greater than MWOA-SVM, 13.65% greater than MWOA-KNN, 21.31% greater than MWOA-NB, 11.72% greater than MWOA-ANN, 3.50% greater than SMWOA-SVM, 7.89% greater than SMWOA-KNN and 15.15% greater than SMWOA-NB for prostate dataset. For the lymphoma dataset, the specificity of SMWOA-ANN is 5.99% greater than MWOA-SVM, 5.14% greater than MWOA-KNN, 3.36% greater than MWOA-NB, 2.44% greater than MWOA-ANN, 3.92% greater than SMWOA-SVM, 2.47% greater than SMWOA-KNN and 1.59% greater than SMWOA-NB. From this analysis it is proven that the executed SMWOA-ANN method has high specificity than alternative methods used for cancer detection.

Sensitivity
Sensitivity in the clinical test can find out exact people with the illness. It produces the proportion of people with the disease who are positive, expressed in percentages. It is calculated as,

Average Error
It is the average error of classifiers to classify the gene expression data with the selected features by MWOA and SMWOA.  Figure 6 ] shows the average error of classifiers for cancer detection with the selected features by MWOA and SMWOA. X-axis stands for the number of iteration and Y-axis shows the average error of classifier. When the number of iteration is 100, the average error of SMWOA is 12.22% less than MWOA. From this analysis it is proved that the proposed SMWOA method has less average error than MWOA methods for cancer detection.

Conclusion
The proposed SMWOA accelerates the convergence speed of classifiers, enhances cancer detection accuracy and effectively improves the exploration and exploitation strategy of MWOA. LF and quadratic implementation are introduced in SMWOA to enhance the classification accuracy. With the help of Lévy flight MWOA jump or breaks out local optima to the repeated short-distance search step, rarely the longer-distance search step, quadratic interpolation solution accuracy is improve by enhancing the exploitation ability. In addition to, a self-adaptive control framework has gained improvement which balances the intervals among local exploitation and global exploration. This experiment concludes that the proposed SMWOA-ANN gives improvement in terms of accuracy, specificity, sensitivity, F1-score and average error than other methods for microarray data classification. However, SMWOA may get stuck in a part of the Pareto-optimal problem since multiple objectives are used in SMWOA. In the future, an efficient technique will be introduced to solve the Pareto-optimal problem and get a better final subset of features for microarray data classification.