Performance Analysis of Different Classifiers in Prediction of Breast Cancer

Objectives: The major motivation is to build the prediction model for diagnosis. The fundamental exploration of prediction is to anticipate breast cancer at a prior stage that guarantees a long survival of patients. Methods/Statistical Analysis: In medical field, the classification of tissues surrounding the malicious cancer cells into benign and malignant categories is extremely challenging task to predict. For diagnosis of a disease, Naive Bayesian [NB], Support Vector Machine [SVM] and Artificial Neural Network [ANN] Classification systems are investigated and Fuzzy C-Means Clustering are analyzed to make clusters. Fuzzy C-Means Clustering [FCM] algorithm clusters the data with simulated annealing which is classified using the above mentioned classifiers in furtherance of developing best prediction model with predefined rules. The performance is validated with K-fold cross validation. Findings: The Wisconsin Breast Cancer Dataset [WBCD] from UCI dataset storehouse is utilized to test the execution of classifiers. This dataset holds 10 properties with 699 records. This dataset has been clustered as benign and malignant. In the clusters, to achieve global optima simulated annealing technique is used and the classifiers are applied for clusters. In this examination, Fuzzy C-Means Clustering [FCM] with simulated annealing and Naive Bayesian classifier serves to be the best one with 89.2% accuracy and its F-measure is computed as 0.9417. The various performance metrics are computed for proposed novel model and its results are compared with existing values which indicates, the Naive Bayesian classifier works well for non-dependency data as there is no affinity between attributes and is considered as most noteworthy among them. Application/Improvements: Prediction model can be used for predicting any disease in medical field domain, which can be further improved by using Farthest First Clustering [FFC] algorithm.


Introduction
Machine learning techniques are extensively used in medical application which includes identifying and classifying of tumours. Machine learning is upgrading characteristic, forecasting conclusions, and begins to blemish the superficial of personalized care. It is mainly used as an aid for cancer analysis and prediction. Cancer inquisitions in recent times have endeavoured to concern the machine learning methods towards cancer prediction and cast. Machine learning hushed deeply a more powerful arena because it allows choice to be made which could not be possibly made using accepted methodologies 1 . Prescient examination is one of the imperative segment regions in information mining which settlement with concentrating data from information and used to anticipate the patterns and personal conduct standards.
Prescient investigation is a prominent measurable technique which has a capability to build predictive models 2 . In data mining, breast cancer is an important research topic in medical science. In women, the most probable intrusive disease is Breast cancer, with more than one million cases and deaths occurring extensive annually.
Detecting of breast cancer at an earlier phase is an efficient way to reduce death occurring due to cancer. The objective is predicting breast cancer in an initial phase which ensures a long survival of patients. A confounded test for the primary finding makes it hard to get the last outcomes 3 .
In prescient investigation, choosing the consequence of an ailment is a standout amongst the most captivating and testing undertakings. The absence of examination of right and imperative data in medicinal science is to deal with colossal measure of datasets by machine learning systems. These calculations could be utilized expressly to locate the last outcome by misusing different arrangement strategies in information mining. For predicting Breast Cancer accurately, there are various possible solutions with earlier interpretation such as supervised and unsupervised learning. Supervised Learning includes Decision tree a popular classification approaches in knowledge discovery and data mining, which classifies the labeled trained data into a tree or rules, Artificial Neural Network (ANN) is a scientific model or computational model dependent on organic neural systems, K-closest neighbor which characterizes the building model, SVM built optimal separating boundary between datasets to solve optimization problem and the construction of classification system by association rule discovery techniques. Unsupervised learning includes clustering which discovers useful patterns within the data. Semi-Supervised learning is also called as inductive learning which is deduces the exact label for unlabeled dataset 4 .
The study is composed as pursues: The investigation of related specialists on the conclusion of bosom disease is displayed in area II. Area III gives brief clarification for the strategies utilized in existing and proposed models. The framework configuration is introduced in area IV. The exploratory outcomes and informational index is portrayed in Section V. At long last, Section VI gives finish of the study.

Literature Survey
In 5 compared various models of classification such as Bayes Net, Naive Bayesian, Sequential Minimal Optimization [SMO] for cancer prognosis. In classification technique, dimensionality reduction is used inorder to remove the features which do not contribute more or does not influence the result. Gain ratio technique is accustomed to remove the undesirable feature and ranker algorithm is influenced to rank the feature depending on the ratio values. Reduction techniques take off the features which has lowest gain ratio values. Among ten classification algorithms, Bayes net classifier provides best accuracy but time taken to accomplish the model is large.
In 6 compared various algorithms of decision tree such as ID3, CART and C4.5. ID3 uses information gain approach to resolve advisable property for each node of a decision tree which was generated. The disadvantage in ID3 algorithm is it cannot handle Continuous values, accepts only definite attribute. C4.5 is an extensibility of ID3, it depends on hunt's algorithm which can hold both definite and constant attributes to build a decision tree. Gain ratio as an feature selection part to build decision tree which removes proneness of information gain. The disadvantage is time taken to accomplish the model is too large.
In 7 diagnosed cancer by combining the approach of farthest first clustering, Outlier detection algorithm (ODA) and J48 decision tree. After clustering the data, ODA is accustomed to identify deviations within the clusters formed.
The clusters are given as input to J48 which has two parts such as tree building and pruning. The advantage is better performance which speeds clustering and outliers are removed. The limitation of this technique is expensive for estimation and time consumption to build decision tree.
In 3 proposes a hybrid approach of DT-SVM as a predictive framework for breast cancer disease. The first state is treatment of information and option extraction followed by DT-SVM hybrid model predictions. The intake features for SVM were optimized using DT algorithm. The advantage of hybrid model is to yield accurate results and robust to noise which yields a good accuracy. The disadvantage is accuracy depends on selection of kernel and computationally expensive.
In 8 described the distinguishing of different clustering techniques like FCM, K-means and EM (Expectation Maximization) cluster. FCM and K-means plays a fundamental role for intrusion detection system because clustering does not desire any labeling information. K-Means is a repetition clustering algorithm is moving an item surrounded by the set of clusters until the covet set is reached. Among them, K-means contribute superior results but FCM also provides results closer to K-means. K-means is said to be an exclusive clustering and FCM is an overlapping clustering. FCM is better for detection as it has high detection rate and low false positive rate though it is time consuming.
In 9 proposes a simulated annealing based Fuzzy Classification System (SAFCS). Initially, if-then fuzzy rules are developed and perturb operations are applied to new fuzzy rules. SAFCS is distinguished with C4.5 which depends on entropy criteria and pruning techniques, these method of classification are applied to different datasets, among them SAFCS achieves better results in premises of accuracy for both training set and testing set. The disadvantage is execution time for prediction model and cooling rate is difficult to assess.

Methodologies
The following exploration methods are employed in this study.

Data Collection and Pre-Processing
The dataset is collected from UCI Machine learning data repository of Wisconsin (Original) Breast cancer dataset (WBC). WBC has 699 instances, 2 class labels (2 for Benign and 4 for Malignant) and 11 attributes. The attributes are cardinal valued. The dataset contains missing values '?' . The dataset is pre-processed by single imputation method, i.e., the replacement of mean value of a variable. The advantage is sample mean remains unchanged. The Breast Cancer dataset is provided in Table 1.

Fuzzy C-Means Clustering
The Fuzzy C-Means Clustering (FCM) algorithm is a soft clustering where one data point can reside to more than one cluster. FCM is an unsupervised clustering algorithm which is enforced in agricultural engineering, astronomy, image analysis, medical diagnosis 8 . In FCM, degree of membership is designated to each data point, based on which the data points are designated to clusters.

Simulated Annealing
Simulated Annealing is a repetitive method which was inspired for annealing for metals. Simulated Annealing mainly used as an escalation search paradigm to evade from local minima and to attain global optima. SA has been extensively accustomed on a wide range of combinatorial optimization and achieves good results 10 . This optimization can be done by accepting moves which degrades the feature on a parameter called temperature. The temperature is step by step diminished by utilizing cooling plan. The conduct of a calculation closes, when the temperature scopes to zero. The parameter required for recreated strengthening are beginning temperature, last temperature and temperature decrement. One approach to reduce the temperature is basic direct technique. The temperature decrement,

Decision Tree [C4.5]
In Classification Problem, C4.5 is a supervised algorithm that generates decision tree (DT). It is improved from ID3 algorithm by dealing with both consecutive and discrete attributes, missing values and pruning trees 6 . C4.5 builds decision trees from a set of training data by calculating the information gain for each attribute. (2) The property with most elevated data gain is taken as a root for choice tree. Data gain for each quality is calculated and sorted in descending order is shown in Table 2.

Support Vector Machine
Support vector machine [SVM] is used to solve binary class of problems which maps linear into non-linear space. In furtherance to enforce mapping, kernel implementation is required 11 . The kernel functions are accustomed to train the classifier which selects the support vector for the kernel.

Artificial Neural Networks
The Artificial Neural Network [ANN] is implemented using three layer neural network of Back propagation approach. This approach has Input layer, Output and hidden layers 12 . Each layer contains an element called neurons. The neurons are associated via links. The output layer consists of 2 neurons which classifies either as benign or malignant. The back propagation approach was used to train the network, in which all the activations are calculated in forward pass. The target node is directly measured for output node by comparing the output of training set.

Naive Bayesian
In machine learning, naive bayes algorithm is considered to be a simple probability based classifier which depends on Bayes theorem with strong independence assumptions between the features 13 . Naive Bayes classifiers are immensely adaptable in which number of parameters is linear to the number of variables (features/predictors) in a learning problem. Naive Bayesian classifier depends on Bayes' hypothesis and the hypothesis of all out likelihood. The likelihood with vector x = < x1... xn> has a place with speculation h is P Y X X P X X Y P X X Xn

Existing System
In existing method, the combined approach such as Fuzzy C-means clustering [FCM] with simulated annealing and Decision tree (C4.5) classifier is used for diagnosis of breast cancer 14 . The pre-processed dataset is clustered using FCM algorithm. In this algorithm, 'm' is a fuzziness index whose value lies between [1,∞]. Fuzziness index, measures the tolerance of required clustering. If the value of 'm' is larger, it has larger overlapping between clusters. In general, m=1 for crisp and 2 for fuzzy clustering. In this investigation, m=1.4 is chosen as fuzzy index. The fuzzy membership degree µ ij , lies between [0, 1]. Then, the clustered data is annealed for which the cooling schedule is chosen as f(t)=4. The starting and final temperature is chosen as minimum and maximum of a feature in a random manner. After clustering, C4.5 classifiers are accustomed to divide the clustered dataset and labels are predicted either as Benign or Malignant. The model is then cross validated by applying K-fold cross validation, here K=10. The existing system flows as it is being provided in Table 3.

Table 3. Existing System Diagnosis of Breast Cancer
Input: Pre-processed Wisconsin Breast Cancer data set. Output: Benign or Malignant cancer with better accuracy Procedure: a. Get dataset WBCD from the UCI Machine Learning Repository. b. Pre-processed dataset is enforced for Fuzzy C-means Clustering. c. The clustered data is applied for simulated annealing.
Again, the output is applied for FCM. d. Repeat steps 2 -3 until minimum objective function is achieved. e. The C4.5 classification algorithm is applied on clustered data. f. Diagnosis of tumor patient either benign or malignant with better accuracy using 10-fold cross validation.

Disadvantages of Existing Model
The drawbacks of existing model are tree structure will be prone to sampling. Generally, trees will be robust to outliers, due to over fitting, decision tree tend not to produce greater results. Decision Tree is said to be greedy algorithm which actually produces local optima.

Proposed System
In proposed model, the pre-processed data is clustered by Fuzzy C-Means Clustering [FCM] algorithm with Simulated Annealing. Then, clustered record is classified by several classifiers such as Naïve Bayesian (NB), Support Vector Machine (SVM) and Artificial Neural Network (ANN) models are used for diagnosis of breast cancer. In existing, Decision tree classifier was used to classify the samples due to over fitting it provides only local optima.
In proposed model, after clustering some of the classifier models are used to classify the clustered dataset and labels are predicted either as Benign or Malignant. The model is then cross validated using K-fold cross validation (Here k=10). The proposed system flow as it is being provided in Table 4 and 5.

Evaluation Metrics
In this section, a relative report on the execution of existing and proposed grouping model is talked about dependent on Accuracy, Error rate, F -measure, exactness and review. Precision quantum's the means by which profound the settled tuples are being ordered effectively 7 . TP embodies to positive tuples and TN epitomizes to negative tuples characterized by the essential classifiers. So also FP ascribes to positive tuples and FN attributes to negative tuples which is inaccurately grouped by the classifiers.

Precision
Precision is a ratio of true positive tuples and all positive tuples in a dataset. Precision is given by,

Recall
Recall is a ratio of true positive tuples against positive and negative tuples. Recall is given by,

F-Measure
F -Measure is also called as F -Score. F -Measure is a mean of precision and recall. F-Measure value varies from 0 to 1. If the value of F-Measure is higher, then it is said to be a better classifier. It is given by

Accuracy
The classifiers accuracy is an important metric for evaluation. It is a ratio of positive tuples and negative tuples against all the tuples. It is given by, Accuracy TP TN TP TN FP FN = + + + + /

Error Rate
The error rate is an essential measure for evaluation. Lower error rate is said to be a better classifier. Error rate determines the error between the prediction and actual. It is given by,

Results
In this research, 10-fold cross validation is used to validate the results. The dataset is divided into ten equal subsets randomly. One of the partition act as a testing set, whereas the rest of the partitions act as training set to train the model. A relative report on the execution of existing and proposed grouping model is talked about dependent on Accuracy, Error rate, F -measure, exactness and review. Precision quantum's the means by which profound the settled tuples are being ordered effectively 7 . TP embodies to positive tuples and TN epitomizes to negative tuples characterized by the essential classifiers. So also FP ascribes to positive tuples and FN attributes to negative tuples which is inaccurately grouped by the classifiers.

Conclusion
The study presents the comparative analysis of several classifiers with clustering which is used for prediction of breast cancer. The performance of Fuzzy C-Means Clustering [FCM] with Naive Bayesian classifier provides a better prediction when compared to other classifiers. Therefore, Fuzzy C-Means Clustering [FCM] with Naïve Bayesian model achieves highest accura cy with lower error rate. F-Measure value is high which also indicates Fuzzy C-Means Clustering [FCM] with Naive Bayes is a better Classifier and it is suggested as a better prediction model for diagnosis of bosom malignant growth.