Computer Aided Detection System using Entropy Based Segmentation as Decision Support for Detection of Neoplasms and Comparison of Feature Spaces for Classiﬁcation

Objective: To design a computer aided detection system for the early detection of the breast cancer from the mammograms as it can assist the doctors in the diagnosis. Methodology: The proposed method used in the design of Computer-Aided Detection (CAD) is based the on using the textural diﬀerences of the abnormal and normal mammograms to detect the breast cancer. The Gabor Features and gray-level co-occurrence matrix (GLCM) features are extracted from the region of interest of the segmented mammograms using the Entropy based segmentation. The Support Vector Machine (SVM) classiﬁer is used for classifying the mammograms into the cancerous and non-cancerous cases. The 35 number of normal and abnormal mammograms are taken from the Mammographic Image Analysis Society (MIAS) data set. The MIAS database is chosen as it carries more challenging data because it carries lot of unwanted tissues and ﬂesh part included in it which has the intensity level more than the micro-calciﬁcations. Findings: The classiﬁcation accuracies obtained are 92.98% and 98.11% using Gabor and gray-level co-occurrence matrix features respectively. The sensitivity achieved with the gray-level co-occurrence matrix features is 100% which shows no missed cancerous case. The classiﬁcation accuracy is higher using gray-level co-occurrence matrix features as compared to the Gabor features which shows superiority of these features in capturing the texture of the mammograms. Novelty: The removal of the pectoral muscle is an important pre-processing step. Only with the proper elimination of the pectoral muscle, the segmentation of the mammograms is possible. The method proposed to remove the pectoral muscle in this paper removed the pectoral muscles from all the mammograms used in the study. The entropy based segmentation and the technique of the removal of the non-contributing features outperforms other CAD systems in the literature available. These are very promising results for successful design of a Computer-aided detection system for early detection of the breast cancer that can be put to clinical trials or used for the double reading.


Introduction
As for the diagnosis of the cancer from the mammograms, doctors are required to exam a number of medical images which is a highly complex task and is prone to human errors. Computer aided classification can reduce the intra and inter observational differences between them and plays a significant role in diagnosis.
The first stage of development of a CAD system is pre-processing stage which is very important for proper segmentation of the mammograms. It includes the noise removal, image enhancement and pectoral removal. In (1) , the rectangle is used to isolate the pectoral muscle from the region of interest (ROI) and suppress the pectoral muscle using seeded region growing (SRG) algorithm. Ma and Manjunath in (2) applied boundary detection scheme based on edge flow to remove the pectoral muscle. The noise in mammographic images is suppressed and edge enhancement is performed based on the wavelet transform (3) . A noise equalization algorithm is proposed in for getting the images where local contrast is equal at all image intensities (4) . A number of image enhancement techniques are described in (5) . The important step in the preprocessing is elimination of the pectoral muscle. The method proposed in this paper eliminates the pectoral muscle of all the mammograms used in the work. The Difference of Gaussian (DoG) filter removes the noise while conserving the frequency information in the mammograms.
The various segmentation techniques have been found in the literature. In (6) Chan-Vese level set is used to trace the contours of the mammograms for segmentation A scalable approach for retrieval and diagnosis of mammographic masses is proposed by (7) . In (8) applied a region-based method of image edge profile acutance to characterize the variation in density of a region of interest (ROI). An entropy based segmentation technique is used in this paper for segmentation followed by the extraction of the region.
Mostly used texture features used in describing mammographic images are described in (9) . Whereas the GLCM and Gabor features are used in this work. In the Classification step, the benign or malignant cases are classified accordingly (10,11) . A number of methods of classifications, both unsupervised and supervised are used for classifications of mammograms. In (12) employed K-means clustering for the classification purpose.
This fully automatic detection is used by computer aided diagnosis systems but still not used for clinical use. So CAD systems can be used for double reading to improve the reader performance of breast cancer detection. The CAD systems reduce the sensitivity of variations of mammographic screening with the expertise of the radiologists (13) .

Materials and Methods
The computer aided detection (CAD) is artificial intelligence based system. The steps used in this system are image preprocessing, segmentation, feature extraction and classification. The general algorithm used in developing CAD is shown in Figure 1 The steps involved are explained in the following sections: https://www.indjst.org/

Dataset
A research organization of United Kingdom, Mammographic Image Analysis Society (MIAS) provide mammograms of normal and cancerous patients (14) . The size of these mammograms is 1024X1024. Mention the number of samples considered and any modifications for training

Image Preprocessing
Pre-processing steps are necessary for detecting the required abnormalities with no interference from other regions of the mammogram like labels, pectoral muscles etc. Because of their fuzzy nature, digital mammograms are hard to interpret. So a pre-processing algorithm is applied to enhance the quality of the image for accurate segmentation. The mammogram is flipped to left along with removal of the black border. The image is thresholded with Otsu method and Moore's Algorithm is used for the boundary detection. Artefacts and labels are removed in breast background region. The pectoral muscle is removed using a novel method (15) . The noise is removed while retaining the spatial frequency components using the Difference of Gaussian (DoG) filter.

Extraction of Region of Interest (ROI) and Segmentation
The important region of interest has been extracted using an automatic cropping algorithm. This algorithm uses Gaussian blurring and Otsu's thresholding method. Locating of suspected region of neoplasm is performed using Texture based Entropy Thresholding.
In this method, the threshold t is calculated, the texture image is considered as it contains the texture information. The texture image is divided into four regions after being thresholded by t. Accordingly, the threshold partitions image in four quadrants i.e. A, B, C and D. The foreground pixels are the ones which are above the grey level t and otherwise taken as background. The probabilities are the normalized transitions obtained from the transition s i j from grey level i to j in Equations 1 to 5. https://www.indjst.org/ The local transition entropy of BB and the joint transition entropy of FB can be defined as given in Equations (6) and (7) respectively. The total entropy is given by the Equation. 8.
To maximize the summation of BB and FB entropy, an optimal threshold t is defined as in the Equation 9.
This threshold generated from new combined entropy of local back ground to background and foreground to background entropies gives best results as compared to thresholds generated from local or joint entropy in segmentation of the image. The threshold obtained from local entropy is better than that of joint entropy. Therefore, this new entropy is chosen and further the segmentation algorithm is applied to the regions of interest containing masses.

Feature Extraction
The texture features are very important in the overall classification accuracy of the classifier. The proper extraction of the texture features is very crucial in the design of the computer aided detection system. The Gabor textures features and the GLCM features are used in this work. These feature spaces are compared to find their efficacy in results of the classification.

Gabor Transform Features
The texture of the thermograms is extracted using the Gabor texture features. The texture features are described for texture representation and discrimination. Bovik et al employed a technique for finding coefficients of the filters by using texture power spectrum features (16) . A dyadic Gabor filter bank to analyze the spatial-frequency domain (17) . The many research papers reported Gabor features to be effective in extracting the textural description (18)(19)(20) .

Gray Level Co-Occurrence Matrix (GLCM)
The Harallick's texture features are extracted from the segmented mammograms. Haralick proposed GLCM for the texture analysis with the use of co-occurrence probabilities in his work (9) . The GLCM is a 2-D histogram of gray levels for two pixels at fixed spatial distance. GLCM of an image is found using radius d and orientation θ . The number of rows and columns in the matrix is decided by number of gray-levels G, in the given image. https://www.indjst.org/

Results
The mammograms with label and pectoral removal and application of DoG are shown in Figure 2    All twenty Gabor texture features are extracted and evaluated for both normal and abnormal set. The Gabor feature space for normal and abnormal mammograms is plotted in Figure 6 The feature space for normal and abnormal cases are found to be overlapping for the most of the Gabor features.
Haralick thirteen texture features are extracted over the GLCM matrix. The features like energy and inverse difference moment are high in case of normal candidates as they have homogeneous texture as compared to masses. The malignant masses provide a high measure of contrast as compared to benign masses due to its high radiopaque nature. All thirteen https://www.indjst.org/ Haralick texture features are evaluated for both normal and malignant/benign set. The GLCM feature space for normal and abnormal mammograms is plotted in Figure 7 Most of the features for normal and abnormal mammograms are found to be well separated in the feature space. The plot of the average of these features for Gabor and GLCM are plotted in Figure 8(a) and (b) and Figure 9 (a) and 9(b) As indicated in Figures 8 and 9, the features of the GLCM has much better separation than the features of the Gabor for normal and abnormal cases in the feature space, so expected to give better classification accuracy.
The confusion matrix for the classification of Masses and Non-masses using Gabor features and SVM Classifier over six iterations are shown in Figure 10 Classification and validation measures using the Gabor texture features as evaluated over the ROI image are shown using Table 1 and the Receiver Operating Characteristics (ROC) curves in Figure 11 The accuracy achieved with Gabor is found to be 91.91%. As seen in the Figure 6, the features 13, 17, 18 and 18 for the normal and cancerous cases are found to be in the overlapping space. So these features are considered not to be useful for the https://www.indjst.org/ The confusion matrix for the classification of Malignant and Benign mammogram using GLCM features and SVM Classifier over six iterations are shown in Figure 12 https://www.indjst.org/

Discussion
The accuracy achieved with GLCM is found to be 97.21%. As seen in the Figure 7, the features inertia, difference variance and information measure correlation2 for the normal and abnormal cases are found to be in the overlapping space. So these features are considered not to be useful for the classification and hence omitted. By omitting these features, the classification accuracy has improved from 97.21% to 98.11 % with reduced standard deviation of ±1.51%. The performance of the proposed CAD system with the existing systems in the literature is compared in the Table 3. So the accuracy achieved in this work is highest among the existing literature. The entropy based segmentation works well for the mammograms so that texture features can be extracted for the classification purpose. The technique of removing the features that do not contribute to the classification increases the classification accuracy significantly.

Conclusion
This study proposed a CAD for the cancer detection using the mammography. Mammograms being fuzzy in nature makes the segmentation and detection a difficult task. In this work, an algorithm has been developed for pectoral removal. The pectoral removal along with other pre-processing steps and exact extraction of ROIs are necessary for proper segmentation of mammograms. The edges are preserved for the better classification accuracy. The segmentation has been successfully executed using Entropy based segmentation with good results. The different texture features are employed to achieve the maximum classification accuracy. The texture features tried in this work are GLCM texture features and the Gabor texture features. The performance of the CAD system using the GLCM features were found to be better than with the Gabor features. As the feature space for normal and abnormal mammograms is overlapping in case of Gabor features as compared to GLCM features. This is reason for lower classification accuracy for Gabor texture features. The results obtained with segmentation technique are found to be better than the other techniques present in the literature. The performance of the Gabor and GLCM texture features is compared using the SVM classifier with Radial Basis Function (RBF) as kernel function for classifying the mammograms into masses/non-masses. The classification accuracies achieved are 92.98 % and 98.11 % for reduced Gabor and GLCM features respectively. The novel, robust and simple technique of removing the non-contributing features has resulted in increase in classification accuracy both in Gabor and GLCM features. The results obtained in this work are even better than the latest learning techniques like the Deep Learning Technique, Convolutional Neural Network etc. as shown in the Table 3. The reason for the same lies in the precisely segmented mammograms to extract the features. The sensitivity achieved is 100% i.e. no false negative case that is very important in medical diagnosis. The cost of misclassification errors can be very large in the case of false negatives. As in case of detection of breast cancer, positive class is classified as negative (false negative) is much more serious (expensive) than the false positive in which a healthy patient is diagnosed as cancerous one. It can result in patient's death because of the incorrect diagnosis and delay in treatment.
The CAD based detection of the breast cancer helps the radiologist in accurate diagnosis. The sensitivity and specificity obtained with GLCM features are quiet high resulting in small number of false positives and false negatives. The encouraging findings in this paper makes it possible for clinical trials of the CAD system. Its use can reduce the subjectivity associated with inter and intra observations of radiologists resulting in number of false negatives and false positives which leads to unnecessary further investigations and potential threat to life of the patient. This system can reduce such aspects of the manual reading of the mammograms.
In the future, the proposed system can be applied for other data bases like Inbreast or DDSM for further authentication as the use of mammograms only from the MIAS database limits its universality. The latest machine learning algorithms can be used for classification using these segmented images using the same features or some other features like shape features may be added for the better results. https://www.indjst.org/