Extraction with Map-Reduce Framework and Correlation-based Feature Selection in Lung Cancer Towards Big Data

Background/objectives: To extract nucleus and cytoplasm that intend to optimize features in high-dimensional images such as all types of raw sputum cells. To calculate following features efficiently: Area, Perimeter, Intensity, NC Ratio, and Circularity. Methods/Statistical analysis: To take results in proposed stride, we introduced map-reduce framework for separating similar cells from sputum cell images that have been collected from Microscope lab images with intended magnification and staining. To avoid model learn from irrelevant features, feature selection methods with correlation-based feature selection contributes appropriate features that are then fed for classification. Features here converted to vectors for the estimation of symmetric uncertainty, correlationbased approach. Findings: Performance evaluation metrics checks into the contribution to measure it’s out coming performance. Even though lot of works relied on feature extraction, our work combines feature extraction with map-reduce framework which improves accuracy for classification. Our proposed method makes extraction of nucleus and cytoplasm easier than other methods. Optimized performance assured in proposed feature selection. Novelty/ applications: Eventual accuracy for every feature in proposed stride improves than other existing works. In addition, ROC curves proves higher true positive rate even in increased datasets. Another significant innovation in our work is map-reduce framework applies in images to sort cells with respect to staining.


Introduction
Lung cancer confronts with lots of people rapidly. The contentious characteristic of lung cancer depends on its treatment and its outcomes. Because many types of lung cancer grow quickly and spread rapidly and the lungs are vital organs, early detection and prompt treatment-usually surgery to remove the tumor-is critical. Medical diagnosis prompts several approaches to detect and cure lung cancer. Computed tomography images and sputum cell images engaged with lung cancer prediction and classification. The quality of images perhaps becomes fascinating features to predict lung cancer, which achieves through some image processing techniques. The mode of classifying lung cancer proceeds with the extraction of features such as area, perimeter, eccentricity followed by feature selection methods. Even though the features are categorical nature, some features extract larger values irrelevant to expected outcomes. Hence optimal feature selection techniques proposed. Meanwhile, inconsistency removes by underlying map-reduce framework in various types of sputum images such as eosinophilia, bronchial mucus, squamous carcinoma cells and then fed to feature extraction using MATLAB. Furthermore, feature selection and classification work in the ML-PYSPARK environment for parallel processing of a large number of datasets.
In Ref. [1], Lin proposes an approach for optimal analysis of feature importance and as effective classification, a PQD recognition method based on image enhancement techniques. Furthermore, the Gini Index used to evaluate sequence forward search and for optimal feature subset selection. Meanwhile, Disturbance features extracted and eliminated from binary image and images are reconstructed for classification purposes. In Ref. [2], Wang evaluates an approach for structural discrimination of networks and also to estimate discriminate features in brain disease classification, graph kernel-based structured feature selection (gk-SFS). Meanwhile, to improve performance, l 1 -norm based sparsity regularizer deployed in [2]. Performance achieves by comparing the accuracy of proposed one with other existing approaches, as well as considers eliminating noisy or redundant information in [2]. In Ref. [3], Chen provides an approach based on incurring security over image manipulation detection attacks. They address random feature selection, which incorporates negligible loss of performance. Specifically, they focus on two issues as adaptive histogram equalization and median filtering which consolidates effective manipulation techniques. In Ref. [4], Nie proposes a method for selecting non-redundant and representative features, an auto-weighted feature selection framework via global redundancy minimization (AGRM) which is non-parametric and is weighing automatically. Hence, they address both supervised and non-supervised feature selection. Meanwhile, our proposed map-reduce framework illustrates the author of Ref. [5]. Furthermore, they provide feature scores for redundant features and compares performance over other existing approaches. In Ref. [6], Singh proposed a novel filterbased approach for feature selection that sorts out the features based on a score. The proposed framework layout abruptly improves the results, even with high-dimensional datasets. Moreover, they tried to improve performance over other classification accuracy and precision. In Ref. [7], Maulik proposed a prediction scheme that combines a fuzzy preference-based rough set (FPRS) method for feature (gene) selection with semisupervised SVMs. Even more, they have shown effectiveness by comparing with the signalto-noise ratio (SNR) and consistency based feature selection (CBFS) methods. In Ref. [8], Shen depicts methodology that combines fused lasso and elastic net as regularization for linear support vector machine (SVM), also called feature selection SVM (OFSSVM), which uses huberized hinge loss as the loss function. However, the author improves performance by adding both binary as well as Multi-class classification. In Ref. [9], Zhang predicts the prognosis and survival time of different subtypes of GBM by introducing combined gene testing with clinical treatment and extracts dataset from Cancer Genome Atlas (TCGA) database, which further improves the efficiency of standards. In this way, they depict the minimum redundancy feature selection method (mRMR) and the Multiple Kernel Machine (MKL) learning method for effective prediction and feature selection. In Ref. [10], Taşkın proposed the classification task which addresses dimensionality reduction by improving the correlation between the spectral features and the noise present in spectral bands. In addition, performance improved by proving the stability of the feature selection method and computational time of classification as well as accuracy. Specifically, they have used hyper spectral datasets as images. In Ref. [11], Archibald proposed a band selection method that co-occurs to assist with classification accuracy. In addition, an embedded-feature-selection (EFS) algorithm that is tailored to operate with support vector machines (SVMs) introduce to perform band selection and classification in parallel to reduce the computational time which further converges its performance. In Ref. [12], Chong depicts Robustness-Driven Feature Selection (RDFS) algorithm that dramatically increases robustness in CT images, considering various factors. In addition, two SVM-based approaches, one with RDFS and another without RDFS to coherently compare the robustness of the proposed algorithm. Moreover, the comparison performed in the multi-reconstruction dataset, using Cohen's kappa classification factor. In Ref. [13], Xiabi Liu et al. introduced a novel approach of fisher criterion and genetic optimization (FIG) which selects subsets of various features considering some factors including bag-ofvisual-words based on the histogram of oriented gradients, the wavelet transform-based features, the local binary pattern, and the CT value histogram. In Ref. [14], Bolourchi proposed score-based approach, entropy score selecting top k methods in Synthetic Aperture radar images for dimensionality reduction and to achieve feasibility in feature selection. High-dimensional datasets used by combining various sets into vectors and improves performance of accuracy on image classification. In Ref. [15], Fauvel proposed a framework for hyper spectral images, Gaussian mixture model classifier (GMM).They classify images and improves performance over the K-folds cross-validation approach and prove the significance of the proposed classification. Besides [16][17][18], deals with sputum cells for nucleus and cytoplasm extraction and with big data workflow. Furthermore [19], process feature selection for the purpose of better classification. Some more concepts layout machine learning techniques for feature extraction as in Ref. [20]. In Ref. [21], CT images collected and stages of lung cancer determined using the concept SVM in addition with image contrast enhancement and optimal feature extraction techniques.

Our Contribution
Map-reduce framework has been developed to put off ill-suited images, by using mapper phase, so as to sort similar type of cells and fed to reducer phase as in Figure 1.
Retrieved images endure feature extraction techniques as in sub-sections. Moreover, correlation-based symmetric uncertainty technique prospers feature selection to expedite classification.

Feature Extraction
Images have been collected from several government hospitals in Tamil Nadu at magnification 40× with PAP staining and H&E staining. Raw sputum images with various cells such as eosinophilia, bronchial mucous cells, and squamous carcinoma cells are having nucleus and cytoplasm in different nature. Hence, they are processed using the map-reduce framework as in Figure 1 to splits cells with the same nature and then encounters with K-means clustering followed by some morphological operations such as erosion and dilation to individualize the combined nucleus so that, other parameters such as area, perimeter computes accurately.
RGB images are first converted into gray level images to remove the noise. Then the images have been actuated into noise removal. The median filter removes the noise and filters the images with gray images. The gray images are then reconstructed to RGB images using MATLAB code. The RGB images are then fed to L*a*b color space for further processing, • L*: Lightness • a*: Red/Green Value • b*: Blue/Yellow Value The LAB color model is a three-axis color system. The first axis, the L-channel or Lightness, goes up and down the 3D color model and it consists of white to black -and all of your gray colors will be exactly right down the center. The A-axis goes from cyan color across to magenta/red color and the B axis goes from blue to yellow. Also derived as device-independent; the colors in L*a*B color space have fluctuated to K-means clustering which extracts nucleus from sputum cells. K-means hold images to cluster and extract the targeted features. Iteration for the algorithm used is 3 times. Furthermore, the region growing algorithm as similar to the method proposed by Ref. [16] classified as a pixelbased image segmentation method to select necessary points or region has been applied as in Figure 2(b). After processing with region growing, images are further actuated to K-means clustering which clusters the images and iteratively processes the images. Besides, a morphological operation such as erosion and dilation intrudes to separate connected nuclei.

Parameters and Features
The first feature is the NC ratio, which is computed by dividing the actual number of pixels over the nucleus region (nuclei area) as in Figure 3 and the cytoplasmic region as in Figure 2.
• Area It is a scalar value which derives an actual number of pixels in nuclei and cytoplasm extraction.
Area calculated as , ( , nucleic area , cytoplasm area ) where i, j depicts nucleic area and the cytoplasmic area which contains the number of pixels in both region.  • NC Ratio NC ratio evaluated using area of nucleus and cytoplasm. Formula layout as in (1) Area of nucleus NC ratio *100 Area of cytoplasm = (2) • Intensity Intensity disputes collection of color pixels of extracted nucleus and cytoplasm.
where edge (nuclei) and edge (cytoplasm) are the vector co-ordinates of i and j, respectively.

Feature Selection
Feature selection is the process of selecting subspaces with appropriate features that are then used for developing a model. To resolve such problems, feature selection plays several approaches.

. Relief
Relief computes score or value for each feature and obtains the highest score feature by applying rank for each score. In addition, the score obtained is used as feature weights which are indulged in the model.

Probability of different feature weights calculated as
Relief W = P(value of w| neighbor instance of an inter-class) − P(value of w| neighbor instance of intraclass) (7) This is then resolved by eliminating conditional attributes and becomes a value of w|intra class or inters class instead of using neighbor instances as in (8).
Whenever Relief used, it measures twice since each feature treated as a class. h is a subset of class C, w is the weight of each feature. Here, the weight of each feature calculated twice as said above.

Minimum Description Length (MDL)
A model though reduces description length of data, complex and high cost and its values are too large to compute. Description length of theory, data are approximated as description length of data given theory summed with description length of theory where theory as T, data as D. The description length of data given theory computed by multiplying deterioration (entropy) of B given A. [1] (thesis) MDL also measures quality with the below equation. The etiquette equation is as follows [2]. (kon95 in the thesis) 2 2 1 MDL computed with (9) and (10) as where N is no. of training attributes in class C, j N numbers of attributes with a jth value of the given attribute, Pr_MDL is prior_MDL and Po_MDL is post_MDL.

Symmetric Uncertainty
Symmetric uncertainty derived by determining first the probability of attributes A and  B with values a ε A, b ε B, respectively. The individual probability values of attributes are partitioned concerning other attributes and if values partitioned becomes lesser than nonpartitioned values of other attributes, then both attributes are in relation as in [1]. Formally, If as said, there is a relation between attributes A and B, then the equation becomes The quantity about value decrease in the partitioning of B after observing attribute A also measures information gain [1] as follows

Info-Gain = Entropy value of B + Entropy value of A-Entropy (A, B)
This Info-Gain is then applied for computation of Symmetric uncertainty as Info Gain 2.0* Entropy value of B Entropy value of A Suc − = + The values are normalized to the range (0, 1) with (16). We used Symmetric Uncertainty for feature selection since Relief selector uses ranking which gradually changes for every iteration. Meanwhile, MDL cost consuming and values are too large to compute.

Results and Discussion
Sputum color images from the microscope lab collected and processed using the mapreduce framework, which bent over backward to make the outcome certainty. Several extraction techniques deployed to retrieve several features. Feature selection is then applied to give benefits of optimized features. Figure 4 depicts features importance value with other feature selection methods. Other feature selection methods include Chi-square, Recursive Feature Elimination (RFE), Random Forest whose nature selects based on filter and wrapper methods without any feature transformation.
Specifically, we have taken certain features and applied chi-square selection as well as proposed Symmetric uncertainty and results are compared as in Figure 5. Moreover, individual results of other methodologies illustrate as in Figure 5. Since chi-square retrieves degree of freedom as 2, changes in features selection reflect in results. Symmetric Uncertainty removes irrelevant features by calculating values of features with labels and also by setting threshold as 0.60 and ignores features with values less than the threshold.
Symmetric uncertainty method with threshold put off the model with all its might and proves as worthwhile approach as in Table 1.
Sputum color images collected in various microscope labs with intended magnification at 40 xs. Since staining differs for each type of cells, we used PAP staining and H&E staining for cells. To extract nucleus and cytoplasm accordingly, we first develops map-reduce framework which sorts similar type of cells in reducer phase in inundate manner. From our perspective view, we furnish extraction and optimal feature selections in the sputum color images with map-reduce framework and pyspark, respectively. Experimental results show that the proposed model persists better than other existing models. Even though several features such as NC ratio, mean, circularity, intensity are taken into consideration in other works, we made our contribution, an expedite approach in high dimensional data sets by using map-reduce framework. Table 1 shows that features importance in terms of accuracy. This approach obtained overall accuracy of 91% as in Table 1 which is higher than the values reported in the related works. In the literature, some works extract features using various MATLAB properties. Our work focused on separating the features over the map-reduce framework in high-dimensional datasets as well as obvious and valuable results for further classification.
Furthermore, Figure 6 shows that our proposed model maintains improved true positive rate, even in an unpleasant situation. Our work compares existing models to prove accuracy as in Table 1.

Conclusion
This work enumerates feature extraction over the map-reduce framework as well as optimal correlated feature selection fair to middling classification better. To this end, sputum color images are given as features for processing over map-reduce as well as correlated symmetric uncertainty feature selection which intends feature importance to improve performance. The filters used are median and wiener filter which removes noise and results from filters are then converted back to RGB images for better extraction of color images. Furthermore, feature subset selection optimized by converting them to dense vectors and endures with symmetric uncertainty methods. Even more, already said approaches, provide optimal feature selection techniques with randomized and preprocessed images. Our proposed techniques are derived with a map-reduce framework and also with raw sputum images including various cells as the source. We enhance the performance over ROC curves and