An eﬀective approach to feature extraction for classiﬁcation of plant diseases using machine learning

Objectives : To make automatic classiﬁcation of diseased potato and grape leaf from normal potato and grape leaf. Methods: Experimental sample size of 3000 and 4270 Potato and Grape leaf images were used respectively. The diseased and healthy leaf image samples were taken from PlantVillage dataset. The color features viz., average Red, Green, Blue and Hue intensities of Lesion region were calculated. Features namely Contrast, Dissimilarity, Homogeneity, Energy, Correlation, ASM, and Entropy were extracted from hue lesion region. Also, histogram features such as mean and standard deviation were extracted from hue infected region. Then, data normalization was done on feature set to bring all features into a common scale. Finally, Naïve Bayes, K Nearest Neighbor and Support Vector Machine Classiﬁers were applied on the above said feature sets. Findings: The Dataset was split in the ratio of 80% and 20% for training and test sets. The classiﬁers NB, KNN and SVM classiﬁed Potato leaves with an accuracy of 88.67%, 94.00% and 96.83% respectively and Grape leaves with an accuracy of 81.87%, 93.10% and 96.02% respectively. For both the species, SVM classiﬁer gave the highest accuracy. Also, it was found that the proposed method performs well as compared with the related works in the literature. Novelty/Applications: An eﬀective feature extraction method to classify grape and potato diseases was proposed in this research work. Also, it was found that the proposed method performs well as compared with the related works in the literature.

Early symptoms: small irregular to circular dark brown spots restricted by leaf veins.
On severely infected leaves: small lesions coalesce and cover large areas of the leaf (12) Refer fig. A Late Blight Fungus; Phytophthora infestans (13) circular to irregular-shaped dark brown or black lesion

Refer fig. B
Grape Black Rot Fungus Guignardia bidwellii (14) Reddish brown and circular-to-angular spots that merge into irregular blotches Fungi P. aleophilum and Phaeomoniella chlamydospora (15) Interveinal (in between veins) striping starts out as dark red and become necrotic (premature death of cells)

Refer fig. D Leaf Blight
Fungus (16) Lesions are dull red to brown in color turn black later. If disease is severe this lesions may coalesce.

Materials and Methods
The experiments done in this study were carried out on the Plant Village Dataset (17) . A data set size of 2000 diseased potato leaf images and 3270 diseased grape leaf and 1000 healthy leaf images of both the species were used in implementing this research work. All the images considered are of size 256 X 256. The Plant Village dataset is a collection of 54,306 images of healthy and diseased plant leaves 14 plant species and 26 diseases.

Color features
Average intensity values of Red, Green and Blue components of RGB color space and Hue component of HSV color space are calculated as color features. It is calculated by finding the average pixel values of the Grey scale image and is given by where f i (x,y) is the intensity value of pixel in component i, is the total number of pixels in the image, and represents the color components Red, Green Blue and Hue.

Histogram features
Histogram plots the frequency of occurrence of each intensity value in an image. The formula for calculating Weighted Mean and Weighted Standard deviations (18) from histogram are given below: where w i is the weight of the i th observation, N ′ is the number of non-zero weights, µ * is the weighted mean.

Grey level co-occurrence matrix (GLCM) and Image Texture
An image texture is a spatial arrangement of intensities or Grey Levels in an image or selected region of an image. GLCM is tabulation of how frequent different combination of Grey levels occurs in an image. In this work, Hue component of the HSV color space was used as gray image as variations in color value determines the disease.
Any gray scale image has 256 gray levels ranging from 0 to 255 and hence the size of GLCM will be 256 X 256. In this research work, the size of GLCM matrix generated was 32X32 as the number of gray levels was reduced to 32.
In this study, the leaf image was partitioned into 16 X 16 patches and patches having more than 10% of information were considered for processing. Patches having less than 10% information were discarded since for most of the seed pixels the gray level intensity values of their neighboring pixels be 0. And these pixels may not contribute to textural information and may lead to misclassification. The texture features such as Contrast, Dissimilarity, Homogeneity, Angular Second Moment, Energy, Correlation and Entropy of useful patches were taken as the texture features for the leaf.

Feature normalization
Feature Normalization converts the feature values to a common range of values. Normalization of feature values is required when the features have different range of values. It is an important pre-processing step required for applying classification algorithms, like K Nearest Neighbors (18) and Support Vector Machines (19) , which computes distance measure. SVM assumes that the data are in the range 0 to 1 or -1 to 1 (20) . But, for certain algorithms, like Naïve Bayes, feature normalization may not have much difference (20) .
Min-Max Scaling normalizes each feature to a given range of values using (18) x Where max and min are feature range (21) . https://www.indjst.org/

Classifiers
Classification in machine learning is a supervised learning method in which models are trained to learn the mapping function, from input X to output Y, Y= f(X). Here, X is the feature set and Y is the set of Categories or Classes. Then, this mapping function is used in predicting the classes of new observations.

Naïve Bayes classifier
Naive Baysian classifier is a probabilistic supervised learning algorithm. The algorithm is used to predict the class Y, for the feature set X by applying Bayes rule. The Bayes rule uses conditional probability P(X|Y], which can be calculated from the training dataset, to find P[Y|X]. Bayes rule is given by Naïve Bayes algorithm can be applied when there are multiple features and all of them are independent of each other and is given by

KNN classifier
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. Figure 2 illustrates K Nearest Neighbor Classification.

SVM classifier
Support Vector Machine is a supervised machine learning algorithm which is commonly be used in classification problems. This method plots data in n-dimensional space with n feature values as co-ordinate positions. The algorithm outputs an optimal hyper-plane that clearly classifies data points and the samples on the margin are support vectors. The dimension of the hyperplane is determined by the number of features. Further, this hyper-plane is used in predicting new examples. The advantage of using SVM classifier is it tries to achieve a maximum margin. Figure 3 shows a maximum-margin hyperplane. A margin is a split-up of line to the closest class point. A good margin is the one in which the data point of one class does not cross the other classes. https://www.indjst.org/

K-Fold cross validation
K-Fold Cross Validation is a technique used in estimating the performance of a machine learning models. In this, the dataset is split into k-parts, called folds. In the first iteration, the first fold is used to test the model and the remaining k-1 folds are used to fit the model. In the second iteration, the second fold is used as test set and the remaining k-1 folds are used as training set. The process is repeated until every fold is given a chance to be the held out test set. Two important sources of errors bias and variance can be obtained from K-Fold Cross Validation. A High bias and a low variance indicate underfitting that is the model does not fit the data well. The model is said to overfit, when the model learns the data excessively well such that it also fit noise present in the data. This is the situation where the model performs extremely well on training data but performs poorly for test data. In order to neither overfit nor underfit the model needs to be a generalized one. A generalized model fits to the data set such that it performs equally well on both training and test set. This is a result of low bias and low variance or trade-off between bias and variance.

Proposed methodology
The overall workflow of this research work is shown in the Figure 4. In the proposed work the background, from RGB leaf image, was removed using automatic enhanced GrabCut algorithm (23) . Figure 5 shows the input leaf image and leaf image obtained after applying enhanced GrabCut algorithm. Then the lesion (infected) region was segmented from the leaf. Further, Red, Green and Blue color features were extracted from segmented RGB image, Hue color feature, histogram and texture features of Hue component of lesion region of HSV color space were extracted. Naïve Bayes, K Nearest Neighbor and Support Vector Machine classifiers were applied on the above said feature sets to classify the diseased leaf image. https://www.indjst.org/

Segmentation of lesion region
Let img lea f be the RGB leaf image. img lea f was converted into HSV color space, img hsv . The hue component, img hue , was extracted from img hsv . A mask corresponding to Green region was created by thresholding hue values between 36 and 104 on img hue . The mask was applied on img hue for segmenting Lesion region, img lesion . The results obtained by segmenting infected region from leaf image are tabulated in Figure 6.

Feature extraction
This section discusses the method of extraction of various features used in this research work in classifying the plant diseases. https://www.indjst.org/

Color features
Algorithm 1 extracts average color features of Red, Green and Blue components of RGB image and Hue component of HSV color space. The average color features extracted from algorithm 1.1 is given in Table 2 .

Histogram features
Histogram for img hue plotted with bins representing Hue value on x axis and the number of occurrence of the hue color on the y axis, was plotted as shown in Figure 7 (c) and Figure 7(f). The hue values corresponding to true colors are depicted in Figure 7(g). The mean and the standard deviation were calculated from histogram. The results obtained are given in Table 3 . Histogram features mean and standard deviation were extracted from hue value of lesion region of one leaf image. The infected regions show variation in color. There will be huge variations in color i.e., hue value when there is a change in color. Hence, the standard deviation takes a higher value than mean value for lesion region of infected leaves.  The infected leaf tissues (lesion region) are rough in nature whereas the normal leaf tissues are smooth. With the help of texture features, it can be determined that whether a region of a leaf image is a rough one or a smooth one. For a rough region, the difference between neighboring grey pixel values will be very large whereas for a smooth region, the neighboring pixels will have the same or closer grey values. Figure 8 demonstrates the sequence of steps performed in this research work to extract texture feature and Algorithm 2 implements the same.

Texture features
https://www.indjst.org/  The input images, used in this research work, are of size 256 X256. Generating a Grey Level Co-occurrence matrix of size 256 X 256 for each image is a complex task. Thus, the Grey levels in hue component of lesion region were divided by 8 to reduce the Grey levels into 32(0 to 31). This will reduce the size of GLCM from 256 X 256 into 32 X 32 thereby increasing the calculation speed and decreasing the complexity. Figure 9 (a) shows the hue component of lesion region. Figure 9(b) shows the image obtained from dividing hue value by 8 which is not visible to the naked eyes.
It can be observed from the Table 5 (a) that the intensity levels of neighboring pixel values are either very close or vary with huge difference.
Close neighboring hue values indicate neighboring pixels are of same color or of same shades of color. When the pixel values are divided by 8, they will be grouped into same gray level or into adjacent grey levels as shown in Table 5(b). A huge variation https://www.indjst.org/  4 175 174 177 1 5 4 4 6 7 8 8 6 4 3 4  1 22 22 22 0 1 1 1 1 1 1 1 1 1 0 1  4 172 170 174 4 8 6 7 8 8 7 7 5 5 4 5  1 22 21 22 1 1 1 1 1 1 1 1 1 1 1 1  10 178 174 179 7 9 10 9 10  in hue colors indicate that the neighboring pixels are of different colors and division by 8 results in corresponding pixel values with different grey levels. Hence, dividing the hue value by 8 thereby reducing the grey levels will not affect the performance of the classification model as color difference was an important parameter that was taken into consideration for texture analysis. It is meaningless to generate GLCM elements for image regions where not enough or no information is present. It can be seen from Figure 10 that the infected leaf area occupies less than 50%, of the overall image area not including background pixel area, uninfected leaf area. This is the regions of interest to be taken into consideration for generating GLCM. Rest of the image area can be discarded. To take into account only the infected regions, the leaf image Grey8Imglesion is partitioned into 16 X 16 blocks. The entire lesion region as a whole needs to be considered when there is a definite shape and shape considered as a feature in classifying the disease. The plant disease symptoms considered in this work doesn't have a definite shape. Though initial symptoms show definite shape, as the disease develops further the lesion regions merge together and become irregular in shape. Hence, partitioning the image into 16 X 16 patches will not affect the result. Each patch with less than 10% information is rejected and not considered for further processing. From experiments, it was found that the patch with less than 10% of information does not affect the result. https://www.indjst.org/ For each of the useful patch the Grey Level co-occurrence matrices, for immediate neighboring pixels and alternate neighboring pixels in the directions with angles 0 0 , 45 0 ,90 0 and 135 0 degrees([0, pi/4, pi/2, 3*pi/4]), were generated as illustrated in Figure 11. As the GLCM matrix is a symmetric matrix, the upper and lower triangular elements are same and these duplicate values were omitted and only the feature values for the angles 0 0 , 45 0 , 90 0 135 0 were calculated. This result in 8 set of values, two for each direction and each distance, for the feature sets contrast, dissimilarity, energy, correlation and ASM etc. Averages of features were calculated to obtain one set of features for each patch. Table 6 shows the GLCM features for one patch. GLCM features give the degree of correlation between pairs of pixels with gray level values. These gray level values represent hue value or color information in this study. The changes in hue value indicate a change in color or shades of color. Here, Inter-pixel correlation between adjacent pixels were measured by taking distance =1. Degree of inter-pixel correlation between alternate pixels at distance =2 to get better information about details of texture. This helps in discriminating various diseases based on hue values. Thus, GLCM features were calculated for all useful patches and averaged out to get one feature set for one leaf image.
The average features of all useful patches were obtained and considered as the GLCM texture feature for the infected region. For all useful patches, Entropy was obtained for the entire patch and averaged out for calculating the Entropy for one leaf image. Table 7 shows the GLCM texture feature set obtained for one leaf image.

Univariate Analysis
Univariate data analysis takes single data, summarizes it and finds the patterns in the data (24) . Here, univariate analysis was done to determine the range of feature values for each class. Figures 12 and 13 show the univariate analysis corresponding to Potato and Grape diseases. https://www.indjst.org/

Feature Normalization
Features like Hue Intensity, Red Intensity, Green Intensity, Blue Intensity, Mean of Hue intensity, SD of Hue Intensity, Contrast, Dissimilarity and Entropy values are not in normalized form. They are normalized, using Min-Max Scaling, between 0 and 1. Table 8 shows the normalized feature values. Note that the features such as Homogeneity, Energy, Correlation, and ASM are already in normalized form.

Feature selection techniques
To improve the speed of the machine learning algorithms, feature selection techniques are applied. Chi 2 Statistical test and ANOVA test have been performed on the normalized feature set and target variable. The results obtained are tabulated in Table 9. Table 10 shows the classification accuracy obtained by keeping top significant features and removing irrelevant features.
https://www.indjst.org/  Feature selection techniques are used to select features that are useful for classification. The prediction accuracy obtained from the classifiers, tabulated in Table 10, clearly shows that classification accuracy increases when all the 13 features are taken into account. Here, the fact that removing features reduces accuracy and adding all features improves accuracy indicates that all the extracted features were significant and hence all the 13 features, namely, Contrast, Dissimilarity, Homogeneity, Energy, Correlation, Angular Second Moment (ASM), Entropy, Hue Intensity, Red Intensity, Green Intensity, Blue Intensity, Mean and Standard Deviation (SD), were used in classifying the plant diseases.

Data set size
The size of the dataset used in this research work is 3000 potato leaf images and 4270 grape leaf images. The dataset was split into 80% and 20% for training and test sets. Table 11 presents the training and test size.   For diseased Grape leaf images and healthy leaf images, an accuracy of 96.02% and a Kappa score of 0.94 (almost perfect agreement) were obtained. 10 -Fold Cross Validation was performed to evaluate the machine learning models Naïve Bayes, K-Nearest Neighbor and Support Vector Machines. Bias and Variance measures obtained are tabulated in Table 14 and it can be concluded from the results that SVM classifier best suits the data as both bias and variance have low values. The results obtained from previous works in the literature compared with the proposed work. It was found that the proposed work gives the highest classification accuracy of 96.02% for grape diseases and 96.83% for potato diseases as shown in Figure 14.

Conclusion
This study is focused on segmentation of lesion region and classifying plant diseases from plant leaf image using color, texture and histogram features. Naïve Bayes, KNN and SVM classifiers were tested and with regard to classification of diseases, SVM classifier gave the highest accuracy of 96.83% for Potato leaf images and 96.02% for Grape leaf images. A Kappa value of 0.91 and 0.94 for Potato and Grape species respectively indicates that there is a perfect agreement with the ground truth and predicted values.
For Potato plant, out of 180 Early Blight infected leaf images 147, 174 and 176 were classified correctly by Naïve Bayes, K Nearest Neighbor and Support Vector Machines respectively; 171, 186 and 188 Late Blight diseased leaves were predicted correctly out of 190 test samples by Naïve Bayes, K Nearest Neighbor and Support Vector Machines respectively; out of 230 https://www.indjst.org/ For Grape plant, out of 202 Black Rot infected leaf images 161, 178 and 188 were classified correctly by Naïve Bayes, K Nearest Neighbor and Support Vector Machines respectively; 196, 225 and 234 Esca (Black Measles) diseased leaves were predicted correctly out of 246 test samples by Naïve Bayes, K Nearest Neighbor and Support Vector Machines respectively; 173, 205 and 204 Leaf Blight diseased leaves were predicted correctly out of 208 test samples by Naïve Bayes, K Nearest Neighbor and Support Vector Machines respectively; out of 199 healthy leaf images 170,188 and 195 were correctly predicted by Naïve Bayes, K Nearest Neighbor and Support Vector Machines respectively.