A novel pixel-based supervised hybrid approach for prediction of land cover from satellite imagery

Background/Objectives: To determine the landuse/cover fromsatellite imagery using image enhancement, image processing, and supervised machine learning techniques. This land usage will help in land use policy development, disaster assessment, planning of urban infrastructure, forest and agriculture monitoring and conservation. Methods/Statistical analysis: A pixel-based supervised hybrid machine learning approach is used that combines parametric density estimation followed by a k-nearest neighbor (k-NN) classifier to predict whether a particular pixel belongs to a nucleated village or to a field, forest, river, or some other terrain from satellite images. Spatial and texture features derived from the images are used as features for the models. Free satellite images crawled from Google EarthTM are hand labeled to serve as the ground truth for training and testing Models. Models are evaluated using four-fold cross validation. Comparison with other related techniques is also presented. Findings: Our experiments suggest that instead of using pixel intensity values as features, the intensity values after edge detection give better prediction accuracy. The parametric density estimation for the two classes is better modeled as Rayleigh distribution than as a normal distribution. Smoothing further helps improve the accuracy of the models. Passing the prediction of the parametric density estimator classifier through k-NN further reduces the error by removing the salt-and-pepper effects. Effects of using different size and number of smoothing filters is also discussed. Additionally, different parameters for k-NN were also evaluated to find the best models. The model was successful in achieving a very high accuracy of more than 97% with a very small false positive rate. The results demonstrate the successful discrimination of land cover of towns and nucleated villages from the surrounding terrain. Novelty/Applications: A pixel-based hybrid approach is proposed to improve accuracy of land cover classification. The proposed features and image enhancement techniques also help improve the prediction significantly. Using the proposed technique, the covered area of nucleated villages/towns can be determined for assessing the growth of towns/villages, for better of planning of urban infrastructure and land use policy, for disaster assessment, and forest and agriculture monitoring and conservation.


Introduction
Making of land cover and usage maps is as old as history itself. Traditionally they were made with the help of scouts and explorers. It is only recently that these maps are being made through aerial and satellite images. The earliest interpretation of aerial images for land cover classification started in 1950s (1,2) . It was in 1972 that Earth Resource Technology Satellite which is now known as Landsat was launched into orbit (3) . Landsat opened the door of remote sensing using satellite imagery for land cover classification. Although a commercial resource at first, now Landsat provides free satellite images for land cover and usage analysis. More recently in 2015 and 2017, European Space Agency launched Sentinel-2 A and Sentinel-2 B satellites with better remote sensing capabilities (4) . These satellites along with the satellite imagery freely provided by Google Earth TM have resulted in hundreds of studies that have developed land cover type maps that are being used in numerous applications such as land use policy development, crop mapping, ecosystem services, infrastructure planning, disaster assessment, conservation, forest management, agricultural monitoring, sustainable development, nature protection, and dynamic assessment of land use/cover (5)(6)(7) . For example, (8) show in Figure 1. How deforestation has occurred at a place near Villamontes, Bolivia.  (8) Of all these various applications we focus on geo-spatial data of nucleated villages and towns because their locations, size, population, and other parameters can greatly help decision makers in many developing countries. For example, it can help in determining the area occupied by the village, and hence its growth over the years. In case of a disaster, such as a flood, it can be used to estimate the extent of the inundated area. We employ twelve different images having different types and sizes of nucleated villages and the surrounding terrain to train and test our model. After applying the necessary pre-processing, feature engineering, and image enhancements we are able to accurately predict which area on the image belongs to the village. For model training and parameter tuning we have used four-fold cross validation. The model consists of a parametric model followed by a k-Nearest Neighbor (k-NN) algorithm. A very high accuracy of more than 97% is achieved with a low False Positive (FP) rate. The effects of varying the parameter values and the image enhancement techniques are also discussed in detail.
The rest of the paper is organized as follows: in Section 2 we provide the related work, followed by the materials and methods in Section 3. The results along with the discussion is provided in Section 4. We finally conclude in Section 5.

Related Work
In the early 1970s, satellite images were manually classified into various land cover types through the examination of printed aerial images. The images were printed as black and white composite or individual bands (3) . Later, the classification was done through knowledge-based systems in which the pre-defined rules were programmed into software products that classified land areas into different cover type (9,10) . With the start of this millennia machine learning (ML) and pattern recognition (PR) systems have emerged as the leading approach for satellite image classification (11,12) . ML and PR methods alleviate the need of a domain expert who is responsible for coming up with rules to classify images. Furthermore, using ML and PR better accuracies are achieved as these methods can find hidden and complex patterns that are beyond the comprehension of the domain experts (13) . ML and PR methods can be either unsupervised or supervised. Unsupervised techniques do not require human labeled images for model training and testing, whereas supervised approaches require parts of an image to be labeled into the different land type categories that are of interest to the end user. Our approach belongs to this latter category.
Depending on the resolution of the images, land cover classification technique can be either pixel, sub-pixel, patch, or object based. In a pixel-based approach, it is assumed that a pixel is made up of one homogenous land cover type therefore each individual pixel is assigned a category (14,15) . However, for low resolution images many pixels are composed of more than one land cover type. In such scenarios sub-pixel-based approaches such as fuzzy set approaches are used (16) . A patch-based approach considers classification based on a window of a specific size e.g. a 8 x 8 size pixel window that consists of 64 pixels may be classified as belonging to a particular class (8) . The object-based approach on the other hand classifies images using a grouping or clustering of pixels (14,17) .
Various techniques for Land cover classification have different number and types of categories. For example (18) use satellite imagery data for retrieving leaf area index and leaf and canopy chlorophyll content of potato crops. Whereas (19) classifies the images into six different land cover types, namely: barren land, grassland, road, trees, grassland, water bodies and buildings. We on the other hand classify the land cover into two types or classes, namely village/town and terrain. Our work is perhaps most similar to (12) having used the same land cover types and source satellites but differ in approach and features used.
Multi-temporal spatial images derived from a time series of satellite images are used to assess the changes over time (16,20) . Our approach is time independent but can be applied on images from different timelines to determine the difference between the two images over time. This study is also not about using dimensionality reduction (21) or removing cloud cover from images (22) or other image cleaning and pre-processing.
Several approaches try to make use of images from multiple new and advanced remote sensors such as advanced very high-resolution radiometer (AVHRR), radio detection and ranging (Radar), moderate-resolution imaging spectroradiometer (MODIS), synthetic aperture radar (SAR), light detection and ranging (LiDAR), and panchromatic images (23,24) . Different signals from different sensors are then fused together to increase the accuracy by complementing each other. Our approach does not make use of any of these advanced sensor data and instead just relies on imagery provided by Google Earth TM and uses only the pixel color intensity values. This makes our approach easier, cheaper, and more scalable.

Data Collection
The data for this study was collected from free satellite imagery resource Google Earth TM . Various images were captured but only 12 images were selected that had villages/town surrounded by agricultural fields or barren land, or rivers or any other terrain. We chose only 12 images because it is very time consuming to manually hand label these images for ground truth. Each image has dimensions of 2048 by 2048 pixels. Figure 2 shows two of the twelve images used for this study. What makes this challenging is the variation introduced by different land cover types, different sizes and shapes of villages, and the variation introduced by the fact that Google Earth TM captures images at different times from different sensors under varying atmospheric conditions. We hand labeled all the images by demarcating the village boundary to make the ground truth. We use this ground truth for training and evaluation of our models. The challenge is labelling these images was that it was unclear what the boundaries of the villages are as there are spread far wide with many individual buildings surrounded with terrain. Therefore, we have labelled houses as belonging to the village, but the adjoining open spaces and farmland are considered as part of the terrain. Only the pixel values of the images such as in (a) of Figure 2 were used, and no other data from advanced remote sensors such as AVHRR, LiDar, MODIS, etc. was collected or used.

Data Split for Training and Evaluation
To train and then test the model we need to split the data into two sets, namely, the training and test set. We split the data between the training and test by randomly choosing 9 of the 12 images for training and the rest for testing. This process is repeated four times; thus, each iteration takes different training and testing images, a technique that is referred to in the machine learning literature as cross-validation. The more iterations that we perform the better the estimate of the classifier's true performance, but this would lead to very long-running time of the algorithm, therefore we have restricted the experiments to four iterations only. Hence, all the results reported in this manuscript, the tables, etc. are an average of the above mentioned four iterations for the same parameters and models, until and unless specified otherwise.

Feature Engineering
Looking at the images of the houses and fields, one can notice that the variations of texture where there are houses is much more than that of the fields, that is because the texture of the fields is quite smooth. From this observation, it can be inferred that if we were to do edge detection on these images, the areas where there are houses will have more edges than the areas with fields. The intuition behind this observation is that there is larger texture variation in houses, and also because of the small size of the houses, there are many edges close by as compared to that in the fields, because a field block is almost equivalent in size to 10 to 20 houses. Therefore, there would only be edges running along the perimeter of the field block.

Distribution Analysis and Edge Detection
For determining the validity of this hypothesis, we synthesized two ground truth images, one for houses and one for fields. These two ground truths are shown in Figure 3. Many different images were sampled and patched together to make this true representative image. Edge detection (ED) was then performed and the histogram of the intensity values of the two images before and after ED is plotted in Figure 4. The difference in the histograms before and after ED suggests that the amount of overlap has decreased but even then, there is quite a significant overlap between the two distributions suggesting that this feature is not good enough for discriminating the classes.

Smoothing and Image Enhancement and Smoothing
Based on the observation mentioned earlier in this section, the edge density in areas having houses is very high as compared to that of the fields. Therefore, it is quite intuitive to use smoothing (or blurring) on the synthesized images because the edges in the fields are surrounded with more pixels with zero or close to zero intensity values as compared to that of the houses because houses are very close by thus forming a cluster. As a result, the intensity values of edges in the fields will move more closer towards zero as compared to edges in the areas containing houses. Figure 4. shows that this observation is correct. It can be seem from this figure that as we increase the number of times we smooth the data the distribution of the fields moves closer and closer to left side of the intensity scale. Whereas, the distribution of the houses started off with negative exponential distribution ( Figure 3) to Rayleigh distribution ( Figure 4) as the smoothing is increased to five times after ED, and then finally to a normal distribution when smoothing is done 10 times after ED.  Similar trend is seen for our 9 training images, the average intensity values for which are given in Figure 5. It is obvious from this figure that before smoothing the distribution of both the classes there was almost a complete overlap between each other but as we increase the number of smoothing the two distributions start to split apart. There is also a limit to how many times we can smooth the images. As smoothing is done many times (30 times) we can see that the advantage getting lesser as a fair amount of values of the distribution of fields now lies under the distribution of the houses because after smoothing the intensity values of field pixels surrounding the house pixels also increases. From this figure it can be further inferred that as we increase smoothing, the distribution of houses more and more resembles that of a normal distribution while the distribution of fields resembles Rayleigh distribution.

Parametric Density Estimation
The distribution of houses resembles that of a normal distribution while the distribution of fields resembles Rayleigh distribution as seen in Figure 5. We, therefore, model the two classes through parametric density estimation. The feature vector is one dimensional having 256 distinct values for the 256 unique pixel intensity values. Therefore, approximately 256 probabilities for the village class (P(pixel/village)) and 256 for the terrain class (P(pixel/terrain)), need to be estimated from the nine training images which totals to around 37 (2048*2048*9) million pixels. Given such a huge number of pixels to estimate our probabilities the parametric density approach adopted here even though being quite simple give very good results. Prior probabilities are not taken into account because in some images the area covered by the village very high while in some it is the opposite. Since there is no fixed percentage of houses or fields in an image, therefore, the priors of the two distributions is not used in modeling.
Initially both the distributions were thought to be best modelled by Rayleigh distribution ( Figure 5). Further experiments showed that the village class would be better modeled as a normal distribution. This was realized after the densities of the two classes were modeled as both as Rayleigh and as normal distribution as shown in Figure 6. The next thing is to find the decision boundary. After having determined the decision boundary we are left with a very simple decision rule to classify between the village and the terrain class which is as follows: if the value of the feature is greater than the decision boundary then classify it as belonging to village otherwise classify as terrain.

Determining Decision Boundary
The Rayleigh distribution could be given as to find the decision boundary when both the distributions are Rayleigh, we equate f 1 (x) = f 2 (x). Solving this equation, we get the following expression for the decision boundary of the distributions: The MLE of σ 2 can be give as The normal distribution could be given as To find the decision boundary of this case when one distribution is normal and the other distribution is Rayleigh we solve for x the following equation numerically

K-NN
The result of the parametric density classifier is then fed as input to k-NN classifier to get the final prediction. The rationale for this is the observation that the images classified by the parametric model have regions within the village area that were classified as terrain and similarly there were pixels within the terrain that were classified as village pixels. To remove this salt-and-pepper effect in prediction or in other words to remove these types of False Positive (FP) and False Negatives (FN) errors k-NN classifier is used. The results also demonstrate improvement with the use of k-NN. In this problem setting k-NN degrades to just counting the number of pixels that belong to both classes within a spatial window of k pixels and then assigning the class which has the highest number of pixels in that window. Furthermore, the closest neighbors here would be the spatial neighbors and not the neighbors closest in terms of the value of the feature. Along the borders of the image, zero padding could change the results significantly. It is therefore avoided by replicating the value along the border. Different window sizes for the k-NN classifier are used it: 1x1 (same as density estimation without the k-NN), 3x3, 5x5 and 9x9.

Smoothing Parameters
Following are three different parameters regarding the smoothing filter applied to the images: 1-Size of the smoothing filter 2-Number of times the filter should be applied 3-Gaussian or uniform blurring filter There are various tradeoffs between selecting different parameters for smoothing. For example the advantage of selecting a large filter is that since the area surrounding the edges of fields is dark so these edges will blur quickly. On the other hand the drawback with this is that homes that are not located in a clutter or the homes on the outer rim of the clutter will also get blurred to black. If a filter that is too small is used, then edges of field will not get blurred because they have a small neighborhood of high intensity values. The number of times to smooth is also crucial because if you smooth only few times the discriminability between the classes will not increase significantly, but on the other hand if you apply the filter too many times then it would also not help because eventually the intensity of each pixel will converge to the mean intensity value of the image. Gaussian filter weighs the contribution of each pixel in accordance with its proximity while the uniform filter will give equal weightage to each pixel in the filter window. There a bigger window might be required for Gaussian smoothing, furthermore it might have to be applied greater number of times than that of a uniform smoothing filter thus taking more time. Table 1 shows the results of the two different parametric models. Rayleigh-Rayleigh model (RRM) refers to the model in which the distribution of the both the classes is modeled using a Rayleigh distribution, whereas, Rayleigh-Normal (RRN) refers to the model in which the terrain class is modeled as a Rayleigh distribution but the village class is modeled as a normal distribution. Surprisingly, even though Rayleigh-Rayleigh model (RRM) does not fit the distribution as well as Rayleigh-Normal model (RNM) does ( Figure 6), RRM performs better than RNM. RRM achieves both, the overall minimum error and as well as the least error for a given number of smoothing. This result is counter intuitive because both the training and test data undergo the same transformation and in the transformed space RNM better approximates the distribution ( Figure 6). One thing to notice though is that RNM has higher TP rate than that of RRM which is a good thing, but at the same time it has a higher FP rate than RRM, the reason for this is evident from Figure 6. In this figure it can be seen that the decision boundary of RRM lies somewhere around 21 on the x-axis while the decision boundary for the RNM lies somewhere around 24 x-axis, because of which RNM model misclassifies less of the village pixels as fields pixel as a result higher TP rate and as a consequence higher FP rate as well. In a scenario where the cost of FP is higher then RNM is better suited for classification.  Table 2 shows the significant improvement in the error rate with the application of k-NN. One can argue that the same effect can be obtained with more smoothing, but our results show otherwise. We can also notice that with a bigger neighborhood size i.e. 15x15 we get a better performance gain than a smaller one like 1x1. Results of Table 3 demonstrate that applying k-NN multiple times helps reduce the error, it will help to do that until certain number of times after which the performance will start to degrade. We cannot verify this claim because applying the k-NN filter is quite expensive and time-consuming because we perform 5 iterations over it to give the averaged result as well.

Smoothing Parameters
Experiments with different parameters of the smoothing filter were carried out to obtain the best set of features. For a fixed set of training and test examples, results of applying different filter sizes is given in Table 4 . It shows that with the increase in the filter size, the error rate decreases until a certain filter size (15x15) and after which it starts to increase. However, the optimal feature can not only be determined by the filter size but the number of times we apply that filter is also important. For smaller filter sizes we will have to smooth many times more in contrast to if we were using a bigger filter size. The results in Table 4 are obtained after doing smoothing 10 times, but if you do smoothing 20 times you get an even better performance with a filter of size 7x7 i.e. around 3.47% error rate (refer to Table 1 ). Similarly, if you perform smoothing only 3 times you get around 4.76% error rate ( Table 5 ). Therefore, we conclude that for smaller filter size we need to perform smoothing more number of times than for larger filter sizes. A question arises then, why not always choose a very big filter. The answer to this question is that with a big filter we get more error (see Table 5) because the houses that are located at the corners of the town or are surrounded by fields will get blurred out. Then one might think that we should choose the smallest filter size, the problem with this argument is that if you choose it to be 1x1 or 2x2, there will be no significant effect. Filter size of 3x3 will benefit the performance of the classifier but then you will need to apply it many times and the cost will become restrictive. As a tradeoff between efficiency and performance we recommend a filter size between 5x5 to 9x9.

Comparison with Other Results
The performance of our model is comparable or better than the performance of models reported in other studies for similar settings. Our reported accuracy is better than all the pixel-based approaches in (25) and in Table 2 of (26) . The number of classes and the size of the training data varies though from the approaches in (26) and (25) .

Conclusion
Automatic classification of satellite imagery in different land cover types can help land use policy development, disaster assessment, planning of urban infrastructure, forest and agriculture monitoring and conservation. We propose a pixel-based ensemble approach to improve accuracy of land cover classification. The proposed hybrid approach combines parametric density estimation followed by a k-nearest neighbor (k-NN) classifier to predict whether a pixel belongs to a nucleated village or to a field, forest, river, or some other terrain. The proposed features and image enhancement techniques also help significantly improve the prediction accuracy. Using the proposed technique, the covered area of nucleated villages/towns can be determined for assessing the growth of towns/villages, for better planning and conservation. We achieved an accuracy of more than 97% with very little FP rate. Results for different parameter selection choices and their discussion is also presented.