Intelligent socio-economic status prediction system using machine learning models on Rajahmundry A.P., SES dataset

Background: Developing economic and social systems and assuring the efficiency of economic and social processes is the major task for the government of any country. Predictable machine learning (ML) models are used for analyzing data sets that allow more efficient enterprise management. Now a day, the research on Socio-Economic Status (SES) and Machine Learning (ML) is very crucial to find socio-economic inequalities, and take further actions that are preventions, protections, and suppressions. Objectives: The mainobjective of this research is to understand the Socio Economic System issues and predicting SES levels on particular area like Rajahmundry, AP, India using statistical analysis and machine learning methodologies. Methods: In this, we analyze the data that is collected from Rajahmundry (Rajamahandravaram),Andhra Pradesh, India with 48 feature attributes (dimensions), and one target four class attribute (poor, rich, middle, upper-middle ). The SES levels like poor, rich, middle, and upper-middle classes are predicted by 5 ML algorithms. Findings: In this paper, we conduct the statistical analysis of each attribute, and analyze and compare the performance accuracies using confusion matrix, performance parameter (classification accuracy, Precision,Recall, and F1) values and receive operating characteristic (ROC) under AUC values of five efficient ML algorithms like Naïve Bayes, Decision Trees (DTs), k-NN, SVM (kernel RBF) and Random Forest (RF). We observed that the RF algorithm showed better results when compared with other algorithms for the Rajahmundry AP SES dataset. The RF algorithm performs 97.82% of classification accuracy (CA) and time is taken for model construction 0.41 seconds. The next superior performed ML model is DTs with 96.67% of CA and 0.16 seconds for model construction. Novelty: Comprehensive analysis indicates that the novel AP SES Dataset with empirical statistical analysis gives the good results and predicts the SES levels with RF model is very effective. Keywords: Machine Learning; socio-economic status; Rajahmundry;household; poverty


Introduction
with an independent contribution of income, occupation, and education to some set of risk factors like smoked, BP, BMI, and cholesterol-related to heart diseases. For this, they choose 2380 people from Stanford and used the forward selection model. They conclude that higher education might be the best SES indicator of good strong health. (11) analyzed the relationship between SES and COPD. For the experiment, they collected 11,042 people's SES, lung function, and demographic data between the age group of 35-95 from 5 countries Argentina, Chile, Bangladesh, Uruguay, and Peru. For the relation between SES and COPD, they used PCA (principal component analysis) and Multivariable alternating (MVAL) logistic regression methods. Overall COPD preponderance was 9.2%, laying out 1.7% to 15.4% across sites. As per their analysis, lower education, lower composite SES index, and lower household income were related to COPD.
Machine Learning (ML) and deep learning related to SES research is vital role in the prediction inequalities, levels of SES, or area wise SES. Some of the researchers researched SES levels problems with GPU systems, demographic data, image maps, and so on using ML and DL. (12) estimate the SES of French users of Twitter. They take a horizontal approach to the SES problem and investigate different methods to infer the SES of examples of web-based social media users. They propose various data assortment and a combination of creep able data, expertly annotated data, or open census information for the prediction. (13) The main motive of their study was the prediction of MSW (municipal solid waste) based on demographic and SER variables of 220 municipalities in Ontario, Canada. For this experiment, they used two algorithms that are DTs (Decision Trees) and NN (Neural Networks). As per the results and conclusions, the performance and accuracies of MLs are good, that the ML models can predict the SES level performances. The NN models had the best accurate values with 72% of the variation in the data. The outcomes showed that given adequate socio-economic factors, the ML methods can develop models with high accuracy for waste prediction applications. The SES factors are of key significance during all periods of wildfire management that incorporate restoration, prevention, and suppression. (14) described SES drivers of wildfire occurrence in central Spain. GLM and ML Maxent methods predicted wildfire occurrence during the 1980s and during the 2000s to recognize changes between every period in the SES drivers influencing wildfire occurrence. Creating social and economic frameworks and guaranteeing the effectiveness of social and financial procedures is one of the significant tasks for the government of any nation. Predicting ML models utilized for analyzing enormous data permit effective enterprise management. (15) analyzed on predicting Ukraine's GDP utilizing the ARIMA ML model and use a twofold exponential smoothing model. In the review of literature, very little work has been reported towards the socio-economic system with statistical and ML models for predicting SES levels in the different areas in the whole world. In the review of literature, very little work has been reported towards the socio-economic system with statistical and ML models for predicting SES levels in the different areas in the whole world.

Contributions/ motivations of the work
The following is the consignments of this work • In this research, we collected the household information from Rajahmundry, Andhra Pradesh, India using a good questionnaire. The data sampling is using ratios of SES levels like rich, above middle class, middle class, and poor. • We compose the data set (*.csv) using 49 attributes including class attribute. • In this paper, we apply 5 reputed ML models like Naïve Bayes, DTs (Tree), k-NN, SVM (kernel RBF), and Random Forest (RF) as well as past and recent SES levels detection research works. As per comparison, the RF model is superior to others. • This research is very useful in socioeconomic systems for the researchers, analysts and administrative employees and government, and so on. • This research leads or helps to auto-detection SES level applications like mobile apps. • We will extend this research with COVID-19 effects on the SES of the Rajahmundry area.

Organization of the paper
The paper organized as following points • Section 2 describes the introduction and literature descriptions in detail relevant to SES research work. • Section 3 outlines the proposed model, along with the structure of the proposed model. The detailed architecture describes the mathematical and algorithmic structure. • Section 4 presents the details about the experimental setup and analysis of the simulated results. In this, we have been analyzed ML models with Rajahmundry AP SES Dataset as well as compare the results of ML models. • Section 5 concludes the work with some future directions. https://www.indjst.org/

Materials and Methods
In this section, we describe the detailed model of the experimental setup and its working process step by step. And it describes experimental materials like metrics and measurement equations. Mainly, it focuses on ML algorithms and their setups and working process, and also describes measuring performance tools like confusion matrix, ROC, and so on.

Socio economic level prediction model
The Figure 1 shows the proposed model of poverty predicting system with Household data set. In this, we collected the information from each house of rural and urban areas of the Rajahmundry constitution, district of East Godavari, A.P, India.
We have gathered all the information with a good questionnaire and store the necessary information in the secondary storage section. After that, we extract the information with features and classes into a data set as *.csv format. The predicted class attribute contains four classes that are rich, upper-middle-class, middle class, and poor. For this investigation, we extract the https://www.indjst.org/ feature attributes as per household information that are personal-data, Socio-status, Economical-status, Living-status, Healthwealthy status, and so on. In this, we constructed 1742 records of information with 49 attributes *.csv files and stored into the secondary storage section.
Using this information, construct the statistical analysis reports for analysts and decision-makers to prevent actions about poverty. In another hand, the data is pre-processed by pre-processing algorithms like PCA (principal component analysis) and split the data set into training and testing parts (80% of Train and 20% of Test) for applying Machine Learning algorithms. Mainly, we use popular ML algorithms like Naïve-Bays, Decision Trees, Random Forest Trees, k-NN (k-Nearest Neighborhood), and SVM (Support Vector Machines). After designing the models, we evaluate the models with evaluated metrics like Accuracy (AC), TP Rate, FP Rate, F1, and AUC (using ROC). As per comparison, choose the best-performed ML model for predicting unknown input feature attribute values. Lastly, we will send the performance results, predicting values and visualization graphs to the analysts and decision-makers

Dataset description
Rajahmundry renamed as Rajamahandravaram is one of the major consistency of East Godavari district in Andhra Pradesh, India. We gather information about each house from this constitution area of rural and urban. Nearly, we collected the 1742 samples as per socio-economic ratios and area wise ratios with good questionnaires between 2018 and 2019. Some of the data is plotted on the Rajamahandravaram Map using longitude and latitude values. The Figure 2 shows location details and detailed information about plotted houses clicking on that point of more details button. For this experiment, we used 48 feature attributes and one class that is the status (rich, poor, middle, and upper-middle classes). The Table 1 describes detailed data set 49 attributes include class attribute.

Naïve Bayes (NB) classification
It expects that the presence of an unambiguous aspect of a class is autonomous of every other aspect (16) . As per Bayes theorem, the contingent probability is given by the Equation It is the most successful algorithm for many applications such as text document classification, spam filtering, Recommender system, etc. https://www.indjst.org/

Working of Naive Bayes Algorithm in SES Problem
NB classifier model for the SES level probabilities: Step1: Firstly we compute the SES data set class levels prior probabilities.
Step2: Find likelihood with each attribute for each class in SES Step3: Bayes Formula is computed using feature attributes of SES and computer the posterior probabilities.
Step4: find the superior probability as per input to class which is high probability. For streamlining posterior and prior probabilities utilize the two tables' probability and frequency tables. Both of these tables will assist us with calculating the probabilities of posterior and prior. All features of SES are in frequency table.

Support Vector Machine (SVM):
Another incredible supervised ML model is SVM that can be used for both regression and classification issues. The Figure 4 shows the analysis of data using SVM. The numbers of characteristics 'n' are spoken to on the n-dimensional space with each component depicted by the estimation of a specific coordinate. An information component comprising n characteristics is plotted on this n-dimensional space. The point is to find a hyper plane that classifies and increases the edge in an n-dimensional space (17) .

K-Nearest neighbors' (k-NN) classification
The k-NN is a non-parametric supervised algorithm method suitable for both classification and regression. It considers the k closest data points in the training examples. The output differs based on the fact that KNN is used for classification or regression. The output predicts the class to which a data point belongs based on how closely it matches with the k nearest neighbors. This is one of the instance-based learning, or lazy learning algorithms (18) . This algorithm uses the distance function to calculate the close approximate with the K Nearest Neighbors. For continuous variables, Euclidean, Manhattan, and Minkowski distance measures are used and hamming distance for categorical variables shown in equations (3)(4)(5).
Working of KNN algorithm for SES dDataset K-nearest neighbors (KNN) model utilizes ' similarity of features' to estimate the estimations of new information or data which further implies that the new data points will be allotted a value on how tightly matches the data points in the set of training. The Figure 5 shows the classification process for the SES dataset in detail. We can comprehend its working with the assistance of following algorithm Step 1 -Give the SES data set of training and testing.
Step 2 -Initialize the K value that it can be any number.
https://www.indjst.org/ Step 3 − For each data point in the test information do the accompanying − i. Calculate the distance train and test data points using Hamming, Manhattan or Euclidean methods. (Euclidean distance is used in the experimental set up for SES data set) ii. Sort them in order of ascending.
iii. We will pick the top rows as per value of K from the arranged data set. iv. Now, it will allot a class to the test point dependent on the most recurrent class of these data rows.

Decision Tree Algorithm
DTs model is one of the supervised learning algorithms. In contrast to other supervised ML models, the DTs can be utilized for solving both classification and regression problems, but most researchers used this model for classification issues. It is a tree-organized classifier, where intermediate nodes describe the features of dataset. The decisions rules are designed with and leaf nodes are described with results. DTs classify the data points by arranging them down the tree from the root to some terminal node, with the leaf node giving the order of the model. Every node in the tree goes about as an experiment for some feature attributes, and each edge plunging from the node relates to the potential responses to the experiment. This procedure is recursive in nature and is recurrent for each sub tree rooted at the new node.

Decision Tree algorithm working with SES Data set
In Decision Trees, for anticipating a class name for a record we start from the base of the tree. We look at the estimations of the root characteristic with the record (genuine dataset) property. Based on correlation, we follow the branch relating to that worth and hop to the following hub. For the following hub, the calculation again contrasts the quality worth and the other subhubs and move further. It proceeds with the procedure until it arrives at the leaf hub of the tree. The Figure 6 shows Decision Trees Classification process for SES Data Set. The total procedure can be better comprehended utilizing the beneath calculation: Step-1: Begin the tree with the root hub, says S, which contains the total SES dataset.
Step-3: Divide the S into subsets that contains potential qualities for the best properties.
Step-4: Generate the choice tree hub, which contains the best trait.
Step-5: Recursively settle on new choice trees utilizing the subsets of the dataset made in step-3. Proceed with this procedure until a phase is arrived at where you can't further arrange the hubs and called the last hub as a leaf hub.

Random Forest (RFs) algorithm
RF is a supervised ML models for classification that is ensemble learning model. The basic reason of this model is that building a little decision-tree with small set of features is a computationally modest procedure. On the off chance that we can construct smaller trees in large number, parallel constructed trees in weak, we would then be able to join the trees to frame a single, averagely strong learner or taking the vote in major. The Figure 7 shows the Random Forest Classification process for SES Data Set. The RF classifier, if numbers of trees are higher in forest then it gives the high performed accurate results. https://www.indjst.org/

Working of Random Forest Algorithm for SES Data Set
Step 1: Choose randomly n features from the SES total feature Set.
Step 2: As per decision trees, choose best splitting tree for the root node.
Step 3: Predict the result utilizing these trees for decisions.
Step 4: Calculate the target votes using each decision tree predictions.
Step 5: The objective or target with the most prominent vote is considered as the last prediction of the SES Data Set.

Confusion Matrix
In this, we represent the 4 class problem that is Middle, Poor, Rich, and upper-middle. Table 2 shows the confusion matrix for the Socio-Economic Status with 4 class problems. The accuracy is calculated by the diagonal of the confusion matrix. The confusion matrix is constructed using actual or true values and Predicted values (19) .

Performance parameters
Performance parameters results give the performance of data set (20) . We calculated the performance parameters like TPR- https://www.indjst.org/

Results and Discussion
In this, we have to analyze the statistical analysis results and machine learning models classification accuracies in detail.

Statistical Analysis
We collected the data from rural and urban areas of the Rajahmundry constitution, East Godavari District, A.P., India. For this, collected sampling data is as per ratios of social and economical status. The rural area samples are 946 and urban area samples are 796 (Total 1742). As per the statistical analysis of the household dataset, some of the houses contain on average 4 to 5 members where the mean value is 4.381 and Std. Dev is 1.467. Some of the houses have only one member (min value is 1) and some of the houses contain 16 (max value). Each house contains at least one male person (min value male persons in a house is 1) and a maximum of 8 male persons as well as on average 2 to 3 persons per one house. On the other hand, the female persons' min value is 0 and max values are 8 and mean and SD values are 1.975 and 0.776 respectively which means every house contains on average one to two females. As per statistics some good conditions that very fewer child workers, average young generation 2 to 3 people in every house and average 1 to 2 workers in each house. Another good thing, the number of diseased people and the number of handicapped people are very less percentage that the mean values are 0.066 and 0.024 respectively. Very important thing for the economic status that it is fully depends on annual income for each house and their resources that are from public, private, asserts and work, and so on. As per statistics annual income min value is 27000/-and the max value is 8000000/-. The income sources from private, government or pension schemes. The detailed analysis is shown in the Table 3. The educational and health resources are also available within the distance of every house.

Experimental setup
In this section, we analyze accuracy values of ML algorithms k-NN, DTs, SVM, RF and NB in detailed. For this, we used confusion matrix for each algorithm.

K-Nearest neighbor
The k-NN model classifies correctly 1643 instances out of 1742. The remaining 99 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.94316 (94.4%). The F1-score is 0.94312 and the precision value is 0.94341. The time taken for the construction of the model is 0.29 seconds. The Figure 8 shows the confusion matrix of the KNN model. In this, we used that k-value is 5, and the distance calculation method is Euclidean. As per the analysis, the prediction class "poor" is very accurate (0.975 or 97.5%) than other classes where only 9 instances are incorrectly classified out of 353 poor class instances. In the next positions two and three occupied by upper-middle predicted class with 94.4% accuracy and 93.7% accuracy of middleclass relatively. In the rich-class predictor, 62 instances are classified correctly and 10 instances are going to upper-middle-class premises, so the accuracy is 86.2% only

Decision Tree (DTs):
The C 4.5 model classifies correctly 1684 instances out of 1742. The remaining 58 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.9667 (96.67%). The F1-score is 0.96659 and the precision value is 0.96672. The time taken for the construction of the model is 0.18 seconds. The Figure 9 shows the confusion matrix of the DTs model. As per the analysis, the prediction class "Middle-class" is slightly more accurate (0.97468 or 97.47%) than the target class "poor" where the poor class classifies 344 instances correctly out of 353 instances(accuracy value is 0.974504(97.45%)). The target "rich" class accuracy is 0.875(87.5%) and the target class "upper-middle" accuracy is 0.962(96.2%). In the upper-middle, out of 527 instances 507 instances are classified correctly and remain 19 instances are classified as "middle-class" and one as in rich incorrectly. The DTs algorithm is somewhat good that it classifies three target classes (poor, middle, and upper-middle) out of four target classes with above 96% of accuracy.  Figure 10 shows the confusion matrix of SVM with the kernel RBF model. In this model, we used the kernel RBF that expression is exp(-g|x-y|^2) where g =0.1, numerical tolerance is 0.001, Cost( C ) value is 1.0 and regression loss epsilon is 0.1 and number of iteration limit is 100. As per the analysis, the target class "poor" is very accurate (0.9887 or 98.87%) than other classes where only 4 instances are incorrectly classified (3 instances in "middle" and 1 in upper-middle) out of 353 instances. In the next positions two and three occupied by upper-middle predicted class with 97.91% accuracy and 90.88% accuracy of middle-class relatively. In the rich-class predictor, 64 instances are classified correctly and 8 instances are classified incorrectly (7 in upper-middle and 1 in middle). So, the accuracy is 88.88% only. https://www.indjst.org/

Random forest
The RF's model classifies correctly 1704 instances out of 1742. The remaining 38 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.97818 (97.81%). The F1-score is 0.9782 and the precision value is 0.9781. The time taken for the construction of the model is 0.41seconds. The Figure 11 shows the confusion matrix of the RF model. As per the analysis, the prediction class "poor" is slightly more accurate (0.9887 or 98.87%) than other target class "middle", "upper-middle" and "rich" classes. The rich target class classifies 71 instances correctly out of 72 instances (accuracy value is 0.9861(98.61%)). The target "upper middle" class accuracy is 0.98102(98.1%) and the target class "middle" accuracy is 0.9708(97.08%). In middle, out of 790 instances 767 instances are classified correctly and remain 23 instances are classified 11 as "poor" and 12 as in "upper-middle" incorrectly. The RF algorithm is somewhat good that it classifies three target classes (poor, rich, and upper-middle) out of four target classes with above 98% of accuracy.

Naïve Bayes (NB)
The Naïve Bayes model classifies correctly 1514 instances out of 1742. The remaining 228 instances are classified incorrectly by this model. The total accuracy (CA) value is 0.86912 (86.91%). The F1-score is 0.87333 and the precision value is 0.89096. The time taken for the construction of the model is 0.12 seconds. The Figure 12 shows the confusion matrix of the Naïve Bayes model. As per the analysis, the prediction class "poor" is more accurate (0.991501 or 99.15%) than other target class "middle", "upper-middle" and "rich" classes. The rich target class classifies 67 instances correctly out of 72 instances (accuracy value is 0.93055 (93.05%)). The target "upper middle" class accuracy is 0.853889943(85.38%) and the target class "middle" accuracy is 0.818987342 (81.9%). In the middle, out of 790 instances 647 instances are classified correctly and remain 163 instances are classified 45 as "poor" and 98 as in "upper-middle" incorrectly. The NB algorithm is not so good that it classifies three target classes (middle, rich and upper-middle) out of four target classes with below or equal 93% of accuracy compared to other used ML algorithms.

Receiver Operating Characteristic (ROC) curves
The ROC curve constructed with specificity (FP Rate) and Sensitivity (TP Rate) measures with 0 to 1 values. The Figure 13 shows the targeted class "middle" by utilizing the ROC curves with experimental models. In this analysis, the experimental models k-NN, DTs (Tree), SVM, Random Forest, and Naïve Bayes on targeted class middle AUC values are 0.9937, 0.9884, 0.9862, 0.9989 and 0.9794 respectively. Each model ROC curves represents each color. All models AUC values are above 0.97, so all the models are efficient and effective for predicting the middle-class unknown values. In this, the RF model is the more efficient performer to predict target class "middle" than other models. The light green specifies the RF performed ROC curve shown in the Figure 13. https://www.indjst.org/ The Figure 14 shows the targeted class "poor" by utilizing the ROC curves with experimental models. In this analysis, the experimental models k-NN, DTs (Tree), SVM, Random Forest, and Naïve Bayes on targeted class middle AUC values are 0.9992, 0.9927, 0.9992, 0.9998 and 0.9952 respectively. Each model ROC curves represents each color. All models AUC values are above 0.99, so all the models are efficient and effective for predicting the target class "poor" with unknown values. In this, the RF model is the more efficient performer to predict target class "poor" than other experimental models. The light green specifies the RF performed ROC curve shown in the Figure 14. https://www.indjst.org/ The Figure 15 shows the targeted class "rich" by utilizing the ROC curves with experimental models. In this analysis, the experimental models k-NN, DTs (Tree), SVM, Random Forest, and Naïve Bayes on targeted class middle AUC values are 0.9985, 0.9874, 0.99903, 0.99993 and 0.98745 respectively. Each model ROC curves represents each color. All models AUC values are above 0.98, so all the models are efficient and effective for predicting the target class "rich" with unknown values. In this, the RF model is the more efficient performer to predict target class "poor" than other experimental models. The light green specifies the RF performed ROC curve shown in the Figure 15. https://www.indjst.org/ The Figure 16 shows the targeted class "rich" by utilizing the ROC curves with experimental models. In this analysis, the experimental models k-NN, DTs (Tree), SVM, Random Forest, and Naïve Bayes on targeted class middle AUC values are 0.9985, 0.9874, 0.99903, 0.99993 and 0.98745 respectively. Each model ROC curves represents each color. All models AUC values are above 0.98, so all the models are efficient and effective for predicting the target class "rich" with unknown values. In this, the RF model is the more efficient performer to predict target class "poor" than other experimental models. The light green specifies the RF performed ROC curve shown in the Figure 16. https://www.indjst.org/  The Table 5 shows all performance parameter values of each ML model and the time taken for building a relative model. In this analysis, one of the main observations that the RF algorithm is the best model than other experimental ML models where all the parameters of performance that CA, AUC, and F1 values are greater than others. But, it takes a lot of time those https://www.indjst.org/ 0.41 seconds for building the model. This is the highest time than comparative other experimental models. The second high performed model is the DTs model and it takes 0.16 seconds only for model building. Naïve Bayes takes the lowest time taken for building the model but it occupies the last position that the second-lowest time taken for model building.

Experimental ML algorithms comparative analysis
This analysis analyzed using bar-chart diagram in detail. The Figure 17 shows the comparative analysis of the ML models. As per analysis, the classification accuracy (CA) represents in pink color and AUC represent in blue color. High values of CA (Classification Accuracy) and AUC values (0.976 and 0.999) are indicated by the random forest. The second highest AUC value is (0.9954) indicated by the KNN model as well as the second highest CA value (0.966) is indicated by Tree (DTs) model. The Figure 18 shows the time taken for built the ML models. In this, the Naïve Bayes model takes 0.12 seconds, but the CA value is 0869 that it is least performed model comparative other. So, it is considered only time based for the predictions. In other hand, the Random forest model takes the highest time (0.41 sec.) for built the model and accuracy point of view it is in fist position. So, the RF is very better model to predict the SES levels based on accuracy only. The DTs is moderate model and somewhat good where the model is built within 0.19 seconds third position and the accuracy is 0.966 in second position. As per analysis, accuracy and time based DTs is the better than all experimental ML models.

Conclusion
Analysis and prediction of socio-economic status research work are very useful for analysts, organizations, and government. Good sampling statistical results represented economic features, social standards, and SES levels of the Rajamahandravaram consistency area. This useful work had described SES with machine learning representation. As per the comparative study, the Random Forest ML model was the best for predicting SES levels of Rajahmundry SES data set where accuracy (CA) value is 0.976, and the AUC value is 0.999. Further, we will take and elaborate overall East Godavari district samples and working with GPS data using Deep Learning concepts for more accurate performance values. And also conduct the research work on before the COVID-19 and after the COVID-19 for this area.