Voting-Boosting: A novel machine learning ensemble for the prediction of Infants' Data

Background/Objectives: Owing to the continuous increase of electronic records and recent advances in machine learning, various automated disease diagnosis tools have been developed and proposed in healthcare sector. In the present study, an ensemble methodology using voting and boosting techniques has been proposed for optimal selection of features and prediction of infants' data of India.Methods/Analysis: For feature selection, the best-first search algorithm of wrapper technique has been used in addition to votingboosting. The proposed ensemble consists of combination of heterogeneous classifiers including Random Forest, J48, JRip, CART and Stochastic Gradient Descent (SGD). The effectiveness of the proposed ensemble and single classifiers have been investigated in terms of classification accuracy, precision, f-measure, recall, MCC and PRC area using varied k-fold cross validation. Findings: The results depicted that the proposed Voting-Boosting ensemble (k=15) outperforms the individual classifiers using selected features. Applications / Improvements: The proposed Voting-Boosting ensemble can be extended by using more state-of-the art classification approaches and further utilized for other healthcare datasets for enhancing the performance.


Introduction
India is one of the fastest growing economies and second most populous country of the world but poor health outcomes among infants has drawn global attention in health profile of India. Still, there are diverse challenges and shortfalls in terms of healthcare expenditure, vaccination, malnutrition, poor health facilities for newborns and widespread inaccessible geographical locations that need to be addressed urgently. India suffers a large proportion of the disease burden from which Infant Mortality Rate (IMR) is quite a big hurdle for the government. Infant Mortality Rate (IMR) is a standard measure for measuring infants' death per 1,000 live births less than one year of age (1) . IMR of India declined from 81/1000 live births in 1990 to 34/1000 in 2016 and there was a total of 1.08 million deaths of under-5 children in 2016 (2) . India contributed 500,000 i.e. one-third of the global deaths annually and most of these are vaccine preventable deaths (3) and nearly 60 million children are malnourished every year (4). There is a gradual decline in mortality rates globally but it is critical to regard that this reduction does not occur the same way in all countries. The health of newborns is an important indicator in the assessment of the development of any society and a growing concern at the global level.
Infant mortality has two additional dimensions viz. neonatal mortality rate and post neonatal period. The neonatal mortality rate is the death of newborns during the period from 0 to 27 days of life and this period is the highest risk of dying. The post neonatal period is the death of newborns from 28 days up to completing 1 year of age and expressed per 1,000 live births (5) . The death of newborns mostly happen due to preventable causes which include pre-maturity/preterm, neonatal infections, congenital malformations etc. whereas post-newborn period or children under-5 deaths mostly occur due to infectious diseases like pneumonia, diarrhea, malaria, measles, diarrhea, malnutrition and intrapartum related complications (6) . All these deaths can be preventable through ingenious and affordable intervention. The World Health Organization (WHO) and the United Nations (UN) have coordinated to reduce these mortality rates. They indited eight Millennium Development Goals (MDGs) with measurable targets and clear deadlines in which MDG 4 aimed to reduce child mortality throughout the world by 2015 (7) . In compliance to MDGs, India has witnessed considerable progress but still a lot of work is remaining in some areas. The Government of India has implemented various initiatives both at national and state levels mainly targeting the deprived sections and poor families with the aim of qualitative improvements in standards of public health, healthcare in the rural areas and mainly for the reduction of infant mortality, thereby initiated intensified schemes including (8) Universal Immunization Programme (UIP), Janani Suraksha Yojana (JSY), Janani Shishu Suraksha Karyakaram (JSSK), Rashtriya Bal Swasthya Karyakram (RBSK), Mission Indradhanush (MI), Pradhan Mantri Surakshit Matritva Abhiyan (PMSMA), Intensified Mission Indradhanush (IMI) and many others. Despite of all these initiatives, there is still a need of more flexible approach i.e. implementation of recent technologies for reducing the burden of present Infant Mortality Rate (IMR) in India.
The potentiality of the available data can be exploited only if it can be analyzed and transformed into useful information and in turn is used to generate knowledge to support decision making or development of intelligent automated system for early detection of problems. This study aims at applying multiple data mining and machine learning algorithms in healthcare domain in general and infants' data in particular. The performance of data mining algorithms used in predicting mortality rate is highly efficient with a good combination of salient features and proper implementation of prediction algorithms (9) . The individual classifiers seem to be inapt of ensuring optimal results in terms of prediction and stability. Thus, the effectiveness of heterogeneous ensemble approach exploits the strength of different classifiers at the same time and overcomes the weaknesses of single classifiers (10) .
In this study, a new heterogeneous ensemble technique using voting and boosting has been proposed for feature selection and prediction. The removal of weak features is highly desirable for reduction of data dimensionality in any dataset (11,12) . Therefore, the objective of feature selection is to identify the subset of relevant and non-redundant features in the dataset. The selected features contribute towards the final prediction and rejected features will not be used for subsequent modules and analysis. The wrapper method has been applied on full training dataset in the present work. Afterwards, majority voting has been applied as baseline method for wrapping the output of each boosted classifier with k-fold cross-validation.
Furthermore, the remainder of this paper is arranged as follows. The work related to proposed ensemble methodology in healthcare domain was studied and explored in section 2. In the third section, the proposed ensemble methodology, architecture and algorithm based on wrapper feature selection were thoroughly explored. In section 4, working environment and the results were discussed and clearly presented. In fifth section, conclusion, future scope and benefits of the proposed methodology have been elaborated.

Related Work
In the recent years, use of data mining and machine learning techniques has been increased to predict the possibility and track of diseases. A numerous number of ensemble methodologies and toolkits have been created, proposed and studied by researchers in the healthcare sector. In this section, a few worth works that are closely related to the proposed ensemble methodology in healthcare sector are presented and the marvelous potential of ensemble techniques is highlighted.
Moreira et al. (13) proposed an ensemble of the nearest neighbor classifiers using the random subspace algorithm which classifies unbalanced pregnancy database. The performance was evaluated by Area under Curve (AUC) and other indicators of the well-known confusion matrix using 10-fold cross-validation method wherein results indicated that Subspace KNN ensemble showed high predictive accuracy of 0.937. This approach predicted the Apgar score, intrauterine growth restriction problems and gestational age during childbirth which can be strongly associated with the neonatal death risk and also predicted fetus-related problems that develop hypertensive disorders in pregnancy. Kabir and Ludwig (14) focused to improve the performance of classification algorithm by using stacked-ensemble technique which finds the optimal weighted average of various learning models. In stacked-ensemble technique, Gradient Boosting Machine (GBM), Random Forest (RF) and Deep Neural Network (DNN) were used as base learners and Generalized Linear Model (GLM) as meta-learner. The results indicated that the stacked ensemble outperformed the individual base learners. Bashir et al. (15) used Bootstrap Aggregation consisting of heterogeneous classifiers namely Naive Bayes, Linear Regression, Quadratic Discriminant Analysis, Instance Based Learner and Support Vector Machine on five different heart disease datasets. The proposed bagging method (BagMOOV) achieved 84.16% accuracy, 93.29% sensitivity, 96.70% specificity and 82.15% f-measure with 10-fold cross-validation. Huang et al. (16) created SVM ensemble based on bagging and boosting over small and large scale breast cancer datasets. The performance of proposed ensemble was evaluated by classification accuracy, ROC, f-measure and computational training time. This approach predicted that SVM ensemble performed slightly better than single SVM classifiers.
Cong et al. (17) created a new selected ensemble method integrated with K-Nearest Neighbor, Support Vector Machine and Naive Bayes in order to diagnose breast cancer using both ultrasound and mammography images. The new indicator R proposed to choose the base classifier for ensemble learning. The proposed new selected ensemble method achieved an accuracy of 88.73% and sensitivity of 97.06% with the evidence that classifier-fusion method was better than the feature-fusion method. Das and Sengur (18) investigated the use of powerful ensemble learning techniques viz. bagging, boosting and random subspace with K-Nearest Neighbors (K-NN), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM) as base classifiers. The results indicated that ensemble method was more effective for diagnosing valvular heart disease. Rijn (19) build online performance estimation framework for dynamic data stream that weight the votes of individual classifiers members across the data stream and rely only on Hoeffding trees as base-level classifier. The performance was estimated using two functions based on window and fading factors. The results showed that BLAST with fading factors outperformed than BLAST using the window approach. Santos (20) proposed an ensemble feature ranking by using Information Gain, Gain Ratio, Symmetrical Uncertainty, Chi-Square and other classifiers on breast cancer dataset with the best performance of Naïve Bayes having higher AUC and lower FPR. Khan et al. (21) presented a technique called as Optimal Tree Ensemble (OTE) by integrating trees that were accurate and diverse. The performance of OTE was assessed by applying it on 35 different data sets and compared it with k-nearest neighbor, classification and regression tree, random forest, node harvest and support vector machine. The results revealed that the size of the ensemble was reduced significantly and in most of the cases, better results were obtained. As a subject of corollary, there are analogous proposed methods to uncover the hidden patterns in healthcare domain by broadening the canvas of their literature.

Proposed Voting-Boosting Ensemble Model
This section introduces the proposed system architecture, algorithm and methodology which concerns its fundamental design, feature selection design and ensemble methodology for prediction. The aim of consolidating multiple classifiers is to obtain preferable performance as compared to individual classifiers. In this work, two popular ensemble techniques viz. majority voting and adaboosting have been used. The majority voting is an ensemble strategy that selects one of many alternatives based on the predicted classes with the most votes. AdaBoosting is a boosting meta-algorithm which iteratively re-weights based on the training error of the base classifier (22) . The proposed framework consists of two main components viz. Feature Selection module and Model Building & Evaluation module for infants' data prediction. The pseudocode of proposed voting-boosting ensemble model algorithm applied in the present research is shown in Algorithm 1. Output: Best suited model for Infants' dataset.

Feature Selection module
The feature selection process attempts to reduce a dataset by removing irrelevant or redundant features to enhance the classifier performance and reduce data noise (23,24) and are categorized mainly into three methods viz. filter, wrapper and embedded methods (25) . The wrapper method uses a classifier to evaluate multiple models by their predictive accuracy (on test data) after statistical re-sampling or cross-validation of the dataset to find the optimal combination that maximizes model performance (26) .
In the present study, Wrapper-Voting-Boosting (WVB) feature selection approach has been proposed and the various algorithms and ensemble techniques have been combined with wrapper method in order to prove their usability in detecting the important features of infants' dataset. Let D s is the dataset, f n is the set of feature vectors, t n is the set of target variables and L = {Algo 1 , Algo 2 , Algo 3 , …, Algo n } is the set of algorithms that have to be applied on dataset to acquire a good performance in the domain of feature selection. In this phase of work, the classical wrapper search algorithm viz. best-first search method has been used. WVB selects 'm' number of relevant features from the 'n' original features. The schematic flow of feature selection is shown in Figure 1. In the process of feature selection, five algorithms viz. CART, J48, JRip, Random Forest and SGD have been used that select different subsets and then yield different results. If desired feature subset is generated, then the process has to be stopped otherwise the other classifiers have to be selected to repeat the process. This process stops at validation procedure. The ensemble feature selection process not only reduces the risk of selecting an unstable subset but also avoids the problem of local optima as the ensemble techniques are usually superior to the single models (27) .

Model Building & Evaluation
Ensemble method is a machine learning technique having collection of several classifiers to improve the overall predictive performance (28,29) . The present section of model building & evaluation deals with heterogeneous ensemble called Voting-Boosting (VB) ensemble and its architecture is shown in Figure 2. The VB ensemble is flexible to choose different classification algorithms for selection and prediction of healthcare datasets. The selected subset of 'm' features obtained from WVB has been further used as input for processing of the model. This approach focuses on techniques to enhance the performance of ensemble learning method with different classification algorithms {C 1 , C 2 , …, C n }. Each classification algorithm gets boosted {B 1 , B 2 , …, B n } and then wrapped using majority voting technique for calculation of final output. The same set of five different classifiers including Random Forest, J48, JRip, CART and SGD has been used. The best classifier has to be evaluated on the basis of various performance measures {M1, M2, …, Mn} with varied k-fold cross-validation.

Experiments
The experiments have been conducted on infants' dataset with selected subset of attributes. For the classification of data, a class label is required. The class label used for the present work is IMR with two values viz. high and low. The high signifies the districts of India that have the IMR value of 33 or greater whereas low signifies the districts of India that have the IMR value of less than 33. Numerous individual classifiers and the proposed ensemble have been applied on each test set with varied fold of cross validation and their results are analyzed and compared to find out the best model.

Dataset
The dataset used in this study has been taken from Health Management Information System (HMIS), a portal of Ministry of Health and Family welfare (MoHFW), Government of India from 2014-18. The dataset initially contained 40 features and after applying WVB ensemble with five different classification algorithms viz. CART, J48, JRip, Random Forest and SGD, a subset of 15 features has been extracted and shown in Table 1. Child immunization -BCG BCG 2.
Children more than 5 years received DPT5 (2nd Booster) DPT5_2B Continued on next page Children more than 10 years received TT10 TT_10 10.
Number of cases of AEFI -Death AEFI_D 12.
Number of cases of AEFI -Others AEFI_O 13.
Number of children more than 16 months of age who received Japanese Encephalitis (JE) vaccine JE_16M

Working Environment
The experiments have been carried out on open source software, Waikato Environment for Knowledge Analysis (WEKA) toolkit. WEKA is an aggregation of multifarious machine learning algorithms for data visualization, classification, clustering, regression etc. and is widely used for study, research, implementation, construction or development of new machine learning schemes (30) .

K-fold Cross-Validation
K-fold cross-validation is a common technique used in statistical learning to evaluate the performance of a model or the generalization of a trained model (31) . This protocol is used to partition the dataset into k mutually exclusive partitions as the first subset is used as a validation set for training model on the remaining k-1 subset (32) . The overall performance is obtained by averaging the performance of all k subsets and reduces the bias associated with random selection of samples from each data set (33) . In this study, K-fold cross validation with K= 5, 10 and 15 have been used.

Evaluation Measures
The performance was evaluated using several standard performance metrics such as:

Accuracy
It refers to the ability of a classifier to measure accurate values i.e. calculates the percentage of correct predictions and mainly used with the cases where the data classes are nearly balanced. True Positive (TP): Observation which is Positive and is also predicted to be Positive. False Positive (FP): Observation which is Negative but is predicted to be Positive. True Negative (TN): Observation which is Negative and is also predicted to be Negative. False Negative (FN): Observation which is Positive but is predicted to be Negative.

Precision
It is the ratio of correct positive values and measures how two or more values are closed to each other.

Recall
The number of correctly predicted positive values out of the total positive values that are true in that particular class.

F-measure
F-measure or F-score is the weighted average of precision and recall i.e. both interpreted together rather than individually.

MCC
Matthew's correlation coefficient (MCC) is a well-balanced measure for the quality of binary classifications ranging from -1 (anti-correlation) to +1 (a perfect classifier) with values around 0 corresponding to a random guess (34) .

PRC
The Precision-recall curve (PRC) shows precision values for corresponding sensitivity (recall) values where recall values plotted on the x-axis and precision values on the y-axis and is determined by the ratio of positives and negatives (35) .

Results and Discussion
In this study, the models were evaluated based on the accuracy, precision, recall, MCC and PRC discussed above but the prediction accuracy has been considered as the most significant factor. Additionally, the value of batch size is taken as 100. In this section, the performances of individual classifiers and proposed Voting-Boosting ensemble with varied fold of cross-validation are reported. The obtained results are discussed and displayed in the following subsections.

Results without Feature Selection and Ensembling
This section describes the results obtained by applying individual classifiers i.e. without feature selection and ensembling with varied fold of cross-validation. Table 2 reported the performance of 5 applied classifiers with all evaluation measures. It can be seen that Random Forest has better performance as compared to other classifiers. The highest accuracy for Random Forest has been achieved as 89.31% when k=15.

Results with Feature Selection and Ensembling
The proposed model incorporates well-known ensemble techniques viz. voting and boosting to improve the performance of the traditional classifiers. The results presented in Table 3 were obtained after applying Voting-Boosting ensemble for feature selection and prediction. The results depicted that Voting-Boosting (VB) ensemble outperforms with 90.5% of accuracy at k=15. The proposed model provides noteworthy effectiveness in terms of all applied evaluation measures.

Overall Comparison
The overall comparison of accuracy, precision, recall, f-measure, MCC and PRC area was compared without feature selection & ensembling and with feature selection & ensembling at varied K-fold cross-validation which is reflected in bar graph viz. Figures 3, 4

Conclusion
With the rapid development of technologies, experts from various fields are working for the well-being of the society by investigating electronic health records. Gigantic amount of data gets analyzed by data mining and machine learning techniques and various new methodologies and automated systems have been developed. The exploitation of ensemble technique in the field of healthcare sector plays a vital role for disease prediction and classification. A novel ensemble technique equipped with a good feature selection function contributes effectively to classification and prediction performance. The wrapper method with best-first search algorithm finds the optimal subset of features for enhancing the predictability of model. The proposed ensemble model smites the limitations of conventional data mining techniques by employing the ensemble of five heterogeneous classifiers viz. Random Forest, J48, JRip, CART and SGD. The reliability of the system was evaluated by computing different parameters including accuracy, precision, recall, f-measure, MCC and PRC area. The proposed Voting-Boosting (VB) ensemble achieved noteworthy effectiveness based on accuracy, precision, recall, f-measure, MCC and PRC area by varying K-fold cross-validation by working on best feature subset obtained after feature selection process. Thus, from this study, it has been educed that Voting-Boosting ensemble was best suited model for classifying the infants' data into different categories. In future, the focus shall be on utilizing other techniques such as bagging, stacked generalization, blending, randomsubspace and other ensemble techniques to improve the performance of the present work. In addition, the same methodology shall also be implemented on other datasets related to healthcare with diverse attributes to confirm the robustness of Voting-Boosting ensemble.