Robust Linear Model Selection Using Paired Bootstrap

Department of Statistics, University of Peshawar, Khyber Pakhtunkhwa, Pakistan; fazlirabbi1@gmail.com CECOS University of IT and Emerging Sciences, Hayatabad, Peshawar, Khyber Pakhtunkhwa, Pakistan; salahuddin_90@yahoo.com Shaheed Benazir Bhutto Women University, Peshawar, Khyber Pakhtunkhwa, Pakistan; najma.fwu@gmail.com Indian Journal of Science and Technology, Vol 12(10), DOI: 10.17485/ijst/2019/v12i10/142190, March 2019 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645


Introduction
Model selection procedures involve fitting a set of competing models and then selecting the model by comparing their prediction loss. Specifically, the robust estimators are used, when the data quality is questionable (i.e., assumptions about error distribution are not fulfilled). Various studies on model selection procedures depend on maximum likelihood-type or least squares approach such as 1,2 , and 3 . All these criteria are severely affected when dataset contains outliers. Some other criteria for model selection are based on minimizing the expected squared prediction loss. The prediction loss is estimated by re-sampling techniques such as crossvalidation or the bootstrap [4][5][6] . However, these criteria are susceptible because of using the unbounded loss function when computing the prediction loss.
The rest of the paper is organized as: Section 2 describes paired bootstrap method to calculate prediction error, Section 3 discusses the stratified bootstrap procedure, Section 4 presents a RoBust Paired bootstrap Keywords: MM Estimation, Outliers, Stratified Bootstrap, Out-Of-Bag Bootstrap, Robust Expected Prediction Loss Criterion (RBPC) for model selection, Section 5 reports the simulation results, Section 6 demonstrates the data example and Section 7 provides the conclusion with a brief discussion.

Bootstrap Measures of Prediction Error
The re-sampling procedures such as the bootstrap and cross-validation are recommended to estimate the prediction error for variable/model selection. Bootstrap procedures involve obtaining multiple independent samples by sampling with replacement of the original data set. For paired bootstrap, sampling with replacement is done to form bootstrap samples from the original sample {(y 1 , x 1 ), (y 2 , x 2 ), … , (y n , x n )} 22 The estimate of the prediction error for the k th bootstrap sample from the original sample (y i , x i ), is given by: is the bootstrap estimator based on the k th bootstrap sample. The unbiased estimator of prediction error for the k th bootstrap sample is suggested by 23 as: Where y i * is the vector of responses for the i th observation in the k th bootstrap sample.Author 5 showed that a selection procedure based on Λ ∧ k is inconsistent? To obtain asymptotic consistency, an m-out-of-n bootstrap procedure used by 5 for an appropriately chosen m<n.
Suppose that we have a column vector of n responses Y = (y 1 ,y 2 ,…,y n ) T and X is an n x p design matrix. Let α represent a subset of size p α from {1,2,…,p}, and X α is an n x p α matrix. Let x i T a represent the i th row vector of the design matrix X α. The linear regression model is given by Where, σ α > 0, X α and є αi = (є αi , є α2 ,…, є αn ) T are independent, and the errors є αi~N (0, 1), β α , is an unknown p α -vector of regression coefficients. Let  represent a collection of candidate models. The interest here is to select a model α from  that fits the data well. So, the model is indexed by α ∈  and β α is estimated by the estimator β α . A good model should have the ability to predict future observations with great accuracy, so for this purpose, one can use the conditional expected prediction error. For a given nonnegative loss function ρ(.), the robust conditional expected prediction error for model α is measured as Where b a ∧ is the estimator of β α ,σ is the measure of spread for a given data, and z = (z 1 , z 2 ,…, z n ) T is a vector of future responses at X, independent of y. Initially, the measure of conditional expected prediction error with r x x ( )= 2 2 is considered by 5 as as election criterion. Following 5 , the m-out-of-n stratified bootstrap procedure is used by 14 to estimate the conditional expected prediction error. The estimated robust expected prediction error is given by: Ignoring the stratification, the bootstrap estimator given in eq. (6) becomes a robust form of the estimator suggested by 5 .

The Stratified Bootstrap
Here we explain the main steps in applying the stratified bootstrap procedure.

The Proposed Robust Model Selection Criterion
In this section, a RBPC is proposed for model selection. The proposed criterion is based on Robust Expected Prediction Loss (REPL). The robust expected prediction loss is estimated by usingan m-out-of-n paired bootstrapping procedure. To obtain the bootstrap estimate of robust expected prediction loss, we follow the following steps: 1. From the full model calculate and arrangePearson residuals. 2. Fix the number of strata S. The number of strata should be in between 3 to 8 as suggested by 24 . 3. Allocate observations into different strata so that observations in the extreme tail are kept in lower or upper tail strata and other strata comprising the remaining observations. 4. From each stratum, sample rows of (y, X) independently with replacement so that total bootstrap sample of size is m ≤ n. ( ). To use eq. (7), we have to specify ρ(.) and σ. We take to be bounded because our interest here is to fit and predict the core observations rather than those observations that lie in the tails. Here we prefer trimming, which means that for large |x|, ρ(.) is constant. Following 14 , we use a function which is just like a quadratic about the origin and becomes constant when it is away from the origin, such a ρ(.) function is given by: The values of b can be varied, but b = 2 is reasonable in our simulation study. For simplicity, σ is measured by the MAD from the median and is given by: The optimal m depends on the true model, one should use 0.25n ≤ m ≤ 0.5n for moderate n (i.e., 50≤ n ≤200) 14, 15 .

Simulation Study
We carry out two simulation studies to evaluatethe performance of our proposed RPBC. The first one is designed to compare the behavior of our proposed procedure for contamination free dataset (simulation setting 1) with that of classical procedures (i.e. the AIC and BIC). The second simulation study demonstrates the utility of our RPBC in handling contaminated data (simulation setting 2).

Simulation Setting 1
To compare the performance of our proposed bootstrap model selection criterion in the no-outlier case, the following regression model is considered: where є i~N (0,1),X1,is the column of 1's and X 2 , X 3 , X 4 , and X 5 are taken from the solid waste data of 25 , the same as those used in 4,5,13,14,16,26,27 . The comparison of our proposed criterion α m,n (eq. (9)) is made with a robust version of 5 α m,n (eq. (6)). In the case of zero contamination, the LS estimator is used to fit the regression models. The estimated selection probabilities for our proposed criterion α m,n andα m,n using the LS estimator androbust loss function ρ(x) = min (x 2 , b 2 ) are given in Table 1. The results presented in Tables 1 are obtained

Simulation Setting 2
Another simulation study iscarried out to show the performance of our proposed RPBC in handling contaminated data. Furthermore, the use of stratified bootstrap is explored when the atypical observations are present in the data. For this purposethe followingregression model is considered where the design matrix X had columns generated as uniform on [-1, 1]; and X is kept constant for all simulation replications. We choose five different error distributions that are deviated from normality:  1 and σ = 1); 4. є 4 Is the Cauchy distribution; and 5. є 5 is the slash distribution In Table 2 Table 3. From the simulation outputs in Table 3, it is clear that for all error distributions our proposed robust criterion α m,n performs very well as compared toα m,n . However, in the case of Slash error distribution, our modified procedure does not perform well as compared to. The output indicates thatthe proposedmodel selection procedure using the robustfunction in (8) and MM-estimator is robust in the presence of highly contaminated data. For example, the percent correct is 36.2% for un-stratified bootstrap, whereas the percent correct is75.3%for stratified bootstrap for the contaminated normal situation є 1 . Furthermore, we observe that, in the presence of outliers and heavytailed error distributions, the conventional least squares selection procedures are not performing well. For example, under є 4 error distribution, the percent correct is 96.6% for our proposed robust criterion using MM-estimator, whereas the percent correct is 17.5%for the least squares procedure. Similarly, under ∈ 5 error distribution, our proposedprocedure with MM-estimator is 87.7% correct, whereas the least squares procedure is11.6% correct.
These outputs show that the proposed robust method has good robustness features with contaminated normal and heavy-tailed distributions, while the least squares procedure performs very poorly in both situations. This clearly proves the lack of robustness of the least squares procedure in the presence of outliers and heavy-tailed distributions.

Data Example (Stack Loss Data)
In this section, we analyze the Stack loss data presented by 29 . This dataset consists of three explanatory variables, and it contains four outliers, namely observations 1,3,4, and 21.The response is the Stack loss (y) observed on n=21 observations. The explanatory variables are theFlow of cooling air (X 1 ), Cooling Water Temperature (X 2 ), and Concentration of acid (X 3 ).We applied our robust method, the existing methods, and the traditional methods on the data. Table 2 presents a summary of selected best models. Table 2 shows that the classical methods select the full model whereas robust criteria agreed with the importance of the two variables, X 1 and X 2 .The best model according to our criterion includes X 1 and X 2 .

Conclusion
A RPBC is proposed to select the best subset of variables in linear regression models. The proposed method is based on robust conditional expected prediction loss and a robust -function. A stratified bootstrap procedure is used to estimate the expected prediction loss function.We recommended a robust -function to decrease the effect of large residuals. From the simulation outputs, it is clear that in both cases, i.e. in contamination-free data as well as in contaminated data, the proposed selection criterion performed well. The results show that the proposedprocedure has good robustness features with contaminated normal and heavy-tailederrors distributions, while the least squares method performed very poorly in both situations. This clearly proved the lack of robustness of the least squares procedure in the presence of outliers and heavy-tailed distributions.Furthermore, our proposed criterion is less dependent ona bootstrap sample of sizemas compared to the robust version of 5 . In conclusion, our proposed criterion is superior and will do better when the data generating model is small.