Comparative Analysis of Accuracy on Partial Least Squares and Principal Component Regression methods

Objective: To associate or compare the goodness of fit of Principal Component Regression (PCR) and Partial Least Squares (PLS) models using metrics such as RMSEP, MSEP and R2. Methods and Statistical Analysis: Regression analysis is used in the study that involves investigation of correlation among an independent and dependent variables. Analysis is made simple when researchers understand and use the preeminent suitable method based on type of dependent variables, independent variables and dimensionality of data. Cross-validation method is used in both predictive models (PCR and PLS). Dataset and Findings: This study presents the comparative analysis on PCR and PLS by applying these methods on a public dataset named octane dataset, where the spectral data of gasolines with 401 attributes are provided. This study concludes that partial least squares regression model yields better prediction results than Principal Component Regression model since PLS accurately select the principal component. Also the number of principal components identified by the PLS is comparatively less. An analysis on preprocessing is also performed with same regression methods and dataset in this paper. Improvements: In this analysis the importance of removing Region of no interest is focused. If Region of No Interest is removed then number of principal component is also reduced which in turn increase the prediction accuracy. The study reveals the number of principal components is high if Region of No Interest is not used, which decreases the prediction accuracy.


Introduction
Research organizations / Government / Corporate Agencies are capturing data at each stage for different applications. These data capturing is increased exponentially in the last 5-6 years. The size of the data generated by these applications is huge ranging from few hundreds to lakhs. Visualizing such a huge data is difficult and further computation also takes time. At this particular situation data cleansing and data preprocessing comes into play.
Data preprocessing is the process of reducing the dimensions of data without losing much of information. But question comes in our mind, "Is this possible to diminish the importance of certain attributes without losing the integrity of data? Understanding data is more important for preparing data for any analysis. The important step involved in preparing data for analysis is preprocessing. The basic preprocessing techniques are data cleansing and normalization. The million dollar question 'Why preprocessing?' is answered in this section. Real world data or raw data are generally unreliable and containing outliers or errors. Raw data has to be converted into meaningful data through data cleansing for flawless analysis and prediction.
Identifying and removing Regions of No Interest (RoNI) is a part of data cleansing, many data have regions that should be removed before analysis. There may be regions of the data that simply don't have much informa-tion; they contribute a noise or outliers and not much more. After preprocessing, the next step is to discover number of principal components that represents the best original variables (independent) in the least square error sense.
The rest of the study is structured as follows. The applications and usage of PLS and PCR is explained in section 2 and section 3 explains these procedures in detail. The result of applying these methods in octane dataset is discussed in section 4. Conclusions and future work discussions are provided in the last section.

Literature Survey
Principal component analysis was first proposed by Hotelling in 1933.Principal component regression 1 is a straightforward method with high computational costs. Distinctive works have been delivered in the ongoing years about the algorithm PCR and PLS 2 . The Partial Least-Squares (PLS) regression method is gaining significance importance in numerous fields of science, systematic, physical, clinical science and mechanical process control. Made observational research in worldwide advertising with PLS 3 and proved the impact of PLS.
Predicted the subjective evaluation of a set of five wines using PLS. The reliant factors that will be anticipated for each wine are its amiability, and how well it runs with meat or pastry. The predictors are the price, sugar, alcohol, and acidity content of each wine 4 .
Illustrated PLS correlation with an model in which 36 wines were portrayed by a matrix or table X which comprises 5 independent quantities (price, total acidity, alcohol, sugar, and tannin) and by a matrix or table Y which contains 9 sensory quantities (fruity, floral, vegetal, spicy, woody, sweet, astringent, acidic, hedonic).
With the help of PLS Abdi analyzed information common to both the tables and predicted one table or matrix from other 5 . Used PLS approach as a statistical methodology to fault isolation and detection of robot schemers 6 . Applied PLS for the normally distributed data and he suggested to perform a logarithmic transformation prior to PLS analysis if the data have outliers 7 . No used PLS to estimate the yields for the Stock Exchange of Thailand 8 used PLS on neuroimaging data for three gatherings of members with three members in each gathering, Matrix X stores neuroimaging or brain activity data (i.e. amplitudes across time for the vertex electrode) and matrix Y stores the behavioral data from a memory undertakingtask 9 .
Investigated important components of fresh raw milk 2 using VIS-NIR spectrometers and PLS 10 method 11 conducted a study on nonlinear multivariate models to give a good fit 11 on regression analysis and produced better coefficient of determination (R 2 = 0.988 & 0.994) for the Research Octane Number (RON). Applied PLS and PCR on nine gasoline models acquired from the Pacific Coast Exchange Group 12 of the ASTM motor.
Ultimately more than 50 research papers have been published so far using PCR and PLS methodologies in different research sectors, but there is no research suggesting the appropriate number of principal components or range of principal components for predicting dependent variable using PCR and PLS regression models. Since the random selection of K-principal components yield erroneousness prediction and smart selection of K-component from the training set provides better prediction on test data, this paper analyzes the performance of PLS and PCR for various number of principal components. The octane dataset with 401 spectral reflectance values for sixty gasoline samples is used for the proposed analysis. The next section provides the fundamentals of normalization, PCR and PLS.

Methods and Techniques
Real world dataset contains features that highly vary in range, units and magnitudes. Normalization should be performed when the scale of a feature is irrelevant or misleading and should not normalize when the scale is meaningful. Normalization protects data integrity. Minmax scaling or min-max normalization 13 , is the simplest method of rescaling the range of features to scale the range in [0, 1] or [−1, 1]. Choosing the objective range relies upon the idea of the information.
Where x{\displaystyle x}xxx is an original value, {\displaystyle x›}x' is the normalized value

Principal Component Regression (PCR)
The fundamental point of PCR is to discover the relapse display with the best expectation execution and not the augmentation of the all-out clarified difference of free variable(X). Considering an information framework with n x m esteems (X) that contains the indicator factors and furthermore a needy variable y with n perceptions, the univariate relapse show is: Algorithm: PCR Input: Data Situation 11 12 1 1 Output: Predict Y value from X Sequence of operation: Step 1: Perform Multiple linear regression with X principal components t1, . . . , tx instead of all of the x's Step 2: How many components: Determine by Crossvalidation.
• Leave-out-one of the interpretations • Fit a prototypical on the remaining(reduced) data • Predict the left out perception by the model • Do this in turn for ALL observations and compute the overall performance of the model by Root Mean Squared Error of Prediction (RMSEP) Step 3: Validate the model (select the one with less RMSEP value and more R2 value) Step 4. Interpret, conclude, predict future values Y.

Partial Least Squares (PLS)
Like PCR, PLS selects components that explain the most variance in the model, but unlike PCR, PLS incorporates the response variable. i.e., it includes the feedback. PLS is an amazing multivariate factual tool that estimates the predictive or connecting relationship between variables.

Algorithm: PLS
Input: Data Situation Output: Predict Y value from X Sequence of operation: Step 1: finds a set of components Step 2: fitting a set of components to X (as in PCA) Step 3: likewise fitting a set of components to Y Step 4: The X and Y scores are chosen so that the association between consecutive pairs of scores is as robust as possible(maximum covariance of X and Y Step 5 Validate the model (select the one with less RMSEP value) Step 6. Interpret, conclude, and predict future values Y.
The following section provides the result of analysis on PCR and PLS and the accuracy of the models is measured by minimum Root Mean Squared Error of Prediction (RMSEP), Mean Squared Error of Prediction (MSEP) and maximum R 2 . The smaller the estimated error becomes the healthier the prediction.

Analysis Results
Sixty data frame with Octane number and 401 NIR spectra of gasoline models has been analyzed in this learning. The NIR spectra were measured using diffuse reflectance as log (1/R) from 900 nm to 1700 nm in 2 nm intervals, giving 401wave lengths [14][15][16] . The likelihood of predicting the octane number by the near infrared spectra was investigated. Min-Max Normalization is applied on the entire data set. All the values are scaled between 0 -1, to process the data. Analytics environment used is R i386 3.4.0. R is a primary tool for machine learning, statistics, and data analysis.

Analysis of PCR and PLS on Raw Data
The data frame with 60 observations is separated into training information and test information. Observation 1 to 50 is selected for Training our model using PCR and PLS, remaining 51 to 60 is kept for testing. Figure  1 a shows the picture generated by R i386 3.4.0for the spectral signature of actual or raw data. A PCR-model and PLS-model is generated by applying PCR and PLS on training dataset. With the help of PCR-model and PLSmodel, the prediction of octane number in test dataset is carried out and RMSEP, MSEP and R 2 values of training record set and test record set are recorded or tabulated for further analysis. Figure 2 shows the RMSEP, MSEP and R 2 value generated by R i386 3.4.0for components ranging from 1 to 10, which helps in selecting the optimum number of components for prediction for both the model PCR and PLS. Table 1 list the RMSEP, MSEP and R 2 value for components ranging from 1 to 10 for PCR and PLS for Raw data.
From the plot and table the conclusion made is that for PCR the number of component for accurate predict can be 9, because at component 9 RMSEP (0.23) is minimum and maximum R 2 (0.97). But in case of PLS model at component 6 itself RMSEP (0.23) is minimum and maximum R 2 (97). So for further predict or iterative process the best suited number of component for PCR is K=9 and PLS is K=6. In PCR the Principal component can be selected from 7 component to 9(i.e 7>=K<=9). In PLS the Principal component can be selected from 4 component to 9 (i.e. 4>=K<=9). Figure 3 shows the prediction vs measured value of octane number using PCR and PLS generated using R i386 3.4.0. Accuracy of prediction is perfect in PCR when 9 components are taken into account, where as in PLS 6 components are taken into account for prediction.

Analysis of PCR and PLS on Preprocessed Data
The reflectance variation is found between 1150 to 1250, 1350 to 1450 and 1600 to 1700. The wavelength from 900 to 1149, 1252 to 1348, and 1452 to 1598 are considered as Region of no interest because they contribute redundant information and removed from spectral data. 243 spectral wave lengths is considered as Region of No Interest and are not considered for training or testing purpose. The result reveals that 401 columns are reduced to 158columns. A PCR-model and PLS-model is generated by applying PCR and PLS on training dataset which is preprocess. With the help of PCR-model and PLS-model, the prediction of octane number in preprocessed test dataset is carried out and RMSEP, MSEP and R 2 values of training record set and test record set are recorded or tabulated for further analysis. Figure 1 (b) shows the spectral region after removing region of no interestgenerated using R i386 3.4.0. Figure  4 shows the RMSE,MSEP and R 2 value generated using R i386 3.4.0 for components ranging from 1 to 10, which helps in selecting the optimum number of components for prediction for both the model PCR and PLS with Removing Region of No Interest. Table 2 list the RMSEP, MSEP and R 2 value for components ranging from 1 to 10 for PCR and PLS for normalized data and with region of Interest. From the plot and table conclusion made is that for PCR the number of component for accurate predict  From the analysis the conclusion made is that when a preprocessed data improve the prediction with less number of components. 1. Component is reduced in PCR, and 2. Component is reduced in PLS. In PCR the Principal component can be selected from 7 component     Figure 5 shows the prediction vs measured value of octane number using PCR and PLS generated using R i386 3.4.0. Accuracy of prediction in PCR is when 8 com-ponents are taken into account, where as in PLS 4 components are taken into account for prediction. Over fitting and under fitting is not an issue in our gasoline dataset, but the size of the data is high over fitting the model may happen.

Conclusion
Proper application of data cleansing and data pre-processing techniques can reduce analysis time and increase the prediction accuracy. The primary focus of this study was to investigate the feasibility of predicting the K-principal component or suggesting the range of K in PCR and PLS algorithm on raw data and preprocessed data. PLS have: 1. improved predictive accuracy, and 2. a lower risk of correlation. Our study proves PLS is a good alternative to the more classical multiple linear regression and principal component regression methods because it is more strong and healthy. Sound implies that the model parameters don't change especially when new adjustment tests are taken from the complete populace. Finally comparison of PCR and PLS with raw data and preprocessed data is recorded or tabulated. K-principal component and range of K values for PCR and PLS regression model is suggested. In future the same sequence of analysis can be done for big data. Larger data set show a large variation in selecting number of components for PCR and PLS with raw data and preprocessed data. Preparing the model to generalize from the training record to any record or data from the problem domain, which allows us to make predictions in the future on data the model has never seen.