Modeling and prediction of third-party claim using a Machine learning approach

Background/Objectives: The main objective of this research paper is to build an appropriate mathematical model that helps in forecasting overall claim amount based on the chosen characteristics of the data.Methods/Statistical analysis: In the field of actuarial research, forecasting the third-party claim amount for Motor vehicles is a challenging task, and only limited empirical research studies are done in predicting the claim. In this context, the annual time series historical claim data were collected for a period of 34 years to examine the predictive performance of the linear regression model, exponential smoothing model, autoregressive integrated moving average (ARIMA), artificial neural network (ANN), and hybrid ARIMA-ANNmodels to predict third party claim amount of motor insurance data in India. Findings: The data are analyzed, and the empirical evidence from the study shows that the ANNmodel improved the accuracy prediction when compared to Linear Regression, Exponential smoothing model, ARIMA and a hybrid model with respect to the performance criteria such as root mean squared error (RMSE) and mean absolute percentage error (MAPE). Therefore, the ANNmodel is more potent in forecasting TP claim amounts by considering the adequacy, suitability, and accuracy of the data modeling. Novelty/Applications: This data analytics approach would help motor insurance companies in India to have an idea about the expected future claim amounts. Also, this modeling approach will help the Motor Insurance companies of India to provide a better customer-centric forecasting model, which ensures better claims settlement and management.


Introduction
In India, the motor insurance sector has heavily suffered based on motor own damage claim from a commercial motor vehicle concerning the viability of the insurance companies, the long-term claims intimation, and the non-cooperation of claim settlement https://www.indjst.org/ are some of the issues faced by the industry. The predictive modeling and forecasting the time series statistics is essential in the uncertainty occurrence of the event fields like Insurance. The empirical research on modeling of the Insurance claim amount is very inadequate, and few authors have considered the ARIMA model for prediction with respect to the property damage claim (1), future health care insurance cost of inpatient diagnosis (2) and fire insurance loss (3) . However, in the Republic of Macedonia life insurance data, by applying two competent models, i.e., Combined ARIMA and Moving Average Model, concluded that the combined Box Jenkins model gives the best fit to the given data series (4) . Based on the earlier study on various forecasting techniques proposed, modeled, and suggested by multiple authors applied in predicting inflation and stock exchange, the ARIMA model is widely used for forecasting the time series data with accuracy and prediction (5) . Although ARIMA models are more flexible to predict different types of problems in various fields, they have their limitations that the data should be linear (6,7) . However, the exact nature of the data series is uncertain to predict, which has a non-linear structure of the data points; we apply the ANN. Several works have been done in forecasting the time series data with the ANN architecture model in a different field by using Box Jenkins methodology as the criterion to test the efficiency of the neural network model (8)(9)(10) . In recent years, hybrid models are developed due to the fact that there is no appropriate model that exists for prediction accurately. Hence, the hybrid ARIMA-ANN model is developed to forecast the data with greater accuracy (11)(12)(13) . Also, the multiple Machine learning predictive model are developed based on the study on spine surgery patients. It can be used to improve risk adjustment when assessing the patients with varying co-morbidities with higher accuracy (14) . Another study to forecast future Health care expenditure based on insurance claims data, the Recurrent Neural Network (RNN) predictive model outperforms the linear regression, least absolute regression, and gradient boosts machine model (15) . But, another study on six health insurance societies claims data to formulate a Japan population health management, a machine learning predictive regression model LASSO developed, which may be useful to assess a population health management plan (16) . A comparative study to predict stock exchange data shows that the techniques such as stacking and blending offer higher predictions as compared with bagging and boosting using machine learning approaches such as Decision Trees, Support Vector Machines and Neural Network (17) . A study on correctly pricing deposit insurances, a machine learning approach with a regularized cost function, the derived quadratic model approach helpful for modeling implied volatility (18) . Another study based on predicting accident claims by comparing the machine learning algorithm XGBoost (19) and the Logistics Regression approach, the classical logistic regression method showed a high predictive performance as compared with the XGBoost method (20) . However, there has been no substantial amount of research work done concerning Motor Insurance in India. It remains a motivating factor for us to evolve and build predictive models as compared with other studies.

Data used for research
The secondary data are collected from different Insurance companies of India for distinctive 34 years from 1985 to 2018. The original data consists of 108 column variables distributed over 15,15,600 rows (i.e., approximately One hundred fifty million data points). After removing outliers of the data set, it was reduced to 12,72,311. Out of this data set, we have divided randomly 8,90,618 (70%) of the sample into a training data set and 3,81,693 (30%) of the sample into a testing data set. We have studied the Third-party (TP) claim variable that is considered for fitting Linear Model, Exponential Smoothing, ARIMA, ANN, and Hybrid Model. The data analytics are analyzed by using STATGRAPHICS Version 18.1.12 and MATLAB 2019b Version. We now discuss the techniques of the model building below.

Modeling of time series
In modeling the time series data, we first applied the traditional time series models such as Linear Model, exponential smoothing, and ARIMA. Further, we have applied the ANN model and the mixed-use of both linear domain & non-linear domain, a hybrid model for predicting the future value with greater accuracy.

Linear model
Generalized Linear Model is a generalization of conventional linear regression models for continuous response variables (TP claim amount) given continuous predictors (time). To fit a model to TP claim data provides an idea to Insurance companies for predicting the future claim amount based on historical data.
A simple general linear regression model structure is given by https://www.indjst.org/ Where x i stands for time variable in a yearly unit; y i is claims amount, e i is a residual error term

Exponential smoothing model
In this methodology, the weighted mean of past data is used to forecast the claim amount. It can be applied when time-series data don't have any seasonality or trend and only have a level. This model will be represented as an equation as Where α is the smoothing parameter, 0 ≤ α ≤ 1 y i is claim amount, y i+1 is the forecast value of the claim amount Hence, α is the smoothing parameter that defines the weighing. If the actual TP claim data appears to be relatively stable over time, then we would select a smaller value for α. On the other hand, if the TP claim data tends to fluctuate rapidly, then we would choose a larger value for α to adjust the smoothened.

ARIMA (p, d, q)
In order to apply an ARIMA model to the data efficiently, the model needs to be a stationary series. When the TP claim data is not stationary, to bring the data into stationary processes, then the original process can then be reconstructed by integrating the differenced series to convert it into stationarity, and then it will be modeled by ARIMA (p, d, q) by using autocorrelation function and partial autocorrelation function. These suggest the most applicable model for the data.
The mathematical equation of Box and Jenkins (21) process can be expressed as AR (p) I (d) MA (q) in terms of the backward shift operator.
Where B is the backward shift operator describing the process of differencing which is given by y t Also, by comparing more than one ARIMA Model, the least Bayesian Information Criteria (BIC), Akaike Information Criteria (AIC), RMSE and MAPE are generally considered to be best for the fitted model.

Artificial Neural Network (ANN Model)
The Neural network models are developed based on their interconnection of nodes. These networks are classified as multilayer perceptron and a single layer perceptron, feedback, and feed-forward model. In feed-forward propagation, the signals are transmitted from one neuron to another in a forward direction only. In most of the time series modeling, a Multi-layer perceptron feed-forward neural network is used to model the cyclic links, which consist of the hidden layer, input, and output layer. The mathematical relationship between the input ( y t−1 , y t−2 , . . . . . . , y t−p ) and the output (y t ) for MLFFN model is Here ' α i ' is a weight from the hidden to output nodes, and ' β i j ' are weights from the input to hidden nodes. 'p' and 'q' refers to input and hidden nodes of the parameter; 'f ' refers the sigmoidal activation functions; The sigmoidal function can be mathematically expressed as follows: Equation (4) ANN model implements a non-linear mapping to obtain the future value from the past observations. i.e., Here ω is a vector parameter, f is a function of weights and network structure and e i refers error component. https://www.indjst.org/

Hybrid ARIMA-ANN model
In many real-world time series forecast modeling, the Box Jenkins model is appropriate to apply for a linear domain, whereas ANN is used for Non-linear domain. Both models have many successes in their data fields. Therefore, a hybrid model was proposed by Zhang (10) , which consists of linear as well as non-linear domains. In this approach, Zhang combined both ARIMA for modeling linear components and ANN for non-linear components separately to evaluate accurate forecasting. According to the model, Where y i stand for TP claim amount L i and N i denote the linear component and non-linear component of a claim data. According to Zhang, First, the original claim amount is fitted by the ARIMA to model the linear part, and the residuals obtained from the ARIMA contain only non-linear parts that will be fitted by the ANN model. So, the residuals of the ARIMA model is defined by e i = y i − L i , where L i denote the predicted value of ARIMA. Secondly, model the residuals using ANN that will capture the non-linear pattern of the TP claim data. Using p inputs, the ANN model for residuals will be of the form Where f denotes a non-linear function of ANN and ε i represents the random error. The predicted value obtained from ANN denoted as N i , then the hybrid ANN-ARIMA forecast to the time series data is obtained as Now, the hybrid model is applied to third party claim data to check for better forecasting accuracies than ARIMA and ANN by comparing RMSE and MAPE.

Results and Discussion
The secondary sample dataset of the overall third-party claim will be taken. Further, based on other works of literature, the logarithm to base e of x of the data are used in the analysis. In this study, the Linear Regression Model and Exponential smoothing models are built using STATGRAPHICS, while ARIMA, ANN, and hybrid models are built using MATLAB 2019b. Further, the Root Mean Square (RMSE) and Mean Absolute Percentage Error (MAPE) are estimated to identify the best forecasting models. From Figure 1, we can observe that the occurrence of claim numbers and its claim amount is gradually increasing from the year 1985 to 2018. This indicates that the occurrence of claims and claim amounts are skewed.

Results of linear model
Firstly, we have fitted a linear regression model to the TP claim data by considering the Claim amount as a response variable and the time as an explanatory variable for the overall motor vehicles. From the observed data, the coefficients β 0 and β 1 are statistically significant. Thus, the linear regression model equation (1) can be fitted as follows: Based on the above equation, we have predicted the claim amount for the entire data set and calculated the RMSE and MAPE accuracy measures for comparison of the model.

Results of exponential smoothing model
First, we determine the value of optimal smoothing constant α. For that, ten experimental trials are performed with different smoothing parameters from 0.1 to 0.9871. From Table 1, it is observed that the least performance accuracy measures are obtained at a higher value of the smoothing parameter(α) because the actual and forecasting TP claim amounts are fluctuating rapidly ( Figure 2). The minimum measures of accuracy, such as MAPE, MSE, RMSE, occurred at an optimum smoothing parameter (α = 0.9703). Also, it is clear that the performance measures of accuracy decrease when the smoothing constant increases. An exponential smoothing plot of the actual average claim amount vs. predicted values with various measures of accuracy after fitting the model based on an equation (2) using historical data is given by

Results of the ARIMA (p, d, q) model
From Figure 3, it is observed that the original data series is in non-stationary form. Also, the augmented Dickey-Fuller (ADF) unit test confirmed that the data series is non-stationary and non-significant (p=0.1714 > 0.01). Thus, the TP claim data series can then be reconstructed by integrating the differenced series. The ADF unit root test was applied again; it is concluded that the data series evaluated is stationary (p = 0.0000 < 0.01). Now, the data series will be modeled by an ARIMA (p, d, q) with the parameter d=1.  Table 2, we have computed the best possible feasible ARIMA models and their performance criteria AIC, RMSE, MAPE, and BIC values. By comparing all the fitted ARIMA models based on their performance criteria, ARIMA (2, 1, 1) model fits well with relatively smaller performance criteria values mentioned above to forecast the TP Claim Amount.

Results of ANN model
The modeling process was performed by using Neural Network (nntool) option in MATLAB 2019b version workspace. From the total samples, 70% of samples are selected randomly for training, 15% of them for validation and the remaining 15% of them selected for the testing network. During training the claim data, the back propagation TRAINLM is applied to predict the RMSE and MAPE. The network performance can be checked and retrain the network when performance is not satisfied. Finally, the TRAINLM method builds forecast results based on Mean Square Error (MSE) using trial and error experiments. From Figure 4, it is observed that Regression plots of ANN-based on training, testing, validation, and overall data set to fall with a 45 0 line with R values in each case almost near to 0.999 with the best validation performance measure is 0.0010621. It indicates that the predicted plots represent a good fit, and also the best performance measure of the overall data set RMSE as 0.03829 and MAPE as 0.01202. https://www.indjst.org/

Results of the Hybrid ARIMA-ANN Model
In this model, First, we have fitted the Box Jenkins Methodology ARIMA Model for original data based on equation (3). The residuals obtained from the fitted ARIMA model indicates non-linearity. Then, we have applied the ANN technique to the residuals of the ARIMA by the TRAINLM algorithm. Finally, based on equation (9), the forecasted values of the hybrid model are obtained by adding the predicted values of ARIMA and ANN with the resulting best performance RMSE as 0.03921 and MAPE as 0.01341 for overall TP claim amount.

Prediction performance of forecasting models
From Table 3, it is observed that the performance criteria index RMSE and MAPE for prediction are compared for different models such as Linear Regression Model, Exponential Smoothing, ARIMA, ANN, and Hybrid. The comparative results showed that the ANN model yields a more accurate forecast compared to any other model with lesser RMSE and MAPE. Hence from the above table, the Artificial Neural Network model is slightly better than the hybrid model for forecasting the claim amount. From Figure 5, we can observe the overall predicted data point comparison between overall claim with other predicted models. If we observe the predicted ANN plot, it is almost similar to the overall claim as compared with other models. It indicates that ANN is a better fit for the whole third-party claim amount. https://www.indjst.org/