Forecasting Analysis of GMDH Model with LSSVM and MARS Models for Hydrological Datasets (Case Study)

Objectives: To forecast hydrological datasets using time-series forecasting model, namely, Group Method and Data Handling (GMDH). Methods/statistical analysis: The monthly streamflow datasets covering a period of 485 and 550 months have been collected from two well-known rivers of Pakistan, the Indus and the Chenab, respectively, for the endorsement of the GMDH model. Computed results are compared with two other forecasting models: Least Square Support Vector Machine (LSSVM) and Multivariate Adaptive Regression Splines (MARS). The accuracy of the model has been verified by the following three statistical estimations: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Correlation Coefficient (CE). Findings: The GMDH model has the potential to estimate with high precision the forecast real value of the hydrological datasets compared to the other models discussed in the present article. Findings show that the GMDH forecasting model is more robust than the other models discussed here. Applications/improvements: The novelty of this study is that it provides a trustable forecast of streamflow of the rivers.


Introduction
The Islamic Republic of Pakistan is bestowed with the largest irrigation system. The river system comprises 61,000 km of canals and 105,000 watercourses and irrigates around 35 million acres of land. 1 Pakistan's economy is highly dependent on agriculture. The Indus River (locally called Sindhu) runs through the entire length of country, and Chenab River (originates from Bara Lacha Pass) plays a crucial role in the irrigation system. The country's increasing agricultural growth and subsequent new stresses on limited water resources necessitate well-organised management of existing water resources rather than building new amenities to meet the challenges. 2 In water management societies, Pakistan is well-acknowledged in term of efforts made to maximise the efficiency of water management based on streamflow forecasting, a methodology that plays a vital role in helping the government's water management efforts to tackle water shortages. Therefore, forecasting river streamflow plays a significant role in the planning and operating of water management, whereas streamflow predicting is essential for improving water management efficiency and useful for irrigation, hydropower generation, recreation, and ecological and other purposes. The superiority of streamflow forecasting can be assessed in terms of Keywords: GMDH, LSSVM, MARS, MAE, RMSE, CE lead-time series data and accuracy. Lead-time refers to the time interval between the forecasting date and the rate of the forecasted flow that is happening. The accuracy of forecasts is an essential requirement to improve the operational, managerial, and strategic executive process.
GMDH was introduced by Ukrainian scientist A. G. Ivakhnenko and colleagues in the late 1960s; it used a multivariate analysis to study nonlinear relations between input and output unknowns and multilayered system of modeling. The GMDH model is ideal for use in multilayered, unstructured systems. Its predictions are useful in data mining, optimisation, and pattern acknowledgement in many areas. The concept of GMDH as a forecasting model for regression estimation was used to develop and determine an analytically based quadratic node transferal function (TF) in a feed-forward network. 3,4 Several models for estimating and predicting time arrangement have been discussed in the literature. The LSSVM model and MARS are considered the foremost among dominant models in customary timearrangement, predicting, and are commonly used for divergence and comparison. The LSSVM and MARS show the classifications of the linear models and their potential to increase the linear component of information regarding time-series forecasting. Subsequently, many researchers have attempted to integrate the different timeseries models to enhance the precision of forecasting. 5,6 New methods have been developed and are being used in these models in relation to forecasting and the tasks accomplished by them are way more complex than what the previous models could accomplish. 7,8 This study focuses on estimating the monthly river streamflow forecasting performance of GMDH model and compares the computed forecasting results generated from using the GMDH model with LSSVM and MARS models.

Group Method of Data Handling (GMDH) Model
The GMDH model is a group of PC-based scientific calculations of multi-parametric datasets that highlight involuntary mechanical and parametric improvements. GMDH algorithm gives the opportunity to identify and obtain the inevitable interrelations in data and choose any optimal model to enhance the accuracy of intact present algorithms. 9 The GMDH model has been recognised for its ability to display the complex nonlinear framework by utilising a transfer function (TF) to communicate the connection between datasets of input and output structures as expressed in the Volterra Functional Series, more commonly known as the KGP (Kolmogorov-Gabor polynomial), which is defined as This algebraic series is expressed by a system of TF comprising two unknowns (Neurons) defined as follows: Let us consider the coefficients {a 0 , a 1 , a 2 , a 3 , a 4 , a 5 ,} determined to expand the least square method. The input unknowns to the system (observed variable) set to x and output unknowns (predict variable) to h.
The following iterative structure was observed for GMDH model: Step 3]: Coefficients of TF determined by SME are in the form of Here A = {a 0 , a 1 , …, a 5 } is the unknown coefficients vector, Y = {y 1 , y 2 , …, y M } T is the output value vector from observation, and Step 4]: Select the optimal factors and eradicate the weakest variable. The determination of the parameter of the optimal factors depends on the three performances indexes that express

Least-Squares Support Vector Machine (LSSVM) Formulation
The least-squares version of the support vector machine (SVM) classifier determines the problem of minimisation by using re-manipulation, which is represented as follows: Subject to constraints on equality The above manipulation of the LSSVM classifier is implicitly consistent with the clarification of regression with binary objectives y i = ±1. Applying 2 Thus, e i develops a sense for LS data-fitting. Therefore, LSSVM classifier development is equivalent to where µ and  are considered as hyper-parameters that can be used to adjust the amount of regularisation versus the sum square error. Therefore, the solution depends on the ratio     , and the original development provides γ as a tuning parameter. Apply both parameters µ and  to use a Bayesian definition to LSSVM. After developing the following Lagrangian function, we obtained the solution of LSSVM model: Here, and After eliminating ω and e a linear system was arrived at in the place of a quadratic programming (QP) problem.

Multivariate Adaptive Regression Splines (MARS)
The MARS model is appropriate for forecasting continuous datasets outcomes and is implemented in two stages: forward and backward stepwise techniques. The forward stepwise technique uses a large set of input variables (basis function) with different knots; however, the use of this technique might result in a complex and multilayered model. 12 Nevertheless, such a model also cannot guarantee a strong forecast as it in fact has been found to have a weak forecasting ability. For increasing the accuracy of forecast, the backward stepwise technique was thus preferred, and it was found to have the capacity to eradicate pointless variables among the chosen datasets; this may have had a weaker effect on the approximation procedure that was pruned by the MARS. The projection of x, input variable, onto a novel, y, output variable, was carried out using the technique of appropriation, a basic function that defines the point of inflection along the input range 13 : In these y functions, x is the input, and c is the threshold value, which is said to be the knot. The function is useful in forward-backward stepwise techniques used for each input unknown in order to classify the position of knots, where the value of the function changes. 14 These y functions are called Spline functions, which is indicated by a c-knot reflected pair. The following is the common formation of the MARS model. 15 where, output variable y is estimated by MARS model, c 0 is constant, c i is the ith basic function coefficient determined by minimising the Root Mean Squared Errors (RMSE), and B i (x) is the ith basic function. The optimal MARS model scheme is designated based on the Generalised Cross-Validation (GCV) principle's smallest value. The GCV is defined as follows: where y i is the objective of output, f(x i ) is the projected output, n is the number of inputs, and C(M) is a penalty which is further expressed as: where d is the penalty for each basic function assessed by the model, and M denotes the number of basic functions.

Results and Discussion
The chief aim of the GMDH model used for the present study was to analyse the input time-series hydrological data collected from Indus and Chenab rivers and arrive at accurate real values as has been discussed in the opening paragraphs of this article. Specimens of six distinct input data combinations prepared for this scheme are shown in Table 1.
The combination of M1-M6 input models was used in the training and testing phases for forecasting models. Among the combinations of input models, M1-M6 represents the number of unknowns selected on the basis of previous analyses of monthly river streamflows.
The computed results for GMDH, LSSVM, and MARS models are illustrated in Tables 2, 3, and 4, respectively. Table 2 presents the details of analysis carried out using the GMDH model, where it can be seen that M5 and M6 models perform better for Chenab and Indus Rivers, respectively. Similarly, Table 3 presents the details of the analysis carried out using the LSSVM model, and the findings of the analysis revealed that M6 model works better for both Indus and Chenab Rivers. Moreover, from the details presented in Table 4 for the analysis carried using the MARS model, it can be seen that both M3 and M6 models perform better for Indus and Chenab Rivers, respectively, which is a crucial finding. The computed values have been assessed using statistical estimations. Accuracy of the said models is shown in Table 5.
In Table 5, it can be seen that in terms of the results of statistical tools used for measuring the accuracy of estimation, such as MAE and RMSE, the small error has been achieved in the case for GMDH model compared to LSSVM and MARS models. It is the evidence that GMDH model is better than the other two models. Furthermore, in regards to the robustness of the model, that large value of CE observed in the GMDH model is y t = f(z t-1 , z t-2 , z t-3 , z t-4 , z t-5 , z t-6 ) M4 y t = f(z t-1 , z t-2 , z t-3 , z t-4 , z t-5 , z t-6 , z t-7 , z t-8 )

Conclusion
Three different time-series forecasting models have been compared with computed statistical estimations. The findings showed that the GMDH model was indeed more  useful and more accurate compared to the other two models, LSSVM and MARS, in estimating the optimal forecasting real value on the monthly river streamflow datasets for both Indus and Chenab Rivers of Pakistan. Based on the numerical investigations, it is concluded that the GMDH model's forecasting performance is more stable and robust than LSSVM and MARS models.