Calculation of Stability Constant of Metal-thiosemicarbazone Complexes using MLR, PCR and ANN

Objectives: In this work, the stability constants log β11 of complexes between thiosemicarbazone and metal ions were predicted based on the modeling of Quantitative Structure and Property Relationship (QSPR). Methods: The QSPR models have been developed by using Multiple Linear Regression (MLR), Principal Component Regression (PCR) and Artificial Neural Network (ANN). Findings: The results of QSPR models building have provided very positive results through the statistical values of validation. The QSPR models were cross-validated based on critical statistics. The quality of the QSPR models was exhibited by the statistical standards as the QSPRMLR model: R 2 train = 0.9446, R 2 adj = 0.939, Q 2 LOO = 0.9262, SE = 0.529 and Fstat = 160.817; QSPRPCR model: R 2 train = 0.949, R 2 adj = 0.942, Q 2 CV = 0.928, MSE = 0.292, RMSE = 0.540 and Fstat = 134.617; QSPRANN model with architecture I (7)-HL(10)-O(1): R 2 train = 0.986, Q 2 CV = 0.984 and R 2 test = 0.983. Applications: Obviously, the results from this work could serve for designing new thiosemicarbazone derivatives that are helpful in the fields of analytical chemistry, pharmacy and environment.


Introduction
Bonding of metal ions with thiosemicarbazone ligands in aqueous solution plays an important role as in recent studies 1, 2 as well as in studies of biological processes 3 . Efforts have been made to design new ligands that can be selectively linked to a metal ion and allow metal ion extraction 1,3 . Now there are many empirical data related to the stability constants of thiosemicarbazone-metal complexes collected 4 -16 . In addition, this provides a good opportunity to develop quantitative relationships between the structure and stability constants of complexes that can be used to design new thiosemicarbazone ligands that bind to metal ion 17,18 . There are continuous publications in the literature showing that the development of QSPR models to predict stability constants of complexes using multivariate techniques is a good choice 1 .
On the other side the QSPR models of the stability constants of the metal-thiosemicarbazone complexes were preceded for a lot of practical applications. The molecular activities of thiosemicarbazone compounds and their complexes were used to support the analytical chemistry 19 and medicinal areas 3 . The metal-thiosemicarbazone complexes are being applied in the medicinal areas for antibacterial, antifungal, anti-malarial, antitumor and antiviral activity 20 -22 . Furthermore, they are also used to catalyze for chemical reactions 23 and in the environmental area 24 . In these cases the applicability of QSPR models in Vol 12 (25) | July 2019 | www.indjst.org practice is not simply due to not enough of the information needed to describe the molecules and the details of the calculation method.
To solve the problems outlined above, we proceed to establish the relationships between the Quantitative Structure and Properties (QSPR) related to the stability constant (logβ 11 ) of metal-thiosemicarbazone complexes Ni 2+ , Co 2+ , Mo 6+ , Cu 2+ , Mn 2+ , Zn 2+ , Ag + , Pb 2+ , Fe 2+ and Zn 2+ . All of these models are based on molecular descriptors of complexes, resulting only from 2D and 3D molecular calculations. Some 3D molecular descriptors were calculated using semi-empirical quantum chemistry with new PM7 and PM7/sparkle. Here we report new QSPR models for the logβ 11 stability constants of 10 transition metal ions with a set of diverse thiosemicarbazone ligands in aqueous solution at 298 K and 0.1 M ion strength 4 -16 . The models were cross-validated using an external validation process
The difference between the experimental logβ 11 constants may be found for the same complexes of the different authors could have relatively high values, as shown in Table 1. If a lot of logβ 11 constants are available for a ligand, then the most recent values or value consistent with the different experimental methods are chosen. Thus, 19 (Ni 2+ ), 16 (Co 2+ ), 29 (Cu 2+ ), 7 (Mn 2+ ), 1 (Zn 2+ ), 1 (Mo 6+ ) and 1 (Ag + )-thiosemicarbazone complexes are taken for the QSPR modeling of the logβ 11 constants, as exhibited in Figure 2. The logβ 11 constants of complexes alter in various ranges, as shown in Table 3. The normal distribution of logβ 11 values for complexes depicted the characteristics of the dataset, as shown in Figure 2.
The metal-thiosemicarbazone complexes are generated by the following reaction between a metal ion (M) and a thiosemicarbazone ligand (L) in an aqueous solution: pM + qL ⇌ M p L q (1) The reaction occurs this step with p = 1 and q = 1; the stability constant β 11 is calculated by the following expression:

Molecular Descriptors
Optimized structures are wielded for calculating the molecular descriptors. The 2D and 3D molecular descriptors are calculated by coding in different forms of a molecular structure. The molecular descriptors include physico-chemistry descriptor LogP; the 2D descriptors xp3, xp5, xvch8, SaasC, nelem and nrings; the 3D descriptors such as ABSQ, Ovality and Surface 25 . BIOVIA Draw 2017 R2 program was used to re-construct the 2D structures of molecules. The topological and quantum descriptors were calculated on Lenovo W540 PC using the MOPAC2016 a semi-empirical level PM7 and PM7/ sparkle 26 and QSARIS program. The molecular descriptors of each descriptive group were used as initial molecular descriptors in the QSPR model to construct the various QSPR models using different techniques. The predictor selection is one of the most important steps in QSPR modeling. For the following QSPR modeling, the stability constant log β11 was transformed as the dependent variable. The QSPR MLR models were constructed by the predictor selection technique using Genetic Algorithm (GA) of QSARIS 25 and forward technique of REGRESS program 27 on Addins in MS-EXCEL 25 . Besides the QSPR PCR model was also built on a XLSTAT2016 program 28 . The Artificial Neural Network model QSPR ANN was developed by using the Neural Network tool in the MATLAB 2016 program 29 .

Regression Analysis
The dataset of 74 complexes is used as a training set. A 74-molecules set was performed for MLR and PCR regression analysis as a constructing method of the QSPR model. The QSPR models were generated by using logβ 11 constant values as dependent variable and different descriptors as independent variables. The cross-validation limit with correlation value is set at 0.7; the descriptors in the final equation are selected by combining regression technique and genetic algorithms to have QSPR MLR and QSPR PCR models. The statistical methods used to evaluate QSPR models include the number of compounds in the regression model, the regression coefficients R 2 , the adjusted R 2 a , the number of descriptors in the model k, F-test for statistical significance, cross-correlation coefficient Q 2 cv , predictive correlation coefficients R 2 pred , and Standard Error (SE).

MLR Analysis
The Multiple Linear Regression analysis (MLR) is used to construct a linear relationship between a dependent variable y (logβ 11 ) and independent variables x (molecular description) 30 .
Multiple Regression Analysis (MLR) was used to estimate the regression coefficient (R 2 ) by least-squared fitting; the Sum of Squared Residual (SSR) values of observed and predicted values are minimized 27, 28 . The linear model can produce a linear approximation in relation to all observed data points 30,31 . In linear regression, the dependent variable (logβ 11 ) y depends on the molecular descriptors, x. The regression equation has the form: Here y is dependent variable logβ 11 , the regression coefficient, b i corresponds to the molecular descriptors, x i ; and c is a constant.

PCR Analysis
Principal Component Regression analysis (PCR) was used to evaluate data based on the correlation between dependent variables and independent variables 28 .  Principal Component Regression analysis is used to find the appropriate structure in data sets.

Artificial Neural Network
The Artificial Neural Network (ANN) receives the processed input information that is capable of communicating by transmitting information through interconnected neurons, weighted connections. Some of their basic features should be emphasized initially 33 In the current article, the number of hidden layers and the appropriate epoch has been carefully checked with trial and error. We used a feed-forward neural network with the Levenberg-marquest learning algorithm to train it 35 -37 . This algorithm seems to be the fastest method for training medium-sized feed-forward neural networks. The training of the ANN neural network model is performed until the average squared error (MSE ANN ) is minimized followed by the comparison of the network output with the actual values of the output obtained from the test results 38 . The training process of a neural network consists of adjusting the weights and deviations of the network to optimize neural network performance. The efficiency function for feed-forward neural networks is based on the average square error of the ANN model (MSE ANN ). The average squared error between the output of the network (y i ) and the target output (t i ) is given by the following formula 37

Validation of QSPR Model
The optimum method to assess the quality of regression models is to perform the internal assessments for QSPR models. The validation was mainly done by a Leave-oneout (LOO) cross-examination, when an observation (logβ 11 ) value was excluded from the training set and the training data was divided into subsets of size are equal 25 . The model was constructed using these subsets and the dependent variable value of the data point was not included in the defined subset, which is a predicted value. The predicted averages will be the same for R 2 train and Q 2 LOO (the value of the correlation coefficient is cross-validated) as all data points would be considered sequentially as predicted values in the LOO subset. The same procedure is repeated after removing another object until all objects have been discarded once. The LOO cross-validation leads to statistically significant patterns for each regression model 28 . R 2 train was used the following formula: The QSPR models screened are based on the values Q 2 LOO for cross-validation set, R 2 test for the test set. These values are calculated by using formula (6) to validate for all models 25, 27, 39 -42 .
The adjusted R² value (R 2 adj ) is the coefficient of significance to determine the number of internal variables for QSPR models. The value of R 2 adj can also be negative if the data set does not have a sufficient number of observations n. This coefficient is only counted if the user is not fixed to the model 27 -28 . R 2 adj is defined by formula: The R 2 adj value is used to calibrate R 2 train , taking into account the number of independent variables used in the model. Average square error (MSE) 27 is determined by following formula: Where n is the number of test substances; logβ 11,exp and logβ 11,cal are the experimental and calculated stability constants.
The selected subsets for QSPR MLR models are presented in Table 4. The descriptors k was varied in the range 1 to 8.  Table 4. During the modeling process, the dataset is split randomly into the training and test subset, in which the training subset contains about 80% of initial data set. The QSPR MLR models are cross-evaluated by the Leave-one-out method through the statistical value Q 2 LOO .
The statistical parameters such as values R 2 train , Q 2 LOO , and RMSE are used to select a best subset. Therefore the best model has highest R 2 train and Q 2 LOO values and lowest RMSE value with suitable number k.
In Table 4 the molecular descriptors are refined preliminarily using genetic algorithm. From the descriptors of subsets, the QSPR MLR model was re-built with the forward technique for the REGRESS system 21 on Add-Ins MS-EXCEL.
The results in Table 4 showed that the k value goes up to 8 then the values R 2 train and Q 2 LOO are not increased. Thus the statistical values change specifically the values RMSE train and RMSE CV go up. Therefore, the k value goes up to 8 then the statistical value changed insignificantly. Therefore, the best subset of descriptors with k = 7 is selected for QSPR MLR modeling in Equation (11). The best QSPR MLR model in bold is shown in Table 4 25,27 .

QSPR PCR Modeling
The best QSPR MLR model (11) based on 7 molecular descriptors, as listed in Table 4. In this work we have also approached to construct the QSPRPCR model by using this dataset with 8 molecular descriptors, as given in Table 4. This model was constructed from the results of the Principal Components Analysis (PCA). Similarly, the QSPR PCR modeling process is implemented by the training set containing original data of 80% and the remaindered is the test set. The QSPR PCR model is also validated by statistical values R 2 train , Q 2 LOO , explained variance and RMSE. The change of principal components in QSPR PCR model influences the RMSE values. The increment of the components caused the decrement of RMSE values for training and validation process, respectively, as exhibited in Figure 4. So the best QSPR PCR model consists of 7 principal components. It can be transformed into a QSPR PCR model of the original-molecular descriptors, as shown in Equation (12).
The Principal Component Regression (PCR) equation is depicted for QSPR PCR modeling with statistical values, as following Equation (12) (12) is statistically significant. This equation has the explained variance of 94.9% in the stability constants, as influenced in Figure  4. From Equations (11), (12) the change of the log β 11 stability constant could be explained by the molecular descriptors. The statistically importance of the molecular descriptors in the QSPR model can be used in the seeking direction of new complexes. Consequently, the modeling results may orientate the design of new thiosemicarbazone ligands based on the structural descriptors to obtain the higher log β 11 stability constants.

Construction of QSPR ANN Model
For the development of QSPR ANN model, the Artificial Neural Network was also approached in this work. The Artificial Neural Network with the training set of 74 complexes using back-propagation algorithm was implemented, as given in Table 1. The neural network can be constructed for prediction of log β 11 stability constant values of external test set, as shown in Table 2.
The different iterations of the training process and the change of neurons of hidden layer could create the several QSPR ANN models I(k)-HL(m)-O(n). In Table 2, we have shown the best QSPR ANN model with architecture I(7)-HL(10)-O(1). The developed QSPR ANN model based on the significant descriptors statistically of QSPR MLR and QSPR PCR models.
The transfer function hyperbolic sigmoid tangent is used to train this neural network I(7)-HL(10)-O(1). The others are used in the training process as learning rate of 0.01, the momentum of 0.9, the convergent goal of 10 -10 and the residual function is RMSE. The QSPR ANN model I(7)-HL(10)-O(1) has the statistical values R 2 train of 0.9860, Q 2 CV of 0.9840, and R 2 test of 0.9830. These results indicate that QSPR ANN model I(7)-HL(10)-O(1) is better than QSPR MLR and QSPR PCR models. So the QSPR ANN modeling could explain the variation 98.6% in the data set; and the QSPR MLR and QSPR PCR models explain the variation 94.5% and 94.9%, respectively. The QSPR ANN model I(7)-HL(10)-O(1) exhibited a better fitness between the predicted and the experimental values. This may also be found in the statistical values ARE, % and MARE, %, as shown in Table 2.

External Validation
QSPR models must be tested for external validation criteria. The authors recommended that in addition to the cross-validated (Q 2 CV ) value. The multiplecorrelation coefficients R have been determined from the experimental and the predicted stability constant values for an external test set must be close to 1. In this study, we used an external data set of 10 metal-thiosemicarbazone complexes from the experimental literature to test the applicability of the constructed QSPR models, as given in Table 2. The QSPR models satisfied the criteria.
The MARE, % values of the QSPR models are also calculated, respectively, as shown in Table 2, indicating that the QSPR ANN model appeared the highest predictability and the predicted log β 11 stability constant values resulting from model QSPR ANN are very close to the experimental values. In addition, the one-way ANOVA method is used to compare the discrepancy between the experimental and predicted log β 11 stability constant values resulting from three QSPR models. Accordingly, the discrepancy between them is insignificant (F = 0.068598 < F 0.05 = 2.866266). Thus, we can use the QSPR models to estimate the log β 11 stability constant of new complexes.

Conclusion
We conclude that the QSPR modeling of transition metal complex was implemented by incorporating the multivariate regression and the Artificial Neural Network. The QSPR models were constructed successfully by the  selected molecular descriptors by the Genetic Algorithm and forward-regression technique. The stability logβ 11 constants of metal-thiosemicarbazone complexes generated by the QSPR MLR , QSPR PCR and QSPR ANN models are a good agreement with experimental data. The developed QSPR models are statistically satisfactory. The applicability of these QSPR models promised to predict accurately the stability constants of the complexes between new thiosemicarbazone ligands with metal ions. The above results indicated that the QSPR ANN model has the best predictability.