Predictive analytics approaches for software effort estimation: A review

Background/Objective: In Software Effort Estimation (SEE), predicting the amount of time taken in human hours or months for software development is considered as a cumbersome process. SEE consists of both Software Development Effort Estimation (SDEE) and Software Maintenance Effort Estimation (SMEE). Over estimation or under estimation of software effort results in project cancellation or project failure. The objective of this study is to identify the best performing model for software Effort Estimation through experimental comparison with various Machine learning algorithms. Methods: Software Effort Estimation was addressed by using various machine learning techniques such as Multilinear Regression, Ridge Regression, Lasso Regression, ElasticNet Regression, Random Forest, Support Vector Machine, Decision Tree and NeuralNet to recognize best performing model. Datasets used are Desharnais, Maxwell, China and Albrecht datasets. Evaluation metrics considered are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE) and R-Squared. Findings: Experiments on various machine learning algorithms for software Effort Estimation determines that Support Vector Machine produced the best performance comparatively with other algorithms.


Introduction
Software Effort Estimation is used to predict effort in terms of person-months or person-hours. For successful development of software, Software Effort Estimation (SEE) is one of the challenging tasks though several models exist. Several models were proposed for software effort estimation (1) . Initially Software Effort estimations are carried out using Expert judgment, User Stories, Analogy based estimations and Use case point approach. Later various Machine learning algorithms alike Linear regression, Logistic regression, Multiple linear regression, Stepwise regression, Ridge regression, Lasso regression, Elasticnet regression, Decision tree, Neural networks, Support vector machine, Random forest, Naïve bayes, etc., are used for estimation. Ensemble approaches also gained more attention and produce more prediction than individual algorithms for effort prediction. The following are the survey of various models used for effort estimation.
According to Expert judgment method (2) , estimation is produced based on the judgmental process. It includes experience and advice of experts based on the degree to which the new project matches with the previously completed projects of the expert with their experience. The techniques used for expert estimation are Delphi technique and Work Breakdown Structure (WBS) (3) .
User stories approach or feedback (4) to avoid inaccurate estimations. Analogy based estimation is based on collected past data about a project (5) . The project manager and team work together for this kind of approach. Once the requirements of the present project are known, they search for a similar past project from the database (6) . Function point estimation is a reliable approach (7) that checks the functionality of the system based on the user's point of view. Function point analysis is initially used to estimate size using which effort can be estimated (8) . Use case point approach (9) is based on the use cases involved in the project. Generally use cases are user interaction with the system. The actors or users are categorized and weighted based on the type of work they perform (10) .
Supervised machine learning techniques are majorly used for estimation. It consists of input variables or independent variables and an output variable or target variable that is to be predicted from input variables. Types of supervised learning are as follows: • Regression-In regression, the output attribute is continuous.
• Classification-In classification, the output attribute is discrete.
Regression produces continuous output variable or dependent variable (11) . It provides relationship between two or more input variables (12) . There are various types of Regression algorithms alike Simple Linear regression, Multiple linear regression, Logistic regression, Stepwise regression, Ridge regression, Lasso regression and ElasticNet regression.
In Simple linear regression, the relationship between the dependent and independent variable is ascertained. Multiple linear regression is an extension of linear regression. It consist of 'n' number of independent variables denoted as X1, X2, X3…Xn and a dependent variable Y. The dependent variable Y is predicted from one or more independent variables X1, X2, X3…Xn. In Logistic regression, dependent variable is a categorical value and not a continuous value (13) . It is of two kinds that can be either binomial or multinomial logistic regression. Binomial logistic regression have only two possible outcomes like yes or no, good or bad, true or false, 1 or 0, etc., Multinomial logistic regression have more than 2 possible categorical outcomes like poor or average or good or very good or excellent, very small or small or big or very big, etc., it's also referred as sigmoid function.
Stepwise regression (13) performs well when there are multiple independent variables (input variables). The main purpose of this technique is to maximize the prediction using minimum number of input variables or predictors. Ridge regressions are mostly used when there are predictors or independent variables that possess multi collinearity (highly correlated) and when there are more number of predictor variables (13) . LASSO (Least Absolute Shrinkage and Selection Operator) Regression (13) is similar to that of ridge regression but it uses L1 regularization technique to minimize error between actual and predicted value. Elasticnet regression is the combination of both Ridge and Lasso regression. It uses both L1 and L2 regularization technique. This type of regression is used when there are more number of features and when they suffer with multi collinearity.
Classification produces discrete output variable or dependent variable. Various types of Classification algorithms are Decision Tree Classifier, Support Vector Machines, Naïve Bayes classifier, K-Nearest Neighbors, Random Forest classifier, Neural Network, etc., Decision Tree classifier is a tree structure consisting of nodes and branches (14) . Internal nodes represent attributes. Branches represent decisions and leaf nodes are the outcomes that can be either categorical or continuous variable. Thus decision trees can be used for both categorical and regressive problems. Support Vector Machine is a supervised algorithm (15) . It can be used for both regression and classification problems but predominantly used for classification problems that is either two class classification and multi class classification.
Naïve Bayes classifier is considered as the fastest classification algorithm, when compared to other algorithms (16) . It is the best algorithm for large dataset. The basic principle of naïve bayes is each pair of features that are classified are independent to one another by applying bayes theorem as, P(A|B)=P(B|A)*P(A)/P(B) where A and B are events. K Nearest Neighbor is used for both classification and regression problems and mostly used for classification. Given a test data and to predict the class of test data, 'K' number of nearest training data are considered that are closer to the test data. The majority of the classes will be the predicted class. Random Forest is also used for both classification and regression algorithm based on a forest of trees (17) . The advantage of this algorithm is when the number of trees increases, the accuracy also increases. Decision tree(CART model) algorithm is the basis for random forest algorithm (18) . Neural Network uses Layered approach (19) which includes Input layer, Hidden layer and Output layer. Error is corrected by adjusting weights accordingly until error goes below threshold value (20) . In (21) estimation of effort using hybrid multilayer perceptron was carried by using complex non-linear input output relationship of a dataset.
Ensemble based approaches also became popular for effort estimations. Ensembling of machine learning algorithms (22) provide accurate results when compared with individual predictive machine learning algorithms. Hybrid of fuzzy based technique with function point analysis (23) and ensembling of fuzzy with analogy based estimation (24) paved attention in effort estimations.
Benefits and limitations of the aforementioned methods are discussed in Table 1. https://www.indjst.org/

Delphi Technique
• It incurs less cost.
• Experts can figure out the requirements for the future project from their past projects experiences.
• This method often leads to overoptimistic estimation Work Breakdown Structure • By this method, project risks can be identified during earlier stages.
Complex process. It uses step by step approach.

Analogy based Estimation
This approach is simple and fast. This approach will not be always accurate in estimation.

User Stories
Story points are relative to the size of the project User stories differs between between teams in a project.

Function point Estimation
This method can be applied during earlier stage of software development. This method is independent of any programming language.
It is a time consuming method and has less accuracy as it is based on judgmental approach.

Use case point approach
Use case point approaches are good measures for size prediction.
Use cases are large unit of work and estimations can be done only when all use cases are written. Supervised machine learning techniques-Regression algorithms

Linear regression
It is the simplest method to find the relationship between two or more variables.
This method is able to give relationship between the independent and dependent variables that are linear.

Logistic regression
Logistic regression is used when the independent variables (input variables) are categorical or/and continuous. It is an efficient and easy method to implement.
Using this method only linear problems can be solved and non-linear problems cannot be solved.

Stepwise regression
Stepwise regression can handle large number of independent variables (predictor or input variables).
Using this method only linear problems can be solved and non-linear problems cannot be solved.

Ridge regression
More number of independent variables can be used.
The drawback in this method is the model is considered to be complex that in turn leads to poor performance. This method generally produces high bias.

Lasso regression
Lasso regression avoids overfitting and feature selection can be done.
This method is not often stable and selecting features among high correlated features is random.

Elastic Net regression
Elastic Net is more preferred when compared to Ridge or Lasso regression.

Computational cost is high
Classification algorithms

Decision Tree Classifier
It is a simple method. It is a better method for estimating categorical data.
It provides less accuracy in prediction when compared to other machine learning algorithms.
Continued on next page https://www.indjst.org/ This method can also be applied to unstructured or semi structured data. They perform better even with many attributes.
It takes longer time for prediction in larger datasets.

Naive Bayes Classifier
It is an easy method for implementation. This method produces better result if input variables are independent in nature.
This method always assumes that input variables are always independent, which cannot be always true.

K Nearest Neighbor algorithm
Optimal for larger sample input. It requires large storage requirement. It is sensitive to noise.

Random Forest classifier
This method is user friendly and strong against overfitting. It can handle huge datasets.
It is time consuming and complex. It uses black box approach.
Artificial Neural Network • It can learn from previous data.
• It is suitable for complex dataset.
• It is suitable for linear and nonlinear functions, thus produces high prediction of software effort.
Slow convergence speed and overfitting problem occurs.

Ensemble approaches
• They combine multiple models into aggregated better model.
Ensemble approaches are computationally expensive.

Software effort estimation datasets
Datasets considered for Effort estimation are Desharnais, Maxwell, China and Albrecht (25) . Datasets repository, attributes and records are elaborated in Table 2

i. Mean Absolute Error (MAE)
It is the average sum of absolute errors (26) . Prediction error=Actual value-Predicted value Absolute error=|Prediction error| MAE=Average of all absolute errors is given by Eq. (1) Actualvalue i − Prediction value j /n (1)

ii. Mean Squared Error (MSE)
It is the average of square of errors (27) in the data set and is given by Eq. (2) https://www.indjst.org/

iii. Root Mean Square Error (RMSE)
It is the measure of standard deviation of predicted deviation (28) and is given by Eq. (3).

iv. R-Squared
It is also known as co-efficient of determination. Higher the value of R-squared, better is the model.

Results and Discussion
Machine  Table 3 shows the performance measures of the machine learning algorithms against Desharnais dataset. Figure 1 1.1, 1.2 Table 5 shows the performance measures of the machine learning algorithms against China dataset. Figure Table 5 is Lasso Regression (LR) produces lower values    Table 6 shows the performance measures of the machine learning algorithms against Albrecht dataset. Figure 4 4   Based on the inference from Table 3 and Table 6, Support Vector Machine produces better results compared to other machine learning algorithm and Table 4 and Table 5 shows better results are produced by ElasticNet and Lasso Regression respectively. ElasticNet and Lasso regression is used to avoid over fitting problems, and they produce better results only when the independent attributes are more correlated with the output attribute, which is also in the case of the Maxwell and China dataset. SVM avoids over fitting problems, suitable for both structured & unstructured data and perform better with many input attributes. Finally, Support vector machine provides better performance compared with other machine learning algorithms and to the next level Lasso and ElasticNet regressions also provide better predictions.

Conclusion
This study compares various machine learning algorithms like Multilinear Regression, Ridge Regression, Lasso Regression, ElasticNet Regression, Random Forest, Support Vector Machine, Decision Tree and NeuralNet using Desharnais, Maxwell, China and Albrecht datasets. Software Effort Estimation (SEE) is predicting the amount of time taken in human hours or months for software development. It is difficult to forecast SEE during initial stages due to uncertainties. Estimation is the process that is used as input for pricing process, project planning, iteration planning, budget and investment analysis. Based on the comparative study of various machine learning algorithms, it is found that Support Vector Machine (SVM) outperforms other algorithms. Evaluation metrics considered are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE) and R-Squared.