Prediction of IPL matches using Machine Learning while tackling ambiguity in results

Background/Objectives: The IPL (Indian Premier League) is one of the most viewed cricketmatches in theworld.With a perpetual increase in the popularity and advertising associated with it, forecasting the IPL matches is becoming a need for the advertisers and the sponsors. This paper is centered on the implementation of machine learning to foretell the winner of an IPL match. Methods/Statistical analysis: The cricket in the T-20 format is highly unpredictable many features contribute to the result of a cricket match, and each attribute feature has a weighted impact on the outcome of a game. In this paper, first, a meaningful dataset through data mining was defined; next, essential features using various methods like feature engineering and Analytic Hierarchy Processwere derived. Besides, a key issue on data symmetry and the inability of models to handle it was identified, which extends to all types of classification models that compare two or more classes using similar features for both the classes. This concept in the paper is termed as model ambiguity that occurs due to themodel’s asymmetric nature. Alongside, different machine learning classification algorithms like Naïve Bayes, SVM, kNearest Neighbor, Random Forest, Logistic Regression, ExtraTreesClassifier, XGBoost were adopted to train themodels for predicting the winner. Findings: As per the investigation, tree-based classifiers provided better results with the derived model. The highest accuracy of 60.043% with Random Forest, with a standard deviation of 6.3% and an ambiguity of 1.4%, was observed. Novelty/Applications: Apart from reporting a more accurate result, the derived model has also solved the problem of multicollinearity and identified the issue of data symmetry (termed as model ambiguity). It can be leveraged by brands, sponsors, and advertisers to keep up their marketing strategies.


Introduction
The IPL (Indian Premier League) is a 20-20 cricket league in India where eight teams (representing eight cities in India) play against each other. This game is India's biggest cricket festival -the most celebrated and the most viewed, where the action is just not limited to the cricket field. The clatters, promotional events, cheerleaders, advertisements, fan clubs, interactions, and betting are celebrated along with the players and the matches.
The entire revenue cycle of the IPL revolves around advertising. IPL also utilizes television timeouts, and there are other humongous opportunities associated with advertising. Apart from national and global broadcasts, the matches are transmitted to regional channels in eight different languages. The brand value of the IPL was |475 billion (US$6.7 billion) in 2019 (1) . The IPL cricket league has proved to be a 'game-changer' for both cricket and the entire Indian advertising industry (2) .
"Due to the saturated market, it is especially important for sports organizations to function with maximum efficiency and to make smart business decisions" (3) . One of the most common areas where Sports organizations use analytic is assessing an athlete's value to their brand and strategizing their marketing activities.
In this paper, models using machine learning to predict IPL matches' outcomes were developed. Figure 1 illustrates the entire process followed while conducting the research.
During the research, a multi-step approach was taken to gather and pre-process the historical data. Feature engineering (4,5) techniques were applied to derive more insights about the current dataset. Further, analysis of the essential features was done using selection techniques, and simultaneously best players were marked based on their performances. Optimized features from players' performance were then added to the team data. The issue of multicollinearity, which occurs when multiple features are highly linearly related was tackled. One of the main issues identified during the research was the symmetry in the dataset. The models returning different results for the same input fed in two configurations was observed. This concept in the paper is termed as model ambiguity, which occurs due to the model's inability to interpret data symmetry because of its asymmetric nature. The models were trained using multiple machine learning classification algorithms to develop a predictive model. The highest accuracy was observed with Random Forest, i.e., 60.043%, with a standard deviation of 6.3% and 1.4% ambiguity. Complete process from data scrapping to creating optimised model accuracies with reduced ambiguity for predicting IPL matches' winner.

Related works
Many researchers have contributed towards predicting the results of cricket matches. Authors (6) proposed a paper on predicting the outcome of an IPL match, where they acquired the dataset available for all the 11 seasons from the archives of the IPL website (7) and applied the concept of Multivariate Linear Regression to calculate the strengths of a team using the data from the Player Points section of the official IPL website. Later for prediction, the scholars utilized various classifiers, namely Naive Bayes, Extreme Gradient Boosting, Support Vector Machine, Logistic Regression, Random Forests, and Multilayer perceptron. In another study, authors (8) adopted the Team Composition method to predict the outcome of an ODI match. They utilized the https://www.indjst.org/ players' data career statistics (both recent and overall performance) to calculate the player's strength and aggregate to finalize the team strength. They also included other features like venue and toss. The model derived from their research gives the best result with the KNN algorithm.
Although not related to cricket match prediction, the authors conducted a study (9) to predict the performance of bowlers. They used Multilayer Perceptron and created a new feature using the data called CBR (Combined Bowling Rate) and calculating the harmonic mean of the Bowling Average, Bowling Economy, and Bowling Strike Rate. Authors (10) used the pressure Index of the team batting in the second innings to predict the match at different points of the chase; they devised a formula to calculate the pressure index at each point and used various methods to calculate the probability of a win based on the pressure index.

Dataset gathering
The historical dataset was obtained from various sources -Kaggle (11) , ESPN Cricinfo (12) ,and iplt20 (7) . The performance data of individual players was scraped from the ESPN Cricinfo website by using Python Library Beautiful Soup (13)  To produce accurate results, all the unnecessary features from the dataset were eliminated, for example -Umpire Name, Stadium Name, Date, Dl applied, Player of the match. The features that could result in data leakage, such as Win by Runs and Win by Wickets, were excluded. Further, all the match rows that were dismissed, drawn, or null were eradicated. Historical data of the number of times each participating team has won a match during the IPL Class Imbalance is a problem in machine learning where the class distribution is highly imbalanced (14,15) . Predicting the results using the team's name was not feasible as it can cause a massive Class Imbalance between the groups. For example -MI (Mumbai Indians) winning more than 100 matches whereas KTK (Kochi Tuskers Kerala) wining less than 10 matches is https://www.indjst.org/ a Class Imbalance problem. Refer to Figure 2. Hence, to rule out the class imbalance, the model was designed to predict the winner based on the essential features instead of the Team names, declaring either Team 1 or Team 2 as a winner. Moreover, the number of times Team 1 won is more than Team 2 was noticed. Further, to resolve this issue and balance the Team 1 winning and Team 2 winning in the label column, few values of column Team 1 were interchanged with column Team 2.

Assumptions
A few assumptions to make the model accurate and robust were followed. The owners changed the names of few teams due to legal actions or due to the change in the ownership; however, the players and team dynamics did not change. The name Delhi Capitals was changed to Delhi Daredevils, Deccan Chargers to Sunrisers Hyderabad, and Pune Warriors to Rising Pune Supergiant. In these cases, the same team irrespective of the change in the name was taken. Moreover, the data of only 11 players for a team based on the highest number of matches they have played during the IPL was considered.
Refer Table 1 for the features extracted from the gathered and pre-processed data. From the gathered and processed data, following three meaningful features were extracted.

City 2. Toss Winner 3. Toss Decision
Since the algorithms do not interpret string values, label encoding on the above three features was done, as follows: 1. City: If the match is played on the home ground of Team 1, the city value is taken as zero. If the match is played on the home ground of Team 2, then the city's value is taken as 1, and if the match is played in some other city, then the city value is taken as 2.
2. Toss Winner: If the Toss is won by Team 1, the Toss Winner value is taken as zero. If the Toss is won by Team 2, the Toss Winner value is taken as 1.
3. Toss Decision: If the Toss winner chooses to Bat, the value of the Toss Decision is taken as zero, and if the Toss winner chooses to Bowl, then the value of the Toss Decision is taken as 1. For Base Feature Distribution, refer to Figure 3 .  For Dream 11 strength feature distribution, refer to  Different measures highlight different aspects of a player's ability, which makes some features essential compared to others. For example, the strike rate is a necessary feature for a game -especially T20. In T20, the number of overs is less, which makes this feature more crucial as it adds to the team's ability to score maximum runs. The features were weighted according to their relative importance over other measures (features) in the research. The Analytic Hierarchy Process (AHP) was adopted to determine these weights for each player to calculate their bowling and batting features. Besides, we calculated the weights for each team based on their past performance. The Analytic Hierarchy Process is a method for decision-making in complex conditions in which many variables or criteria are considered in prioritizing and selecting options (16) . AHP generates a weight for each evaluation criterion. The higher the weight for a corresponding criterion, the more important is the corresponding criterion (Refer to Appendix B). Finally, the AHP combines the criteria weights and the options amounts, thus determining a global score for each option and a consequent ranking. The global score for a given option is a weighted sum of the scores it obtained with respect to all the criteria (17) .

Batting AHP
. Priority Order: The attributes were arranged in their decreasing order of importance based on the knowledge and experience from the T20 cricket matches, as below: Batting Average > Innings > Strike Rate > 50's > 100's > 0's Subsequently, a matrix was created to compare the importance of each attribute. Refer to Table 2.  From this section, four essential features were formed, mentioned below: For AHP Strength Feature Distribution, refer to Figure 5 . Using the AHP, the coefficient for the win rate of each team against the other were derived. Assumption: KTK (Kochi Tuskers Kerala) and GL(Gujrat Lions) Teams were dropped while calculating the weights, as they never played against each other. Priority Order: The priority order through AHP with the dataset of the matches played for win/loss for each team against each team was calculated. For example, the Team CSK (Chennai Super Kings) and MI (Mumbai Indians) played 27 matches against each other, and according to the dataset, MI won 16, and CSK won the rest 11 games. In this instance, in the MI row, the input will be 16/11 = 1.454545, and in the CSK row, it will be reciprocal, which is 11/16 or 1/1.4545454 = 0.6875. Refer to Table 4. https://www.indjst.org/ Further, the yearly ranks of each team based on the win ratios was noted and the ranks were derived using AHP. Refer to Table 5. For the KTK and GL, the mean value which is 1 as the coefficients was taken and two features were formed from this section, as below: 1

. Team_1_Rank 2. Team_2_Rank
For AHP Rank Feature Distribution, refer to Figure 6 . . For a cricket match, the win rate almost determines the overall performance of a team. A team is continuously winning the matches against other teams is a sign that the team's form is good and the probability of the team winning the upcoming matches is higher. On the other hand, a losing team reflects that it is not in good form and may even lose games further. As next steps, the entire IPL match list played every year by each team from 2008 to 2019 was crawled. If the two teams played against each other for the first time, the win rate was reset to 0 for both the teams. Subsequently, all the played matches were checked and the winners for such occurrences were noted. This helped in defining a ratio for each team. For a match, the past win rate ratio of the team was considered as below:   Two features were formed from this section, mentioned below:

Transformation Features
With all the formulated Base and Intersection features, Transformed features were developed. These features were created by subtracting two base features or intersection features from the same category. For example, Team1_Team_Strength is subtracted from the Team2_Team_Strength to create a new feature. Since many new features based was created on base and intersection features for the model, multicollinearity (18) could occur. Multicollinearity occurs when multiple features in a model are highly linearly related, which means one variable can be predicted quite accurately using the other variable. The problem with multicollinearity is that it causes the model to overfit. To deal with multicollinearity in the derived model, all the base and intersection features that were used to create the new features were dropped.

Addressing the Symmetry in Data
As per the primary assumption, every team's performance is independent of the opposition team, toss decision, home-field advantage, and progress into the series. This allowed to make independent team features that will be present in both TEAM1 and TEAM2. The features generated can be broadly bucketed into Match and Team Features. As there are similar features for both TEAM1 and TEAM2, symmetry in the dataset was observed (Refer to Table 6 ). It is apparent to a human that while switching TEAM1 with TEAM2, the results will be the same. However, a machine learning model is asymmetric in nature and is neither capable of identifying the symmetry of features nor has a way to input the information about the symmetry of features. Hence, this information was entered to the model by generating a symmetric duplicate for every row in the training set (Refer to Table 7 ). The below steps were taken to the train and test sets: 1. The original dataset is split using train_test_split from sklearn (19) library into training and test sets. The data was split such that 90% of data are in training set and 10% of data are in testing set. 2. The training set is then mirrored as shown above and append to the original training set which increases in training set size 3. The test set is also mirrored but the test sets were not appended to create two test sets https://www.indjst.org/

Model Ambiguity
The mirroring of the rows only tells the model about the existence of a symmetric scenario, but the model will still interpret the mirrored rows as new training set rows completely unrelated to the original rows. This asymmetric nature of the model leads to ambiguity in the results in certain rows ( Refer to Table 8 ). The model was tested for a given match in two configurations. The model interprets both the cases as two different test cases. As a result, sometimes, the model returns different predictions for the same case. Such an occurrence is called Model Ambiguity. Note: This occurrence is not an incorrect prediction, as the prediction will be counted correct in either test set 1 accuracy or test set 2 accuracy.
To tackle this phenomenon of Model Ambiguity, the model was evaluated using five parameters apart from just training and test accuracy: • Training Accuracy: % of correct predictions in mirrored and merged train set • Test 1 Accuracy: % of correct prediction in the original test set • Test 2 Accuracy: % of correct prediction in the mirrored test set • Real Test Accuracy: % of correct prediction after discrediting the scores for ambiguous rows • Ambiguity: % of rows in which ambiguity is observed The objective of hyperparameter tuning was to maximize real test accuracy by driving down the ambiguity while evaluating the overfitting of the model using training accuracy and test 1 & 2 accuracies.

Data set split
Changing the random state in dataset the accuracy differs a lot was noted. This change occurs because the training and testing dataset is randomly split based on the state in which the data was put. To prevent such a scenario and to make the model robust RepeatedStratifiedKFold (20) was used. 10 folds and 2 iterations were selected to give a total of 20 folds. RepeatedStratifiedKFold was preferred over StratifiedKFold (20) as dataset is small, and RepeatedStratifiedKFold gives more fold with larger validation set.
Constant: Random State = 827 throughout the project was taken The model was evaluated using accuracy and Standard Deviation, Cohen Kappa (21) , Skewness (22,23) and Kurtosis (22,23) . To check or visualize the performance of the multi -class classification problem, AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curves were plotted. These curves are one of the most important evaluation metrics for checking any classification model's performance (24).

Results and Discussions
8 Supervised algorithms to train the derived model were selected:

Model implementation using Naïve Bayes
The Real test accuracy of 58.233 % with a standard deviation of 5.5 % and ambiguity of 3.0% were derived (Refer to Table 9 ). The Area under the Curve is 0.63. The ROC curve was plotted with the best result using Naïve Bayes. The distribution of Real Test Accuracy was done to derive Skewness and Kurtosis of the Real Test Accuracy (Refer to Figure 11 ).

Model implementation using logistic regression
The model was tuned with over 1232 combinations. Refer to Appendix C (a). The best results derived: Real Test Accuracy of 57.78% with a standard deviation of 5.8% and ambiguity of 2.2 % (Refer Table 10 ). Further, the ROC curve with the best result was made and AUC value of 0.57 was derived. The Real test accuracy distribution was plotted for deriving the Kurtosis and Skewness. Refer Figure 12.
• Kurtosis of the Real Test Accuracy: 0.5892 • Skewness of the Real Test Accuracy: 1.4699

Model implementation using support vector machines
The model was tuned with over 25 combinations. Refer to Appendix C (b). Real test accuracy of 58.416% with a standard deviation of 5.69% and ambiguity of 0.24% was derived (Refer to Table 11 ). The Area under the Curve is 0.72. The ROC curve was plotted with the best result from Support Vector Machines. The distribution of Real Test Accuracy was done to derive Skewness and Kurtosis of the Real Test Accuracy (Refer to Figure 13 ).
• Kurtosis of the Real Test Accuracy is 1.6979 • Skewness of the Real Test Accuracy: 0.4171

Model implementation using k-Nearest neighbours
The model was tuned with over 300 combinations. Refer to Appendix C (c). Real test accuracy of 53.472% with a standard deviation of 5.2% and ambiguity of 1.90% was derived (Refer to Table 12 ).

Model Implementation using ADABOOST
The model was tuned with over 56 combinations. Refer to Appendix C (d). The best result with the corresponding hyperparameters were derived -Real test accuracy is 60.035% with a standard deviation of 6.2% and ambiguity of 5.4% (Refer to  Table 13 ). Further the ROC curve with the best result was made and the AUC value of 0.62 was derived. The Real test accuracy distribution with ADABOOST was plotted for deriving the Kurtosis and Skewness (Refer Figure 15 ).
• Kurtosis of the Real Test Accuracy: -0.6021 • Skewness of the Real Test Accuracy: -0.4677

Model Implementation using XGBOOST
The model was tuned with over 3600 combinations. Refer to Appendix C (e). The best result with the corresponding hyperparameters were derived -Real test accuracy is 55.42 % with a standard deviation of 5.9% and ambiguity of 7% (Refer to Table 14 ). Further the ROC curve with the best result was made and the AUC value of 0.62 was derived. The Real test accuracy distribution with XGBOOST was plotted for deriving the Kurtosis and Skewness (Refer Figure 16 ).
• Kurtosis of the Real Test Accuracy is -0.8633 • Skewness of the Real Test Accuracies: 0.0456

Model implementation using ExtraTreesClassifiers
The model was tuned with over 320 combinations. Refer to Appendix C (f). The best results derived: Real Test Accuracy of 59.506 % with a standard deviation of 5.9% and ambiguity of 4.3% (Refer to Table 15 ).

Model Implementation using Random Forest Classifier
The model was tuned with over 1200 combinations (25) . Refer to Appendix C (g). The best result with the corresponding hyper-parameters were derived -Real test accuracy is 60.043 % with a standard deviation of 6.3% and ambiguity of 1.4% (Refer to Table 16 ).

Conclusion and Future Works
The research focused on predicting the winner for an IPL match using machine learning and utilizing the available historical data of IPL from season 2008-2019. In the process, various Data Science methods were adopted to conduct the study, including data mining, visualization, preparation of database, feature engineering, applying the Analytic hierarchical process, creating prediction models, and training classification techniques. The IPL dataset was gathered and pre-processed. The missing values were removed, and variables were encoded into the numerical format to make the dataset uniform. The essential features were then derived from data using the domain knowledge to extract raw data features via data mining techniques, and the results were derived from the model. Since the dataset that is available for IPL is limited and small, multiple levels of features were created to make sure that the derived model is not underfit. Almost every feature that can affect the result of a match was derived. Further, the problem of multicollinearity was solved and the issue of data symmetry was identified (termed as model ambiguity). Several machine learning models were applied to the selected features to predict the IPL match results. The best results were concluded using the tree-based classifiers. The highest accuracy of 60.043% with Random Forest with a standard deviation of 6.3% and an ambiguity of 1.4% was observed (Refer to  Table 17 ). In this research, the player's series-wise performance rather than their match-wise performance was taken while calculating the player's strength. For a more thorough approach to further develop this research, match wise data can be considered. The research can also be further enhanced by adding other factors like comparing players' performances at a particular stadium. Total number of 100s scored 16 Φnum_4s

Appendices
Total number of 4s scored 1 Φnum_6s Total number of 6s scored 2 φstmp Stumping/ Run Out (direct) 12 Φr_out Run Out (Thrower/Catcher) 8/4 Φfduck Dismissal for a Duck (only for batsmen, wicket-keepers and all-rounders) -2 Φbat_innings Number of times a player has batted in a match Φbowl_innings Number of times a player has bowled in a match Φwickets Number of wickets taken by a bowler in the season 25 Φmaidens Number of times a bowler has bowled an over without conceding any runs 8 Φ4_wicket_houl Number of times a player has taken 4 wickets in a single match 8 Φ5_wicket_houl Number of times a player has taken 5 wickets in a single match 16 Φbowl_economy Bowling