Classification and Prediction of Student Academic Performance in King Khalid University-A Machine Learning Approach

Objectives: Universities accumulate huge amount of student’s data in electronic form. Based on the information stored in the database filtering a data on certain criteria becomes difficult, when executed manually. Hence implementing tools that analyses the data in statistical, descriptive or computational ways are quite important to be considered. Methods/ Statistical Analysis: This study presents an analysis on top ten machine learning algorithms used in classification and prediction. WEKA tool is used to conduct the experiment to know the accuracy and other result parameters on evaluating the categorical prediction of student performance. Also an analysis has been done to estimate the parameters based on the number of samples. Findings: The comparative analysis on the classification accuracy of around 12 classifiers of WEKA involving Rep Tree, Naive Bayes, J48, Bagging, lBK, Multilayer Perceptron, Random Forest, Random Tree, Stacking, AdaBoost, Logistic and SMO were analysed on datasets in varying number of instances. Based on the results obtained best 5 methods are chosen and compared on the entire dataset for prediction results. Ten machine learning algorithms were considered wherein the results such as accuracy in classification, Kappa statistic, and Mean absolute error are considered and compared. Bagging, Random Forest, lBK, Random Tree was filtered at the first level based on kappa statistic. In the second level filter based on accuracy lBK, Random Tree was considered as the final suitable models for the provided dataset. Application/Improvements: Developing a questionnaire among students and teachers is to be done to evaluate and predict the results in various angles based on various parameters. The positive factors and the negative factor contribution for the result of the institution are to be analysed.


Introduction
Machine learning uses the educational data mining techniques to predict the exact results on the student performance thereby creates an initiative for the educational institutions to rise up the results of their institution by looking over the parameters that affects their academic position in global market. This area on educational data mining improves the pedagogical strategy. Students' academic performance is a crucial deciding factor in building their future 1,2 . Machine learning includes developing a new model for the proposed work. Even though, many machine learning algorithms exists, some algorithms are of concern in all fields of research. For categorical analysis and prediction 43 algorithms are available for classifying data , but ease of consolidation only 10 algorithm of peak performance on considerable parameters are analyzed in this work. Many tools exist to test on the data for machine learning algorithms, but WEKA seems to be user interactive and easy to be used even for nonprogrammers, hence WEKA is chosen as a tool to identify the algorithm which can be used as a base for development of new model in predicting student performance. As in Figure 1 the entire process of machine learning depicts in to following steps in major. Steps used for predicting a data in machine learning involves: • Data Gathering involves collecting data from real-time environment and segregating the data according to the requirement of the prediction result. • Pre-processing data • Classify using model • Save model that train data • Apply saved model for test data • Predict result and estimate accuracy parameters

Related Works
Praneet et al. 4 in his work projects out the importance of predicting the results of students in the field of education. The real-time dataset of student academic records is tested and applied on various classification algorithms such as multilayer Perception, Naïve Bayes, SMO, J48 and REP Tree using WEKA an Open source tool. As a result, statistics are generated based on all classification algorithms and comparison of all five classifiers is also done in order to predict the accuracy and to find the best performing classification algorithm among all.
Ameerah et al. 5 provides the overview of data mining techniques that have been used to predict students' performance. The prediction algorithm used to identify the important attributes in a student's data is identified. Factors like Internal assessments, psychometric factors, CGPA, Social network interaction, Student demographic were considered.
Raheela et al. 6 made a case study on the student academic performance prediction using the cohort performance system considering only pre-university marks and marks of 1st and 2nd year courses, no socio-economic or demographic features, to predict the graduation performance in 4th year at university

Student Performance Model
In order to choose a tool and a best algorithm to serve as a base in developing a new model for the academic performance prediction WEKA is chosen. The academic results of previous semester based on 5 attributes like Student id, name, Mid-semester 1 and Mid-semester 2 contributing a major part in semester_internal marks are used to predict the final exam results. Even though the results of this analysis will turn up to be more positive when other factors contributing to the results like assignment, quiz are included. As per the curriculum of King Khalid university the entire marks of the course is split into two major equal halves semester_internal and semester_final marks each sharing the 100 marks of total equally. The semes-ter_internal marks includes not only the Mid_semester 1 and Mid_semester 2 marks but also includes lab exams (if any), assignment, quiz, activities based on the course specification allocated for each course. As an initiative part this research work starts with prediction of the results of the exam that has been completed previous semester considering only the Mid_semester 1 and Mid_semester 2 marks. In the later case the results of this semester are to be predicted as a proposed future direction of this research considering various other factors.
The major classifiers designed in WEKA for machine learning purpose includes 3 : • weka.classifiers.IBk: k-nearest neighbour learner • weka.classifiers.j48.J48: C4. 5  In the perspective of machine learning application, there are ten major algorithms that suits to the classification process of any research problem. Three major categories of Machine learning algorithms exist as Linear, Non-linear and Ensemble as shown in Table 1. Linear algorithms assume that the predicted attribute is a linear combination of the input attributes. The relationship between the input attributes and the output attribute being predicted are not considered into assumptions in Non-linear algorithms, whereas, Ensemble methods combines the predictions from multiple models in order to make more robust predictions.

Data Collection and Preparation Phase
For the analysis and prediction of the academic results only three main attributes Mid_semester 1, Mid_semester 2 marks and Semester_internal marks were taken as dependent attributes for classifying and predicting the results of final exams. The odd semester marks of 2018 in the College of Arts and Science, AhdRufidah a female branch of King Khalid University is incorporated for analysis. A total of marks of 2350 students were analyzed out of which around1880 data were considered as training data contributing to 80% of population data and the remaining 20% were used as test data for prediction. The prediction results are provided in the forthcoming section, based on which the appropriate method is to be chosen for future implementation.

Data Analysis Phase
The entire research process on predicting the student academic performance involves two main steps. The first step is to find a suitable machine learning algorithm that supports our requirement in predicting the academic performance based on the available dataset. The comparative analysis on the classification accuracy of around 12 classifiers of WEKA involving RepTree, NaiveBayes, J48, Bagging, lBK, MultilayerPerceptron, RandomForest, RandomTree, Stacking, AdaBoost, Logistic and SMO were analysed on datasets in varying number of instances. Based on the results obtained best 5 methods are chosen and compared on the entire dataset for prediction results. The training set is used to create the model for prediction and the testing set is used to check the model accuracy.

Results and Discussion
Ten machine learning algorithms were considered wherein the results such as accuracy in classification, Kappa statistic, and Mean absolute error are considered and compared. Initially the twelve classifiers are considered on evaluating the prediction results of 2350 instances, in which the results are observed as shown in Table 2. Based on the results of Table 2    Where: P o = the relative observed agreement among raters. P e = the hypothetical probability of chance agreement.
As far as the error measures are considered, they are expected to be at the least value which reveals the genuinity of prediction. The next parameter considered is accuracy which is depicted by (TP+TN)/(TP+TN+FP+FN). With a conclusion of these three factors the classifiers RandomTree, lBK, RandomForest, Bagging and J48 are taken into consideration for predicting results on test data in the ranking order suitable for the provided train data. The graphical result on the three factors considered for Analyzing and deciding the model is shown in Figure 2.
The chosen 5 classifiers are tested over the test data of 500 instances which shows perfection in deciding the prediction model. Table 3 shows the elimination of J48 classifier on predicting results since the remaining 4 classifiers as Bagging, Random Forest, lBK, RandomTree has an accuracy of approximately 60% and the kappa statistic to be approximately 0.37. On further filtering it is clear that lBK and RandomTree shows less mean absolute error when compared to the other classifiers as shown in Table 4. Hence lBK a WEKA implementation of K-nearest neighbor algorithm and RandomTree are chosen for final evaluation.
Since the confusion matrix decides on the accuracy of parameter statistic achieved, the comparative measure of the two classifiers are shown in Figure 2. The parameters arrived from the confusion matrix includes: The resultant confusion matrix of the experiment is shown. To get a better understandability on the confusion matrix since all the predictions are on the diagonal of the matrix and the misclassifications outside the diagonal. In order to improve the accuracy of prediction all data were remodeled again and again in different classifier methods to create a balanced classification.
Best rules found on prediction using Apriori algorithm:

Future Works and Conclusion
The deciding factor of the academic result is not just a single factor. It depends on several other factors. Based on the analysis made by this research a decision has been arrived on use a specific algorithm in predicting the results based on several factors. For this a future scenario on developing a questionnaire among students and teachers is to be done to evaluate and predict the results in various angles based on various parameters.
The questionnaire will include the contributing factor of result such as difficulties faced by students in the period of course, difficulties faced by teachers which contributes in pull down of results whereas on the other side the positive factors will also be collected to estimate the favouring factors of the academic performance.

Acknowledgement
Credit to college of arts and science, AhdRufaidah, Abha, Kingdom of Saudi Arabia.