Predicting University Dropout through Data Mining: A Systematic Literature

Objectives : To make a systematic review of literature on the prediction of university student dropout through data mining techniques. Methods/Analysis: The study was developed as a systematic review of the literature of empirical research results regarding the prediction of university dropout. In this phase, the review protocol, the selection requirements for potential studies and the method for analyzing the content of the selected studies were provided. The classification pre - sented in section 3 allowed answering the main research question. What are the aspects considered in the prediction of university student desertion through data mining? Findings: University dropout is a problem which affects universities around the world, with consequences such as reduced enrolment, reduced revenue for the university, and financial losses for the State which funds the studies, and also constitutes a social problem for students, their families, and society in general. Hence the importance of predicting university dropout, that is to say identify dropout students in advance, in order to design strategies to tackle this problem. Novelty /Improvement: This is the first work to perform an integral systematic literature review about university dropout prediction through data mining, with studies from 2006–2018.


Introduction
There is currently an increasing interest in researching the topic of university dropout around the world 1 , with one of the main concerns being elevated rates of occurrence 2 .Dropout negatively affects institutions in the reduction of enrolment and the non-achievement of institutional objectives 3 .As a consequence, students, universities and governments are affected in both economic and social terms.Furthermore, dropout becomes a critical topic when university administrators do not possess the tools necessary to identify students who are at risk of leaving the institution.In turn, potential corrective measures are reduced 4 , which might have enabled student retention at higher education institutions 5 .In the same way, the early prediction of student dropout has become a major challenge, as well as identifying the factors which contribute to this increasingly occurring phenomenon 6 .One pos-sible reason that there are still high university dropout rates may be associated with the fact that most of the prediction models applied to solve this problem are difficult to interpret 7 .A significant effort has been made to close the university dropout gap and thus reduce dropout rates.Nonetheless, this effort has been insufficient 4 ; according to the Organization for Economic Cooperation and Development (OECD), in 2016, European dropout rates ranged between 30% and 50%, while in the United States the student dropout rate was 37% 8 .
In some Latin American countries, such as Columbia, dropout rates exceeded 40%, while in Brazil they reached approximately 54%.In Costa Rica, the dropout rate reached 50% 9 , with public universities presenting higher dropout rates than private ones 10 .One of the measures to deal with university dropout is based on predicting its rates; therefore, data mining is used, aimed at developing methods to identify patterns among large datasets and 2 thereby extract meaningful knowledge 2 .This approach is widely used in the prediction process to study dropping out, due to its acceptable degree of significance 11,12 .In general, this process follows four stages, which range from data pre-processing to result evaluation (Figure 1).
Prior literature survey on data mining and education 13,14 have covered topics such as: learning management systems, intelligent tutoring systems, adaptive educational systems, learning analytics, student modeling, and predicting academic performance.However, none of these considers the topic of university dropout, despite the large number of studies regarding factors that influence university dropout and techniques for dropout prediction.For this reason, the present study aims to answer the following question: What aspects are considered in predicting university student dropout through data mining?To meet this objective, we propose a systematic literature review of the period 2006-2018, including journals indexed in Scimago Journal & Country Rank, from which we identified and analyzed 67 articles from nine academic publishers.The present article is organized in five sections.The first section is this introduction, followed by the methodology for the systematic literature review.Subsequently, the results and analysis of the selected documents are presented in the third section.The discussion and conclusions are then presented in the fourth and fifth sections, respectively.

Research Methodology
In order to perform this systematic review, we considered the methodologies applied by 15 , which consist in three stages:

Planning
Five research questions were proposed in order to determine the aspects that have been developed to predict university student dropout.
• Articles from conferences and journals indexed in Scimago Journal Country Rank (SJR) with impact factor were reviewed in the following databases: Science Direct, ACM Digital Library, IEEE Xplore, Springer, DOAJ, Taylor and Francis, Emerald, Proquest and Ebsco.For document selection, the inclusion and exclusion criteria presented in Table 1 were applied.
The following search criteria were considered: "dropout student" OR "drop out student" OR "dropping student" AND "data mining", which were applied to the title, abstract and keywords in the search period between January 2006 and December 2017.

Inclusion Exclusion
Models to provide a solution to the problem of university student dropout.Documents that present factors influencing university dropout.
Papers that include prediction based on data mining.Papers that present metrics to assess the quality of predictive models.Papers that respond to the research questions.
Prediction documents that are unrelated to university student dropout, such as primary, secondary and postgraduate education.Documents not related to data mining.Documents that do not have numeric experimentation.Documents that are not found within the established search period.

Result
Table 2 summarizes the total identified and selected documents by information source, Science Direct being the main source of information, with 40% of the primary selected studies.Meanwhile, Emerald and ACM Digital Library present rates of 4.47% and 1.49%, respectively.Figure 3 exhibits the increase in studies during the past 12 years and the interest that researchers have in solving the problem of university dropout prediction.87% of the primary selected documents come from journals (58 studies out of 67), and 13% correspond to publications in conferences (9 studies of 67), as presented in Figure 4. From the selected documents, we identified three aspects regarding university dropout prediction: factors, techniques and tools, all of which are specified in the framework of the present study.

Dropout factors:
The reasons for which students leave studies 16 .Data mining techniques: The objective of these techniques is to discover patterns, profiles and trends through

Implementation
We performed the search process based on the strategies proposed in section 2. Once selected, each document's content was reviewed in order to determine whether it matched the established selection criteria.The systematic literature review process is presented in Figure 2.      In the pre-processing stage, eleven techniques were identified (Table 3).This stage allows the management of anomalies as well as the correction of atypical and

ADF40
Total failed courses 20 Vol 12 (4) | January 2019 | www.indjst.orgare considered internal factors of variability which are simple to define and measure 22 .
Academic factors: These refer to the development of students in their formative process.We identified 40 academic factors, which correspond to 36% of the total identified factors, presented in Table 5.
Analysis of these factors shows that the university entrance test is the most frequently used factor in the literature.However, it bears mentioning that the learning process at university has a close relationship with preceding study levels, impacting further educational achievements 23 .In the same way, the score that a student obtains in the university entrance examination is considered an indicator to explain success or failure in academic trajectory at university 5 .In this sense, many studies have analyzed the predictive validity of this factor, considering it a predictor of cognitive and attitudinal characteristics that is of the utmost importance for students to succeed at university 24 .
Economic factors: These are related to students' ability to satisfy the economic requirements that present themselves during the academic program.In this dimension, 15 factors were identified that affect dropout, and they correspond to approximately 13% of the total analyzed factors, which are presented in Table 6.These economic dimension factors refer to material comforts and the ability of parents to allocate more and better resources for the academic performance of their children, which has a significant impact on academic achievements 25 .
Social factors: These are aspects that affect students as a whole, and which are determined by their place and space, as presented in Table 7.
On the other hand, the social dimension focuses on the importance of the interaction between students and their social environment; interaction in relation to the institution, academic norms, and study habits 26 .

Institutional factors:
The factors that correspond to this category relate to the structural and functional characteristics of an institution, which are presented in Table 8; these represent approximately 3.53% of the total analyzed factors.

c) Q3: What techniques are used for factor selection?
We identified ten techniques for factor selection, which are presented in Table 9.The objective of these techniques is to select the most relevant factors used as input missing values 17 .The purpose of these techniques is to improve the properties of the variables and solve data anomalies to optimize the search process of data mining algorithms 18 .This is based on three activities: integration, cleaning and transformation of the information.All of the studies 10 involving the pre-processing stage are concentrated around the activity of data transformation, with the techniques of normalization and discretization being the most commonly used.However, integration and cleaning activities are also important; as in 19,20 indicate; selecting the wrong variables in the data mining process can negatively affect prediction accuracy for these techniques.

b) Q1: What factors affect dropout?
We identified 112 factors to predict university dropout, which were classified according to the five dimensions (personal, academic, economic, social and institutional) proposed by author 21 .
Personal factors: These constitute characteristics that determine student behavior such as feelings, thoughts or actions, which are decisive in the development of their educational environment.We identified 31 factors in the personal category, and these corresponded to approximately 28% of the total identified factors, as shown in Table 4.For many authors, personal factors are the main cause of students dropping out of university, and Table 4 evidences this fact.Age and gender are the most frequently used factors for prediction; this is because they

EDF15
Type of financial assistance 63,38,48,12 We identified14 data mining techniques, which had been classified into artificial intelligence and statistical method techniques; these are presented in Tables 10  and 11.Approximately 79% (22 out of 28 studies) used Decision tree classifiers.According in 22,30 this technique is used due to its flexibility when processing data of a numerical and categorical nature, its monotonous transformations of explanatory variables, and the ease of interpreting results.Furthermore, it presents better accuracy rates.In 31 and 32 mention that the algorithm ID3 (Decision tree classifier) is effective in classifying data from student history registers and is more sensitive in comparison to other algorithms.
Neural network classifiers and support vector machines hold the second highest frequency of use, since these data mining approaches are considered powerful tools for solving classification problems 33 and are used frequently for their simplicity and ease of understanding 32 .Four statistical techniques were identified, corresponding to a total of 36 references, or 3% (4 out of 14 techniques) of the total studies analyzed.Of these, 54% (21 out of 39 studies) applied Linear Regression and Logistic Regression, as variables for dropout prediction models.Approximately 55% (23 out of 42 studies) used descriptive statistics, as this technique produces the characteristics of dispersion, location and distribution of the variables 27 .Additionally, the technique is frequently used to identify patterns regarding student characteristics and behaviors related to dropout.Of these 23 studies, 14 are oriented towards variable correlation and apply this type of analysis to evaluate the association and relationship of quantitative data in terms of directionality, through correlation coefficients 28 .On the other hand, 12% (5 out of 42 studies) apply Principal Components Analysis to reduce the dimensionality of the observed variables to a number of hypothetical variables; thus, groups of variables that correlate with one another are created.These variables are transformed into independent factors that are implemented in dropout prediction models 29 .

d) Q4: What techniques are used for prediction
and what are their levels of reliability?

Discussion
Of the 67 studies identified on university student dropout prediction, 18% contemplate the pre-processing phase.Therefore, this underlines the importance of this phase in obtaining variable properties, solving data anomalies, and increasing accuracy rates.We found that 90% of the studies regarding dropout prediction contemplate factor dimension, which evidences its relevance to the scientific community.Age, gender, ethnicity, and entrance exam performance are the most commonly used factors and correspond to the personal dimension.Although the total factors are wide-ranging, their behavior changes from one context to another; therefore, there is much controversy over which factors prove to be most efficient in university dropout prediction.With respect to factor selection techniques, 34% of studies used descriptive statistics and 7% used principal components analysis.This is one of the most relevant phases when predicting dropout due to its reduction in variable dimensionality.Thus, it allows us to adequately select the most predominant factors used as input variables in dropout prediction models.With regards to the techniques used to predict dropout, currently, statistical techniques are most commonly used.However, these are gradually being replaced by artificial intelligence techniques, since the latter present higher accuracy rates.Nevertheless, these rates vary according to the factors and educational context, the educational environment, and the theoretical framework of the analysis.

Conclusions
This study presents a systematic literature review on the aspects of data mining considered for predicting university dropout.We identified 1,681 primary studies related to the topic, from amongst which 67 documents were selected according to the established inclusion and exclusion criteria, identifying five important dimensions: factors, pre-processing techniques, factor selection techniques, prediction, and tools.This study makes an these are frequently used techniques for classifications based on data characteristics, and are flexible in the use of categorical and continuous predictor variables 34 .
On the other hand, regarding the accuracy of data mining techniques, the authors considered metrics such as sensibility, specificity, and accuracy.Of these, accuracy is determined by the ratio of True Positives (TP) to True Negatives (TN) among the total of registers, as formulated in equation (1).

TP TN TP TN FP FN
where, FP is the number of false positives and FN the number of false negatives.Tables 12 and 13 report the accuracy levels of the data mining techniques that reached a ratio higher than 60% and have a dataset composed of a number higher than 100 students.
The results show that the most accurate techniques are the Decision Tree Classifier, with the classifiers C4.5, ID3, and CART, reaching an accuracy of 98%, 97.5%, and 97%, respectively.The results evidence that the most accurate technique is Linear Regression (87.8%).However, these results cannot be generalized, as they depend on the dataset and the considered variables.

e) Q4: What tools are used?
We identified four tools in studies with artificial intelligence techniques, and seven tools in those using statistical methods; these are presented in Tables 14 and  15, respectively.The results highlight that the most widely used tools are WEKA and SPSS Modeler, most likely due to their wide variety of automatic learning algorithms for data mining tasks, flexibility in predictive modeling, and their facilities and functionalities 26 .

IID Technique
ES1 Logistical regression 7,25,63,28,69,34,39,40,54,56,58,59,33,62,[73][74][75]32,82,84 ES2 Lineal regression 83,60,38,[47][48][49][50]52,57,27,[70][71][72][77][78][79][80][81] ES3 Discriminant analysis 24 inventory of 112 factors that influence dropout prediction. These fars were classified into five dimensions: personal, academic, economic, social, and institutional; the most commonly studied was the personal dimension, which considers factors such as age, ethnicity and gender. Furthermore, we identified ten pre-proing techniques, the most widely used being normalization and discretization.There were ten techniques for factor selection, of which descriptive statistics and Principal Component Analysis were the most referenced.Additionally, four-  Radial basis function 51 70 Support vector machine 51 79 Support vector machine 32 65 Logistic regression 32 65 Random forest 32 86 Gradient boosting decision tree 32 88 Artificial neural networks 67 62 Decision trees and random forest 67 63 62,375 Artificial neural network 20 84 Decision tree classifier 20 82 Bayesian networks 20 63 85 Support vector machine 63 90 Decision tree classifier 63 89 Logistic regression 63 80 Logistic regression 39 84 Naive Bayes 39 83 Support vector machine 39 67 Decision tree classifier 39 83 128 Decision tree classifier 58 84 Logistic regression 58 84 Naive Bayes 58 82 Artificial neural network 58 82 (Continued) teen techniques were identified for dropout prediction, and these were classified into statistics and artificial intelligence.The statistical techniques presented a higher frequency of use, while the artificial techniques presented greater accuracy rates.Finally, there are many data mining tools, of which the most used are WEKA and SPSS Modeler.Consequently, it is clear that university dropout prediction is of interest to the scientific community, evidenced by the large volume of works on the topic, and its socio-economic impact.To address the problem of dropout, highly accurate techniques are being developed, however we cannot identify one technique that is clearly superior, for prediction accuracy depends mainly on the context, data and technique characteristics; any potential alternative must consider these factors.

References
Question 1 (Q1): What techniques are used for data pre-processing?• Question 2 (Q2): What factors affect dropout?• Question 3 (Q3): What techniques are used for factor selection? • Question 4 (Q4): What techniques are used for prediction and what are their levels of reliability?• Question 5 (Q5): What tools are used?

Figure 1 .
Figure 1.Data mining process for university dropout prediction 24 .

Figure 3 .
Figure 3. Temporal trend of selected publications on university dropout.

Table 3 .
Techniques for data pre-processing

Table 7 .
Social dimension factors

Table 9 .
Techniques for the selection of factors

Table 13 .
Accuracy of statistical techniques

Table 14 .
Tools used in studies applying artificial intelligence techniques

Table 15 .
Tools used in studies applying statistical techniques