Select the Best Machine Learning Algorithms for Prediction and Classification of Intrusions using KDD99 Intrusion Detection Dataset

Objectives/Methods: The growing prevalence of network attacks is an issue that can affect the availability, confidentiality and integrity of critical information for companies. Thus, Intrusion detection systems are increasingly being used to identify unusual access or attacks to secure internal networks. In this study, we will outline the evolution of large data in the intrusion detection system, and apply three supervised learning methods namely: Naïve Bayes, Random tree, and Support Vector Machines SVM, using the kdd99 data set. The purpose of this research is to detect and predict attacks in order to take preventive action against intrusion risks. Findings: Investigational results have demonstrated that the random tree gives the highest accuracy at 100%. The results will be useful in choosing the best classification machine learning algorithm for intrusion prediction. Application/Improvements: for simulation and testing the performance of algorithms, we have used WEKA (Waikato environment for knowledge analysis), which includes tools for data preparation, classification, regression, clustering, association rule extraction and visualization.


Introduction
Today, Intrusion Detection System (IDS) has a very important role in network security. Especially as the number of attacks targeting confidential information is increasing, ranging from 9 million attacks in June 2004 to over 33 million attacks in less than a year 1 . One of the solutions proposed to solve this problem is the use of Network Intrusion Detection Systems (NIDS), which is used to detect attacks by monitoring network activities 2 . Thus, it is required that these systems be accurate and fast to report attacks to network administrators, quickly, in order to take appropriate countermeasures.
Traditionally, Intrusion Detection Systems are based on human technology to distinguish between intrusive and normal traffic. However the massive and increas-ing volume of data requires the use of machine learning techniques that provide decision tools for analysts, and automatically generate rules to be applied in order to prevent unauthorized access to the computer network 3 .
In this study we will use the Waikato Environment for Knowledge Analysis (WEKA), data mining tool for classification. It firstly classifies the data set and then defines the best algorithm to diagnose and predict the intrusion.
The main contributions of this work are: Select the best classifier for the intrusion detection system, comparison of different data mining algorithms for the kdd99 intrusion detection data set and identification of the best solution based on the performance algorithm for intrusion prediction. The rest of the paper is organized as follows: IDS is presented in Section 2, related work is discussed in Section 3, Section 4 describes the Experiment, Select the Best Machine Learning Algorithms for Prediction and Classification of Intrusions using KDD99 Intrusion Detection Dataset Section 5 explains in detail the experiences of using the proposed machine learning models and Section 6 presents the conclusions and future prospects.

Intrusion Detection System
An IDS is a mechanism to identify abnormal or suspicious activities on a given target in order to remedy problems as soon as possible. The IDSs are based on several approaches: Scenarios approach This type of IDS uses a database of signatures, and tries to match a data obtained by the information sources of the system, with that already known and Behavioral Approach detect violations of the security policy of the system by observing the behavior of the users and comparing it with a model of behavior considered normal called profile 4 .
In this paper we evaluate the performances of classifiers, while trained to identify signatures of attacks.

Related Work
In paper 5, the authors have presented a framework of machine learning for intrusion detection system in order to protect wireless sensor networks. Their system is not limited on particular attacks, while machine learning methods allow creating detection model from training data automatically thus reduce human labor to write signature of attacks or indicate the normal behavior of a sensor node.
In paper 6, the authors have presented two orthogonal and complementary approaches to reduce the number of false positives in intrusion detection by using alert post-processing via data mining and machine learning. Furthermore, these two methods can be used jointly in an alert-management system due to their complementary nature. These concepts have been verified on a variety of data sets, and achieved a significant reduction in the number of false positives in both simulated and real environments.
In paper 7 ther authors have used a hybrid intelligent approach by using a combination of classifiers to make the best decision, thus the performance of the resulting model is ameliorated. The procedure consists of filtering the data under supervision or unsupervised using a classifier or clustered on all training data then the output is applied to another classifier to classify the data. They use a two-class classification strategy and a 10-fold cross validation method to obtain the final results that classify intrusion and normal traffic. The simulation shows that their proposed approach is effective with a high detection rate and a low false alarm rate.
In paper 8, the authors examine four learning algorithms for a breast cancer data set. in their efforts to predict breast cancer and reduce the risk of death. they have used several machine learning algorithms which are: Random forest, Naive Bayes, Supportg Vector Machines SVM, and K-Nearest Neighbors K-NN, to choose the more effective one.

Weka
Waikato Environment for Knowledge Analysis (WEKA) is a collection of machine learning algorithms designed to facilitate the application of machine learning techniques to a variety of real-world problems, including tools for data preparation, classification, regression, clustering, association rule extraction and visualization 9 .

KDD-99 Data Set
This database, a standard set to be audited, includes a wide variety of intrusions simulated in a military network environment. Many published studies have showed that KDD99 is the most widely used dataset for IDS and machine learning domains, and it is effectively the dataset for these research areas 10 . This data set contains 22 intrusion types (Table 1), 42 attributes, and 494020 instances (Web-1).

Classifiers Used
For our work we will use the following classifiers: 1. Naïve Bayes algorithm simplifies learning by assuming that the functions are independent given class. Despite the fact that independence is generally low in practice; Bayes naive is often in competition with a more sophisticatedclassifiers 11 . 2. Random Tree is the supervised Classifier that uses a bagging idea to create a random set of data in order to construct a decision tree. This algorithm can be uses for both classification and regression problem 12 . 3. SVMs are a learning technique that can be considered as a new method for training classifiers of polynomial functions, neural networks or basic radial functions. Despite the fact that SVM is considered easier to use than neural networks, users are not familiar with 13 .

Metrics
In this section we will describe the metrics, evaluate the machine learning methods used, and discuss the results. Accuracy: The accuracy of detection is given by the percentage of correctly classified instances. It is the number of correct predictions divided by the total number of instances in the data set.
The accuracy can be measured by the following equation:

TP TN Accuracy
TP FP FN     Recall: also known as sensitivity is the rate of the positive observations that are correctly predicted as positive. The sensitivity or the true positive rate (TPR) is defined by:

TP Sensitivity
TP FN = + (2) while the specificity or the True Negative Rate (TNR) is given by: Precision: Percentage of correctly classified elements for a given class: F-measure: Combination of precision and recall. (5)

Result and Discussion
To implement and evaluate the classifiers, we apply the 10-fold cross-validation test which is a method used to evaluate predictive models. It divides the original set in a training sample to build the model, and a set of tests to evaluate it. After applying the pre-treatment and preparation methods, we try to analyze the data and determine the distribution of values in terms of effectiveness and efficiency Table 2-4 present the result of simulation. In order to compare the performance of the classifiers we have used the weighted average of classifiers, and were based on the number of correctly classified instances, the number of incorrectly classified instances, precision and the model build time (Table 5). Table 5. Weighted average of classifiers After obtaining these results we can visualize it as shown in Figure 1 (graph that illustrates the performance of the classifiers). Random Tree is the best classifier for kdd99 data set with 100% of precision, 100% true positive rate and 0% false positive rate. The time to build the model is longer than naive bayes which guarantees only 98.9% of precision. SVM cannot classify the data set correctly, and it takes a long time to build a model (289.99s).

Conclusion
In this article, we have analyzed the kdd99 intrusion detection dataset using tree machine learning algorithms, namely: Naïve Bayes, Random Tree and SVM. The results show that the Random Tree algorithm is the best way to classify all the data. The global performance of naive bayes and the SVM algorithm is unacceptable. Therefore, our future work is to optimise intrusion detection system employing decision tree algorithm and using python programming language.