A Novel Negative Selection Algorithm with Optimal Worst-case Training Time Complexity for R-chunk Detectors

Objectives: To generate complete and non-redundant detector set with optimal worst-case time complexity. Methods: In this study, a novel exact matching and string-based Negative Selection Algorithm utilizing r-chunk detectors is proposed. Improved algorithms are tested on some data sets; the experiments’ results are compared with recently published ones. Moreover, algorithms’ complexities are also proved mathematically. Findings: For string-based Artificial Immune Systems, r-chunk detector is the most common detector type and their generation complexity is one of the important factors considered in the literature. We proposed optimal algorithms based on automata to present all detectors. Novelty/applications: The algorithm could generate the representation of complete and nonredundant detector set with optimal worst-case time complexity. To the best of our knowledge, the algorithm is the first one to possess such worst-case training time complexity.


Introduction
The biological immune system is a cooperative system that provides a comprehensive line of defense for human against pathogens. After million years of evolution, it has become a defensive system that is adaptive, inherently distributed, and incredibly robust. It possesses powerful capabilities such as pattern recognition, learning, and memory which helps to combat infections caused by pathogens (such as viruses), even though it needs no central control or coordination. detectors are first randomly generated by some processes. They are then censored by being matched against the self-sample data given by a set S, where S might represent the system components. Any candidate detector that matches (at least) one element of S is discarded and the ones that survive are retained and stored in a set called the detector set. The flowchart of the detection phase is given in Figure 1b. It is used to discriminate between selves (system components) and non-selves (anomalies, outliers) in that if a new data instance matches any detector in the detector set, it is regarded as non-self [16].
For string-based NSAs, the two most well-known matching rules for the construction of detector sets are r-contiguous and r-chunk. For both rules, a major problem with existing NSA implementations is that the first phase, detector generation, might have, in the worst case, exponential time complexity. The state-of-the-art algorithm for generating complete and non-redundant detector sets proposed by Elberfeld and Textor [17] possesses time complexities of O(|S|ℓ r|Σ|) and O(ℓ), for the detector generation (training) phase and detection phase, respectively. While the worst-time complexity for the detection phase is optimal (linear time), it is still open if the training time of the NSAs in could be improved further. In this article, we will show that at least in the case of r-chunk matching rule, improvement on training time complexity could be made by proposing a fast r-chunk based NSA for generating non-redundant detector sets, which requires only O(|S|ℓ|Σ|) while still maintains (worst case) time complexity of O(ℓ) for the detection phase. It is noted that the reduction of r in the training time complexity is substantial as in some applications such as intrusion detection, r could be approximately 50 [18]. Moreover, it can be easily shown that such (worst-case) training time complexity is optimal (i.e. it could not be further improved). Table 1 summarizes the (worst-case) time complexity the previously published r-chunk detector-based algorithms and our proposed algorithm. In [17] the table and the rest of the article, it is assumed that binary alphabet is used (|Σ|) = 2). It is noted that the parameter |D| in Table 1 is only relevant for the algorithms that generate detectors in explicit form. Our algorithm and the algorithm in produce the results that obtain maximal number of generated detectors [19][20].
The organization of the rest of the paper is as follows. Some basic terminologies and definitions related to (string) languages, automata, and matching rules (r-chunk, r-contiguous) are given in the next section. Section 3 details our proposed r-chunk based negative selection algorithm. Experiments and discussions are given in Section 4. Finally, the paper is concluded with Section 4, where we will also highlight some possible future works.

Backgrounds
For being self-contained and consistent with in this section, some basic concepts are defined using similar notations as in Ref. [17].

Strings, Substrings, Languages
Let Σ be a finite s (non-empty) set of symbols called an alphabet, we define Σ* as the set of all strings on Σ, that is any string s ∈ Σ* comprises of a sequence of symbols taken from Σ. For each string s, the number of symbols in s defines its length (denoted as |s|). When |s|=0, s is called the empty string.

Prefix Trees, Prefix Directed Graphs, Automata
A rooted and directed tree T with edge labels from Σ is called a prefix tree over alphabet Σ if for all c ∈ Σ and every node n in T, n has no more than one outgoing edge labeled with c. A tree T contains a string s (s ∈ T) if, there is a path p ∈ T from the root to a leaf of T such that the string concatenated along p equals s.
For a given tree T, the language L(T) = {s|s has a nonempty prefix in T}. For instance, given T as in Figure 2a, we could assert that 10 ∈ T and 0 ∈ T, but 1 ∉ T. Therefore, 0 ∈ L(T) and 01∈ L(T) as 0∈ T, but 11∉ L(T) since T does not contain any prefix of 11. Similar to prefix trees, a prefix DAG D could be defined as a directed acyclic graph, where its edges have labels as the symbols from an alphabet Σ. A string s ∈ D if there is a path p from a root to a leaf of D such that the string concatenated along p equals s.
For a node n in D, we define the language L(D, n) as the set of all strings s such that s has a (nonempty) prefix equaling the concatenated sequence of labels on the path from n to some leaves in D.
For example, for the DAG D in Figure 2b and its lower left node n, L(D, n) comprises of all strings that start with 11. We also define language L(D) = ∪ misarootofD L(D,m).
A finite automaton is defined as a five-tuple M = (Q, q 0 , Q a ,Σ,∆), where Q is a set of states with q 0 ∈ Q is called the initial state, Q a ⊆ Q is the set of accepting states, Σ is the alphabet of M, and ∆ ⊆ Q xΣx Q is the transition map. The transition map is considered unambiguous in that for any q ∈ Q and c ∈ Σ, there is no more than one q′ ∈ Q with (q, c, q′) ∈ ∆. We could use a graph G = (V, E) to represent the transition relation Q of an automaton M by setting the node set V = Q and E = c-labeled edges, where a c-labeled edge is identified from q to q′ for any q, q′ ∈ Q if (q, c, q′) ∈ ∆.
An string s is accepted by a automaton M if in the transition graph of M there is a path from q 0 to some q ∈ Q a such that its concatenated sequence of symbols equals s. The set of all strings accepted by an automaton M is its language L(M).

Detectors
Given an alphabet Σ, a string s (|s|=ℓ), a self-set S ⊆ Σ ℓ , and r ∈ {1,...,ℓ} (called matching parameter), We could define r-chunk detectors as follows [17]. Definition 1. A r-chunk detector is a tuple (d, i), where d ∈Σ r is a string of length r and i is a position (i ∈{1,...,ℓ − r + 1}). An r-chunk detector d is said to match a string s if d occurs in s at (at least) one position i.
Given a set of strings S, the set of r-chunk detectors that do not match any string in S, denoted as CHUNK(S, r), is called the detector set for S. A string m ∈ Σ ℓ is called a nonself w.r.t. S and its r-chunk detector set if m matches at least one detector from CHUNK(S, r); Otherwise, m is regarded as self. The set of non-self of S w.r.t its r-chunk detectors, is denoted as We denote CHUNK-NONSELF(S, r).
Another popular form of detectors for NSAs is r-contiguous, which is defined as follows [17]. Definition 2. An r-contiguous detector can be any string d ∈ Σ ℓ . d is said to match a string s ∈Σ ℓ if there is a position i ∈{1,..., ℓ − r + 1} such that d[i...i + r − 1] is a substring of s. Similar to r-chunk detectors, we denote the set of all r-contiguous detectors not matching any string in S as CONT(S, r). A string m ∈Σ ℓ is non-self if it matches at least a r-contiguous detector in CONT(S, r). Otherwise, it is called self.
Since a r-contiguous detector can be decomposed into ℓ − r + 1 overlapping r-chunk detectors, r-chunk is considered as a simplification of the r contiguous n matching rule [21]. It has been showed in Ref. [22] that chunk-based detectors could help NSAs work well on problems where contiguous regions in the sequence of input data are not semantically correlated, e.g. when the input sequence are network data packets.
For the sake of comparison, we reuse the example from Ref. [17].

Negative Selection Algorithm with Chunk Detectors
Suppose that each self-string of S has an associated index, I = {s i , i = 1,.., |S|}, We introduce the following two important data structures: Two arrays Q and P are used to expand the prefix DAG step by step. The final automaton to decide the membership of strings in Σ ℓ is constructed in two stages. The first is to create a DAG G so that L(G)∪Σ ℓ = Σ ℓ \CHUNK-NONSELF(S,r) by algorithm (Algorithm 1) and this DAG is then turned into an automaton M such that L(M)∪Σ ℓ = CHUNK-NONSELF(S,r) in the second stage by algorithm (Algorithm 2). In algorithm 1, the notations NULL and new() are used with the usual semantics as in C programming language.
The key idea behind our construction is that instead of generating all ℓ−r+1 prefix trees T 1 ,...,T ℓ−r+1 as in Ref. [16], we only create one tree T 1 for S[1..r] explicitly and then enlarge it by adding nodes and edges level-by-level to attain a prefix DAG. After this process, the DAG encodes all positive detectors S[i...i+r−1], i = 1,...,ℓ−r+1. This is then inverted to construct a compression of CHUNK(S, r).

Algorithm 1
To generate positive r-chunk detectors set 1. Procedure Positive R-chunk Detector (S, ℓ, r, G If no edge with label c starts at n then 7. create new edge (n,n′) labeled with c 8. For every node n ∈ G do 9.
If n is not reachable to n′ then 10. delete n Example 2. Let ℓ = 5, r = 3 and S is self-set from Example 1. The prefix DAG generated by Algorithm 1 is illustrated in Figure 3a. After adjusting by Algorithm 2, this DAG can be turned into an automaton as in Figure 3b.
Based on same self-set from Example 1, the automaton generated by the algorithm in Ref. [17] contains 23 nodes and 25 edges, while the automaton in Figure 3b has 14 nodes and 20 edges only. This supports in part a better memory complexity of our proposed algorithm.
In our experiments, we use a popular flow-based datasets NetFlow and a random dataset. The flow-based NetFlow is generated from packet-based DARPA dataset [23] is used for experiment 1. It was encoded as a set of 105,033 binary strings with their length of 57. A randomly created dataset containing 50,000 strings of length 100 is used for experiment 2.
The parameter values and running times (in milliseconds) of the experiments are showed in Table 2.
In Table 2, the runtime of NSA in Ref. [16] for both experiments are in shown in columns a and c, respectively, while the runtime of our proposed NSA are given in columns b and d. The results in the table show that there is a positive correlation between threshold r and the ratio of the runtime of NSA in to the runtime of CHUNK-NSA (as in columns a/b and c/d). Figure 4 shows that, for random strings in experiment 2, the ratio is closer to r when r increases (i.e. our proposed NSA almost r times faster than the NSA proposed in Ref. [24]. The ratio increases slower as r increases in experiment 2 (for real data) approaching r/3. Overall, the experimental results are consistent with our theoretical proof (Theorem 1).

Conclusions
In this study, we have introduced a new NSA to generate complete and non-redundant r-chunk detector sets with optimal worst-time complexity. Our theoretical proof and empirical experiments show that the proposed r-chunk detector algorithm, CHUNK-NSA, trains much faster the state-of-the-art one.
A limitation of CHUNK-NSA is that it is more memory consuming than the NSA as it utilizes two extra arrays Q and P with the memory complexities of |Σ| r and |S|r, respectively. In our opinion, this drawback is not serious as the modern computing systems could support huge internal memory storage, especially when |Σ| is small (e.g.