Instant fuzzy search using probabilistic-correlation based ranking

Background: Instant search recommends completions of the query `on the fly', and instantly displays the results with every keystroke. It is desirable that these query results be robust against typographical errors that appear not only in the query but also in the documents. Additionally, instant search requires instant response time and ranking of the results to focus on the most important answers. Method: In this study, simple and efficient methods for instant fuzzy single keyword andmulti-keyword search that are resilient to typographical errors and that employ nomore than inverted and forward indices are studied. While computing search results incrementally using the cached results, the answers are ranked based on their relevance to the query using probabilistic correlation based ranking. Findings: Experiments are conducted on data sets DBLP andMedline and the execution time for obtaining answers to instant fuzzy single keyword search is recorded for different prefix lengths. Similarly, the execution time for obtaining answers to instant fuzzymulti-keyword search is recorded for sub-queries of two keywords and three keywords for various prefix lengths on the same data set. Furthermore, in order to measure the usefulness of the proposed correlation based ranking, precision is calculated for the search results. Experimental evaluation demonstrates the efficacy of the instant fuzzy search algorithms and the probabilistic correlation based ranking. Applications: The proposed instant fuzzy keyword search for single and multiple keywords not only improves the efficiency but also the quality of the search results.


Introduction
In instant search, for each keyword or for the keyword that is presently being typed, the search engine must return results not only for the similar words but also for the words whose prefix is similar to the keyword.
One of the earliest examples of instant search is the Unix Shell, which presents the list of all file names starting with the letter that has been typed on the command line. While the purpose of the instant search is finding information, it is also employed in text editors for predicting user input (1)(2)(3) . In the context of information retrieval systems, https://www.indjst.org/ Bast et al. (4)(5)(6) proposed methods for indexing keywords and then querying for instant search. However, these techniques require large processing times and a lot of space. There are also studies on fuzzy instant search that are typo-tolerant and word-order independent (7)(8)(9)(10) . More recently, it is integrated into a number of search engines, for example, Google Instant for searching the web, Facebook that searches for the relevant people, Internet Movie Database that recommends movies, YouTube Instant that suggests the videos, interactive query suggestions within an e-mail system (11) and so on. However, these search engines make use of massive logs of previous queries, and fail to generate appropriate answers if the query posed is not in the log.
An additional challenge in search engines is dealing with massive amounts of documents in the data repository. One of the solutions in dealing with massive amounts of data is ranking query results; because ranking directs attention towards only the most relevant answers. Ranking algorithms are extensively studied in databases and information retrieval. The Ranking models can be classified as vector space models (12) , probabilistic models (13) , statistical language models (14) and hybrid models (15) .
In this article, simple and efficient methods for fuzzy instant search are studied that answer single keyword and multikeyword queries. The methods make use of no more than inverted and forward indices. Moreover, the methods compute answers incrementally using the cached results, and rank the answers based on their relevance to the query using probabilistic correlation based ranking.
The rest of the article is organized as follows: Section 2 presents preliminaries of the study. While Section 3 presents instant fuzzy keyword search, Section 4 describes instant fuzzy multi-keyword search, and Section 5 defines probabilistic-correlation based relevance ranking. Finally, Section 6 demonstrates the experiments conducted and Section 7 concludes the paper.

Preliminaries
Data Set: Let R = {r 1 , r 2 , ...} be a set of records and D = {w 1 , w 2 , ...} be a dictionary that contains the set of all distinct keywords of R. While in Table 1 an example of a set of records is presented, in Table 2 an inverted index of the set of records is given.
Similarity Measurement: In this study, edit-distance is employed to measure the dissimilarity between two keywords, which is defined as the minimum number of edit operations such as insertion, deletion, and substitution of single characters required to convert one keyword to the other. Let us denote the edit-distance between two keywords S 1 and S 2 as edist (s 1 , s 2 ) then the two keywords are similar if edist (s 1 , s 2 ) ≤ τ where τ is the edit distance threshold. Using linear algebra for intelligent information retrieval r 4 Approaches to intelligent information retrieval Freenet: a distributed anonymous information storage and retrieval system r 6 Survivable information storage systems

Instant fuzzy keyword search
Instant fuzzy keyword search consists of searching keywords that have a prefix close to the query string. More formally, given a query string q = c 1 c 2 . . ., instant fuzzy keyword search generates a ranked list of pairs (w i , r j ) such that prefix v of w i is similar to q and w i ∈ D is a keyword of r j ∈ R. In this section, a simple and efficient algorithm for instant fuzzy keyword search is presented where the information system accepts a sequence of queries from the user who is keying in character by character, and computes answers to the query from the answers generated in the previous query in the sequence. The answers generated are ranked based on their relevancy which will be described in Section 5.

Algorithm Description
Let q = c 1 c 2 . . . be the query being typed character by character, and τ be an edit-distance threshold. For the sub queries . . c τ the answer is set of all pairs of (w i , r j ), such that w i ∈ D is a keyword of record r j ∈ R.
For the sub-query q τ+1 = c 1 c 2 . . . c τ+1 the entire inverted index is scanned in order to generate the answer. Let ϕ τ+1 be the answer for query q τ+1 , and v i,τ+1 be the prefix of w i whose length is equal to τ + 1, then for each w i ∈ D the prefix V i,τ+1 is https://www.indjst.org/ anonymous Freenet r 5 8.
intelligent r 3 r 4 10. linear r 3 11. retrieval r 1 r 2 r 3 r 4 r 5 12. storage r 2 r 5 r 6 13. structures r 1 14. survivable r 6 15. system r 5 r 6 16. using r 3 compared against the query q τ+1 , and the pair (w i , r j ) is included in the answer set of query q τ+1 if and only if v i,τ+1 is similar to q τ+1 . In other words, the answer to query q τ+1 is a set of pairs defined as follows: For the subsequent queries q x = c 1 c 2 . . . c x such that x > τ + 1, the answer ϕ x is computed from ϕ x−1 as follows. Instead of scanning the entire inverted index, only the keywords of the pairs included in ϕ x−1 are considered. At first, ϕ x is initialized to be empty, then for each (w i , r j ) ∈ ϕ x−1 , the edit-distance of prefix v i,x of w i whose length is equal to x is computed against the query q x , and the pair (w i , r j ) is included in ϕ x if and only if edist (v i,x , q x ) ≤ τ. Formally, the answer to query q x is a set of pairs defined as follows: The proposed algorithm is typo-tolerant, simple and efficient. The algorithm employs an inverted index that is not scanned in entirety for all the sub-queries. Furthermore, while calculating the edit-distance between the prefix of a keyword and the given query, only the prefix whose length is equal to the length of the query is considered.

Instant fuzzy multi-keyword search
A multi-keyword query q l consists of a sequence of keywords (w 1 , w 2 , . . . , w l ). In an instant search, a query is generated for each character typed in by the user, and instant fuzzy multi-keyword search constitutes of searching records r j ∈ R such that the following two conditions hold: • The record r j has a keyword similar to w i for 1 ≤ i ≤ l − 1.
• The record r j has a keyword with prefix similar to w i .

Algorithm description
Initially, as the user types the query character by character, the answer for the first keyword w 1 is computed using an inverted index as described in Section 3.1. Upon completion of the first keyword the records that match the first keyword are cached. Subsequently, when the user enters a sequence of characters generating multi-keyword query, a sequence of sub-queries are generated for each character being typed and the results of the previous query are cached. Let w l be the keyword being typed where l > 1, then the prefixes similar to the keyword w l are searched in the records of the cached results of the previous query rather than the complete set of keywords. Let R i−1 be the set of records cached for the sub-query q i−1 , then the set of records that contain the keywords similar to w 1 , w 2 , . . . , w l−1 and the keyword with prefix similar to w l is computed for q i as follows. For each r i ∈ R i−1 , forward index is https://www.indjst.org/ used to determine if there exists a keyword with prefix similar to w l in r i . If a keyword with prefix similar to w l exists in r i , then r i is included in R i .
The algorithm described employs an inverted and a forward index. Furthermore, the algorithm is both typo-tolerant and word-order independent. The algorithm is efficient since it does not scan all records for every sub-query.

Probabilistic-correlation based relevance Ranking
Keywords in a record must be associated with weights based on how well it distinguishes a particular record from other records. In general, a keyword that occurs more often in various records is a bad discriminator and must be assigned smaller weight when compared to a keyword that occurs less often in various records.
Correlation between the keywords is a measure of inter-dependence and can be calculated based on conditional probability as follows: It can be noticed that cor (w i , w j ) = cor (w j , w i ). Moreover, cor (w i , w j ) = 1 implies that the keywords w i and w j are interdependent on each other and always appear together in the records; and conversely, cor (w i , w j ) = 0 implies that the keywords w i and w j are independent of each other and never appear together.
More generally correlation of a keyword w i with one or more keywords simultaneously can be calculated as follows: Let R l ∈ 2 R be a set of cached results and W l ∈ 2 D be a set of all distinct keywords in cached results R l . If keyword w i ∈ W l then the correlation of w i given a set of keywords W ⊆ W l over cached results R l is denoted as cor l (w i ,W ), and is computed from the cached results instead of the entire data repository R.

Relevance ranking
Let r ∈ R l be a record in the cached results R l consisting of keywords r = {k 1 , k 2 , . . . , k n } where k i ∈ W l for 1 ≤ i ≤ n ; then the relevance or importance of keyword k i in record r can be defined as follows: It can be noted that 0 ≤ rel(k, r) ≤ 1; and rel(k, r) = 1 implies that the keyword k has higher relevance in r while rel(k, r) = 0 implies no relevance of keyword k in r.

Relevance Ranking of Single Keyword Query Results
Let ϕ x be set of answers to the query q x defined according to (2), then for all pairs (w i , r j ) ∈ ϕ x the relevance of keyword w i in relation r j denoted as rel (w i , r j ) is the rank of the corresponding pair.

Relevance Ranking of Multi-Keyword Query Results
Let q l = (w 1 , w 2 , . . . , w l ) be a multi-keyword query, and R l ∈ 2 R the corresponding set of cached results. Then the relevance of answer r ∈ R l given multi-keyword query q l can be calculated as follows: https://www.indjst.org/

Experiments
The performance of the proposed algorithms is evaluated on two real data sets namely, DBLP and Medline. For this study, 1000 DBLP log queries and 1000 Medline log queries are extracted and the experiments are conducted on Intel i3 CPU @ 2.6 GHz and 2GB of memory. The average time for generating answers to an instant fuzzy keyword search described in section 3.1 is recorded for various prefix lengths and presented in Figure 1A. Similarly, in order to evaluate instant fuzzy multi-keyword search, the average running time for generating answers to sub-queries of two keywords and three keywords are recorded for different prefix lengths in Figure 1B and Figure 1C respectively. It can be noted from Figure 1A, Figure 1B, and Figure 1C that the proposed algorithm constantly performed well for different prefix lengths. To gauge the quality of the answers generated by the proposed algorithms and the ranking method, precision is measured by determining the percentage of the expected results generated by these approaches. Based on the investigation, the precision of instant fuzzy keyword search is 85% and that of instant multi-keyword search is 90%.