Probabilistic multiple correlation based term weighting scheme for measuring similarity of unstructured text records

Background/Objectives: In this study, a term weighting scheme derived from probabilistic multiple correlation is defined for measuring similarity between unstructured text records.Methods:While the intra-correlation is the correlation of terms in the same record, inter-correlation is the correlation of terms that exist in different records. Probabilistic multiple correlation based term weighting calculates the weight or relevance of a term by considering its intracorrelation with one or more terms simultaneously. Subsequently, the term weights are used in measuring the inter-correlation of terms and then the similarity between two text records. Findings: The experiments are run onunstructured text records that are incomplete and employ abbreviations. There is significant improvement in precision, recall and f-score using probabilistic multiple correlation based term weighting scheme when compared with probabilistic simple correlation weighting scheme. Applications: Using probabilistic multiple correlation based term weighting scheme can improve the overall accuracy in matching unstructured text records that contain abbreviations and incomplete data.


Introduction
Unstructured textual data is a textual information that does not have a well-defined structure. It can include e-mail messages, transcripts, metadata, health records, comments and chat-logs to name a few. According to data analysts, 80 percent of all business information is unstructured (1,2) . Management of Unstructured data may include duplicate record detection, categorization, clustering, information extraction and integration that typically depend on estimating similarity among the unstructured text records. In this study, a term weighting scheme based on probabilistic multiple-correlation of a term with one or more terms simultaneously is discussed in the context of citation matching.
The key aspect of citation matching is determining citations that represent the same publication, which is a difficult problem because citations suffer from inaccuracies, missing references and incorrect references (3) . Consequently, citation matching can https://www.indjst.org/ be studied as approximate string matching. In Table 1, an illustration of several citations that represent the same document, despite many superficial variations is given. Since citations are text strings, approximate string matching measures similar to cosine similarity and edit distance (4) can be employed to quantify either the similarity or dissimilarity between two citation records respectively. However, these metrics cannot be applied directly, owing to the differences in various citation formats that include abbreviations and incomplete data. Cohen (5) proposes a distance metric derived from term frequency and inverse document frequency which can measure similarity of records in spite of different word orderings and incomplete data. However, citations are strings of short length, and hence the term frequency of terms is one; because more often than not, terms in a citation appear only one time (6) . Bilenko & Mooney (7) use machine learning techniques, in which the distance metrics for each field are derived by learning, and a classifier that combines the results of different distance metrics is employed. Unfortunately, the methods of Bilenko & Mooney (7) cannot be applied to unstructured text records since the records do not have well-defined fields. Pasula et al. (8) proposes a probabilistic object identification method for citation matching, but requires identification of citation style and segmentation of citation into author and title subfields. More recently, the probabilistic correlation-based similarity defined by Song et al. (6) successfully handles information formats that contain abbreviations and missing data. In this, the intra-correlation between two terms is derived from the probability that the terms occur jointly in the same records. Subsequently, term weights are calculated from the degree of intra-correlation of a term with other terms in a record. Then, the probabilistic correlation-based similarity is calculated between two records from the inter-correlations of terms and the term weights. Generally, two or more terms simultaneously distinguish a text record. Unfortunately, Song et al. (6) measures the correlation between two terms without consideration of the fact that both terms may be influenced by other terms in distinguishing a record. Consequently, in order to overcome the drawback of simple correlation, one can employ multiple-correlation that studies the correlation of a term with one or more terms simultaneously. Therefore, this study investigates a term weighting scheme based on probabilistic multiple correlation of terms that considers the correlation of a term with one or more terms simultaneously; rather than simple correlation between two terms. Subsequently, a similarity measure derived from probabilistic multiple correlation of terms similar toSong et al. (6) is used in comparing two unstructured text records. Finally, a report of the experimental evaluation to demonstrate the efficacy of the proposed approach is presented.
The remaining part of the paper is structured as follows: Section 2 introduces a term weighting scheme derived from probabilistic multiple-correlation between two or more terms. Section 3 presents correlation similarity measure similar to studies done by Song et al. [6]. In Section 4, the performance of the proposed term weighting scheme is demonstrated. Finally, conclusions are drawn in Section 5.

Probabilistic multiple correlation based term weighting
In this section, we derive probabilistic multiple intra-correlation of terms from probabilistic simple correlation of terms defined in the publication done by Song et al. (6) . While the intra-correlation is the correlation of terms in a single record, intercorrelation is the correlation of terms between two records.

Probabilistic simple intra-correlation of terms
Let R = {r 1 , r 2 , . . . , r m } be a set of records that are segmented into terms T = {t 1 ,t 2 , . . . ,t n } then the correlation between two terms are calculated from the conditional probability. The conditional probability of the term t i given term t j is defined as follows (6) : https://www.indjst.org/ Subsequently, the probabilistic intra-correlation of terms can be defined as follows (6) : It can be noted that cor (t i ,t j ) = cor (t j ,t i ). Besides, cor (t i ,t j ) = 1 implies that the terms t i and t j always exist jointly in the records, and cor (t i ,t j ) = 0 denotes that they never appear together in any record.

Probabilistic multiple intra-correlation of terms
If T = {t 1 ,t 2 , . . . ,t n }, then 2 T is the power set of T that contain all subsets of T . The conditional probability of term t i given a set of multiple terms S ∈ 2 T can be defined as follows: Consequently, correlation of term t i given a set of multiple terms S ∈ 2 T can be defined as follows:

Term weighting
Terms must be associated with weights based on its discriminability in distinguishing a record. Generally, a more frequent term is a bad discriminator and hence must be assigned a low weight. Alternatively, a less frequent term is a good discriminator and must be assigned more weight. It can be observed that the terms that appear less frequently have higher correlation to other terms compared to terms that appear more frequently. Therefore, less frequent terms with higher correlation are more likely to characterize the record. Consequently, terms with higher correlations can be regarded as essential features of the record. We define a term weighting scheme derived from the degree of term correlation with other terms as follows: where freq (S) is the frequency of S in record space R. Thus, the correlation weight lies between 0 and 1, and denotes the relevance of the term t i in distinguishing a record. Consequently, high correlation weight of a term implies more relevance of the term in distinguishing the record and vice-versa. For instance, let us consider the term set {Hall, Patrick, Geoff, Dowling, et al.} for the sake of simplicity. In Table 2, the terms present in each of the records of Table 1 are recorded. In Table 3, the probabilistic multiple-correlation of the term "Hall" to other terms, and correlation weight of term 'Hall' is presented. https://www.indjst.org/

Probabilistic inter-correlation of terms
In this section, we discuss the similarity measure described in (6) which is derived from the probabilistic correlation. The cosine similarity measure is single-to-single; a term in one record is matched with only one term in another record. However, in probabilistic correlation based similarity measure, multiple-to-multiple correlations exist, that is one term may be matched with several terms in another record. Consequently, three types of inter-correlations of terms exist between two records (6) .
1. Firstly, inter-correlation is defined between the terms that match exactly, for instance, the inter-correlation of 'Hall' in record 1 and 2 of Table 1. 2. The second type of inter-correlation exists between the terms that are present in both records, for instance, the intercorrelation between "Hall" in record 1 and "CSUR" in record 2 of Table 1. 3. The third type of inter-correlation exists between the terms where at least one of the terms does not occur in two records.
For instance, "ACM" in record 4 and "CSUR" in record 5 of Table 1.
Consequently, the probabilistic inter-correlation between terms t i and t j in records r 1 and r 2 , in that order can be defined as follows (6) : Let M 1 and M 2 constitutes the set of terms of records r 1 and r 2 respectively, then the correlation-similarity between records r 1 and r 2 can be calculated according to (6) as follows: where w i , w j denote the correlation weights cow(t i ) and cow(t j ) of terms t i and t j respectively calculated according to (5), and cor(t i , t j ) is the probabilistic inter-correlation between terms t i and t j calculated according to (6). Furthermore, in order to normalize the similarity measure, ∥r 1 ∥ · ∥r 2 ∥ is employed (6) , where and ∥r 2 ∥ can be computed in an analogous way. https://www.indjst.org/

Experiments and Results
The data sets called Cora and Restaurant are employed in the experiments conducted as in (6) . Cora is composed by McCallum (6,9) containing 1295 distinct citations of 122 publications. Restaurant is a database compiled by Sheila Tejada (6,10) containing 864 restaurant names along with addresses consisting of 112 duplicate records. To measure the efficacy of the proposed term weighting scheme derived from multiple correlation of a term with one or more terms simultaneously, the distances between a particular record and its potential duplicates are computed. Two records with the highest similarity are considered to characterize the same entity. Precision, recall and f-score are adopted to measure the usefulness of the proposed term weighting scheme over simple correlation weighting scheme of (6) , and are defined below: where R a represents the pair of records that actually correspond to the same entity and R f represents the pair of records found to be analogous. While precision connotes 'how relevant the results are' , the recall is the measure of 'how comprehensive the results are' . Consequently, as more number of pairs with lower similarity are misrepresented as the same entity, recall increases while precision decreases. The f-score is calculated from the harmonic mean of precision and recall and is a measure of overall accuracy. In Table 4, the maximum f-score values of multiple correlation and simple correlation based term weighting scheme are presented. While in Figure 1 the comparison of multiple-correlation and simple correlation based term weighting schemes in cora data set is presented, in Figure 2 comparison of the two term weighting schemes in restaurant data set is presented. It can be noted that the term weighting scheme based on multiple correlation of a term with one or more terms simultaneously is more effective than the term weighting scheme that employs simple correlation of only two terms simultaneously. https://www.indjst.org/

Conclusions
In this study, a probabilistic multiple correlation based term weighting scheme for measuring the similarity of unstructured textual records is proposed. Term weights are assigned based on how a particular term correlates with one or more number of terms simultaneously in a record. The proposed weighting scheme is effective in dealing with text records that do not have well-defined structure, that employs abbreviations, and that are incomplete. Furthermore, the experimental results demonstrate the improved overall accuracy of the proposed scheme.