A comparative analysis of Latent Semantic analysis and Latent Dirichlet allocation topic modeling methods using Bible data

Objective: To compare the topic modeling techniques, as no free lunch theorem states that under a uniform distribution over search problems, all machine learning algorithms perform equally. Hence, here, we compare Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to identify better performer for English bible data set which has not been studied yet. Methods: This comparative study divided into three levels: In the first level, bible data was extracted from the sources and preprocessed to remove the words and characters which were not useful to obtain the semantic structures or necessary patterns to make the meaningful corpus. In the second level, the preprocessed data were converted into a bag of words and numerical statistic TF-IDF (Term Frequency – Inverse Document Frequency) is used to assess how relevant a word is to a document in a corpus. In the third level, Latent Semantic analysis and Latent Dirichlet Allocations methods were applied over the resultant corpus to study the feasibility of the techniques. Findings: Based on our evaluation, we observed that the LDA achieves 60 to 75% superior performance when compared to LSA using document similarity within-corpus, document similarity with the unseen document. Additionally, LDA showed better coherence score (0.58018) than LSA (0.50395). Moreover, when compared to any word within-corpus, the word association showed better results with LDA. Some words have homonyms based on the context; for example, in the bible; bear has a meaning of punishment and birth. In our study, LDA word association results are almost near to human word associations when compared to LSA. Novelty: LDA was found to be the computationally efficient and interpretable method in adopting the English Bible dataset of New International Version that was not yet created.


Introduction
There are many text mining methods to turn unstructured textual data into actionable information. While traditional methods to analyze texts are limited in processing large amounts of data, some researchers have applied text mining to qualitative research projects. Due to these research advancements, text mining is viewed as a viable qualitative research method in machine learning and natural language processing efficiently (1)(2)(3) . These computer applications closely follow the paradigm of a common technique, topic modeling in the field of text mining. The topic models allow in analyzing a set of documents based on statistics of words in each, to express what the topic might be and what each document's balance of topics. It also refers to a probabilistic topic model to use statistical algorithms for discovering hidden topics of the collection of documents (4)(5)(6) . The significant and crucial step in the accuracy and storage of the information is quality management and extraction according to the information that is present.
There are various methods of text mining to identify the underlying topics in the text. This study compares the results of applying Latent Semantic Analysis (LSA) (7)(8)(9) , a natural language processing technique, and Latent Dirichlet Allocation (LDA) (10)(11)(12)(13) , a type of probabilistic topic modeling, to the text field. The outputs may help to determine and to demonstrate the feasibility of the technique if the use of these two models leads to additional insights when applied to the English language bible as a dataset. The dataset used in this comparative study is from the New International Version (NIV) available online at h ttp://www.biblegateway.com/passage/?search=Genesis+1&version=NIV.
The dataset includes text fields that describe each incident and the length of the text field different from a few words to paragraphs with more than a few sentences. These text data were mined to reveal additional knowledge about incidents in the bible. Data were collected from the Book of Genesis, the first book of the bible and the old testament, It is an account of the creation, life on earth, beginning of sin, the fallen state of the world, the need for a redeemer, and the promise of His coming. All these centre on the covenants that linking God to his chosen people and the people to the Promised Land.

The used methodologies
The details of the two well-known information retrieval methodologies, LSA, and LDA are presented in this section. This section demonstrates how these two text mining algorithms use different mechanisms to automatically generate the topics (A topic is a grouping of related words) in the text corpus.

Latent Semantic Analysis (LSA)
Latent Semantic Analysis is a method for representing and extracting the contextual meaning of words through statistical computations over a text corpus (6) , (14,15) . It is formerly known as Latent Semantic Indexing (LSI) (16) , before LSI, Information is fetched by accurately matching words in documents with the queries using lexical matching methods. These methods made the Information retrieval difficult because of two problems one is synonyms (missing documents regarding "automobile" when querying on "car") and another polysemy (retrieving the documents about a financial bank when querying on the river bank) (16) . To work out these two problems and other similar issues, the documents are expressed as concealed concepts in preference to terms. The hidden structure is not a fixed mapping between terms and hidden concepts, but it depends on the underlying document and correlation between the words it contains.
Moreover, it has been recently established that it is possible to give a statistical interpretation of the traditional Latent Semantic Analysis (LSA) paradigm, which collects hidden concepts from the document of the corpus using a linear algebra technique known as "Singular Value Decomposition" (SVD) technique (17) . The SVD represents the term-document matrix A nxm as the product of three matrices A=USV T. Where S = (σ 1 , σ 2 ,.., σ r ) is an r x r matrix, U=(u 1 ,….,u r ) is an n x r matrix, V=(v 1 ,…v r ) is an m x r matrix; However, columns in both matrices are orthonormal and r is minimum (n,m). The algorithm, as shown in Supplementary Table 1, LSA works by keeping the K largest singular values in the above decomposition, for some appropriate k. Let S k =(σ 1 ,.., σ k ), U=(u 1 ,….,u k ) and V=(v 1 ,…v k ) . Then A=U k S k V T k A is a matrix of rank k, which is our approximation of A. The rows of V k S k above are then used to correspond to the documents. This new space (latent semantic space) is used to analyze semantic relatedness among the documents (withincorpus and outside of the corpus) and words. It is also useful for information retrieval and information filtering and performs well if the corpus is a collection of meaningfully correlated documents (16) , (18) .

Latent Dirichlet Allocation (LDA)
Topic modeling algorithms are statistical methods that analyze the words of unstructured original texts to discover the themes that run through them automatically. LDA is a generative probabilistic model for the collection of discrete data as a corpus. It https://www.indjst.org/ was first introduced by Devid Blei et al. (10) . The basic idea is that documents are represented as a random mixture over latent topics, where a Dirichlet distribution over words characterizes each topic to find topics in documents, or LDA identifies a set of topics by associating a set of words to each topic (19,20) . The underlying assumption of LDA is that a text document will consist of multiple themes and has a three-level hierarchical Bayesian model where each item of a collection of text is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of word probabilities.
According to the algorithm shown in Supplementary Table 2, in words, this means that there are K topics. Their distributions are φ 1 ,… , k are derived from Dirichlet (β ) that are shared among all documents. Each document in the corpus D is considered as a mixture over these topics, indicated by θ j. Then we generate the words for document dj by first sampling a topic assignment z j,t from the topic proportions θ j, and then sampling a word from the corresponding topic φ z j,t . z j,t is a variable it denotes which topic from 1, …., k was selected for the t-th word in document dj.
It is essential to identify some critical assumptions with this model. First, we assume that the number of topics K is a fixed quantity known beforehand, the number of distinct words in the dictionary is fixed and known ahead of time, and each φ k is a fixed quantity to be approximated. Each word within a document and topic proportion θ j is not dependent.
In this formulation, we can see that the joint distribution of the topic mixture Θ, the set of topic assignments Z, the words of the corpus W, and the topics Φ by

Experimental step of the analysis
The experimental model diagram illustrated as in Figure 1 represents the steps of analysis in this research study from the input (Raw bible data) followed by preprocessing to remove the noise in the data and further removal of stop words and to find the root word of the given word by Lemmatization process to further assess the text to vector conversion and comparison of two topic modeling methods (LSA and LDA) in identifying the document similarity within-corpus and with the unseen document to categorize the word associations and coherence score as a measure for topic comparison and goodness of the topic model.

Data selection and preprocessing
The book of genesis was selected from the bible and divided into subtopics; each sub-topic is considered as a document in this corpus. Before the document term matrix was created, some preprocessing was also done on the data. Foremost, a program was created to read the documents from the concerned files. The second program removes punctuations and other symbols that are not useful to text analysis, and then documents were split into words. Then the program searches each word and then retains the words which are not in the "stop words" list. However, the words contained in a list of "stop words" were removed; these words are deemed to have no significance in describing the mechanical qualities of a data under study. The remaining words were converted into their base form using the lemmatization process. It helps to reduce the scope of the data for document matching, to get more consistent results from both the LDA and LSA methods. Now the corpus is ready for use. Both LDA and LSA take a document-term matrix as an input. Each row represents a document from the entire dataset, and each column represents a word. Each location in the matrix has a number that corresponds to the number of times the word designated by the column appeared in the document designated by the row.

Analysis and comparison of the topic modeling methods
We used the GENSIM (Topic modeling and preprocessing), NLTK (Natural Language Processing), and SCIPY (Document comparison) and MATPLOTLIB (Visualization) Libraries in python that searches through a combination of the parameters. LSA gives a direct output of document similarities in the form of a cosine similarity matrix and coherence score based on the matter related to the book of genesis in this study (21)(22)(23) . The text relevance is calculated where the values range from -1 to 1, where one is considered an exact match, and -1 represents two documents that are complete opposites. This output is enough to create a matrix related to the information whose columns each represent a document and whose rows contain documents in their order of similarity to the document associated with the column they are in.
Unlike LSA, LDA does not directly output document similarities. Instead, LDA outputs a matrix, whose rows represent all the documents in the dataset, and columns represent all the topics. Each value represents a particular topic's weight in a document. The user specifies the total number of topics that the words are sorted into, and columns in the matrix range between 0 and the user-defined number of topics. LDA was run with different numbers of topics until a good topic range was found for the dataset.
The final step is to compare the document similarity matrices output by LDA and LSA. If only minor differences can be found between them, it can be inferred that LSA and LDA are more or less equal in their ability to sort the mechanics of the data. Nevertheless, if the two results differ significantly, the more efficient algorithm is determined by comparing one document with the other documents. In each column of the matrix, it counts the number of documents with the same core functions, and the central functions of mechanical data must be individually determined.

Performance evaluation
We compared the performance of LSA and LDA models with two baselines, cosine similarity and coherence score as the primary evaluation metrics. In the following subsections, we illustrated and summed up the methods mentioned above. Because of document similarity within the corpus, entire documents were classified into four categories that are 0% to 25%, 26% to 50%, 51% to 75% and 76% to 100% similarity groups and chosen the documents from these groups and their most similar documents in similarity descending order and the same document were taken from the other method results and analyzed why the differences are shown between the results of two methods.
As per the results obtained from two methods, Figure 2 shows algorithms outperform significantly and almost with the same results at 76 to 100% similarity group when compared against the remaining three groups (0 to 75%). This result is an essential finding in the understanding of the similarities between documents, and this suggests and demonstrates that these methods can predict considerably better.
Further, the similarity results have been studied that in downstream to find which method giving the relevant results. Table 1 shows LSA results for the top ten similar documents with reference documents, and Table 2 shows the top ten similar documents from LDA with reference documents under study. The first column of the first row in the tables occupied by the reference document and their corresponding top five topics were occupied in the next columns. The second row onwards tables were filled with its top ten most similar documents and their top five topics in descending order of their similarity. https://www.indjst.org/   Table 3 , the reference document is placed in the first column of the first row and its top five topics were occupied in their respective columns. In the second row onwards, the table was tabulated with its top ten most similar documents and their top five topics by descending order, respectively. Comparatively, the LDA results in Table 2 identified some sort of similarity, like document 2 was showing the most similar document for the reference document 1; But, LSA output gives 51 to 75% similarity between the documents. However, LDA gives only 0 to 25% of similarity.
This result highlights that little is known about the correlation between the topics of reference documents and their top ten most similar documents ( Table 1 ). Whereas in LDA, the correlation showed best among the topics of reference and other documents even they have 0-25% similarity with reference document 1. Owing to these results, a more in-depth downstream analysis was performed at the document level to check how many words are in common between the reference and most similar documents as both LSA and LDA results have been showing a clear difference. The complete difference has been reported in document 1 with both methods understudy; the following analysis was performed further with document 1 to understand the difference. Figure 3 represents that document 1 and the topmost similar document, document 2. The common represented words are in thick colour, whereas, non-common words in both the documents are in light colour respectively. Document 1 explains the Garden of Eden, the creation of a woman from Adam, and his command. In comparison, document2 represents the entry of sin into humanity. However, these words do not convey the actual content of documents. LSA results showed that there is 56 to 75% similarity even there is no that much similarity among the documents, whereas LDA result representing there is 0 to 25% similarity among the document1 and document2. So, from the result is understandable, the LDA results are more appropriate than LSA.

Comparison based on similarity with the unknown document:
The supplementary Table 3 lists the top ten most similar documents with the unseen document. The first row of the table explaining that both LSA and LDA provided document 0 and document 1 as most similar to unseen documents, but from the third row onwards, some discrepancy can be observed. Mainly, document 2 is listed as the third most similar in the results of LDA, but in LSA, it is in the eighth position of document similarity in descending order. To understand the difference, observe the next Tables, in supplementary Table 4, and Table 5 which contains unseen document and its corresponding top five topic proportions in the first row of the table and top ten most similar documents to unseen document with its corresponding top five topics has occupied the second row onwards.
From the results shown here, it is easily understood that topics of document 2 in LDA are more correlated with topics of unseen document than in LSA. So, we can understand that the LDA results are more appropriate than LSA in finding similar documents to unseen documents.

Coherence score
Coherence is a state in which a set of topics or concepts supports each other, and it computes the relative distance between terms in topics. Topic coherence is a measure used to assess the goodness of the topic models. These measurements help in distinguish between semantically comprehensible topics. Here, the c_v coherence measure is used to calculate the score. c_v measure is based on a Boolean sliding window calculation, one-set segmentation of the top words, normalized point wise mutual information(NPMI) for agreement between individual words and cosine similarity. The following two figures showing the coherence score of LSA and LDA.  Figure 4, it is observed that the coherence score given by LSA is 0.50395, whereas LDA showed as 0.58018. So, LDA results are better than LSA results for the bible text.

Word association
The resemblance between the two words can be estimated by whether they share a common topic. Here, we found some word association between two words by calculating the cosine similarity between their topic proportions.
From the Table 3 , it is established that the semantic relationship between the words given by LDA results is better than the LSA results. https://www.indjst.org/

Discussion
In this study, we discussed some results and emerging trends and how they can be understandable from the perspective of earlier studies, including our comparisons. The difference between the two methods using bible text as corpora; the results give some indication about how evenly the distribution of words is between the documents (24) . The analysis shows that both techniques find the most significant percentage of instances and assessments of the context in which the words appear that contain words related to God's creation and his mandate for humanity. Generally, human word associations, high-frequency words are more probable to be used as response words than lowfrequency words. For example, in the studies of Griffiths and steyvers (25) compared the topic model with LSA in predicting word associations, finding the balance between the influence of word frequency and semantic relatedness found by the topic model can result in better performance than LSA on this task. In our study, the main questions related to the extraction of word meaning in natural language processing, but also for the extraction of its meaningful associations, have been observed. For example, the word 'bear' in the book of genesis in our dataset, implies in contexts like punishment and birth. For instance, in stock markets, the bear represents that the market is diminishing. Comparatively, the word bear that associates or correlates with LDA than LSA, respectively.
However, the studies conducted by Siti Qomariyaha et al. in 2019 (26) by using Twitter data as text data were corroborated with our results in this study as they concluded that LDA considers the relationship between documents in the corpus with the best topic coherence than LSA. Also, in comparative studies using different text mining methods as applied to short text data, LDA showed more meaningful extracted topics and obtained good results with topic coherence as an evaluation metric for creating the content of a document collection (6,27) .
The overall results showed clearly how the book of genesis is defined by the two text mining methods that complement each other. LSA and LDA agree with many of the texts and topics, yet they each generated some topics that the other method did not identify. This result indicates that using more than one text mining technique that uses different mechanisms to identify topics can result in more meaningful analysis and better identification of semantic structure from the text. Furthermore, we recommend using LDA due to its superior performance, and employing the LDA also provides the system with a more significant explanation as LDA is a probabilistic model in arriving at the conclusions. These insights can help in understanding the natural patterns in the data, when necessary.

Conclusion
Based on the result, the LDA showed the best topic coherence 0.58018 than the coherence score given by LSA (0.50395). Therefore, this study shows that LDA achieves superior performance when compared to LSA. The performance achieved by LDA using document similarity within-corpus, document similarity with the unseen document, and word associations also delivered maximum meaningful topics and implicitly, contextual word meaning from bible text corpora. Thus, the work presented in this comparative study can be a computationally efficient and vital reference for researchers on topic modeling. https://www.indjst.org/