Bag-of-Phrases (BoPh) and sentiment analysis of Arabic text in Twitter

Background/Objectives: Sentiment analysis plays main role in various text mining problems. Although, the Arabic text mining is important especially in the field of sentiment analysis, there is a paucity of research in it, especially, when it plays an important role in different issues in Arabic countries. Arabic language has many dialects that people use to express their feelings in social media. The objective of this study is to perform an experiment that follow the subjective opinion from the text. Subjective Analysis is one way that we can implement to improve the accuracy of the sentiment results in such texts in some dialects, that hide various meanings behind the words such as Saudi dialect. Methods/Statistical analysis: In this study, we manually annotated more than 8,000 tweets to have training and testing data sets with positive or negative words and phrases. Then we proposed a “Bag of Phrases” methodology to analyze the sentiments in the texts, which helped to improve the performance of sentiment analysis. Since using bag of words method is not enough in many cases, we applied a Naive Bayes algorithm to test our method. Findings: The results show that the accuracy of having True positive or True negative is about 84% comparing by using manual annotation process. The accuracy is calculated after taking into consideration the margin of error due to themanual annotation step and subjective interpretation of the texts by the annotators.Novelty/Applications: The novelty of the study is havingmore accurate training data set comparing with the other works in Saudi dialect for Arabic text, and proposing the BoPh concept.


Introduction
Recently Sentiment Analysis has played a major role in achieving different goals in many organizations. For example, using customer reviews to develop or improve specific services or goods. There are different critical points that must be considered to complete that achievement, such as the source of the collected data, the volume and comprehensive the data, and the ability of processing the data to get some results from it. In (1) the authors mentioned the enormous amount of data that would be produced from the web and social media during the next couple of years. The available data would https://www.indjst.org/ be about 40 trillion gigabytes in 2020. Most of the studies that applied to the text data focuses on classifying the text in two groups: Yes or No group, three groups such as Positive, Negative, Neutral, or more (2) . While, the source of the text data might be different, most of them are from the social media applications such as Twitter, Facebook, YouTube and so on (3) .
Based on various factors and issues, such as the form of the data, the language of text data, etc., sentiment analysis process may obtain good results or may not. Therefore, the processing of Arabic text data might need more efforts than processing English text data because of the language rules, the data availability in the Arabic language and the paucity of studies conducted and applied to the Arabic language sentiment analysis (4) . Different techniques that have been applied to processing the Arabic text data on the same data sets are using lexicon approach, Part of Speech (POS) approach, word2vec and others (5) (6) . However, there was a contrast between the results, which indicated the complexity of processing the Arabic text data. The contrast resulted from different reasons such as the private letter in Arabic and the language dialects (3) (7) .
The complexity of working with the Arabic language arises because of the number of dialects (8) , which are more than five main groups from more than 15 countries. In addition, the Arabic people use common phrases that express direct or opposite meaning. Also, there are some words whose meaning varies according to the context. This paper represents some related works that have been applied to the Arabic text data to analyze the opinions and sentiments in Section II. Then in Section III, we discuss the manual annotation approach that we applied to the text data containing annotation of the text data on the aspect level (9) (10) . Section IV shows and analyzes the dataset, and the result of the manual annotation, along with building a word-base and phrase-based lexicon for future work. Finally, Section V discusses the conclusion and future work.
Text Sentiment Analysis focuses on exploring the meaning behind the text. Therefore, the accuracy of the results is related to the language and its meanings. In (11) the authors analyzed the opinions of consumers of an electricity company about its services. They applied a combination of two lexicons in a twitter text data by going a step further to increase the accuracy in the results, as compared to the traditional way of having only one lexicon. Since they used the unsupervised learning algorithms to determine the results, there weren't any training data or testing data. The results showed that the negative and neutral tweets about the services of the new energy companies are more than the positive opinions about the old ones.
In (5) the authors applied two different approaches of sentiment analysis, the Sentence-level and the Document-level. They used the general structure of the Arabic sentences in the first approach of mining the data. The feature selection has been done manually because of the difficulties in the Arabic grammar. The results indicate an improvement in classifying the data using Arabic grammar and sentence structure. The authors in (12) have combined two different techniques to classify the sentiments in the Arabic text. They have combined a lexicon-based approach with the corpus-based approach to achieve more accuracy in the classification process. They have selected two different datasets and three different lexicons for classifying words which represent emojis, positive, and negative words. They applied the classification process using different machine learning algorithms and the results show slight improvement in some cases and failure in few other cases.
The manual annotation process plays a major role in building a strong base for sentiment analysis. In (13) authors applied it and discussed the importance and the results. They built corpora in five different measures/criteria: positive, negative, neutral, dual, and spam. They used an Arabic text data that did not follow the rules of the Arabic grammar but was based on different dialects. They used a Facebook user's feedback to annotate. Two different corpora were built in two different domains: news and arts. The results indicated that using these corpora in a lexicon-based classifier gives a satisfactory result in performance in a range of between 73% and 96%.
We explored a very helpful review paper about Arabic Sentiment Analysis ASA (14) , and we found couple studies that show a variance in the goals and results. That variance explained the difficulties in sentiment analysis using Arabic text, even the reviewed papers had applied different algorithms in different data sets. The most accurate algorithms are Naïve Bayes and Support Vector Machine SVM. That's directs us to apply one of them in our study.

Twitter data
We targeted Twitter text data as an open public environment that would include most peoples' opinions about various issues these days. Therefore, the complexity of analyzing tweets arises from the dialect differences and use of the new phrases among the young people. Also, since the Arabic language is spoken by different people from different countries and cultures (14) , there might be two different opinions about a person or an event based on the dialect that is used on the tweet even they use the same words. So, we might know the sentiment based on that.
We accessed Twitter API and downloaded more than 20,000 tweets that were collected by searching for seven common issues at the time. Those issues are as follows: two of them about two controversial people ( ‫العيسى‬ ‫أحمد‬ and ‫الشيخ‬ ‫ال‬ ‫تركي‬ ), and two https://www.indjst.org/ different sport clubs ( ‫الهالل‬ and ‫النصر‬ ), and two different ecommerce websites ( ‫شيك‬ ‫جولي‬ and ‫نون‬ ), and one common employment exam ( ‫كفايات‬ ).
Whereas these keywords were the most active keywords at that time, it should be noted that the collected data were based on the area of Saudi Arabia and were related to these specific topics. Therefore, we have neither studied considering the differentiation in dialects nor studied other issues or topics in the other Arab countries.

Preprocessing and preparing data
After removing the retweeted tweets, we ended-up with 8,247 valid tweets to be analyzed. The process of cleaning the data or preparing the data to be ready for analysis is the important process in the study. Therefore, we applied a couple of formatting rules to prepare the data, such as removing Non-Arabic letters, numbers, control characters and graphics. However, we kept the punctuation marks such as question and exclamation marks, because they might reflect sentiments.
Some of the preparing rules could be applied too, such as the normalization, which is unifying the letters that might come in different formats. For example, Aleph letter with Hamzah ( ‫أ‬ ) or without ( ‫ا‬ ). Also, the letter "Ya" with two dots ( ‫ي‬ ) or without ( ‫ى‬ ) and so on. However, some of these changes could affect the real meaning, and it is one of the complexities of working with Arabic text. We completed the normalization during the manual annotation while we annotated the tweets.

Manual annotation approach
The manual annotation approach mainly relies on understanding the sentences and meaning behind them. In sentiment analysis we call it Aspect level sentiment analysis (15) . The main goal of the work is to come up with a lexicon to be a reference that contains not only opinion words but also opinion phrases. Thus, we divided the work into five groups; each group's work has been completed by one native Arabic speaker.

Annotation technique
After getting cleaned tweets, we started annotation of the tweets based on five different categories as follows: 1. Clear Positive: which must include one or more positive word. This category is known by a lexicon-based approach. 2. Clear Negative: which must include one or more negative word. Similar to the one above, it is known by a lexicon approach too. 3. Positive: which might include both positive and negative words at the same time, or NOT include any sentiment words.
We need to apply the manual annotation approach to categorize it. 4. Negative: which might include both positive and negative words at the same time, or NOT include any sentiment words.
We need to apply the manual annotation approach to categorize it. 5. Neutral: any tweet that has no sentiment about any other issues, for example, questions or commercial ads.
During the annotation process, we came across some tweets that were not clear. In this case we tried to include them after having them re-read more than one time by different annotators. Also, some tweets were not related to the topic, such as ads and personal/general tweets. Nevertheless, some of them were annotated because they still carry sentiments.

Building lexicons
There are many lexicons that have been built to provide a high level environment to process Arabic sentiments (16) . However, there are drawbacks in some of these lexicons: some lexicons have been built under specific circumstances or events as the data were collected using common keywords, and some lexicons have been built only for specific dialects or for standard Arabic text.

Positive and negative words lexicon
For clear positive and clear negative categories, the annotators write the sentiment word/words in column/columns that were created to store positive and negative words. Thus, a list of positive words and negative words was compiled.
We built a positive bag-of-words and negative bag-of-words (BOW) from all clear positive and clear negative annotated tweets.
• First, at least one word not exceeding three was extracted from each tweet. • Then, all the words from the seven files were compiled in one list.
https://www.indjst.org/ • The duplicated words were removed from the final list.
It should be noted that, the annotators did the normalization manually while doing the annotation process to ensure that the annotated words would have no problems caused by the errors that may occur because of the Hamza ( ‫ء‬ ), Dots (.), Madod (∼), or even duplication of the letters. For example, in the word ( ‫ـنن‬ ‫ج‬ ‫(ي‬ ) which means (very nice), some people may duplicate the second last letter several times, in the following way: ( ‫جننننن‬ ‫ي‬ ) to show emphasis or more enthusiasm. This kind of normalization might affect the proficiency of the results, because of people writing the sentiment words with these errors or repeating some letters in a word to emphasize the meaning.
Also, there are some words that may have different meaning based on their usage or context. It is common among certain Arab people to write a word which means the opposite. This can certainly affect the sentiment of the text.

Positive and Negative Phrases Lexicon
In the project, for only positive and negative categories, the annotators wrote the phrases that helped to categorize the tweet to be either positive or negative. The phrases could be one or more that has no sentiments alone but when come together in a sentence they could express a sentiment. Also, they could contain common names or adjectives that have either positive or negative sentiment in the culture, such as ( ‫هللا‬ ‫حسبي‬ ) together with ( ‫كفايات‬ ), so the tweet that contains these two phrases most likely has negative opinion.
For neutral category, there were no sentiment words and no phrases. However, in some cases the annotators added some notes that indicated why they had annotated the tweets as neutral. For example, tweets containing a question mark. After having the manual annotation process completed, we came up with a general idea about the most popular positive words and negative words that were used in the people's tweets to express their opinions about the tested issues. Some of the results were shown in Tables 1 and 2. They show the most popular 10 positive words and most popular 10 negative words related to the topic of 'Kifayat, ' that is related to jobs. In Figures 1 and 2, the word cloud shows positive words and negative words that are replicated more than 6 times in "JollyChic" and "Alhilal" respectively. We analyzed the first results indicating the statistics that we obtained using manual annotation to compare them with any results that we could obtain from applying any machine learning algorithms such as SVM, Naive Bayes or any other classification algorithm.
From the results, it is noted that most of the topics share one or more common words which are either positive or negative. For example, the word " ‫ظلم‬ " which means "Injustice", appears in most topics as the common negative words in the lists. However, to determine the shared common words between different topics is not feasible, because the common words are dependent on the topics, such as sports, politics, education, markets and so on.
The following charts show general statistics of the first results. In Figures 3, 4 and 5 we can see the difference between the common words that were repeated more than 6 times in these different topics and that because of the difference in the topic area as we mentioned before. However, it can be clearly noticed that some words are commonly used in these topics together, even though they may not be the most common ones used in each of the topics.  The following chart Figure 6 explains the common positive words related to "Alnasser". We can explore the shared common words in two topics in the same area. For example, "Alnasser" and "Alhilal" are both football teams in Saudi Arabia. From the above charts, we can notice that the word " ‫أفضل‬ " which means "better" is a shared common word among the two topics. However, we can find this word in each positive list of most topics because it is one of the top positive words in most comparable topics.
Dealing with phrases to analyze sentiments in Arabic text is another goal of the project. So, we manually categorized some phrases that might help to analyze the sentiments. The following chart gives an example of it. Figure 7 shows some phrases that might help to analyze the sentiments in Arabic texts, which are related to JollyChic. For the purpose of our research, we tried to create a bag of phrases that cloud help in some areas to analyze the sentiments. So, two columns additional were created in our work to add phrases in them from the tweets. These phrases could be used secondarily in future to analyze the sentiments in Arabic texts. The expectations from this step are really not very high but it might help in certain ways, given the fact that opinion mining is not easy in the aspect level.
In addition, the phrases that we have might not include any sentiment word. Thus, our approach of considering the phrases is beneficial. There would be a difference between the traditional machine learning approaches such as Bag-of-Words BOW approach and ours, which is Bag-of-Phrases BoPh approach. For example, in some topics, we found phrases that have negative meaning such as " ‫يدك‬ ‫أغسل‬ " which is commonly used in Saudi culture for negative expression, however, it doesn't have any sentiment word in it, and it might be understood as a positive meaning for those people who do not know the local phrases in Arabic.
Also, in some cases we need to analyze the sentiments for a specific topic or issue in different times, such as the beginning of the year, middle of the year, and end of the year. In this case, the proposed approach (BoPh) would help and improve the performance as we have noticed in our study. Tables 3 and 4 show some positive and negative phrases respectively that have been collected manually from the tweets. https://www.indjst.org/ Table 3. Top 10 most positive phrases related to Kifayat hashtag Table 4. Top 10 most negative phrases related to Kifayat hashtag https://www.indjst.org/ As we can notice from the above tables there is some duplication in the phrases, and that is because of one of the following reasons: • In the same list (either positive or negative list), that is because of the difference in at least one letter such as " ‫حسبي‬ " as singular and " ‫حسبنا‬ " plural. • In different lists, that is because of the universality of the words in the phrase such as " ‫حمدهلل‬ ‫ال‬ " it comes in both the lists and its meaning is related to the other words. For example, " ‫حال‬ ‫كل‬ ‫على‬ ‫حمدهلل‬ ‫ال‬ ", which commonly has negative connotation (showed 29 times in the negative list) but sometimes has positive connotations. (showed 4 times in the positive list). • Some short phrases might not have any meaning, and they fall in both lists with other words to form a sentiment phrase such as " ‫حمدهلل‬ ‫ال‬ " to be " ‫جحت‬ ‫ن‬ ‫حمدهلل‬ ‫ال‬ " and " ‫عديت‬ ‫حمدهلل‬ ‫ال‬ " which becomes positive.
The following figures Figures 8 and 9 show the most positive and negative phrases on Kifayat. We can now see the final results of sentiment analysis using sentiment phrases in Arabic text related to Kifayat. In addition, we can notice the effect of the culture in Saudi Arabia, the type or the nature of the topic, and the dialect used for the expression.

Machine Learning Algorithms
In sentiment analysis there are different ways to discover a good solution for a specific issue. However, the nature of the issue or the situation plays a major role in user sentiments. Culture, language, or the domain of the issue are examples of the influencing factors (17). Therefore, while the process of analyzing sentiments could work perfectly in some situations, it might not in others. Furthermore, we might need to repeat the process frequently at different point of time to get better results. After building the corpus, machine learning algorithm was applied to test our proposed approach, i.e., bag-of phrases (BoPh). The total number of tweets after eliminating the "Natural" sentiments from the list is 1565 tweets that are either "Positive" or "Negative". We applied the concept of Document-Term Matrix DTM (18) , which splits the whole corpus into words and terms in the main row and the numbers of the documents "tweets" in the first column, then the matrix puts the number of repeated terms in the table cells.
The next tables: Tables 5, 6 and 7 show the DTM for "Kifayat" topic. We use it to establish the corpus to apply a propitiate machine learning algorithm such as Naïve Bayes or others.  We created a corpus from the tweets and then we applied a Naive Bayes algorithm to examine our test data. The size of data was 1566 documents (tweets) and 1766 terms. The training data is (1000 X 1766) and testing data is (566 X 1766). The classifier is created by building tweets and DTM table.

Results and Discussion
Although the first part of the research which is about the manual annotation was completed by Arabic native speakers, there were some uncertainty in some results. For example, there were few uncertainties in part of results which were based on human interpretation of topics. So, there were some annotators had annotated the tweets based on their understood and their emotions about the topic itself. In addition, the annotators sometimes read the tweet from different prospective. For example, in " ‫العيسى‬ ‫أحمد‬ " topic, which is about one of the controversial people, who was the minister of the education at that time, one annotator might annotate some tweets as positive while the other annotator might annotate the same tweets as negative, and that based on their feelings about him. By knowing the difference between people opinion in the manual annotation process, we can understand the nature of the results that we will discuss in this section.
The research mainly focuses on building a new Bag-of-Word (BoW) and Bag-of-Phrases (BoPh) for Arabic text by collecting the sentiment words and phrases from tweets. Whereas there are couple of Bag-of-Word lists provided by researchers in the field, most of them either based on the formal Arabic language, which is not commonly used in social media nowadays, or there is shortage in most of the lists.
The result shows that the BoPh model helps in classifying the tweets based on different criteria such as the field of the topic, geographical area etc. However, some phrases could be used either in positive or negative sentiments at the same time such as " ‫حمدهلل‬ ‫ال‬ ", which means "we accept that anyway". Nevertheless, we can determine the sentiment if we use a slide-window technique in a limited amount of words in some cases, e.g. " ‫جحت‬ ‫ن‬ ‫هلل‬ ‫حمد‬ ‫ال‬ " for a positive sentiment or " ‫على‬ ‫حمدهلل‬ ‫ال‬ ‫حال‬ ‫كل‬ " for a negative sentiment. Some phrases present the sentiment directly based on the topic itself just like a commonly used bad phrase or a good phrase, such as " ‫الوكيل‬ ‫ونعم‬ ‫هللا‬ ‫حسبي‬ ".
Also, the result shows that by applying a Naive Bayes algorithm to determine the accuracy of our model, the accuracy of testing data is 84%, whereas 16% were either false positive or false negative based on the training data set. However, since the study may have some uncertainty or margin of error in the input data because of manual annotation, the accuracy value can be considered as acceptable.

Conclusion
In this study, manual annotation was applied to Arabic text data "tweets" for Saudi dialect, and positive or negative sentiments were determined. We collected at most three positive or negative words from each tweet to create an Arabic Bag of Words. At https://www.indjst.org/ the same time, not more than two phrases were collected from many tweets to create an Arabic Bag of Phrases (BoPh), which would help in some common uses in expression.
Then, Naïve Bayes algorithm was applied to test our model. The results are acceptable and indicate 84% accuracy after taking into consideration the margin of error due to the manual annotation step and subjective interpretation of the texts by the annotators. Nevertheless, the annotators took utmost care to be as objective as possible in the interpretation of the text. Finally, this study has a very high accuracy in training data set in Saudi dialect as an Arabic text.