SCF: Smart Big Data Classification Framework

Background/Objectives: Remote sensing produces huge data to be analyzed for different applications. The aim of this study is to develop smart big data classification platform based on AI techniques. Methods: The data differs in its types and format, text, images, audio, and video, and they might be structured or unstructured. Besides, data could be divided into categories, and each category needs to be analyzed by itself. Hence, the first step to handle this massive data is to classify them according to their types. Then, the classification phase is followed by the analysis phase. We proposed to utilize two AI algorithms: Fuzzy KNN and CNN. Findings: We proposed a novel and new smart big data classification platform based on AI techniques. It also involves cloud computing as a distributed environment to speed up the classification process of such huge data. The framework proposes a pre-analysis structure with suggested algorithms. The framework is examined against a regular/serial approach, and it proves its efficiency in the big data analysis.


Introduction
Recently, big data has become a new era for complex data due to its nature and behavior. Such a dominant approach demands careful consideration in terms of data classification and analysis. Therefore, big data has affected the data platform techniques owing to the heterogeneity in data format and size. Moreover, data could be generated from different destinations in various forms. One example on the big data generators is the smart city where it mimics a source of big data 1 . Moreover, Gartner report has released that 20 billion devices will be connected by 2020, which would create huge data 2 . Hence, dealing with massive data is a challenge at the level of classification and analysis 3 . Accordingly, various platform techniques have been developed to overcome such arise issues in big data analysis 4 . At the same time, big data is gaining much attention at level of software and hardware due to the fact compiling such sort of data demands deep inspection and analysis.
Big data can be defined as 5 Vs which refers to velocity, variety, value, veracity, and volume 5 . Various domains could be the source of big data such as internet, social media, short messages, and banking. Therefore, data anal-ysis techniques play a vital role on analyzing these data in order to find very useful information 6 . However, it's very critical to achieve very accurate results of analysis, as such analysis output may lead to decisions to be taken to either act or improve a certain issue 7 . However, the first step to analyze massive data is the classification process. There is a need for efficient distributed classification approach. Therefore, this paper presents a novel smart framework for data classification which combines AI and cloud computing paradigm.
The rest of this paper is organized as follow; related work of data analysis problem is reviewed in section 2. Following that, section 3 presents the proposed framework, sections 4 and 5 show the implementation of the proposed solution and results are discussed thoroughly accordingly.

Literature Survey
During the last decades, qualitative research has been benefited from theoretical and mythological research of data analysis. Though, recently big data analysis has gained much attention on the research community 8 . However, data analysis typically falls into one of two Vol 12(37) | October 2019 | www.indjst.org classes which are content and thematic 9 . The content analysis studies the frequency of particular words or phrases. Furthermore, word count and extracting some of the word/phrase semantics such as word place and synonym. Hence, content analysis is valuable in terms of efficiency and easy implementation; however, content analysis faces an issue of its limitation in the richness of the summary data produced 10 . Thematic analysis is more than counting words or extracting data representing phrases. It focuses on identifying and describing both implicit and explicit ideas. Reliability of the thematic analysis could be of greater interest due to generating some codes out of the text that require deep understanding to the raw data.
Analysis of big data goes through multiple phases as shown in Figure 1 1 . These phases are called analysis pipeline. The first phase includes the acquisition and recording where data is collected may be from various sources. Data acquisition could be through sensors and/ or data entry, or other sources. The second phase of the analysis is the data extraction/cleaning/annotation. The cleaning process is an important phase in such some of the fields, words, and/or noise could not be needed or important. This phase requires careful attention due to losing data that might affect the overall data analysis process. The third phase includes data classification which leads to the analysis/modeling phase. The last phase is the interpretation. Of course, these phases might not be easy to be implemented 11 due to either the lack of expertise, the unstructured data, and/or the various sources and formats. There are many data classification and analysis techniques including statistics methods, intelligent methods, and rough set method. In statistical methods 3 , means and standard deviation are the most famous techniques. Both techniques could be done through one sample or two samples. However, this is beneficial to a certain level of confidents. On the other hand, intelligent methods might include fuzzy controllers, neural networks, and neuro fuzzy. However, MapReduce technique 12 could be used to build the fuzzy rules based classification systems. Furthermore, Rough set 13 can be also an efficient method for big data analysis. In addition to that, Machine Learning (ML) is one of the tools used for big data classification and analysis issues since ML acts without any human interventions which is based upon AI techniques 14 . However, ML is classified based as supervised, unsupervised, and reinforcement 15 . Furthermore, one of the main recent techniques used in big data analysis is called deep learning. This technique has attracted more attention recently though it's not human engineered 16 , such technique involves supervised and unsupervised methods within the deep architecture of deep learning. The paper utilizes the deep learning in data classification and the next step is to test its efficiency in data analysis as well. Moreover, another study has proposed a novel technique based upon hybrid cluster algorithms 17 . In this study, authors have shown how hybrid cluster algorithm is performing better that existing algorithms in terms of accuracy and quality of produced data.
This paper proposes a new classification framework to serve as a vehicle for big data analysis. The proposed framework is tested against the sequential data classification.

Smart Classification Framework (SCF)
Based on our experience, data comes from different sensing devices in various formats. The main problems of analyzing such data are the amount of data and the interrelation between them. Taking into consideration one type of data and building our decision based on its analysis might not be efficient. On the other hand, analyzing all the types of received data is too expensive in terms of resources and time. Therefore, our model in this section is motivated by those two problems. The proposed model tries to handle the incoming data from sensing devices in a distributed manner and tries to classify it accordingly. This leads into distributed analysis to the data later. Although the proposed approach is a generic approach that can be used with different types of data, certain blocks are suggested for better performance. Some of those blocks are cloud computing, Map-Reduce mechanism, RIS scheduling algorithm. Figure 1 shows the proposed framework diagram. The proposed solution consists of the following blocks:

Sensing Block
This block encompasses all the sensing devices, including cameras, sensors, software agents, websites, and any other data input. Thus, out of this block, massive data are expected to be received in either a real -time data or offline. Simultaneously, based up on the type of systems and variety of sensing devices, the types and amount of data flowing into the framework are identified.

Classification Block
One of the main blocks in this framework is the classification block, where Map-Reduce is used to classify different types of data. Map-Reduce is introduced in 2004 by Google 20 . It is a distributed programming model for writing huge, scalable, and fault-tolerant information applications that were created over a cluster of computers to process big information sets. The MapReduce model is based on two main features: the map function and the user-designed reduction function. In particular, the map function generates some intermediate outcomes, processes the input information in the first stage; subsequently, these intermediate outcomes are fed into a second stage in a reduction function that somehow mixes the intermediate outcomes to produce a final output.     Figures 2, 3, and 4. These two techniques will be used along with the Map-Reduce for data classification.
The Fuzzy KNN classifier works by assigning to the unlabeled signature a membership value that provides the framework with appropriate data to estimate the decision's certainty. Each one of the groups identified as a fraction of the unlabeled signature determined by the coefficient of Fuzzy membership. This is one of the main advantages of the Fuzzy logic over the smooth logic system when we allocate the record to an unknown category. Two things are classified by the Fuzzy KNN classifier, the Fuzzy information from the training data, and the training data itself. A Fuzzy K-NN Classifier is one of the most popular software techniques due to its simplicity and consistency of the classification decision.
With regards to the deep learning, the classification is done differently where Convolution Neural Network (CNN) 23 is used the best with image classification. Therefore, there is a need for preprocessing for the input data to change it accordingly to image similar format. The input in this research is formed in a 28 × 28 pixels matrix to fit the CNN input. This has been done in the "map" block. In addition, there are different activation functions can be used with CNN. Table 1 shows a summary of the CNN activation functions.

Function Equation Shape
Sigmoid In both cases, we split the data sets into data sets learning and data sets evaluation. Thus, the learning data sets are 75% of all data, and the remaining 25% of all data are the test data sets. MapReduce divides data into different chunks, and the division size is a function of the data size and number of available nodes. As shown in Figures  2 and 3, using the map function, the data sets are split into several mappers. Every mapper has the same sample number. The reduced part takes individual mappers ' tests and incorporates them to produce the final outcome. The model's idea is to build on an individual classifier. That classifier is used to identify the test data and send the class label to the reducer function, and then the reducer takes the majority vote to decide for the test data on the final class label.

Experimental Results
In this section, the experimental results are shown which examines the efficiency of the proposed classifier. The USTC_SmokeRS dataset 18 is used to evaluate the proposed framework. Therefore, the data set consists of a total of 6225 RGB images from six classes: cloud, dust, haze, land, seaside, and smoke. However, each image was saved in a format of ".tif " with a size of 256 × 256 and a spatial resolution of 1 km. The number of images in each class is shown in Table 2. Figure 5 19 shows a sample of the dataset images.  After running the training process with different iterations, the accuracy of both algorithms is measured. The performance of CNN with different activation functions is shown in Figure 6. Besides, Table 3 demonstrations the average results of the activation functions. The performance, in this context means the classification percentage. As can be seen, the performance of different activation functions produces similar results; however, Relu, eLu, and SoftPlus are giving the best performance. Also, it is clearly noted that with the increase of the number of iterations, the performance increases which gives a proof of the proposed framework efficiency but few drops in some of the functions such as Softsign, Sigmoid, and SoftPlus. Therefore, those activation functions are recommended for applications with similar classification requirements.

Function Equation Shape
Softsign   Table 4 shows comparison between the Fuzzy KNN and CNN performances. As can be seen, the performance of Fuzzy KNN over performs the CNN with small percentage. In summary, we can say that both algorithms work fine in data classification based on our selected dataset. In addition, the speed performance of the proposed classifier reached, on average, 50% faster than the sequential algorithm. Indeed, this performance measurement could be enhanced by increasing the number of used processors as well as the increasing of the different input channels.

Conclusion
We proposed a big data classification framework based on AI. We utilized two AI algorithms, Fuzzy KNN and CNN. Both algorithms were examined, and the performance results show that they are suitable for the framework; however, the Fuzzy KNN seems to have a better performance than CNN. The future work of this study is to examine this framework on a large-scale data, and to extend the framework to involve the analysis phase as well.