Rice yield prediction and optimization using association rules and neural network methods to enhance agribusiness

Objectives: This study aims to implement data analytics andmachine learning approaches to rice data and to establish association rules on fixed attributes and their correlations for yield prediction of crops. Methods: The data of rice crop is collected fromdistrict Larkana as per defined parameters: area, production, yield, temperature, rainfall, humidity and wind speed. The pre-processing operations are applied on prepared dataset to execute data analytics and machine learning algorithms. The processed data are then input into an Apriori algorithm for generating association rules. Neural network model is used to perform optimization on resulted association rules. Findings: Minimum support and confidence values equal to 3 and 80 respectively were set using Apriori algorithm on prepared rice dataset and obtained 88 association rules. Among them, results of 28 associated rules predicted `High Yield Production'. Neural network model is experimented to optimize the predicted yield of district Larkana through which overall network performance of 97.8% is calculated. Previously, rice yield data of Larkana were not statistically and digitally predicted and investigated. Application : Group of effective and well-built association rules of yield prediction are core outcome of this study which will be helpful for researchers, farmers and government officials to improve the productivity of rice crop.


Introduction
Crop yield prediction is helpful in food production and management. Its role in the country's economy is inescapable (1) . Pakistan is a well-known agriculture based country. The agriculture sector of Pakistan contributes approximately 23.4% in its economy. Majority of the people are living in rural areas and purely dependent on agricultural activities (2) . Among the crops, rice has great importance because Pakistan has 4 th position across the world, among the Rice exporters. Province of Sindh is the main contributor with approximately 25% rice produce in various districts especially in Larkana.
https://www.indjst.org/ But most of the farmers are practicing traditional crop production methods instead of advanced technology and smart farming. The rapid changes in weather condition, results in the low crop productivity so that essential measures should be considered for yield prediction.
The current challenges and threats faced by the farmers are uncertain due to rapid changing conditions of weather, water shortages, poor irrigation facilities, uncontrolled cost and accurate yield prediction (3) . Therefore, it is necessary for the farmers to be equipped with smart farming particularly crop yield prediction using statistical and computational techniques. A research study proves that the yield crop prediction would facilitate farmers to make precautionary actions to improve productivity. Early prediction is possible through collection of previous experience of farmers, weather conditions and other influencing factors, by storing them in a large database with variety of parameters such as area, temperature, rainfall, humidity and wind speed.
Modern Technology is able to provide a lot of information on agricultural-related activities, through which necessary information can be extracted and sorted using data analyzing techniques. The big data analytics and other applications are used in this purpose to increase the agricultural yield production and reduce the expenses of the farming by taking appropriate decisions. Various machine learning techniques such as prediction, classification, regression and clustering are available for weather forecasting and crop yield prediction (4) . In this study, the rice data of district Larkana is collected from the Agriculture Statistics Department, Islamabad. These data are experimented to predict the yield using the Association Rule based approach. The Artificial Neural Network (ANN) Feed Forward Back propagation method of supervised learning are used to perform optimization on resulted association rules for decision-making.

Literature review
Data science is being utilized by most researchers in different areas of research including agriculture. Surya (5) carried a survey pertaining to the various factors of agriculture using the techniques of big data analytics and suggested that the implementation of big data analytics in agriculture gives better results. Furthermore, crop yield prediction is mainly preferred by the researchers using the Data Mining Techniques (DMTs). Rao (6) has used DMTs to envisage yield of crop in India. Many data mining algorithms are reported in literature to acquire the association rules. Some famous algorithms are AprioriTid, Apriori and Eclat implemented on different variables to find the association. The DMTs are also used by Ramakrishna (7) for agriculture land soil and reported the improvement in crop productivity. Specifically, an Apriori association rules were used to generate the advisory reports that facilitate decisions that support crop rotation, fertilizer requirements and harvesting procedures to the farmers. Supriya (8) used DMTs to analyze soil datasets for prediction of crops and developed complete system using K-Nearest Neighbor and Naive Bayes approaches for predicting the crop yield. Additionally, Zingade (9) presented an android system that used data analytics techniques i.e. linear regression to predict the most profitable crop in the current weather and soil conditions. The temperature, rainfall, soil parameters and past year production are used as selected attributes. The classification, clustering and association rule DMTs are analyzed by Chouhan (10) for analyzing the crops production.
During the literature review it is found that ANN is commonly used by the researchers for yield prediction and optimization. The ANN and statistical techniques are compared by Paswan (11) . Pandey (12) made prediction of potato yield crop using Generalized Regression Neural Network (GRNN) and a Radial Basis Function Neural Network (RBFNN). The leaf area index, biomass and plant height were used as input data parameters, while the yield of potato set as output dataset to train and test the NN. Author concluded that GRNN is a better predictor than RBFNN on the basis of quick learning capability and lower spread constant parameters. Niedbala (13) used ANN with Multi-Layer Perceptron (MLP) and presented three neural models for prediction of winter rapeseed yield. The temperature, precipitation and information about mineral fertilization are used to make the prediction of yield. The forecasting quality of the models is verified by determination of forecast errors. Similarly, Mathieu (14) used NN classifier for forecasting extreme corn yield losses caused by weather extremes. The average temperature and standardized precipitation evapo-transpiration index are used as set predictors. Moreover, Khaki (15) used deep NN to predict yield, check yield, and yield deference of corn hybrids from genotype and environment data. The study concludes that environmental factors had a greater effect on the crop yield than genotype.

Research methodology
Some specific and predefined steps are always required to finish the chosen research project. Figure 1 depicted the core research methodology adopted to accomplish the task of rice yield prediction and optimization. Selection of appropriate variables is the first task of this research project in which yield is selected as dependent variable and area, temperature, rainfall, humidity, wind speed and rice production are considered as independent variables. The required data is particularly collected on the basis of mentioned independent and dependent variables. The datasets are separately prepared for preprocessing the needed operations and further analytical experiments.
https://www.indjst.org/ This research project is reported in two sections one deals with the yield prediction and other one is related with the optimization of predicted rice yield. The association rules based approach along with Apriori algorithm is used for rice yield prediction and various association rules are generated and analyzed using Frontline Excel Solver V2019. This software tool offers the facility of Analytic Solver through which association rules are automatically generated for further process. Furthermore, Matlab offers various optimization opportunities including the tool of ANN hence, NN model is proposed and implemented on the received yield prediction outcomes for optimization. It is better to develop own programming code with considerable parameters for prediction and obtain more accurate results but this process takes more time so that the built-in tool of Matlab is used in this research project. All manually collected and published data were transferred into spreadsheet/computerized file, so that rice dataset can be input to the machine for the development of association rules. The Rice is Kharif Crop, its sowing in Larkana District starts in the month of June and harvesting starts in the November month of every year. Hence, average value of maximum temperature, average monthly amount of rainfall, average monthly relative humidity and average monthly wind speed of June to November of every year is taken for further preprocessing. The prepared datasets from the year 2010 to 2015 are given in Table 1.

Data collection and processing
Machine language algorithms process the data, when it is not in continuous form of data. The discretization is the process that allows making the values in limited number of possible states. The output of discretization is the file containing the ordered https://www.indjst.org/ and discrete values. The developed dataset is based on several attributes having continuous data. In this stage, discretization is performed on continues data of all attributes of rice crop dataset of Larkana District. The attributes: area, temperature, rainfall, humidity, wind speed, production and yield are discretized into least, average, and utmost on particularly fixed threshold value with preferred attributes. The set threshold values for all variables are given in Table 2.

Association Rules
Association Rules is one of the incredibly important concepts of machine learning commonly used by the researchers to solve the problems pertaining with big data analytics including crop yield prediction. The association rules sufficiently help to uncover all the relationships between the items from huge generated databases. For this, Apriori algorithm is used for getting empirical results. Moreover, association rules are if-then statements that help to show the probability of relationships between data items within large data sets stored in databases. An association rule has two parts: an antecedent (if) and a consequent (then), both of which are a list of items. An antecedent is an item found within the data, while a consequent is an item found in combination with the antecedent. Various metrics are in place to help us understand the strength of association between antecedent and the consequent.

Experiments with Apriori algorithm
The searching of widespread patterns is vital and Apriori algorithm allows significantly doing the task specifically among the items which are already stored in the developed database. This algorithm is utilized effectively and association rules were successfully and digitally produced with chosen item sets. Hence, the pre-processed and discretized rice dataset of district Larkana is given as input to this preferred algorithm for getting the frequent item sets and strong association rules. The process of this algorithm is started by creating the basic item set with the association of single item with all available item sets. On the basis of the candidate set of item the most useful sets of items are created by shortening all the items whose support computations do not fully satisfy the least value of support threshold. Hence, the standard item set consequently acquired items that support values are continuously increasing. Furthermore, strong association rules obtained from the experiments are examined to comprehend the correlation surrounded by the variables i.e. area, temperature, rainfall, humidity, wind speed, production and yield.
The established list of 88 various association rules auto generated by using Apriori algorithm on Larkana Rice Dataset. The auto generated list of association rules is divided into two parts; Tables 3 and 4 presented 1 to 44 and 45 to 88 rule lists respectively. Both tables contain Rule id, A-support, C-support, Support, Confidence, Lift-ratio, Antecedent and Consequent https://www.indjst.org/ attributes. The attribute Rule id identifies the rules in the table and Support, Confidence, Lift-ratio, Antecedent and Consequent are core parameters, while values of A-support, C-support parameters is not the scope of this research so that is not intensely discussed.

Measurements of support and confidence levels
The results are obtained using an Apriori algorithm contains the list of association rules. It is tried to achieve maximum possible combinations and stronger association rules among the items in the rice dataset by setting the minimum support and confidence values. The item sets are denoted with the variable X and the support is represented as supp(X). The X is always described as the proportion of the all possible transactions across the all data sets that hold the calculated item set. After computing the various support levels, it is a correct time to calculate the level of confidence using Eq.1 given below.
It is noted here that the computational process of supp(X∪Y) actually represents the support for the occurrences of the both transactions X and Y. Furthermore, confidence is usually interpreted as the probability estimation P (Y|X) through which left and right sides of the obtained rules are calculated. Then, the left side of the rule is calculated using the Eq.2 given below to observe support percentage of the variable X and Y.
Only two types of support values were found while the experimenting. Figure 2 presented the support values adjacent to the rules. The highest support value is 4 and the least support value is 3. Surprisingly, rule numbers 3, 9, 11, 27, 28 and 30 have the highest support value which is 4 and remaining rules holds the least support value, hence the minimum support value is set with 3.

Fig 2. Calculated support levels
Similar to the calculation of Support, the confidence value is calculated to observe the relation between the rules as well as fixed variable. Figure 3 presents the calculated confidence values with all the generated rules. The least confidence value of 80 is observed with the rules 9, 15 and 28. All the other rules represent the maximum confidence value which is 100. Keeping the facts, minimum confidence value equals to 80 is fixed for the analysis.

Predicted rice yield
Tables 3 and 4 presented the outcomes of the association rules specifically in terms of the support and confidence level of the rules. In many places, the value of temperature, wind speed and other variables are very high as compared to variable yield. Hence, the purified list of those rules having yield at high level is separated for realizing the impact of other variables on the dependent variable yield.
The generated list of 28 strong filtered association rules that predict the Rice Yield at maximum level, with their Antecedent, Consequent, Support and Confidence values is given in Table 5. The value of support attribute shows the participation of item set given in Antecedent of each rule. The value of confidence expresses the support for occurrences of transactions where both https://www.indjst.org/  Antecedent and Consequent appears true. Basically, the value of confidence tells the probability of how many times the rule has become correct using the provided dataset. While the value of lift-ratio shows the ratio of observed support to the Antecedent and Consequent were found independent.

Neural network c lassification
Various machine learning approaches have been used for crop yield prediction as well as optimization. The NN approach has great importance and gives acceptable level accuracy hence, this approach is used for optimization of rice yield data. With the help of the neural network, the process of the data optimization is performed on the extracted information such as Area, Temperature, Rainfall and the Humidity against the Yield and Production which are the targeted parameters. Optimization performed through the neural network using the feed forward back propagation method uses the supervised learning. However, the application of neural network is used in another way by Gandhi (16) and Khaki (17) for crop yield prediction. It is considered that the algorithm is provided with samples of inputs and outputs. Based on the supervised learning, the network calculates the error. The major goal of this algorithm is that it is used to adjust the input weights in order to reduce the error rate so that process of learning and training data is achieved by NN. Figure 4 is the depiction of developed neural network for optimization of the predicted rice yield. The developed network is based on the 5 input and 2 output values. Area, Temperature, Rainfall, Humidity and Wind Speed are the factor effecting yield and production performance. The remote sensing data is also used by Putri (18) with specific variables. Thus, these five variables are selected as input variables while Yield and Production are set as targeted values to train the network. Furthermore, ten hidden neurons are fixed as hyper parameters for developed network to get the accurate and reliable results. Some researchers argued that convolutional neural network is more appropriate than the back propagation network due to the time dependency (19) . In this project, time may be not considered in that way because it is the time period of the rice cropping so that the proposed neural network is sufficient enough for such kind of systems. The training and testing phase performed using a dataset made in excel work sheet later imported in Matlab programming language by creating 2 arrays of 5x6 and 2x6 as shown in Figure 5.
https://www.indjst.org/ Using the above mentioned values the network developed on a method known as Feed Forward Back propagation Method. Data was optimized to generate the best output of Yield and Production. However, during the training process the best output is generated at the 7 th epoch by the network out of 1000 which is shown in Figure 6. The batch training performed during which, samples pass through the learning algorithm before updating the weights at the learning rate of 0.001/epoch. While the Figure 7 showing the gradient ration generated by network which is 264971.27 up to the 7 th epoch. The validation check is also shown in the same figure which is a total of 6 in 7 epochs.
The selected dataset is used for training as well testing of network. Experimented results showed that admirable outcomes were achieved during training and testing phase. In Figure 8 it is demonstrated that best validation performance results achieved by the network are 648 at the 1 st epoch with respect to its Mean Square Error (MSE). Also, the graph depicted in the Figure 9 shows training as well testing performance against received MSE. While the overall network performance shows that 99.8% received during the network training and overall optimized results achieved are 97.8%.

Discussion
The already available rice data of Larkana district is experimented using Apriori algorithm by considering the parameters: area, temperature, rainfall, humidity, wind speed, production and yield data of six years. It is noted that the last two variables are https://www.indjst.org/ fixed as targeted variables. The rules that predict the rice yield with satisfactory level of confidence are described in Table 6. After careful analysis of the rules and predicted rice yield, it is realized that many factors directly affect the yield of rice crop. For instance, rule number 3 defines that if the area is low then yield is high this means that the farmer is not able to provide the sufficient water to the rice crop. On the other hand, with the reflection of rule number 9, when the parameter temperature is high as compared to normal level then yield is also high because low temperature reduces the production of rice. The rule 11 provided the surprising results belonging with the humidity because the medium percentage of this factor gives maximum rice yield.
But, rule number 23 tells us that low area and medium level humidity is also a cause of high yield. The analysis is consciously realized with the rule number 25 in a way that if area is low and wind speed is high then yield is high. We have some parameters having directed relation such as area and production for example, rule number 30 defines if area is low and production is medium then yield is high. The rice yield is increased when the temperature is soaring and the humidity is average (see rule number 33). The rule 41 is characterized in a way that when both factors humidity and production are at appropriate levels https://www.indjst.org/  then yield is high. With the focusing of rule 51, it is found that if area, temperature and wind speed are touching the high level then yield is increase at acceptable level. The rule number 55 defines that if area is low and temperature is high then production is medium and yield is high but when the area is low and temperature is high and production is medium then yield is utmost showed by rule 56. It is also examined that using the rule number 59 if the area is stumpy and humidity is medium then production is medium but farmers received high level of yield. Additionally, rule 61 described that the low area of cultivation with average humidity and production gives maximum rice yield. It is seldom happening and come to know with the rule number 64 that if the area is least and the other factors like wind speed and production is high and medium respectively then farmers can also get maximum rice yield. When the rule 66 was examined, it was observed that the area of cultivation may be low but if the wind speed is high and the production is average then we can obtain high level of yield. The farmers are upset due to the low area reserved for rice cultivation but rule number 73 defines that if the temperature and wind speed is high then yield will be high with the condition of medium production. Similarly, by applying rule number 74, the problem of area is same and the wind speed is high then it is possible to increase the temperature and yield except production.
Moreover, it is attractively observed with rule 76 that when the area is low but wind speed and temperature is at maximum level then yield is high but the production is medium. Another rule having number 77 is examined carefully and found that if area is low and temperature is high and production is medium then wind speed and yield both are high. Most of the farmers crop rice in low area due to unavailability of the proper land. They can get yield high when rule 79 is successfully executed. This rule holds good if wind speed is high, production is medium and temperature is high. We have also experimented rule number 84 that defines if the area is low and temperature as well as wind speed are high then yield is high but the production is medium. During the discussion with farmers and experts, it is found that low wind speed decreases the rice yield therefore with the consideration of rule 84 high level wind speed is beneficial for rice yield.

Summary and Conclusion
The aim of this study is to share yield predictions made on previous valuable rice data using analytical techniques and machine learning algorithms. The Apriori algorithm is implemented on https://www.indjst.org/ rice crop dataset of district Larkana. The required data is collected from the different associated departments. The association rule approach proved that the Rule 9, Rule 20, Rule 33, Rule 51 and others clearly results in "Yield is high". The predicted results of rice yield optimized using neural network provides high value of yield which can be followed by farmers, growers and government officials of Rice crop. STaleo that these personnels can take better decisions based on past trends and considering forecasting values of correlated attributes discussed in this study. The obtained results are bumper crop production that will strengthen country's economy by increasing rice production and also save the farmers from occurrence of monetary loss. The proposed neural network of optimization of rice yield of district Larkana will be analytically compared with the obtained results of advanced linear and statistical models.