A comparative study of different LSTM neural networks in predicting air pollutant concentrations

Aim/Objective: This study aims to identify the key trends among different types of LSTM networks and their performance and usage for air pollutants(PM2:5 and PM10) concentrations prediction. Methods: In this study, the extensive research efforts were made for Particulate Matters (i.e., PM10 and PM2:5) prediction using several LSTM networks, namely Vanilla, Stacked, and Bidirectional. These are trained and tested using air quality data, retrieved from the Central Pollution Control Board (CPCB) of the town Bawana, Delhi. Realtime hourly a data from 2018 to 2020 with nine air pollutants are considered for experimental analysis. We conducted data preparation strategy to select the best features, which improve the quality of the data. An adequate number of experiments are conducted to choose the best hyperparameters using Python package TensorFlow. Findings: MSE, MAE, RMSE, and R2 parameters are used as the statistical criteria for evaluating the model’s performances.The numerical experiments revealed that deep neural networks could predict the Particulate Matters (mg/m3) with high accuracy. We found that Stacked LSTM with minimum MSE, MAE, RMSE, and maximum R2 works better than the other two methods, i.e., Vanilla LSTM and Bidirectional LSTM for PM2:5 and PM10 concentrations prediction. The empirical, experimental analysis also shows that Vanilla, Stacked, and Bidirectional LSTM models have comparatively minimum MSE, MAE, RMSE, and maximum R2 for PM2:5 than PM10 concentration prediction. Applications: With the help of a predictive model, one can find reliable fine concentration prediction information for a particular area. The resultant information on relative performance can help researchers in the selection of an appropriate LSTM algorithm for their studies. 
Keywords: Air pollutants; air quality index; PM concentrations; LSTM;TensorFlow; air pollution


Introduction
According to World Health Organization data, in India, 1.5 million people died from chronic respiratory and asthma diseases caused by exposure to outdoor air pollution (1) . In Delhi, due to the vast industrialization and urbanization, the utmost environmental concern is air pollution in terms of PM concentrations and pollutants consisting of a complex mixture of solid and liquid particles. Hazardous air pollutants are likely to inflame the respiratory tract, may even damage the blood and nervous system, and cause death. Therefore, air quality analysis and prediction have become an extensive area of research.
The government of India has installed pollutant measuring sensors to monitor air pollution, and these sensors regularly capture the air pollutants. The air quality prediction is highly significant for the governmental emergency departments and citizens to implement protective measures and promptly mitigate serious pollution incidents. Concurrently, environmental monitoring using pollutant level prediction is a useful technical medium in executing scientific decision-making for air pollution control and prevention. For instance, Badarpur coal-fired power plant in Delhi has been shut down every winter to reduce the pollution level due to the predicted air quality (2) .
The Air Quality Index (AQI) is a value used by government organizations, reflecting ambient air quality in the atmosphere for a particular area. According to Indian national ambient air quality standards, AQI is calculated from ambient concentration values of eight major pollutants: particulate matter 2.5 (PM 2.5 ), particulate matter 10 (PM 10 ), carbon monoxide (CO), nitrogen oxide (NO 2 ), sulfur dioxide (SO 2 ), ozone (O 3 ), ammonia (NH 3 ), and lead (Pb). The AQI computes the overall air quality on a scale with a range of 0 to 500. It is categorized into six remarks: good, satisfactory, moderate, poor, very poor, and severe; these remarks show the possible impact on public health (3,4) .
Different types of LSTM networks have been a dominant method in prediction or forecasting. The AQI and Air pollutant concentration prediction has recently shown a potential application area for these methods.
The stacked LSTM model is used to predict future air pollutant concentrations. The authors have considered features such as meteorological datasets, different factors, including festivals and national holiday traffic, etc. They have predicted pollutants (PM 2.5 , PM 10 , CO, NO 2 , SO 2 , O 3 ) for the next 1hr, next 6hr, and next 12hr for the major stations Delhi and Agra (5) . The LSTM neural network is implemented for the detection and prediction of PM 2.5 in (6) . The authors have collected data from the Environmental Protection Administration of Taiwan from the year 2012 to 2016. Several kinds of LSTM architectures were used in air pollutants and concentration prediction in the region of Madrid (7) .
The single-layer LSTM network predicted 12 air pollutants on the CPCB data at the Visakhapatnam, India. The authors have compared their results with various SVR kernels (8) . LSTM method is employed for AQI prediction on air pollutant factors and meteorological data (9) . Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), and LSTM methods are used to forecast the PM 10 on time series AirNet data. The paper also analyses predictive performance (10) .
The Bi-directional LSTM is constructed to predict air pollutant PM 2.5 . To increase the accuracy of the model, the authors have included weather information while training the dataset. They have predicted PM 2.5 for the next 6, 12, and 24 hours at the multiple stations in New Delhi (11) . Bidirectional LSTM RNN applied to the spatio-temporal interpolation domain to predict PM 2.5 concentration. The authors have considered daily PM 2.5 measurements for 2009 in the southeast region of the USA (12) . The air quality and meteorological information, real time-series data, were considered for the pollutant prediction using the RNN LSTM algorithm. Authors have predicted PM 2.5 for the next 8h, 12h, 16h, 20h, and 24h in South Korea for one year (13) .
The PM concentrations (PM 2.5 and PM 10 ) predicted using LSTM and deep autoencoder (DAE) methods. The air quality data were collected from 25 stations in Seoul, South Korea, for over three years. Authors have compared the algorithms' performance using different hyperparameters such as learning rate, epochs, and batch size (14) . The denoising auto encoder deep network (DAEDN) model is based on LSTM designed and constructed for PM 2.5 concentration prediction. The five years of air quality data in the Beijing's city were considered for experimental analysis, and the proposed model achieved RMSE and MAE of 15.504 and 6.789, respectively (15) . The long short-term memory neural network extended (LSTME) model is proposed to predict air pollutant concentration. The experiment was carried out on hourly air quality data, gathered from 12 monitoring stations in Beijing City over two years and achieved 11.93% of MAPE (16) .
Out of several mentioned pollutants, Particulate Matter PM 10 and PM 2.5 are of interest: their diameter is less than 10 and 2.5 micrometers; it can quickly enter the human respiratory system. Hence, the prediction of PM 10 and PM 2.5 plays a vital role in determining air quality. The most cutting-edge deep learning technology is used, RNN based LSTM, which has nonlinear problems processing ability and noise tolerance ability, improving forecasting efficiency and accuracy. Therefore, this method has high theoretical research value. The LSTM network is divided into three basic categories Vanilla, Stacked, and Bidirectional. This study's main objective is to compare the performance of these LSTM networks in terms of air quality prediction and identify efficient ones among them. A comparison is only possible when a standard benchmark on the dataset and the scope is established. Therefore, we chose identical hyperparameters that were implemented on the same data for comparison. We conducted a series of experiments to validate the intended approaches and present evidence to demonstrate the proposed https://www.indjst.org/ framework's effectiveness using evaluation parameters.

Data description
The dataset used in this study is collected from the monitoring station at Bawana, Delhi, for the period of 4 th July 2018 to 30 th June 2020 (17) . The most polluting industrial units operating in Delhi's residential area were relocated to Bawana, North West Delhi; thus, air condition became relatively weak at Bawana. Therefore, air quality analysis is important in the region of Bawana. The dataset contains 17126 instances of hourly pollutants for PM 10 , PM 2.5 , NO, NO 2 , NOx, NH 3 , SO 2 , and O 3 , provided by the certified analyzer and electronic instruments measured in µg/m 3 .

Data pre-processing
In the process of conducting air quality prediction on the data, the preprocessing stage plays a vital role (18) . Due to some uncontrollable factors, obtained data lacked a certain values, such as timestamps and monitoring data. Figure 1 reports the missing values with the percentage loss present in real-time data. Each factor of the missing values is computed out of 17126 data, and percentage loss is estimated. In the dataset mentioned in Figure 1, the number of missed PM 10 parameter, as an essential factor, reaches nearly fourteen hundred, which has a significant impact on the effectiveness of the model. Therefore, constructing and designing reasonable and efficient data is a prerequisite to reduce data noise. It is also a widespread research issue in data science (19) .

Data imputation
In dealing with the data-missing problem, the linear interpolation method is adopted. The missing values in the data field are imputed by interpolating the respective parameter. Equation (1) formulates the linear interpolation as, Where X t indicates the missing value at time t and n indicates time interval between (X e + X s ) .

Data statistics
rovides descriptive statistics of each factor in the dataset of the cityBawana Delhi, including the statistical properties such as minimum, maximum, and average values. As shown in the

Data normalization
When the scale of each factor is different, it could affect the training process, resulting in a decreasing accuracy. Therefore, to facilitate network-training processes and prevent "overfitting" problems, each factor's original data needed to be normalized on the same scale for better performance. We used Min-Max Normalization to scale the values in the range [0, 1]:

Feature selection
This paper uses spearman's rank-order correlation to find a monotonic relationship between the Bawana dataset features. The valid range of correlation coefficient is from -1 to +1. The -1 and +1 indicate a strong negative, and positive correlation, respectively, whereas 0 indicates a negligible correlation (20) . The formula for spearman's rank-order correlation is ρ r x r y = cov (r x r y ) σ r x σ r y Where r x , r y denotes the ranked values of variable x and y, COV stands for covariance and σ indicates standard deviation. This study refers to Cohen's standard (21) for feature selection. The feature correlation values less than 0.30 are considered weak, hence removed from further processing. Based on investigating in Figure 2 , PM 10 has a strong positive correlation with PM 2.5 , moderate positive correlation with NO, NO 2 , NOx, NH 3 , and CO, and moderate negative correlation with O3. Likewise, PM 2.5 has a strong positive correlation with PM 10 , and moderate positive correlation with NO, NO 2 , NOx, NH 3 , SO 2 , and CO. According to Cohen's standard, SO 2 with 0.29 and O 3 with -0.2 are discarded from the dataset while predicting PM 2.5 and PM 10 concentrations. https://www.indjst.org/

LSTM neural networks
The ability to memorize sequences of data makes the LSTM a special kind of RNNs capable of catching data from past events and using it for future predictions (22) , (23) . In an LSTM, the memorization of earlier stages performed through gates and incorporated memory lines. The basic LSTM unit consists of three filtering gates: the input gate, forget gate, and output gate, which controls the flow of information. This study formulates the problem PM 2.5 and PM 10 concentrations prediction as a time series based problem. The fine pollutant concentrations at time t+1 not only take input data at time t but also take predicted values at time t-1, t-2,…, t-N, where N is the memory length. Because in the approach we have considered, the current pollutant level depends on past pollutant levels. This paper predicts fine PM concentrations using three basic types of LSTM neural networks described as follows:

Vanilla LSTM
A single layer LSTM network is commonly known as Vanilla LSTM. Vanilla LSTM consists of a single LSTM unit, fully connected hidden layer, and output layer to make a prediction. In this study, vanilla LSTM has a visible layer with one input, a hidden layer with 75 units, and an output layer that produces single value prediction.

Stacked LSTM
In the multiple hidden layers, each layer comprises the number of memory cells stacked on the top of each other and is called Stacked LSTM. This study stacked two LSTM layers on top of another, where the first LSTM layer returns their whole output sequences, although the second LSTM layer returns the last step its output sequence only. https://www.indjst.org/

Bidirectional LSTM
The Bidirectional LSTM can simultaneously run past and future information in both the direction, i.e., forward and backward. Two hidden states can preserve information from past and future and then join the corresponding output of hidden state to the same output state. Due to the positive and negative storage of time series data, unidirectional LSTM lag problems occurring while training the network can overcome. Consequently, it can provide additional information to the network and improves the correctness of the predictions.

Development of LSTM models
The experimental process framework is shown in Figure 3 for predicting PM concentrations using different LSTM neural networks. The process workflow included data processing that processes missing values and generates data for training and testing the model. Using generated input data we trained and evaluated the models. MSE, MAE, RMSE, and R 2 are used to assess the prediction effectiveness of the LSTM networks. In order to develop the PM 2.5 and PM 10 prediction models, the dataset is separated into two sets. Specifically, 1/4 th data of the original dataset (12845 samples) is used to learn the model, called the training dataset; 3/4 th remaining of the data (4281 samples) is used to justify the model, called the testing dataset.
For simplicity, the number of units or neurons in each LSTM layer taken as 64, dropout of 0.3 used to avoid over-fitting. Moreover, fine-tune hyperparameters are selected, such as Nesterov Adam (Nadam) optimizer, which is a combination of NAG (Nesterov Accelerated Gradient) and Adam (24) with a learning rate of 0.001, and loss function is set to mean squared error. The activation function is set to sigmoid, which squeezes a huge input space into a small input space, i.e., 0 and 1. The epochs and batch size are selected for model training chosen from a candidate set of {25, 50, 100, and 200} and {25, 75, 125}, respectively. The more the number of iterations, the more accurate prediction result is, but processing time increases. The most appropriate settings are chosen that yielded the best performance based on several comparative experiments. When the epochs and batch sizes are 100 and 75, respectively, the LSTM model achieved the best performance. Therefore, in this study, epochs and batch sizes are set as 100 and 75, respectively. Finally, the input data reshaped into three-dimension for the architecture of the LSTM network.
The LSTM models are implemented using Python programming language and its packages, namely Pandas, SciKit-Learn, Matplotlib, etc. The LSTM models are built and trained using deep learning libraries called TensorFlow and Keras. All the LSTM algorithms are designed using the same hyperparameters and verified with the same operation environment Windows 8, Intel(R) Core(TM) i3-3110M, CPU @2.40 GHz, 6 GB RAM.

Evaluation Parameters
To validate the model's effectiveness, we chose four standard metrics to compare the Vanilla LSTM, Stacked LSTM, and Bidirectional LSTM. The prediction accuracy is determined by the functions Mean Square Error (MSE), and Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R square (R 2 ) between measured and predicted values. The following acceptable criteria are used for evaluation: (1) low MSE, (2) low MAE, (3) low RMSE, and (4) high R 2 . The mathematical expressions of these functions are given below, Where y i denotes the actual values at i th samples and y i denotes the mean of actual values at i th samples, y ′ i denotes the predicted values at i th samples and n is the number of samples.

Results and discussion
After developing the models, their performance is compared and analyzed using evaluation parameters. Based on the training dataset, the predictive LSTM models (i.e., Vanilla, Stacked, and Bidirectional) are developed as described above. The true and predicted PM 2.5 and PM 10 concentrations are plotted and visualized in Tables 2 and 3. The performance of the developed LSTM models in Table 2 showed that all the models could generate a reasonably accurate concentration estimation. Table 3 shows the scattering of the actual and predicted values. It can be seen that the trend of the prediction curve follows the trend of the actual curve, and the predicted values have a positive linear relationship with the actual values. Table 4 calculates indicators of the performance of LSTM techniques for estimating the PM 2.5 and PM 10 concentrations in the training process. Table 5 sorts the ranking of LSTM models. Based on these results, we found that stacked LSTM achieves https://www.indjst.org/ the best performance in all evaluation metrics in predicting the concentration. With PM 2.5 concentration prediction, Vanilla LSTM performed slightly better than Bidirectional LSTM. Whereas with PM 10 concentration prediction, Bidirectional LSTM performed marginally better than Vanilla LSTM.

Conclusion
This study focused on the comparative performance of different LSTM neural networks in air pollutant concentration prediction. We have drawn number of conclusions as follows: (1) we have handled a problem of multivariate variables and approximate nonlinear mapping very well. (2) Stacked LSTM with minimum MSE, MAE, RMSE and maximum R 2 works better than other networks for PM 2.5 and PM 10 concentrations prediction. (3) Each LSTM model has comparatively minimum MSE, MAE, RMSE, and maximum R 2 for PM 2.5 than PM 10 concentration. (4) Feature normalization and feature selection play a major role in making better air quality prediction. This study is limited to predict PM 2.5 and PM 10; as long term prediction task is tedious. Though each LSTM model performs well and has high precision, and strong adaptive ability, huge data in terms of traffic, industry, and meteorological need to be explored in order to build a more comprehensive and effective air quality prediction.