Evaluation of Multi-variate Time Series Clustering for Imputation of Air Pollution Data

Air pollution is one of the world’s leading risk factors for death, with 6.5 million deaths per year worldwide attributed to air pollution-related diseases. Understanding the behaviour of certain pollutants through air quality assessment can produce improvements in air quality management that will translate to health and economic benefits. However problems with missing data and uncertainty hinder that assessment. 5 We are motivated by the need to enhance the air pollution data available. We focus on the problem of missing air pollutant concentration data either because a limited set of pollutants is measured at a monitoring site or because an instrument is not operating, so a particular pollutant is not measured for a period of time. In our previous work, we have proposed models which can impute a whole missing time series to enhance air quality monitoring. Some of these models are based on a Multivariate Time Series (MVTS) clustering method. Here, we apply our method 10 to real data and show how different graphical and statistical model evaluation functions enable us to select the imputation model that produces the most plausible imputations. We then compare the Daily Air Quality Index (DAQI) values obtained after imputation with observed values incorporating missing data. Our results show that using an ensemble model that aggregates the spatial similarity obtained by the geographical correlation between monitoring stations and the fused temporal similarity between pollutants concentrations produced very good imputation results. Furthermore, the analysis enhances understanding 15 of the different pollutant behaviours, and of the characteristics of different stations according to their environmental type.

based on the similarity of their PM 2.5 composition profiles, then characterise these clusters based on chemical characteristics, emission profiles, geographic locations and population density. Ignaccolo et al. (2008) transformed the TS of pollutant daily observations into a functional form to smooth the TS, then classified the air quality monitoring network in Northern Italy using the Partitioning Around Medoids algorithm (PAM) to cluster three individual pollutants, namely NO 2 , PM 10 , and O 3 . Tuysuzoglu et al. (2019) applied different clustering algorithms such as k-means, Expectation Maximisation, and Canopy for 60 each air pollutants in the dataset (NO, NO 2 , SO 2 , PM 10 , and O 3 ), then aggregated the clustering results based on majority voting to identify one clustering solution for similar regions in terms of air quality.
On the other hand, there has been some research into similarity within MVTS. For example, Fontes and Budman (2017) proposed a MVTS clustering method based on extracted features from the univariate TS. In their work, Principal Component Analysis (PCA) is used to measure the similarity between MVTS, and fuzzy k-means is used to cluster these TS. This clustering approach was used for fault detection in a gas turbine. Zhou and Chan (2014) developed an algorithm for clustering MVTS by discovering each TS's temporal patterns. Their algorithm is based on k-means and aims to groups MVTS with similar temporal patterns together into the same cluster. D' Urso et al. (2018) proposed robust fuzzy clustering models for MVTS based on an exponential transformation of the dissimilarities. This algorithm was applied to real-world data on the concentrations of three pollutants (NO, NO 2 , and PM 10 ) in the Metropolitan City of Rome for the problem of detecting pollution alarms. 70 In our previous work (Alahamade et al., 2020), we compared different TS distance measures and imputation techniques to impute the missing observations and missing pollutants (TS). We found that using Shape-Based Distance (SBD) gives better separated cluster than Dynamic Time Warping (DTW). Also, using MICE to impute the TS missing observations is better than using some single imputation methods such as Simple Moving Average (SMA). We used a univariate TS clustering using k-medoids (PAM) to cluster stations and imputed the missing pollutants using the cluster average. In this work, we use the 75 k-means clustering algorithm and include a number of pollutants in the clustering, which make it MVTS clustering. This clustering algorithm was proposed in Alahamade et al. (2021) where more details can be found. bands; 'low' (1-3), 'moderate' (4-6), 'high' (7-9) and 'very high ' (10). The air quality is negatively correlated with DAQI index, meaning that a higher DAQI index represents worse air quality.

90
The MVTS clustering algorithm and our proposed imputation models were implemented in R, Version (3.5.2) and are fully explained in previous work (Alahamade et al., 2021). To provide a more robust testing scenario, we separate the 'model building' stage from the imputation testing stage. We use an initial data period of three years (2015-2017) as a training set to build the clustering, and then impute on the next year (2018) of the TS to evaluate the goodness of fit.

95
For evaluation purposes, we assume each pollutant from each station is missing entirely and impute it. For any given station, j, to impute the values of missing pollutant P j i , where i is represents the different pollutants (1 ≤ i ≤ 4), we use different models under two main similarity criteria: the similarity using clustering solutions and the similarity using geographical distance.
The k-means clustering algorithm is used to group the stations based on their temporal similarity, which is the similarity in time between the hourly pollutant concentrations using SBD as the temporal distance measure. While the geographical distance 100 is used to find the spatial similarity between stations locations. Adding to that, we use an ensemble model which calculates the median of all the previous imputation models; this model aggregates the temporal and the spatial imputation using both the time series clustering and the geographical location similarity. Then, we evaluate these models to select the one that gives the highest similarity to the real values which are known. We explain these models in detail in the following sections: 4.1.1 Imputation models using clustering results:

105
Once a clustering of our stations is obtained, we can use the clustering solution to impute missing TS (pollutants). If station j belongs to cluster C x , (1 ≤ x ≤ k, where k is the number of clusters) given the measured pollutants over time, then, to impute pollutant P i based on the clustering results, we use three models: 1. We impute the average of pollutant P i in cluster C x , which is the hourly average of pollutant P i in all the stations that fall in this cluster. We call this method Cluster Average (CA). 110 2. We impute the average of pollutant P i in cluster C x , but using only stations with the same environment type to station j within the cluster, such as 'Background Rural', 'Background Urban','Traffic', or 'Industrial'. We call this method (CA+ENV). This is in recognition that the type of station may be important and result more similar pollutant concentrations.
3. We impute the average of pollutant P i in cluster C x , for stations that belong to the same region. As defined by Defra 115 (DEFRA , 2021) there are 16 regions in the UK for air quality assessment, such as Eastern, North Wales, East Midlands, and the other UK regions; this method is called (CA+REG). 4.1.2 Imputation models by similarity using geographical distance: First, we measure the geographic distance using Harvison metric, which calculates geographic distance on earth based on longitude and latitude. We calculate the distance between station j and all other stations that measure pollutant P i . Then to 120 impute pollutant P i for station j we use: 1. The nearest neighbour (1NN) using the Harvison based distance to station j; this method is called (1NN).
2. The average of the two nearest neighbours (2NN) to station j; this method is called (2NN).

Imputation model by ensemble:
In this approach, for a given station j, to impute pollutant P i , we use the median value of all the imputed values from the 125 previous models, that are cluster average (CA), cluster average considering the station type (CA+ENV), cluster average considering the region (CA+REG), first nearest neighbour (1NN), and the average of the two nearest neighbours (2NN). This method is called (Median). This imputation approach may be computationally the most expensive as it needs for all others to be computed, but ensembles have the potential to provide very powerful solutions by combining predictions. 130 We evaluate how plausible the imputation is using different models by comparing truth values to imputed values. The models evaluation are based on the test dataset, which is the 2018 data. As earlier mentioned we do this by taking each existing TS for which we have values, one at a time, and consider them missing. We impute the whole TS by various models and compare that to the ground truth. We are evaluating our models against the real concentrations which contain missing values, hence, we ignore all the missing values in this evaluation. For each model, we can average the different imputation models' behaviour 135 from all the stations to establish the one that provides imputed values closest to the real values. Hence, for our experimental set up we take each existing TS for a given pollutant and station, P j i in turn, and impute it by the various models to obtain an imputed TS, P I j i . We compare the real values to the imputed values using different statistical and graphical model evaluation functions. The statistics function include Fraction of predictions within the factor of two (FAC2), Mean Bias (MB), Normalised Mean Bias (NMB), Root Mean Squared Error (RMSE), Coefficient of correlation (R), and Index of Agreement (IOA). These 140 measures are used to evaluate the temporal variation of air pollutants between imputed/modelled and observed concentrations.

Imputation model evaluation
The graphical functions include Conditional Quantile plot, Time Variation plot, and Taylor's Diagram. These are functions within the Openair Package, a freely available air quality data analysis tool in R (Carslaw and Ropkins, 2012), that present comparisons between the modelled and measured air pollutant concentrations and their statistics graphically. Model evaluation functions are beneficial when more than one model is involved in the comparison, and help us in understanding why a model 145 does not perform well.
The model that gives the lowest error on average, the highest correlation and the highest degree of agreement between imputed and observed concentrations for all stations (i.e. imputed TS) will be considered the best model. Note that the best model may change from one pollutant to another and may be affected by other factors such as station type (e.g. urban background, rural and roadside) or pollutant lifetime and spread.

DAQI calculation
In the UK, DAQI forecasts are issued on a national scale; they are produced by the Met Office in the morning for the current day as well as for the next four days. The forecast is improved by incorporating the recent observations of air quality recorded at the AURN stations. The overall air pollution index for a site or region is determined by the highest DAQI of the five pollutants.
The regional DAQI is the highest index among all the stations at that region.

155
For our evaluation, we calculated the daily DAQI value using the observed data for each station. This is because the DAQI value is not saved as part of the historical data available so we need to calculate it from the downloaded data. Defra has published a guide for the implementation of DAQI (DEFRA , 2013), which explains how the value is calculated and we follow that guidance. To calculate DAQI, each air pollutant is calculated as follows: -Ozone: the O 3 is measured hourly. To determine the DAQI we need to calculate the daily maximum 8-hourly running 160 mean concentration. First, for each hour we calculate the running 8-hourly mean from the previous hours. Then we find the maximum value of these 8-hourly running means. For this calculation 75% of the data must be captured to calculate the 8-hourly mean.
-Nitrogen dioxide: the NO 2 is measured based on hourly mean. We calculate the daily NO 2 contribution to the DAQI by taking the maximum observation in 24 hours every day from 0:00 to 23:00.

165
-Particles PM 10 PM 2.5 : are measured hourly. The DAQI is based on the 24 hours mean, which we calculate by taking the mean value from the hourly observations. For these pollutants 75% of the daily observations must be captured to calculate the mean, otherwise, the pollutant is considered as missing that day.
-We define the daily index for each pollutant separately. Then, for a station, we take the highest air pollutant index to be the value of the DAQI at that station. 170 We called the DAQI that is calculated based on observation 'observed DAQI', and the DAQI that is calculated based on imputation 'imputed DAQI'. We use the observed DAQI as a performance tool to evaluate our imputation model on its ability to reproduce the daily air quality index.

Results
In this section, we first analyse the proposed pollutant imputation models using some statistical and graphical air pollution 175 modeling evaluation functions. Then, we evaluate the imputation model performance based on the comparison between the observed and imputed DAQI.

Air pollution imputation modeling evaluation.
We first evaluate imputation models based on the statistical and then on the graphical analysis. In general, model 6 (Median), which is the model that uses the ensemble technique of other models, gives the lowest error average (RMSE), the highest Pearson correlation coefficient (R), and the highest agreement between imputed and observed 185 concentrations (IOA) for O 3 , PM 2.5 , and PM 10 . However, NO 2 shows different behaviour with model 2 (CA+ENV) achieving slightly higher performance with an increase of the correlation coefficient (by 0.049) and decrease of error average (by 0.826) compared to model 6 (Median). The Model Bias (MB) for model 2 is 50% higher than that of model 6. NO 2 shows local patterns, as it is concentrated where it is emitted in urban areas and near to the roadside. Adding to that NO 2 is shorter lived than other pollutants and shows greater spatial variability, with concentrations being strongly influenced by the environment 190 type (e.g. roadside, urban background, rural). This changes the NO 2 concentrations from one location to another based on the environmental type (CenterForCities, 2020).
All the selected models performed well, with 71-89 % of their imputations falling within a factor of two of the observed concentrations as shown in the FAC2 values in Table. 1. According to Dick Derwent and Murrells (2010), an air quality model minimum requirement is that the FAC2 value is higher than 0.50 and NMB values should be in the range between -0.2 and 195 +0.2. Both are met by our models. MNB measures if the model under or over predict, as it estimates the difference between the mean observed and imputed concentrations. Negative MMB means that the model under-predict and vice versa. All the models have very small biases.

Model evaluation based on Taylor's diagram analysis.
We use Taylor's diagram to analyse three main statistics: correlation coefficient R, the standard deviation (sigma) and the root-200 mean-square error (centred). These statistics can be plotted on one (2D) graph which can be represented through the Law of Cosines (Taylor, 2001).
The standard deviation represents the variability between modelled and observed concentrations. The observed variability is plotted on the x-axis. The magnitude of the variability is measured as the radial distance from the plot's origin. The black dashed line shows this for the observed value. The grey lines are isopleths for the correlation coefficient (R) as indicated by 205 the arc shaped axis; the correlation increases along the arc towards the x-axis. The centred root-mean square error (RMS) is represented by the concentric brown dashed lines. The furthest the points/models are from the observed value the worst performance they have (Carslaw and Ropkins, 2012). Fig. 1 shows Taylor Diagram plots for all models with all pollutants. In almost all cases the models exhibit less variability than the observed, indicated by the points being closer to the origin than the black dashed line. In general, Model 4 (1NN) followed by Model 5 (2NN) show variability that is most similar to the  Model 6 (Median), regardless of its ability to capture variability, is confirmed as having the highest correlation coefficient, and the lowest centred root means squared with all the pollutants except NO 2 , for which it is the second best behind Model 2 (CA+ENV).   In this analysis, we focus on the performance of model 6 (Median) and model 2 (CA+ENV), as those performed best for the 265 different pollutants in the previous section, but now we break down the analysis for the six environmental types (background rural, background urban, background suburban, and industrial urban, industrial suburban, and traffic urban) to which stations belong. Notice that a pollutant may/may not be measured in all stations and the number of stations of each type is different as shown in Table 2. We also use conditional quantiles to analyse our model's performance within each environmental type.  Table 2 shows the statistical measures of performance also broken down by environment type.  The most common sources for NO 2 are roads, however NO 2 concentrations are influenced by traffic density, road locations, and meteorological conditions, which cause variation from one roadside location to another. Fig. 4 shows that high NO 2 275 concentrations are found at traffic urban followed by industrial suburban, then background urban sites, while the background rural sites have the lowest NO 2 concentrations. Fig.5 shows the conditional quantile plots by station type for NO 2 imputation using model 2 (CA+ENV). Here, we see that modelled concentrations are higher than observed concentrations with all environmental types. This is confirmed by all the statistical model quality measures presented in Table 2 where we can observe positive mean bias for NO 2 .   From Table 2   emissions of nitric oxide from vehicles. Looking at model 6 (Median) performance in Table 2 based on the RMSE, the best model performance is associated with industrial urban stations and its average performance is associated with background rural stations (those with higher concentrations in Fig. 6), while its worst performance is associated with traffic urban stations (those with lower concentrations in Fig. 6). Fig. 7, shows the performance of model 6 (Median) for imputing O 3 for the six environ-295 mental types (Panels A to F). The model shows similar performance for industrial suburban (Panel D) and background rural stations (Panel A). For both types, the model is negatively biased (see also Table 2), meaning that the modelled concentrations tend to be lower than observed concentrations (the median lines are above the blue lines).

Conditional quantile analysis in
The worst performance based on the RMSE is associated with traffic urban stations (Panel F), which are the stations located at the roadsides. With those stations, the modelled concentrations are higher than observed concentrations, i.e. the modelled 300 histogram is shifted to the right. This is indicated by the model positive bias (0.503). The median line also extends beyond the blue line, which means that some modelled concentrations are much higher than observed measurements.
The best model performance is associated with industrial urban stations (Panel E) according to the RSME, even though background urban stations (Panel C) appear to have the best performance by looking at the conditional quantile plots. The histogram of Panel C indicates that the distribution of the observed and modelled concentrations tend to be closer to each other 305 for higher concentrations. However, the model overestimates the average concentrations at these stations (between 25 to 70 µg m −3 ) and underestimates the very low concentrations. That is consistent with the model performance at these sites. Fig. 9 shows corresponding conditional quantile plots by station types. Imputing PM 2.5 concentrations using model 6 (Median) gives similar performance for the different station types. In  worst performance for traffic urban (panel E), and this is also indicated by the highest RMSE (5.098) shown in Table 2. The model underestimates the concentrations at these stations, which is confirmed by the model bias (-0.073) in Table 2. On the 315 other hand, the model's best performance is associated with background suburban sites ( Fig. 9 (panel B)), even though it underestimates PM 2.5 concentrations with a mean bias of (-0.013).
Finally, PM 10 levels at background rural and urban areas are lower than those at industrial and traffic urban areas as shown in Fig. 10. For PM 10 , imputation performance shown in Fig.11 is similar for background urban and background rural sites (panels A and B). The model overestimates the concentrations of PM 10 that are ≤ 10µgm −3 , while it underestimates the high 320 concentrations of PM 10 at industrial urban (slightly) and traffic urban sites (panels C and D). That is confirmed by the model mean bias at these sites (-0.002, -0.106 ) as shown on Table 2.

Evaluating the imputed concentrations based on the Daily Air Quality Index (DAQI).
After imputing the measured pollutants in all the stations, we calculate DAQI from the imputed data, as explained in Section.
4.3. Then we compare it with the DAQI from the observed data to see our selected models' performances. The selected models 325 are model 6 (Median) for O 3 , PM 2.5 , and PM 10 and Model 2 (CA+ENV) for NO 2 .
We compare the imputed DAQI with the observed DAQI based on RMSE, and the number of days where there are agreements and disagreements. The total number of days in our data set is 60,955 days (167 stations * 365 days), there are 2,212 days with missing observed DAQI (DAQI = 0) that have resulted from missing observations on those days. The total number of days to compare is 58,743 days.

330
In general, the total average of RMSE from all days in all stations is (0.55). As the station type and the region may affect our imputation, Fig. 12  between imputed and observed DAQI.
We also study the correlation between number of measured pollutants in a station and the agreement between modelled and observed DAQI to see if number of measured pollutants impacts our model's performance. First, we classify stations based on number of measured pollutants to: stations that measured one, two, three and all four pollutants, as shown in Table 3. Each row in this table represents one group. The second column is the total number of days  We also compare the imputed and the observed DAQI based on the number of days where the imputed DAQI agrees and disagrees with the observed DAQI. In this work, we evaluated our proposed models to impute missing pollutants in a station based on statistical and graphical model evaluation functions (Taylor's diagrams and Conditional quantile plots), that are designed to evaluate air pollution modelling. We found that the best imputation model based on statistical analysis is model 6 (Median) for O 3 , PM 10 , and PM 2.5 and Model 2 (CA+ENV) for NO 2 imputation. The station environmental type plays an essential role with NO 2 imputation, 360 because NO 2 shows local patterns, as it is concentrated where it is emitted in urban areas and near to the roadside. Adding to that NO 2 is shorter lived than other pollutants and shows greater spatial variability, with concentrations being strongly influenced by the environment type (e.g. roadside, urban background, rural). This changes the NO 2 concentrations from one location to another based on the environmental type (CenterForCities, 2020).
On the other hand, the graphical model evaluation functions showed these models' performance based on the distribution Through our analysis, we also found that the variation of the model's performance with different environmental types is due to the pollutant behaviour and its emitted sources.

380
Model 6 (Median) performance with O 3 imputation changes from one environmental type to another due to the ozone's behaviour at these locations. As we know, ozone is not directly emitted into the air, but it is formed as a secondary pollutant by chemistry involving nitrogen oxides (NO x ), the sum of NO 2 , nitric oxide (NO) and volatile organic compounds (VOCs) in the presence of sunlight (Diaz et al., 2020). This chemistry is non-linear and newly emitted NO can react with O 3 leading to reductions in O 3 concentrations close to sources of NO (e.g. in urban areas and in particular, close to roads). Consequently, 385 ozone concentrations in urban areas are often lower than those at rural areas (H. Khan et al., 2017), as shown in Fig. 6. From the same figure (Panel C), as shown in the histogram for background urban stations, there is a high frequency of low concentrations (less than 10 µg m −3 ) at these stations that the model does not capture. This is consistent with the reaction of 395 newly emitted NO from urban roadside that reduces the concentrations of ozone at urban areas. Based on the RMSE and NMB, the model is a middle performing model. As shown in Table 2, the majority of stations measuring O 3 belong to this type.
As, we mentioned earlier that NO 2 is short lived so it has large differences between sites near sources (roadside) and those further away. Based on the RMSE Model 2 (CA+ENV) performs better with lower NO 2 concentrations than high values, and since these high NO 2 values exist near to traffic, the model performs the worst with traffic urban stations as shown in Fig.   400 5 (Panel F). In contrast, the model best performance is associated with background rural stations that have the lowest NO 2 concentrations.
PM 2.5 and PM 10 have many varied sources so in roads and industrial sites it can be associated with local sources, for example widespread primary sources (direct emissions) and diffused secondary sources (i.e. produced in the atmosphere following emissions of precursor gases). Whilst PM concentrations are often greater at roadside (DEFRA LAQM , 2021), the particles 405 can have lifetimes of several days in the atmosphere, meaning that they can be distributed widely. The larger particles are subject to greater loss via sedimentation, so PM 2.5 is more evenly distributed than PM 10 (National Statistics, 2020). This behaviour can also be observed with Model 6 (Median) performance, where there are less variation with the model performance under different environment types compared to the variation of NO 2 and O 3 , as shown in Table 2.
We also observed that the distributions of NO 2 , PM 2.5 and PM 10 are skewed to lower concentrations which impact model 410 performance at higher concentrations. All models perform worse for high concentrations with NO 2 , PM 2.5 and PM 10 than O 3 , indicated by the width of the quantiles at high values as shown in Figs. 2 and 3. Similarly, for lower concentrations, these models tend to perform better for NO 2 , PM 2.5 , and PM 10 than for O 3 . However, our selected models (model 6 (Median) and model 2 (CA+ENV)) are able to overcome this impact slightly.
Our approach enables us to impute/estimate plausible concentrations of multiple pollutants at stations across the UK, and the 415 modelled concentrations from the selected models correlated well with the observed concentrations. The performance of these models is very good with a slight underestimation in model 6 (Median), especially with high concentrations. At the opposite end, Model 2 (CA+ENV) slightly overestimates the NO 2 concentrations, due to the regional behaviour of this pollutant.
We also analysed the performance of these models based on the daily modelled concentrations under different weather types using Lamb Weather Types (LWTs), which are a synoptic classification of daily weather patterns across the UK Lamb (1972). 420 We found that these models work equally well for all LWTs, so we did not include this analysis in this work.
In conclusion, MVTS clustering enables imputation even when no measurement is available for a given pollutant since the station can be allocated to a cluster based on the value of the other pollutants measured. Our proposed imputation models, model 6 (Median) for O 3 , PM 10 , and PM 2.5 and Model 2 (CA+ENV), give the best performance for imputing these pollutants.
The advantage of these model is that they aggregate the spatial and temporal imputation. The spatial imputation is obtained 425 from the nearest stations and the temporal imputation is obtained by MVTS clustering that clusters the stations based on similarity in time.
In our future work, we aim to improve our imputation by considering more information about the stations, such as station altitude and location in relation to the weather effects. We may also consider the correlation between pollutants in our imputation, and include further analysis for the daily air quality index (DAQI), especially for those days when there is a variation between 430 imputed and observed DAQI. Finally, we need to study all possible uncertainty associated with this type of application, since the pollution level may change from year to year due to some pollution episodes caused by high temperature, wind, wildfire or other reasons.