Air pollution is one of the world's leading risk factors for death, with 6.5 million deaths per year worldwide attributed to air-pollution-related diseases. Understanding the behaviour of certain pollutants through air quality assessment can produce improvements in air quality management that will translate to health and economic benefits. However, problems with missing data and uncertainty hinder that assessment.

We are motivated by the need to enhance the air pollution data available. We focus on the problem of missing air pollutant concentration data either because a limited set of pollutants is measured at a monitoring site or because an instrument is not operating, so a particular pollutant is not measured for a period of time.

In our previous work, we have proposed models which can impute a whole missing time series to enhance air quality monitoring. Some of these models are based on a multivariate time series (MVTS) clustering method. Here, we apply our method to real data and show how different graphical and statistical model evaluation functions enable us to select the imputation model that produces the most plausible imputations. We then compare the Daily Air Quality Index (DAQI) values obtained after imputation with observed values incorporating missing data. Our results show that using an ensemble model that aggregates the spatial similarity obtained by the geographical correlation between monitoring stations and the fused temporal similarity between pollutant concentrations produces very good imputation results. Furthermore, the analysis enhances understanding of the different pollutant behaviours and of the characteristics of different stations according to their environmental type.

Time series (TS) analysis has received much attention in recent decades due to its importance in many real-world applications such as earthquake prediction

The quality of the air in the UK is assessed based on five main pollutants. In this study we focus on the four main pollutants: particulate matter less than 2.5

The main challenge with analysing these pollutant TS is that not all the stations report all the pollutants. Even if a station does, it may not measure a particular pollutant all the time due to instrument downtime. In our previous work

Based on the clustering results and station geographical location, we proposed three models to impute the whole time series for the missing pollutant at a given station. In this paper, we apply multiple model evaluation functions to assess which model gives the best results and to demonstrate the validity of our models.

Our long-term goal is to reduce the uncertainty in air quality assessment by imputing all missing pollutants in the monitoring stations. This will allow us to calculate new air quality indices that may or may not agree with the previous indices; that is, the observed indices that incorporate missing data. This in turn will help us to identify where more measurements can be beneficial.

We refer to our approach as time series imputation because we used the observed time series to impute missing time series (whole TS) in stations where one pollutant is not measured but other pollutants are. In this process, we are not filling the missing values within the time series (e.g. interpolating) but imputing a new TS. Also, we do not use predictive models; hence, we do not consider this a prediction task. However, it could be argued that our task is close to spatial interpolation

The paper's structure is as follows: Sect.

In this section, we briefly review some representative research in clustering techniques and its application in air pollution modelling. Data mining techniques have been widely applied to study air pollution data; however, most of this research focuses only on a single pollutant (univariate TS), while clustering multivariate time series remains a challenging task

On the other hand, there has been some research into similarity within MVTS. For example,

In our previous work

We will study air pollution using the concentrations measured at the Automatic Urban and Rural Network (AURN) around the UK. The stations in the network are automatic and produce hourly pollutant concentrations. The data are collected and stored, then made directly available via the Web

The Daily Air Quality Index (DAQI) represents air pollution levels in the UK. This index is reported based on the highest individual DAQI derived for each of the five major air pollutants (O

The MVTS clustering algorithm and our proposed imputation models were implemented in R version (3.5.2) and are fully explained in previous work

For evaluation purposes, we assume each pollutant from each station is missing entirely and impute it. For any given station,

The

Once a clustering of our stations is obtained, we can use the clustering solution to impute missing TS (pollutants). If station

We impute the average of pollutant

We impute the average of pollutant

We impute the average of pollutant

First, we measure the geographic distance using the Harvison metric, which calculates geographic distance on Earth based on longitude and latitude. We calculate the distance between station

the nearest neighbour (1NN) using the Harvison-based distance to station

the average of the two nearest neighbours (2NN) to station

In this approach, for a given station

We evaluate how plausible the imputation is using different models by comparing truth values to imputed values. The model evaluations are based on the test dataset, which is the 2018 data. As mentioned earlier we do this by taking each existing TS for which we have values, one at a time, and consider them missing. We impute the whole TS by various models and compare that to the ground truth. We are evaluating our models against the real concentrations which contain missing values; hence, we ignore all the missing values in this evaluation. For each model, we can average the different imputation models' behaviour from all the stations to establish the one that provides imputed values closest to the real values. Hence, for our experimental set-up we take each existing TS for a given pollutant and station,

Model evaluation functions are beneficial when more than one model is involved in the comparison and help us in understanding why a model does not perform well. The model that gives the lowest error on average, the highest correlation, and the highest degree of agreement between imputed and observed concentrations for all stations (i.e. imputed TS) is initially considered the best model. However, extensive evaluation with various graphical functions enables us to better assess the model quality and how it reflects uncertainty. Note that the best model may change from one pollutant to another and may be affected by other factors such as station type (e.g. urban background, rural, and roadside) or pollutant lifetime and spread.

In the UK, DAQI forecasts are issued on a national scale; they are produced by the Met Office in the morning for the current day as well as for the next 4 d. The forecast is improved by incorporating the recent observations of air quality recorded at the AURN stations. The overall air pollution index for a site or region is determined by the highest DAQI of the five pollutants. The regional DAQI is the highest index among all the stations in that region.

For our evaluation, we calculated the daily DAQI value using the observed data for each station. This is because the DAQI value is not saved as part of the historical data available, so we need to calculate it from the downloaded data. DEFRA has published a guide for the implementation of DAQI

We define the daily index for each pollutant separately. Then, for a station, we take the highest air pollutant index to be the value of the DAQI at that station.

We called the DAQI that is calculated based on observation “observed DAQI” and the DAQI that is calculated based on imputation “imputed DAQI”. We use the observed DAQI as a performance tool to evaluate our imputation model on its ability to reproduce the Daily Air Quality Index. Note that although we produce only one imputation and not multiple imputations at this stage, we believe they reflect the underlying uncertainty because they are based on a number of aggregated methods.

In this section, we first analyse the proposed pollutant imputation models using some statistical and graphical air pollution modelling evaluation functions. Then, we evaluate the imputation model performance based on the comparison between the observed and imputed DAQI.

We first evaluate imputation models based on the statistical and then on the graphical analysis.

Table

In general, model 6 (Median), which is the model that uses the ensemble technique of other models, gives the lowest error average (RMSE), the highest Pearson correlation coefficient (

All the selected models performed well, with 71 %–89 % of their imputations falling within a factor of 2 of the observed concentrations as shown in the FAC2 values in Table

Performance of the hourly pollutant concentration imputation models based on statistical measures. Best values are in bold for FAC2, RMSE,

We use a Taylor diagram to analyse three main statistics: correlation coefficient

The standard deviation represents the variability between modelled and observed concentrations. The observed variability is plotted on the

In almost all cases the models exhibit less variability than observed, as indicated by the points being closer to the origin than the black dashed line. In general, model 4 (1NN) followed by model 5 (2NN) show variability that is most similar to the observations, as indicated by their relative closeness to the black dashed line. However, these models tend to have the lowest correlation coefficients, as indicated by the grey lines, and the greatest RMSE, as indicated by the brown dashed lines.
Models 4 and 5 use the concentrations from a single site (i.e. the nearest stations) in the imputation, whereas the other models use a cluster average (CA, CA

Model 6 (Median), regardless of its ability to capture variability, is confirmed as having the highest correlation coefficient and the lowest centred root means squared with all the pollutants except NO

Taylor diagrams comparing modelled and observed concentrations for O

We analyse the spread of the modelled and observed pollutant concentrations using conditional quantile plots. Figures

These plots show how the modelled concentrations compare with the observed concentrations and how the models capture the variability in the concentrations. The spread of the modelled concentrations around the perfect model line (blue line) is shown by the shaded portions and quantile intervals. If narrow, it indicates high agreement or precision between the modelled and observed concentrations. The quantile intervals also represent the uncertainty bands. In some cases these intervals do not extend along with the median line due to insufficient concentrations to calculate them. The model with good performance is obtained when the median (red line) coincides with the perfect model (blue line) and when the spread in the percentile is as narrow as possible.

From these plots, in general, the histograms indicate that model 4 (1NN) (panel d) has better estimation of the variability between the observed and modelled concentrations, as observed before, even though the median line does not match the perfect model. This model is positively biased at high concentrations, as shown by the departure of the median line below the blue line for all pollutants. This result supports our analysis from the Taylor diagram that model 4 (1NN) has the lowest variability between modelled and observed concentrations, but with a lower correlation coefficient, and the highest centred root means squared for all pollutants.

Conditional quantile plot of modelled and observed pollutant concentrations of O

Conditional quantile plot of modelled and observed pollutant concentrations of PM

In Fig.

Model 6 (Median) (panel f) has the best performance, as indicated by an overlapping median line with the blue line. This model has the lowest mean bias and the highest degree of agreement, as indicated by the narrow spread of the modelled concentration quantile intervals.

In the same figure (right), NO

The variation between PM

Model 6 (Median) (panel f) gives better performance, as indicated by the narrow spread of the modelled concentration quantile intervals and minimal bias, which is indicated by the overlaps between the red and blue lines compared to other models. Models for PM

In this analysis, we focus on the performance of model 6 (Median) and model 2 (CA

First, we show the monthly average concentrations for each pollutant under each environment type in our test dataset (year 2018) to understand the normal variation of the pollutant concentrations in different environment types. Figures

Monthly average concentrations of observed NO

Conditional quantile plot of modelled and observed pollutant concentrations of NO

The most common sources of NO

Figure

As NO

From Table

Monthly average concentrations of observed O

Conditional quantile plot of modelled and observed pollutant concentrations of O

For O

Conditional quantile analysis in Fig.

The worst performance based on the RMSE is associated with traffic urban stations (panel f), which are the stations located at roadsides. With those stations, the modelled concentrations are higher than observed concentrations; i.e. the modelled histogram is shifted to the right. This is indicated by the model positive bias (0.503). The median line also extends beyond the blue line, which means that some modelled concentrations are much higher than observed measurements.

The best model performance is associated with industrial urban stations (panel e) according to the RMSE, even though background urban stations (panel c) appear to have the best performance by looking at the conditional quantile plots. The histogram in panel (c) indicates that the distributions of the observed and modelled concentrations tend to be closer to each other for higher concentrations. However, the model overestimates the average concentrations at these stations (between 25 and 70

Monthly average concentrations of observed PM

Conditional quantile plot of modelled and observed pollutant concentrations of PM

Figure

Monthly average concentrations of observed PM

Conditional quantile plot of modelled and observed pollutant concentrations of PM

Finally, PM

Performance of the hourly pollutant concentration imputation models using model 6 (Median) for O

Next, we show some examples of our imputed TS compared to the real TS for each pollutant using the selected imputation models in some stations. The following examples in Figs.

Figure

Imputed (black) and real (red) TS comparison for PM

Imputed (black) and real (red) TS comparison for PM

Figure

Figure

Imputed (black) and real (red) TS comparison for NO

Imputed (black) and real (red) TS comparison for O

After imputing the measured pollutants in all the stations, we calculate the DAQI from the imputed data, as explained in Sect.

We compare the imputed DAQI with the observed DAQI based on RMSE and the number of days on which there are agreements and disagreements. The total number of days in our dataset is 60 955 d (167 stations

In general, the total average RMSE from all days in all stations is 0.55. As the station type and the region may affect our imputation, Fig.

The model performance based on DAQI RMSE:

We also study the correlation between the number of measured pollutants in a station and the agreement between modelled and observed DAQI to see if the number of measured pollutants impacts our model's performance.

First, we classify stations based on the number of measured pollutants to stations that measured one, two, three, and all four pollutants, as shown in Table

Comparing observed and modelled DAQI based on the number of measured pollutants in stations.

We also compare the imputed and the observed DAQI based on the number of days on which the imputed DAQI agrees and disagrees with the observed DAQI. Table

Number of days for which imputed DAQI agrees or disagrees with observed DAQI.

In this work, we evaluated our proposed models to impute missing pollutants in a station based on statistical and graphical model evaluation functions (Taylor diagrams and conditional quantile plots) that are designed to evaluate air pollution modelling. We found that the best imputation model based on statistical analysis is model 6 (Median) for O

On the other hand, the graphical model evaluation functions showed these models' performance based on the distribution of the concentrations and the degree of agreement between imputed–modelled and observed concentrations. These functions help us to understand the relationship between the distributions of the observations and the model's performance. From the histograms in Figs.

Model 6 (Median) is based on the median concentrations from stations with temporal and spatial similarity, so this model's expected performance is to underestimate the highest values and overestimate the lowest values with a normally distributed dataset. We found that the model performance can vary based on the environmental type and the nature of the pollutant, as shown in our analysis of model performance and DAQI RMSE.

Through our analysis, we also found that the variation of the model's performance with different environmental types is due to the pollutant behaviour and its emitted sources.

Model 6 (Median) performance with O

Figure

From the same figure (panel c), as shown in the histogram for background urban stations, there is a high frequency of low concentrations (less than 10

As we mentioned earlier, NO

PM

We also observed that the distributions of NO

Our approach enables us to impute and/or estimate plausible concentrations of multiple pollutants at stations across the UK, and the modelled concentrations from the selected models correlated well with the observed concentrations. The performance of these models is very good, with a slight underestimation in model 6 (Median), especially with high concentrations. At the opposite end, model 2 (CA

We also analysed the performance of these models based on the daily modelled concentrations under different weather types using Lamb weather types (LWTs), which are a synoptic classification of daily weather patterns across the UK

In conclusion, MVTS clustering enables imputation even when no measurement is available for a given pollutant since the station can be allocated to a cluster based on the value of the other pollutants measured. Our proposed imputation models, model 6 (Median) for O

In our future work, we aim to improve our imputation by considering more information about the stations, such as station altitude and location in relation to the weather effects. We may also consider the correlation between pollutants in our imputation and include further analysis for the Daily Air Quality Index (DAQI), especially for those days when there is variation between imputed and observed DAQI. Finally, we need to study all possible uncertainty associated with this type of application, since the pollution level may change from year to year due to some pollution episodes caused by high temperature, wind, wildfire, or other factors.

Code and data are available at

The experimentation and initial draft were produced by WA as part of her PhD. IL and CER contributed ideas, co-supervised the PhD, and revised the draft papers. BDLI was the main supervisor for the work and contributed to the draft revisions.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We thank the anonymous referees for their useful suggestions and Salvatore Grimaldi for editing.

This paper was edited by Salvatore Grimaldi and reviewed by two anonymous referees.