Evaluation of multivariate time series clustering for imputation of air pollution data

Alahamade, Wedad; Lake, Iain; Reeves, Claire E.; De La Iglesia, Beatriz

doi:https://doi.org/10.5194/gi-10-265-2021

Articles | Volume 10, issue 2

https://doi.org/10.5194/gi-10-265-2021

Articles | Volume 10, issue 2

Research article

03 Nov 2021

Research article |

| 03 Nov 2021

Evaluation of multivariate time series clustering for imputation of air pollution data

Wedad Alahamade, Iain Lake, Claire E. Reeves, and Beatriz De La Iglesia

Abstract

Air pollution is one of the world's leading risk factors for death, with 6.5 million deaths per year worldwide attributed to air-pollution-related diseases. Understanding the behaviour of certain pollutants through air quality assessment can produce improvements in air quality management that will translate to health and economic benefits. However, problems with missing data and uncertainty hinder that assessment.

We are motivated by the need to enhance the air pollution data available. We focus on the problem of missing air pollutant concentration data either because a limited set of pollutants is measured at a monitoring site or because an instrument is not operating, so a particular pollutant is not measured for a period of time.

In our previous work, we have proposed models which can impute a whole missing time series to enhance air quality monitoring. Some of these models are based on a multivariate time series (MVTS) clustering method. Here, we apply our method to real data and show how different graphical and statistical model evaluation functions enable us to select the imputation model that produces the most plausible imputations. We then compare the Daily Air Quality Index (DAQI) values obtained after imputation with observed values incorporating missing data. Our results show that using an ensemble model that aggregates the spatial similarity obtained by the geographical correlation between monitoring stations and the fused temporal similarity between pollutant concentrations produces very good imputation results. Furthermore, the analysis enhances understanding of the different pollutant behaviours and of the characteristics of different stations according to their environmental type.

Download & links

Article (PDF, 4956 KB)

Download & links

How to cite.

Received: 26 Apr 2021 – Discussion started: 17 May 2021 – Revised: 22 Sep 2021 – Accepted: 04 Oct 2021 – Published: 03 Nov 2021

1 Introduction

Time series (TS) analysis has received much attention in recent decades due to its importance in many real-world applications such as earthquake prediction (Di Bello et al., 1996), weather forecasting (Carbajal-Hernández et al., 2012), air pollution forecasting (Du et al., 2020), and human activity recognition (Seto et al., 2015). Generally speaking, TS data can be described as a sequence of observations that a variable takes over time. When several variables are observed and recorded simultaneously, this becomes a multivariate time series (MVTS).

The quality of the air in the UK is assessed based on five main pollutants. In this study we focus on the four main pollutants: particulate matter less than 2.5 µm in diameter (PM_2.5) or less than 10 µm in diameter (PM₁₀), ozone (O₃), and nitrogen dioxide (NO₂). These pollutants are measured hourly at various monitoring stations.

The main challenge with analysing these pollutant TS is that not all the stations report all the pollutants. Even if a station does, it may not measure a particular pollutant all the time due to instrument downtime. In our previous work (Alahamade et al., 2021), we applied an intermediate fusion approach to fuse the distance between stations using the similarity of the four pollutants. The similarity between pollutant TS was measured using shape-based distance (SBD) between hourly pollutant concentrations (TS), as we found that SBD is better than other measures on our dataset (Alahamade et al., 2020). Then we used the k-means clustering algorithm to cluster the stations based on the fused distance; we called that MVTS clustering. Our initial clustering analysis showed that using the basic k-means with the fused distance gives very compact geographical clustering that enhances our understanding of the UK's air pollutant behaviours. Adding to that, using the fused distance to measure the similarity between the pollutants helped us solve some of the uncertainty problems associated with missing pollutant values as the MVTS clustering enables imputation even when no measurement is available for a given pollutant. This is because the multivariate nature of the clustering enabled a station to be allocated to a cluster based on the value of the other pollutants measured.

Based on the clustering results and station geographical location, we proposed three models to impute the whole time series for the missing pollutant at a given station. In this paper, we apply multiple model evaluation functions to assess which model gives the best results and to demonstrate the validity of our models.

Our long-term goal is to reduce the uncertainty in air quality assessment by imputing all missing pollutants in the monitoring stations. This will allow us to calculate new air quality indices that may or may not agree with the previous indices; that is, the observed indices that incorporate missing data. This in turn will help us to identify where more measurements can be beneficial.

We refer to our approach as time series imputation because we used the observed time series to impute missing time series (whole TS) in stations where one pollutant is not measured but other pollutants are. In this process, we are not filling the missing values within the time series (e.g. interpolating) but imputing a new TS. Also, we do not use predictive models; hence, we do not consider this a prediction task. However, it could be argued that our task is close to spatial interpolation (Lam, 1983) even though it is not completely based on spatial information; that is, we did not use any geographical information within the proposed MVTS clustering. Geographical information, however, is used in nearest-neighbour approaches, which are used in the ensemble proposed. Nevertheless, the main goal of the spatial interpolation is to fill in the gaps (points and/or locations with unknown measurements) using points with known values to cover a certain geographical area (Lam, 1983). Our goal is to impute unmeasured pollutants (whole TS) in several stations where they are not measured using the fused similarity between stations of other pollutants or using an ensemble of techniques including the MVTS clustering approach. We would argue that our imputation approach incorporates some uncertainty by using a combination of values (within the clustering process and within the ensemble) to produce the imputed value.

The paper's structure is as follows: Sect. 2 discusses some of the existing TS clustering methods and their application in the air quality field. Section 3 gives a brief introduction of the air quality assessment in the UK and its challenges. Section 4 discusses all the methods we used in detail to impute the missing pollutants and evaluate our proposed solutions. Finally, in Sect. 5, we analyse the results of our imputation models. Then, we conclude the work with some final remarks and indication for further developments in Sect. 6.

2 Related work

In this section, we briefly review some representative research in clustering techniques and its application in air pollution modelling. Data mining techniques have been widely applied to study air pollution data; however, most of this research focuses only on a single pollutant (univariate TS), while clustering multivariate time series remains a challenging task (Liao, 2005). Partitioning algorithms such as k-means and k-medoids are very common among works related to TS clustering and have been applied in many papers (e.g. Ignaccolo et al., 2008; Austin et al., 2013; Tuysuzoglu et al., 2019)

Austin et al. (2013) used the k-means algorithm to identify spatial patterns in air pollution data to cluster US cities based on the similarity of their PM_2.5 composition profiles, then characterize these clusters based on chemical characteristics, emission profiles, geographic locations, and population density. Ignaccolo et al. (2008) transformed the TS of pollutant daily observations into a functional form to smooth the TS, then classified the air quality monitoring network in northern Italy using the partitioning around medoids algorithm (PAM) to cluster three individual pollutants, namely NO₂, PM₁₀, and O₃. Tuysuzoglu et al. (2019) applied different clustering algorithms such as k-means, expectation maximization, and canopy for each air pollutant in the dataset (NO, NO₂, SO₂, PM₁₀, and O₃), then aggregated the clustering results based on majority voting to identify one clustering solution for similar regions in terms of air quality.

On the other hand, there has been some research into similarity within MVTS. For example, Fontes and Budman (2017) proposed an MVTS clustering method based on extracted features from the univariate TS. In their work, principal component analysis (PCA) is used to measure the similarity between MVTS, and fuzzy k-means is used to cluster these TS. This clustering approach was used for fault detection in a gas turbine. Zhou and Chan (2014) developed an algorithm for clustering MVTS by discovering each TS's temporal patterns. Their algorithm is based on k-means and aims to groups MVTS with similar temporal patterns together into the same cluster. D'Urso et al. (2018) proposed robust fuzzy clustering models for MVTS based on an exponential transformation of the dissimilarities. This algorithm was applied to real-world data on the concentrations of three pollutants (NO, NO₂, and PM₁₀) in the Metropolitan City of Rome for the problem of detecting pollution alarms.

In our previous work (Alahamade et al., 2020), we compared different TS distance measures and imputation techniques to impute missing observations and missing pollutants (TS). We found that using shape-based distance (SBD) gives better separated clusters than dynamic time warping (DTW). Also, using MICE to impute the TS missing observations is better than using some single imputation methods such as simple moving average (SMA). We used a univariate TS clustering using k-medoids (PAM) to cluster stations and imputed the missing pollutants using the cluster average. In this work, we use the k-means clustering algorithm and include a number of pollutants in the clustering, which makes it MVTS clustering. This clustering algorithm was proposed in Alahamade et al. (2021) where more details can be found. Here we extend that work by applying the imputation solution to real data and using extensive evaluation methods to demonstrate its effectiveness. This enables us to extend our understanding of pollutant behaviour.

3 Air quality assessment

We will study air pollution using the concentrations measured at the Automatic Urban and Rural Network (AURN) around the UK. The stations in the network are automatic and produce hourly pollutant concentrations. The data are collected and stored, then made directly available via the Web (DEFRA, 2021). There are 167 stations with different environmental types: rural, urban, suburban background, roadside, and industrial.

The Daily Air Quality Index (DAQI) represents air pollution levels in the UK. This index is reported based on the highest individual DAQI derived for each of the five major air pollutants (O₃, NO₂, PM₁₀, PM_2.5, and SO₂) based on their concentrations. If concentration data for some of these pollutants are not available, the DAQI is based on those pollutants for which data are available. The DAQI is used to provide an indication of the air quality and some associated information that may be used by at-risk groups as well as the general population (DEFRA, 2021). The DAQI is numbered from 1 to 10 and divided into four bands: “low” (1–3), “moderate” (4–6), “high” (7–9), and “very high” (10). The air quality is negatively correlated with the DAQI, meaning that a higher DAQI represents worse air quality.

4 Methods

The MVTS clustering algorithm and our proposed imputation models were implemented in R version (3.5.2) and are fully explained in previous work (Alahamade et al., 2021). To provide a more robust testing scenario, we separate the “model building” stage from the imputation testing stage. We use an initial data period of 3 years (2015–2017) as a training set to build the clustering and then impute on the next year (2018) of the TS to evaluate the goodness of fit.

4.1 Imputation models of missing pollutant TS

For evaluation purposes, we assume each pollutant from each station is missing entirely and impute it. For any given station, j, to impute the values of missing pollutant $P_{i}^{j}$ , where i represents the different pollutants ( $1 \leq i \leq 4$ ), we use different models under two main similarity criteria: the similarity using clustering solutions and the similarity using geographical distance.

The k-means clustering algorithm is used to group the stations based on their temporal similarity, which is the similarity in time between the hourly pollutant concentrations using SBD as the temporal distance measure. This distance function is implemented in the “dtwclust” package in R (Sarda-Espinosa, 2017). The geographical distance is used to find the spatial similarity between station locations. Adding to that, we use an ensemble model which calculates the median of all the previous imputation models; this model aggregates the temporal and spatial imputation using both the time series clustering and the geographical location similarity. Then, we evaluate these models to select the one that gives the highest similarity to the real values which are known. We explain these models in detail in the following sections.

4.1.1 Imputation models using clustering results

Once a clustering of our stations is obtained, we can use the clustering solution to impute missing TS (pollutants). If station j belongs to cluster C_x, ( $1 \leq x \leq k$ , where k is the number of clusters) given the measured pollutants over time, then, to impute pollutant P_i based on the clustering results, we use three models.

We impute the average of pollutant P_i in cluster C_x, which is the hourly average of pollutant P_i in all the stations that fall in this cluster. We call this method cluster average (CA).
We impute the average of pollutant P_i in cluster C_x, but using only stations with the same environment type to station j within the cluster, such as “background rural”, “background urban”, “traffic”, or “industrial”. We call this method CA+ENV. This is in recognition of the fact that the type of station may be important and result in more similar pollutant concentrations.
We impute the average of pollutant P_i in cluster C_x for stations that belong to the same region. As defined by DEFRA (DEFRA, 2021) there are 16 regions in the UK for air quality assessment, such as eastern and northern Wales, the East Midlands, and the other UK regions; this method is called CA+REG.

4.1.2 Imputation models by similarity using geographical distance

First, we measure the geographic distance using the Harvison metric, which calculates geographic distance on Earth based on longitude and latitude. We calculate the distance between station j and all other stations that measure pollutant P_i. Then to impute pollutant P_i for station j we use the following:

the nearest neighbour (1NN) using the Harvison-based distance to station j – this method is called 1NN; and
the average of the two nearest neighbours (2NN) to station j – this method is called 2NN.

4.1.3 Imputation model by ensemble

In this approach, for a given station j, to impute pollutant P_i, we use the median value of all the imputed values from the previous models. Those are cluster average (CA), cluster average considering the station type (CA+ENV), cluster average considering the region (CA+REG), first nearest neighbour (1NN), and the average of the two nearest neighbours (2NN). This method is called Median. This imputation approach may be computationally the most expensive as it needs for all others to be computed, but ensembles have the potential to provide very powerful solutions by combining predictions.

4.2 Imputation model evaluation

We evaluate how plausible the imputation is using different models by comparing truth values to imputed values. The model evaluations are based on the test dataset, which is the 2018 data. As mentioned earlier we do this by taking each existing TS for which we have values, one at a time, and consider them missing. We impute the whole TS by various models and compare that to the ground truth. We are evaluating our models against the real concentrations which contain missing values; hence, we ignore all the missing values in this evaluation. For each model, we can average the different imputation models' behaviour from all the stations to establish the one that provides imputed values closest to the real values. Hence, for our experimental set-up we take each existing TS for a given pollutant and station, $P_{i}^{j}$ , in turn and impute it by the various models to obtain an imputed TS, $P I_{i}^{j}$ . We compare the real values to the imputed values using different statistical and graphical model evaluation functions. The statistical functions include the fraction of predictions within a factor of 2 (FAC2), mean bias (MB), normalized mean bias (NMB), root mean squared error (RMSE), coefficient of correlation (R), and index of agreement (IOA). These measures are used to evaluate the temporal variation of air pollutants between imputed–modelled and observed concentrations. The graphical functions include a conditional quantile plot, time variation plot, and Taylor diagram. These are functions within the “openair” package, a freely available air quality data analysis tool in R (Carslaw and Ropkins, 2012) that presents comparisons between the modelled and measured air pollutant concentrations and their statistics graphically. We use the R packages openair (Carslaw and Ropkins, 2012) and tidyverse (Wickham et al., 2017) for the evaluation.

Model evaluation functions are beneficial when more than one model is involved in the comparison and help us in understanding why a model does not perform well. The model that gives the lowest error on average, the highest correlation, and the highest degree of agreement between imputed and observed concentrations for all stations (i.e. imputed TS) is initially considered the best model. However, extensive evaluation with various graphical functions enables us to better assess the model quality and how it reflects uncertainty. Note that the best model may change from one pollutant to another and may be affected by other factors such as station type (e.g. urban background, rural, and roadside) or pollutant lifetime and spread.

4.3 DAQI calculation

In the UK, DAQI forecasts are issued on a national scale; they are produced by the Met Office in the morning for the current day as well as for the next 4 d. The forecast is improved by incorporating the recent observations of air quality recorded at the AURN stations. The overall air pollution index for a site or region is determined by the highest DAQI of the five pollutants. The regional DAQI is the highest index among all the stations in that region.

For our evaluation, we calculated the daily DAQI value using the observed data for each station. This is because the DAQI value is not saved as part of the historical data available, so we need to calculate it from the downloaded data. DEFRA has published a guide for the implementation of DAQI (DEFRA, 2013), which explains how the value is calculated, and we follow that guidance. To calculate DAQI, each air pollutant is calculated as follows.

Ozone. The O₃ is measured hourly. To determine the DAQI we need to calculate the daily maximum 8-hourly running mean concentration. First, for each hour we calculate the running 8-hourly mean from the previous hours. Then we find the maximum value of these 8-hourly running means. For this calculation 75 % of the data must be captured to calculate the 8-hourly mean.
Nitrogen dioxide. The NO₂ is measured based on an hourly mean. We calculate the daily NO₂ contribution to the DAQI by taking the maximum observation in 24 h every day from 00:00 to 23:00 GMT.
Particle PM₁₀ and PM_2.5. These are measured hourly. The DAQI is based on the 24 h mean, which we calculate by taking the mean value from the hourly observations. For these pollutants 75 % of the daily observations must be captured to calculate the mean; otherwise, the pollutant is considered missing that day.
We define the daily index for each pollutant separately. Then, for a station, we take the highest air pollutant index to be the value of the DAQI at that station.

We called the DAQI that is calculated based on observation “observed DAQI” and the DAQI that is calculated based on imputation “imputed DAQI”. We use the observed DAQI as a performance tool to evaluate our imputation model on its ability to reproduce the Daily Air Quality Index. Note that although we produce only one imputation and not multiple imputations at this stage, we believe they reflect the underlying uncertainty because they are based on a number of aggregated methods.

5 Results

In this section, we first analyse the proposed pollutant imputation models using some statistical and graphical air pollution modelling evaluation functions. Then, we evaluate the imputation model performance based on the comparison between the observed and imputed DAQI.

5.1 Air pollution imputation modelling evaluation

We first evaluate imputation models based on the statistical and then on the graphical analysis.

5.1.1 Model evaluation based on statistical analysis

Table 1 shows the statistical analysis results. In this table N is the number of stations that measure each pollutant. The table also shows the fraction of predictions within a factor of 2 (FAC2), mean bias (MB), normalized mean bias (NMB), root mean squared error (RMSE), coefficient of correlation (R), and index of agreement (IOA).

In general, model 6 (Median), which is the model that uses the ensemble technique of other models, gives the lowest error average (RMSE), the highest Pearson correlation coefficient (R), and the highest agreement between imputed and observed concentrations (IOA) for O₃, PM_2.5, and PM₁₀. However, NO₂ shows different behaviour, with model 2 (CA+ENV) achieving slightly higher performance with an increase in the correlation coefficient (by 0.049) and decrease in error average (by 0.826) compared to model 6 (Median). The model bias (MB) for model 2 is 50 % higher than that of model 6. NO₂ shows local patterns, as it is concentrated where it is emitted in urban areas and near the roadside. Adding to that, NO₂ is shorter-lived than other pollutants and shows greater spatial variability, with concentrations being strongly influenced by the environment type (e.g. roadside, urban background, rural). This changes the NO₂ concentrations from one location to another based on the environmental type (CenterForCities, 2020).

All the selected models performed well, with 71 %–89 % of their imputations falling within a factor of 2 of the observed concentrations as shown in the FAC2 values in Table 1. According to Derwent et al. (2010), an air quality model minimum requirement is that the FAC2 value is higher than 0.50 and NMB values should be in the range between −0.2 and +0.2. Both are met by our models. NMB measures if the model underpredicts or overpredicts, as it estimates the difference between the mean observed and imputed concentrations. Negative NMB means that the model underpredicts and vice versa. All the models have very small biases.

Table 1Performance of the hourly pollutant concentration imputation models based on statistical measures. Best values are in bold for FAC2, RMSE, R, and IOA.

Download Print Version | Download XLSX

5.1.2 Model evaluation based on Taylor diagram analysis

We use a Taylor diagram to analyse three main statistics: correlation coefficient R, the standard deviation (sigma), and the root mean square error (centred). These statistics can be plotted on one (2D) graph, which can be represented through the law of cosines (Taylor, 2001).

The standard deviation represents the variability between modelled and observed concentrations. The observed variability is plotted on the x axis. The magnitude of the variability is measured as the radial distance from the plot's origin. The black dashed line shows this for the observed value. The grey lines are isopleths for the correlation coefficient (R) as indicated by the arc-shaped axis; the correlation increases along the arc towards the x axis. The centred root mean square error (RMSE) is represented by the concentric brown dashed lines. The further the points or models are from the observed value, the worse performance they have (Carslaw and Ropkins, 2012). Figure 1 shows Taylor diagram plots for all models with all pollutants.

In almost all cases the models exhibit less variability than observed, as indicated by the points being closer to the origin than the black dashed line. In general, model 4 (1NN) followed by model 5 (2NN) show variability that is most similar to the observations, as indicated by their relative closeness to the black dashed line. However, these models tend to have the lowest correlation coefficients, as indicated by the grey lines, and the greatest RMSE, as indicated by the brown dashed lines. Models 4 and 5 use the concentrations from a single site (i.e. the nearest stations) in the imputation, whereas the other models use a cluster average (CA, CA+REG, CA+ENV) or a model ensemble average (Median), so it is reasonable for models 4 and 5 to have variability fairly similar to the observed concentrations. All the other models display less variability than the observed concentrations (as indicated by their points being further from the black dashed line); this may be consistent with their derivation methods, which may smooth out some of the variability.

Model 6 (Median), regardless of its ability to capture variability, is confirmed as having the highest correlation coefficient and the lowest centred root means squared with all the pollutants except NO₂, for which it is the second-best behind model 2 (CA+ENV).

https://gi.copernicus.org/articles/10/265/2021/gi-10-265-2021-f01

Figure 1Taylor diagrams comparing modelled and observed concentrations for O₃, NO₂, PM_2.5, and PM₁₀.

Evaluation of multivariate time series clustering for imputation of air pollution data

4.1 Imputation models of missing pollutant TS

4.1.1 Imputation models using clustering results

4.1.2 Imputation models by similarity using geographical distance

4.1.3 Imputation model by ensemble

4.2 Imputation model evaluation

4.3 DAQI calculation

5.1 Air pollution imputation modelling evaluation

5.1.1 Model evaluation based on statistical analysis

5.1.2 Model evaluation based on Taylor diagram analysis

5.1.3 Model evaluation based on conditional quantile analysis

5.1.4 Model evaluation based on conditional quantile analysis and station environmental types

5.2 Evaluating the imputed concentrations based on the Daily Air Quality Index (DAQI)