17 May 2021

17 May 2021

Review status: this preprint is currently under review for the journal GI.

Evaluation of Multi-variate Time Series Clustering for Imputation of Air Pollution Data

Wedad Alahamade1,3, Iain Lake2, Claire E. Reeves2, and Beatriz De La Iglesia1 Wedad Alahamade et al.
  • 1School of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, UK
  • 2School of Environmental Sciences, University of East Anglia, Norwich NR4 7TJ, UK
  • 3School of Computing Sciences, Taibah University, Medina 42353, Saudi Arabia

Abstract. Air pollution is one of the world's leading risk factors for death, with 6.5 million deaths per year worldwide attributed to air pollution-related diseases. Understanding the behaviour of certain pollutants through air quality assessment can produce improvements in air quality management that will translate to health and economic benefits. However problems with missing data and uncertainty hinder that assessment.

We are motivated by the need to enhance the air pollution data available. We focus on the problem of missing air pollutant concentration data either because a limited set of pollutants is measured at a monitoring site or because an instrument is not operating, so a particular pollutant is not measured for a period of time.

In our previous work, we have proposed models which can impute a whole missing time series to enhance air quality monitoring. Some of these models are based on a Multivariate Time Series (MVTS) clustering method. Here, we apply our method to real data and show how different graphical and statistical model evaluation functions enable us to select the imputation model that produces the most plausible imputations. We then compare the Daily Air Quality Index (DAQI) values obtained after imputation with observed values incorporating missing data. Our results show that using an ensemble model that aggregates the spatial similarity obtained by the geographical correlation between monitoring stations and the fused temporal similarity between pollutants concentrations produced very good imputation results. Furthermore, the analysis enhances understanding of the different pollutant behaviours, and of the characteristics of different stations according to their environmental type.

Wedad Alahamade et al.

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on gi-2021-11', Anonymous Referee #1, 15 Jun 2021
    • CC1: 'Reply on RC1', wedad Alahamade, 15 Jun 2021
  • RC2: 'Comment on gi-2021-11', Anonymous Referee #2, 05 Sep 2021

Wedad Alahamade et al.

Data sets

Modelled Concentrations Wedad Alahamade

Air pollution data set Department for Environment, Food & Rural Affairs

Model code and software

Model Evaluation code Wedad Alahamade

Wedad Alahamade et al.


Total article views: 781 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
709 59 13 781 5 5
  • HTML: 709
  • PDF: 59
  • XML: 13
  • Total: 781
  • BibTeX: 5
  • EndNote: 5
Views and downloads (calculated since 17 May 2021)
Cumulative views and downloads (calculated since 17 May 2021)

Viewed (geographical distribution)

Total article views: 676 (including HTML, PDF, and XML) Thereof 676 with geography defined and 0 with unknown origin.
Country # Views %
  • 1
Latest update: 20 Sep 2021
Short summary
Reduce the uncertainty in air quality assessment by imputing all missing pollutants in the monitoring stations and identify where more measurements can be beneficial. The proposed approach is based on spatial or temporal similarity between stations. We found that our proposed approach enables us to impute/estimate plausible concentrations of multiple pollutants at stations across the UK, and the modelled concentrations from the selected models correlated well with the observed concentrations.